LUDWIG•MAXIMILIANS•UNIVERSITÄT TECHNISCHE UNIVERSITÄT MÜNCHEN

Fakult¨at fur¨ Informatik

Masterarbeit in Bioinformatik

Predict Subcellular Localization in All Kingdoms

Tatyana Goldberg

Aufgabensteller: Prof. Dr. Burkhard Rost Betreuer: Dipl. Bioinf. Tobias Hamp Abgabedatum: 17. Oktober 2011

Ich versichere, dass ich diese Masterarbeit selbst¨andig verfasst und nur die angegebenen Quellen und Hilfsmittel verwendet habe.

17. Oktober 2011 Tatyana Goldberg

Abstract

The prediction of protein subcellular localization is an important step towards under- standing its function. Here, a new method for predicting localization in all six taxonomic kingdoms is presented. The method was developed on a non-redundant data set of pro- teins of known localization from SWISS-PROT. Three localization classes were targeted for archaea, six for and eleven for eukaryota. Prediction requires an amino acid sequence and the taxonomic classification. For the development of the method support vector machines were used and a range of string kernels examined. The kernel using evolutionary profiles was selected as the most appropriate for detecting compartment-specific patterns. A number of multiclass classification techniques were then compared, including one-against-all, various types of ensembles of nested dichotomies and the nested dichotomy with a fixed structure. The latter allowed prediction of protein subcellular localization by mimicking the cascading mechanism of cellular sorting. Though its overall accuracy was comparable to or higher than other classification techniques, its computational time was significantly lower. Three separate classifiers were trained on non-membrane proteins, transmembrane proteins and proteins of all types, allowing the latter to be applied to large-scale screenings of entire proteomes. When evaluated on the non-redundant test sets, the method developed on all types of proteins achieved the highest level of accuracy for archaeal proteins, 84% for bacterial and 64% for eukaryotic proteins, thus outperforming current state-of-the art predictors. In addition, the prediction methods were benchmarked on three independent data sets that were not used during their development. The method developed here surpassed the other methods in nearly all benchmarks.

i

Zusammenfassung

Die Vorhersage der subzellul¨aren Lokalisierung eines Proteins ist ein wichtiger Schritt zur Aufkl¨arung seiner Funktion. Hier wird eine neue Methode zur Vorhersage der Lokalisierung in allen sechs taxonomischen Reiche vorgestellt. Die Methode wurde auf einem nicht-redundanten Datensatz mit Proteinen bekannter Lokalisierung aus der SWISS-PROT Datenbank entwickelt. Drei Klassen wurden fur¨ Archaeen, sechs fur¨ Bak- terien und elf fur¨ Eukaryoten vorhergesagt. Die Vorhersage erfordert eine Aminos¨aure- sequenz und die taxonomische Klassifizierung. Fur¨ die Entwicklung der Vorhersagemethode wurden Support Vector Machines benutzt und eine Reihe von String-Kernels getestet. Der Kernel, der mit evolution¨aren Profilen arbeitet, wurde als der geeignetste zur Erkennung Kompartiment-spezifischer Muster in der Sequenz ausgew¨ahlt. Danach wurde eine Anzahl von verschiedenen MultiClass Klassifizierungstechniken verglichen, darunter one-against-all, verschiedene Ensembles verschachtelter Dichotomien und die verschachtelte Dichotomie mit einer festen Struktur. Letztere erlaubt die Vorhersage der subzellul¨aren Lokalisierung durch Nachahmung des kaskadierenden Mechanismus der Proteinsortierung in der Zelle. Obwohl die Genauigkeit dieser Technik vergleichbar oder h¨oher war als die der anderen Klassifizierungstechniken, war ihre Rechenzeit deutlich niedriger. Drei verschiedene Klassifikatoren wurden auf Nicht-Membranproteinen, Transmem- branproteinen und Proteinen aller Art trainiert. Letztere erm¨oglichte die Anwendung auf large-scale Screenings gesamter Proteome. Die Auswertung der Methode entwickelt auf Proteinen aller Art zeigte, dass diese Methode ein H¨ochstmaß an Genauigkeit fur¨ Proteine aus Archaea, 84% Genauigkeit fur¨ Proteine aus Bakteria und 64% Genauigkeit fur¨ Proteine aus Eukaryota erreichen kann. Damit war sie besser als die aktuellen ”State- of-the-Art” Vohersagemethoden. Die Methode wurde zus¨atzlich mit anderen Methoden auf drei unabh¨angigen Datens¨atzen, die nicht w¨ahrend ihrer Entwicklung verwendet wur- den, verglichen. Die hier entwickelte Methode ubertraf¨ die anderen Methoden in fast allen Benchmark-Tests.

iii

Acknowledgments

I consider myself very fortunate to have had the opportunity to study at two of the outstanding German universities, the Ludwig Maximilian University and the Technical University of Munich. Both universities provided me with an excellent academic envi- ronment throughout the years of my study, which I greatly enjoyed. I want to express my sincere gratitude to the professors who supported me all the way to the point where I am and to my fellow students for the great time we had together. Most of all, I wish to thank Prof. Burkhard Rost for offering me a warm welcome into his group and for supervising this thesis. His great experience, patience and fruitful discussions have continuously encouraged me. He gave me the opportunities such as the mentoring of two programs for high-school students, involvement in the teaching of the ’Bioinformatics Lab’ course, attending an international conference on computational biology and presenting a poster there, all of which helped me to enhance my skills considerably and were greatly appreciated. I also owe many thanks to Tobias Hamp for providing an excellent support with my thesis. I thank him for all the help with machine learning, for making me looking at the statistics more seriously and for keeping telling me not to worry too much. I also thank him for proofreading this thesis and his valuable comments. Further, I wish to thank the whole Rost Lab group that I had a pleasure to work with. Especially, I want to thank Laszlo Kajan and Guy Yachdav for introducing me the Rost Lab and for always having a good advice for me concerning work or future plans. I also thank Timothy Karl for the help with the computer cluster, Shruti Rastogi for the initial help with LOCtree and for providing the LocDB data, Edda Kloppmann for her comments on my work, Yana Bromberg for our delightful conversations, Marc Offman for his eccentric jokes, Andrea Schafferhans and Shaila R¨ossle for their good mood. I like to thank Christian Sch¨afer, Esmeralda Vicedo, Arthur Dong, Maina Bitar, Dedan Githae and the students for the range of activities we undertook together. Many thanks go to Marlena Drabik and Lothar Richter for their help with the administrative matters.

v On the personal side, I want to thank my lovely mother Nelly for teaching me to appreciate knowledge and education. I thank my brother Valerij, who has always been my big hero and always will be. Margarita, his wife, and my three little nephews deserve a special thanks for the fun we have together. Finally, I thank my boyfriend Taras for always being there for me even over long distances and differences in time. Last but not least, I wish to thank Sebastian Briesemeister from the University of Tubingen¨ for running MultiLoc2 predictions and all those who make their data and implementations publicly available.

vi Contents

Abstract i

Zusammenfassung iii

Acknowledgments v

List of Figures xi

List of Tables xiii

1 Introduction 1 1.1 Subcellular Localization as a Functional Characteristic of a Protein . . . 1 1.2 Transmembrane Proteins ...... 1 1.3 Compartmentalization ...... 2 1.4 Protein Sorting ...... 4 1.4.1 The Protein Trafficking System ...... 5 1.4.2 Sorting Signals ...... 6 1.5 Prediction of Subcellular Localization ...... 6 1.6 Novel Method ...... 8

2 Materials and Methods 9 2.1 Workflow ...... 9 2.2 Data Sets for Development ...... 12 2.2.1 Data Sets Extraction ...... 12 2.2.2 Homology Reduction ...... 13 2.2.3 Scores for Measuring Sequence Similarity ...... 16 2.2.4 Size Increase of the Training Sets ...... 17

vii Contents

2.3 Data Sets for Testing Only ...... 18 2.3.1 LocDB ...... 18 2.3.2 Newly Added SWISS-PROT Proteins ...... 20 2.4 Support Vector Machines ...... 21 2.4.1 Linear Classification ...... 21 2.4.2 Soft Margin ...... 24 2.4.3 Non-Linear Classification ...... 25 2.4.4 The Kernel Matrix ...... 27 2.4.5 Sequential Minimal Optimization ...... 28 2.5 String Kernels ...... 28 2.5.1 String Subsequence Kernel ...... 28 2.5.2 Mismatch Kernel ...... 29 2.5.3 Profile Kernel ...... 30 2.6 Multiclass Classification ...... 32 2.6.1 One-Against-All ...... 32 2.6.2 Ensemble of Nested Dichotomies ...... 33 2.6.3 Ensemble of Class Balanced Nested Dichotomies ...... 33 2.6.4 Ensemble of Data Balanced Nested Dichotomies ...... 34 2.6.5 Predefined Nested Dichotomies ...... 35 2.7 Performance Evaluation ...... 37 2.7.1 Accuracy, Coverage and their Geometric Average ...... 37 2.7.2 The Standard Error ...... 38 2.7.3 Stratified k-fold Cross Validation ...... 40 2.8 External Prediction Methods ...... 43 2.8.1 CELLO v.2.5 ...... 43 2.8.2 LOCtree ...... 43 2.8.3 MultiLoc2 ...... 44 2.8.4 WoLFPSORT...... 44 2.9 Box Plots ...... 45

3 Results and Discussion 47 3.1 Training and Test Sets ...... 47 3.2 Kernel Selection ...... 49 3.2.1 Parameter Optimization ...... 49

viii Contents

3.2.2 Comparative Evaluation ...... 51 3.3 Classification Model Selection ...... 54 3.4 Model Parameter Optimization ...... 55 3.4.1 Generalization Performance ...... 55 3.4.2 Classification Runtime ...... 56 3.5 Testing...... 58 3.5.1 Performance Evaluation ...... 58 3.5.2 Prediction Reliability ...... 63 3.5.3 Comparison with the External Classifiers ...... 64 3.6 Application to the Independent Test Sets ...... 69 3.6.1 Re-Training of the Final Classification Model ...... 69 3.6.2 Comparison with the External Classifiers ...... 69 3.7 Localization-wise Performance of MyND ...... 74

4 Conclusion 77

5 Appendix 79 5.1 Effectiveness of the Mismatch Kernel ...... 79 5.2 Effectiveness of the String Subsequence Kernel ...... 80 5.3 Effectiveness of the Profile Kernel ...... 81 5.4 MyND Benchmark on Eukaryotic Non-membrane Proteins ...... 82 5.5 MyND Benchmark on Eukaryotic Transmembrane Proteins ...... 83

Bibliography 84

ix

List of Figures

1.1 Subcellular compartments of a prokaryotic and eukaryotic cells ...... 3 1.2 The protein trafficking system in ...... 5

2.1 Workflow ...... 10 2.2 HSSP-curve at HSSP-vlaue=0 ...... 16 2.3 Optimal separating hyperplane for perfectly linearly separable data . . . 22 2.4 Linear separating hyperplane for linearly non-separable data ...... 24 2.5 Optimal separating hyperplane in the high dimensional space ...... 26 2.6 Two different nested dichotomies for a five-class classification problem . . 34 2.7 Hierarchical architectures of MyND ...... 36 2.8 The standard error ...... 39 2.9 Example of a 5-fold cross-validation ...... 40 2.10 Example of a 5-by-10 cross-validation ...... 42 2.11 Configuration of a box plot...... 45

3.1 MyND performance on the data sets of all bacterial proteins ...... 59 3.2 MyND performance on the data sets of all eukaryotic proteins ...... 62 3.3 Accuracy versus coverage for the probability scores of MyND ...... 63

5.1 Analysis of Mismatch kernel parameter combinations ...... 79 5.2 Analysis of String Subsequence kernel parameter combinations ...... 80 5.3 Analysis of Profile kernel parameter combinations ...... 81

xi

List of Tables

2.1 Keywords for subcellular localization annotation in SWISS-PROT . . . . 14 2.2 Homology reduced SWISS-PROT data sets ...... 15 2.3 Homology reduced LocDB data sets ...... 19 2.4 Homology reduced data set of newly added SWISS-PROT proteins . . . 20 2.5 Confusion matrix ...... 38

3.1 Training and test sets ...... 48 3.2 Evaluation of kernel functions on the data sets of all proteins ...... 51 3.3 Evaluation of kernel functions on the data sets of non-membrane and transmembrane proteins ...... 53 3.4 Evaluation of multi-class classification approaches ...... 54 3.5 Performance of optimized multi-class classification approaches ...... 56 3.6 Testing time of optimized multi-class classification approaches ...... 57 3.7 MyND performance on non-redundant test sets ...... 60 3.8 Comparison of MyND to external classifiers on all bacterial proteins . . . 65 3.9 Comparison of MyND to external classifiers on all eukaryotic proteins . . 68 3.10 Comparison of MyND to external classifiers on LocDB human proteins . 70 3.11 Comparison of myND to external classifiers on LocDB plant proteins . . 72 3.12 Comparison of MyND to external classifiers on newly added SWISS- PROT proteins ...... 73

5.1 MyND performance on eukaryotic non-membrane proteins ...... 82 5.2 MyND performance on eukaryotic transmembrane proteins ...... 83

1 Introduction

1.1 Subcellular Localization as a Functional Characteristic of a Protein

Unlike prokaryotic cells that generally consist of one compartment, the , eu- karyotic cells are organized into different membrane-surrounded compartments in which thousands of proteins reside. These proteins are molecules that are involved in various biochemical processes required for the viability and the functionality of a cell. Since every subcellular compartment is usually specialized to perform a particular cellular function, it is assumed that proteins that are found in the same compartment have either similar functions or contribute to a common physiological function [1, 2, 3]. Therefore, knowl- edge of the subcellular localization of a protein can help to annotate its interacting partners and to provide conclusions about its functional role in cells. This information can be incorporated in drug discovery and other research initiatives. Most eukaryotic and all prokaryotic proteins are synthesized in the cytoplam and some of them are sorted to the locations where they perform their biological functions. The sorting of proteins is carried out by localization signals encoded in their amino acid sequences [4]. Because experimental determination of subcellular localization continues to be a very tedious and time consuming process, the development of accurate automated prediction methods has evolved into one of the major tasks in bioinformatics.

1.2 Transmembrane Proteins

Different proteins are exposed to different physico-chemical environments and the en- vironment of a protein influences its structure [5]. A prominent example is globular proteins, i.e. proteins that are soluble in aqueous solvent, and transmembrane proteins, proteins that are not. The globular protein fold is defined in terms of hydrophilic amino

1 1 Introduction acid residues which, driven from water by the hydrophobic effect, are pushed together in the protein interior [6, 7]. The globular fold allows these proteins to exist in biological fluids and to accomplish a wide range of cellular functions. They can act as enzymes that catalyze chemical reactions, as transfer agents for other molecules or as regulatory messengers, to name just a few. The transmembrane proteins, in contrast, have entire hydrophobic regions that avoid water by embedding into the membrane and interacting with the lipid interior of the bilayer. Their hydrophilic regions are placed on one or the other side of the membrane and are in contact with polar lipid head groups and water en- vironment [8, 9, 10]. The transmembrane proteins carry functions such as cell-signaling, transport of ions, neurotransmitters and other solutes across membranes.

1.3 Cell Compartmentalization

All living organisms are classified into one of the two categories: prokaryotes and eu- karyotes. Prokaryotes are further classified into the kingdoms of archaea and bacteria whereas eukaryotes are classified into the kingdoms of protista, plantae, fungae and animalia. Prokaryotes originated much earlier than eukaryotes [11]. They are usually smaller in size and have a simpler cell structure than eukaryotes (Figure 1.1). The distin- guishing feature of eukaryotic cells is the presence of a clearly defined nucleus and other membrane-bound compartments. Each membrane-bound compartment, or organelle, performs its specialized cellular function. The list below gives an overview of the ma- jor cellular compartments of prokaryotic and eukaryotic cells and their basic functions (based on the descriptions in [4]).

Cytoplasm surrounds the nucleus and is the main site for protein synthesis. It consists of a gel-like fluid, the cytosol, and the cytoplasmic organelles that are placed in it. The cytosol contains an organized network of fibrous molecules, the , that gives the cell a shape and allows organelles to move within it.

Endoplasmic reticulum (ER) forms an interconnected network of sheets, sacs and tubules. The rough ER is specialized in the synthesis and transport of membrane pro- teins and soluble proteins destined for secretion or for other organelles. The synthesis of lipids that are used by the rest of the cell, including the ER itself, takes place in the

2 1.3 Cell Compartmentalization

Figure 1.1: Subcellular compartments of a prokaryotic (bacterial) and eukaryotic (ani- mal) cell. The two cell types are distinguished by fundamental differences in their cell plans. A prokaryotic cell lacks a nucleus and other membrane- bound organelles that are present in all eukaryotic cells. An eukaryotic plant cell has an additional organelle, the chloroplast. Figure adapted from [12]. smooth ER.

Endosomes are membrane-bound compartments that transport molecules from the to lysosomes for digestion.

Golgi apparatus is a system of stacked disc-like sacs that collect proteins from the ER, modify and package them into vesicles for shipping to other organelles and outside the cell.

Lysosomes are organelles that contain digestive enzymes involved in intracellular degradation of intact organelles and cellular macromolecules.

Mitochondria and plastids generate most of the energy required for cell survival. Al- though both organelles synthesize their own proteins, most of the proteins that act there are encoded in the nucleus and imported from the cytoplasm. While mitochondria are found in all eukaryotic cells plastids are present only in cells of plantae and protista. The most prominent member of the plastid family of organelles are chloroplasts.

3 1 Introduction

Nucleus contains the genetic material of the cell and is the main site for DNA and RNA synthesis. The nucleus is found in all eukaryotic cells but not in prokaryotes.

Peroxisome (also known as microbodies) are small compartments that contain en- zymes employed in various oxidative reactions and degradation of peroxides.

Plasma membrane is a selectively permeable membrane that controls the flow of vari- ous substances in and out of the cell. Plant, fungi, bacterial and some archaeal cells are additionally surrounded by a that provides these cells with structural support and protection.

Vacuole is an organelle for the storage of water, food particles and gases. They are much larger in plant cells than in animal cells. In plants they take up most the cell space.

Fimbrium are hair-like structures on the surface of archaeal and bacterial cells. They are used to attach to surfaces of cells of the same species for biofilm conjugation, DNA uptake, cell-cell interactions and twitching motility.

Outer membrane is the membrane in gram-negative bacteria that is separated from the cell membrane (inner membrane) by the periplasmic space.

1.4 Protein Sorting

The sorting of proteins within prokaryotic and eukaryotic cells is a reasonably well un- derstood process. In eukaryotes, the sorting is much more complex as it involves a larger number of different subcellular compartments. It is guided by a trafficking system, which starts with the synthesis of a protein in the and ends when the protein reaches its final destination. The transportation of a protein depends on a number of factors, with the key factor being the protein sequence itself. For proteins not remaining in the cytosol, the sequence encodes sorting signals that are present in form of signal peptides or signal patches.

4 1.4 Protein Sorting

1.4.1 The Protein Trafficking System

The trafficking system incorporates three different mechanisms by which a protein can travel from one compartment to another [4]. These are schematically summarized in Figure 1.2: (1) The bidirectional gated transport occurs between the cytosol and the nu- cleus, in which the nuclear pore complexes function as selective gates for specific macro- molecules while allowing smaller molecules a free passage. (2) The transmembrane trans- port involves the participation of membrane-bound protein translocators which allow the migration of specific proteins from the cytosol into the mitochondria, ER, plastids or the . The transported proteins must usually unfold during the translocation process. (3) The vesicular transport does not require crossing the membranes. Instead, proteins are loaded onto the transport vesicles that ferry them to a different compart- ment and discharge them by fusion with that compartment. The transport of soluble proteins from the ER to the , and from there to lysosomes, endosomes, secretory vesicles or the cell membrane occurs in this way.

Figure 1.2: The trafficking system of proteins in eukaryotic cells. The protein movement between different subcellular compartments is realized via gated transport, transmembrane transport, or vesicular transport. The signals that guide the movement of a protein through the system are contained in its amino acid sequence. Along the path, the decisions are made at each intermediate station (boxes) whether the protein is to be retained in the compartment or further should be transported. Figure adapted from [4].

5 1 Introduction

1.4.2 Sorting Signals

After synthesis on ribosomes, some proteins remain in the cytosol whereas others travel to one or more subcellular compartments. Thus, the final localization of a protein must somehow be determined by the regions in the amino acid sequence that direct the protein outside the cytosol. Indeed, these regions are known as sorting signals, which are classified into two types [4]: (1) The first type consists of a contiguous stretch of typically 15 to 30 residues long at the C- or N-terminus of the sequence [13]. This signal peptide is usually cleaved from the protein by signal peptidase once the translocation process has been completed. (2) The second type consists of residues that can be distant from each other in the amino acid sequence, but on the surface of a folded protein build a patch that forms the signal.

1.5 Prediction of Subcellular Localization

Within the last two decades the large-scale sequencing projects have made available a tremendous amount of protein sequences in the public databases but only a small por- tion of them have a reliable functional annotation. For example, in March 2011, the database of nucleotide and protein sequences NCBI RefSeq [14] counted over 12 million protein sequences from more than 12.000 genomes, while the manually curated database of experimental annotations SWISS-PROT [15] contained only about 500.000 protein entries. Of the 500.000 proteins in the SWISS-PROT database only 300.000 were anno- tated with the subcellular localization. Though automated methods for the prediction of subcellular localization may not be as reliable as wet-lab experiments yet, their speed and prediction accuracy can provide an appealing alternative to experimental methods. Most of the currently available prediction methods can roughly be divided into one of the following four categories:

(1) Sequence Homology Based Methods. Many efforts have been made to infer subcellu- lar localization based on sequence similarity. Such methods include phylogenetic profiling [16], domain projection [17] and other sequence homology based methods [18, 19, 20]. Though homology based methods are most reliable, their applicability is limited to the cases where a homologous sequence with experimental annotation is known.

6 1.5 Prediction of Subcellular Localization

(2) Sorting Signals Based Methods. This type of prediction methods aims to identify- ing short sequence motifs that are responsible for protein sorting. These include nuclear localization signals (NLS) [21, 22] and N-terminal peptides targeting for chloroplasts, mitochondria or secretion [23, 24, 25, 26]. Unfortunately, the majority of signal peptides are still unknown and for signal patches the situation is even worse.

(3) Text Analysis Based Methods. The growing number of publicly available biological databases has led to the development of prediction tools based on automatic text analy- sis. These methods transfer missing annotations from the annotated Gene Ontology (GO) terms [27, 28], SWISS-PROT keywords [29, 30] or PubMed abstracts [31], thus requiring the proteins to be to some degree already annotated.

(4) Ab initio Prediction Methods. Finally, the observation that the subcellular local- ization correlates with the amino acid composition has resulted in a variety of ab initio prediction methods. Many of them also use other sequence-derived features, such as structural and evolutionary information. These methods utilize machine learning algo- rithms, including neural networks [32], k-nearest neighbor classifier [33], Markov models [34, 35] and support vectors machines [36, 37, 38, 39, 40]. Though the accuracy of these methods is generally lower than that of the other methods described above, their ad- vantage is that they can be applied to virtually any protein sequence without limitations.

In addition to the prediction methods above mentioned, there are also hybrid ap- proaches that combine different sources of information. For example, sequence de- rived features used by ab initio predictors can be incorporated together with the fea- tures such as the presence of homologous sequences, signal peptides or GO annota- tions [41, 42, 43, 44, 45]. Other prediction methods integrate various prediction tools. These methods are called meta-predictors. The assignment of the subcellular localization of meta-predictors is based on the most accurate prediction for a given query protein [46, 47, 48].

7 1 Introduction

1.6 Novel Method

This thesis presents a novel sequence-based method for predicting subcellular localiza- tion of archaeal, bacterial and eukaryotic proteins. Several challenges are addressed that are incorporated with the development of such a prediction method. First, the biologi- cal features which are to use for the detection of localization signals in uncharacterized proteins are determined, since only a fraction of proteins have clearly identifiable local- ization signals in their sequence. Second, the data set used for the assessment of the prediction method is homology reduced in order to avoid overestimates in the prediction performance. Third, the prediction results are made interpretable by users, i.e. a mea- sure of the reliability of a prediction is provided, so that users can concentrate on the results that are more likely to be correct. Finally, the prediction method is compared to other state-of-the-art methods using an independent data set to ensure realistic perfor- mance estimates. Different subcellular localization prediction methods are implemented on different data sets and incorporate a different number of predicted classes, hence making such a comparison to a difficult task. The development of the method presented here is performed on a set of protein se- quences with experimentally determined localization annotations extracted from the SWISS-PROT database [15]. By the use of a stratified k-fold cross-validation the per- formance of several kernel functions and their parameters is evaluated. Additionally, the performance and computation time of several multiclass classifiers is examined. Three different types of classification models are built; a general model for the classifi- cation of all types of proteins, a model for non-membrane proteins and a model that is a specialist for transmembrane proteins. Each prediction is accompanied by a probability score reflecting the level of confidence of a localization class assigned to the protein of interest. The prediction method is compared to the best publicly available predictors. The comparison is performed on the redundancy reduced test sets of our prediction method as well as on several additional sets being not incorporated in the development of any prediction method tested here. We believe that the method developed, allowing the prediction of subcellular localiza- tion for sequences of soluble and membrane spanning proteins from all different kingdoms of life, provides a valuable contribution to the field.

8 2 Materials and Methods

2.1 Workflow

The workflow of the development of the prediction methods for subcellular localization, as well as their assessment using the techniques described in the following sections is shown in Figure 2.1. The first step of the developmental process involved the extraction of sets of archaeal, bacterial and eukaryotic proteins together with their experimentally determined anno- tations of subcellular localization from the SWISS-PROT database [15] (Section 2.2.1), as well as their homology reduction (Section 2.2.2). Each of the three data sets was split into the subsets of non-membrane and transmembrane proteins in order to train classifiers specialized on non-membrane proteins, transmembrane proteins and proteins of all types. Since the set of archaeal transmembrane proteins contained labels of only one localization class (Table 2.2), a classifier specialized for archaeal transmembrane proteins was not built. Thus, the number of individual classifiers developed in this work was eight (Section 2.6.5). Each of the eight data sets was divided into five equally sized subsets (Section 2.7.3) in order to train and test the corresponding classification model. Four subsets were used for training and one for testing. All subsets were rotated such that each subset was used for testing exactly once. Note, a certain degree of homology within each of the training sets was introduced in order to increase them in size (Section 2.2.4). The support vector machines (SVMs) were employed as the base classifiers for our pre- diction methods (Section 2.4). The performance of any SVM-based classifier depends highly on the choice of the kernel function and the tuning of its parameters. Therefore, during the training process the first task was to evaluate the performance of three differ- ent string-based kernel functions, namely the string subsequence kernel, the mismatch kernel and the profile kernel (Section 2.5). Each of the kernels was used in combination

9 2 Materials and Methods

Figure 2.1: The workflow gives an overview of the steps involved in the construction of our classification approaches. The training phase involved four separate steps of the kernel parameter optimization, the kernel selection, the classification model selection and the model parameter optimization, because simultaneous training on all possible combinations of the parameters involved is unfeasible. with the simplest and fastest multiclass classification approach one-against-all (Section 2.6.1). The performance evaluation was assessed by a further partition of each of the

10 2.1 Workflow training sets into ten subsets. The classifier was trained on 90% of the initial training set and validated on the remaining 10%. This was repeated ten times and the average over ten classifiers formed the result. For each kernel, the parameters that led to the highest performance of the classifier were considered as optimal. Finally, the kernel function with the highest average score over five folds together with its optimal parameters was chosen for the next applications.

The second task of the training phase was to assess the performance of different multiclass classification approaches. The multiclass classification approaches were one- against-all, ensembles of nested dichotomies, ensembles of class balanced dichotomies, ensembles of data balanced dichotomies and predefined nested dichotomies (Section 2.6). Here, the performance was evaluated for each of the five splits of the training set in a stratified 10-fold cross-validation, again. Three classifiers with the best average perfor- mance over five splits of the training set were chosen.

The final task of the training process involved the tuning of the SVM cost parameter C (Section 2.4.2). Optimizing this parameter manually could be a very tedious task. Therefore, the meta learner CVParameterSelection [49] implemented in WEKA [50] was used that performed parameter optimization automatically for each binary SVM classifier using cross-validation. The range for the parameter C was defined between 0.001 and 1000 and the number of folds was set equals 5. CVParameterSelection built then a classifier from the entire training set for the parameter with the highest cross-validation performance.

The outcome of the training phase is a classifier with the highest performance and the best generalization power for the unseen cases. Therefore, the classifiers were applied to the testing sets that were unseen during the training. Additionally, the testing sets were used to compare the performance of our classifiers to the performances of four state-of- the-art publicly available predictors of subcellular localization: CELLO v.2.5, LOCtree, MultiLoc2 and WoLFPSORT (Section 2.8).

After the classification model was selected with its parameters optimized, each of our eight classifiers was retrained on the corresponding complete data set, without any partitioning. The re-trained classifiers presented our final classification models, which were consequently benchmarked against four external predictors on the data sets derived from the LocDB database and of newly added SWISS-PROT proteins (Section 2.3).

11 2 Materials and Methods

2.2 Data Sets for Development

The SWISS-PROT database [15] is a manually curated knowledge base of protein se- quences and their functional annotations. SWISS-PROT entries are continuously up- dated and annotated by expert biologists. Therefore, this database is the standard database to use when working with proteins. We extracted three data sets of archaeal, bacterial and eukaryotic protein sequences together with their subcellular localization annotations from the SWISS-PROT release 2011 04. Each of the data sets was first internally homology reduced and then divided into the training and test sets by a stratified 5-fold cross validation. As the performance of an SVM-based classifier increases with the growing training size [51] we subsequently increased the size of all our training sets by allowing a certain degree of redundancy within them. The procedure of data preparation for the development of our classifica- tion approaches is described in the sections below.

2.2.1 Data Sets Extraction

Taxonomic classification. The taxonomic classification of the source organism was iden- tified, as it is maintained at the NCBI [52], by checking the annotation information in the organism classification (OC) lines of a SWISS-PROT entry. Archaeal, bacterial and eukaryotic proteins were selected.

Subcellular localization annotation. Subcellular localization annotations from the com- ment lines (CC) starting with ”-!- SUBCELLULAR LOCATION”were selected. Proteins annotated with unclear localizations were excluded. For example, a bacterial protein an- notated with ”SUBCELLULAR LOCATION: Cell membrane” in the CC field was not included as it is not clear whether the protein is found in or associated with the inner or outer cell membrane. In addition, proteins annotated with two or more subcellular local- izations were also not included in the data set. For example, a protein entry annotated with ”SUBCELLULAR LOCATION: Cell membrane. Cytoplasm.” was not included. Since the products of alternative splicing can be localized in differential subcellular com- partments due to the presence or absence of a sorting signal, proteins annotated with ’Isoform’ in the CC lines were also excluded from the data set. Table 2.1 summarizes the

12 2.2 Data Sets for Development keywords that were used to search against the categorization of localizations in the CC lines. Note that several subcellular localizations are categorized by multiple keywords, meaning that proteins matching any of these keywords were included.

Non-membrane and transmembrane proteins. Protein entries lacking the term ”mem- brane” were considered as non-membrane proteins. While transmembrane proteins, i.e. proteins spanning the membrane at least once, were identified by the additional terms ”Single-pass” or ”Multi-pass” in the subcellular location lines.

Experimental annotation. All proteins with subcellular localization annotations not based on experimentally proven findings were excluded. These were indicated by the qualifiers ”Potential”, ”Probable” and ”By similarity” in the CC lines. The term ”Poten- tial” indicates that there is some logical or conclusive evidence that the given annotation could apply. This qualifier is used to make annotations with the results from protein sequence analysis tools, if the result makes sense in the context of a given protein. The term ”Probable” is stronger than ”Potential”, denoting that there is at least some ex- perimental evidence for the provided information. The term ”By similarity” is added to annotations that were proven for a protein or a part of it and which are then transferred to other protein family members within a certain taxonomic range.

Data statistics. All proteins with a sequence length of less than 50 amino acids were removed. This left around 600 non-membrane and 70 transmembrane archaeal proteins, 11300 non-membrane and 1100 transmembrane bacterial proteins, and 29000 non-membrane and 6000 transmembrane eukaryotic proteins. These proteins comprised altogether 18 localization classes in the data set of experimentally annotated subcellular localizations derived from SWISS-PROT.

2.2.2 Homology Reduction

Bias in protein databases towards certain protein families [53] may lead to significant over-estimates of classifiers if trained on highly homologous sequences. In such cases, the predictions are based on highly similar sequences rather than on general features. Thus, the development and evaluation of our classification approaches involved the construction of sequence unique sets that were obtained in two redundancy reduction steps.

13 2 Materials and Methods

Localization Keywords in SWISS-PROT Chloroplast Chloroplast Cytoplasm Cytoplasm Plasma membrane Cell membrane Outer membrane Endosome Endosome ER Sarcoplasmic reticulum Fimbrium Fimbrium Golgi apparatus Golgi Inner Membrane Inner membrane Lysosome Lysosome Melanosome Melanosome Mitochondria Nucleus Chromosome Nuclear Nucleus Perinuclear region Periplasm Peroxisome Glyoxysome Glycosome Plastid Plastid Secreted Cell junction Vacuole

Table 2.1: Keywords used to search against the CC lines in the SWISS-PROT database. One keyword is sufficient for the assignment of a protein to one of the seven- teen subcellular localization classes.

In the first step, the UniqueProt software [54] was run to remove sequences from our sets of archaeal, bacterial and eukaryotic proteins until each of the sets contained only sequence pairs with HSSP-value≤0 (equation 2.1). This value was chosen because below this threshold the annotation of subcellular localization based on the sequence homology was found to be unreliable [55].

14 2.2 Data Sets for Development

In the second reduction step, an all-against-all BLAST2 [56] search at E-value≤10-3 was performed. Information from entire protein sequences was used, including the low complexity regions in both redundancy reduction steps, and only alignments of at least 35 residues were considered, as this is the expected upper bound size for the majority of localization signals [57, 21]. Due to a very low number (below 5) of eukaryotic protein sequences in the localiza- tion classes of lysosome, melanosome and endosome left after the redundancy reduction procedure, these classes were excluded from further analyses. The final sequence unique sets resulted in 59 proteins in archaea, 479 proteins in bacteria, and 1682 proteins in eukaryota as summarized in Table 2.2.

Localization Archaea Bacteria Eukaryota NM TM Sum NM TM Sum NM TM Sum Chloroplast 133 11 144 Cytosol 41 41 179 179 220 220 Endoplasmic Reticulum 10 65 75 Extracellular space 5 5 82 82 596 596 Fimbrium 16 16 Golgi apparatus 3 17 20 Mitochondria 140 87 227 Nucleus 320 5 325 Outer membrane 6 6 Plasma membrane 13 13 144 144 40 40 Periplasm 52 52 Peroxisome 6 2 8 Plastid 14 14 Vacuole 3 10 13 Total 46 13 59 329 150 479 1445 237 1682

Table 2.2: Number of sequences per localization class in the homology reduced sets of archaeal, bacterial and eukaryotic proteins. Each of the three homology re- duced sets contained no sequence pair with HSSP-value>0. Abbreviations: NM, non-membrane; TM, transmembrane.

15 2 Materials and Methods

2.2.3 Scores for Measuring Sequence Similarity

Two scores were used to measure the pairwise sequence similarity of proteins. The first score is the E-value as it is derived from BLAST searches. The second score is the HSSP-value introduced by Sander and Schneider [58] and modified by Rost [59]. The HSSP-value is estimated by first aligning two sequences against each other by pair- wise BLAST. Then the length (L) of the alignment (without including the gaps) and the percentage of identical residues (PID), which is the number of identical residues in the alignment times 100 and divided by L, are transformed into the HSSP-value by:

   100, for L ≤11 − L − − −0.32(1+e 1000 ) ≤ HSSP value(L, P ID) = PID  480L , for 11 < L 450 (2.1)  19.5, for L > 450

The HSSP-value estimates whether a protein alignment is above the HSSP curve (HSSP-value>0) or below it (HSSP-value<0) (Figure 2.2). Positive HSSP values denote a degree of sequence proximity and therefore structural homology whereas negative HSSP values provide an estimate about the distance between two aligned protein sequences. Nair and Rost [18] showed that HSSP-values are also accurate in distinguishing protein pairs with co-localizations from those with different localizations.

Figure 2.2: HSSP-curve separates proteins of homologous and non-homologous struc- tures. A pairwise alignment can be presented as a dot in the graph. If the dot is placed above the curve then the structural homology can be inferred.

16 2.2 Data Sets for Development

2.2.4 Size Increase of the Training Sets

In order to obtain unbiased estimates we have to analyze homology reduced sets of protein sequences. However, it has been shown that larger data sets improve the per- formance of SVMs through increased coverage of the sequence space [51]. Moreover, the performance of SVMs can also be improved through training on sequence redundant sets. Therefore, the size of our training data was increased by allowing a certain degree of homology within each of the training sets of archaeal, bacterial and eukaryotic pro- teins. This procedure increased the size of our training data considerably, by a factor of 5.

Outline of the algorithm:

1. Start with the homology reduced set and align it against all proteins extracted from SWISS-PROT by a pairwise BLAST (e.g. BLAST2 at E-value< 104 in our case).

2. Compile HSSP-values (equation 2.1) for each pair of aligned sequences.

3. Find all structural homologs to the sequences in the homology reduced set at HSSP-value≤ 60.

4. Align all sequences found in the previous step against each other by a pairwise BLAST.

5. Find all pairs that are structural homologs at HSSP-value≤ 60.

6. Remove sequences from the previous step that have HSSP-value>0 to more than one sequence in the homology reduced set.

17 2 Materials and Methods

2.3 Data Sets for Testing Only

A fair comparison between the performances of different prediction methods is possible only when they are all trained on exactly the same data set. Unfortunately, the external prediction methods were all trained on sets from different SWISS-PROT releases. There- fore, the predictors developed here were evaluated and their performances compared to the performances of CELLO v.2.5, LOCtree, MultiLoc2 and WoLFPSORT not only on the data described in Section 2.5 but also on the following independent test sets:

◦ Homo sapiens and Arabidopsis thaliana proteins from the LocDB database [60].

◦ SWISS-PROT proteins added between releases 2011 04 and 2011 07.

Each of the independent data sets was first filtered for sequences with a length of at least 50 residues to enable the identification of localization signals. The sets were then homology reduced to the whole set of proteins used for the development of our prediction methods. The homology reduction was performed such, that between the two sets no protein pair had a level of sequence similarity corresponding to HSSP-value>5. This threshold ensured sequence identity of at most 25% over 250 aligned residues. After- wards, the independent sets were internally homology reduced at HSSP-value≤0 and BLAST E-value≤10−3 with a minimum alignment length of 35 residues. After the homology reduction, the remaining protein sequences in the independent sets were never used for the development of our classifiers. With the exception of LOC- tree, that applied homology based and text analysis based predictors to SWISS-PROT proteins to increase its data set, and WoLFPSORT, that extracted an additional set of Arabidopsis thaliana proteins from the Gene Ontology website, it is very unlikely that any of the other predictors tested here used any of these remaining proteins, as they were all developed using SWISS-PROT releases younger than 2011 04. It should be mentioned, that proteins added from the sources other than SWISS-PROT to the training sets of LOCtree and WoLFPSORT may lead to a rather underestimation of the methods to which they are compared, including our method.

2.3.1 LocDB

LocDB [60] is an expert driven database of experimental annotations for the subcellular localizations of proteins in Homo sapiens and Arabidopsis thaliana. Each LocDB en-

18 2.3 Data Sets for Testing Only try is derived from the UniProt database [61] which is a combination of the TrEMBL (automatically annotated) and the SWISS-PROT (manually annotated) databases. By collecting the subcellular localization information from the primary literature and other databases LocDB extends the number of experimental annotations for Homo sapiens and Arabidopsis thaliana contained in SWISS-PROT by the factor of 4. The subcellular localization annotations are provided according to the Gene Ontology terminology [62] and grouped into 12 classes: cytoplasm, endoplasmic reticulum, endosome, extracellu- lar, Golgi apparatus, mitochondrion, nucleus, peroxisome, plasma membrane, plastid, vacuole and vesicles. Since we were interested in the prediction of proteins found in a single subcellular localization, all proteins annotated as multi-localized in the LocDB set were eliminated. Furthmore, proteins derived from TrEMBL, i.e. whose existence has not been supported experimentally, were also removed. All proteins with a sequence length of less than 50 residues were excluded, as were proteins whose localization annotation did not corre- spond to localization classes defined for our methods. After the homology reduction procedure, the sequence unique set of 232 Homo sapiens and 67 Arabidopsis thaliana proteins was obtained (Table 2.3).

Localization Homo sapiens Arabidopsis thaliana Cytosol 63 4 Endoplasmic Reticulum 9 4 Extracellular space 16 8 Golgi apparatus 7 3 Mitochondria 43 9 Nucleus 39 9 Plasma membrane 50 15 Peroxisome - 3 Vacuole 5 12 Total 232 67

Table 2.3: Number of sequences per localization class in the homology reduced sets of Homo Sapiens and Arabidopsis thaliana proteins derived from LocDB. Both sets contained no sequence pairs with HSSP-value>0 (eq. 2.1) and no protein sequences with HSSP-value>5 to any of the sequences used for the develop- ment of our prediction methods. Abbreviations used as for Table 2.2.

19 2 Materials and Methods

2.3.2 Newly Added SWISS-PROT Proteins

All proteins added to the SWISS-PROT database between releases 2011 04 and 2011 09 were collected in accordance with the procedure described in section 2.2.1. The data obtained presented the most reliable set for the comparative evaluation of different pre- diction methods. Unfortunately, it suffered from the small size, as the homology reduc- tion procedure removed all archaeal proteins leaving only 12 bacterial and 53 eukaryotic proteins in the homology reduced set. Due to the extremely small size of the homology reduced bacterial data set, it was not used for our analyses. The number of protein sequences in each localization class in the homology reduced data set of eukaryotic pro- teins is given in Table 2.4.

Localization Eukaryota Chloroplast 1 Cytosol 12 Endoplasmic reticulum 3 Golgi apparatus 1 Extracellular space 8 Mitochondria 5 Nucleus 18 Plasma membrane 3 Vacuole 2 Total 53

Table 2.4: Number of sequences per localization class in the homology reduced set of eu- karyotic SWISS-PROT proteins added between releases 2011 04 and 2011 09. The data set contained no sequence pairs with HSSP-value>0 and no protein sequences with HSSP-value>5 to any of the sequences used for the develop- ment of our prediction methods.

20 2.4 Support Vector Machines

2.4 Support Vector Machines

Support Vector Machines (SVMs) are a well established and highly used type of a ma- chine learning algorithm presented by Vapnik and Cortes in 1995 [63]. SVMs separate a set of a binary labeled data by setting an optimal separating hyperplane that is maxi- mally distant from them. The optimal separating hyperplane generalizes well, meaning that it can successfully be applied to new unseen data. SVMs have shown excellent performance in various protein classification tasks. Exam- ples include the detection of remote protein homology [64], prediction of DNA binding residues [65], classification of protein secondary structure [66], identification of surface loop flexibility [67] and discrimination between coding and non-coding RNAs [68]. In the first part of this Section the way in which SVMs find an optimal separating hyperplane for the case of linearly separable training data is described. Then, the case of outliers for which no linear separation is possible, is dealt with. Finally, kernel functions and kernel matrices are introduced, that allow linear separation in the higher dimen- sional feature space of linearly non-separable data in the original space. The following descriptions are based on the tutorial of Burges [69] and the publication of Vapnik and Cortes [63].

2.4.1 Linear Classification

In a case where the training data is perfectly linearly separable, SVMs build a hyperplane that separates two classes. Instances with one label can then be found only on one side of the hyperplane, whereas instances with another label are held only on the other side. Obviously, there can me more than one separating hyperplane for two classes (Figure 2.3A). However, the optimal separating hyperplane is the hyperplane with the maxi- mum distance to both classes (Figure 2.3B). This is a unique solution to the separating hyperplane problem. More formally, let x be the training data associated with labels +1 and -1

n (x1, y1), ..., (xn, yn), x ∈ R , y ∈ {+1, −1} that can be separated by the hyperplane

(w · x) + b = 0, (2.2)

21 2 Materials and Methods

Figure 2.3: The case where the binary labeled data can be linearly separated by a hyper- plane. (A) Each of the three hyperplanes can separate two classes linearly. (B) The optimal separating hyperplane is, however, the hyperplane with the maximal margin (the distance between hyperplanes H1 and H2 ). Figure adapted from [63]. where w is the normal vector of the hyperplane. The term w · x is the dot product of ∑ vectors w and x, defined as w x . Additionally, |b|/||w|| is the perpendicular distance i i √∑ || || 2 from the hyperplane to origin, where w = i wi the Euclidian norm of w.

The optimal separating hyperplane is the hyperplane with the largest margin, i.e. the the distance between the closest vector and the hyperplane is maximal. Obviously, the preferred choice is w and b that maximize the margin, or the distance between two parallel hyperplanes H1 and H2 that are placed as far from each other as possible, while still separating the data. These hyperplanes can be described by H1: (w · x) + b = +1 and H2: (w · x) + b = −1. If none of the data points fall between hyperplanes H1 and H2, then the training data satisfy the following conditions:

xi · w ≥ +1, yi = +1 (2.3)

xi · w ≤ −1, yi = −1 (2.4) which can be combined to

yi(xi · w + b) − 1 ≥ 0, i ∈ {1, ..., N} (2.5)

22 2.4 Support Vector Machines

The hyperplane H1 has the distance from the origin |1 − b|/||w||. Similarly, the hy- perplane H2 has the distance from the origin | − 1 − b|/||w||. Thus, the margin is simply 2/||w||. As result, by minimizing ||w||2 with respect to constrains 2.5 we can find a pair of hyperplanes that maximizes the margin.

Support vectors. The training points that lie on of the hyperplanes H1 and H2 are called support vectors. Their removal or change leads to the modification of the optimal separating hyperplane.

Langrangian formulation. The optimization problem stated above can be described using Langrangian formulation. By introducing positive Langrange multipliers αi, one for each constraint in equation 2.5, we get the formulation of the Langrangian:

∑N 1 ∑N L = α − α α y y x x , (2.6) i 2 i j i j i j i=1 i,j=1 subject to: ∑N αiyi = 0, αi ≥ 0 (2.7) i=1

The optimization problem now is to maximize L with respect to the αi, subject to constraints 2.7. The solution for w can then be given by:

∑Ns w = αiyixi, (2.8) i=1 where Ns is the number of support vectors. Therefore, according to equation 2.8, the optimal hyperplane is determined by the Lagrange multipliers and support vectors.

Finally, the linear decision function for any instance x based on the optimal separating hyperplane is: f(x) = sign(w · x + b), (2.9) where { +1, α ≥ 0 sign(α) = −1, α < 0

23 2 Materials and Methods

2.4.2 Soft Margin

The previous Section introduced the method for linearly perfectly separable data. How- ever, if applied to non-separable data (i.e. noisy data) the method will fail in finding a feasible hyperplane. The Soft Margin method introduces a solution to this. It re- laxes the constraints 2.3 and 2.4 by introducing a cost for errors, the slack variables

ξi ≥ 0, i ∈ {1, .., N} (Figure 2.4). The constraints then become:

xi · w ≥ +1 − ξ, yi = +1 (2.10)

xi · w ≤ −1 + ξ, yi = −1 (2.11)

Figure 2.4: The case where outliers (indicated by red circles) make the data linearly not perfectly separable. The introduction of slack variables ξ allows SVMs to generate an approximately linearly separating hyperplane in such a case.

From the constraints 2.10 and 2.11 follows, that a way to assign an extra cost for errors ∑ || ||2 || ||2 k is to change the objective function to be minimized from w /2 to w /2+C( i ξi) , where C is the parameter to be chosen manually. Large C corresponds to assigning higher penalty to errors. The problem can again be expressed by the formulation of the ∑ Lagrangian (eq. 2.6) with 0 ≤ αi, αiyi = 0 and the additional constraint α ≤ C. The solution for w is given by equation 2.8. It can be seen, that the only difference to the optimal separating hyperplane is that αi are constrained here by the upper bound C.

24 2.4 Support Vector Machines

2.4.3 Non-Linear Classification

The previous two Sections described how SVMs use the linear decision function for the perfectly linearly separable data and the data containing noise and outliers. Unfortu- nately, most of the real world data can not be separated linearly, i.e. by a straight plane that separates two classes. The solution to this, is in addition to setting soft margins, the mapping of the linearly non-separable data into a higher dimensional space, where a linear decision function can be found. Using the terminology from the previous two sections we extend the usage of SVMs on linearly non-separable data and introduce the following definitions.

Feature Space. The high dimensional space in which the linearly-non separable data in the original space can be separated by a linear decision function is called the feature space.

Feature Map. Let Φ : Rn → Rm be the non-linear function that maps input vectors in space Rn into vectors of a higher dimensional space Rm, with m > n. This mapping is called feature map.

The Kernel Trick. First notice, that according to eq. 2.6 the data appear in the train- ing problem in the form of dot products xi ·xj. If we map the data by the function Φ into a higher dimensional space, then the training algorithm would now only depend on the data in the form of Φ(xi) · Φ(xj). The high-dimensionality introduces a computational problem, which is very cost intensive or even intractable, especially in a case where the high-dimensionality is infinite. Boser et al. [70] showed, that for the construction of the optimal separating hyperplane in a high dimensional feature space one does not need to consider it in its explicit form. Instead, one defines a kernel function that calculates only the dot products between the support vectors and the vectors of the feature space (eq. 2.8 and 2.9).

Kernel Function. The kernel function is a function of the form

K(xi, xj) = Φ(xi) · Φ(xj), (2.12)

25 2 Materials and Methods that allows the computation of a dot product between two vectors in a high dimensional space without explicitly knowing what Φ is. According to equation 2.9, the decision function of the optimal separating hyperplane can be formulated by the use of K :

∑Ns f(x) = sign( αiyiK(xi, x) + b), (2.13) i=1 where xi are support vectors and computation of Φ(x) explicitly is avoided by the use of K(xi, x) = Φ(xi) · Φ(x) instead. The situation is schematically illustrated in Figure 2.5.

Figure 2.5: The realization of an optimal separating hyperplane in the high dimensional space by SVMs. The input vectors are mapped from Rn into a higher dimen- sional space Rm where the kernel function calculates the dot product of the vectors, thus allowing the construction of an optimal separating hyperplane in the high dimensional feature space. Figure adapted from [63].

The advantage of the kernel function is two-fold. First, it allows the construction of a decision function that is non-linear in the input space but that is equivalent to a linear decision function in a high dimensional feature space. Second, the runtime of a kernel function does not change significantly from the function working with un-mapped training data.

26 2.4 Support Vector Machines

2.4.4 The Kernel Matrix

From equation 2.13 follows, that we only need pairwise kernel values to find the hyper- plane. For the set of input points we can compute a kernel matrix, which is a square matrix containing values produced by a kernel function K. It can serve as input to the SVM and thus reduce its running time by avoiding the re-computation of kernel values. An important property of a kernel matrix is that it is positive semi-definite, i.e. for all x ∈ R and i, j ∈ {1, ..., n} we have

∑n T ≥ xi Kxj = xixjKij 0 (2.14) i,j=1

The structure of a kernel matrix can be expressed by: ( ) K K K = T r T rT e , (2.15) T KT rT e KT e

where Tr is the number of training instances (labeled), Te is the number of test in- n stances (unlabeled) and K(xi, xj) = Φ(xi) · Φ(xj) with i, j ∈ R . The SVM uses the labeled part of data to learn to generalize well and the classifications are made on the unlabeled part of it.

Normalization. The performance of the generalization and thus the overall perfor- mance of the SVM can be significantly improved by normalizing the data in the fea- ture space [71]. In this case, only the feature vectors of the same length enter the optimization√ problem. The normalization can be achieved by replacing K(xi, xj) with K(xi, xj)/ K(xi, xi) · K(xj, xj).

Application to the test data. The computation of the kernel values is a time extensive task, especially for large data sets. Therefore, for the training purposes, the KT r sub- matrix was computed only once. For the testing purposes, KT r was extended with the kernel values between the training and the test instances (KT rT e sub-matrices) as well as the values between each test instance to itself (the diagonale of the KT e sub-matrix required for normalization). All other values in the kernel matrix K were set to 0.

27 2 Materials and Methods

2.4.5 Sequential Minimal Optimization

All prediction methods established in this work were developed using a freely available Java implementation of the SVMs in the standard distribution of the WEKA workbench [50]. In WEKA, SVMs are implemented using the Sequential Minimal Optimization algorithm (SMO) [72]. The idea behind the SMO algorithm is to solve the optimization problem by dividing it into sub-problems that can be individually solved. This speeds up the runtime of the SVM algorithm, especially if applied on large data sets.

2.5 String Kernels

There is a number of sequence-based kernel functions designed for protein classification tasks. In this work, the String Subsequence Kernel [73], the Mismatch Kernel [74] and the Profile Kernel [75] were applied and their performances evaluated. The main idea behind them is to compare two protein sequences by looking at the number of common subsequences of a fixed length. No biological knowledge is incorporated, in the sense that protein sequences are simply represented as strings of amino acids. Some notations that will be used throughout the text are here introduced. We define a finite alphabet of amino acids A with |A| = 20. A protein sequence s is represented by a set of (not necessarily) contiguous subsequences of length k ≥ 1, the so-called k- mers. These are used for a non-linear transformation of the protein sequences into a high dimensional space.

2.5.1 String Subsequence Kernel

The string subsequence kernel (SSK) [73] measures the similarity between two sequences by the weighted number of common k-mers that can be contiguous as well as non- contiguous. The degree of contiguity determines the weight, assigning more similar subsequences with higher weights. Subsequences with gaps are less weighted. The SSK feature space is determined by three values: k, λ and l. The length of a k-mer shared between two sequences is given by k. λ is a real value between 0 and 1, specifying the decay factor to penalize non-contiguous k-mers. Lastly, l is the length k of the occurrence of α = α1...αl ,α ∈ A in a sequence s, that is the length of a k-mer including the gaps.

28 2.5 String Kernels

More formally, let |s| be the length of a sequence s. If there exist indices i = (i1, ..., i|α|) ≤ ≤ | | and αj = sij on the condition of 1 i1 < ... < i|α| s , then α is called a subsequence of s and expressed as α = s[i]. The length of α is given by l(i) = i|α| − i1 + 1. The SSK feature map for a given sequence s is defined by: ∑ l(i) k Φα(s) = λ , α ∈ A , 0 ≤ λ ≤ 1 (2.16) i:α=s[i]

The SSK kernel for two input sequences s1 and s2 is defined as a dot product of their feature vectors holding the weighted sum of the number of all common substrings of length k: ∑ ∑ ∑ ∑ l(i) l(j) Kk(s1, s2) = (Φα(s1) · Φα(s2)) = λ λ = α∈Ak α∈Ak i:α=s [i] j:α=s [j] ∑ 1∑ ∑ 2 (2.17) = λl(i)+l(j)

α∈Ak i:α=s1[i] j:α=s2[j]

At first glance, the SSK kernel function seems like an exponential algorithm, but [73] propose an efficient recursive solution, which involves dynamic programming and allows the computation of the kernel in O(m|s1||s2|) time and space.

2.5.2 Mismatch Kernel

The mismatch kernel [74] is an approach that measures the similarity between two se- quences by counting the number of k-mers shared between them. The mismatch kernel allows a certain degree of mismatches by which the k-mers can differ. The number of mismatches is given by the parameter m. More specifically, the mismatch kernel defines the mismatch neighborhood Nk,m(α) around the contiguous subsequence α = α1...αk as a set of all k-length subsequences β that differ from α by at most m mismatches. For a given sequence s the mismatch feature map is given by: ∑ k Φk,m(s) = (ϕβ(α)), β ∈ A , (2.18) k-mers α∈s where { 1, if β belongs to N(k,m)(α) ϕβ(α) = 0, otherwise

29 2 Materials and Methods

The mismatch kernel calculates the dot product of feature vectors holding the counts of all k-mers with the maximum of m mismatches between a pair of compared sequences:

K(k,m)(s1, s2) = Φ(k,m)(s1) · Φ(k,m)(s2) (2.19)

Storing and computing high-dimensional feature vectors is very cost intensive. There- fore, the mismatch kernel applies a data structure called mismatch tree for the efficient computation of the kernel matrix. Similar to the suffix tree [76], the mismatch tree is a rooted tree of depth k. Each internal node has |A| branches, each labeled with a symbol from A. A leaf node, representing a k-mer in the mismatch feature space, is obtained by concatenating the symbols along the path from root to leaf. The kernel function per- forms a depth-first traversal of the tree starting at the root node and recursively visiting all the subtrees of the node. It stores at each node of depth d a set of pointers to all substrings (k-mers) whose d-length prefixes differ by a maximum of m mismatches from the substring on the path from the root node to the node of depth d. Thus, the mismatch kernel matrix can be computed by visiting down the path corresponding to k-mers (with mismatches) that occur in the two compared strings. The number of k-mers α that differ from a fixed k-mer by at most m mismatches is O(km|A|m). The complexity of calcu- m+1 m lating a kernel value for a pair of sequences s1 and s2 is thus O(k |A| (|s1| + |s2|)).

2.5.3 Profile Kernel

Evolutionary Sequence Profile

The key feature of the profile kernel [75], as the name already states, is the use of protein sequence profiles. A protein sequence profile is estimated by aligning a sequence of interest against a group of sequences and extracting information from the alignments of the sequences of families of similar proteins [77]. For the query protein, this information is then expressed in the profile by the conservation scores of its residues. The profile is represented as a position-specific scoring matrix with n rows and over 20 columns. The length of the query sequence is given by n. The first 20 columns contain scores for finding, at that position in the target sequence, each of the 20 amino acid residues. For instance, residues with high levels of conservation are those with high probability scores, residues with weak conservation are indicated by scores around zero, and residues with no conservation are indicated by negative scores.

30 2.5 String Kernels

We built evolutionary profiles utilizing the Position-Specific Iterated BLAST program (PSI-BLAST) [78]. Three iterations with the E-value parameter set to 10−3 were per- formed. The protein database that was searched against was a combination of the SWISS-PROT, TrEMBL [15] and PDB [79] databases redundancy reduced at 80% se- quence identity.

Computation of the Profile Kernel

The profile kernel makes use of an evolutionary profile P (s) of a sequence s to define the mutational neighborhood around a contiguous k-mer α. Different to the SSK and the mismatch kernels, which defined the inexact matching neighborhood by the number of mismatches, the profile kernel defines it by the evolutionary profiles. Therefore, the inexact matching neighborhood of a k-mer differs from sequence to sequence or even between different regions of the same sequence.

More specifically, for each k-mer s[j + 1 : j + k] = sj+1...sj+k in s(0 ≤ j ≤ |s| − k), the profile kernel specifies the positional mutation neighborhood M(k,σ)P (s[j + 1 : j + k]), using the corresponding segment of the profile P (s), as a set of all k-length subsequences β, whose cumulative conservation score is less then the user-defined σ threshold, that is ∑ − k i=1 logpj+i(bi) < σ, where β = b1...bk. The profile feature map for a given sequence s can be set as:

|∑s|−k k Φk,σ(P (s)) = (ϕβ(P (s[j + 1 : j + k]))), β ∈ A , (2.20) j=0 where { 1, if β belongs to M(k,σ)(P (s[j + 1 : j + k])) Φβ(P (s)) = 0, otherwise The profile kernel is again defined as the dot product of feature vectors:

K(k,σ)(P (s1),P (s2)) = Φ(k,σ)(P (s1)) · Φ(k,σ)(P (s2)) (2.21)

Similar to the mismatch kernel, the profile kernel computes the kernel matrix using a tree data structure. The only difference for the profile kernel is that instead of k-mers, k-long profiles are stored on the path from the root to the leaf. Thus, an internal node of depth d stores a set of pointers to all k-length profiles P (s[j + 1 : j + k]), whose

31 2 Materials and Methods current cumulative conservation scores are less than the σ threshold. More specifically, ∑ − logpj+i(bi) < σ, where bi is the prefix of the current node and i ∈ {1, ..., d}.

The complexity of computing a profile kernel value K(s1, s2) depends on the size of the positional mutation neighborhood of a k-length profile. It has been observed that with a typical choice of σ, the mutation neighborhood allows up to m=2 mismatches. The number of k-mers β in the positional mutation neighborhood of s[j+1:j+k] is O(km|A|m).

The complexity of computing a kernel value for a pair of protein sequences s1 and s2 is m+1 m O(k |A| (|s1| + |s2|)) with m≤ 2.

2.6 Multiclass Classification

The multiclass classification algorithm extends the idea of binary classification by al- lowing the mapping of the data to labels from more than two classes. The multiclass classification problem can be solved by decomposing it into a set of binary classification tasks that are efficiently solved using binary classifiers, such as SVMs in our case. The binary classifiers can then be combined to form the final multiclass classifier. Several methods have been proposed for constructing multiclass classifiers in such a way. Here, the one-against-all approach [80] and the approaches of arranging the classes into a tree with different binary classifiers utilized at each internal node [81, 82, 83] are presented.

2.6.1 One-Against-All

Breaking a multiclass classification problem into several binary classification problems is an easy way of tackling multiclass classification problems. Given a set of n classes, the one-against-all approach solves n individual binary classification problems by exploiting n binary classifiers. Each classifier discriminates between the positive training instances belonging to one class of the set and the negative training instances belonging to the remaining n-1 classes. As a result, every instance is used as a positive example exactly once and n-1 times as negative. The classification result is the output of the classifier that generates the highest value (the classifier is considered as a winner) and its class label is assigned to a new testing instance.

32 2.6 Multiclass Classification

2.6.2 Ensemble of Nested Dichotomies

Another approach for dealing with a multiclass classification problem is to decompose classes into a system of nested dichotomies (NDs) [84] which can be represented as binary trees. Each internal node of the tree stores one binary classifier and a set of corresponding classes. The root node contains the entire set of the classes and divides it into two subsets. One of them is regarded as a positive subset, whereas the other is regarded as a set of negative examples. Thus, the classifier at the root node learns to distinguish between them. The two successor nodes of the root node inherit two subsets of the original set of classes and the procedure is repeated recursively until the leaf node is reached. Each leaf node contains only one class label and the number of the leaf nodes corresponds to the number of classes. NDs do not require any prior knowledge of the relatedness of classes to each other. Therefore, there are many possible tree structures that can be generated for a given set of classes. Figure 2.6 shows an example of two different decompositions of a five-class classification problem. Obviously, the classification results of different NDs differ as they involve learnings on different binary problems and it is not clear which ND is more appropriate. Therefore, if every ND is equally probable, it makes sense to consider an ensemble of nested dichotomies (ENDs), that is the set of all possible NDs, for a given multiclass classification problem. The result of such an END is the average over the estimates obtained from the individual NDs. Although building a binary tree requires time linear in the number of classes, the algorithm must be applied several times to build ensembles of NDs. In fact, there are ∏ n − i=2(2n 3) possible systems of NDs for an n-class problem. This can be seen as the drawback of the ENDs method, at least if compared to the one-against-all approach. Frank and Kramer [81] showed that 10 to 20 NDs are sufficient for the achievement of the maximum accuracy of an ENDs-based classifier.

2.6.3 Ensemble of Class Balanced Nested Dichotomies

One variant of nested dichotomies is presented by class balanced nested dichotomies (CBNDs). CBNDs aim to class-balance each internal node of a binary tree. ECBNDs are defined similarly as ENDs. While ENDs sample from a space of all possible tree structures, ECBNDs sample from a space of class-balanced tree structures and build an ensemble of balanced trees. This implies, that each internal node in a class-balanced

33 2 Materials and Methods

Figure 2.6: Two systems of nested dichotomies for a five-class classification problem. binary tree has two equal-sized subsets to pass to both of its successor nodes. This restriction limits the number of possible sets of classes a node can inherit and as a result, the number of possible CBNDs is always smaller than the number of possible NDs. Thus, finding appropriate NDs within an ensemble of possible NDs is becoming more likely. Another advantage of the CBNDs approach is that the depth of a class- binary tree is always logarithmic in the number of classes, which in turn speeds up the runtime of the learning algorithm.

2.6.4 Ensemble of Data Balanced Nested Dichotomies

Another variant of nested dichotomies is given by the data balanced nested dichotomies (DBNDs). DBNDs are motivated by the fact that many multiclass classification prob- lems are not data-balanced, meaning that the distribution of the instances between different classes may strongly vary. In this case class-balancing does not imply data- balancing, i.e. an equal number of inherited instances in the two successor nodes of an internal node. DBNDs are built by randomly assigning classes to two subsets until the number of instances in one of the subsets exceeds half the total amount of instances in the parental node. The two class-balanced subsets are then passed to the successor node. Thus, the heavily populated classes are located high up in the tree structure making the ensemble of possible DBNDs (EDBNDs) biased towards populous classes. However, it has been shown in [83] that the accuracy of EDBNDs is comparable to that of ENDs and ECBNDs on the UCI dataset [85]. This was the reason for investigating this approach on our data.

34 2.6 Multiclass Classification

2.6.5 Predefined Nested Dichotomies

In addition to the state-of-the-art multiclass classification approaches presented above, we defined nested dichotomies of a fixed architecture for archaeal, bacteria and eukaryotic proteins. For each taxa, three classifiers were built dependent on the type of proteins used for their development: non-membrane, transmembrane and proteins of both types. Since the transmembrane archaeal proteins exhibited labels of only one localization class (Table 2.2), there was no classifier built for this type of proteins. Based on the number of class labels in our data set, the architectures of the trees for proteins of all types and non-membrane proteins resulted to be the same for archaea. This was also the case for eukaryota. For bacteria, however, there were three different architectures, one for each of the protein types, respectively. The architectures of the trees are shown in Figure 2.7. Each system of nested dichotomies is given a protein sequence as input and the SVM at the root node discriminates between the intra-cellular compartments and the compart- ments of the secretory pathway. Since proteins destined for the extra-cellular space, the plasma membrane, the Golgi apparatus or the ER are transported via the same mech- anism (Section 1.4.1) their sequences are more similar to each other than to sequences of proteins remaining in the intra-cellular space [86]. Therefore, in our classification systems the eukaryotic compartments of the secretory pathway are grouped together. Similarly, in bacteria, the non-cytoplasmic compartments form an own branch in the tree distinct from the cytosol. In fact, we implemented a different tree architecture for bacterial proteins of all types, in which we separated compartments of inner and outer membranes from the non-membrane compartments by grouping them into two separate branches of the tree. The resulting architecture, however, turned out to be less efficient than the architecture shown in Figure 2.7. In eukaryotes, the intra-cellular compartments are further divided into the branches ’cytosol and nucleus’ and ’non-cytosol’. This divison was motivated by the different mechanisms of protein transport for each of these groups (Section 1.4.1) and consequent differences in the targeted proteins. We refer to our system of nested dichotomies by the term MyND. In fact, MyND for transmembrane proteins is the first ab initio predictor capable of distinguishing between eight different subcellular classes. Most of the available subcellular localization prediction methods either discard the class of transmembrane proteins from their data sets [34, 42, 39, 44, 45] or group them into a maximum of five localization classes [87, 88].

35 2 Materials and Methods

Figure 2.7: The hierarchical architectures of MyNDs for archaeal, bacterial and eukary- otic proteins of three types: all, non-membrane and transmembrane pro- teins. The architectures were set up in a way to imitate the biological mechanism of protein sorting as closely as possible. The branches of the trees represent paths of the protein sorting mechanism, whereas internal nodes are decision points along the paths. The leaves of the trees com- prise the final localization classes for which the prediction is made. Ab- breviations: extracellular (for non-membrane proteins) and plasma mem- brane (for transmembrane proteins), cytosol, inner membrane, periplasmic space, outer membrane, fimbrium, endoplasmic reticulum, Golgi apparatus, nucleus, vacuole, peroxisome, mitochondria, plastid and chloroplast.

36 2.7 Performance Evaluation

2.7 Performance Evaluation

In order to provide information about the accuracy of a prediction method it must be evaluated on some external test data, i.e. the data not used during the training of this method. Such evaluation leads to the standard performance estimates, such as accuracy, coverage and their geometric average. The precision of the estimates is provided by the statistical measure of the standard error. In addition to the statistical estimates, the general approach of a stratified k-fold cross validation is also introduced in this section.

2.7.1 Accuracy, Coverage and their Geometric Average

A binary prediction of a type, whether the particular protein belongs to the localization class L or not, results for a given set of proteins in four possible outcomes:

◦ True Positives: the number of correctly predicted proteins in localization L

◦ False Positives: the number of proteins predicted to be in localization L and ob- served experimentally in not-L

◦ False Negatives: the number of proteins predicted to be in not-L and observed to be in L

◦ True Negatives: the number of proteins correctly predicted to be in not-L

The four possibilities that occur in a binary classification task are summarized in the so-called confusion matrix in Table 2.5. These values form the basis for a variety of performance estimates. The accuracy/specificity is defined as:

TP Acc(L) = 100 · (2.22) TP + FP and puts the number of correctly predicted proteins that are observed to be in L in relation to the overall number of proteins predicted in L. Thus, Acc(L) is a measure of the ability of a classifier to predict proteins located in L. Accordingly, the coverage/sensitivity is defined as

TP Cov(L) = 100 · (2.23) TP + FN

37 2 Materials and Methods and sets the number of correctly predicted proteins that are observed to be in L in relation to the overall number of proteins observed in L. Therefore, Cov(L) gives a measure of the ability of a classifier to correctly predict proteins located in L of all proteins observed to be in L. Here, Acc(L) and Cov(L) were combined through their geometric average: √ gAv(L) = Acc(L) · Cov(L) (2.24)

Finally, the overall accuracy Q was used, that is defined as the total number of correct predictions divided by the total number of proteins in the test set: ∑ TP Q = 100 · ∑ (2.25) (TP + FN)

Predicted to be in L Predicted to be in not-L Observed to be in L True Positives; TP False Negatives; FN Observed to be in not-L False Positives; FP True Negatives; TN

Table 2.5: The confusion matrix for four possible outcomes of a binary classification task.

2.7.2 The Standard Error

A measure for the precision of the estimates, presented in the previous Section, was provided by the standard error (SE). The SE of an estimated value is defined as the standard deviation (SD) of its sampling distribution [89]. That is, the higher the vari- ability in the sampling distribution (e.g. in the case of small-sized samples) the larger the standard error and the less precise is the estimated value (Figure 2.8A). Conversely, the lower the variability in the sampling distribution (e.g. in the case of large-sized sam- ples) the smaller the standard error and the more precise is the estimated value (Figure 2.8B). The SE is important particularly when comparing two values of an estimate, i.e. whether one value is higher than the other in a statistically meaningful way. Generally, the computation of the SE of an estimate, such as accuracy, coverage or their geometric average, is difficult or even mathematically intractable [90]. Bootstrapping [91] is an approach that presents a way out of this dilemma. It measures the precision of an estimatex ¯ in the following way:

38 2.7 Performance Evaluation

Figure 2.8: The standard error is a measure for the uncertainty of an estimated value. (A) The standard error is large if the variability in the sampling distribution is high. (B) In the case of low variability in the sampling distribution the standard error is small.

1. Randomly select with replacement a set of n predictions from the original data set

2. Compute an estimate x for this set of predictions

3. Repeat previous steps m times to come up with the bootstrap estimates x1, ..., xm

Based on the bootstrap estimates, the standard deviation can be calculated by: √∑ m (x − x¯)2 SD = i=1 i , (2.26) n where xi is the observed value in the sample i,x ¯ is the estimated value and n is the number of samples. The standard error is then obtained by:

SD SE = √ (2.27) n − 1

However, it has been observed (B. Rost, personal communication) that the SE, ob- tained for a number of prediction methods operating on protein sequences in such a way, varies on the data sets other than those used for the development of these methods. In fact, the SEs were smaller than those reported in the original publications. Based on this observation and the fact that a prediction for a protein sequence is usually not made more than once, the aforementioned algorithm was slightly modified. Namely, each bootstrapped set was built by selecting a protein sequence from the original data set exactly once. The size for each sample was selected to be 15% from the original data set size. The number of samples bootstrapped was selected to be 1000 in total.

39 2 Materials and Methods

2.7.3 Stratified k-fold Cross Validation

The training of a classifier bears the problem of its adaption to the training set and thus overestimating its performance. Therefore, a realistic value of the performance can only be achieved if a classifier is trained and tested on two different non-overlapping data sets. However, keeping one part of the data out and not using it for the learning means wasting the information contained in it. This scenario is especially critical when only a limited amount of data is available. A possible solution to this problem is the method of a stratified k-fold cross validation. In this method, the original data set is randomly partitioned into k equally sized subsets such that each of the subsets contains about the same proportion of class labels as the original data set. Each subset participates in the testing phase only once while the remaining k − 1 subsets are used for training. This procedure yields k performance values, one for each of the test sets. These values are averaged and the result sets the prediction performance of a classifier. An example of a stratified 5-fold cross-validation is schematically illustrated in Figure 2.9.

Figure 2.9: In a 5-fold cross-validation the original data set is divided into five subsets of equal size. Four subsets are used for training while the remaining subset is used for testing. This is repeated five times so that each subset is used for testing exactly once. The estimate of the final predictor is the average of the estimates of five models, each trained and tested on different data sets.

40 2.7 Performance Evaluation

In this work, the method of a stratified k-fold cross validation was utilized within three different contexts: the kernel and classification model selection, the parameter optimization of different kernel functions and classification models, and the performance estimation of a classifier.

Kernel and Classification Model Selection

Different kernel functions (Section 2.5) were compared in their generalizability via cross- validation. Cross-validation was also used for selecting the best-performing multiclass classification approach (Section 2.6). The procedure of the kernel and the classification model selection took place in a 5-by-10-fold cross validation environment. This means that it operated on only 80% of the entire data set that was initially set for training. The training set was randomly partitioned into 10 subsets. The classifier was trained repeatedly on nine of them and validated on the remaining set. The kernel and the multiclass classification approach that led to the best performance on the validation set were used for training of the classifier, whose generalization performance was tested on the test set. Figure 2.10 demonstrates the general approach of a 5-by-10 cross-validation.

Parameter Optimization

The performance of classification methods developed in this work relied on the SVM cost parameter C (Section 2.4.2) and at least two further kernel parameters (Section 2.5). The procedure of parameter optimization was performed using a 5-by-10 cross-validation.

Performance Estimation

As already mentioned, the method of a stratified k-fold cross-validation was used to estimate the performance values of a classifier such as the class-wise accuracy, coverage, their geometric average, and the overall prediction accuracy (Section 2.7.1). It permitted the use of the entire data set for testing and obtaining these estimates.

41 2 Materials and Methods

Figure 2.10: In a 5-by-10 cross validation each of the five training sets is partitioned into ten equally sized subsets. One of the subsets is used for validating and the remaining nine for training. Splitting the initial data set into test, training and validation sets allows the test set to be completely detached from the entire training process of a classifier.

42 2.8 External Prediction Methods

2.8 External Prediction Methods

We compared the prediction performance of our classifiers to four state-of-the-art pre- diction methods for subcellular localization. These methods include CELLO v.2.5 [40] LOCtree [42], MultiLoc2 [44] and WoLFPSORT [41], which are briefly described below.

2.8.1 CELLO v.2.5

CELLO v.2.5 is a two-level system of SVMs that predicts bacterial proteins in five classes: cytoplasm, inner membrane, periplasm, outer membrane and extracellular space. Eu- karyotic proteins are predicted in twelve classes: chloroplast, cytoplasmic, cytoskeletal, endoplasmic reticulum, extracellular, Golgi apparatus, lysosomal, mitochondrial, nu- clear, peroxisomal, plasma membrane and vacuole. The first level of the prediction system is composed of a number of SVMs, each trained on a specific type of sequence features. These features comprise the amino acid composition of the entire sequence, the gapped dipeptides and the fragments of equal length from various partitions of the sequence. The second layer holds a ”jury”SVM, which collects the outputs from the first- level SVMs and makes the final prediction of the most probable localization class based on them. For bacteria, the data set is that of PSORTb [92], which was extracted from SWISS-PROT release 40.29 and not homology reduced. For eukaryota it is that of PLOC [37], which was derived from SWISS-PROT release 39 and homology reduced at 80% pairwise sequence identity. CELLO v.2.5 is available at http://cello.life.nctu.edu.tw/.

2.8.2 LOCtree

LOCtree is an SVM based prediction method. It utilizes the hierarchy of binary classifiers for the prediction of prokaryotic, eukaryotic non-plant and plant non-membrane proteins. Prokaryotic proteins are predicted in three localization classes: cytolpasm, extra-cellular space and periplasm. LOCtree predicts five classes for eukaryotic non-plant proteins: cytoplasm, extra-cellular space, mitochondria, nucleus and other organelles. An addi- tional sixth class ”chloroplasts” is predicted for plant proteins. The data set is derived from SWISS-PROT release 40 and redundancy reduced at a HSSP-value≤5 (eq. 2.1). Below this threshold the annotation of subcellular localization based solely on sequence homology was found to be erroneous [18]. The training sets are increased in size us- ing LOChom [29] and LOCkey [18], the approaches to annotate sequences by homology

43 2 Materials and Methods and additional functional keywords contained in SWISS-PROT. LOCtree predictions are based on the amino acid composition in three states: the evolutionary sequence pro- file of the entire sequence, the 50 residues of an N-termini and the predicted secondary structure. Additionally, the output of SignalP [93] is used for prediction of extra-cellular eukaryotic proteins. LOCtree is available at http://www.predictprotein.org/.

2.8.3 MultiLoc2

MultiLoc2 is another two-level system of SVMs that is capable of predicting ten local- ization classes for animals and fungi: cytoplasmic, endoplasmic reticulum, extracellular, Golgi apparatus, lysosomal, mitochondrial, nuclear, peroxisomal and plasma membrane. An additional tenth class of chloroplasts is predicted for plant proteins. MultiLoc2 is trained on the data set of MulitLoc [94], which was derived from the SWISS-PROT release 42 and has been homology reduced at the pair-wise sequence identity of 80%. The first level of the prediction system consists of six sub-predictors that consider the N-terminal part of the query sequence, its overall amino acid composition, the presence of signal anchors and certain motifs (extracted from the PROSITE [95] and NLSdb [21, 96] databases), as well as its phylogenetic profile and GO [62] terms. The outputs are collected into a protein profile vector that serves as input to the second level SVM, which makes the localization prediction. MultiLoc2 is available at http://abi.inf.uni- tuebingen.de/Services/MultiLoc2/.

2.8.4 WoLFPSORT

WoLFPSORT is a k-nearest neighbor classifier and the newest member of the PSORT [97, 98, 99] subcellular localization predictors family. It predicts twelve localization classes for eukaryotic proteins: chloroplast, cytosol, cytoskeleton, endoplasmic reticu- lum, extra-cellular, Golgi body, lysosome, mitochondrion, nuclear, peroxisome, plasma membrane and vacuolar membrane. In addition, it is able to predict some dual localiza- tions such as ”cytosol and nuclear”. The data set was derived from SWISS-PROT release 45 and not homology reduced. Additionally, several hundred Arabidopsis entries were extracted from the GO website. The predictions are based on sequence features, such as the amino acid composition, sorting signals and target peptides, the length of the se- quence, the hydropathy and charge. WoLFPSORT is available at http://wolfpsort.org/.

44 2.9 Box Plots

2.9 Box Plots

Box plots [100] were used to visualize the performance estimates of kernel functions throughout the range of their parameters. A box-plot, also called a box-and-whiskers- plot, is a standardized way to summarize the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. The configuration of a box plot is illustrated in Figure 2.11.

Figure 2.11: A boxplot splits the data set into quartiles. The ”box” begins at the first quartile and extends to the third quartile. The second quartile corresponds to the median. Two dashed lines, called whiskers, extend to the smallest and largest values in the data set.

Box plots are very useful for comparison of two or more data sets, as their medians and ranges can be visualized immediately. Box plots can be interpreted as follows:

◦ The first quartile of data is represented by the lower edge of the ”box”. This is the point at which 25% of the values in the data set are lower than the value at the first quartile.

◦ The third quartile is depicted by the upper box edge and is the point at which 75% of the values are lower than the value at the third quartile. Thus, the length of the ”box” corresponds to the middle 50% of the values in the data set.

◦ Median is shown by the bold bar inside the box. This is the score at the second quartile.

45 2 Materials and Methods

◦ Minimum and maximum scores in the data set are represented by the whiskers below and above the box. The range of the whiskers is no more than 1.5 times the length of the box.

◦ Outliers are the values that fall outside the range of the whiskers. Outliers are labeled with a circle in the graph. There can be outliers above or below the whiskers, or multiple outliers in the data set.

46 3 Results and Discussion

3.1 Training and Test Sets

The development of the classification approaches for this thesis required partitioning of the data sets into training and test sets. This was done by randomly dividing each of the data sets into five equally sized and mutually exclusive subsets (Section 2.7.3). By rotating over the subsets, the classifiers could be trained on different combinations of four of the five subsets and tested on the remaining set that was not involved in any step of the training. A certain degree of homology was introduced within each of the training sets, thus increasing them considerably in size (Section 2.2.4). In doing so, care was taken to ensure that no homologue of any of the proteins in the testing set was included. The distribution of protein sequences across different subcellular localizations in all the training and test sets used is shown in Table 3.1.

Set 1 Set 2 Set 3 Set 4 Set 5 Training Test Training Test Training Test Training Test Training Test Archaeal All Proteins Cytosol 199 8 184 8 214 8 160 8 175 9 Plasma membrane/ 34 4 21 4 34 4 31 4 36 2 Extra-cellular Total 233 12 205 12 248 12 191 12 211 11 Archaeal Non-Membrane Proteins Cytosol 130 8 189 8 210 8 195 8 208 9 Extra-cellular 4 1 4 1 4 1 4 1 4 1 Total 134 9 193 9 214 9 199 9 212 10

Table continues on the following page.

47 3 Results and Discussion

Set 1 Set 2 Set 3 Set 4 Set 5 Training Test Training Test Training Test Training Test Training Test Bacterial All Proteins Cytosol 1478 36 890 36 1522 36 1319 36 1455 35 Fimbrium 52 3 28 3 46 3 50 3 44 4 Outer membrane 8 1 8 1 8 1 6 1 6 2 Periplasm 89 10 95 10 98 10 93 10 89 12 Plasma membrane 279 29 259 29 286 29 261 29 247 28 Extra-cellular 127 16 134 16 130 16 109 16 124 18 Total 2033 95 1414 95 2090 95 1838 95 1965 99 Bacterial Non-Membrane Proteins Cytosol 1112 36 1508 36 1376 36 1172 36 1496 35 Fimbrium 50 3 51 3 45 3 27 3 47 4 Periplasm 92 10 97 10 100 10 93 10 82 12 Extra-cellular 132 16 120 16 126 16 118 16 128 18 Total 1386 65 1776 65 1647 65 1410 65 1753 69 Bacterial Transmembrane Proteins Outer membrane 6 1 8 1 8 1 8 1 6 2 Plasma membrane 260 29 256 29 269 29 285 29 262 28 Total 266 30 264 30 277 30 293 30 268 30 Eukaryotic All Proteins Chloroplast 1127 29 1200 29 1210 29 1028 29 1255 28 Cytosol 824 44 861 44 868 44 879 44 824 44 ER 178 15 182 15 163 15 180 15 185 15 Golgi 67 4 64 4 65 4 59 4 57 4 Mitochondria 615 45 655 45 673 45 645 45 684 47 Nucleus 1234 65 1210 65 1171 65 1223 65 1270 65 Peroxisome 35 2 45 2 39 2 45 2 48 - Plastid 89 3 86 3 81 3 79 3 93 2 Vacuole 49 3 45 3 53 3 45 3 48 1 Plasma membrane/ 2479 127 2601 127 2359 127 2578 127 2435 128 Extra-cellular Total 6697 337 6949 337 6682 337 6761 337 6899 334 Eukaryotic Non-Membrane Proteins Chloroplast 1088 27 1148 27 1009 27 1057 27 1242 25 Cytosol 834 44 764 44 774 44 819 44 773 44 ER 26 2 23 2 22 2 26 2 27 2 Golgi 12 1 8 1 10 1 13 - 13 - Mitochondria 397 28 400 28 387 28 387 28 381 28 Nucleus 1161 64 1121 64 1070 64 1167 64 1169 64 Peroxisome 29 1 29 1 30 1 33 1 23 2 Plastid 86 3 80 3 70 3 88 3 96 2 Vacuole 13 1 12 1 10 1 14 - 11 - Extra-cellular 2078 119 2026 119 2017 119 2022 119 1941 120 Total 5724 290 5611 290 5399 290 5626 288 5676 287 Eukaryotic Transmembrane Proteins Chloroplast 24 2 20 2 25 2 24 2 19 3 ER 99 13 93 13 84 13 93 13 99 13 Golgi 36 3 32 3 32 3 33 3 27 5 Mitochondria 232 17 199 17 240 17 242 17 187 19 Nucleus 5 1 6 1 6 1 5 1 6 1 Peroxisome 6 - 7 - 5 - 6 1 4 1 Vacuole 27 2 23 2 25 2 29 2 24 2 Plasma membrane 340 8 398 8 395 8 399 8 220 8 Total 769 46 778 46 812 46 831 47 586 52

Table 3.1: Number of protein sequences per localization class in the training and test sets 48 which were used for the development of the eight classifiers applied in this thesis. 3.2 Kernel Selection

3.2 Kernel Selection

A string kernel function receives a set of protein sequences as input and outputs a kernel matrix for them. In this Section the String Subsequence kernel, the Mismatch kernel and the Profile kernel (Section 2.5) are compared in terms of their individual performance using a 10-fold cross validation (Section 2.7.3) on the sets of training data of archaeal, bacterial and eukaryoic all proteins (Table 3.1). Each training set was randomly distributed into ten subsets, such that nine of them were repeatedly used for training and the prediction accuracy was assessed on a different validation subset exactly once. Thus, the performance result of a kernel function for each training set was the average of the results of 10 runs of the algorithm. The kernel functions were tested with a range of possible parameter combinations. The simplest and fastest SVM multiclass classification approach one-against-all (Section 2.6.1) was used.

3.2.1 Parameter Optimization

String Subsequence kernel. The String Subsequence Kernel is parametrized by specify- ing three parameters: the subsequence length k, the penalty factor for non-contiguous matches λ, and the λ-pruning parameter l denoting the subsequence length including the gaps (Section 2.5.1). The classifications were carried out using λ ∈ {0.01, 0.1, 0.5, 1} and k ∈ {2, 3, 6}. Lodhi et al. [73] suggest to choose the values of l to be about 3 times the value of k, thus l ∈ {8, 12, 18} was set. The influence of the parameters on the averages of overall accuracies (eq. 2.25) over five training sets using the one-against-all classifier is summarized in the supplementary Figure 4.2 in the Appendix. The results showed that most of the variation in the prediction accuracy was due to the subsequence length k. The SVM predictor was more accurate for smaller substrings (with k=2 for archaea and k=3 for bacteria and eukaryota). Larger values of k resulted in a poor performance of the classifier, denoting its ability to capture the similarity between two strings using rather short non-contiguous substrings. As expected, the optimal value for l was the three-fold value of k. This resulted in a higher weighting of the decay parame- ter, as it had its peaks at λ ∈ {0.5, 1} for archaea and bacteria, and λ=0.5 for eukaryota.

49 3 Results and Discussion

Mismatch kernel. In order to find the optimal parameters for the Mismatch kernel, combinations of the parameters k, which is the length of the subsequence shared be- tween two compared sequences, and m, which is the number of mismatches by which the subsequences can differ (Section 2.5.2), were tested. We set k∈ {2, 3, 4, 5, 6, 7, 9, 11}. Larger values of k led to a kernel matrix with almost everywhere 0 off the diagonal. Small values of m∈ {1, 2} were selected for the efficiency in the training of the classifier. The Mismatch kernel failed to produce kernel matrices for the eukaryotic training data sets. Thus, the supplementary Figure 4.1 in the Appendix shows the results for archaeal and bacterial proteins only. Here, the variation in the prediction accuracy correlated strongly with the subsequence length k, again. The highest accuracies were observed for k = 3 for archaea and k ∈ {4, 5} for bacteria. The mismatch parameter m had rather a small influence on the prediction performance in archaea and no influence in bacteria.

Profile kernel. The Profile kernel generates a feature space, in which a classifier is trained, on the evolutionary profiles generated by PSI-BLAST (Section 2.5.3). The fea- ture vectors of an input sequence are determined by the length of a substring k, whose cumulative conservation scores in the profile are less then the σ threshold. Parameters k ∈ {2, 3, 4, 5, 6, 7}, σ ∈ {3, 4, 5, 6, 7, 8, 9, 10, 11} with k<σ, and the maximum of 3 mis- matches, as presented in the original publication [75], were used. The overall accuracies of the classifier based on the changes in the parameters of the Profile kernel are presented in supplementary Figure 4.3 in the Appendix. The results showed, that similar to the other two kernel functions, the performance of the classifier varied strongly with the subsequence length k. The classifier was more effective using subsequences of smaller or moderate length. For archaeal proteins the overall accuracy peaked at k of 2-5, for bacterial proteins at k of 4-5 and for eukaryotic proteins at k of 5-6. Surprisingly, at the subsequence length >6 there was an increasing decline in the overall prediction accuracy of the classifier. An analysis of the effectiveness of the parameter σ revealed, that the parameter choice in the range of 5-11 yielded best results.

50 3.2 Kernel Selection

3.2.2 Comparative Evaluation

String Subsequence Kernel Mismatch Kernel Profile Kernel λ k l Q k m Q k σ Q Archaea All Proteins, Nclasses=2 Training Set 1 1.0 2 8 973 3 1 973 4 9 973 Training Set 2 1.0 3 18 973 3 1 973 2 4 973 Training Set 3 1.0 3 18 973 3 1 973 2 4 973 Training Set 4 1.0 2 8 973 4 2 973 4 8 973 Training Set 5 1.0 2 8 973 3 1 973 4 9 973 Average 973 973 973 Bacteria All Proteins, Nclasses=6 Training Set 1 0.5 3 12 882 4 1 872 5 9 932 Training Set 2 1.0 3 12 882 4 1 882 5 9 932 Training Set 3 0.5 3 12 882 5 2 882 4 7 932 Training Set 4 0.5 3 12 882 4 1 882 4 7 932 Training Set 5 0.5 3 12 882 5 2 872 5 9 932 Average 882 882 932 Eukaryota All Proteins, Nclasses=10 Training Set 1 0.5 3 8 691 6 11 811 Training Set 2 0.5 3 12 701 Not 5 9 811 Training Set 3 0.5 3 8 691 available 5 8 821 Training Set 4 0.5 3 8 691 5 9 811 Training Set 5 0.5 3 8 691 5 9 811 Average 691 811

Table 3.2: Comparison of three kernel functions (Section 2.5) employed by the one- against-all classifier. The classifier was trained and tested in a 10-fold cross validation (Section 2.7.3) on the training sets of all archaeal, bacterial and eukaryotic proteins. For each training set only the highest overall accuracies are reported. The kernel parameters are displayed prior to the results. For their description refer to the text. The Mismatch kernel failed to produce the results on our largest sets of eukaryotic proteins. Abbreviations: Q, the overall accuracy (eq. 2.25); Nclasses, the number of predicted localization classes.

The performance results of the String Subsequence kernel, the Mismatch kernel and the Profile kernel on each of our training sets of all archaeal, bacterial and eukaryotic pro- teins are displayed in Table 3.2. The results show only the highest overall classification

51 3 Results and Discussion accuracies achieved for each kernel function over the range of tested parameters. Where multiple parameter combinations produced exactly the same result, one combination was randomly chosen. The overall accuracies peaked for each kernel function at rather shorter subsequence lengths, denoting the ability of fragments of short or moderate length to better deter- mine the subcellular localization than longer fragments. Generally, the overall accuracies decline with the growing number of localization classes. The String Subsequence kernel and the Mismatch achieved equal levels of accuracy on archaeal and bacterial data sets. While the Profile kernel was comparable in the overall accuracy to other kernel functions on archaeal data sets, it significantly outperformed the other two approaches on larger data sets of bacterial and eukaryotic proteins. Thus, we can conclude that the sorting sig- nals for subcellular localization are conserved within evolutionary profiles and the signal encoded in a profile is stronger than the signal encoded in the sequence alone. Based on these results, the Profile kernel was selected for the training of our SVM-based classifiers.

Profile kernel performance on the sets of non-membrane and transmembrane proteins. Table 3.3 shows the highest overall accuracies for the Profile kernel achieved on the sets of non-membrane and transmembrane proteins. The parameter combinations tested for these sets of proteins were the same as for the sets of all proteins. Here again, in the case of multiple parameter combinations producing exactly the same result, one combination was randomly chosen.

52 3.2 Kernel Selection

Profile kernel k σ Q k σ Q k σ Q Non-Membrane Proteins Archaea, Nclasses=2 Bacteria, Nclasses=4 Eukaryota, Nclasses=10 Training Set 1 4 9 973 4 7 952 5 8 831 Training Set 2 4 9 973 4 7 952 5 8 831 Training Set 3 4 9 973 4 7 942 5 8 841 Training Set 4 4 9 973 4 7 942 5 8 831 Training Set 5 4 9 973 4 7 952 5 8 841 Average 973 952 831 Transmembrane Proteins Bacteria, Nclasses=2 Eukaryota, Nclasses=8 Training Set 1 3 6 100∗ 3 6 833 Training Set 2 Not 5 11 100∗ 3 6 833 Training Set 3 available 3 6 100∗ 4 9 833 Training Set 4 3 6 100∗ 3 6 833 Training Set 5 3 6 100∗ 5 11 853 Average 100∗ 833

Table 3.3: The highest overall accuracies achieved by the one-against-all classifier using the Profile kernel (Section 2.5.3) in a 10-fold cross-validation (Section 2.7.3) on the training sets of non-membrane and transmembrane proteins. The ker- nel parameters are displayed prior to the results. For their description refer to the text. There was no classifier built for the set of archaeal transmem- brane proteins, as these proteins exhibited labels of only one localization class (Section 2.6.5). Abbreviations: Q, the overall accuracy (eq. 2.25); Nclasses, the number of predicted localization classes; asterisk, overoptimistic estimate due to the small data set size.

53 3 Results and Discussion

3.3 Classification Model Selection

In the previous Section, the Profile kernel was found to perform best when compared to the other kernel functions. Here, the effect of five different multiclass classification approaches, using the Profile kernel, on the overall accuracy is investigated. The multi- class classification approaches are: one-against-all, the ensembles of nested dichotomies (ENDs), ensembles of class balanced nested dichotomies (ECBNDs), ensembles of data balanced nested dichotomies (EDBNDs) and the nested dichotomy of a predefined struc- ture (MyND) (Section 2.6). The ensembles of nested dichotomies were built using 20 ensemble members, i.e. the average over 20 results of different systems of nested di- chotomies formed the prediction. For this experiment a stratified 10-fold cross validation was used on the training sets of all archaeal, bacterial and eukaryotic proteins.

One- ENDs ECBNDs EDBNDs MyND Against-All Archaea, 973 973 973 973 973 Nclasses=2 Bacteria, 932 962 962 962 962 Nclasses=6 Eukaryota, 811 861 861 861 861 Nclasses=10

Table 3.4: Comparison of the overall accuracies of five multi-class classification ap- proaches (Section 2.6) trained and tested in a 10-fold cross-validation (Section 2.7.3) on the training sets of all archaeal, bacterial and eukaryotic proteins. The kernel function employed was the Profile kernel (Section 2.5.3). The results displayed are the averages over five training sets. Abbreviations: Nclasses, the number of predicted localization classes.

The averaged results across five training sets of all archaeal, bacterial and eukaryotic proteins for each multiclass classifier are summarized in Table 3.4. For archaea the performances of all classifiers were exactly the same. This was not a surprising result due to the small size of the training sets and the number of localization classes of only two. However, in cases of larger data sets and a higher number of classes there was a significant improvement in the overall accuracy for all hierarchy-based approaches as compared to the one-against-all approach, which uses no class hierarchy at all. Among

54 3.4 Model Parameter Optimization four hierarchy-based classification approaches there was no significant difference in the overall accuracy. Based on these findings, ENDs, the approach that generates nested dichotomies in a random manner, was selected for the parameter optimization task. We decided to also select MyNDs, the approach that incorporates knowledge about the relatedness of classes to each other for model building, and EDBNDs, the approach employing data-balanced classification models.

3.4 Model Parameter Optimization

The generalization performance of an SVM-based classifier depends not only on the kernel function and its parameters but also on a good setting of the complexity parameter C (Section 2.1.2). The parameter C controls the trade-off between the width of the margin of a hyperplane that is used for a multiclass classification task and the number of misclassified instances. The selection of the optimal parameter C was performed in an automated manner. For this task, the meta-classifier CVParameterSelection was employed, available in the standard distribution of WEKA. CVParameterSelection ran tests over a number of val- ues and selected the best performing parameter for the training of the classifier and the subsequent prediction with it. The values that are exponents of 10, in the range between 0.01 and 1000 were tested and the number of folds was set to be equals 5. The optimization of the parameter C was performed for each binary classifier separately.

3.4.1 Generalization Performance

The overall accuracy for optimized classification models of ENDs, EDBNDs and MyND is shown in Table 3.5. The results demonstrate that parameter C is indeed an important factor influencing the generalization performance, as the overall accuracy improves for all classification methods over the different sets of tested data. It was interesting to observe that despite the different sizes of the data sets tested, there were no significant differences in accuracy between the individual classifiers. This indicates that in terms of accuracy the ensemble of randomly chosen binary structures competes with the ensemble of data- balanced structures and both classification models are comparable with the classification model, whose architecture was chosen to follow the biological pathways of protein sorting.

55 3 Results and Discussion

ENDs EDBNDs MyND Archaea, 983 983 983 Nclasses=2 Bacteria, 972 972 972 Nclasses=6 Eukaryota, 882 882 882 Nclasses=10

Table 3.5: Comparison of the overall accuracy for ENDs, EDBNDs and MyNDs (Section 2.6) with the optimally chosen SVM complexity parameter C (Section 2.4.2), trained and tested in a 5-fold cross-validation (Section 2.7.3) on the training sets of all archaeal, bacterial and eukaryotic proteins. The kernel function used was the Profile kernel (Section 2.5.3). The averages over all training sets are reported. Abbreviations are used as for Table 3.4.

3.4.2 Classification Runtime

The comparison of the generalization performance of ENDs, EDBNDs and MyNDs did not reveal one method to be superior to the other methods. Hence, the testing time of the optimized classification models was evaluated. The results of this evaluation may be of special interest to users who want to apply the method to large-scale analyses of data sets. The runtime was measured on a Dell M605 machine with a Six-Core AMD Opteron processor (2.4 GHz, 6MB and 75W ACP) running on Linux. The average number of sequences tested per minute over five training sets is shown in Table 3.6. The results indi- cate that, in contrast to accuracy, there are severe differences in the computational time between the three different classification approaches. MyNDs were significantly more efficient than ensemble based methods. When ensemble based methods were applied to two-class data sets, then all nested dichotomies were automatically data-balanced and thus no severe differences were expected between ENDs and EDBNDs. Our results con- firmed this. However, when the approaches were applied to data sets with more than two classes, MyNDs, the scheme with a fixed structure, exhibited a significant advantage over ENDs, the ensemble of randomly composed NDs, and EDBNDs, the ensemble of data-balanced NDs.

56 3.4 Model Parameter Optimization

Based on the classification speed and the accuracy of MyNDs, it was decided to select this multiclass classification approach for our further prediction tasks.

ENDs EDBNDs MyND Archaea, 162·103 174·103 1218·103 Nclasses=2 Bacteria, 41·103 61·103 304·103 Nclasses=6 Eukaryota, 4.6·103 5·103 79·103 Nclasses=10

Table 3.6: Comparison of prediction time for ENDs, EDBNDs and MyNDs (Section 2.6) with the optimally chosen complexity parameter C (Section 2.4.2), trained and tested on the training sets of all archaeal, bacterial and eukaryotic pro- teins. The kernel function used was the Profile kernel (Section 2.5.3). The average number of protein sequences processed per minute over five training sets are reported. Abbreviations are used as for Table 3.4.

57 3 Results and Discussion

3.5 Testing

After selecting the kernel function, optimizing its parameters and the parameters of the classification model, each of the eight classifiers (Figure 2.7) was evaluated on the test data sets that were kept distinct from the training data sets (Table 3.1) and thus did not participate in any step of the training process. The results of the evaluation are shown in Table 3.7. Additionally, the reliably of the probability scores that are provided by MyNDs for each prediction were examined. Further, the accuracy of MyND predictions was compared to the accuracy of four external classifiers (Section 2.8) on the MyND test data sets.

3.5.1 Performance Evaluation

Accurate distinction between archaeal classes. MyND seemed to perform very well on the test sets of all and non-membrane archaeal proteins with accuracy and coverage at 100% (Table 3.7). These results, however, should be interpreted with caution, as we had only a very limited number of instances available in each of the testing sets.

Overall accuracy of 84% for all bacterial proteins. In order to make a prediction for an unknown protein in Bacteria, the method first determines if the protein is localized in the cytosol or is sorted to another subcellular compartment (Figure 2.7). The SVM that makes this decision achieved an overall accuracy (eq. 2.25) of 92% (Figure 3.1A, not shown in Table 3.7). The non-cytoplasmic proteins were further classified into the proteins residing in the plasma membrane and proteins that are sorted into the cell exte- rior at the overall accuracy of 89% (Figure 3.1A). For gram-negative bacteria the MyND exhibits two further localizations of periplasm and outer membrane. Nevertheless, the accuracy did not differ significantly between gram-positive and gram-negative bacteria. For gram-negative bacteria the overall accuracy at the decision point between extra- cellular and other compartments was at 86%. Despite the highly unbalanced testing data sets (the number of sequences of the cytosolic class corresponded to roughly 40% and of the plasma membrane class to 30% of all sequences), the overall accuracy over six localization classes was 84%.

58 3.5 Testing

Figure 3.1: MyND performance on the 5-fold cross-validated test data sets of all bacte- rial proteins. (A) Overall performance. The highest accuracy of 93% was achieved for Level 1 predictions, which separate proteins into cytosolic and non-cytoplasmic proteins. The overall accuracy declined at lower levels in the hierarchical tree (Figure 2.7). The overall accuracy of Level 2 leaves, which include cytosolic and inner membrane classes and thus represents an accuracy that separates proteins into three classes, decreased to 89%. For the purpose of simplification, the curve for Level 3 predictions is not here provided. Level 4 accuracy includes the accuracies of outer membrane, periplasmic, plasma membrane and cytosolic classes and separates proteins into 5 classes at 86%. The difference in accuracy between Level 0 and Level 5, which separates proteins into one of six subcellular localization classes, was 9%. (B) Class- wise performance. Only performances for localization classes with at least 20 members are shown. MyND was significantly more accurate in predict- ing plasma membrane proteins (accuracy of 99% at 55% coverage), followed by cytosolic proteins (accuracy of 97% at 55% coverage). The prediction of periplasmic proteins was significantly worse (accuracy of 78% at 55% cov- erage). The lowest accuracy was achieved for extra-cellular proteins at the level of 70%.

59 3 Results and Discussion al 3.7: Table Q Extra-cellular membrane/ Plasma Vacuole Plastid Peroxisome Nucleus Mitochondria Golgi ER Cytosol Chloroplast Eukaryota Q Extra-cellular membrane Outer Periplasm Fimbrium Cytosol Bacteria Q Extra-cellular membrane/ Plasma Cytosol Archaea Localization lsamembrane Plasma ie ytesadr ro Scin272 u otesaldt e size. set bound data upper small unrealistic the dagger, to size; due set 2.7.2) data (Section small the error to standard due the set; estimate by test overoptimistic Abbreviations: given unique asterisk, sequence proteins. 2.25); eukaryotic given (eq. the and accuracy in bacterial archaeal, localization known of with sets Cov proteins data of test number the the on MyNDs of Performance h oeae(q 2.23); (eq. coverage the , 3 76 0 636 0 13 14 66 8 53 50 325 46 227 20 44 75 42 220 144 63 96 82 75 144 83 6 52 87 16 179 100 100 gAv 18 Cov 41 Acc Nprot 64 0 84 100 100               l Proteins All Proteins All Proteins All ∗ ∗ ∗ ∗ 90 4 74 49 6 10 8 31 50 43 16 25 8 13 73 95 12 4 58 16 92 25 6 3 4 † 0 0 0 31 100 100 67              gAv ∗ ∗ 82 3 70 51 6 8 22 38 17 44 13 33 8 9 68 95 11 4 66 16 89 5 251 32 782 47 h emti vrg of average geometric the , 0 0 0 100 100              ∗ ∗ 4 6 6 16 10 6 7 11 6 16 6 31 46 † 9 82 0 596 3 14 67 6 52 320 0 140 3 46 10 51 220 133 70 82 76 - 91 - 52 87 16 179 100 5 gAv 41 Cov Acc Nprot o-ebaeProteins Non-Membrane Proteins Non-Membrane Proteins Non-Membrane 68 0 0 100 81 - - 100 100            ∗ ∗ ∗ ∗ 3 5 92 3 78 6 38 12 48 8 28 14 77 11 54 16 63 93 20 6 0 0 0 33 0 - - 100 100 Acc           ∗ ∗ 58 0 87 3 72 5 44 10 47 38 8 9 73 11 64 16 75 90 33 5 and 0 0 0 0 - - 100 100 Cov           ∗ ∗ 0 4 8 8 6 8 12 16 27 6 e.2.24); (eq. 055 0 40 10 - 76 2 50 5 49 87 17 44 65 - 11 100 - 144 6 - - - gAv Cov Acc Nprot r nmmrn Proteins Transmembrane Proteins Transmembrane Acc 55 99 - 0 0 - - 86 - - -         o available Not h cuay(q 2.22); (eq. accuracy the , Q ∗ 8 2 830 28 80 18 11 66 50 13 36 50 35 h vrl prediction overall the , † 0 - 0 0 - - 99 100 - - -       ∗ 740 17 78 10 30 57 21 14 40 32 2 † 0 - 0 0 - - 100 93 - - -       ∗ 14 11 19 11 29 35 Nprot † ,

60 3.5 Testing

Ten-states overall accuracy of 64% on the sets of all eukaryotic proteins. The first decision that is made by the MyND classification system for all eukaryotic proteins is whether a protein is an intra-cellular protein or belonging to the secretory pathway (Figure 2.7). This decision was made at the level of 85% accuracy (Figure 3.2A). The secretory pathway proteins were further sub-classified belonging to the class of endoplas- mic reticulum and the classes of Golgi apparatus and extra-cellular/plasma membrane. The extra-cellular and accordingly plasma membrane proteins were among others the most accurately predicted proteins with the level of 76% accuracy and 90% coverage (Figure 3.2B, Table 3.7). It should be noted that the class of extra-cellular proteins was also the most populated with the number of protein sequences of roughly 40% of the total number of tested sequences. As we descend the hierarchical tree (Figure 2.7), the sub-classification of intra-cellular and secretory pathway proteins into a further four classes occurred at a lower level of accuracy of 74% (Figure 3.2A). The second most accurate prediction was made for the class of nuclear proteins, the second largest class in our testing data set. The prediction of cytosolic proteins was made at a 44% accuracy and a 43% coverage. The non-cytosolic proteins were sub-classified into five classes, for three of which it was not possible to obtain any positive prediction at all. These three classes were peroxisome, plastid and vacuole comprising 8, 14 and 13 proteins, respec- tively. The other two classes of mitochondrial and chloroplast proteins were predicted at a level of accuracy of about 50% and coverage not higher than 55%. The accuracy at the lowest level of the tree was the overall accuracy Q of 64% (eq. 2.25) for classification of proteins into one of ten subcellular localization classes.

More experimental annotations required for transmembrane eukaryotic proteins. The MyND classifier trained on the sets of eukaryotic non-membrane proteins was comparable in its prediction performance to MyND trained on the combined sets of non-membrane and transmembrane proteins (Table 3.7). However, the performance of MyND designed for eukaryotic transmembrane proteins only was roughly 10% lower than that of its com- panions. The most accurately predicted class with accuracy of 78% was mitochondria. The prediction accuracy for the other localization classes was roughly at 50%. The dif- ficulty of the classifier to correctly recognize chloroplast, Golgi and plasma membrane proteins was evident from the low levels of coverage below 40%. Since the number of trans-membrane proteins was very limited (it comprised less than 15% of the total number of eukaryotic proteins), we believe that with a higher diversity of protein se-

61 3 Results and Discussion quences with reliable subcellular localization annotations in the public databases it will be possible to achieve better performance results with our prediction scheme.

Figure 3.2: MyND performance on the 5-fold cross-validated test data sets of all eukary- otic proteins. (A) Overall performance. The decision node at Level 1 in the hierarchical tree (Figure 2.7) separates intra-cellular proteins from pro- teins belonging to the secretory pathway at 90% accuracy and 80% coverage. Similarly to Figure 3.1A, the prediction accuracy declined with the depth of the classification tree. The predictions at Level 2, where secretory pathway proteins are separated into the extra-cellular/plasma membrane and Golgi classes, and intra-cellular proteins into non-cytosolic and nucleus/cytosolic classes, were made at 84% accuracy and 80% coverage. The classification into one of ten localization classes at Level 6 was performed at a signifi- cantly lower accuracy of around 75% and 80% coverage. The performances at Levels 3-5 are explicitly not provided in order to simplify. (B) Class-wise performance. Only performances for localization classes with at least 20 members are shown. MyND performed best at predicting extra-cellular and plasma membrane proteins (95% accuracy at 50% coverage). The prediction of nuclear proteins was about 10% less accurate at 50% coverage, while the performance of mitochondrial proteins was significantly worse with only 53% correctly predicted at 50% coverage. The prediction performance for cytoso- lic, endoplasmic reticulum and chloroplast proteins was made at insufficient levels of accuracy and coverage below 50%.

62 3.5 Testing

3.5.2 Prediction Reliability

Each MyND prediction is accompanied by a probability score for the most likely local- ization class. The scores are formed by multiplying the probability scores obtained from the individual two-class classifiers [81], of which MyND is composed (Figure 2.7), and are provided in the range between 1 and 100. Intuitively, higher scored predictions result in a higher accuracy. For example, at the probability score of 80, the accuracy for all bacterial proteins was slightly above 90% (Figure 3.3A). This implies that nine out of ten proteins predicted to be in localization L were also experimentally observed to be in L. At the same time, the coverage was about 70%, denoting that only seven out of ten proteins observed to be in L were predicted in L. For eukaryotic proteins, the level of accuracy at the probability score of 80 was about 90% and of coverage at only slightly above 50% (Figure 3.3B). In general, when considering predictions with the highest lev- els of accuracy, it should be kept in mind that the level of coverage for such predictions is only at about 20%. At the other extreme, at the levels of high coverage, the percentage of correct predictions is 84% and 64% for bacteria and eukaryota, respectively.

Figure 3.3: Accuracy versus coverage for the probability scores of MyND. The x-axis gives the probability scores. The y-axis provides the percentage of accuracy (eq. 2.22) and the percentage of coverage (eq. 2.23). TPs were defined as the number of correct predictions with probability scores above the given thresh- old, FNs as the number of correct predictions with probability scores below the threshold and FPs as the number of wrong predictions with probability scores above the threshold. The curves were obtained on test data sets of all bacterial proteins (A) and all eukaryotic proteins (B).

63 3 Results and Discussion

3.5.3 Comparison with the External Classifiers

The performance of MyND trained on the data sets of all (non-membrane and trans- membrane) proteins was compared to the best publicly available prediction methods for subcellular localization. The external prediction methods were CELLO v.2.5, LOCtree, MultiLoc2 and WoLFPSORT (Section 2.8). It is important to note that some of the proteins in our test sets may have homology to the proteins in the training sets of the predictors considered, thus to some degree overestimating their results, as we could not retrain them on our training data. The external prediction methods were run with their default parameters. The testing sets were the test sets of MyND of all bacterial and eukaryotic proteins (Table 3.1).

Bacterial proteins most accurately predicted by MyND. In order to ensure a fair com- parison of MyND to other specialists for bacterial proteins, the test sets were restricted to subcellular localizations predicted by CELLO v.2.5 and LOCtree. Since both meth- ods do not discriminate between extra-cellular and fimbrium classes, these classes were grouped into one common class. The number of tested classes was five for the compar- ison with CELLO v.2.5 and three for the comparison with LOCtree (Table 3.8). On the five classes data set, MyND significantly outperformed CELLO v.2.5 with 8% higher overall accuracy and on the three classes data set, the prediction performance of MyND was 5% higher than that of LOCtree. MyND was significantly more accurate in discrim- inating periplasmic and outer membrane proteins. This was done, however, at the cost of reduced coverage. The extra-cellular class predictions using MyND were extremely balanced between accuracy and coverage, while CELLO v.2.5 and LOCtree tended to under-prediction. Overall, MyND outperformed the other methods in terms of average prediction (gAv) for all localization classes.

64 3.5 Testing

Localization Nprot Acc Cov gAv Acc Cov gAv 5 Classes Methods: MyND CELLO v.2.5 Cytosol 179 876 925 896 777 944 857 Plasma membrane 144 964 954 955 983 837 917 Periplasm 52 7516 5816 6615 5112 8312 6513 Outer membrane 6 100∗ 6748† 8247† 5033 8335† 6532 Extra-cellular 98 7410 7711 7511 8812 3911 5911 Q 864 784 3 Classes Methods: MyND LOCtree Cytosol 179 886 924 906 836 963 896 Periplasm 52 8115 5816 6815 5313 7913 6513 Extra-cellular 98 7710 7710 7711 9110 4211 6212 Q 825 776

Table 3.8: Comparison of MyND trained on the data sets of all bacterial proteins to external prediction methods (Section 2.8). The methods were tested on the test sets of MyND of all bacterial proteins (Table 3.1). Extra-cellular and fimbrium proteins were grouped into one common class. CELLO v.2.5 pre- dicted bacterial proteins in five localization classes, while LOCtree in only three. Abbreviations are used as for Table 3.7.

Running predictions on eukaryotic proteins. Plastid proteins were excluded from the test sets as none of the external methods were trained to predict them. For the pre- dictions using MultiLoc2, vacuolar proteins were also removed. A class of organelles was created, containing endoplasmic reticulum (ER), Golgi apparatus, peroxisomal and vacuolar proteins for the fair comparison with LOCtree, which does not distinguish be- tween these localizations. The external prediction methods were run separating the data set into distinct taxonomies, where necessary. For example, LOCtree was run on two distinct sets of plant and non-plant proteins. To deal with multiple localizations pre- dicted by WoLFPSORT and CELLO v.2.5, one localization was randomly chosen. It was further noted that WoLFPSORT and CELLO v.2.5 distinguish the classes of cy- toskeleton and cytoplasm. In this study, the predicted instances of both classes were regarded as cytoplasmic. The predictions of CELLO v.2.5, WoLFPSORT and MultiLoc2 for extra-cellular and plasma membrane classes were grouped together in order to enable a comparison with MyND.

65 3 Results and Discussion

MyND outperformed other prediction methods on all eukaryotic proteins. In the com- parison on the test sets of all eukaryotic proteins (Table 3.1), MyND outperformed in terms of overall accuracy Q (eq. 2.25) the other prediction methods (Table 3.9). It showed the highest gAv, the average between accuracy and coverage (eq. 2.24), for pre- dicting nuclear and extra-cellular/plasma membrane proteins. In comparison to CELLO v.2.5, WoLFPSORT and MultiLoc2, it also achieved the highest gAv for ER and Golgi proteins. In comparison to LOCtree, MyND further achieved the highest gAv for or- ganellar proteins. MyND was least successful in predicting chloroplast proteins and failed to identify vacuolar and peroxisomal proteins. In fact, CELLO v.2.5 and WoLFPSORT were also not able to identify vacuolar proteins. Peroxisomal proteins were either over- predicted (low accuracy, high coverage; MultiLoc2), under-predicted (high accuracy, low coverage; CELLO v.2.5) or not predicted at all (WoLFPSORT). The sets of vacuolar and peroxisomal proteins were the smallest ones. Chloroplast proteins were most ac- curately predicted by MultiLoc2 and LOCtree (100% and 81%, respectively), the only methods that distinguished between plant and non-plant proteins. For MultiLoc2, this was done, however, at the cost of extremely low level of coverage (13%). It should here be mentioned that LOCtree was trained explicitly on non-membrane proteins and in our study it was tested on a combination of non-membrane and transmembrane proteins. All tested methods, with the exception of WoLFPSORT, overlapped in the prediction of cytosolic proteins with gAv at roughly 45%. In terms of overall accuracy Q, MyND scored at least 4% higher than the currently best predictors for subcellular localization.

MyND was most accurate in predicting nearly all eukaryotic membranes. The predic- tion performance of MyND trained on the set of all eukaryotic proteins was compared with external classifiers on separate sets of non-membrane and transmembrane proteins. The results revealed that in comparison to the test on all proteins, the overall accuracies improve using non-membrane proteins (Appendix, Table 5.1) and decline on transmem- brane proteins (Appendix, Table 5.2). The improvement in the overall accuracies was mainly due to the lower number of false positives and thus higher accuracies for the class of extra-cellular/plasma membrane proteins (Appendix, Table 5.1). This indicated, that most of the prediction methods have in general a bias towards plasma membrane localiza- tion for transmembrane proteins, i.e. the membrane spanning regions are confused with signals or indicators for the plasma membrane. The low accuracies and high coverages achieved by MyND, CELLO v.2.5, WoLFPSORT and MultiLoc2 for plasma membrane

66 3.5 Testing proteins confirmed this (Appendix, Table 5.2). It is interesting to observe that though explicitly excluded membrane proteins from its training sets LOCtree reached an overall accuracy at least 11% higher than either CELLO v.2.5, WoLFPSORT or MultiLoc2. Moreover, it performed extremely well on chloroplast membrane proteins with gAv of 81%. Nevertheless, MyND outperformed the other methods in terms of overall accuracy in both tests. On transmembrane proteins it showed highest accuracies for nearly all subcellular localizations. The test also showed that the prediction performance can be improved by using different systems of MyND if the type of testing proteins is known a priori. For instance, MyND trained on all eukaryotic proteins reached 45% overall accuracy on transmembrane proteins (Appendix, Table 5.2), while MyND trained on transmembrane proteins only scored 10% higher on the same data set (Table 3.7).

Rather lower estimates observed than reported in the original publications. During evaluation, it was noticed that all prediction methods, with the exception of MultiLoc2, have published performance estimates on unseen protein sequences higher than those observed in this study. In particular, CELLO v.2.5 reports to have reached the overall accuracy of 90% on bacterial proteins and 85% on eukaryotic proteins, while on our data sets achieving only 78% and 60%, respectively. WoLFPSORT also reported overall accuracies of above 73%, while achieving only 60% on our eukaryotic data. Thus, the differences in the overall accuracy must be due to the underlying test and training data sets. Indeed, CELLO v2.5 allowed up to 80% sequence identity for eukaryotic proteins and did not remove homology between its bacterial training and test sets. The homology was also not reduced between the training and test sets of WoLFPSORT. LOCtree applied a very thorough procedure of redundancy reduction between its sets and reported an overall accuracy of 83% for prokaryotic proteins, 74% for non-plant and 70% for plant proteins. In our study, lower estimates of 77% for bacterial and 64% for eukaryotic non-membrane proteins, were measured. The differences may be explained by the fact that LOCtree chose the HSSP-value≤ 5 (eq. 2.1) as the threshold for sequence similarity, while we used a more stringent threshold of HSSP-value≤0 and BLAST E- value≤ 10−3 (section 2.2.2). MultiLoc2 published performance estimates close to our results. In the benchmark study against WoLFPSORT it reported similar results on animal proteins and a clear superiority on fungi and plant proteins. Unexpectedly, in this study, all other methods outperformed MultiLoc2 by at least 9%. The reason for the low level of MultiLoc2 performance on our test data set is unknown.

67 3 Results and Discussion

Localization Nprot Acc Cov gAv Acc Cov gAv Acc Cov gAv 9 Classes Methods: MyND CELLO v.2.5 WoLFPSORT Chloroplast 144 4513 259 347 6116 249 388 6814 358 489 Cytosol 220 448 438 446 5610 398 477 327 408 365 ER 75 4817 3113 3810 100∗ 13 123 100∗ 56 236 Golgi 20 670 1018 2616 100∗ 511 2211 0 0 0 Mitochondria 227 559 499 526 579 499 537 529 428 467 Nucleus 325 666 746 706 505 855 655 576 666 615 Peroxisome 8 0 0 0 100∗ 2549 5048 0 0 0 Vacuole 13 0 0 0 0 0 0 0 0 0 Extra-cellular/ 636 773 903 834 684 774 724 744 854 804 Plasma membrane Q 653 603 603 8 Classes Methods: MyND MultiLoc2 Chloroplast 144 4513 259 347 100∗ 157 387 Cytosol 220 448 438 446 285 787 464 ER 75 4817 3113 3810 2011 1911 197 Golgi 20 670 1018 2616 1621 1521 1511 Mitochondria 227 559 499 526 5512 308 416 Nucleus 325 666 746 706 797 456 606 Peroxisome 8 0 0 0 910 6346 248 Extra-cellular/ 636 773 903 834 774 665 714 Plasma membrane Q 653 513 6 Classes Methods: MyND LOCtree Chloroplast 144 4514 2510 347 8110 479 6110 Cytosol 220 448 439 446 428 488 446 Mitochondria 227 558 498 526 476 638 546 Nucleus 325 666 746 706 516 576 545 Organelles 116 6017 2810 419 3015 159 217 Extra-cellular/ 636 764 903 834 834 794 814 Plasma membrane Q 653 613

Table 3.9: Comparison of MyND trained on the data sets of all eukaryotic proteins to external prediction methods. The methods were tested on the test sets of MyND of all eukaryotic proteins (Table 3.1). CELLO v.2.5, WoLFPSORT and MultiLoc2 predictions of extra-cellular and plasma membrane were grouped together as MyND does not discriminate between these two localizations. As result, CELLO v.2.5 and WoLFPSORT predicted proteins in nine localization classes. MultiLoc2 did not predict the class of vacuole. Classes of ER, Golgi, peroxisome and vacuole were grouped as organelles for the predictions using LOCtree. Note, the training sets of external predictors may contain homologs to the sequences in MyND test sets, thus leading to overestimated results. Abbreviations are used as for Table 3.7.

68 3.6 Application to the Independent Test Sets

3.6 Application to the Independent Test Sets

The performances of MyND, CELLO v.2.5, WoLFPSORT, MultiLoc2 and LOCtree were re-examined on the additional protein sequences that were not included into the training sets of any of these prediction methods. The performances were then compared. For this task, the more general MyND prediction model for eukaryotic proteins was selected, that was trained on the combined set of non-membrane and transmembrane proteins. This model has an advantage that it does not require knowledge of whether a protein is transmembrane or not and thus can be applied to whole proteome data sets. It should be noted that LOCtree employed homology-based and text analysis-based tools to the SWISS-PROT proteins without subcellular localization annotations and WoLFPSORT included several hundred Arabidopsis thaliana entries from the GO web site. Thus, if a protein sequence tested here was a homologue to any of the additional protein sequences of LOCtree or WoLFPSORT then its prediction was to some degree overestimated.

3.6.1 Re-Training of the Final Classification Model

In order to run the predictions with MyND, it was re-trained it on the entire data set of non-membrane and transmembrane proteins without partitioning it into subsets. The optimal parameters for the Profile kernel (Section 2.5.3) were found by testing the combinations of values for k ∈ {5, 6} and σ ∈ {8, 9, 10, 11}. The best performing values were k=6 and σ=10.

3.6.2 Comparison with the External Classifiers

LocDB

The LocDB database is a manually curated knowledge base of experimental subcellu- lar localization annotations for Homo sapiens and Arabidopsis thaliana proteins. The information is derived either from the SWISS-PROT entries or by mining the primary literature and other databases, thus extending the number of available annotations in SWISS-PROT considerably. Although the prediction method here developed was de- signed to not differentiate between plant and animal proteins, two separate analyses were performed in order to investigate the prediction performance of MyND on proteins from different eukaryotic taxonomies.

69 3 Results and Discussion

Localization Nprot Acc Cov gAv Acc Cov gAv Acc Cov gAv 7 Classes Methods: MyND CELLO v.2.5 WoLFPSORT Cytosol 63 5421 3015 4011 4920 2713 3611 5515 5416 5411 ER 9 2328 3348 2821 0 0 0 100∗ 1125 3330 Golgi 7 0 0 0 0 0 0 0 0 0 Mitochondria 43 5117 4719 4915 7817 6019 6916 6319 5619 5914 Nucleus 39 4015 6418 5113 3312 6917 4711 5021 4421 4715 Vacuole 5 0 0 0 0 0 0 0 0 0 Extra-cellular/ 66 5813 7013 6412 6214 7313 6712 5512 7911 6611 Plasma membrane Q 498 518 558 6 Classes Methods: MyND MultiLoc2 Cytosol 63 5621 3014 4111 4711 7614 6011 ER 9 2328 3348 2821 714 1134 911 Golgi 7 0 0 0 2730 4350 3428 Mitochondria 43 5117 4719 4915 7222 4218 5516 Nucleus 39 4015 6418 5113 6932 2315 4015 Extra-cellular/ 66 6113 7013 6513 9210 5313 7013 Plasma membrane Q 508 508 6 Classes Methods: MyND LOCtree Cytosol 63 5421 3014 4111 4717 3716 4111 Mitochondria 43 5117 4719 4915 5916 7417 6616 Nucleus 39 4015 6418 5113 3213 5120 4111 Organelles 21 3133 2423 2715 2332 1922 2116 Extra-cellular/ 66 5813 7013 6412 6515 4814 5612 Plasma membrane Q 508 488

Table 3.10: Comparison of MyND trained on the data sets of all eukaryotic proteins to external prediction methods. The test set was the sequence unique set of Homo sapiens proteins derived from LocDB (Table 2.3). None of the proteins in the test set shared a significant sequence similarity to any pro- tein with the subcellular localization annotation in SWISS-PROT 2011 04 release. CELLO v.2.5, WoLFPSORT and MultiLoc2 predictions of extra- cellular and plasma membrane were grouped together as MyND does not discriminate between these two localizations. As a result, CELLO v.2.5 and WoLFPSORT assigned proteins into seven localization classes. Multi- Loc2 does not make predictions for the class of vacuole, thus the number of predicted localization classes for MultiLoc2 was six. Classes of ER, Golgi, peroxisome and vacuole were grouped as organelles for the predictions using LOCtree. Abbreviations are used as for Table 3.7.

70 3.6 Application to the Independent Test Sets

Bias in LocDB or moderate performance of MyND on Homo sapiens proteins. The analysis of the performance of individual predictors on the homology reduced set of Homo sapiens proteins (Table 2.3) revealed a clear reduction in the overall accuracies for most predictors (Table 3.10) when compared to the results in Table 3.9. The differ- ence for WoLFPSORT was 5%, for CELLO v.2.5 9%, for LOCtree 13% and for MyND 15%. The only method with an unchanged overall accuracy was MultiLoc2. It was also the only method that was able to correctly predict Golgi proteins. However, for each localization class MultiLoc2 seemed to be prone to either over- (low Acc, high Cov) or under-predictions (high Acc, low Cov). About one fourth of the proteins in the testing set were annotated as cytoplasmic. All methods predicted cytoplasmic proteins with a roughly similar level of accuracy of 50%. Higher levels of coverage were reached only by MultiLoc2 (60%) and WoLFPSORT (54%). The coverages for the other methods were at about 30%. This result indicates either a better performance of MultiLoc2 and WoLFPSORT on cytoplasmic proteins or a bias of LocDB towards this localization. In contrast to the results on MyND testing sets (Table 3.9), MyND was outperfromed by almost all other methods in terms of accuracy and coverage for mitochondrial proteins. For other localization classes MyND showed highest gAv. Overall, MyND was outper- formed by CELLO v.2.5 and WoLFPSORT by 2% and 6%, respectively. It had the same level of overall accuracy as MultiLoc2 and a 2% higher level than LOCtree.

Outstanding performance of MyND on Arabidopsis thaliana proteins. The homology reduced testing set of Arabidopsis thaliana proteins contained only a few (less than 10) proteins for most localization classes (Table 2.3). Therefore, some care should be taken when interpreting the results. Nevertheless, MyND outperformed all other methods by at least 5% overall accuracy (Table 3.11). Surprisingly, the lowest levels of overall ac- curacy were observed for WoLFPSORT and LOCtree. The low performances could be explained by the absence of chloroplast proteins in the testing set, the class of proteins that is very accurately predicted by both methods (Table 3.9). Similar to previous tests, MyND showed highest levels of accuracy and coverage for nuclear and organellar pro- teins. In contrast to the results on Homo sapiens proteins (Table 3.10), it also scored highest in terms of gAv for cytosolic proteins, thus underlying its generalizability for this class of proteins. However, for mitochondrial proteins it achieved the same level of gAv as CELLO v.2.5 and a lower level than WoLFPSORT, MultiLoc2 and LOCtree. The predictions of the different methods for extra-cellular/plasma membrane proteins were

71 3 Results and Discussion in general consistent.

Localization Nprot Acc Cov gAv Acc Cov gAv Acc Cov gAv 8 Classes Methods: MyND CELLO v.2.5 WoLFPSORT Cytosol 4 2731 7541† 4530 3347 5050 4143 1321 5050 2517 ER 4 2541 5050 3527 0 0 0 0 0 0 Golgi 3 0 0 0 0 0 0 0 0 0 Mitochondria 9 4048 2243 3024 4049 2240 3022 100∗ 2241 4740 Nucleus 9 7829† 7842† 7833† 3926 7837 5528 5533 6746 6033 Peroxisome 3 0 0 0 0 0 0 0 0 0 Vacuole 12 500 819 2017 0 0 0 2543 1726 2016 Extra-cellular/ 23 5223 6526 5818 5321 7026 6117 5825 4827 5320 Plasma membrane Q 4515 4015 3415 7 Classes Methods: MyND MultiLoc2 Cytosol 4 3348 7541† 5030 1822 100∗ 4321 ER 4 2946 5050 3842 0 0 0 Golgi 3 0 0 0 330 330 3342 Mitochondria 9 5050 2242 3326 7541† 3347 5032 Nucleus 9 8833† 7842† 8233† 100∗ 5650† 7549† Peroxisome 3 0 0 0 0 0 0 Extra-cellular/ 23 6320 6526 6423 9020† 3928 5924 Plasma membrane Q 5316 4016 6 Classes Methods: MyND LOCtree Cytosol 4 2731 7541† 4530 1132 2545 1718 Mitochondria 9 4048 2244 3024 4449 4449 4432 Nucleus 9 7838† 7842† 7833† 2028 3348 2620 Organelles 22 5035 2726 3720 5050 915 2113 Extra-cellular/ 23 5223 6526 5818 7125 5228 6123 Plasma membrane Q 4915 3315

Table 3.11: Comparison of MyND trained on the data sets of all eukaryotic proteins to external prediction methods. The test set was the sequence unique set of Arabidopsis thaliana proteins derived from LocDB (Table 2.3). None of the proteins in the test set shared a significant sequence similarity to any protein with the subcellular localization annotation in the SWISS-PROT 2011 04 release. CELLO v.2.5, WoLFPSORT and MultiLoc2 predictions of extra-cellular and plasma membrane compartments were grouped together as MyND did not discriminate between these two localizations. As a result, CELLO v.2.5 and WoLFPSORT assigned proteins into eight localization classes. MultiLoc2 does not make predictions for the class of vacuole, thus the number of predicted localization classes for MultiLoc2 was seven. Classes of ER, Golgi, peroxisome and vacuole were grouped as organelles for the predictions using LOCtree. Abbreviations are used as for Table 3.7.

72 3.6 Application to the Independent Test Sets

Newly Added SWISS-PROT Proteins

Localization Nprot Acc Cov gAv Acc Cov gAv Acc Cov gAv 6 Classes Methods: MyND CELLO v.2.5 WoLFPSORT Cytosol 12 7545 2530 4327 8038† 3333 5232 8042 3333 5234 ER 3 5050 670 5849† 0 0 0 0 0 0 Mitochondria 5 5050 6049† 5550† 3346 4050 3744 3848 6048 4732 Nucleus 18 7028 8918† 7923† 5822 7822 6723 6827 7228 7024 Vacuole 2 0 0 0 0 0 0 0 0 0 Extra-cellular/ 11 7129 9121† 8133† 4726 7329 5927 6333 9120† 7525 Plasma membrane Q 6415 5315 5815 5 Classes Methods: MyND MultiLoc2 Cytosol 12 7545 2530 4327 5627 8325† 6826 ER 3 670 670 6750† 670 670 6735† Mitochondria 5 5050 6049† 5550† 3347 4049 3743 Nucleus 18 7028 8918† 7923 8524 6124 7228 Extra-cellular/ 11 7129 9121† 8133† 7030 6430 6733 Plasma membrane Q 6715 6315 5 Classes Methods: MyND LOCtree Cytosol 12 7545 2530 4327 6340† 4234 5130 Mitochondria 5 5050 6049 5550† 6348† 100∗ 7947† Nucleus 18 7028 8918† 7923† 5029 5629 5324 Organelles 6 8041† 6747† 7350† 0 0 0 Extra-cellular/ 11 7129 9121† 8133† 6930 8225† 7533† Plasma membrane Q 6815 5515

Table 3.12: Comparison of MyND trained on the data sets of all eukaryotic proteins to external prediction methods. The test set was the sequence unique set of newly added SWISS-PROT proteins (Table 2.4). None of the proteins in the test set shared a significant sequence similarity to any protein with the subcellular localization annotation in SWISS-PROT 2011 04 release. For each prediction method the results were obtained as described in Table 3.11. Abbreviations are used as for Table 3.7.

MyND very accurate on newly added SWISS-PROT proteins. The analysis of the performance of MyND on the redundancy reduced set of SWISS-PROT proteins added between releases 2011 04 and 2011 07 (Table 2.4) and its comparison to the external prediction methods confirmed our previous findings. Namely, it achieved the highest performance in terms of accuracy and coverage for organellar and extra-cellular pro- teins (Table 3.12). MyND also showed the highest coverage for nuclear proteins and in accuracy it was only outperformed by MultiLoc2. Mitochondrial proteins were most accurately predicted by LOCtree, followed by MyND. An interesting result was ob-

73 3 Results and Discussion served for cytosolic proteins. While MyND, CELLO v.2.5, WoLFPSORT and LOCtree tended to under-predicted them (high Acc, low Cov), MultiLoc2 was biased towards over-prediction. Overall, MyND outperformed the other methods by at least 4% overall accuracy.

3.7 Localization-wise Performance of MyND

The performance of MyND was evaluated and compared to CELLO v.2.5, WoLFP- SORT, MultiLoc2 and LOCtree on five different redundancy reduced test sets: the sets of bacterial and eukaryotic proteins used for testing MyND, the sets of Homo sapiens and Arabidopsis thaliana proteins derived from LocDB, and the set of SWISS-PROT proteins added after 2011 04 release. Results showed that MyND was the most accurate method in predicting subcellular localization for proteins in four testing sets (Tables 3.8-3.9, 3.11-3.12 ). For the fifth testing set it remained unclear whether it contained bias towards certain localizations or if MyND was indeed least successful in predicting these localizations (Table 3.10). The following list gives an overview of the performances of individual classifiers in predicting main compartments of eukaryotic cells.

Chloroplast. The subcellular localization of chloroplasts was the only compartment for which MyND performance was least accurate. The most accurate predictions were achieved by MultiLoc2 and LOCtree, the only methods discriminating between plant and non-plant proteins.

Cytosol. Predictions using MyND for prokaryotic proteins localized to the cytosol were very balanced between accuracy and coverage, while CELLO v.2.5 and LOCtree tended to over-predict them. MyND was also the method with the highest accuracy. For eukaryotic proteins the performances of different methods varied strongly between different data sets. On the large sets, MyND again showed a good balancing and was never least in accuracy or coverage.

Mitochondria. MyND showed a performance equivalent to the other methods on mi- tochondrial proteins derived from SWISS-PROT. On proteins from the LocDB database the predictions between individual methods quite disagreed. The highest accuracy for

74 3.7 Localization-wise Performance of MyND

Homo sapiens proteins was achieved by CELLO v.2.5 and the highest coverage by LOC- tree. For Arabidopsis thaliana proteins WoLFPSORT scored the highest in accuracy and LOCtree again in coverage.

Nucleus. Localization assignments to the nucleus reached high values of geometric av- erage between accuracy and coverage by all five methods. The method with the highest value on all our testing sets was MyND. For the most part, MyND also showed the best balancing of accuracy and coverage.

Plasma membrane/extra-cellular space. The prediction method here described was designed not to discriminate between the compartments of plasma membrane and extra- cellular space. Therefore, the predictions of other methods for the two compartments were grouped together. MyND performed best on SWISS-PROT proteins but on proteins with literature-mined annotations it was outperformed either in accuracy or coverage by other methods. However, it should be noted that while other methods tended to over- (WoLFPSORT) or under-predict (CELLO v.2.5, MultiLoc2 and LOCtree), MyND pre- dictions were more balanced.

Endoplasmic reticulum, Golgi apparatus, peroxisome and vacuole. The localization compartments of endoplasmic reticulum, Golgi apparatus, peroxisome and vacuole were the least represented in the testing sets. Due to the extremely low number of protein sequences comprising these classes, all of the methods failed to predict at least one of the compartments correctly. Furthermore, the predictions were characterized by a high number of false negatives, leading to low results. Nevertheless, MyND showed for most cases the highest values of accuracy and coverage. Thus, highlighting its predominance even on poorly represented compartments over the other methods.

75

4 Conclusion

This thesis has presented the development of a fast and accurate method for the predic- tion of protein subcellular localization. The method was trained and tested in a 5-fold cross-validation on non-redundant data sets of annotated proteins from SWISS-PROT 2011 04 release. Non-membrane and transmembrane proteins were included in the data sets. The aim was to predict three classes for archaea, six classes for bacteria and eleven classes for eukaryota based on the input protein sequence. The development of this method involved the comparison of three popular sequence kernels in terms of their overall accuracy. The results obtained show that on the smallest two-class archaeal data set there were no significant differences between the individual kernels. However, on bacterial and eukaryotic data sets with a larger number of classes the Profile kernel clearly outperformed the String Subsequence and the Mismatch kernels. Subsequent to the choice of the kernel, an SVM-based multiclass classification ap- proach was selected. The one-against-all approach was benchmarked against ensembles of randomly chosen nested dichotomies, ensembles of class balanced nested dichotomies, ensembles of data balanced nested dichotomies and a predefined nested dichotomy, whose structure followed the general pathways of protein sorting. No differences in perfor- mances between the approaches on the two-class data set were observed; on the data sets with a larger number of classes, however, the one-against-all approach performed worst. The SVM parameter C was optimized for the hierarchy-based approaches and no significant differences in the overall accuracy between them were found. However, a comparison between the approaches in terms of computational time revealed a clear superiority of the predefined nested dichotomy over the others. Thus, predefined nested dichotomy was chosen as the underlying approach for the subcellular localization pre- diction method described in this thesis. For each taxa, three different types of predictors were built specific to non-membrane proteins, transmembrane proteins and proteins of all kinds. The results showed that the predictors specialized in non-membrane and transmembrane proteins perform compara-

77 4 Conclusion bly or better than the general predictor trained on all types of proteins. Nevertheless, the advantage of the latter may be seen in its independency from information on the query protein and thus can be applied for screenings of entire proteomes and hypothetical pro- tein sequences. For each localization, the prediction of membrane spanning proteins can then be carried out using MEMSAT-SVM [101], PolyPhobius [102] or SCAMPI [103], the methods found to be most accurate in the parallel thesis of Jonas Reeb [104]. The general method suited for all types of proteins outperformed the other predictors on the test sets of bacterial and eukaryotic proteins that did not share sequence similarity with proteins used for the training. This indicates that the signal for protein sorting may better be inferred from the evolutionary profile of a protein, which is used by the Profile kernel, than derived from the sequence alone. Additionally, the prediction performance of the general method was compared to other predictors on independent data sets that were not used during the development of any method tested here. It was found that, on the literature-mined data sets, the predictors predominantly did not agree and our method was not among the best scoring on the set of human proteins. However, on the literature-mined sets of plant proteins, as well as on SWISS-PROT proteins added after 2011 04 release it performed favorably. The prediction performance of our methods implied that they could be employed as a cheap starting point for the careful design of wet-lab experiments, and their prediction speed indicated at their suitability for large-scale studies of proteins. Furthermore, the localization signals identified by the Profile kernel may enhance our understanding of the biological mechanisms of protein sorting. The flexible structure of our methods allows them to be extended by new or finer-grained subcellular locations, such as inner and outer membranes of eukaryotic organelles or various subcompartments of the nucleus. Since the Profile kernel incorporates information from the entire protein sequence, the prediction methods here developed may be expected to improve in accuracy with re-training on larger and more diverse data sets. Furthermore, the inclusion of other sequence-derived features (e.g. secondary structure, binding motifs, solvent accessibility, etc.) may also positively affect their prediction performance. The prediction methods developed here are planned to be embedded in Predict Pro- tein, the highly used Internet service for sequence analysis and the prediction of various aspects of protein function and structure [105]. The Linux binaries as well as the bench- marking data sets will be freely available at www.rostlab.org.

78 5 Appendix

5.1 Effectiveness of the Mismatch Kernel

Figure 5.1: The influence of the Mismatch kernel parameters on the overall accuracy (eq. 2.25) of the one-against-all classifier (Section 2.6.1). For the description of kernel parameters refer to Section 2.5.2. Note, each combination of the parameters k and m was evaluated in a 10-fold cross-validation for each of the five training sets separately (Section 2.7.3). Here, however, the averages of the results over all training sets are reported. The data sets were the combination of non-membrane and transmemebrane proteins with two class labels in archaea and six class labels in bacteria (Table 3.1). The standard error (Section 2.7.2) in the overall prediction accuracy was 3% for archaea and 2% for bacteria.

79 5 Appendix

5.2 Effectiveness of the String Subsequence Kernel

Figure 5.2: The influence of the String Subsequence kernel parameters on the overall accuracy (eq. 2.25) of the one-against-all classifier (Section 2.6.1). For the description of kernel parameters refer to Section 2.5.1. Each combination of the parameters λ, k and l was tested in a 10-fold cross-validation for each of the five training sets separately (Section 2.7.3). Here, however, the averages of the results over all training sets are reported. The data sets were the combination of non-membrane and transmemebrane proteins with two class labels in archaea, six class labels in bacteria and ten class labels in eukary- ota (Table 3.1). The standard error (Section 2.7.2 in the overall prediction accuracy was 3% for archaea, 2% for bacteria and 1% for eukaryota.

80 5.3 Effectiveness of the Profile Kernel

5.3 Effectiveness of the Profile Kernel

Figure 5.3: The influence of the Profule kernel parameters on the overall accuracy (eq. 2.25) of the one-against-all classifier (Section 2.6.1). For the description of kernel parameters refer to Section 2.5.3. Each combination of the parameters σ and k was evaluated in a 10-fold cross-validation for each of the five training sets separately (Section 2.7.3). The results were obtained as reported in the description to Figure 5.2.

81 5 Appendix

5.4 MyND Benchmark on Eukaryotic Non-membrane Proteins

Localization Nprot Acc Cov gAv Acc Cov gAv Acc Cov gAv 9 Classes Methods: MyND CELLO v.2.5 WoLFPSORT Chloroplast 133 4615 259 347 6317 239 388 6915 3210 479 Cytosol 220 469 438 456 5910 398 486 357 408 385 ER 10 1327 1019 1112 100∗ 1021 3220 0 0 0 Golgi 3 500 330 4145 0 0 0 0 0 0 Mitochondria 140 479 4810 477 549 6710 607 4910 5110 508 Nucleus 320 686 756 716 535 865 675 596 676 636 Peroxisome 6 0 0 0 100∗ 3346 5847 0 0 0 Vacuole 3 0 0 0 0 0 0 0 0 0 Extra-cellular/ 596 844 903 874 834 774 804 853 864 864 Plasma membrane Q 683 663 653 8 Classes Methods: MyND MultiLoc2 Chloroplast 133 4615 259 347 100∗ 137 367 Cytosol 220 469 438 456 304 786 484 ER 10 1327 1019 1112 69 2027 117 Golgi 3 500 330 4145 0 0 0 Mitochondria 140 479 4810 477 5311 4410 488 Nucleus 320 686 756 716 798 466 606 Peroxisome 6 0 0 0 1517 6749 3217 Extra-cellular/ 596 844 903 874 893 674 775 Plasma membrane Q 683 563 6 Classes Methods: MyND LOCtree Chloroplast 133 4615 259 347 8312 4311 6011 Cytosol 220 468 438 455 488 488 486 Mitochondria 140 4710 4810 478 388 6010 487 Nucleus 320 686 756 716 577 587 576 Organelles 22 1827 916 1310 1113 1820 148 Extra-cellular 596 834 903 874 874 824 844 Q 683 643

Table 5.1: Comparison of MyND trained on the data sets of all eukaryotic proteins to external prediction methods (Section 2.8). The test sets were the sets of non- membrane eukaryotic proteins that were distinct from the MyND training sets. CELLO v.2.5, WoLFPSORT and MultiLoc2 predictions of extra-cellular and plasma membrane compartments were grouped together for a fair comparison with MyND. Classes of ER, Golgi, peroxisome and vacuole were grouped as organelles for the predictions using LOCtree. Abbreviations are used as for Table 3.7.

82 5.5 MyND Benchmark on Eukaryotic Transmembrane Proteins

5.5 MyND Benchmark on Eukaryotic Transmembrane Proteins

Localization Nprot Acc Cov gAv Acc Cov gAv Acc Cov gAv 8 Classes Methods: MyND CELLO v.2.5 WoLFPSORT Chloroplast 11 3347 2730 3020 4348 2730 3425 5833 6431 6133 ER 65 5518 3414 4310 0 0 0 100∗ 68 257 Golgi 17 100∗ 613 2412 100∗ 613 2412 0 0 0 Mitochondria 87 7413 5213 6212 8618 2111 4210 6218 2811 4111 Nucleus 5 1330 2040 1620 49 2040 98 0 0 0 Vacuole 10 0 0 0 0 0 0 0 0 0 Extra-cellular/ 40 3110 7817 4910 208 8814 427 258 8014 448 Plasma membrane Q 457 267 307 7 Classes Methods: MyND MultiLoc2 Chloroplast 11 3347 2730 3020 100∗ 3634 6034 ER 65 5518 3414 4310 3419 1811 259 Golgi 17 100∗ 613 2412 2328 1811 2014 Mitochondria 87 7413 5213 6212 8831† 86 277 Nucleus 5 1330 2040 1620 0 0 0 Extra-cellular/ 40 3110 7817 4910 2110 5020 329 Plasma membrane Q 457 217 5 Classes Methods: MyND LOCtree Chloroplast 11 3347 2729 3021 7128 9120 8130 Mitochondria 87 7414 5212 6211 6712 6912 6811 Nucleus 5 1126 2040 1519 26 2040 76 Organelles 94 7016 3312 4811 6226 148 298 Extra-cellular 40 3010 7817 4810 3015 3319 3111 Q 457 417

Table 5.2: Comparison of MyND trained on the data sets of all eukaryotic proteins to external prediction methods (Section 2.8). The test sets were the sets of transmembrane eukaryotic proteins that were distinct from the MyND train- ing sets. For each prediction method the results were obtained as explained in the description to Table 5.1. Abbreviation are used as for Table 3.7.

83

Bibliography

[1] L. J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H. H. Staerfeldt, K. Rapacki, C. Workman, C A F. Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak. Prediction of human protein function from post-translational modifications and localization features. J Mol Biol, 319(5):1257– 1265, Jun 2002.

[2] P. J. Coates and P. A. Hall. The yeast two-hybrid system for identifying protein- protein interactions. J Pathol, 199(1):4–7, Jan 2003.

[3] Tijana Milenkovic and Natasa Przulj. Uncovering biological network function via graphlet degree signatures. Cancer Inform, 6:257–273, 2008.

[4] B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, and J.D. Watson. Molecular Biology of the Cell. Garland, 4th edition, 2002.

[5] M. A. Andrade, S. I. O’Donoghue, and B. Rost. Adaptation of protein surfaces to subcellular location. J Mol Biol, 276(2):517–525, Feb 1998.

[6] B. W. Matthews. Structural and genetic analysis of protein stability. Annu Rev Biochem, 62:139–160, 1993.

[7] G. D. Rose, A. R. Geselowitz, G. J. Lesser, R. H. Lee, and M. H. Zehfus. Hydropho- bicity of amino acid residues in globular proteins. Science, 229(4716):834–838, Aug 1985.

[8] G. von Heijne. Membrane proteins: the amino acid composition of membrane- penetrating segments. Eur J Biochem, 120(2):275–278, Nov 1981.

[9] D. M. Engelman, T. A. Steitz, and A. Goldman. Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem, 15:321–353, 1986.

85 Bibliography

[10] G. von Heijne. Membrane protein structure prediction. hydrophobicity analysis and the positive-inside rule. J Mol Biol, 225(2):487–494, May 1992.

[11] T. Vellai and G. Vida. The origin of eukaryotes: the difference between prokaryotic and eukaryotic cells. Proc Biol Sci, 266(1428):1571–1577, Aug 1999.

[12] Kara Rogers, editor. The Cell. New York, N.Y. : Britannica Educational Pub. in association with Rosen Educational Services, 2011.

[13] Olof Emanuelsson, Sren Brunak, Gunnar von Heijne, and Henrik Nielsen. Locating proteins in the cell using targetp, signalp and related tools. Nat Protoc, 2(4):953– 971, 2007.

[14] Kim D. Pruitt, Tatiana Tatusova, and Donna R. Maglott. Ncbi reference sequence (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res, 33(Database issue):D501–D504, Jan 2005.

[15] A. Bairoch and R. Apweiler. The swiss-prot protein sequence database and its supplement trembl in 2000. Nucleic Acids Res, 28(1):45–48, Jan 2000.

[16] E. M. Marcotte, I. Xenarios, A. M. van Der Bliek, and D. Eisenberg. Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci U S A, 97(22):12115–12120, Oct 2000.

[17] Richard Mott, Jrg Schultz, Peer Bork, and Chris P. Ponting. Predicting protein cellular localization using a domain projection method. Genome Res, 12(8):1168– 1174, Aug 2002.

[18] Rajesh Nair and Burkhard Rost. Sequence conserved for subcellular localization. Protein Sci, 11(12):2836–2847, Dec 2002.

[19] K. O. Wrzeszczynski and B. Rost. Annotating proteins from endoplasmic reticulum and golgi apparatus in eukaryotic proteomes. Cell Mol Life Sci, 61(11):1341–1353, Jun 2004.

[20] J. L. Gardy, M. R. Laird, F. Chen, S. Rey, C. J. Walsh, M. Ester, and F S L. Brinkman. Psortb v.2.0: expanded prediction of bacterial protein subcellular lo- calization and insights gained from comparative proteome analysis. Bioinformatics, 21(5):617–623, Mar 2005.

86 Bibliography

[21] M. Cokol, R. Nair, and B. Rost. Finding nuclear localization signals. EMBO Rep, 1(5):411–415, Nov 2000.

[22] Alex N. Nguyen Ba, Anastassia Pogoutse, Nicholas Provart, and Alan M. Moses. Nlstradamus: a simple hidden markov model for nuclear localization signal pre- diction. BMC Bioinformatics, 10:202, 2009.

[23] M. G. Claros and P. Vincens. Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur J Biochem, 241(3):779–786, Nov 1996.

[24] H. Nielsen, J. Engelbrecht, S. Brunak, and G. von Heijne. Identification of prokary- otic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng, 10(1):1–6, Jan 1997.

[25] O. Emanuelsson, H. Nielsen, and G. von Heijne. Chlorop, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci, 8(5):978–984, May 1999.

[26] O. Emanuelsson, H. Nielsen, S. Brunak, and G. von Heijne. Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. J Mol Biol, 300(4):1005–1016, Jul 2000.

[27] Sang-Mun Chi. Prediction of protein subcellular localization by weighted gene ontology terms. Biochem Biophys Res Commun, 399(3):402–405, Aug 2010.

[28] Suyu Mei, Wang Fei, and Shuigeng Zhou. Gene ontology based transfer learning for protein subcellular localization. BMC Bioinformatics, 12:44, 2011.

[29] Rajesh Nair and Burkhard Rost. Inferring sub-cellular localization through auto- mated lexical analysis. Bioinformatics, 18 Suppl 1:S78–S86, 2002.

[30] Z. Lu, D. Szafron, R. Greiner, P. Lu, D. S. Wishart, B. Poulin, J. Anvik, C. Mac- donell, and R. Eisner. Predicting subcellular localization of proteins using machine- learned classifiers. Bioinformatics, 20(4):547–556, Mar 2004.

[31] Scott Brady and Hagit Shatkay. Epiloc: a (working) text-based system for pre- dicting protein subcellular location. Pac Symp Biocomput, pages 604–615, 2008.

87 Bibliography

[32] A. Reinhardt and T. Hubbard. Using neural networks for prediction of the sub- cellular location of proteins. Nucleic Acids Res, 26(9):2230–2236, May 1998.

[33] Kuo-Chen Chou and Hong-Bin Shen. Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic k-nearest neighbor classifiers. J Proteome Res, 5(8):1888–1897, Aug 2006.

[34] Z. Yuan. Prediction of protein subcellular locations using markov chain models. FEBS Lett, 451(1):23–26, May 1999.

[35] Tien-ho Lin, Robert F. Murphy, and Ziv Bar-Joseph. Discriminative motif finding for predicting protein subcellular localization. IEEE/ACM Trans Comput Biol Bioinform, 8(2):441–451, 2011.

[36] S. Hua and Z. Sun. Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17(8):721–728, Aug 2001.

[37] Keun-Joon Park and Minoru Kanehisa. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 19(13):1656–1663, Sep 2003.

[38] Chin-Sheng Yu, Chih-Jen Lin, and Jenn-Kang Hwang. Predicting subcellular lo- calization of proteins for gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci, 13(5):1402–1406, May 2004.

[39] Andrea Pierleoni, Pier Luigi Martelli, Piero Fariselli, and Rita Casadio. Bacello: a balanced subcellular localization predictor. Bioinformatics, 22(14):e408–e416, Jul 2006.

[40] Lu C.H. Hwang J.K. Yu C.S., Chen Y.C. Prediction of protein subcellular local- ization. Proteins: Structure, Function and Bioinformatics, 64:643–651, 2006.

[41] Obayashi T Nakai K. Horton P., Park K. J. Protein subcellular localization pre- diction with wolf psort. Proceedings of the 4th Annual Asia Pacific Bioinformatics Conference APBC06, pages pp. 39–48, 2006.

[42] Rajesh Nair and Burkhard Rost. Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol, 348(1):85–100, Apr 2005.

88 Bibliography

[43] Sebastian Briesemeister, Torsten Blum, Scott Brady, Yin Lam, Oliver Kohlbacher, and Hagit Shatkay. Sherloc2: a high-accuracy hybrid method for predicting sub- cellular localization of proteins. J Proteome Res, 8(11):5363–5366, Nov 2009.

[44] Torsten Blum, Sebastian Briesemeister, and Oliver Kohlbacher. Multiloc2: inte- grating phylogeny and gene ontology terms improves subcellular protein localiza- tion prediction. BMC Bioinformatics, 10:274, 2009.

[45] Catherine Mooney, Yong-Hong Wang, and Gianluca Pollastri. Sclpred: Protein subcellular localization prediction by n-to-1 neural networks. Bioinformatics, Aug 2011.

[46] Yao Qing Shen and Gertraud Burger. ’unite and conquer’: enhanced prediction of protein subcellular localization by integrating multiple specialized tools. BMC Bioinformatics, 8:420, 2007.

[47] Kuo-Chen Chou and Hong-Bin Shen. Cell-ploc: a package of web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc, 3(2):153–162, 2008.

[48] Johannes Assfalg, Jing Gong, Hans-Peter Kriegel, Alexey Pryakhin, Tiandi Wei, and Arthur Zimek. Supervised ensembles of prediction methods for subcellular localization. J. Bioinformatics and Computational Biology, 7(2):269–285, 2009.

[49] R. Kohavi. Wrappers for Performance Enhancement and Oblivious Decision Graphs. PhD thesis, Department of Computer Science, Stanford University., 1995.

[50] Donkin A. Holmes G. and Witten I.H. Weka: A machine learning workbench. Proc Second Australia and New Zealand Conference on Intelligent Information Systems, pages 357–361, 1994.

[51] Webb G. I. and Yu X., editors. Advances in Artificial Intelligence, 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 4-6, 2004, Proceedings, volume 3339 of Lecture Notes in Computer Science. Springer, 2004.

[52] Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and Eric W. Sayers. Genbank. Nucleic Acids Res, 37(Database issue):D26–D31, Jan 2009.

89 Bibliography

[53] U. Hobohm, M. Scharf, R. Schneider, and C. Sander. Selection of representative protein data sets. Protein Sci, 1(3):409–417, Mar 1992.

[54] Sven Mika and Burkhard Rost. Uniqueprot: Creating representative protein se- quence sets. Nucleic Acids Res, 31(13):3789–3791, Jul 2003.

[55] B. Rost. Enzyme function less conserved than anticipated. J Mol Biol, 318(2):595– 608, Apr 2002.

[56] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J Mol Biol, 215(3):403–410, Oct 1990.

[57] Jannick Dyrlv Bendtsen, Henrik Nielsen, Gunnar von Heijne, and Sren Brunak. Improved prediction of signal peptides: Signalp 3.0. J Mol Biol, 340(4):783–795, Jul 2004.

[58] C. Sander and R. Schneider. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9(1):56–68, 1991.

[59] B. Rost. Twilight zone of protein sequence alignments. Protein Eng, 12(2):85–94, Feb 1999.

[60] S. Rastogi and B. Rost. Locdb: experimental annotations of localization for homo sapiens and arabidopsis thaliana. Nucleic Acids Res, 39(Database issue):D230– D234, Jan 2011.

[61] UniProt Consortium. Ongoing and future developments at the universal protein resource. Nucleic Acids Res, 39(Database issue):D214–D219, Jan 2011.

[62] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel- Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet, 25(1):25–29, May 2000.

[63] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297, 1995.

90 Bibliography

[64] T. Jaakkola, M. Diekhans, and D. Haussler. Using the fisher kernel method to detect remote protein homologies. Proc Int Conf Intell Syst Mol Biol, pages 149– 158, 1999.

[65] Yanay Ofran, Venkatesh Mysore, and Burkhard Rost. Prediction of dna-binding residues from sequence. Bioinformatics, 23(13):i347–i353, Jul 2007.

[66] J. J. Ward, L. J. McGuffin, B. F. Buxton, and D. T. Jones. Secondary structure prediction with support vector machines. Bioinformatics, 19(13):1650–1655, Sep 2003.

[67] Howook Hwang, Thom Vreven, Troy W. Whitfield, Kevin Wiehe, and Zhiping Weng. A machine learning approach for the prediction of protein surface loop flexibility. Proteins, 79(8):2467–2474, Aug 2011.

[68] Rajesh Nair and Burkhard Rost. Protein subcellular localization prediction using artificial intelligence technology. Methods Mol Biol, 484:435–463, 2008.

[69] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge discovery, 2(2):121167, 1998.

[70] Vapnik N. Boser B., Guyon I. A training algorithm for optimal margin classifiers. Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, Pittsburgh ACM:pp.144–152, 1992.

[71] Graepel T. Herbrich R. A pac-bayesian margin bound for linear classifiers. IEEE Transactions on Information Theory, page 3140 3150, 2002.

[72] J. Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998.

[73] Saunders C. Shawe-Taylor J. Cristianini N. Lodhi, H. and C. Watkins. Text clas- sification using string kernels. Journal of Machine Learning Research, vol. 2:pp. 419–444, 2002.

[74] Christina S. Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, and William Stafford Noble. Mismatch string kernels for discriminative protein classi- fication. Bioinformatics, 20(4):467–476, Mar 2004.

91 Bibliography

[75] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, and Christina Leslie. Profile-based string kernels for remote homology detection and motif extraction. Proc IEEE Comput Syst Bioinform Conf, pages 152–160, 2004.

[76] Dan Gusfield. Algorithms on Strings, Trees and Sequences: Computer science and Computational Biology. Cambridge University Press, New York, 1997.

[77] M. Gribskov, A. D. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A, 84(13):4355–4358, Jul 1987.

[78] S. F. Altschul, T. L. Madden, A. A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, Sep 1997.

[79] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Res, 28(1):235–242, Jan 2000.

[80] Robert Shapire Erin Allwein and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, page 113141, 2000.

[81] Kramer S. Frank, E. Ensembles of nested dichotomies for multi-class problems. In Proceedings of ICML., 2004.

[82] Arthur Zimek, Fabian Buchwald, Eibe Frank, and Stefan Kramer. A study of hierarchical and flat classification of proteins. IEEE/ACM Trans Comput Biol Bioinform, 7(3):563–571, 2010.

[83] Frank E. Kramer S. Dong, L. Ensembles of balanced nested dichotomies for multi- class problems. PKDD, pages 84–95, 2005.

[84] J. Fox. Applied Regression Analysis, Linear Models, and Related Methods. Sage Publication.

[85] C. Blake and C. Merz. Uci repository of machine learning databases. University of California, Irvine, Dept. of Inf. and Computer Science, [www.ics.uci.edu/ mlearn/MLRepository.html], 1998.

92 Bibliography

[86] B. Martoglio and B. Dobberstein. Signal sequences: more than just greasy pep- tides. Trends Cell Biol, 8(10):410–415, Oct 1998.

[87] M. Bodn S. Maetschke, M. Gallagher. A comparison of sequence kernels for local- ization prediction of transmembrane proteins. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 367–372, 2007.

[88] Andrea Pierleoni, Pier Luigi Martelli, and Rita Casadio. Memloci: predict- ing subcellular localization of membrane proteins in eukaryotes. Bioinformatics, 27(9):1224–1230, May 2011.

[89] E. L. Lehmann and G. Casella. Theory of point estimation. Springer, New York, 1998.

[90] W. Guan. From the help desk: Bootstrapped standard errors. The Stata Journal, 3, Number 1:7180, 2003.

[91] Tibshirani R.J. Efron B. An introduction to the bootstrap. New York: Chapman & Hall, 1993.

[92] Jennifer L. Gardy, Cory Spencer, Ke Wang, Martin Ester, Gbor E. Tusndy, Istvn Simon, Sujun Hua, Katalin deFays, Christophe Lambert, Kenta Nakai, and Fiona S L. Brinkman. Psort-b: Improving protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids Res, 31(13):3613–3617, Jul 2003.

[93] H. Nielsen, J. Engelbrecht, S. Brunak, and G. von Heijne. A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int J Neural Syst, 8(5-6):581–599, 1997.

[94] Annette Hglund, Pierre Dnnes, Torsten Blum, Hans-Werner Adolph, and Oliver Kohlbacher. Multiloc: prediction of protein subcellular localization using n- terminal targeting sequences, sequence motifs and amino acid composition. Bioin- formatics, 22(10):1158–1165, May 2006.

[95] Christian J A. Sigrist, Lorenzo Cerutti, Nicolas Hulo, Alexandre Gattiker, Lau- rent Falquet, Marco Pagni, Amos Bairoch, and Philipp Bucher. Prosite: a docu- mented database using patterns and profiles as motif descriptors. Brief Bioinform, 3(3):265–274, Sep 2002.

93 Bibliography

[96] Rajesh Nair, Phil Carter, and Burkhard Rost. Nlsdb: database of nuclear local- ization signals. Nucleic Acids Res, 31(1):397–399, Jan 2003.

[97] Kanehisa M. Nakai, K. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics, 14, 14:897911, 1992.

[98] Nakai K. Horton, P. Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proceeding of the Fifth International Conference on Intelligent Systems for Molecular Biology, page 147152, 1997.

[99] Maruyama O. Nakai K. Miyano S. Bannai H., Tamada Y. Extensive feature de- tection of n-terminal protein sorting signals. Bioinformatics, 18:298305, 2002.

[100] John W. Tukey Robert McGill and Wayne A. Larsen. Variations of box plots. The American Statistician, Vol. 32, No. 1:12–16, 1978.

[101] Timothy Nugent and David T. Jones. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics, 10:159, 2009.

[102] Lukas Kaell, Anders Krogh, and Erik L L. Sonnhammer. An hmm posterior de- coder for sequence feature prediction that includes homology information. Bioin- formatics, 21 Suppl 1:i251–i257, Jun 2005.

[103] Andreas Bernsel, Hkan Viklund, Jenny Falk, Erik Lindahl, Gunnar von Heijne, and Arne Elofsson. Prediction of membrane-protein topology from first principles. Proc Natl Acad Sci U S A, 105(20):7177–7181, May 2008.

[104] Jonas Reeb. Evaluation of methods to predict transmembrane alpha-helices in pro- teins. Bachelor’s Thesis, Ludwig-Maximilians-University and Technical University Munich, I12 - Department for Bioinformatics and Computational Biology, Rost Lab, Sep 2011.

[105] Burkhard Rost, Guy Yachdav, and Jinfeng Liu. The predictprotein server. Nucleic Acids Res, 32(Web Server issue):W321–W326, Jul 2004.

94