Kernel Methods, Multiclass Classification and Applications to Computational Molecular Biology

UNIVERSITA` DEGLI STUDI DI FIRENZE Dipartimento di Sistemi e Informatica Dottorato di Ricerca in Ingegneria Informatica e dell’Automazione XVI Ciclo Kernel Methods, Multiclass Classification and Applications to Computational Molecular Biology Andrea Passerini Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer and Control Engineering Ph.D. Coordinator Advisors Prof. Edoardo Mosca Prof. Paolo Frasconi Prof. Giovanni Soda Anno Accademico 2003-2004 Abstract Support Vector Machines for pattern recognition were initially conceived for the binary classification case. A common approach to address multi-classification problems with binary classifiers is that of reducing the multiclass problem to a set of binary sub-problems, and combine their predictions in order to obtain a multiclass prediction. Reduction schemes can be represented by Error Correcting Output Codes, while binary predictions can be com- bined using a decoding function which outputs a score for each possible class. We propose a novel decoding function which computes the conditional probability of the class given the binary predictions, and present an extensive set of experiments showing that it outper- forms all the decoding functions commonly used in practice. An alternative approach for solving multiclass problems is that of directly extending binary classification algorithms to the multiclass case. Various multicategory extensions for SVM have been proposed so far. We review most of them showing their similarities and differences, as well as the connections to the ECOC schemes. We report a series of experiments comparing different multiclass methods under various conditions, showing that while they perform similarly at the optimum, interesting differences emerge when forced to produce degenerate solu- tions. Moreover, we present a novel bound on the leave-one-out error of ECOC of kernel machines, showing that it can be successfully employed for model selection. In the second part of this thesis, we present applications of kernel machines to problems in computational molecular biology. We address the problem of predicting disulphide bridges between cysteines, which represent strong constraints on proteins 3D structure, and whose correct location can significantly help the overall folding prediction. The problem of disulphide bridges prediction can be divided in two successive steps: firstly, for each cysteine in a given protein, predict whether it is involved or not in a disulphide bond; secondly, given the subset of disulphide bonded cysteines in the protein, predict their connectivity pattern, by coupling each cysteine with the correct partner. We focus on the first step, and develop state-of-the-art learning algorithms combining kernel machines and connectionist models. Disulphide bridges are not the only type of binding a cysteine can be involved in, and many cysteines actually bind different types of ligands, usually including metal ions, forming complexes which play very important roles in biological systems. We employed kernel machine algorithms to learn to discriminate between ligand bound and disulphide bound cysteines, in order to get deeper insights into the role of each cysteine in a given protein. We developed ad-hoc kernels able to exploit information on residues similarity, showing that they obtain performances similar to standard kernels with much simpler models. Contents Acknowledgements xii 1 Introduction 1 1.1 Kernel Machines and Multiclass Classification . ....... 2 1.2 ApplicationstoBioinformatics . ..... 3 I Kernel Machines and Multiclass Classification 5 2 Kernel Methods 6 2.1 StatisticalLearningTheory . 7 2.1.1 LossFunctionandRiskMinimization . 7 2.1.2 VC Dimension and Bounds on Expected Risk . 9 2.1.3 StructuralRiskMinimization . 9 2.1.4 EmpiricalEstimatesoftheExpectedRisk . 9 2.2 SupportVectorMachines . 11 2.2.1 HardMarginHyperplanes . 11 2.2.2 SoftMarginHyperplanes . 15 2.2.3 Nonlinear Support Vector Machines . 15 2.2.4 Bounds on the LOO Error of SVM . 19 2.3 OtherSupportVectorMethods . 21 2.3.1 Support Vector Regression . 21 2.3.2 SupportVectorClustering . 24 2.4 KernelTheory .................................. 27 2.4.1 Positive Definite and Mercer Kernels . 27 2.4.2 RegularizationTheory . 31 2.5 KernelDesign .................................. 33 2.5.1 BasicKernels............................... 33 2.5.2 KernelCombination ........................... 34 CONTENTS ii 2.5.3 KernelsonDiscreteStructures. 36 2.5.4 Kernels from Generative Models . 41 2.5.4.1 Dynamic Alignment Kernels . 42 2.5.4.2 FisherKernel ......................... 44 2.5.5 HyperParameterTuning. 45 3 Multiclass Classification 47 3.1 Error Correcting Output Codes . 49 3.1.1 Decoding Functions Based on Conditional Probabilities . ........ 50 3.2 Multiclass Classification with Kernel Machines . ...... 53 3.2.1 ECOCofKernelMachines. 53 3.2.2 Multicategory Support Vector Machines . 54 3.2.3 Connections between ECOC and MSVM . 59 3.3 BoundsontheLOOError ............................ 60 3.3.1 LOO Error Bounds for ECOC of Kernel Machines . 61 3.4 Experiments ................................... 66 3.4.1 Comparison between Different Decoding Functions . 66 3.4.2 Comparison between Different Multiclass Methods . 69 3.4.3 HyperparameterTuning . 71 3.5 Conclusions.................................... 73 II Cysteine Bonding State Prediction 75 4 Protein Structure 76 4.1 Overview..................................... 76 4.2 ProteinStructureDetermination. 79 4.2.1 X-RayCrystallography . 80 4.2.2 NMRSpectroscopy ........................... 80 4.3 ProteinStructurePrediction. 81 4.3.1 ComparativeModeling. 81 4.3.2 FoldRecognition............................. 82 4.3.3 De Novo ProteinStructurePrediction. 83 4.3.3.1 Predictionsin1D . 84 4.3.3.2 Predictionsin2D . 87 4.3.3.3 Predictionsin3D . 87 5 Disulphide Bonding State Prediction 89 5.1 DisulphideBondsFormation. 90 5.2 CysteineBondingStatePrediction . 94 5.2.1 Overview of Current Methods . 95 5.2.2 Output-LocalPredictor . 96 5.2.2.1 Implementation using probabilistic SVM . 97 CONTENTS iii 5.2.2.2 A Fully-Observed Mixture of SVM Experts . 98 5.2.2.3 SpectrumKernel. 99 5.2.3 Output-Global Refiners . 100 5.2.3.1 Hidden Markov Models . 100 5.2.3.2 Bidirectional Recurrent Neural Networks . 102 5.2.4 DataPreparation............................. 103 5.2.4.1 InputEncoding . 103 5.2.4.2 Cysteines Conservation . 104 5.2.5 Results.................................. 104 5.2.5.1 Output-localexperts. 104 5.2.5.2 FilteringwithHMM . 105 5.2.5.3 FilteringwithBRNN. 106 5.2.5.4 Combining BRNN and HMM Advantages . 106 5.3 ConnectivityPrediction ............................. 108 5.4 Conclusions.................................... 108 6 Cysteine Binding Types Prediction 110 6.1 DataPreparation................................. 111 6.2 PROSITEPatternsasaBaseline . 112 6.3 Prediction by Support Vector Machines . .... 113 6.4 ResultsandDiscussion. 114 6.5 Conclusions.................................... 121 Bibliography 123 CONTENTS iv List of Figures 2.1 Confidence-based losses for binary classification (a) and regression(b). 8 2.2 Structural risk minimization (SRM) induction principle: the entire class of functions is divided into nested subsets with decreasing VC dimension. Within each subset, a trained function is obtained by minimizing the empirical risk only. Finally, we choose the training function which minimizes the bound on the expected risk as given by the sum of the empirical risk and the VC confidence. 10 2.3 Separable classification problem solved by support vector machines. The solid line represents the separating hyperplane. Support vectors (in black) are the points lying on the two closest hyperplanes on both sides of the separating one, corresponding to a confidence margin of one. All other points (in white) do not contributetothedecisionfunction. 14 2.4 Non separable classification problem solved by support vector machines. The solid line represents the separating hyperplane, while dotted lines are hyperplanes with confidence margin equal to one. Grey points are unbound SVs, black points are bound SVs and extra borders indicate bound SVs which are also training errors. All other points do not contribute to the decision function........ 16 2.5 (a) No linear decision function can separate black points from white ones. (b) A non linear decision surface correctly separating points. (c) By nonlinearly mapping points into a higher dimensional feature space we can find a separating hyperplane, which corresponds to the nonlinear decision surface (b) in the input space. ...................................... 17 2.6 (a) ǫ-insensitive region around the target function. Points lying within the region are considered as correctly predicted. (b) Large margin separation corresponds in regression to flat estimation functions. Circles are training points, and dotted lines enclose the ǫ-insensitive region. Solid lines are candidate regression functions, the bold one being the flattest which still fits training data with an approximation up to ǫ............................... 22 LIST OF FIGURES v 2.7 Regression problem solved by support vector machines. The dotted line is the true function, and the dashed lines enclose the ǫ-insensitive region. The solid line is the learned regression function. Grey points are unbound SVs, and black points are bound SVs. All other points do not contribute to the estimation function. Flatness in feature space implies smoothness of the regression function ininputspace. .................................

Load more