Using Machine Learning to Determine Fold Class and Secondary Structure Content from Raman Optical Activity and Raman Vibrational Spectroscopy

Using Machine Learning to Determine Fold Class and Secondary Structure Content from Raman Optical Activity and Raman Vibrational Spectroscopy A thesis submitted to the University of Manchester for the degree of MPhil in the Faculty of Life Sciences 2012 Myra Kinalwa-Nalule 1 Table of Contents Table of Contents ................................................................................................................... 2 List of Appendices.................................................................................................................6 List of Figures ....................................................................................................................... 7 List of Tables ....................................................................................................................... 10 List of Abbreviation ............................................................................................................ 12 Abstract ............................................................................................................................... 13 Declaration .......................................................................................................................... 15 Copyright Statement ............................................................................................................ 16 Acknowledgements ............................................................................................................. 17 1. Protein Structure .............................................................................................................. 18 1.1 Introduction ................................................................................................................... 18 1.2 Protein structure ............................................................................................................ 20 1.2.1 Primary, secondary and tertiary protein structure ...................................................... 20 1.2.2 The helix structure ...................................................................................................... 23 1.2.3 The β-sheet structure .................................................................................................. 24 1.2.4 Disordered structure ................................................................................................... 25 1.2.5 Protein Motifs ............................................................................................................. 27 1.3 Classification of proteins and protein databases ........................................................... 29 1.4 Determination of protein structure ................................................................................ 34 1.5 References ..................................................................................................................... 37 2. Vibrational Spectroscopy ................................................................................................ 42 2.1 Introduction ................................................................................................................... 42 2.2 Vibrational Energies ..................................................................................................... 42 2.3 Raman Spectroscopy ..................................................................................................... 43 2 2.3.1 The Raman Effect ...................................................................................................... 46 2.4 Raman Optical Activity (ROA) .................................................................................... 47 2.3.1 Basic ROA Theory ..................................................................................................... 49 2.5 Spectroscopic Structural Band Assignments ................................................................ 52 2.6 Circular Dichroism ........................................................................................................ 53 3. Chemometrics analysis of vibrational spectroscopy ....................................................... 56 4. References ....................................................................................................................... 57 3. Machine Learning ........................................................................................................... 62 3.1 Introduction ................................................................................................................... 62 3.2 Support Vector Machines (SVM) Classification .......................................................... 64 3.2.1 Soft Margin ................................................................................................................ 67 3.2.2 Kernel Methods .......................................................................................................... 68 3.3 SVM Regression ........................................................................................................... 72 3.4 Random Forests ............................................................................................................. 75 3.4.1 Introduction ................................................................................................................ 75 3.4.2 Gini Index ................................................................................................................... 76 3.4.3 Random Forest prediction ........................................................................................ 77 3.4.5 Out-of-Bag (OOB) data .............................................................................................. 77 3.4.5 Distance Metric .......................................................................................................... 77 3.5 Partial Least Squares (PLS) Regression ....................................................................... 79 3.5.1 Introduction ................................................................................................................ 79 3.5.2 The PLS Model .......................................................................................................... 80 3.5.3 PLS number of components ....................................................................................... 82 3.6 Chemometric studies of protein vibrational spectroscopy .......................................... 82 3.7 Related Chemometrics Methods ................................................................................... 86 3.8 References.....................................................................................................................90 3 4 Data and Methods ............................................................................................................ 94 4.1 Datasets ......................................................................................................................... 94 4.2 Dataset representation ................................................................................................... 98 4.3 Data Processing ............................................................................................................. 99 4.3.1 Binning ....................................................................................................................... 99 4.3.2 Range Selection ....................................................................................................... 100 4.3.3 Scaling ..................................................................................................................... 100 4.4 Training the Models .................................................................................................... 101 4.4.1 SVM Models ............................................................................................................ 101 4.4.2 PLS Models .............................................................................................................. 103 4.4.3 Random Forest Models ............................................................................................ 105 4.5 References .................................................................................................................. 107 5. Support Vector Machine (SVM) Classification and Regression analyses of Raman and ROA spectra ...................................................................................................................... 108 SVM Analyses of Raman and ROA .................................................................................. 108 5.1 Data Pre-processing ................................................................................................... 109 5.1.1 Choosing the Bin Factor...........................................................................................109 5.2 Results and Discussion ................................................................................................ 111 5.2.1 SVM Classification Analyses Results of ROA and Raman spectra.........................111 5.3 SVM Regression Analyses Results of Raman and ROA spectra ................................ 111 5.3.1 SVM Regression Analyses of Raman and ROA ..................................................... 115 5.3.2 Discussion................................................................................................................ 116 5.4 References ..................................................................................................................

Using Machine Learning to Determine Fold Class and Secondary Structure Content from Raman Optical Activity and Raman Vibrational Spectroscopy

A Deeply Glimpse Into Protein Fold Recognition

Modeling and Predicting Super-Secondary Structures of Transmembrane Beta-Barrel Proteins Thuong Van Du Tran

Final List of NGF Publications (2001-2007)

A Tri-Gram Based Feature Extraction Technique Using Linear Probabilities of Position Specific Scoring Matrix for Protein Fold Recognition

The Classification of Chemical Compounds Should Preferably Be Of

Simple Proteins Conjugated Proteins Enzymes

Amino ACIDS PROTEINS

SVM-Betapred: Prediction of Right-Handed ß-Helix Fold from Protein Sequence Using SVM

Regulation of Oxidoreductase Enzymes During Inflammation

Support Vector Machines for Protein Fold Class Prediction 1

Population Statistics of Protein Structures: Lessons from Structural Classiﬁcations Steven E Brenner∗§, Cyrus Chothia² and Tim JP Hubbard³

Computational Methods for the Understanding of Protein-Based Interactions