Using Machine Learning to Determine Fold Class and Secondary Structure Content from Raman Optical Activity and Raman Vibrational Spectroscopy
Total Page:16
File Type:pdf, Size:1020Kb
Using Machine Learning to Determine Fold Class and Secondary Structure Content from Raman Optical Activity and Raman Vibrational Spectroscopy A thesis submitted to the University of Manchester for the degree of MPhil in the Faculty of Life Sciences 2012 Myra Kinalwa-Nalule 1 Table of Contents Table of Contents ................................................................................................................... 2 List of Appendices.................................................................................................................6 List of Figures ....................................................................................................................... 7 List of Tables ....................................................................................................................... 10 List of Abbreviation ............................................................................................................ 12 Abstract ............................................................................................................................... 13 Declaration .......................................................................................................................... 15 Copyright Statement ............................................................................................................ 16 Acknowledgements ............................................................................................................. 17 1. Protein Structure .............................................................................................................. 18 1.1 Introduction ................................................................................................................... 18 1.2 Protein structure ............................................................................................................ 20 1.2.1 Primary, secondary and tertiary protein structure ...................................................... 20 1.2.2 The helix structure ...................................................................................................... 23 1.2.3 The β-sheet structure .................................................................................................. 24 1.2.4 Disordered structure ................................................................................................... 25 1.2.5 Protein Motifs ............................................................................................................. 27 1.3 Classification of proteins and protein databases ........................................................... 29 1.4 Determination of protein structure ................................................................................ 34 1.5 References ..................................................................................................................... 37 2. Vibrational Spectroscopy ................................................................................................ 42 2.1 Introduction ................................................................................................................... 42 2.2 Vibrational Energies ..................................................................................................... 42 2.3 Raman Spectroscopy ..................................................................................................... 43 2 2.3.1 The Raman Effect ...................................................................................................... 46 2.4 Raman Optical Activity (ROA) .................................................................................... 47 2.3.1 Basic ROA Theory ..................................................................................................... 49 2.5 Spectroscopic Structural Band Assignments ................................................................ 52 2.6 Circular Dichroism ........................................................................................................ 53 3. Chemometrics analysis of vibrational spectroscopy ....................................................... 56 4. References ....................................................................................................................... 57 3. Machine Learning ........................................................................................................... 62 3.1 Introduction ................................................................................................................... 62 3.2 Support Vector Machines (SVM) Classification .......................................................... 64 3.2.1 Soft Margin ................................................................................................................ 67 3.2.2 Kernel Methods .......................................................................................................... 68 3.3 SVM Regression ........................................................................................................... 72 3.4 Random Forests ............................................................................................................. 75 3.4.1 Introduction ................................................................................................................ 75 3.4.2 Gini Index ................................................................................................................... 76 3.4.3 Random Forest prediction ........................................................................................ 77 3.4.5 Out-of-Bag (OOB) data .............................................................................................. 77 3.4.5 Distance Metric .......................................................................................................... 77 3.5 Partial Least Squares (PLS) Regression ....................................................................... 79 3.5.1 Introduction ................................................................................................................ 79 3.5.2 The PLS Model .......................................................................................................... 80 3.5.3 PLS number of components ....................................................................................... 82 3.6 Chemometric studies of protein vibrational spectroscopy .......................................... 82 3.7 Related Chemometrics Methods ................................................................................... 86 3.8 References.....................................................................................................................90 3 4 Data and Methods ............................................................................................................ 94 4.1 Datasets ......................................................................................................................... 94 4.2 Dataset representation ................................................................................................... 98 4.3 Data Processing ............................................................................................................. 99 4.3.1 Binning ....................................................................................................................... 99 4.3.2 Range Selection ....................................................................................................... 100 4.3.3 Scaling ..................................................................................................................... 100 4.4 Training the Models .................................................................................................... 101 4.4.1 SVM Models ............................................................................................................ 101 4.4.2 PLS Models .............................................................................................................. 103 4.4.3 Random Forest Models ............................................................................................ 105 4.5 References .................................................................................................................. 107 5. Support Vector Machine (SVM) Classification and Regression analyses of Raman and ROA spectra ...................................................................................................................... 108 SVM Analyses of Raman and ROA .................................................................................. 108 5.1 Data Pre-processing ................................................................................................... 109 5.1.1 Choosing the Bin Factor...........................................................................................109 5.2 Results and Discussion ................................................................................................ 111 5.2.1 SVM Classification Analyses Results of ROA and Raman spectra.........................111 5.3 SVM Regression Analyses Results of Raman and ROA spectra ................................ 111 5.3.1 SVM Regression Analyses of Raman and ROA ..................................................... 115 5.3.2 Discussion................................................................................................................ 116 5.4 References ..................................................................................................................