White Okstate 0664D 14551.Pdf (4.276Mb)
Total Page:16
File Type:pdf, Size:1020Kb
VARIABLE SELECTION TO IMPROVE CLASSIFICATION IN STRUCTURE-ACTIVITY STUDIES AND SPECTROSCOPIC ANALYSIS By COLLIN GARETH WHITE Bachelor of Science in Chemistry Oklahoma State University Stillwater, Oklahoma 2010 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY May, 2016 VARIABLE SELECTION TO IMPROVE CLASSIFICATION IN STRUCTURE-ACTIVITY STUDIES AND SPECTROSCOPIC ANALYSIS Dissertation Approved: Barry K. Lavine Dissertation Adviser Ziad El Rassi John Gelder Nicholas F. Materer R. Russell Rhinehart ii ACKNOWLEDGEMENTS I would like to thank my advisor, Dr. Barry K. Lavine, for his support, guidance and patience. I would also like to thank my graduate advisory committee: Dr. Ziad El Rassi, Dr. Nicholas Materer, Dr. John Gelder, and Dr. R. Russell Rhinehart. I am also thankful to present and former members of the Lavine research group: Nuwan Perera, Tao Ding, Kaushalya S. Dahal, Francis Kwofie, Nikhil S. Mirjankar, Sandhya R. Pampati, and Razvan I. Stoian. Special thanks go to Dr. Ayuba Fasasi and Matthew D. Allen for their helpful discussions and guidance. Additionally, I would like to thank Dr. Douglas R. Heisterkamp from the computer science department and former engineering student Solomon Gebreyohannes for their assistance. Finally, I wish to thank my parents, Karl A. White and Barbara L. White, and my brother, Ethan M. White for their support and encouragement. iii Acknowledgements reflect the views of the author and are not endorsed by committee members or Oklahoma State University. Name: COLLIN GARETH WHITE Date of Degree: MAY 2016 Title of Study: VARIABLE SELECTION TO IMPROVE CLASSIFICATION IN STRUCTURE-ACTIVITY STUDIES AND SPECTROSCOPIC ANALYSIS Major Field: CHEMISTRY Abstract: A genetic algorithm for variable selection to improve classifications is explored and validated on a wide range of data. In one study, 147 tetralin and indan musks and nonmusks compiled from the literature for the purpose of investigating the relationship between molecular structure and musk odor quality were correctly classified by 45 molecular descriptors identified by the pattern recognition GA which revealed an asymmetric data structure. A 3-layer feed-forward neural network trained by back propagation was used to develop a discriminant that correctly classified all of the compounds in the training set as musk and nonmusk. The neural network was successfully validated using an external prediction set of 37 compounds. In another study, 172 tetralin- , indan- and isochroman-like compounds were combed from the published literature to investigate the relationship between chemical structure and musk odor quality. The 20 molecular structural descriptors selected by the pattern recognition GA yielded a discriminant that was successfully validated using an external validation set consisting of 19 compounds. In a third study, the development of a prototype pattern recognition library search system for the infrared spectral libraries of the paint data query database to improve the discrimination capability and permit quantification of discriminant power for automotive paint comparisons involving the original equipment manufacturer is described. The system consists of two separate but interrelated components: search prefilters to cull the library spectra to a specific assembly plant and a cross correlation library search algorithm that utilizes both forward and backward searching to identify the year, line and model of the unknown in the spectral set identified by the search prefilters. The genetic algorithm was able to identify spectral variables from the clear coat, surfacer-primer and e-coat layers of the original manufacturer’s automotive paint that were characteristic of the assembly plant of the vehicle. iv TABLE OF CONTENTS Chapter Page I. INTRODUCTION ......................................................................................................1 II. METHODOLOGY ....................................................................................................6 2.1 Introduction ........................................................................................................6 2.2 Principal Component Analysis ..........................................................................8 2.3 Genetic Algorithm for Pattern Recognition Analysis ......................................12 2.3.1 PCKaNN Fitness Function ......................................................................15 2.3.2 Modifications to PCKaNN ......................................................................19 2.3.2.1 Transductive Learning ...................................................................19 2.3.2.2 Multivariate Calibration .................................................................23 2.3.2.3 Pinch Ratio Clustering ...................................................................24 2.4 Software Design and Implementation ..............................................................27 2.4.1 Use of the Java GA .................................................................................28 2.4.2 Algorithm Implementation......................................................................29 III. IMPROVING THE PDQ DATABASE TO ENHANCE INVESTIGATIVE LEAD INFORMATION FROM AUTOMOTIVE PAINTS ............................................33 3.1 Introduction ......................................................................................................33 3.2 Methodology ....................................................................................................37 3.2.1 Search Prefilters ......................................................................................38 3.2.2 Cross-Correlation Library Search System ..............................................44 3.3 Results and Discussion ....................................................................................47 3.3.1 Manufacturer Level Prefilter...................................................................48 3.3.2 General Motors .......................................................................................53 3.3.3 Chrysler ...................................................................................................76 3.3.4 Ford .........................................................................................................94 3.4 Conclusion .....................................................................................................114 v Chapter Page IV. ODOR-STRUCTURE RELATIONSHIP STUDIES OF POLYCYCLIC MUSKS ......................................................................................................................119 4.1 Introduction ....................................................................................................119 4.2 Experimental ..................................................................................................122 4.2.1 Musk Compounds .................................................................................122 4.2.2 Molecular Descriptor Generation .........................................................130 4.3 Indan and Tetralin Study ................................................................................132 4.3.1 Pattern Recognition Results ..................................................................133 4.3.2 Interpretation of Selected Descriptors ..................................................136 4.4 Indan, Tetralin and Isochroman Study ...........................................................139 4.4.1 Pattern Recognition Results ..................................................................140 4.4.2 Interpretation of Selected Descriptors ..................................................144 4.5 Conclusion .....................................................................................................145 V. CONCLUSION ....................................................................................................151 5.1 Forensic Automotive Paint Analysis .............................................................151 5.2 Structure Activity Correlations of Musks ......................................................153 vi LIST OF TABLES Table Page 3.1. Training Set and Validation Set for Manufacturer Search Prefilter ..................49 3.2. Distribution of GM Assembly Plants into Plant Groups ...................................54 3.3. Training Set and Validation Set for General Motors Plant Groups ...................61 3.4. Assembly Plants and Subplants Comprising Plant Group 1 ..............................63 3.5. Assembly Plants Comprising Plant Group 2 .....................................................66 3.6. Assembly Plants and Subplants Comprising Plant Group 3 ..............................69 3.7. Assembly Plants and Subplant Comprising Plant Group 4 ...............................70 3.8. Assembly Plants and Subplant Comprising Plant Group 5 ...............................71 3.9. Prototype Cross Correlation Library Search Results for GM ............................73 3.10. OMNIC Library Search Results for GM .........................................................74 3.11. Distribution of Chrysler Assembly Plants into Plant Groups ..........................77 3.12. Training Set and Validation Set for Chrysler Plant Groups ............................81 3.13. Assembly Plants and Subplants Comprising Plant Group 11 ..........................84 3.14. Assembly Plant and Subplants Comprising Plant Group 12 ...........................87