Real-Time Wavelet Compression and Self-Modeling Curve

Total Page:16

File Type:pdf, Size:1020Kb

Load more

REAL-TIME WAVELET COMPRESSION AND SELF-MODELING CURVE
RESOLUTION FOR ION MOBILITY SPECTROMETRY

A dissertation presented to the faculty of the College of Arts and Sciences of Ohio University

In partial fulfillment of the requirements for the degree
Doctor of Philosophy

Guoxiang Chen March 2003
This dissertation entitled
REAL-TIME WAVELET COMPRESSION AND SELF-MODELING CURVE
RESOLUTION FOR ION MOBILITY SPECTROMETRY

BY
GUOXIANG CHEN

has been approved for the Department of Chemistry and Biochemistry and the College of Arts and Sciences by

Peter de B. Harrington
Associate Professor of Chemistry and Biochemistry

Leslie A. Flemming
Dean, College of Arts and Sciences
CHEN, GUOXIANG. Ph.D. March 2003. Analytical Chemistry Real-Time Wavelet Compression and Self-Modeling Curve Resolution for Ion Mobility Spectrometry (203 pp.) Director of Dissertation: Peter de B. Harrington
Chemometrics has proven useful for solving chemistry problems. Most of the chemometric methods are applied in post-run analyses, for which data are processed after being collected and archived. However, in many applications, real-time processing is required to obtain knowledge underlying complex chemical systems instantly. Moreover, real-time chemometrics can eliminate the storage burden for large amounts of raw data that occurs in post-run analyses. These attributes are important for the construction of portable intelligent instruments.
Ion mobility spectrometry (IMS) furnishes inexpensive, sensitive, fast, and portable sensors that afford a wide variety of potential applications. SIMPLe-to-use Interactive Self-modeling Mixture Analysis (SIMPLISMA) is a self-modeling curve resolution method that has been demonstrated as an effective tool for enhancing IMS measurements. However, all of the previously reported studies have applied SIMPLISMA as a post-run tool.
A modified SIMPLISMA algorithm, referred to as RTSIMPLISMA, was developed for modeling IMS data in real-time. The real-time algorithm can determine the number of components in the IMS data automatically. Resolved concentration and spectral profiles are simultaneously displayed on a virtual instrument while the data is collected from an ion mobility spectrometer.
The computational burden for real-time SIMPLISMA increases when the collected number of spectra grows in size. A spectrum will not be acquired when the data processing consumes too large a share of computer resources. To alleviate this problem, a two-dimensional wavelet compression (WC2) was applied prior to RTSIMPLISMA modeling. The optimal settings of WC2-RTSIMPLISMA for processing IMS data were obtained, by which satisfactory models could be resolved when the data was compressed to 1/256.
A novel real-time WC2 has been developed to compress data as it is acquired from
IMS sensors. RTSIMPLISMA was applied to the WC2 processed data in real-time, by which the real-time modeling could be significantly accelerated. An integrated software package was developed to implement the real-time WC2-RTSIMPLISMA algorithm and used for the rapid processing of the IMS data of drugs and explosives. The real-time algorithm was able to disclose the very small features in the IMS data and rapidly model the dynamic changes during an IMS measurement course.

Approved: Peter de B. Harrington
Associate Professor of Chemistry and Biochemistry
5
Acknowledgments
I would like to thank my research advisor, Dr. Peter de B. Harrington, for his invaluable support and guidance during my stay at Ohio University. This dissertation could not have been written and the research could not have been accomplished without his help. I would also like to thank my dissertation committee members, Drs. Gary W. Small, Howard D. Dewald, Martin T. Tuck, Wen-jia R. Chen and Xiaozhuo Chen, for their great help in my academic progress and research pursuits. Paul Schmittauer is thanked for his assistances in electronic techniques.
I would like to thank the Department of Chemistry and Biochemistry at Ohio
University for offering me the opportunity to conduct my doctoral research. The Center for Intelligent Chemical Instrumentation at Ohio University is thanked for supporting the conference trips. Ohio University is thanked for the support of Donald R. Clippinger Fellowship. The US Army ERDEC, GeoCenters, and Ion Track Instruments are thanked for the partial support of this research. Metara Inc. is thanked for supporting me to write this dissertation while working. Dr. Willem Windig at Eigenvector Research Inc. is thanked for his permission for me to use the spectral data files and MATLAB scripts.
I would also thank the members in Dr. Harrington’s research group for their helpful suggestions. Special thanks are given to Libo Cao for her consistent help over the years. Dr. Tricia L. Buxton Derringer is also thanked for the bacterial data set.
I would like thank Zhuo Chen for her love, encouragement, and valuable support.
I would like to thank my father and the other family members who are always caring and supportive in my life.
6
Table of Contents

Page
Abstract...............................................................................................................................3 Acknowledgments...............................................................................................................5 List of Tables ......................................................................................................................9 List of Figures...................................................................................................................10 List of Abbreviations ........................................................................................................18

Chapter 1
1.1
Introduction............................................................................................... 21 General Statement..................................................................................... 21 Ion Mobility Spectrometry........................................................................ 23 Self-Modeling Curve Resolution.............................................................. 26 Data Compression..................................................................................... 29 The Research Objectives........................................................................... 32
1.2 1.3 1.4 1.5

  • Chapter 2
  • SIMPLISMA and Wavelet Transform...................................................... 34

SIMPLISMA............................................................................................. 34 Wavelet Transform ................................................................................... 48
2.1 2.2

Chapter 3
3.1
Real-Time Self-Modeling Mixture Analysis ............................................ 59 Introduction............................................................................................... 59
7
3.2

3.3 3.4 3.5
Theory....................................................................................................... 61 Experimental Section................................................................................ 63 Results and Discussion ............................................................................. 67 Conclusions............................................................................................... 86

  • Chapter 4
  • RTSIMPLISMA Applied to Two-Dimensional Wavelet Compressed Ion

Mobility Data........................................................................................................ 87

4.1 4.2 4.3 4.4
Introduction............................................................................................... 87 Theory....................................................................................................... 91 Experimental Section................................................................................ 94 Results and Discussion ............................................................................. 97
4.4.1 Conventional SIMPLISMA Models...................................................97

  • 4.4.2 Optimization of WC2-RTSIMPLISMA
  • 103

  • 117
  • 4.4.3 RTSIMPLISMA Applied to Windig Standard Data Sets

  • 4.5
  • Conclusions............................................................................................. 137

  • Real-Time Two-Dimensional Wavelet Compression and Its Application to
  • Chapter 5

Real-Time Self-Modeling of IMS data............................................................... 143

5.1 5.2 5.3
Introduction............................................................................................. 143 Theory..................................................................................................... 144 Experimental Section.............................................................................. 148
8
Results and Discussion ........................................................................... 151
5.4.1 Time Performance of Real-Time WC2-RTSIMPLISMA.................151 5.4.2 Enhanced IMS Measurement by Real-Time WC2-RTSIMPLISMA156 5.4.3 Real-Time Self-Modeling of IMS Data of Explosives.....................164 5.4.4 Internal Reference Method for Real-Time WC2-RTSIMPLISMA...172
5.4

  • 5.5
  • Conclusions............................................................................................. 186

Summary and Future Work..................................................................... 187 ................................................................................................................. 191
Chapter 6 References Appendix A: Publications.............................................................................................. 200 Appendix B: Presentations............................................................................................. 201 Appendix C: MATLAB Scripts..................................................................................... 202
9
List of Tables

  • Table
  • Page

Table 2.1 The concentrations of the compounds A and B during the reaction course..... 41 Table 3.1 Time performances of different methods and batch number R for processing
550 spectra. ............................................................................................................... 85
Table 4.1 Contribution of wavelet type and compression level to the variation of
RRMSES and RRMSEC......................................................................................... 111
Table 4.2 Compression levels, compression factor (C.F.), percent correctnc , average
RRMSES, minimum RRMSES, and the corresponding wavelet type for different compression levels for drug data set....................................................................... 112
Table 4.3 Compression levels, percent correctnc , average RRMSES, minimum RRMSES, and the corresponding wavelet type for different compression levels for bacterial data set. ................................................................................................................... 113
Table 5.1 Experimental setup for the data sets in Section 5.4.4. In the table, ti is the time when the sample was inserted into the desorber; is the time when the

ts

  • measurement stopped;
  • is the total number spectra collected. Sample volume was

ns

1 µL and the sample disk was removed at 5 s after it was inserted........................ 175
10
List of Figures

  • Figure
  • Page

Figure 1.1 The schematic diagram of an ion mobility spectrometer. .............................. 24 Figure 2.1 The virtual spectra of 1 mM aqueous solution of the pure compound A (Panel
A) and 1 mM aqueous solution of the pure compound B (Panel B)......................... 42
Figure 2.2 The three-dimensional surface plot of the synthesized mixture spectra of the two-component virtual reaction system.................................................................... 43
Figure 2.3 SIMPLISMA resolved spectra of compounds A (Panel A) and B (Panel B) with the number of components being predefined to two (α=0.05)......................... 45
Figure 2.4 SIMPLISMA resolved concentration profiles of compounds A (Panel A) and
B (Panel B) with the number of components being predefined to two (α=0.05). .... 46
Figure 2.5 SIMPLISMA model of the synthesized data set with the number of components being predefined to three (α=0.05)....................................................... 47
Figure 2.6 Schematic of multi-level operations of the pyramid WT algorithm with dyadic sampling.................................................................................................................... 50
Figure 2.7 Father and mother wavelets of daublet 4 (Panel A, four coefficients), daublet
14 (Panel B, 14 coefficients), coiflet 3 (Panel C, 18 coefficients), and symmlet 6 (Panel D, 12 coefficients). ........................................................................................ 52
Figure 2.8 Illustration of the multi-level operations of the pyramid algorithm for forward
WT using daublet 14................................................................................................. 55
11
Figure 2.9 The enlarged view of the smooth and detail parts of WT spectrum at level 5 in
Figure 2.8.................................................................................................................. 56
Figure 2.10 Reconstructed spectra from the smooth parts of forward WT with level 1 to 5 corresponding to Figure 2.8...................................................................................... 58
Figure 3.1 Structures for (A) diisopropyl methanephosphonate (DIMP), and (B)
Pinacolyl methyl phosphonofluoridate (soman)....................................................... 64
Figure 3.2 The graphical user interface for real-time SIMPLISMA................................ 66 Figure 3.3 The 3D surface plot of the IMS data of ethanol. ............................................ 68 Figure 3.4 RTSIMPLISMA resolved concentration profiles (Panel A) and component spectra (Panel B) for ethanol data............................................................................. 69
Figure 3.5 The 3D surface plot of the IMS data of DIMP. The data set was acquired from the CAM at positive mode........................................................................................ 70
Figure 3.6 RTSIMPLISMA resolved concentration profiles (Panel A) and component spectra (Panel B) for DIMP data............................................................................... 72
Figure 3.7 SIMPLISMA-det resolved concentration profiles (Panel A) and component spectra (Panel B) for DIMP data............................................................................... 73
Figure 3.8 RTSIMPLISMA model after processing 25 spectra for DIMP data. Panel A presents the concentration profiles and Panel B presents the component spectra.... 75
Figure 3.9 RTSIMPLISMA model after processing 40 spectra for DIMP data.. ............ 76 Figure 3.10 RTSIMPLISMA model after processing 135 spectra for DIMP data. ......... 77 Figure 3.11 RTSIMPLISMA resolved model for DIMP data with NPV threshold β0 =

0.008.......................................................................................................................... 78
12
Figure 3.12 RTSIMPLISMA model for DIMP data with NPV threshold β0 = 0.04....... 79

Figure 3.13 RTSIMPLISMA model for DIMP data with NPV threshold β0 = 0.24....... 80 Figure 3.14 Comparison of time performance for real-time implementation of
SIMPLISMA-det, RTSIMPLISMA, and data acquisition only................................ 82
Figure 3.15 Effects on time performance by real-time implementation of
RTSIMPLISMA for batches of R spectra................................................................. 84
Figure 4.1 Structures for (A) cocaine, and (B) heroin. ................................................... 90 Figure 4.2 Schematic diagram of the implementation principle of the WC2-
RTSIMPLISMA algorithm....................................................................................... 92
Figure 4.3 The cocaine-heroin data set comprised 1024 spectra on a 3D surface plot
(Acquired from ITEMISER® ITMS in positive mode)............................................. 98
Figure 4.4 The TMAH-preprocessed Bacillus cereus data set comprised 1024 spectra on a 3D surface plot (Acquired from Barringer IONSCAN350 spectrometer in positive mode)........................................................................................................... 99
Figure 4.5 Conventional SIMPLISMA model from the original cocaine-heroin data set
(three-component model). (A) Concentration profiles. (B) Component spectra.... 101
Figure 4.6 Conventional SIMPLISMA model from the Bacillus cereus data set (fourcomponent model). (A) Concentration profiles. (B) Component spectra............... 102
Figure 4.7 Relative purity curves of determinant-based SIMPLISMA for the drug and bacterial data sets.................................................................................................... 104
Figure 4.8 Relative purity curves of Gram-Schmidt-based SIMPLISMA for the drug data set............................................................................................................................ 105
13
Figure 4.9 Relative purity curves of Gram-Schmidt-based SIMPLISMA for the bacterial data set. ................................................................................................................... 106
Figure 4.10 Percent correct number of components with respect to the threshold ∆0 .. 108

Figure 4.11 The 4 × 4 daublet 14-daublet 4 compressed Bacillus cereus dataset comprised of 32 × 64 points in a 3D surface plot................................................... 116
Figure 4.12 Reconstructed RTSIMPLISMA model from the 4 × 4 daublet 14-daublet 4 compressed drug data set. (A) Concentration profiles. (B) Component spectra. ... 118
Figure 4.13 Reconstructed RTSIMPLISMA model from the 4 × 4 daublet 14-daublet 4 compressed bacterial data set. (A) Concentration profiles. (B) Component spectra. ................................................................................................................................. 119
Figure 4.14 The reconstructed data set from the 4 × 4 daublet 14-daublet 4 WC2-
RTSIMPLISMA model from the bacterial data set................................................ 120
Figure 4.15 RTSIMPLISMA relative purity curve for the Windig Raman data set. The transition point is highlighted. ................................................................................ 122
Figure 4.16 RTSIMPLISMA resolved spectra for the Windig Raman data set. ........... 124 Figure 4.17 Conventional SIMPLISMA (Panel A) and RTSIMPLISMA (Panel B) resolved concentration profiles for the Windig Raman data set............................. 125
Figure 4.18 Conventional SIMPLISMA resolved spectra for the Windig Raman data set
(α = 0.03)................................................................................................................ 126
Figure 4.19 RTSIMPLISMA relative purity curve for the Windig FTIR microscopy data set............................................................................................................................ 127
14
Figure 4.20 RTSIMPLISMA resolved spectra for the Windig FTIR microscopy data set.
................................................................................................................................. 129
Figure 4.21 Conventional SIMPLISMA (Panel A) and RTSIMPLISMA (Panel B) resolved concentration profiles for the Windig FTIR microscopy data set............ 130
Figure 4.22 Conventional SIMPLISMA resolved spectra for the Windig FTIR microscopy data set (α = 0.03). .............................................................................. 131
Figure 4.23 RTSIMPLISMA relative purity curve for the Windig NIR data set. ......... 132 Figure 4.24 Resolved spectra for the Windig NIR data set with conventional
SIMPLISMA applied on the positive part of the inverted second derivative data set (α = 0.1).................................................................................................................. 133
Figure 4.25 RTSIMPLISMA resolved spectra for the Windig NIR data set................. 135 Figure 4.26 RTSIMPLISMA relative purity curve for the Windig time resolved mass spectrometry data set............................................................................................... 136
Figure 4.27 Reference spectra for the three photographic color coupling compounds in the Windig time resolved mass spectrometry data set............................................ 138
Figure 4.28 TSIMPLISMA resolved spectra for the Windig time resolved mass spectrometry data set (α = 0.03)............................................................................. 139
Figure 4.29 RTSIMPLISMA resolved spectra for the Windig time resolved mass spectrometry data set............................................................................................... 140
Figure 4.30 TSIMPLISMA (Panel A) and RTSIMPLISMA (Panel B) resolved concentration profiles for the Windig time resolved mass spectrometry data set. . 141
15
Figure 5.1 Circular buffer. As a new point is received, it is placed into the memory pointed by pointer Pn. The start position of the data to be processed is located by Pp. ................................................................................................................................. 147
Figure 5.2 Structures for (A) urea nitrate, (B) cyclotrimethylenetrinitramine (RDX), (C)
2,4,6-trinitrotoluene (TNT), and (D) 3,4 methylenedioxymethamphetamine (MDMA) ................................................................................................................................. 150
Figure 5.3 The vector φ(4) that defines the FIR filter for column compression............ 153 Figure 5.4 The time performance curve for RTSIMPLISMA without compression..... 154 Figure 5.5 The time performance curves for data acquisition only and real-time WC2-
RTSIMPLISMA...................................................................................................... 155
Figure 5.6 IMS data set of blank trap disk on 3D surface plot. ..................................... 157 Figure 5.7 The average IMS spectra for three replicates of IMS measurement of a blank trap disk and the average spectrum of the data set that only has RIP..................... 158
Figure 5.8 Real-time WC2-RTSIMPLISMA resolved spectra for the data sets from the three replicates of blank trap disk experiments....................................................... 160
Figure 5.9 The variation profiles corresponding to different drift time for the raw data set of blank 3 in Figure 5.7........................................................................................... 161
Figure 5.10 IMS data set of 3.6×101 pg TNT on 3D surface plot.................................. 162 Figure 5.11 The average IMS spectra for two replicates of IMS measurement of 3.6×101 pg TNT and the average spectrum of the data set that only has RIP...................... 163
Figure 5.12 Real-time WC2-RTSIMPLISMA resolved spectra for the data sets from the two replicate data set of 36 pg TNT........................................................................ 165
16
Figure 5.13 IMS data set of explosives (urea nitrate, RDX, and TNT) on a 3D surface plot. ......................................................................................................................... 166
Figure 5.14 Real-time WC2-RTSIMPLISMA resolved concentration profiles at the final point (258.3 s)......................................................................................................... 168
Figure 5.15 Real-time WC2-RTSIMPLISMA resolved component spectra at different acquisition time (Part I). ......................................................................................... 169
Figure 5.16 Real-time WC2-RTSIMPLISMA resolved component spectra at different acquisition time (Part II). (177.1 - 249.0 s)............................................................. 170
Figure 5.17 Real-time WC2-RTSIMPLISMA resolved component spectra at different acquisition time (Part III). (249.1 - 258.3 s) ........................................................... 171
Figure 5.18 SIMPLISMA-det resolved concentration profiles from the raw IMS data of explosives................................................................................................................ 173
Figure 5.19 SIMPLISMA-det resolved component spectra from the raw IMS data set of explosives................................................................................................................ 174
Figure 5.20 Ion mobility spectra of 1µL ethanol solution with 1.0 × 102 ng MDMA, 1.0
× 102 ng cocaine, and 2.0 × 102 ng heroin, respectively, collected on the ITEMISERITMS in positive ion mode................................................................ 176
Figure 5.21 Real-time WC2-RTSIMPLISMA resolved component spectra from the data set of drug mixture A.............................................................................................. 178
Figure 5.22 Real-time WC2-RTSIMPLISMA resolved component spectra for drug mixture A with internal reference spectra of cocaine, MDMA, and heroin. .......... 179
17
Figure 5.23 Real-time WC2-RTSIMPLISMA resolved concentration profiles for drug mixture A with internal reference spectra of cocaine, MDMA, and heroin. .......... 181
Figure 5.24 Real-time WC2-RTSIMPLISMA resolved component spectra for drug mixture B. ............................................................................................................... 183
Figure 5.25 Real-time WC2-RTSIMPLISMA resolved component spectra for drug mixture B with appended IMS reference spectra of cocaine, MDMA, and heroin.184
Figure 5.26 Real-time WC2-RTSIMPLISMA resolved concentration profiles for drug mixture B with appended reference IMS spectra of cocaine, MDMA, and heroin. 185
18
List of Abbreviations

Recommended publications
  • Curriculum Vitae

    Curriculum Vitae

    Jeff Goldsmith 722 W 168th Street, 6th floor New York, NY 10032 jeff[email protected] Date of Preparation April 20, 2021 Academic Appointments / Work Experience 06/2018{Present Department of Biostatistics Mailman School of Public Health, Columbia University Associate Professor 06/2012{05/2018 Department of Biostatistics Mailman School of Public Health, Columbia University Assistant Professor 01/2009{12/2010 Department of Biostatistics Bloomberg School of Public Health, Johns Hopkins University Research Assistant (R01NS060910) 01/2008{12/2009 Department of Biostatistics Bloomberg School of Public Health, Johns Hopkins University Research Assistant (U19 AI060614 and U19 AI082637) Education 08/2007{05/2012 Johns Hopkins University PhD in Biostatistics, May 2012 Thesis: Statistical Methods for Cross-sectional and Longitudinal Functional Observations Advisors: Ciprian Crainiceanu and Brian Caffo 08/2003{05/2007 Dickinson College BS in Mathematics, May 2007 Jeff Goldsmith 2 Honors 04/2021 Dean's Excellence in Leadership Award 03/2021 COPSS Leadership Academy For Emerging Leaders in Statistics 06/2017 Tow Faculty Scholar 01/2016 Public Voices Fellow 10/2013 Calderone Junior Faculty Prize 05/2012 ASA Biometrics Section Travel Award 12/2011 Invited Paper in \Highlights of JCGS" Session at Interface 05/2011 Margaret Merrell Award for Outstanding Research by a Biostatistics Doc- toral Student 05/2011 School-wide Teaching Assistant Recognition Award 05/2011 Helen Abbey Award for Excellence in Teaching 03/2011 ENAR Distinguished Student Paper Award 05/2010 Jane and Steve Dykacz Award for Outstanding Paper in Medical Statistics 05/2009 Nominated for School-wide Teaching Assistant Recognition Award 08/2007{05/2012 Sommer Scholar 05/2007 James Fowler Rusling Prize 05/2007 Lance E.
  • Multivariate Chemometrics As a Strategy to Predict the Allergenic Nature of Food Proteins

    Multivariate Chemometrics As a Strategy to Predict the Allergenic Nature of Food Proteins

    S S symmetry Article Multivariate Chemometrics as a Strategy to Predict the Allergenic Nature of Food Proteins Miroslava Nedyalkova 1 and Vasil Simeonov 2,* 1 Department of Inorganic Chemistry, Faculty of Chemistry and Pharmacy, University of Sofia, 1 James Bourchier Blvd., 1164 Sofia, Bulgaria; [email protected]fia.bg 2 Department of Analytical Chemistry, Faculty of Chemistry and Pharmacy, University of Sofia, 1 James Bourchier Blvd., 1164 Sofia, Bulgaria * Correspondence: [email protected]fia.bg Received: 3 September 2020; Accepted: 21 September 2020; Published: 29 September 2020 Abstract: The purpose of the present study is to develop a simple method for the classification of food proteins with respect to their allerginicity. The methods applied to solve the problem are well-known multivariate statistical approaches (hierarchical and non-hierarchical cluster analysis, two-way clustering, principal components and factor analysis) being a substantial part of modern exploratory data analysis (chemometrics). The methods were applied to a data set consisting of 18 food proteins (allergenic and non-allergenic). The results obtained convincingly showed that a successful separation of the two types of food proteins could be easily achieved with the selection of simple and accessible physicochemical and structural descriptors. The results from the present study could be of significant importance for distinguishing allergenic from non-allergenic food proteins without engaging complicated software methods and resources. The present study corresponds entirely to the concept of the journal and of the Special issue for searching of advanced chemometric strategies in solving structural problems of biomolecules. Keywords: food proteins; allergenicity; multivariate statistics; structural and physicochemical descriptors; classification 1.
  • Copula-Based Analysis of Dependent Data with Censoring and Zero Inflation

    Copula-Based Analysis of Dependent Data with Censoring and Zero Inflation

    Copula-Based Analysis of Dependent Data with Censoring and Zero Inflation by Fuyuan Li B.S. in Telecommunication Engineering, May 2012, Beijing University of Technology M.S. in Statistics, May 2014, The George Washington University A Dissertation submitted to The Faculty of The Columbian College of Arts and Sciences of The George Washington University in partial satisfaction of the requirements for the degree of Doctor of Philosophy January 10, 2019 Dissertation directed by Huixia J. Wang Professor of Statistics The Columbian College of Arts and Sciences of The George Washington University certifies that Fuyuan Li has passed the Final Examination for the degree of Doctor of Philosophy as of December 7, 2018. This is the final and approved form of the dissertation. Copula-Based Analysis of Dependent Data with Censoring and Zero Inflation Fuyuan Li Dissertation Research Committee: Huixia J. Wang, Professor of Statistics, Dissertation Director Tapan K. Nayak, Professor of Statistics, Committee Member Reza Modarres, Professor of Statistics, Committee Member ii c Copyright 2019 by Fuyuan Li All rights reserved iii Acknowledgments This work would not have been possible without the financial support of the National Science Foundation grant DMS-1525692, and the King Abdullah University of Science and Technology office of Sponsored Research award OSR-2015-CRG4-2582. I am grateful to all of those with whom I have had the pleasure to work during this and other related projects. Each of the members of my Dissertation Committee has provided me extensive personal and professional guidance and taught me a great deal about both scientific research and life in general.
  • Principal Component Analysis

    Principal Component Analysis

    Tutorial n Chemometrics and Intelligent Laboratory Systems, 2 (1987) 37-52 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands Principal Component Analysis SVANTE WOLD * Research Group for Chemometrics, Institute of Chemistry, Umei University, S 901 87 Urned (Sweden) KIM ESBENSEN and PAUL GELADI Norwegian Computing Center, P.B. 335 Blindern, N 0314 Oslo 3 (Norway) and Research Group for Chemometrics, Institute of Chemistry, Umed University, S 901 87 Umeci (Sweden) CONTENTS Intr~uction: history of principal component analysis ...... ............... ....... 37 Problem definition for multivariate data ................ ............... ....... 38 A chemical example .............................. ............... ....... 40 Geometric interpretation of principal component analysis ... ............... ....... 41 Majestical defi~tion of principal component analysis .... ............... ....... 41 Statistics; how to use the residuals .................... ............... ....... 43 Plots ......................................... ............... ....... 44 Applications of principal component analysis ............ ............... ....... 46 8.1 Overview (plots) of any data table ................. ............... ..*... 46 8.2 Dimensionality reduction ....................... ............... 46 8.3 Similarity models ............................. ............... , . 47 Data pre-treatment .............................. ............... *....s. 47 10 Rank, or dim~sion~ty, of a principal components model. .......................... 48
  • Chemometrics in Europe: I (R;Q,P)= I (Q11p) Na (2) Selected Results

    Chemometrics in Europe: I (R;Q,P)= I (Q11p) Na (2) Selected Results

    Volume 93, Number 3, May-June 1988 Journal of Research of the National Bureau of Standards Accuracy in Trace Analysis [C]={[E]' [El> [E] [A] tions to the field. Obviously, this is not the place to offer a review on chemometrics, let alone one that to recalculate the concentration profiles. This pro- is restricted to a continent. cess (truncation, normalization and pseudoinverse The definition of chemometrics [I] comprises followed by pseudoinverse) was repeated until no three distinct areas characterized by the key words further refinement occurred. "optimal measurements," "maximum chemical in- The concentration profiles and spectra of the formation" and, for analytical chemistry something three unknown components of stearyl alcohol in that sounds like the synopsis of the other two: "op- carbon tetrachloride obtained in this manner were timal way [to obtain] relevant information." found to make chemical sense. This EFA procedure, unlike others, was success- ful in extracting concentration profiles from situa- Information Theory tions where one component profile was completely encompassed underneath another component Eckschlager and Stepanek [2-5] pioneered the profile. adaption and application of information theory in analytical chemistry. One of their important results gives the information gain of a quantitative deter- References mination [5] [I] Malinowski, E. R., and Howery, D. G., Factor Analysis in Chemistry, Wiley Interscience, New York (1980). [2] Gemperline, P. G., J. Chem. Inf. Comput. Sci. 24, 206 I (qI 1p)=lIn toII)= n(X 2SV2xR-en-xl) \/nA (I) (1984). [3] Vandeginste, B. G. M., Derks, W., and Kateman, G., Anal. Chim. Acta 173, 253 (1985). [4] Gampp, H., Macder, M., Meyer, C.
  • Chemometrics: Theory and Application

    Chemometrics: Theory and Application

    Chapter 7 Chemometrics: Theory and Application Hilton Túlio Lima dos Santos, André Maurício de Oliveira, Patrícia Gontijo de Melo, Wagner Freitas and Ana Paula Rodrigues de Freitas Additional information is available at the end of the chapter http://dx.doi.org/10.5772/53866 1. Introduction This chapter aims to present a chemometrics as important area in chemistry to be able to help work with many among of data obtained in analysis. The term chemometrics was introduced in initial 70th years by Svant Wold (Swede) and Bruce Kowalski (USA). According International Chemometrics Society, founded in 1974, the accept definition to chemometrics is (i) the chemical discipline that uses mathematical and statistical methods to design or select optimal measurement procedures and experiments (ii) to provide maximum chemical information by analyzing chemical data [1]. When the study involving many variable became the study in a multivariate analysis, so it is necessary to building a typical matrix and is normal to do a pre-processing. Pre-processing is a procedure to adjust the different factors with different units in values than allow give for each factor the same change to contribute to the model. After, next step is usually the Pattern Recognition method, to find any similarity in your data. In This method is common using the unsupervised group where there are the HCA and PCA analysis and the supervised group where there is the KNN. The HCA analysis (Hierarchical Cluster Analysis) is used to examine the distance among the samples in two dimensional plot (dendogram) and cluster samples with similarity. (Figure 1).
  • A Synergistic Use of Chemometrics and Deep Learning Improved the Predictive Performance of Near-Infrared Spectroscopy Models for Dry Matter Prediction in Mango Fruit

    A Synergistic Use of Chemometrics and Deep Learning Improved the Predictive Performance of Near-Infrared Spectroscopy Models for Dry Matter Prediction in Mango Fruit

    Chemometrics and Intelligent Laboratory Systems 212 (2021) 104287 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics A synergistic use of chemometrics and deep learning improved the predictive performance of near-infrared spectroscopy models for dry matter prediction in mango fruit Puneet Mishra a,*,Dario Passos b a Wageningen Food and Biobased Research, Bornse Weilanden 9, P.O. Box 17, 6700AA, Wageningen, the Netherlands b CEOT, Universidade do Algarve, Campus de Gambelas, FCT Ed.2, 8005-189 Faro, Portugal ARTICLE INFO ABSTRACT Keywords: This study provides an innovative approach to improve deep learning (DL) models for spectral data processing 1D-CNN with the use of chemometrics knowledge. The technique proposes pre-filtering the outliers using the Hotelling’s Neural networks T2 and Q statistics obtained with partial least-square (PLS) analysis and spectral data augmentation in the variable Fruit quality domain to improve the predictive performance of DL models made on spectral data. The data augmentation is Artificial intelligence carried out by stacking the same data pre-processed with several pre-processing techniques such as standard Ensemble pre-processing normal variate, 1st derivatives, 2nd derivatives and their combinations. The performance of the approach is demonstrated on a real near-infrared (NIR) data set related to dry matter (DM) prediction in mango fruit. The data set consisted of a total 11,961 spectra and reference DM measurements. The results showed that removing the outliers and augmenting spectral data improved the predictive performance of DL models. Furthermore, this innovative approach not only improved DL models but attained the lowest root mean squared error of prediction (RMSEP) on the mango data set i.e., 0.79% compared to the best known RMSEP of 0.84%.
  • Philip Ernst: Tweedie Award the Institute of Mathematical Statistics CONTENTS Has Selected Philip A

    Philip Ernst: Tweedie Award the Institute of Mathematical Statistics CONTENTS Has Selected Philip A

    Volume 47 • Issue 3 IMS Bulletin April/May 2018 Philip Ernst: Tweedie Award The Institute of Mathematical Statistics CONTENTS has selected Philip A. Ernst as the winner 1 Tweedie Award winner of this year’s Tweedie New Researcher Award. Dr. Ernst received his PhD in 2–3 Members’ news: Peter Bühlmann, Peng Ding, Peter 2014 from the Wharton School of the Diggle, Jun Liu, Larry Brown, University of Pennsylvania and is now Judea Pearl an Assistant Professor of Statistics at Rice University: http://www.stat.rice. 4 Medallion Lecture previews: Jean Bertoin, Davar edu/~pe6/. Philip’s research interests Khoshnevisan, Ming Yuan include exact distribution theory, stochas- tic control, optimal stopping, mathemat- 6 Recent papers: Stochastic Systems; Probability Surveys ical finance and statistical inference for stochastic processes. Journal News: Statistics 7 The IMS Travel Awards Committee Surveys; possible new Data selected Philip “for his fundamental Science journal? Philip Ernst contributions to exact distribution theory, 8 New Researcher Travel in particular for his elegant resolution of the Yule’s nonsense correlation problem, and Awards; Student Puzzle 20 for his development of novel stochastic control techniques for computing the value of 9 Obituaries: Walter insider information in mathematical finance problems.” Rosenkrantz, Herbert Heyer, Philip Ernst will present the Tweedie New Researcher Invited Lecture at the IMS Jørgen Hoffmann-Jørgensen, New Researchers Conference, held this year at Simon Fraser University from July James Thompson,
  • Abstracts Chemometrics and Analytical Chemistry 2014

    Abstracts Chemometrics and Analytical Chemistry 2014

    Abstracts Chemometrics and Analytical Chemistry 2014 PL1 PERSPECTIVES ON THE INTERDISCIPLINARY NATURE OF CHEMOMETRICS AND THE FUTURE OF ITS IDENTITY AS A DISCIPLINE Paul J. Gemperline1, Maryann Cuellar1, and Paul Trevorrow2 1Department of Chemistry, East Carolina University, Greenville, NC 27858 2John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex, United Kingdom. PO19 8SQ Chemometrics as an identifiable discipline is about 40 years old. It is characterized by its specialized jargon and distinctive mathematical methods. In the pioneering years it was identified as a sub discipline of analytical chemistry. It experienced rapid growth during the 1990’s and 2000’s, during which time publications grew at a linear rate while citations grew nearly exponentially, a dramatic indication of chemometrics’ growing impact in a broad range of multidisciplinary fields. By these measures, chemometrics has been highly successful as a discipline. However, as the use of data analytics has become ubiquitous in the past decade, is research in chemometrics at risk of becoming irrelevant? Has the pace of innovation and development of new chemometric methods stalled, or worse yet, is it in decline? Are there still new methods to be discovered and invented, or, as knowledge and expertise in mathematics and computational methods has risen to new levels throughout the world, have all novel and innovative mathematical tools been discovered? If chemometrics is to remain a viable discipline, where will the next innovations come from? This perspective will examine the impact of chemometrics in three specialized research areas as bellwether indicators of the discipline and its future, including work on use of Raman spectroscopy for monitoring bioprocesses.
  • Chemometric Study of the Correlation Between Human Exposure to Benzene and Pahs and Urinary Excretion of Oxidative Stress Biomarkers

    Chemometric Study of the Correlation Between Human Exposure to Benzene and Pahs and Urinary Excretion of Oxidative Stress Biomarkers

    atmosphere Article Chemometric Study of the Correlation between Human Exposure to Benzene and PAHs and Urinary Excretion of Oxidative Stress Biomarkers Flavia Buonaurio 1, Enrico Paci 2, Daniela Pigini 2, Federico Marini 1 , Lisa Bauleo 3 , Carla Ancona 3 and Giovanna Tranfo 2,* 1 Department of Analytical Chemistry, Sapienza University, Piazzale Aldo Moro, 5, 00185 Rome, Italy; fl[email protected] (F.B.); [email protected] (F.M.) 2 INAIL Research, Department of Occupational and Environmental Medicine, Epidemiology and Hygiene, Via di Fontana Candida 1, 00078 Monte Porzio Catone (RM), Italy; [email protected] (E.P.); [email protected] (D.P.) 3 Department of Epidemiology Lazio Regional Health Service, Via Cristoforo Colombo, 112, 00154 Rome, Italy; [email protected] (L.B.); [email protected] (C.A.) * Correspondence: [email protected] Received: 29 October 2020; Accepted: 8 December 2020; Published: 11 December 2020 Abstract: Urban air contains benzene and polycyclic aromatic hydrocarbons (PAHs) which have carcinogenic properties. The objective of this paper is to study the correlation of exposure biomarkers with biomarkers of nucleic acid oxidation also considering smoking. In 322 subjects, seven urinary dose biomarkers were analyzed for benzene, pyrene, nitropyrene, benzo[a]pyrene, and naphthalene exposure, and four effect biomarkers for nucleic acid and protein oxidative stress. Chemometrics was applied in order to investigate the existence of a synergistic effect for the exposure to the mixture and the contribution of active smoking. There is a significant difference between nicotine, benzene and PAH exposure biomarker concentrations of smokers and non-smokers, but the difference is not statistically significant for oxidative stress biomarkers.
  • Sequential Truncation of R-Vine Copula Mixture Model for High- Dimensional Datasets

    Sequential Truncation of R-Vine Copula Mixture Model for High- Dimensional Datasets

    Hindawi International Journal of Mathematics and Mathematical Sciences Volume 2021, Article ID 3214262, 14 pages https://doi.org/10.1155/2021/3214262 Research Article Sequential Truncation of R-Vine Copula Mixture Model for High- Dimensional Datasets Fadhah Amer Alanazi Prince Sultan University, Riyadh, Saudi Arabia Correspondence should be addressed to Fadhah Amer Alanazi; [email protected] Received 12 May 2021; Accepted 22 July 2021; Published 31 July 2021 Academic Editor: Fernando Bobillo Copyright © 2021 Fadhah Amer Alanazi. &is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Uncovering hidden mixture dependencies among variables has been investigated in the literature using mixture R-vine copula models. &ey provide considerable flexibility for modeling multivariate data. As the dimensions increase, the number of the model parameters that need to be estimated is increased dramatically, which comes along with massive computational times and efforts. &is situation becomes even much more complex and complicated in the regular vine copula mixture models. Incorporating the truncation method with a mixture of regular vine models will reduce the computation difficulty for the mixture-based models. In this paper, the tree-by-tree estimation mixture model is joined with the truncation method to reduce computational time and the number of parameters that need to be estimated in the mixture vine copula models. A simulation study and real data applications illustrated the performance of the method. In addition, the real data applications show the effect of the mixture components on the truncation level.
  • Chemometrics and Data Analysis

    Chemometrics and Data Analysis

    Fundamentals of Chemometrics and Modeling Dr. Tom Dearing CPAC, University of Washington Outline • Fundamentals of Chemometrics – Introduction to Chemometrics – Measurements – The Data Analysis Procedure • Basic Modeling – Principal Component Analysis – Scores and Loadings • Advanced Modeling – Partial Least Squares – Latent Variables – Scores and Loadings – Calibration and Validation – Prediction • Case Study Section 1 Through the looking glass….. Chemometrics • Chemometrics is: The science of extracting information from measurements made on chemical systems with the use of mathematical and statistical procedures. • Keywords and phrases: data analysis, data processing, univariate, multivariate, variance, modeling, scores, loadings, calibration and validations, predictions, real time decision making. Near IR Tablet Data Measurements 7 6.5 • Measurements come in many 6 different forms. 5.5 – Spectroscopic 5 Signal Signal Intensity • Near IR, Fluorescence, Raman. 4.5 – Chromatographic 4 • Gas Chromatography, HPLC. 3.5 3 600 800 1000 1200 1400 1600 1800 2000 – Physical Wavenumber cm-1 • Temperature, Pressure, Flow rate, Melting Points, Viscosity, Concentrations. • All measurements yield data. • NIR data set containing 255 spectra measured at 650 (counts) Intensity different wavenumbers has 165750 data points!! Wavelength (nm) Two Types Of Data • Univariate • Multivariate – One variable to measure – Multiple variables – One variable to predict – Multiple predictions – Typically select one – Typically use entire wavelength and monitor spectra. change of absorbance – Allows investigation into over time. the relationship – Wavelength must not between variables. have contributions or – Allows revealing of overlapping from other latent variation within a peaks. set of spectra. Multivariate Analysis • Analysis performed on multiple sets of measurements, wavelengths, samples and data sets. • Analysis of variance and dependence between variables in crucial to multivariate analysis.