REAL-TIME COMPRESSION AND SELF-MODELING CURVE

RESOLUTION FOR ION MOBILITY SPECTROMETRY

A dissertation presented to

the faculty of

the College of Arts and Sciences of Ohio University

In partial fulfillment

of the requirements for the degree

Doctor of Philosophy

Guoxiang Chen

March 2003

This dissertation entitled

REAL-TIME WAVELET COMPRESSION AND SELF-MODELING CURVE

RESOLUTION FOR ION MOBILITY SPECTROMETRY

BY

GUOXIANG CHEN

has been approved for

the Department of and

and the College of Arts and Sciences by

Peter de B. Harrington

Associate Professor of Chemistry and Biochemistry

Leslie A. Flemming

Dean, College of Arts and Sciences

CHEN, GUOXIANG. Ph.D. March 2003.

Real-Time Wavelet Compression and Self-Modeling Curve Resolution for Ion Mobility

Spectrometry (203 pp.)

Director of Dissertation: Peter de B. Harrington

Chemometrics has proven useful for solving chemistry problems. Most of the

chemometric methods are applied in post-run analyses, for which are processed after

being collected and archived. However, in many applications, real-time processing is

required to obtain knowledge underlying complex chemical systems instantly. Moreover,

real-time chemometrics can eliminate the storage burden for large amounts of raw data

that occurs in post-run analyses. These attributes are important for the construction of

portable intelligent instruments.

Ion mobility spectrometry (IMS) furnishes inexpensive, sensitive, fast, and

portable sensors that afford a wide variety of potential applications. SIMPLe-to-use

Interactive Self-modeling Mixture Analysis (SIMPLISMA) is a self-modeling curve resolution method that has been demonstrated as an effective tool for enhancing IMS measurements. However, all of the previously reported studies have applied

SIMPLISMA as a post-run tool.

A modified SIMPLISMA algorithm, referred to as RTSIMPLISMA, was developed for modeling IMS data in real-time. The real-time algorithm can determine the number of components in the IMS data automatically. Resolved concentration and spectral profiles are simultaneously displayed on a virtual instrument while the data is collected from an ion mobility spectrometer.

The computational burden for real-time SIMPLISMA increases when the

collected number of spectra grows in size. A spectrum will not be acquired when the data

processing consumes too large a share of computer resources. To alleviate this problem, a

two-dimensional wavelet compression (WC2) was applied prior to RTSIMPLISMA

modeling. The optimal settings of WC2-RTSIMPLISMA for processing IMS data were

obtained, by which satisfactory models could be resolved when the data was compressed

to 1/256.

A novel real-time WC2 has been developed to compress data as it is acquired from

IMS sensors. RTSIMPLISMA was applied to the WC2 processed data in real-time, by

which the real-time modeling could be significantly accelerated. An integrated software package was developed to implement the real-time WC2-RTSIMPLISMA algorithm and

used for the rapid processing of the IMS data of drugs and explosives. The real-time

algorithm was able to disclose the very small features in the IMS data and rapidly model the dynamic changes during an IMS measurement course.

Approved: Peter de B. Harrington

Associate Professor of Chemistry and Biochemistry

5

Acknowledgments

I would like to thank my research advisor, Dr. Peter de B. Harrington, for his invaluable support and guidance during my stay at Ohio University. This dissertation could not have been written and the research could not have been accomplished without his help. I would also like to thank my dissertation committee members, Drs. Gary W.

Small, Howard D. Dewald, Martin T. Tuck, Wen-jia R. Chen and Xiaozhuo Chen, for their great help in my academic progress and research pursuits. Paul Schmittauer is thanked for his assistances in electronic techniques.

I would like to thank the Department of Chemistry and Biochemistry at Ohio

University for offering me the opportunity to conduct my doctoral research. The Center for Intelligent Chemical Instrumentation at Ohio University is thanked for supporting the conference trips. Ohio University is thanked for the support of Donald R. Clippinger

Fellowship. The US Army ERDEC, GeoCenters, and Ion Track Instruments are thanked for the partial support of this research. Metara Inc. is thanked for supporting me to write this dissertation while working. Dr. Willem Windig at Eigenvector Research Inc. is thanked for his permission for me to use the spectral data files and MATLAB scripts.

I would also thank the members in Dr. Harrington’s research group for their helpful suggestions. Special thanks are given to Libo Cao for her consistent help over the years. Dr. Tricia L. Buxton Derringer is also thanked for the bacterial data set.

I would like thank Zhuo Chen for her love, encouragement, and valuable support.

I would like to thank my father and the other family members who are always caring and supportive in my life. 6

Table of Contents

Page

Abstract...... 3

Acknowledgments...... 5

List of Tables ...... 9

List of Figures...... 10

List of Abbreviations ...... 18

Chapter 1 Introduction...... 21

1.1 General Statement...... 21

1.2 Ion Mobility Spectrometry...... 23

1.3 Self-Modeling Curve Resolution ...... 26

1.4 Data Compression...... 29

1.5 The Research Objectives...... 32

Chapter 2 SIMPLISMA and Wavelet Transform...... 34

2.1 SIMPLISMA...... 34

2.2 Wavelet Transform ...... 48

Chapter 3 Real-Time Self-Modeling Mixture Analysis ...... 59

3.1 Introduction...... 59 7

3.2 Theory...... 61

3.3 Experimental Section...... 63

3.4 Results and Discussion...... 67

3.5 Conclusions...... 86

Chapter 4 RTSIMPLISMA Applied to Two-Dimensional Wavelet Compressed Ion

Mobility Data...... 87

4.1 Introduction...... 87

4.2 Theory...... 91

4.3 Experimental Section...... 94

4.4 Results and Discussion...... 97

4.4.1 Conventional SIMPLISMA Models...... 97

4.4.2 Optimization of WC2-RTSIMPLISMA 103

4.4.3 RTSIMPLISMA Applied to Windig Standard Data Sets 117

4.5 Conclusions...... 137

Chapter 5 Real-Time Two-Dimensional Wavelet Compression and Its Application to

Real-Time Self-Modeling of IMS data...... 143

5.1 Introduction...... 143

5.2 Theory...... 144

5.3 Experimental Section...... 148 8

5.4 Results and Discussion...... 151

5.4.1 Time Performance of Real-Time WC2-RTSIMPLISMA ...... 151

5.4.2 Enhanced IMS Measurement by Real-Time WC2-RTSIMPLISMA156

5.4.3 Real-Time Self-Modeling of IMS Data of Explosives ...... 164

5.4.4 Internal Reference Method for Real-Time WC2-RTSIMPLISMA...172

5.5 Conclusions...... 186

Chapter 6 Summary and Future Work...... 187

References ...... 191

Appendix A: Publications...... 200

Appendix B: Presentations...... 201

Appendix C: MATLAB Scripts...... 202 9

List of Tables

Table Page

Table 2.1 The concentrations of the compounds A and B during the reaction course..... 41

Table 3.1 Time performances of different methods and batch number R for processing

550 spectra...... 85

Table 4.1 Contribution of wavelet type and compression level to the variation of

RRMSES and RRMSEC...... 111

Table 4.2 Compression levels, compression factor (C.F.), percent correct nc , average

RRMSES, minimum RRMSES, and the corresponding wavelet type for different

compression levels for drug data set...... 112

Table 4.3 Compression levels, percent correct nc , average RRMSES, minimum RRMSES,

and the corresponding wavelet type for different compression levels for bacterial

data set...... 113

Table 5.1 Experimental setup for the data sets in Section 5.4.4. In the table, ti is the time

when the sample was inserted into the desorber; is the time when the ts

measurement stopped; is the total number spectra collected. Sample volume was ns

1 µL and the sample disk was removed at 5 s after it was inserted...... 175 10

List of Figures

Figure Page

Figure 1.1 The schematic diagram of an ion mobility spectrometer...... 24

Figure 2.1 The virtual spectra of 1 mM aqueous solution of the pure compound A (Panel

A) and 1 mM aqueous solution of the pure compound B (Panel B)...... 42

Figure 2.2 The three-dimensional surface plot of the synthesized mixture spectra of the

two-component virtual reaction system...... 43

Figure 2.3 SIMPLISMA resolved spectra of compounds A (Panel A) and B (Panel B)

with the number of components being predefined to two (α=0.05)...... 45

Figure 2.4 SIMPLISMA resolved concentration profiles of compounds A (Panel A) and

B (Panel B) with the number of components being predefined to two (α=0.05). .... 46

Figure 2.5 SIMPLISMA model of the synthesized data set with the number of

components being predefined to three (α=0.05)...... 47

Figure 2.6 Schematic of multi-level operations of the pyramid WT algorithm with dyadic

...... 50

Figure 2.7 Father and mother of daublet 4 (Panel A, four coefficients), daublet

14 (Panel B, 14 coefficients), coiflet 3 (Panel C, 18 coefficients), and symmlet 6

(Panel D, 12 coefficients)...... 52

Figure 2.8 Illustration of the multi-level operations of the pyramid algorithm for forward

WT using daublet 14...... 55 11

Figure 2.9 The enlarged view of the smooth and detail parts of WT spectrum at level 5 in

Figure 2.8...... 56

Figure 2.10 Reconstructed spectra from the smooth parts of forward WT with level 1 to 5

corresponding to Figure 2.8...... 58

Figure 3.1 Structures for (A) diisopropyl methanephosphonate (DIMP), and (B)

Pinacolyl methyl phosphonofluoridate (soman)...... 64

Figure 3.2 The graphical user interface for real-time SIMPLISMA...... 66

Figure 3.3 The 3D surface plot of the IMS data of ethanol...... 68

Figure 3.4 RTSIMPLISMA resolved concentration profiles (Panel A) and component

spectra (Panel B) for ethanol data...... 69

Figure 3.5 The 3D surface plot of the IMS data of DIMP. The data set was acquired from

the CAM at positive ...... 70

Figure 3.6 RTSIMPLISMA resolved concentration profiles (Panel A) and component

spectra (Panel B) for DIMP data...... 72

Figure 3.7 SIMPLISMA-det resolved concentration profiles (Panel A) and component

spectra (Panel B) for DIMP data...... 73

Figure 3.8 RTSIMPLISMA model after processing 25 spectra for DIMP data. Panel A

presents the concentration profiles and Panel B presents the component spectra. ... 75

Figure 3.9 RTSIMPLISMA model after processing 40 spectra for DIMP data...... 76

Figure 3.10 RTSIMPLISMA model after processing 135 spectra for DIMP data...... 77

Figure 3.11 RTSIMPLISMA resolved model for DIMP data with NPV threshold β0 =

0.008...... 78 12

Figure 3.12 RTSIMPLISMA model for DIMP data with NPV threshold β0 = 0.04...... 79

Figure 3.13 RTSIMPLISMA model for DIMP data with NPV threshold β0 = 0.24...... 80

Figure 3.14 Comparison of time performance for real-time implementation of

SIMPLISMA-det, RTSIMPLISMA, and data acquisition only...... 82

Figure 3.15 Effects on time performance by real-time implementation of

RTSIMPLISMA for batches of R spectra...... 84

Figure 4.1 Structures for (A) cocaine, and (B) heroin...... 90

Figure 4.2 Schematic diagram of the implementation principle of the WC2-

RTSIMPLISMA algorithm...... 92

Figure 4.3 The cocaine-heroin data set comprised 1024 spectra on a 3D surface plot

(Acquired from ITEMISER® ITMS in positive mode)...... 98

Figure 4.4 The TMAH-preprocessed Bacillus cereus data set comprised 1024 spectra on

a 3D surface plot (Acquired from Barringer IONSCAN 350 spectrometer in

positive mode)...... 99

Figure 4.5 Conventional SIMPLISMA model from the original cocaine-heroin data set

(three-component model). (A) Concentration profiles. (B) Component spectra.... 101

Figure 4.6 Conventional SIMPLISMA model from the Bacillus cereus data set (four-

component model). (A) Concentration profiles. (B) Component spectra...... 102

Figure 4.7 Relative purity curves of determinant-based SIMPLISMA for the drug and

bacterial data sets...... 104

Figure 4.8 Relative purity curves of Gram-Schmidt-based SIMPLISMA for the drug data

set...... 105 13

Figure 4.9 Relative purity curves of Gram-Schmidt-based SIMPLISMA for the bacterial

data set...... 106

Figure 4.10 Percent correct number of components with respect to the threshold ∆0 .. 108

Figure 4.11 The 4 × 4 daublet 14-daublet 4 compressed Bacillus cereus dataset

comprised of 32 × 64 points in a 3D surface plot...... 116

Figure 4.12 Reconstructed RTSIMPLISMA model from the 4 × 4 daublet 14-daublet 4

compressed drug data set. (A) Concentration profiles. (B) Component spectra. ... 118

Figure 4.13 Reconstructed RTSIMPLISMA model from the 4 × 4 daublet 14-daublet 4

compressed bacterial data set. (A) Concentration profiles. (B) Component spectra.

...... 119

Figure 4.14 The reconstructed data set from the 4 × 4 daublet 14-daublet 4 WC2-

RTSIMPLISMA model from the bacterial data set...... 120

Figure 4.15 RTSIMPLISMA relative purity curve for the Windig Raman data set. The

transition point is highlighted...... 122

Figure 4.16 RTSIMPLISMA resolved spectra for the Windig Raman data set...... 124

Figure 4.17 Conventional SIMPLISMA (Panel A) and RTSIMPLISMA (Panel B)

resolved concentration profiles for the Windig Raman data set...... 125

Figure 4.18 Conventional SIMPLISMA resolved spectra for the Windig Raman data set

(α = 0.03)...... 126

Figure 4.19 RTSIMPLISMA relative purity curve for the Windig FTIR data

set...... 127 14

Figure 4.20 RTSIMPLISMA resolved spectra for the Windig FTIR microscopy data set.

...... 129

Figure 4.21 Conventional SIMPLISMA (Panel A) and RTSIMPLISMA (Panel B)

resolved concentration profiles for the Windig FTIR microscopy data set...... 130

Figure 4.22 Conventional SIMPLISMA resolved spectra for the Windig FTIR

microscopy data set (α = 0.03)...... 131

Figure 4.23 RTSIMPLISMA relative purity curve for the Windig NIR data set...... 132

Figure 4.24 Resolved spectra for the Windig NIR data set with conventional

SIMPLISMA applied on the positive part of the inverted second derivative data set

(α = 0.1)...... 133

Figure 4.25 RTSIMPLISMA resolved spectra for the Windig NIR data set...... 135

Figure 4.26 RTSIMPLISMA relative purity curve for the Windig time resolved mass

spectrometry data set...... 136

Figure 4.27 Reference spectra for the three photographic color coupling compounds in

the Windig time resolved data set...... 138

Figure 4.28 TSIMPLISMA resolved spectra for the Windig time resolved mass

spectrometry data set (α = 0.03)...... 139

Figure 4.29 RTSIMPLISMA resolved spectra for the Windig time resolved mass

spectrometry data set...... 140

Figure 4.30 TSIMPLISMA (Panel A) and RTSIMPLISMA (Panel B) resolved

concentration profiles for the Windig time resolved mass spectrometry data set. . 141 15

Figure 5.1 Circular buffer. As a new point is received, it is placed into the memory

pointed by pointer Pn. The start position of the data to be processed is located by Pp.

...... 147

Figure 5.2 Structures for (A) urea nitrate, (B) cyclotrimethylenetrinitramine (RDX), (C)

2,4,6-trinitrotoluene (TNT), and (D) 3,4 methylenedioxymethamphetamine (MDMA)

...... 150

Figure 5.3 The vector φ(4) that defines the FIR filter for column compression...... 153

Figure 5.4 The time performance curve for RTSIMPLISMA without compression..... 154

Figure 5.5 The time performance curves for data acquisition only and real-time WC2-

RTSIMPLISMA...... 155

Figure 5.6 IMS data set of blank trap disk on 3D surface plot...... 157

Figure 5.7 The average IMS spectra for three replicates of IMS measurement of a blank

trap disk and the average spectrum of the data set that only has RIP...... 158

Figure 5.8 Real-time WC2-RTSIMPLISMA resolved spectra for the data sets from the

three replicates of blank trap disk ...... 160

Figure 5.9 The variation profiles corresponding to different drift time for the raw data set

of blank 3 in Figure 5.7...... 161

Figure 5.10 IMS data set of 3.6×101 pg TNT on 3D surface plot...... 162

Figure 5.11 The average IMS spectra for two replicates of IMS measurement of 3.6×101

pg TNT and the average spectrum of the data set that only has RIP...... 163

Figure 5.12 Real-time WC2-RTSIMPLISMA resolved spectra for the data sets from the

two replicate data set of 36 pg TNT...... 165 16

Figure 5.13 IMS data set of explosives (urea nitrate, RDX, and TNT) on a 3D surface

plot...... 166

Figure 5.14 Real-time WC2-RTSIMPLISMA resolved concentration profiles at the final

point (258.3 s)...... 168

Figure 5.15 Real-time WC2-RTSIMPLISMA resolved component spectra at different

acquisition time (Part I)...... 169

Figure 5.16 Real-time WC2-RTSIMPLISMA resolved component spectra at different

acquisition time (Part II). (177.1 - 249.0 s)...... 170

Figure 5.17 Real-time WC2-RTSIMPLISMA resolved component spectra at different

acquisition time (Part III). (249.1 - 258.3 s) ...... 171

Figure 5.18 SIMPLISMA-det resolved concentration profiles from the raw IMS data of

explosives...... 173

Figure 5.19 SIMPLISMA-det resolved component spectra from the raw IMS data set of

explosives...... 174

Figure 5.20 Ion mobility spectra of 1µL ethanol solution with 1.0 × 102 ng MDMA, 1.0

× 102 ng cocaine, and 2.0 × 102 ng heroin, respectively, collected on the

ITEMISER ITMS in positive ion mode...... 176

Figure 5.21 Real-time WC2-RTSIMPLISMA resolved component spectra from the data

set of drug mixture A...... 178

Figure 5.22 Real-time WC2-RTSIMPLISMA resolved component spectra for drug

mixture A with internal reference spectra of cocaine, MDMA, and heroin...... 179 17

Figure 5.23 Real-time WC2-RTSIMPLISMA resolved concentration profiles for drug

mixture A with internal reference spectra of cocaine, MDMA, and heroin...... 181

Figure 5.24 Real-time WC2-RTSIMPLISMA resolved component spectra for drug

mixture B...... 183

Figure 5.25 Real-time WC2-RTSIMPLISMA resolved component spectra for drug

mixture B with appended IMS reference spectra of cocaine, MDMA, and heroin. 184

Figure 5.26 Real-time WC2-RTSIMPLISMA resolved concentration profiles for drug

mixture B with appended reference IMS spectra of cocaine, MDMA, and heroin. 185 18

List of Abbreviations

2D...... two-dimensional

3D...... three-dimensional

ALS...... alternating

ANOVA ...... analysis of

APCI ...... atmospheric-pressure chemical

ionization

CAM ...... chemical agent monitor

C.F...... compression factor

CIN...... code interface node

COO ...... correlation around the origin

CWT...... continuous wavelet transform

DIMP...... diisopropyl methanephosphonate

DWT ...... discrete wavelet transform

DSP ...... digital

EFA...... evolving

FAME ...... fatty acid methyl ester

FIR ...... finite impulse response

FT...... Fourier transform

FTIR...... Fourier transform infrared

GCIN...... get chemical information now 19

HELP...... heuristic evolving latent projection

IMS ...... ion mobility spectrometry

ITTFA ...... iterative target testing factor analysis

K0...... reduced mobility

LabVIEW...... Laboratory Virtual Instrument

Engineering Workbench

MDMA...... 3,4 methylenedioxy-

methamphetamine

MIA...... multivariate image analysis

MRA ...... multiresolution analysis

NIR...... near infrared

NPV...... new pure variable

OPA...... orthogonal projection analysis

PCA...... principal component analysis

PMF...... positive matrix factorization

PV ...... pure variable

QMF...... quadrature mirror filters

RDX ...... research department explosive

RIP ...... reactant ion peak

RRMSE...... relative root--square error

RRMSEC ...... RRMSE of RTSIMPLISMA

concentration profiles 20

RRMSES...... RRMSE of RTSIMPLISMA spectra

RRSSQ...... relative root of the sum of squares

RTSIMPLISMA...... real-time SIMPLISMA

SFA ...... subwindow factor analysis

SIMPLISMA...... SIMPLe-to-Use Interactive Self-

modeling Mixture Analysis

SIMPLISMA-det...... determinant-based SIMPLISMA

SIMPLISMA-gs...... SIMPLISMA based on Gram-

Schmidt method

SMCR ...... self-modeling curve resolution

TMAH...... tetramethylammonium hydroxide

TMOS ...... tetramethyl orthosilicate

TSIMPLISMA ...... transpose SIMPLISMA

TNT...... 2,4,6-trinitrotoluene

UV...... ultraviolet

VI ...... virtual instrument

WC ...... wavelet compression

WC2 ...... two-dimensional wavelet

compression

WFA...... window factor analysis

WT ...... wavelet transform 21

Chapter 1 Introduction

1.1 General Statement

Chemometrics has been proven an invaluable aid for solving chemical problems since the inception of the field in its modern form in the 1970’s. Optimal experimental design and useful information about chemical systems can be obtained by using chemometric methods.1 Chemometrics has capitalized on the advances in mathematics, computer technology, and other chemical disciplines.

Tremendous successes have occurred in the computer industry in the past two decades. The physical size of hardware has been reduced with increasing power. Modern chemical instruments that benefit from these advancements can generate larger amounts of data both faster and for longer duration. The most advanced waveform digitizer cards can be operated at a sampling rate of as high as billions-of-points per second, which provides a very high resolution for measured signals as well as a very large data size.2 On the other hand, the advanced computer hardware and software also pave a broader way for chemometricians to explore the implications underlying more complex chemical systems. As a result, high dimensionality of data have been commonly used in analytical chemistry.3

Currently, most chemometric methods are applied in post-run analyses, in which

data processing is accomplished after the data has been collected and archived. Real-time

chemometric methods are desired for simultaneous modeling of on-line data as it is

acquired. Instant real-time information enables one to monitor the chemical system and 22 control the . This capability is important for intelligent instrumentation that may need to tune the instrument or respond to the measurement. Temporal information adds a new dimension to simplify complex systems and overcome instrumental limitations.4 Real-time analysis is also desired for portable instruments which may have

limited storage capacities.

In general, real-time chemometrics involves data acquisition, data analysis, and

data presentation that are integrated towards a complete solution. Unlike the offline

chemometrics that mainly focuses on the effectiveness of a method, real-time

chemometrics has to find a compromise between effectiveness and . Usually,

data analysis is a crucial factor that depends on the complexity of the underlying

algorithm. Real-time calculations must be computationally efficient because the

calculations must be completed in the free time between spectral acquisitions. For

example, for the ion mobility spectrometry (IMS) measurements the spectra are generated

at a rate that varies between 10 to 30 Hz depending on the instrument. At the end of the

spectrum is a 5 to 10 ms window before the next spectrum is generated. When the

collected number of spectra grows in size, the computational burden to model the

changes that in the spectra with time exceeds the free time window of the acquisition and

a spectrum will not be acquired, which may result in losing chemical information.

Several strategies could be used to alleviate the computational constraints for data processing algorithms. The first is to optimize the chemometric algorithm with respect to sequential processing of data. For instance, recursive implementations can lower computational overhead.5 Another tactic is to shift to more powerful numerical 23 processors and faster storage. Hardware improvements often increase the cost of the system. An alternative approach is to reduce the data size and complexity. Simple reduction of data size by undersampling or thresholding discards useful information. Data compression has renewed interest because it can reduce the data size while retaining the important information in the compressed domain.6

1.2 Ion Mobility Spectrometry

Since its first introduction as plasma in the 1960’s, ion mobility

spectrometry (IMS) has been widely used for the detection of chemical warfare agents,7, 8

environmental pollutants,9-11 explosives,12-15 drugs of abuse,16-18 inorganic substances,19

and bacteria.20-23 IMS offers the advantages of portability, low cost, high sensitivity, rapid

response, and real-time monitoring capabilities.24

The details of the operation principle of IMS can be found elsewhere.25-27 The principle of IMS is similar with time-of-flight (TOF) mass spectrometry. However, IMS devices have much smaller drift tubes that operate at atmospheric pressure. The schematic of an ion mobility spectrometer is given in Figure 1.1. The IMS device includes four components: ion source region, ion gate, drift region, and a detector. Like most IMS devices, the three IMS devices that were used in this dissertation use a radioactive source 63Ni (beta emitter) to initiate ionization. The other ionizing sources include photoionization,11 laser ionization,28 surface ionization,29 and electrospray.30 The ionizing source produces reactant ions in the ion source region through atmospheric- pressure chemical ionization (APCI). The reactant ions transfer charge to the analytes by 24

Electric Field

Ion Source Region Drift Tube

Ionizing Source Ion Gate

Sample Inlet Repelling Rings Drift Rings Collector Electrode

Figure 1.1 The schematic diagram of an ion mobility spectrometer.25 The sample is introduced from the sample inlet and ionized by the ionizing source. The ionized species are separated in the drift tube by different mobilities. The ion current is measured by the collector electrode. 25 competitive charge transfer reactions. The analyte ions are directed through the drift tube by an electric field when the ion gate is periodically opened by removing an opposing potential barrier for a short period of time. The ions travel at different ion mobilities based largely on the size of the ion and are detected at a collector electrode. The readout of the detector is synchronized to the gate pulse to yield ion mobility spectra as a plot of the ion current with respect to drift time.

The IMS devices can be operated in negative and positive modes. The form of the reactant ions is dependant on the instrument design. Usually, a dopant is added to increase the selectivity of the charge transfer reactions, which, subsequently, suppresses the signal from interferents and improve the selectivity of the instrument. For an IMS device without dopant operated at positive mode, the reactant ions would be protonated

+ water clusters ( Hi (HO)2n). For the production of positive ions, the monomer ions can be formed for an analyte (A) by the proton transfer reaction (1.1):

++ (1.1) Hii (HO)2n + AH A + n HO 2

Dimer ions can be formed at higher concentrations by the following reaction:

++ (1.2) Hii A + AH A2

- At negative mode, the reactant ions are(H2n O)⋅ O 2. Three possible reactions may

occur to produce negative ions. These reactions include ion transfer, charge transfer, or

dissociative charge transfer. 26

-- A + (H2n O)ii O 2 A O 2 + n H 2 O (1.3)

-- A + (H2n O)i O 2 A + O 2 + n H 2 O (1.4)

-- A-X +(H2n O)i O 2 X + A + O 2 + n H 2 O (1.5)

- for which X is an electronegative moiety of the molecule, and O2 is superoxide.

Owing to the facts described above, ion mobility spectra have some unique attributes compared to data from other spectrometric methods. Each ion mobility spectrum typically contains a background component that is present in the form of reactant ion peak (RIP). Each spectrum is closed in that the ion current (i.e., peak intensities) integrates to a constant value. As a consequence, the intensities of analyte ion peaks are inversely correlated to that of RIP.

1.3 Self-Modeling Curve Resolution

In instrumental analysis, the conventional way to determine the spectral profiles of pure components and their contributions, usually designated as concentration profiles, in a mixture is to separate them before applying spectral analysis. The most commonly used tool is chromatography hyphenated with a spectrometer, such as liquid chromatography with mass spectrometer (LC/MS) and with Fourier transform infrared (GC/FTIR), by which the pure standard samples and reference spectra are required. 27

As a complementary technique of analytical separations, self-modeling curve resolution (SMCR) has been developed to mathematically determine the spectral and concentration profiles of the pure components in mixture spectra without the need of a priori knowledge and reference data of the components. Two popular aliases of SMCR are multivariate curve resolution, and self-modeling mixture analysis. As a mathematical separator, the input of SMCR is usually a two-dimensional (2D) data matrix acquired from analytical instruments. The rows of the matrix are usually a series of spectra and columns correspond to measurement resolution elements, such as wavelengths or drift time. SMCR is able to decompose the pure component information when each component has different variations with respect to signal responses during the measurement course. For example, FTIR is used to continually monitor a reaction with two reactants and one product, which is a three component mixture. Suppose the FTIR spectra of the components are different. The mixture spectra could be obtained for which each spectrum corresponds to the overall contributions of all the components. When the reaction starts, the concentrations (contributions) of the reactants are reduced and that of the product increases. Applying SMCR to the mixture spectra can determine the FTIR spectra and the concentration profiles of the reactants and product. In this example, the variations in the mixture spectra are associated with the reaction kinetics. Any physical or chemical factor, such as temperature and pressure, which causes this sort of variation, might be used to produce such a data matrix. In other words, SMCR can be used to explore the pure components underlying any data matrix that has variations in two- or more-dimensions. Tutorials and reviews about SMCR are available.31, 32 28

The first SMCR method was reported by Lawton and Sylvestre in 1971.33 The pioneer method mathematically resolved a two-component system in UV-visible spectra of mixtures with varying concentration ratios. Many SMCR methods have been reported since then, which include iterative target testing factor analysis (ITTFA),34 evolving factor analysis (EFA),35 window factor analysis (WFA),36 heuristic evolving latent projection (HELP),37 subwindow factor analysis(SFA),38 orthogonal projection analysis

(OPA),39 SIMPLe-to-use Interactive Self-modeling Mixture Analysis (SIMPLISMA),40 positive matrix factorization (PMF),41 and alternating least squares (ALS).42

Since SIMPLISMA was developed by Windig and Guilment,40 the algorithm has been applied to the deconvolution of multiresponse fluorescence spectra,43 peak purity assay of high performance liquid chromatography with diode-array detection,44 spectral resolution of near infrared spectra,45 investigation of hydrogen peroxide activation by nitriles,46 study of dust emitted by lead and zinc smelters,47 investigation of protonation equilibria by diffuse reflectance UV-visible spectra,48 and processing nuclear magnetic resonance (NMR) images.32 Windig has demonstrated the applications of SIMPLISMA to spectral data files of Raman spectra, FTIR microscopy spectra, near-infrared (NIR) spectra, time resolved mass spectra, and Raman imaging data.49, 50

SIMPLISMA has proven an effective and efficient tool for simplifying intricate ion chemistry that may occur in IMS data. SIMPLISMA is beneficial with IMS because overlapping peaks may be resolved, and selectivity and sensitivity of measurement may be enhanced through the use of this method.4 The decomposition of the data into concentration profiles and spectra provides concentration independent representation of 29 the spectral components. In addition, the spectra obtained from SIMPLISMA models may facilitate classification of complex models. Harrington et al. have shown that

SIMPLISMA could be a useful tool for characterizing mixed analytes with IMS.4 Rauch et al. developed a recursive SIMPLISMA that was faster than conventional SIMPLISMA for IMS data.5 Utilizing SIMPLISMA with thermal desorption IMS, Reese and

Harrington resolved methamphetamine hydrochloride from cigarette smoke residue.51

Shaw and Harrington used SIMPLISMA to detect methamphetamine in the presence of nicotine.52 Buxton and Harrington applied SIMPLISMA to identification of explosives53 and bacteria23 by IMS. Patchett et al. used SIMPLISMA as an aid to identify γ- hydroxybutyrate, a popular drug of abuse, which was spiked in beer, tap water, and soft beverage.17

1.4 Data Compression

Recently, data compression applications have been revitalized in analytical chemistry.54-57 Data compression is advantageous because it reduces the size and computational burden of the data without losing important chemical information.

Instead, noise is lost during compression and processing of compressed data is much more efficient. There are several areas of analytical chemistry for which compression has become important. High resolution and multidimensional measurements can generate very large data sets that are cumbersome to manipulate on state-of-the-art workstations.

Miniaturized sensors may be much smaller than handheld computers or portable computer equipment and require embedded processors with limited memory, power, and 30 processing capabilities. Wireless communication of sensors to data stations may be bandwidth limited. Real-time chemometric processing and modeling requires fast computation and can become unfeasible when data set grows too large.

Currently, there are many data compression techniques available that can be divided into two major groups. The first group of techniques reduces the data dimensionality by projecting the data onto a few new key variables.58 The new basis to which the data is projected is constructed so that the maximum variance in the data can be retained. The most popular method is principal component analysis (PCA)59, by which the data compression can be achieved because the original data can be represented by a few orthogonal principal components (PCs). PCs are the eigenvectors of the matrix, which cumulatively account for the variance in the data set. The new basis is a set of orthogonal axes corresponding to the eigenvectors with highest eigenvalues. PCA allows efficient compression and provides visualization of the key features in the data.

However, it is computationally inefficient for large data sets. Thus, PCA and SVD have also been coupled to compression methods.60, 61 Moreover, PCA projects the data onto a variable ordinate system. Data projected onto different principal component bases are not comparable. The variable ordinate systems also have the disadvantage that the data are transformed to an abstract domain that often conveys little physical or chemical information.

The second group of compression methods projects the data onto basis functions to obtain a transform domain. Fourier transform (FT) and wavelet transform (WT) have been commonly used.62, 63 Wavelet and Fourier transforms are fixed ordinate, so that data 31 projected onto these new coordinates systems are comparable. The WT has been widely used due to the advantages of simplicity, speed, and multiresolution capability. The WT technique has been applied to compressing absorbance spectra64-67 and ion mobility spectra57 and denoising absorbance spectra,64, 68 chromatograms,69 and electrochemical signals.70, 71 WT has also been used with other chemometric approaches such as multicalibration,72-74 ,75, 76 multivariate curve resolution,63, 77, 78 and artificial neural networks.79, 80 Reviews and tutorials of WT are available.62, 81-85 For the compression using WT, i.e., wavelet compression (WC), the smooth part in transform domain resembles the original data, which makes two compressed data visually comparable, but also amenable to secondary processing such as SMCR methods. The WC coupled with real-time SIMPLISMA algorithm in this dissertation is based on this strategy.

Compared to one-dimensional compression, multidimensional compression affords much greater compression factors without degrading the signal quality. The 2D compression of sensor data was first demonstrated with Fourier compression.54 A 2D discrete sine transform was used to compress large sets of IMS data so that it would be amenable to PCA.61 Two dimensional wavelet compression was the logical progression and was applied to ion mobility spectra57 and NIR for monitoring wood chips.56 Multidimensional compression is especially desired for real-time processing of ion mobility spectra data where the instrument acquires data very rapidly, up to thousands of spectra per minute. Two-dimensional compression will be implemented in this dissertation. 32

1.5 The Research Objectives

This dissertation involves research conducted on implementing real-time self- modeling curve resolution and data compression to ion mobility spectrometry (IMS). The

SMCR method was SIMPLISMA. As discussed earlier, SIMPLISMA has proven useful for enhancing the measurements of IMS. However, all of the previous studies have applied SIMPLISMA in post-run analyses, for which the data is processed in a separate step after collection. Storage burden can be a problem, especially for portable instruments such as ion mobility spectrometers that have limited storage capacity and high data acquisition rate. Post-run analyses are not amenable for continuous monitoring and online processes for which real-time information is desired. SIMPLISMA modeling is an interactive process that requires chemical knowledge from users, which may limit its applications in practices.

The research in this dissertation addressed these problems by implementing

SIMPLISMA in real-time. The IMS data will be modeled as they are acquired and the model will be simultaneously updated during the measurement course. The SIMPLISMA algorithm was modified to automatically determine the complexity of the model without input from users.

SIMPLISMA is an algorithm with extensive calculations that demands advanced processing power. In portable IMS sensors, the computational process may not be powerful enough to implement the algorithm on uncompressed data. The acquisition speed is reduced as the size of the data matrix increases, which has the deleterious affect of causing spectra to be lost. Real-time modeling must be accomplished in the idle 33 periods between the spectral acquisitions. Two-dimensional (2D) wavelet compression

(WC2) was applied to compress the data prior to SIMPLISMA modeling. The effects of wavelet types and compression level were optimized for IMS data. The method was evaluated with different samples and three different ion mobility spectrometers.

This dissertation includes six chapters. The general background is given in this chapter. The general theory of SIMPLISMA and WT will be introduced in Chapter 2.

Additional background and theory of each project will be presented in the corresponding chapters. The real-time implementation of SIMPLISMA will be given in Chapter 3. The application of the modified SIMPLISMA denoted RTSIMPLISMA, to two-dimensional wavelet compressed ion mobility spectra will be presented in Chapter 4. Two diverse

IMS data sets and published reference data were used to evaluate this method. The real- time two-dimensional wavelet compression with RTSIMPLISMA will be presented in

Chapter 5. An integrated software package, called Get Chemical Information Now

(GCIN), was developed. The real-time approach was applied to rapid processing of IMS data for explosives and drugs. Finally, the summary and future work will be given in the last chapter. The resulting papers and presentations will be included in Appendix A and B, respectively. Partial codes are adapted to MATLAB scripts and listed in Appendix C.

The following conventions will be applied for the mathematic notations throughout the dissertation: (1) Bold uppercase symbols designate matrices such as X. (2)

Bold lowercase symbols designate column vectors. For example, x refers to a column

vector in the matrix X. (3) Italic regular symbols designate single variables such as ns for number of spectra. 34

Chapter 2 SIMPLISMA and Wavelet Transform

The theories of SIMPLISMA and WT are described in this chapter. Along with the mathematical equations, simple examples are given to make the descriptions instructive. Both algorithms have been modified for real-time implementations in this dissertation. The modified algorithms will be presented in the theory part of the corresponding chapter.

2.1 SIMPLISMA

Suppose ns spectra are collected that contain nx points in each spectrum. A data

matrix X of size ns × nx is constructed so that the rows comprise ns spectra ordered with

respect to spectrum number, and the columns comprise nx concentration profiles ordered with respect to measurement resolution element (i.e., drift time in terms of IMS).

The objective of SIMPLISMA is to extract the pure component information from data matrix X. SIMPLISMA can decompose X into a matrix of concentration profiles(C) and a matrix of component spectra (S):

X=CST (2.1)

for which, given the number of pure components in the model is nc , C is an ns by nc

matrix, and S is an nx by nc matrix .

SIMPLISMA estimates C with pure variables. Pure variables are associated with the columns of X that estimate the spectral intensity variations, which have commonly 35 been referred to as concentration profiles, of pure components. A pure variable is the resolution element in the data set that has a unique concentration profile and relative large variance with respect to intensity. Pure variables are estimated by finding the variable that maximizes the purity.

40 For the conventional SIMPLISMA, the purity ( pij ) of a candidate variable is defined as:

σ j pij=×ew j ij = × w ij (2.2) µαj +

for which i is the index of component, µ j and σ j are the mean and the standard

deviation of the jth candidate variable, i.e., the jth column of X, respectively. The term e j may be recognized as an expression for relative for the jth column of X.

Variables with relatively large variance have a large e j . The influence of noise is removed with the term α, termed damping factor. Usually, α is 5% of the maximum peak

intensity of the mean of the data set. The wij term is the weight that characterizes the linear independence of the jth candidate variable with respect to the previously resolved i−1 components. For the first component, the weight values are:

σ j w1 j = (2.3) σ j +α

Therefore, the first pure variable is selected by: 36

n x (2.4) p11=×max ewjj j=1 ()

Suppose the purity is maximized when jr= 1 , the r1 th column of X will be taken as the concentration profile of the first component. The other pure variables (i ≥ 2 ) are selected by:

n x (2.5) pijij=×max ew j=1 ()

for which wij represents the independence weight for the jth candidate variable with respect to the previously resolved i−1 components. There are two methods to calculate

wij . The first one is a determinant-based method by which the weight is calculated with the determinant of the correlation around the origin (COO) matrix.40 The SIMPLISMA using this method will be referred to as SIMPLISMA-det in this dissertation. First, the

columns of X, xj , are normalized by:

x x+ = j j n s 2 (2.6) ∑()xij +α i=1

Equation (2.6) slightly differs from the normalization method of the original

SIMPLISMA by Windig40, which is given in equation (2.7). Equation (2.6) has slightly simpler calculations. 37

x x+ = j (2.7) j 22 µ jj++()σα

The COO matrix O is obtained from the normalized data matrix X+ :

++Τ (2.8) OXX= / ns

The weight for the ith component and the jth candidate variable, wij , is calculated by:

oo... o j,, j jr11 jr ,i− oo... o w = rj11111,, rr rr ,i− (2.9) ij ...... oo...... rjiii−−−111,, rr

The index ri−1 represents the column index of X for the (i−1)th pure variable. The larger weight implies greater independence.

Another algorithm for weight calculation is based on the Gram-Schmidt distance calculation.5, 86 Mathematical description of Gram-Schmidt orthogonalization can be found elsewhere.87 To estimate the dependence importance, the Gram-Schmidt method constructs an orthogonal basis from the previously determined pure variables. The first vector of the basis (v1) is the normalized concentration profile of the first pure variable

( c1 ).

c1 v1 =

c1 (2.10) 38

The remaining vectors of the basis set ( vi , i ≥ 2 ) are defined by the previously determined i−1 pure variables.

i−1 ()vcT v vc=− ki k ii∑ T k =1 vvkk (2.11)

Weight for the ith component and the jth candidate variable is calculated by:

xx*T * w =−1 jj ij T xxjj (2.12)

* for which xj is the jth candidate variable, i.e., the jth column of X, and xj is the

projection of xj onto the orthogonal basis:

i−1 xvx*T= j ∑ k j k=1 (2.13)

Geometrically, the independence weight of the candidate variable is associated with the

angle (θ ) between variable and its projection onto the orthogonal basis. The weight ( wij ) in equation (2.12) is obtained bysin2 θ . When the candidate concentration profile is

orthogonal to the basis, the weight wij will be unity, and when the candidate concentration profile is similar to the others in the basis, the weight will be diminished.

The SIMPLISMA based on Gram-Schmidt method is referred to as SIMPLISMA-gs.

The algorithm will stop searching pure variables when the predetermined number

of components is obtained. The concentration profiles (C) comprise nc columns of X that 39 furnish the largest purity values. The component spectra are resolved from the concentration profiles by:

S=XTT1 C(CC)− (2.14)

Each column of the resolved spectra S is normalized to unit vector length by dividing the spectrum by the square root of the vector sum of squares. The normalized S is used to generate new concentration profiles by:

C=XS(SS)T1− (2.15)

The purpose of the normalization is to remove model ambiguity and transfers the intensity units to the concentration profiles. Concentration profiles will be scale-invariant if normalization to unit length is used when SIMPLISMA is applied to the same data set represented in wavelet domain.

Take the virtual aqueous reaction (2.17) as an example. The reaction is a two- component (compounds A and B) system. If we want to investigate the relationship of

concentration of B ( cB ) with respect to time, two methods can be used. One method was

to use chromatography to separate the mixture and determine cB . Assume the

concentrations of A ( cA ) and B ( cB ) were measured every 1 minute and the results of nine measurements are given in 40

Table 2.1. The results revealed that the concentration of B relates to time (t ) by equation

(2.17).

A → B (2.16)

0.50 (2.17) ctB = 0.25

Alternatively, we could use a chemometric method with a certain spectrometry without using the chromatographic separation. For illustration purpose, we can simulate the spectrometric chemistry and synthesize the mixture spectra based on the following three assumptions. First, imagine the spectra of 1 mM aqueous solution of the pure compounds are known as given in Figure 2.1. The spectra are not from real spectrometer and just for illustration purpose. In practice, ultraviolet (UV)-visible spectrometer, NIR, FTIR, or

Raman spectroscopy can be used. Second, assume that water does not have response in the spectrometer. Third, assume the mixture spectral intensity at point number n (()I n ) is the linear combination of the contributions of A and B:

I()nIncInc=+AABB () () (2.18)

for which, I A (n ) and IB (n ) are the intensity of A and B at point number n , respectively.

The synthesized data set is presented in Figure 2.2, for which X in equation (2.1) correspond to a matrix of 9 rows and 128 columns. The data set is challenging in that the first peaks of the two compounds completely overlap with each other while the second peaks are partially overlapping and have different peak widths. Applying SIMPLISMA- 41

Table 2.1 The concentrations of the compounds A and B during the reaction course.

Time (min) Concentration of A (mM) Concentration of B (mM)

1 0.75 0.25

2 0.65 0.35

3 0.56 0.44

4 0.50 0.50

5 0.44 0.56

6 0.39 0.61

7 0.34 0.66

8 0.30 0.70

9 0.25 0.75

42

A 0.5

Intensity 0.2

10 30 50 70 90 110

0.5 B

Intensity 0.2

10 30 50 70 90 110 Point Number

Figure 2.1 The virtual spectra of 1 mM aqueous solution of the pure compound A

(Panel A) and 1 mM aqueous solution of the pure compound B (Panel B). This is a computer-synthesized data. The spectrum of A contains two peaks with the centroids at the point number 30 and 80. The spectrum of B contains two peaks with the centroids at the point number 30 and 70. 43

Figure 2.2 The three-dimensional surface plot of the synthesized mixture spectra of the two-component virtual reaction system. This data set is synthesized assuming we know the spectra of 1 mM pure A and B (Figure 2.1) and the concentrations varied as given in Table 2.1. The data set is the linear combinations of the contributions of A and B. 44

det to X with nc being predefined to 2 and α being 0.5 is able to resolve the pure component spectra (S) and concentration profiles (C) that are given in Figure 2.3 and

Figure 2.4. The selected two pure variables correspond to the point number 70 and 80 that afford the maximum purity, i.e., the maximum intensity variation and independence weight. The component spectra in Figure 2.3 were normalized and there is no unit for relative intensity. Moreover, the amplitude of relative intensity differs from that of the spectra of 1 mM pure compounds in Figure 2.1 due to the normalization. However, the shape and position of the peaks are retained. Note that the first two spectra are enough to extract the pure component spectra although nine spectra are used here to observe the concentration profiles. The concentration profiles in Figure 2.4 carry the different values from those in Table 2.1. The least square fitting for the concentration profile of B results in equation (2.19) with R-square being 0.9995, which predicts the same order as the expression in equation (2.17). Given the initial concentrations of A and B, the accurate model could be obtained. The identical results are obtained using SIMPLISMA-gs.

0.50 (2.19) ctB = 0.38

Selection of the predefined number of components nc is crucial for the modeling

accuracy by SIMPLISMA. Modeling with an inappropriate nc can result in an erratic model that does not reveal the true chemical information underlying the data set.

Provided nc be predefined to three instead of two, SIMPLISMA resolves the data set with a model as given in Figure 2.5, for which only the information about B is correctly modeled by component 2. The resolved spectra of component 1 and 3 result from the 45

0.5 A

0.2

10 30 50 70 90 110 Relative Intensity B

0.2

0.0

10 30 50 70 90 110 Point Number

Figure 2.3 SIMPLISMA resolved spectra of compounds A (Panel A) and B (Panel B) with the number of components being predefined to two (α=0.05). The input of

SIMPLISMA algorithm is the synthesized data set that was plotted in Figure 2.2. The selected pure variables correspond to the point number of 70 and 80. 46

1.0 A

)

M

m

( 0.5

n

o

i

t

a

r

t

n

e

c

n 2468

o

C

d

e

t

a

r

g

e 1.0

t

n

I

0.5

B

2468 Time (min)

Figure 2.4 SIMPLISMA resolved concentration profiles of compounds A (Panel A) and B (Panel B) with the number of components being predefined to two (α=0.05).

The input of SIMPLISMA algorithm is the synthesized data set that was plotted in

Figure 2.2. The selected pure variables correspond to the point number of 70 and 80.

47

A 0.5 B

0.2

0.2 Relative Intensity Relative Intensity

-0.4

10 30 50 70 90 110 10 30 50 70 90 110 Point Number Point Number

C D

3 0.2 2 Component 1 Component 2 Text Component 3 1 Relative Intensity -0.4 Integratd ConcentrationIntegratd (mM) 0

10 30 50 70 90 110 2468 Point Number Time (min)

Figure 2.5 SIMPLISMA model of the synthesized data set with the number of components being predefined to three (α=0.05). The components are ordered by the purity values. Panel A, B, and C present the resolved spectrum of component 1, 2, and

3, respectively. Panel D presents the resolved concentration profiles of the three components. The selected pure variables correspond to the point number of 70, 80 and

67. 48 over-fitting the model to the data set and do not provide the correct information about the mixture data. In this dissertation, a method has been developed to automatically estimate the number of components. More details will be given later.

The above example demonstrates that the method using SIMPLISMA is advantageous over the conventional method in the following aspects. First, the method using SIMPLISMA would be more rapid because the measurement using chromatography would consume considerably more time than using spectrometry. A typical chromatographic separation spends 10 to 30 min while a spectrometric measurement usually less than 30 s. Second, the reference materials are required to determine the concentrations in the conventional method. Moreover, the knowledge about the components is required to obtain optimal chromatographic conditions to separate the mixture. In addition, SIMPLISMA provides the spectra of pure components which present additional information about the mixture. Finally, spectrometry is more suitable for the study of reaction kinetics, by which the reaction can be continuously monitored in situ. For the chromatography method, it is required to appropriately terminate the reaction before the mixture is sampled and introduced into a chromatograph.

2.2 Wavelet Transform

WT is similar with Fourier transform (FT), both of which involves the convolution (forward transform, or decomposition) and deconvolution (inverse transform, or reconstruction) of the signals. But WT processes signals with wavelets while FT represents signals in terms of sine and cosine waves. Unlike sine and cosine waves, 49 wavelets are aperiodic. The WT signal is localized in both and time. Each level of the WT, provides the time information at a resolution reduced by a factor of two.

By studying the time behavior at different levels and resolution furnishes multiresolution analysis (MRA).

Wavelet transform can be implemented in two ways, discrete wavelet transform

(DWT) and discretized continuous wavelet transform (CWT). For real-time purposes,

DWT is advantageous over discretized CWT in that DWT is considerably faster than

CWT but provides sufficient information. The term WT refers to DWT in the following text as only DWT will be used in this dissertation. Accordingly, WC refers to wavelet compression by DWT.

The pyramid algorithm88 has been commonly used for MRA, which is implemented by multi-level operations. The schematic of WT is presented in Figure 2.6.

In terms of spectral analysis, the forward WT recursively decimates a spectrum by half into smooth and detail parts by passing the signal through low-pass and high-pass filters with dyadic sampling, respectively. The two filters are defined by basis wavelets that contain finite non-zero coefficients. The basis wavelet for low-pass filter is the father wavelet (h ) and that for the high-pass filter is mother wavelet (g ), both of which satisfy the condition of quadrature mirror filters (QMF):

m ghmMm=−(1) −−1 (2.20) for which M is filter length in number of coefficients and m is the index of the coefficients. Theoretically, the number of wavelet types could be infinite. Twenty-seven commonly used types of wavelets were evaluated in this work, including 15 from the 50

x()l d()l d()l−1 …… d(1) …… Decomposition

x()2 d()2

Low-pass filter High-pass filter

Smooth Part ( x()1 ) Detail Part (d(1) )

Low-pass filter High-pass filter

Raw Spectrum ( x(0) )

Figure 2.6 Schematic of multi-level operations of the pyramid WT algorithm with dyadic sampling. The forward WT decomposes the raw spectrum ( x(0) ) to smooth part

( x(1) ) and detail part (d(1) ) at level 1. The detail part is left alone while the smooth part is further decomposed to the smooth and detail parts at level 2 and so on. The complete

WT stops as only one point is retained in x()l . The WT that stops before the point affords partial WT. 51

Daubechies family (daublet 2, 4, … , 30), 5 from the coiflet family (coiflet 1, 2, …, 5), and 7 from the symmlet family (symmlet 4, 5, …,10), all of which are orthogonal wavelets because the wavelet coefficients fulfill certain orthogonal properties. Different types of wavelets have different coefficients, which determine the shape, and length (i.e., number of coefficients) in the basis wavelet. The number in the wavelet name is associated with its length. The wavelets “daublet n”, “coiflet n” and “symmlet n” have n,

6n, and 2n coefficients, respectively. Some basis wavelets are given in Figure 2.7.

The complete WT stops as only one point is retained in the smooth part, which requires the number point in the raw spectrum is a power of 2 due to the multi-level operations of dyadic sampling. The WT that stops before the complete point affords partial WT. A partial WT stopping at level l requires the data point to be multiples of 2l .

(0) Let x denotes a spectrum in X with nx points, where nx is a power of 2, and let x()l and d()l denotes the smooth and detail parts of WT at the lth level, respectively, where

()l ()l l the length of x and d is nx 2 . The linear operators (filter) for low-pass

Hx: ()ll→ x (+ 1) and high-pass Gx: ()ll→ d (+ 1)are derived from h and g . Essentially, the operators H and G are two circulant matrices that contain the finite non-zero elements from h and g , respectively. To simplify the mathematic descriptions, we define

Τr ()v − t as an operator that constructs a matrix that contains r rows of circular translates of a row vector ( v ). The vector v is translated by multiples of t ( rt, ∈ z) and padded with zeros. For example, a daublet 4 filter has four coefficients and is the second member of the Daubechies family (see Figure 2.7A). Its basis wavelet includes 52

A B 0.5 0.6

-0.3 -0.2

1234 15913 Father Wavelet Mother Wavelet

C Coefficient Values Coefficient 0.7 D 0.6

0.1

-0.5 -0.2

1 6 11 16 1357911 Coefficient Number

Figure 2.7 Father and mother wavelets of daublet 4 (Panel A, four coefficients), daublet 14 (Panel B, 14 coefficients), coiflet 3 (Panel C, 18 coefficients), and symmlet

6 (Panel D, 12 coefficients). 53

four coefficients {cccc1234,,,}so the father wavelet h is a row vector:

h = [cccc1234] (2.21)

Hence, an operation of Τ3 (2)h − to h constructs a matrix ( H ):

cccc12340000  H = 00cccc 00 1234 (2.22) 0000cccc1234

If we use daublet 4 to process the spectrum ( x ), the operator Hx: ()ll→ x (+ 1)is:

l+1 Τ(21)nx − (2)h − H =  (2.23) cc340 ... 0 cc 12

Similarly, the operator Gx: ()ll→ d (+ 1) can be obtained by the corresponding mother wavelet g and equation (2.20). The pyramid algorithm of MRA can consequently be written as:

(1)ll+ () xHx= (2.24)

dGx(1)ll+ = () (2.25)

The spectrum is reconstructed by the inverse WT. The inversion process is simple since the filters form orthogonal bases. The smooth part at the lth level can be recursively obtained by: 54

xHxGd(ll )=+ T (++ 1) T ( l 1) (2.26)

The reconstructed spectrum is obtained by the inverse WT of a subset of the wavelet coefficients in the wavelet domain. Different coefficient selection criteria have been established.85 In this dissertation, an incomplete transform is used and the compressed spectrum includes the smooth coefficients at the final transform level. Take a partial WT of level 4 applied to a spectrum with 1024 points as an example. The last smooth part has 64 points that are used for reconstruction or as the input of the hyphenated algorithm. During reconstruction, the detail coefficients are set to zeros. The reconstructed estimate xˆ ()l is obtained from equation (2.27).

xHxˆ ()ll= T (+ 1) (2.27)

Figure 2.8 illustrates the multi-level operations of the WT of an IMS spectrum with 1024 points. The basis wavelet is daublet 14 that was given in Figure 2.7B. The length of the smooth parts is reduced to 512, 256, 128, 64, 32 points at level 1 to 5, respectively. The smooth part resembles the raw spectrum but in a compressed representation. The detail part in the wavelet spectrum at level 1, 2 and 3 bears very little signal information while there are significant signals in the detail part in the wavelet spectrum at level 4 and higher. A clearer representation of the smooth and detail parts of the wavelet spectrum at level 5 is given in Figure 2.9. Compared to the raw spectrum, the smooth part in Figure 2.9A retains the major information of the raw spectrum but partial information is lost and transports to the detail part given in Figure 2.9B. If all of the wavelet coefficients in both the smooth and detail parts in the wavelet spectrum are used, 55

Point Number 100 300 500 700 900

F: Level 5 2.0

0.0

2.0 E: Level 4

0.5

D: Level 3 1.0

0.0 1.1 C: Level 2 Intensity 0.3

0.8 B: Level 1

0.3

0.6 A: Raw Spectrum 0.2

100 300 500 700 900 Point Number

Figure 2.8 Illustration of the multi-level operations of the pyramid algorithm for forward WT using daublet 14. Panel A presents a raw IMS spectrum with 1024 points.

Panel B to F present the wavelet spectra of WT level 1 to 5, respectively, in which the spectrum left to the dashed line represents the smooth part and that right to the dashed line the detail part. 56

A 2.5

1.0 Intensity

-0.5

0 102030 Point Number

B 0.3

0.0 Intensity

-0.3

0 200 400 600 800 1000 Point Number

Figure 2.9 The enlarged view of the smooth and detail parts of WT spectrum at level 5 in Figure 2.8. Panel A presents the smooth part that is transformed from the smooth part of level 4. Panel B presents all of the detail parts from level 1 to level 5. 57 the reconstruction perfectly composes the raw spectrum. However, the beauty of WT is its capability to reconstruct the raw signals by a small part of wavelet coefficients, which is the principal basis for data compression. As given in Figure 2.10, the reconstructed spectra using the smooth part at level 1 to 4 recover most of the information compared to the raw spectrum. Distortion is found in Figure 2.10E where the smooth part at level 5 is used, which suggests that this compression level is too large. 58

Point Number 100 300 500 700 900

4.0 F: Level 5

3.5

3.2 E: Level 4

2.7 2.6 D: Level 3

2.2

2.0 C: Level 2

1.5

B: Level 1 1.1

0.6 0.6 A: Raw Spectrum 0.2

100 300 500 700 900 Point Number

Figure 2.10 Reconstructed spectra from the smooth parts of forward WT with level 1 to 5 corresponding to Figure 2.8. Panel A presents a raw IMS spectrum with 1024 points. Panel B to F presents the spectra reconstructed from the smooth parts at level 1 to 5, respectively. 59

Chapter 3 Real-Time Self-Modeling Mixture Analysis

3.1 Introduction

Portable IMS sensors were developed in the 1980s and were first used in battlefield environments in 1992-93.89 Generally, these sensors have a bar reading to indicate the intensity level inside a predefined drift time window and can alarm to the presence of target compounds in field detection. The predefined windows are selected according to the positions of the characteristic peaks of standard target compounds and must be programmed into the IMS instrument. The predefined windows method is simple and easy to use. However, the method is susceptible to interferences. In field detection,

IMS spectra typically represent mixtures. The peak capacity of IMS is much lower than gas chromatography and mass spectrometry. Interferences from the other compounds that give peaks in a target drift time window may trigger false positive alarms.

Some IMS sensors average the acquired signals to yield a single spectrum for each sample27. The averaged spectrum is used for identification of target compounds. To some extent, signal averaging can reduce the false positive alarms and the interference of instrumental noise. However, signal averaging discards temporal information regarding the dynamic response of the spectrometer and smaller short-duration features in the data may be diluted to near the noise level when averaged, which generates false negative alarms.

As discussed earlier, SIMPLISMA, one of the SMCR methods, has proven to be a useful tool to improve the selectivity and sensitivity of IMS measurements. The basis of 60 applying SMCR is to have at least 2D variations underlying the data. A collection of IMS spectra ordered by acquisition time construct a 2D data matrix, in which the variations correspond to the spectral profiles in the direction of drift time (row variation) and the intensity variations in the direction of acquisition time (column variation). Before analytes are introduced into an IMS device, only the RIP appears in the spectrum. When the analytes are introduced, the RIP decreases as it transfers charge to the analyte, which is manifested in an increase in the analyte peak intensity. When the introduction of the analytes stops, the analyte peaks decrease to baseline (i.e., clear-down) and the RIP increases. Different analytes have different clear-down rates, which results in the different column variation profiles for which SMCR is applicable. In some applications, thermal desorption was used to add dissimilarity to the column variation profiles when two analytes have a similar clear-down rate.51

All of the previous studies have involved post-run analysis in which chemometric methods were applied in a separate step after the has been completed. IMS sensors can acquire spectra at rates between 10 and 30 Hz. At theses rates a data matrix with a million data points could be acquired in a minute. In post-run analyses, therefore, huge quantities of data need to be stored in a computer before data analysis. In addition, the temporal information during the measurement process can not be modeled in real time. This project integrates data acquisition, SIMPLISMA analysis, and data visualization in a real-time package for dynamically modeling IMS measurements. The resolved component spectra and concentration profiles are displayed as IMS data are acquired through a user-friendly virtual instrument (VI) interface written with LabVIEW. 61

The concentration profiles indicate changes of the individual component concentrations in the instrument response with respect to sample acquisition time and the spectra indicate the characteristic peaks of the components with respect to ion mobility. This display allows subtle changes in the instrument's response to be easily visualized as the data are acquired.

3.2 Theory

The theory of SIMPLISMA with two different methods for measuring independence of the candidate concentration profiles has been described earlier. The determinant-based SIMPLISMA was denoted SIMPLISMA-det and the other that uses

Gram-Schmidt method was denoted SIMPLISMA-gs. The real-time SIMPLISMA

(RTSIMPLISMA) is based on SIMPLISMA-gs because the determinant calculation was not amenable to a real-time implementation due to computational constraints5. Another reason, as will be discussed in the later chapter, is the determinant-based SIMPLISMA does not furnish a transition point in relative purity curve.

Efficient computation is a key factor for real-time chemometrics. This project adapts the recursive SIMPLISMA algorithm5 with some other modifications to real-time analysis. First, the calculation of purity is simplified for real-time implementation. The real-time algorithm estimates the average noise level (σ ) by

2 ns m 1 ()xkj− x k σ = ∑∑ nms kj==11 −1 (3.1) 62

for which ns is the number of spectra, and m is the number of points in the baseline region of a spectrum, where no peaks occur with respect to drift time (typically, 1.0-3.0

ms for portable IMS). The term xk represents the mean of the m points in the baseline region for the kth spectrum. The value of σ is updated in real-time. To eliminate channels that do not convey signal, the columns in the data matrix that have standard deviations less than 3σ are excluded as candidate pure variables. In other words, only those columns with standard deviations greater than 3σ are evaluated; therefore the damping factor α in Eq. (2.2) is not used in the modified algorithm. In addition, the real-time

algorithm removes the term µ j in Eq. (2.2) because dividing the by the mean causes low value columns to have high priority in being selected as pure variables despite low independence weights. The modified purity calculation is

p =×σ w ij jij (3.2)

Determination of the number of pure variables is important for applying

SIMPLISMA. The appropriate selected pure variables should account for all of the components in the data. In this work, a simple method has been used to check whether a candidate variable represents a new component. The method compares the purity of

candidate variable ( pij ) with the purity of the first selected pure variable ( p1 ).

p β = ij

p1 (3.3)

The relative purity of the first pure variable is 1. For the remaining pure variables, the relative purities are less than 1 and decay by order in that the variables with largest 63

purities are always selected first. A value of β greater than a threshold value β0 , called new pure variable threshold (NPV threshold), indicates a new pure variable found. In practice, if the NPV threshold is too high, new components might not be detected while spurious components may be modeled if the threshold value is too small.

3.3 Experimental Section

The IMS data used in the paper was generated by a Chemical Agent Monitor

(CAM) Type 482-301N (Graseby Ionics, Watford, Herts, UK) with a single modification.

The chemistry of reactant ions was based on water rather than acetone to make the instrument more sensitive and less selective. The CAM was interfaced with a single processor Pentium Pro 200MHz/64MB RAM computer through a data acquisition board

Type AT-MIO-16X (National Instruments, TX, USA). The operating system was

Windows 98 Second Edition. This computer was used to collect, analyze, and display data.

Chemicals used for the experiments were diisopropyl methanephosphonate

(DIMP) (Lancaster, 98%) and ethanol (95%). DIMP is a chemical simulant of soman

(Pinacolyl methyl phosphonofluoridate), a lethal nerve agent. The structures of DIMP and soman are given in Figure 3.1. Samples (3-5 drops) were dropped onto paper tissues

(Delicate Task Wipers, Kimberly-Clark Corp., GA) and placed in 21×70 mm vials

(Fisher Scientific, Miami, FL, CAT NO: 03-339-21F). The vapors were generated by evaporating the samples in the vials at room temperature.

64

.

A B

O F P O O P O O O

Figure 3.1 Structures for (A) diisopropyl methanephosphonate (DIMP), and (B)

Pinacolyl methyl phosphonofluoridate (soman).

65

All VIs are house-made using LabVIEW 5.1 (National Instruments, TX, USA) and Visual C++ 6.0 (Microsoft, Seattle, WA) and executed with LabVIEW 5.1 (National

Instruments, Austin, TX). LabVIEW furnishes an intuitive graphical programming development environment for data acquisition, instrument control, data analysis, and data visualization. LabVIEW offers the advantages of high programming productivity and ease of use. The LabVIEW Code Interface Node (CIN) was used to implement

SIMPLISMA. All of the SIMPLISMA algorithms were written in Visual C++ and the compiled module was called by the CIN in the VI. The real-time VI system continuously acquires data from the spectrometer and simultaneously extracts spectra and concentration profiles using SIMPLISMA. The graphical user interface is given in Figure

3.2. There are three graph sub-windows on the VI panel, which display the concentration profiles, the resolved spectra, and the single spectrum simultaneously. The negative values in resolved models were kept unchanged. The negative peaks in the models mathematically reflect the correlations between selected pure variables. All of calculations used single-precision (32-bit) floating-point arithmetic.

All spectra were acquired by the CAM in positive ion mode. The number of points for each spectrum is adjustable. The first 80 points of each spectrum, where the gating pulse is located, were discarded before applying real-time SIMPLISMA. The data acquisition frequency was 80 kHz throughout this work.

Bad spectra may be acquired when the system driver could not read data from the data acquisition device fast enough to keep up with the device throughput. The onboard 66

Figure 3.2 The graphical user interface for real-time SIMPLISMA. 67 device buffer overflowed. Bad spectra were detected by the absence of a gating pulse and discarded.

3.4 Results and Discussion

The benefits of SIMPLISMA are evident in its application to the IMS data of ethanol. Figure 3.3 gives the data collected from ethanol vapor in a 3D surface plot, in which the RIP (6.39 ms) is not resolved from the product ion peak for ethanol (6.50 ms).

The experiment was initiated by acquiring 60 spectra from the empty compartment. Then the vial containing ethanol was put near the instrument inlet and removed after 2 s. The acquisition stopped when 210 spectra were processed. The RTSIMPLISMA model with

NPV threshold being 0.010 is given in Figure 3.4. Two components, the reactant ions and the ethanol, have been resolved from the overlapping peaks. The concentration changes of the two components is given in Figure 3.4A. In Figure 3.4A, the concentration of the reactant ions decreases while the ethanol concentration increases. Figure 3.4B presents the resolved spectra of ethanol and the reactant ions. The negative peak in the spectrum of ethanol suggests that this peak is partially correlated with the RIP.

Figure 3.5 gives the 3D surface plot of the DIMP data. The ion mobility spectra of

DIMP are more complicated than ethanol in that monomer and dimer peaks appear in the

IMS signal of this compound. The experiment was initiated by acquiring 30 spectra from the empty compartment. The vial containing DIMP was positioned near the inlet of the

IMS and removed after about 2 s. As DIMP is sampled into the CAM, the intensity of the

RIP decreases, and the monomer and dimer peaks concomitantly increase. 68

Figure 3.3 The 3D surface plot of the IMS data of ethanol. The data set was acquired from the CAM at positive mode. 69

20

15

Reactant Ion Peak Ethanol 10 Integrated Intensity Integrated (V)

5

0 30 80 130 180 Spectrum Number

0.04 Reactant Ion Peak Ethanol

0.02

Relative Intensity Relative 0.00

-0.02

345678 Drift Time (ms)

Figure 3.4 RTSIMPLISMA resolved concentration profiles (Panel A) and component spectra (Panel B) for ethanol data. 70

Figure 3.5 The 3D surface plot of the IMS data of DIMP. The data set was acquired from the CAM at positive mode. 71

After removing the sample away from the inlet of the CAM, the intensity of the RIP increases, and the monomer and dimer peaks decrease to baseline. The clear-down rate of decrease differs for these two peaks. Therefore a three-component (reactant ion, monomer, and dimer) system should be observed. As given in the Figure 3.6, the final results after collecting 592 spectra, RTSIMPLISMA successfully resolved the three

components. The NPV threshold β0 was 0.010. The concentration profiles in Figure

3.6A indicate that the dimer peak decays faster than the corresponding monomer peak when the concentration of the analyte decreases. Figure 3.6B gives the spectra of three components. The data size of 592 spectra was 3.6 MB (i.e., 592 spectra × 1500 points/spectrum × 4 bytes/point). The data size of the resolved model was less than 0.05

MB, of which 18 KB was required for concentration profiles (i.e., 3 components × 592 points × 4 bytes/point) and 25 KB for the resolved spectra (i.e., 3 components × 1500 points × 4 bytes). Hence, the real-time implementation greatly reduces the requirement of storage capacity.

As a comparison, Figure 3.7 gives the results of SIMPLISMA-det applied to the same data in post-run. The damping factor (α) is 5% of the maximum peak intensity of the mean of the data set. Interestingly, the SIMPLISMA fails to differentiate the monomer and dimer peaks according to the resolved components given in Figure 3.7A and Figure 3.7B. The third component is a spurious component that represents noise. The criterion in equation (3.1) removes the influence of noise, and the real-time algorithm could resolve the subtle difference in concentration profiles. 72

24

19

14 Reactant Ion Peak Dimer Peak Monomer Peak 9 Integrated Intensity Integrated (V)

4

-1

0 100 200 300 400 500 600 Spectrum Number

Reactant Ion Peak 0.03 Dimer Peak Monomer Peak

0.02 Relative Intensity Relative 0.01

0.00

-0.01

381318 Drift Time (ms)

Figure 3.6 RTSIMPLISMA resolved concentration profiles (Panel A) and component spectra (Panel B) for DIMP data. 73

A

20

Component 1 Component 2 10 Component 3 Integrated Intensity (V)

0

-10 0 100 200 300 400 500 600 Spectrum Number

B Component 1 Component 2 0.023 Component 3

0.018

0.013

0.008 Relative Intensity Relative

0.003

-0.002

051015 Drift Time (ms)

Figure 3.7 SIMPLISMA-det resolved concentration profiles (Panel A) and component spectra (Panel B) for DIMP data. 74

Figure 3.8, Figure 3.9, and Figure 3.10 illustrate the real-time results when different number of spectra is acquired. Figure 3.8 gives the results after acquiring 25 spectra, and

Figure 3.9, and Figure 3.10 after acquiring 40, and 135 spectra, respectively. When 25 spectra have been collected, there is only one component in the system before the CAM samples the DIMP vapor. The concentration profiles in Figure 3.8A indicate no change.

After collecting 40 spectra, DIMP starts to be introduced to the CAM. However, monomer and dimer peaks are correlated when the peaks increase. The clear-down process is typically the stage at which monomer and dimer ions can be resolved.

Therefore only two components are resolved according to Figure 3.9. When 135 spectra were acquired, three components were resolved as given in Figure 3.10. Figure 3.10 is almost identical with Figure 3.6, which indicates model convergence after 150 spectra.

The effects of the NPV threshold on the real-time SIMPLISMA model have been investigated. Different thresholds should affect the number of resolved components in the model. The hypothesis is justified by the results of Figure 3.11, Figure 3.12, and Figure

3.13. By changing the NPV threshold β0 from 0.010 to 0.008, 0.040, and 0.240, the

component number changes from 3 to 4, 2, and 1, respectively. In Figure 3.11, when β0

reduces to 0.008, a spurious component (component 4) indicates the NPV threshold β0 is too “sensitive” for the dataset, which is referred to as over-resolution. On the other hand,

Figure 3.12 and Figure 3.13 indicate that values of β0 are too large and the new

components are not resolved. In Figure 3.12, β0 was set to 0.040, and the monomer and dimer peaks were recognized as a single component (component 2) but discernable from

the RIP (component 1). When β0 was set to a value as large as 0.240 (Figure 3.13), only 75

A Component 1

25

20 Integrated Intensity (V) 15

10 0 5 10 15 20 25 Spectrum Number

B 0.029 Component 1

0.024

0.019

0.014

Relative Intensity Relative 0.009

0.004

-0.001

381318 Drift Time (ms)

Figure 3.8 RTSIMPLISMA model after processing 25 spectra for DIMP data. Panel A presents the concentration profiles and Panel B presents the component spectra. 76

A 24

19

14 Component 1 Component 2

9 Integrated Intensity (V)

4

-1

0 10203040 Spectrum Number

B 0.03 Component 1 Component 2

0.02

0.01 Relative Intensity Relative

0.00

381318 Drift Time (ms)

Figure 3.9 RTSIMPLISMA model after processing 40 spectra for DIMP data. Panel A presents the concentration profiles and Panel B presents the component spectra. 77

A 24 Component 1 Component 2 Component 3 19

14

9 Integrated Intensity (V)

4

-1

0 20406080100120140 Spectrum Number

B 0.03 Component 1 Component 2 Component 3

0.02

0.01 Relative Intensity Relative

0.00

-0.01 381318 Drift Time (ms)

Figure 3.10 RTSIMPLISMA model after processing 135 spectra for DIMP data. Panel

A presents the concentration profiles and Panel B presents the component spectra. 78

A

20

Component 1 Component 2 Component 3 10 Component 4 Integrated Intensity (V) Intensity Integrated 0

-10 0 100 200 300 400 500 600 Spectrum Number

B Component 1 Component 2 0.03 Component 3 Component 4

0.02

0.01 Relative Intensity Relative

0.00

-0.01

381318 Drift Time (ms)

Figure 3.11 RTSIMPLISMA resolved model for DIMP data with NPV threshold β0 =

0.008. Panel A presents the concentration profiles and Panel B presents the component spectra. 79

25 A

15 Component 1 Component 2

5 Integrated Intensity (V)

-5

0 100 200 300 400 500 600 Spectrum Number

Component 1 B Component 2 0.023

0.018

0.013

Relative Intensity 0.008

0.003

-0.002

381318 Drift Time (ms)

Figure 3.12 RTSIMPLISMA model for DIMP data with NPV threshold β0 = 0.04.

Panel A presents the concentration profiles and Panel B presents the component spectra. 80

A 30 Component 1

25

20 Integrated Intensity (V) 15

10

0 100 200 300 400 500 600 Spectrum Number

B Component 1

0.019

0.014

0.009 Relative Intensity Relative

0.004

-0.001

381318 Drift Time (ms)

Figure 3.13 RTSIMPLISMA model for DIMP data with NPV threshold β0 = 0.24.

Panel A presents the concentration profiles and Panel B presents the component spectra. 81 one component (component 1) was resolved from the dataset. The single component includes all of the peaks, the RIP, the monomer peaks, and the dimer peaks of DIMP.

Hence, the value of β0 can significantly affect the sensitivity for modeling new

components. In the case of DIMP, setting β0 into a of [0.01, 0.04] obtained a

three-component model. In practice, β0 can be set to a low value initially to explore a

mixture and eliminate the obvious spurious components by increasing β0 . The value of

β0 can be adjusted during the acquisition.

To improve the implementation efficiency of real-time SIMPLISMA, time performance analyses were performed. The real-time VI system recorded the local time while saving each spectrum. Subtracting the recorded local times by that of the first spectrum gave the acquisition time of each spectrum. Time performance curves of the real-time VI were obtained by plotting acquisition time against spectrum number. Figure

3.14 gives the time performance curves for the VI that implemented SIMPLISMA-det,

RTSIMPLISMA, and only data acquisition without running SIMPLISMA. For the first two cases, SIMPLISMA updates the model for each spectrum. The data points between 1 ms and 20 ms, a total of 1420 points, are used for the SIMPLISMA computations. As the

RTSIMPLISMA only processes the columns with standard deviations greater than three times of noise level, the calculations were simplified for finding the pure variables.

Therefore, the slope of curve 2 is less than curve 3 in Figure 3.14. However, it is found that the slopes of time performance curves 2 and 3 increase as the number of acquired 82

1. Data Acquisition Only 3 2. RTSIMPLISMA 1000 3. SIMPLISMA-det

800

2 600 Time (s) Time

400

200

1 0

100 300 500 700 900 Spectrum Number

Figure 3.14 Comparison of time performance for real-time implementation of

SIMPLISMA-det, RTSIMPLISMA, and data acquisition only. 83 spectra increases. In the other words, the acquisition rate decreases because the data size increases. This effect might be detrimental for the implementation of RTSIMPLISMA for large data sets. Some spectra were lost due to the increase of computational burden. This problem was partly addressed in two ways. One was to introduce as little data without removing useful information as possible for SIMPLISMA computations, i.e., to reduce the size of input data as much as possible. For the DIMP case, for instance, the range of data was between 3 ms and 13 ms. The input data size could be reduced 47% compared to the range of 1 to 19 ms. This range could be set on the VI panel. The other way to accelerate the real-time system was the batch processing method, i.e., to apply

SIMPLISMA after every R (R ≥ 2) spectra are collected, where the batch number R can be set on the VI panel. Figure 3.15 gives the time performance curves for the VI that implemented the modified SIMPLISMA after collecting every 5, 25, and 50 spectra. The result indicates that a larger batch number R results in faster data processing. The time performances of different algorithms and R for processing 550 spectra are given in Table

3.1. Compared to 38.7 s for data acquisition without running SIMPLISMA, it spent more time when running the RTSIMPLISMA (360.9 s) and SIMPLISMA-det (587.3 s), respectively. Both updated the model for each additional spectrum. Using

RTSIMPLISMA with R set to 50, time usage reduced to 65.7 s. Therefore, the batch processing method made the real-time SIMPLISMA implementation faster. However, batch processing may miss some real-time information. Moreover, some spectra were still lost according to the time performance curves. This problem will be studied in the later chapters. 84

4 100 1. Data Acquisition Only 2. R = 50 3. R = 25 4. R = 5 80 3

2 60 Time (s)

1 40

20

0

0 100 200 300 400 500 600 Spectrum Number

Figure 3.15 Effects on time performance by real-time implementation of

RTSIMPLISMA for batches of R spectra. 85

Table 3.1 Time performances of different methods and batch number R for processing

550 spectra.

Method R Time usage (s)

Data acquisition only - 38.7

SIMPLISMA-det 1 587.3

RTSIMPLISMA 1 360.9

RTSIMPLISMA 5 104.4

RTSIMPLISMA 25 71.2

RTSIMPLISMA 50 65.7 86

3.5 Conclusions

LabVIEW was used to construct a user-friendly VI system for IMS. A real-time

SIMPLISMA algorithm has been successfully implemented to model IMS measurements.

The real-time algorithm was compiled by Visual C++ and called by a CIN in the VI system. The integrated real-time system displays the original IMS data, resolved concentration profiles and component spectra on the three graph sub-windows on the VI panel. The resolved concentration profiles give information about changes of component intensities while the resolved spectra indicate the characteristic peaks of the components with respect to ion mobility.

This work reduces the influence of noise data according to real-time calculated noise level while the conventional SIMPLISMA uses a factor of α to remove the influence of noise. Additionally, the number of components in a mixture is automatically determined. It is a major problem that the time consumption for SIMPLISMA computation increases as the data matrix grows during data acquisition. Some real-time data may be lost due to the increase of modeling time. The problem was partially solved by running real-time SIMPLISMA in batch processing. In the later chapter, data compression will be applied prior to SIMPLISMA, which would reduce time required for modeling without losing important chemical information. 87

Chapter 4 RTSIMPLISMA Applied to Two-Dimensional Wavelet Compressed

Ion Mobility Data

4.1 Introduction

Chapter 3 presented a modified SIMPLISMA (RTSIMPLISMA) that was able to dynamically model the IMS data in real-time. Real-time modeling could alleviate storage burdens and provide a global perspective of the measurement process. However, a key issue for real-time processing is that the algorithms must be computationally efficient so that the processing does not lag behind the data acquisition. The demand for processing power of many algorithms may increase linearly or geometrically with respect to spectrum and resolution element numbers. If the algorithm consumes too large a share of computer resources, the data acquisition may be deprived of resources so that data stored on the acquisition board is overwritten before it is read by the computer. The problem is especially severe for those processes with high acquisition rates. IMS is one of them that can acquire millions of data points in a minute. Batch processing could partially alleviate the problem, however, by which the model could not be instantly updated for each input spectrum.

Another problem may occur with the determination of number of pure components, or chemical rank, in the mixture data. Chemical rank is different from mathematical rank because it is relevant to the chemistry underlying the data set. The conventional SIMPLISMA usually combines two methods to judge if all of the components are resolved. The first method is to reconstruct the data matrix ( Xˆ ) and 88 compare with the original data matrix ( X ) by calculating the relative root of the sum of squares (RRSSQ) of difference between the both matrices. The equation of RRSSQ is as follows:

nn sx 2 (4) ˆ ∑∑()xij− x ij RRSSQ = ij==11 nnsx 2 ∑∑xij ij==11

th th th th for which xij is the i row and j column element of X ; xˆij is the i row and j column

ˆ element of X ; ns is the number of spectra and nx is the number of points in each spectrum. If the inclusion of a new variable to the model does not significantly reduce the

RRSSQ, then the variable is considered as the first spurious component. However, the

RRSSQ method may not reach a safe result because some spurious component can also change RRSSQ significantly. In addition, it is hard to establish a general criterion to decide whether a RRSSQ difference is significant or not. The second method is to visually inspect the concentration profile of the candidate pure variable. The first variable with a profile similar to signal noise is considered as the first spurious component. With these conventional methods, automatic determination of chemical rank is difficult because the modeling accuracy greatly relies on the knowledge and experience of the inspector.

The method using new pure variable (NPV) threshold ( β0 ) reported in the previous chapter was designed for automatic determination of the chemical rank in IMS data. The NPV threshold method was simple and easy to implement. However, it is difficult to find a single threshold that is generally functional of various data sets. Under- 89 resolution or over-resolution might occur for the RTSIMPLISMA model, which leads to the wrong number of components, if the threshold was not selected appropriately.

In this chapter, RTSIMPLISMA will be further modified to yield a more reliable estimate of the number of components to include in the model. Moreover, 2D wavelet compression (WC2) will be used to reduce the size of the input data for RTSIMPLISMA

(WC2-RTSIMPLISMA), which allows faster implementation of RTSIMPLISMA, improves the RTSIMPLISMA models by removing noise from the data, and lowers storage burden that makes real-time modeling of large data sets possible. The

RTSIMPLISMA model will be transformed to the uncompressed representation using the inverse wavelet transform. The effects of wavelet filter types and compression levels will be investigated. The ultimate goal was to generate a package of settings for WC2-

RTSIMPLISMA that will be suitable for real-time processing large IMS data sets.

Drug and bacterial ion mobility spectra were chemically diverse samples. In addition, the two ion mobility spectrometers differed in operating principle as well. The

IONSCAN used a pinhole inlet and an ion shutter; while the ITEMISER used a membrane inlet and a field free region to trap ions prior to injection. The drug sample was a mixture of cocaine and heroin, which has the street name of a speed-ball. Mixed drug abuse has been on the rise.90, 91 The structures of cocaine and heroin are given in

Figure 4.1. The bacterial data was provided by Buxton.92 Bacterial cells are not volatile.

Identification of bacteria by IMS was achieved by in situ derivatization produced by the thermal hydrolysis of bacterial lipids with the desorber heater and tetramethylammonium hydroxide (TMAH) derivatizing reagent. TMAH was added to the IMS sample disk to 90

A O

O N

O O .

B O

O O

O N

O

Figure 4.1 Structures for (A) cocaine, and (B) heroin. 91 hydrolyze and methylate lipids in a similar procedure to the one used by mass spectrometrists.93 The bacterium studied was the food borne pathogen Bacillus cereus that can be fatal for individuals with compromised immune systems.

4.2 Theory

Figure 4.2 presents the schematic of WC2-RTSIMPLISMA that is applied to a data set containing three components, or pure variables. In terms of IMS signals, row compression corresponds to the compression of the drift-time dimension of the data matrix while column compression refers to the acquisition-time dimension. By applying

WC to each row of X , the spectra are compressed in row direction. The columns of the

compressed data matrix are further compressed to furnish the 2D compressed matrix XC .

For 2D compression notation, lr × lc wr - wc refers to lr level compression using wavelet

type wr applied to row compression, i.e., drift time dimension, and lc level compression

using wavelet type wc applied to column compression, i.e., acquisition time dimension.

The compression efficiency is evaluated with compression factor (C.F.), which is measured as the ratio of the number of points retained in the compressed matrix (N) to the original data size (N0):

N (5) CF..= N0

Instead of using the raw data as input, the WC2-RTSIMPLISMA directly

processes the compressed data XC . The matrix XC in the wavelet domain can be 92

Row Column compression compression

XC

X RTSIMPLISMA

ST T Inverse WT SC

CC C

Figure 4.2 Schematic diagram of the implementation principle of the WC2-

RTSIMPLISMA algorithm. 93 decomposed into concentration profiles and component spectra in wavelet domain,

denoted as CC and SC , respectively.

T (6) X=CSCCC

The RTSIMPLISMA model in wavelet domain is inversely transformed to uncompressed representation.

RTSIMPLISMA is further modified for better determination of pure components in IMS data. First, the calculation of purity is modified to:

ns 2 p =−×xx w ij∑() kj j ij (4.7) k =1

The index k represents spectrum number. The first part of the equation is proportional to the variance of variable j while standard deviation was used in the algorithm in the previous chapter. Furthermore, RTSIMPLISMA revised the threshold method for the

better determination of nc . The new thresholding method was developed in this work which is based on the following observations. First, the standard deviation of the concentration profile of a real component should be greater than three times of noise level of the data set that is estimated by the data points within 1.5 to 3.0 ms with respect to drift time for IMS data. Second, the purity of the last real component should be less than one percent of that of the first pure variable, if the first rule was satisfied. Finally, the difference of the relative purities of two adjacent components in terms of ∆ log β should

be greater than a threshold ( ∆0 ).

The relative root-mean-square error of RTSIMPLISMA spectra (RRMSES) and the relative the relative root-mean-square error of RTSIMPLISMA concentration profile 94

(RRMSEC) are used to assess the model accuracy. RRMSES compares the reconstructed

WC2-RTSIMPLISMA component spectra (Sˆ ) with the RTSIMPLISMA spectra without compression (S):

nncx ˆ 2 ∑∑()ssij− ij

RRMSES = ij==11 (8) nncx 2 ∑∑sij ij==11

Likewise, RRMSEC is calculated by:

nncs ˆ 2 ∑∑()ccik− ik RRMSEC = ik==11 (9) nncs 2 ∑∑cik ik==11

This assessment approach reflects the relative errors of RTSIMPLISMA models and makes it comparable for the errors from different data sets.

4.3 Experimental Section

Two different ion mobility spectrometers were used. Both of them are interfaced with computers through data acquisition devices (National Instruments, TX, USA).

Homemade virtual instrument (VI) programs were implemented in LabVIEW 6.02

(National Instruments, TX, USA) to acquire data from the IMS instruments. Original signals were subtracted by the average baseline signal before storing. The baseline region was located from 1.5 ms to 3.0 ms of the IMS spectrum, a region of the spectrum where usually no peaks occur. 95

The first spectrometer is an ion trap mobility spectrometer (ITMS), ITEMISER contraband detection and identification system (Ion Track Instruments, Inc., Wilmington,

MA, USA). The ITMS system was interfaced with a laptop with a single processor of

Pentium III 850 MHz and 384 MB memory through a PCMCIA card (Type DAQCard-

AI-16XE-50). The operating system was Windows 2000-SP2. The drug data set was collected in positive polarity, which is the conventional mode of analysis for drugs. The acquisition rate was 80 kHz, and each spectrum consisted of 1500 points.

The second spectrometer is a Barringer IONSCAN 350 (Barringer Instruments

Inc., New Jersey, USA). The IONSCAN was interfaced with a single processor PII 200

MHz/64 MB RAM computer through a data acquisition board (Type AT-MIO-16XE-10).

The operating system was Windows 98 Second Edition. The IONSCAN was operated in positive polarity. Data sets were collected for this instrument by placing a small amount of a prepared sample solution on a sample filter. The sample filter was placed in a filter cartridge that is heated by the desorber to vaporize the sample into the instrument inlet.

A single run for this instrument was limited to 20 s. The acquisition rate was 80 kHz, and each spectrum consisted of 1600 points.

Cocaine (SIGMA Chemical Co., St. Louis, MO, USA; Lot 97H1018) and heroin

(Lipomed, Inc., Cambridge, MA, USA), both in the form of freebase, were prepared in absolute ethanol. Unlike the CAM, the ITEMISER does not have response to the presence of ethanol, which is suppressed by the ammonium dopant. The concentrations were 0.02 mg/mL and 0.20 mg/mL for cocaine and heroin, respectively. The drug mixture was prepared by adding 50 µL of each drug solution to an Eppendorf tube 96

(Brinkmann Instruments, Inc., Westbury, NY, USA). A 10. µL of the mixture solution was place on a sample trap for narcotics mode (Ion Track Instruments, Inc., Wilmington,

MA, USA). The sample trap was exposed to air to evaporate ethanol to yield samples comprised of 0.1 µg of cocaine and 1.0 µg of heroin. Several hundreds of blank spectra were collected before the sampled trap was placed into the thermal desorber of the ITMS.

The data acquisition was halted when the ITMS returned to baseline response.

Freeze-dried Bacillus cereus ((#11778) was purchased from American Type

Culture Collection, Manassas, VA, USA. Specimens were rehydrated by brain-heart infusion broth with 3% NaCl and grow in the infusion for 24 h. The brain-heart infusion was purchased from Difco, Lot 9316001. Several drops of the broth were used to inoculate a brain-heart infusion agar plate (Sigma, Lot 90K0804). The agar was also autoclaved at 121 ºC for 15 minutes to ensure sterility. The plate was left at room temperature for 48 h. Bacillus cereus cells were placed on an IMS sample filter

(Barringer Instruments, Part No. PL09045) along with 1 µL of 0.1 M TMAH (Sigma, Lot

18H0443). The sample filter was placed above the desorber heater on the IONSCAN to thermally hydrolyze the sample at 300 °C. Resulting volatile compounds were introduced into the IONSCAN by the carrier gas. The IONSCAN system used nicotinamide as an internal calibrant. The data acquisition stopped when the instrument returned to baseline response.

The MATLAB codes of the conventional SIMPLISMA algorithm that is only used for Section 4.4.3 were obtained from Windig.94 All the other programs were written at Ohio University and compiled by Borland C++ 5.02. MATLAB programs were 97 written to perform statistical calculations. The programs were run on a PC desktop with a

1.2 GHz processor and 512 MB RAM. The operating system was Windows 2000-SP2.

All calculations used single-precision (32 bit) floating-point arithmetic.

4.4 Results and Discussion

Two different data sets were used that represent the traditional application of drug detection and a newer application of characterizing bacteria. These data sets also had different signals and noise levels, and were useful for evaluating the WC2-

RTSIMPLISMA algorithm. Because this wavelet compression algorithm used can only process the data with dyadic length, these data sets were culled to retain 1024 spectra for

Bacillus cereus data (i.e. bacterial data) and 512 spectra for cocaine-heroin data (i.e., drug data), both of which had 1024 points in drift time measurements. The raw data in context refers to the culled data sets, given in Figure 4.3 and Figure 4.4 as 3D surface plots, in comparison to compressed and reconstructed data by WC. The number of components of the both data sets is known according to the chemistry underlying the data set. The drug data set includes three components, which represent RIP, cocaine, and heroin, respectively. The bacterial data set was reported to have four components (i.e., TMAH, nicotinamide, Bacillus cereus component and another bacterial component) according to the previous studies.23

4.4.1 Conventional SIMPLISMA Models

The conventional SIMPLISMA accurately modeled the raw data sets with a damping factor α of 0.3 of the maximum peak intensity of the mean spectrum for the bacterial data 98

Figure 4.3 The cocaine-heroin data set comprised 1024 spectra on a 3D surface plot

(Acquired from ITEMISER® ITMS in positive mode). 99

Figure 4.4 The TMAH-preprocessed Bacillus cereus data set comprised 1024 spectra on a 3D surface plot (Acquired from Barringer IONSCAN 350 spectrometer in positive mode). 100 and 0.05 for the drug data unless otherwise stated. The bacterial data set required a larger value for α, because it was noisier than the drug data set which arises from the spectra obtained from the IONSCAN®. The noise level of the drug data set was 5.5×10-3 V

(0.21% of maximum RIP intensity) and that of the bacterial data set was 5.4×10−2 V

(1.3% of maximum RIP intensity). The SIMPLISMA models are given in Figure 4.5 and

Figure 4.6, respectively, where the components are ordered by purity values. In the spectral plots, the lower abscissa corresponds to drift time and upper abscissa to the reduced mobilities that are calculated using cocaine (1.16 cm2V−1s−1), 25 and nicotinamide

(1.86 cm2V−1s−1) 23 as the calibrant ion for drug and bacterial data, respectively. Negative peaks may occur in the spectra that indicate correlations among the pure variables. The drug data set yielded a three component model that comprised the ammonium reagent, cocaine, and heroin peaks. The reactant ions are formed from ammonia that is an internal dopant in the ITMS. This ion suppresses signals arising from substances with lower proton affinities and transfers charge to drugs that have comparable proton affinities to ammonia. Each IMS spectrum is closed in that the ion current (i.e., spectral intensities) integrates to a constant value.

Because ionization occurs through charge transfer reactions, the RIP decreases concomitantly with the increase of analyte peaks in Figure 4.5A and Figure 4.6A. In

Figure 4.5B, the three small peaks from 10 to 13 ms in the cocaine spectrum may be related to cluster ions that formed during the analysis. The bacterial data set was more complex and the SIMPLISMA model comprised four components that corresponded to the nicotinamide reactant ions, peaks pertaining to the TMAH derivatizing agent, and two 101

Acquistion Time (s) 10 20 30 40 50 A 11

RIP 7 Cocaine Heroin

Integrated Intensity(V) 3

-1

100 300 500 700 900 Spectrum Number

Reduced Mobility (cm2V-1s-1) 4.74 2.48 1.67 1.26 1.01 0.85 0.73 0.64

0.20 B

0.15

RIP Cocaine 0.10 Heroin

0.05 Relative Intensity Relative

0.00

-0.05

35791113 Drift Time (ms)

Figure 4.5 Conventional SIMPLISMA model from the original cocaine-heroin data set

(three-component model). (A) Concentration profiles. (B) Component spectra.

102

Aquisition Time (s) 16111621 A 18 Calibration Peak TMAH Bacterial Component Bacillus cereus 13

8 Integrated Intensity (V) 3

-2

100 200 300 400 500 Spectrum Number

Reduced Mobility (cm2V-1s-1) 3.6 2.64 2.08 1.72 1.46 1.27 1.13 1.01

B 0.24 Calibration Peak TMAH Bacterial Component 0.19 Bacillus cereus

0.14

0.09 Relative Intensity Relative

0.04

-0.01

5 7 9 11 13 15 17 Drift Time (ms)

Figure 4.6 Conventional SIMPLISMA model from the Bacillus cereus data set (four- component model). (A) Concentration profiles. (B) Component spectra. 103 peaks that corresponded to bacteria. Figure 4.6A gives the concentration profiles. The

TMAH profile increases rapidly, because it is a volatile compound. The slow increase in height of the bacterial peaks indicates the reaction rate of the thermal hydrolysis/methylation of the lipids in Bacillus cereus cells. The resolved SIMPLISMA spectra of all of the four components are given in Figure 4.6B.

4.4.2 Optimization of WC2-RTSIMPLISMA

The relative purity curves that plot the logarithm of relative purities with respect to component number are given in Figure 4.7, Figure 4.8, and Figure 4.9. The relative purity curves of determinant-based SIMPLISMA are given in Figure 4.7, and the Gram-

Schmidt-based SIMPLISMA in Figure 4.8 (drug data) and Figure 4.9 (bacterial data). In the figures, SIMPLISMA-gs, RTSIMPLISMA-s, and RTSIMPLISMA-v correspond to the purity calculation method using equation (2.2), the method using standard deviation

(equation (3.2)), and the one using equation (4.7), respectively. The WC2 refers to 4×4 daublet 14-daublet 4 compression. From Figure 4.7, the determinant-based SIMPLISMA differs from the Gram-Schmidt-based in that the former does not have a clear transition point where the slopes of the relative purity curve change, which makes it unsuitable for

accurately determining the number of components ( nc ) in the model. For the latter, a threshold based on calculating ∆ log β after the transition point in the relative purity curves discloses the number of components in the model. The components before the transition point are real components and furnish larger purity values while those afterwards correspond to spurious components. The transition points for the drug data 104

0 Drug Data Bacterial Data

-10

-20 β log

-30

-40

-50 1234567891011 Component Number

Figure 4.7 Relative purity curves of determinant-based SIMPLISMA for the drug and bacterial data sets. 105

SIMPLISMA-gs 0 RTSIMPLISMA-s RTSIMPLISMA-v WC2 - RTSIMPLISMA-v

-2

β log

-4

-6

123456789101112131415 Component Number

Figure 4.8 Relative purity curves of Gram-Schmidt-based SIMPLISMA for the drug data set. (SIMPLISMA-gs: SIMPLISMA using Gram-Schmidt method for raw data;

RTSIMPLISMA-s: RTSIMPLISMA using standard deviation for purity calculation for raw data; RTSIMPLISMA-v: RTSIMPLISMA using variance for purity calculation for raw data; WC2-RTSIMPLISMA-v: RTSIMPLISMA using variance for purity calculation for 2D compressed data; the highlighted points indicate transition points). 106

1 SIMPLISMA-gs RTSIMPLISMA-s RTSIMPLISMA-v 0 WC 2 - RTSIMPLISMA-v

-1

β

g

o -2

l

-3

-4

-5 123456789101112131415 Component Number

Figure 4.9 Relative purity curves of Gram-Schmidt-based SIMPLISMA for the bacterial data set. (SIMPLISMA-gs: SIMPLISMA using Gram-Schmidt method for raw data; RTSIMPLISMA-s: RTSIMPLISMA using standard deviation for purity calculation for raw data; RTSIMPLISMA-v: RTSIMPLISMA using variance for purity calculation for raw data; WC2-RTSIMPLISMA-v: RTSIMPLISMA using variance for purity calculation for 2D compressed data; the highlighted points indicate transition points). 107 sets for all four of the Gram-Schmidt methods are obvious. The transition point for the

RTSIMPLISMA using variance is the most apparent. The relative purity curves for the raw drug data and the 2D compressed almost overlap each other for the first six points, which suggests that the compression has an insignificant affect on the model convergence when noise is relatively low in the data. However, the curve of the compressed bacterial data diverges from that of the raw in the fifth component. Generally, the first spurious component corresponds to distribution of noise across the spectra. The 2D wavelet compression removed high frequency noise from the bacterial data set and reduced the relative purity of the spurious components, which enhances the difference in relative purities between chemical and the spurious components. Therefore, the RTSIMPLISMA- v prevailed over the other methods and was selected for further study.

As discussed earlier, RTSIMPLISMA uses a threshold ∆0 to determine if the transition point is reached. To investigate the threshold value, the RTSIMPLISMA with different threshold values were applied to 1458 data sets that were populated from the 4 ×

4 2D compressed both drug and bacterial data sets with 27 types of wavelets, respectively.

The twenty-seven wavelets included 15 from the Daubechies family (daublet 2, 4, …, 30),

5 from the coiflet family (coiflet 1, 2, …, 5), and 7 from the symmlet family (symmlet 4,

5, …,10). The drug data set was compressed to 64 × 64 points while the bacterial data set to 32 × 64 points. The compression factor is 1/256. The results of percent correct

nc with respect to ∆0 in Figure 4.10 reveal that the optimal ∆0 is located between 0.45 108

100

80 (%)

c 60 n Drug Bacterial Correct 40

20

0 0.2 0.3 0.4 0.5 0.6

∆0

Figure 4.10 Percent correct number of components with respect to the threshold ∆0 . 109

and 0.65, in which the average percent correct nc is 98.7% for the rug data set and 75.1%

for the bacterial data set. Compared to the drug data set, the percent correct nc for the

bacterial data set is lower and more sensitive to the change of ∆0 . First, the bacterial data set is noisier and more complex. From the relative purity curve of RTSIMPLISMA-v in

Figure 4.9, the transition point is not as clearly defined as it was for the drug data set in

Figure 4.8. Second, the 2D compressions using some of the wavelet filters altered the relative purity curves of the bacterial data set more than relative purity curves of the drug data. In other words, those 2D compressions distorted the data set and changed its chemical rank. For example, the 4 × 4 daublet 8-symmlet 8 compression reduced the number of components in bacterial data set to three, in which the calibration peak and the

TMAH peak could not be separated and were modeled as the same component. In practice, it is more informative to resolve extra components than to underestimate the

number of components. Although the percent correct nc is the highest when ∆0 is equal

to 0.55, the value 0.5 was selected as the optimal ∆0 since the percent correct nc for the

both ∆ log β0 is similar while a lower ∆0 tends to find out more components when the

correct nc cannot be resolved.

The effects of wavelet type and compression level on WC2-RTSIMPLISMA were evaluated using compression levels that ranged from 1 to 6 and with 27 wavelet filters applied to one of the sample and drift time dimensions of the drug data set.

RTSIMPLISMA was applied to the one-dimensional compressed data. The RRMSES and

RRMSEC were calculated for each level and each wavelet filter. Total, 162 (i.e., 27 × 6) 110

RRMSES and 162 RRMSEC were obtained for each dimension. Two-factor (ANOVA) was used to evaluate the results. The reference was obtained for a 5% significance level. Both RRMSES and RRMSEC were used for ANOVA. The

ratio of FF / crit are reported in Table 4.1. This statistic indicates the significance that a

factor contributes to the total variation. The FF / crit ratios for wavelet filter and compression level factors were calculated for the two dimensions. The compression level has a greater impact on the RRMSES and RRMSEC than wavelet type.

To further investigate the effects of compression levels on modeling accuracy, the

27 wavelet filters were applied to the drug and bacterial data set in both dimensions.

Compression levels were varied from 2 to 5 for drift and acquisition time dimensions, respectively. The average RRMSES for each compression level pattern was calculated by averaging all of the RRMSES values from the 729 different wavelet filter combinations

that accurately determined nc . The deviation of the average RRMSES was obtained by t-

statistics with a significance level of 5%. The percent correct nc of each level pattern is

calculated by dividing the number of models with correct nc by the total number of models.

The average RRMSES, compression factor, the minimum RRMSES for each compression level and the corresponding wavelet type combination are given in

Table 4.2 and Table 4.3. With the same row compression level, the average

RRMSES values are approximately equal for different column compression levels while the RRMSES values considerably differs for different row compression levels, suggesting the row compression level is a more important factor with respect to the 111

Table 4.1 Contribution of wavelet type and compression level to the variation of

RRMSES and RRMSEC.

FF/ crit

Source of variation Row compression Column compression

RRMSES RRMSEC RRMSES RRMSEC

Wavelet type 2.0 1.3 1.1 1.3

Compression level 58.5 17.4 12.5 6.5

112

Table 4.2 Compression levels, compression factor (C.F.), percent correct nc , average

RRMSES, minimum RRMSES, and the corresponding wavelet type for different compression levels for drug data set.

Percent Average Minimum Wavelet for Minimum RRMSES Compression C.F. Correct RRMSES RRMSES Levels Row Column

nc (%) (%) (%)

2 × 4 1/64 100 2.55±0.09 1.07 daublet 16 daublet 8

3 × 4 1/128 100 4.76±0.16 1.36 daublet 22 daublet 2

4 × 2 1/64 96 22.30±1.70 6.40 daublet 14 coiflet 5

4 × 3 1/128 97 22.21±1.69 6.28 daublet 14 coiflet 5

4 × 4 1/256 97 22.18±1.68 6.29 daublet 14 symmlet 6

4 × 5 1/512 78 22.16±1.87 6.29 daublet 14 symmlet 4

5 × 4 1/512 69 46.82±1.83 23.18 daublet 10 symmlet 7

113

Table 4.3 Compression levels, percent correct nc , average RRMSES, minimum RRMSES, and the corresponding wavelet type for different compression levels for bacterial data set.

Correct nc Average Minimum Wavelet for Minimum RRMSES Compression Levels (%) RRMSES (%) RRMSES (%) Row Column

2 × 4 96 9.26±0.43 2.61 daublet 24 daublet 4

3 × 4 91 9.42±0.40 3.25 symmlet 9 daublet 4

4 × 2 76 19.39±1.16 10.32 daublet 22 daublet 4

4 × 3 83 21.08±1.27 10.34 daublet 22 daublet 6

4 × 4 77 22.00±1.26 10.39 daublet 22 daublet 4

4 × 5 25 31.50±2.16 11.81 daublet 22 daublet 2

5 × 4 54 60.01±2.52 38.51 daublet 12 daublet 4

114 reconstruction errors. This finding can be explained by the pure variable selection

mechanism of RTSIMPLISMA. In raw data sets, pure variables are selected from nx variables, i.e., 1024 in this case. The row compression, i.e., drift time direction, will decrease the number of variables in the variable pool, while column compression does not change it. As a result, row compression may significantly change the selected concentration profiles (C), which can furnish a larger RRMSEC. The errors in C propagate to the spectra S and may increase RRMSES, because S is calculated from C.

Therefore, two factors contribute to RRMSE; one is from the compression and the other from changes to the concentration profiles. The effect of the latter is more significant than the former in low level compression (less than level five). Consequently, the wavelet filter affects row compression and spectral reconstruction errors more significantly than column compression. The spectral dimension is more important for optimizing compression level and wavelet filters than the acquisition time dimension.

The average RRMSES values from the bacterial data set were greater than those

from the drug data set and furnished a lower percent correct nc than the drug data set because the former is noisier. Wavelet compression removes noise from the data.

RTSIMPLISMA can remove noise to some extent but not as well as wavelet compression.

Therefore, higher noise levels may contribute some part to the higher RRMSES.

Alternatively, the higher uncertainty for selecting pure variable from noisy data may also lead to a greater RRMSES. Second, the greater the compression level, the lower the

percent correct nc that was obtained, because greater compressions may have altered the

RTSIMPLISMA models. 115

Selection of the optimal compression level and the wavelet filter was optimized with respect to computation time and minimized reconstruction errors. For compression level, computational efficiency correlates to reduced data set size, because SIMPLISMA has a greater computational burden than the wavelet compression.

Summarizing from both tables, 4 × 4 compression is selected as the optimal compression level for IMS data, by which the RRMSES values are acceptable (around

6% and 10% for the drug and bacterial data sets, respectively). With the 4 × 5

compression, the percent correct nc is reduced considerably in comparison to 4 × 4 compression although the minimum RRMSES values are similar, which suggests that the compression error increases considerably from level 4 to level 5 compressions. The optimal wavelet filter pair was daublet 14-daublet 4 instead of either daublet 14-symmlet

6 or daublet 22-daublet 4 that yielded minimum RRMSES in the tables. This result arises from two factors. Shorter wavelet filters are more computationally efficient. For example, the daublet 22 has 22 coefficients, while the daublet 4 filter has 4 coefficients. With the daublet 14-daublet 4 filter pair, the RRMSES values for the drug and bacterial data sets are 6.60 and 11.58, respectively. The RRMSES values are similar to the corresponding minimum RRMSES.

The WC2-RTSIMPLISMA models with RRMSES less than 10% matched well with original RTSIMPLISMA models. However, 10% is not a strict threshold. The

RRMSES criterion should be used in combination with the comparative observation of the specific models. The 4 × 4 daublet 14-daublet 4 compressed bacterial data set that 116

Figure 4.11 The 4 × 4 daublet 14-daublet 4 compressed Bacillus cereus dataset comprised of 32 × 64 points in a 3D surface plot. 117 consists of 32 × 64 points is given in Figure 4.11. The critical chemical information remains in the compressed data. Compared to the raw data in Figure 4.4, the compressed domain furnishes a more clear representation of the chemical information. However the data size was reduced to 1/256 of that of the raw data. The reconstructed WC2-

RTSIMPLISMA models are given in Figure 4.12 and Figure 4.13. These reconstructed models are comparable with the original SIMPLISMA models in Figure 4.5 and Figure

4.6. The reconstructed models differ from the original ones in that they remove most of the noise whereas they characterize the same changes in the analytical signal. Note that, in Figure 4.13A, there are some pronounced variations of the bacterial concentration profiles for WC2-RTSIMPLISMA model from the raw spectra, but the spectra in Figure

4.13B corresponded well. The variation of the concentration profiles was inconsequential.

The data sets ( Xˆ ) can be reconstructed by the dot product of the reconstructed WC2-

RTSIMPLISMA resolved models, i.e., concentration profiles Cˆ and component spectra

(Sˆ ). The reconstructed bacterial data set is given in Figure 4.14, which significantly differs from the raw data in Figure 4.4. Most of the noise has been removed and important chemical information is represented more clearly.

4.4.3 RTSIMPLISMA Applied to Windig Standard Data Sets

Four standard industrial data sets have been published by Windig 49, 94 and investigated by a number of SMCR methods. Shen et al. measured the chemical rank of the data sets using subspace comparisons.95 De Braekeleer and Massart applied orthogonal projection approach (OPA)96 to the data sets and Grande and Manne evaluate 118

Acquisition Time (s) 10 20 30 40 50

12 A

8 RIP Cocaine Heroin

4 Integrated Intensity(V)

0

100 300 500 700 900 Spectrum Number

Reduced Mobility (cm2V-1s-1) 4.74 2.48 1.67 1.26 1.01 0.85 0.73 0.64 B 0.20

0.15

RIP Cocaine 0.10 Heroin

0.05 Relative Intensity Relative

0.00

-0.05

35791113 Drift Time (ms)

Figure 4.12 Reconstructed RTSIMPLISMA model from the 4 × 4 daublet 14-daublet 4 compressed drug data set. (A) Concentration profiles. (B) Component spectra. 119

Acquisition Time (s) 1 6 11 16 21 A

Calibration Peak 20 TMAH Bacterial Component Bacillus cereus 15

10

5 Integrated Intensity (V) Intensity Integrated

0

-5

100 200 300 400 500 Spectrum Number

Reduced Mobility (cm2V-1s-1) 3.60 2.64 2.08 1.72 1.46 1.27 1.13 1.0 B 0.25 Calibration Peak TMAH Bacterial Component Bacillus Cereus

0.15

Relative Intensity 0.05

-0.05 57911131517 Drift Time (ms)

Figure 4.13 Reconstructed RTSIMPLISMA model from the 4 × 4 daublet 14-daublet 4 compressed bacterial data set. (A) Concentration profiles. (B) Component spectra. 120

Figure 4.14 The reconstructed data set from the 4 × 4 daublet 14-daublet 4 WC2-

RTSIMPLISMA model from the bacterial data set. 121

a new pure variable searching method using convexity.97 The Windig data sets include: Raman spectra of a reaction, FTIR microscopy spectra of a polymer laminate,

NIR pectra of mixtures of five solvents, and time resolved mass spectra of a mixture of three photo-graphic color coupling compounds. In this section, RTSIMPLISMA was

applied to the four data sets without compression. The transition threshold ( ∆0 ) was set

to 0.5 for all of the data sets except NIR data set, for which ∆0 was 0.45. While searching pure variables by RTSIMPLISMA, the criterion that requires the standard deviation of a pure variable to be more than three times of noise level was not applied. The MATLAB codes for conventional SIMPLISMA in this section were obtained from Windig94. The conventional SIMPLISMA differs from SIMPLISMA-det in that it used equation (2.7) for data normalization while SIMPLISMA-det used equation (2.6). RTSIMPLISMA also used equation (2.6) to normalize data. As a result, the models from Windig SIMPLISMA and RTSIMPLISMA have different ordinate amplitude in the spectra and concentration profiles although the reconstructed data will be similar.

The Raman spectra (16 × 151) were obtained from a study of the formation of silica glasses from reaction of tetramethyl orthosilicate, Si(OCH3)4, (TMOS) in aqueous methanol. Sixteen Raman spectra with 151 spectral channels were collected to study the hydrolysis and condensation of TMOS during the process. Hydrolysis products

(Si(OCH3)n(OH)4−n, n=1,2 and 3) and two condensation products ( Si(OSi)(OR)3 and

Si(OSi)2(OR)3, R = CH3 or H) are formed under the normal reaction conditions. The detailed description of the reaction is available elsewhere.98

122

0

-1

-2

-3 β log

-4

-5

2 4 6 8 101214 Component Number

Figure 4.15 RTSIMPLISMA relative purity curve for the Windig Raman data set. The transition point is highlighted. 123

The RTSIMPLISMA relative purity curve of this data set is given in Figure 4.15. The fifth point is the transition point, suggesting the number of components in this data set is four. The resolved spectra of the four components are given in Figure 4.16. The first component corresponds to the hydrolysis products Si(OCH3)n(OH)4−n that have characteristic peaks at 673, 696 and 726 cm-1 for n=3, 2 and 1, respectively. TMOS has a peak at 644 cm-1 that is resolved as the third component. The first condensation product (

-1 -1 Si(OSi)(OR)3 has two peaks at 608 cm and a shoulder peak at 586 cm and the second

-1 condensation product Si(OSi)2(OR)3) at 525 cm . These two products are resolved as condensation product B and A in the RTSIMPLISMA model, respectively. The

RTSIMPLISMA resolved concentration profiles are given in Figure 4.17A. The models agree well with the models by conventional SIMPLISMA that are given in Figure 4.17B and Figure 4.18.

The FTIR microscopy data set (17 × 81) was obtained from scanning the cross section of a 240 µm thick polymer laminate by a FTIR microscopy with 81 spectral channels (1628 - 701 cm−1). The inner layer of the laminate was isophthalic polyester (IP)

(2-3 µm) and the other layers consisted of polyethylene (PE) and polyethylene terephthalate (PET). The detailed description of the experimental setup is available elsewhere.99

The relative purity curve is given in Figure 4.19. The difference of the relative purities of two adjacent components in term of ∆ log β for the first five components is

1.60, 0.61, 1.00, and 0.03, respectively. Apparently, the difference of relative purity 124

Hydrolysis Products 0.30

0.10

Condensation Product A 0.20

0.00 Relative Intensity Relative TMOS

0.20

-0.10 Condensation Product B

0.14

0.02

450510570630690 Wavenumber (cm-1)

Figure 4.16 RTSIMPLISMA resolved spectra for the Windig Raman data set.

Hydrolysis products: Si(OCH3)n(OH)4−n; TMOS: Si(OCH3)4; Condensation product A:

Si(OSi)2(OR)3; Condensation product B: Si(OSi)(OR)3. 125

A B

Hydrolysis Products Hydrolysis Products Condensation Product A Condensation Product A TMOS 15 TMOS 100 Condensation Product B Condensation Product B

80

10

60

Integrated Intensity 40 5

20

0 0

261014 261014 Spectrum Number Spectrum Number

Figure 4.17 Conventional SIMPLISMA (Panel A) and RTSIMPLISMA (Panel B) resolved concentration profiles for the Windig Raman data set. 126

Hydrolysis Products 0.04

0.01

Condensation Product A 0.05

0.00 Relative Intensity Relative TMOS

0.07

0.01

Condensation Product B

0.02

0.00

450510570630690 Wavenumber(cm-1)

Figure 4.18 Conventional SIMPLISMA resolved spectra for the Windig Raman data set (α = 0.03). Hydrolysis products: Si(OCH3)n(OH)4−n; TMOS: Si(OCH3)4;

Condensation product A: Si(OSi)2(OR)3; Condensation product B: Si(OSi)(OR)3. 127

0

-1

-2

-3 β log

-4

-5

2468101214 Component Number

Figure 4.19 RTSIMPLISMA relative purity curve for the Windig FTIR microscopy data set. The transition point is highlighted. 128 between the fourth and the fifth components is less than the others, which suggest the fourth point is the transition point and there are three components in the data set. The conventional SIMPLISMA and RTSIMPLISMA models are given in Figure 4.20, Figure

4.21, and Figure 4.22. Subtle differences can be found between the concentration profiles of IP (Figure 4.21) from SIMPLISMA and RTSIMPLISMA. The reason is that the both algorithms selected the similar pure variables. SIMPLISMA selected variables 31, 15, and 36 that correspond to 1269, 1466 and 1223 cm−1, respectively, while

RTSIMPLISMA selected 32, 15 and 35 (1269, 1466 and 1234 cm−1). However, the resolved spectra from SIMPLISMA and RTSIMPLISMA are similar to each other.

The NIR data set (140 × 700) was obtained for mixtures of five solvents including

2-butanol, methylene chloride, methanol, dichloropropane, and acetone. Seventy different mixtures of the solvents were prepared and collected two NIR spectra for each solvent.

Total 140 spectra were collected with 700 spectral channels from 1100 to 2498 nm

(resolution 2 nm). The detailed experimental procedure can be found elsewhere.100 This data set is challenging because the peaks are broad and severely overlapping, and the baseline was shifted along the wavelength direction.

The relative purity curve of the NIR data set is given in Figure 4.23. The value of

∆ log β for the sixth and seventh components is 0.09, which is significantly smaller than those for the previous components, suggesting that the sixth point is the transition point and there are five components in the data set. SIMPLISMA was not able to successfully resolve the five components. Windig applied SIMPLISMA on the positive part of the 129

PET 0.4

0.1

PE 0.6

0.2 Relative Intensity

IP

0.4

0.1

1600 1400 1200 1000 800 Wavenumber (cm−1)

Figure 4.20 RTSIMPLISMA resolved spectra for the Windig FTIR microscopy data set.

130

A B PET PET PE PE IP 3 IP 15

2 10 Integrated Intensity Integrated

1 5

0 0

2581114 2581114 Spetrum number Spectrum Number

Figure 4.21 Conventional SIMPLISMA (Panel A) and RTSIMPLISMA (Panel B) resolved concentration profiles for the Windig FTIR microscopy data set.

131

PET 0.08

0.03

PE

0.07

0.01 Relative Intensity

IP 0.08

0.03

1600 1400 1200 1000 800 Wavenumber (cm−1)

Figure 4.22 Conventional SIMPLISMA resolved spectra for the Windig FTIR microscopy data set (α = 0.03). 132

0

-1

-2

-3 β log

-4

-5

2468101214 Component Number

Figure 4.23 RTSIMPLISMA relative purity curve for the Windig NIR data set. The transition point is highlighted. 133

A B 1.1e-2 7.0e-3

5.0e-3 7.0e-3 3.0e-3 3.0e-3 1.0e-3 -1.0e-3 -1.0e-3 1100 1500 1900 2300 1100 1500 1900 2300 9.0e-3 C 9.0e-3 D

7.0e-3 7.0e-3

5.0e-3 5.0e-3 3.0e-3 3.0e-3

Relative Intensity 1.0e-3 1.0e-3 -1.0e-3 -1.0e-3 1100 1500 1900 2300 1100 1500 1900 2300 Wavelength (nm) 9.0e-3 E

7.0e-3

5.0e-3

3.0e-3

1.0e-3

-1.0e-3 1100 1500 1900 2300 Wavelength (nm)

Figure 4.24 Resolved spectra for the Windig NIR data set with conventional

SIMPLISMA applied on the positive part of the inverted second derivative data set (α

= 0.1). (Panel A: methylene chloride; Panel B: butanol; Panel C: methanol; Panel D: dichloropropane, and Panel E: acetone. 134 inverted second derivative data and satisfactory spectra were resolved, which is given in

Figure 4.24. The spectra agree well with the reference spectra. However, the resolved spectra (Figure 4.25) with RTSIMPLISMA to the data set were not in good agreement with the reference data, although the accurate number of components was determined.

Considerable correlations were found in the resolved components, which was also the problem for OPA96. The poorer models were obtained by applying RTSIMPLISMA on the positive part of the inverted second derivative data, suggesting that RTSIMPLISMA has difficulty to resolve the highly overlapping data set with broad peaks and a nonzero baseline.

The time resolved mass spectra (20 × 739) resulted from an experiment of a mixture that consists of three photographic color coupling compounds (A, B, and C), dissolved in methanol. After evaporation of the solvent, the sample was introduced into a mass spectrometer on a probe with heated filament. The mixture composition was different over time due to the different evaporation profiles of the three compounds, which yielded a mixture data set. The experimental details can be found elsewhere.101

The relative purity curve of the NIR data set is given in Figure 4.26. The values of

∆ log β between points 3 and 4, and points 2, 3 are 0.45 and 0.22, respectively, which is less than the transition threshold of 0.5. However, the relative purity at point 4 is less than

1% (i.e., log β is larger than −2.0) and the value of ∆ log β between point 4 and 5 (1.44) is large. Moreover, the value of ∆ log β between point 5 and 6 (0.04) is very small.

Therefore, it is concluded that the fifth point is the transition point and there are four 135

2.0e-1 A B 1.5e-1 2.0e-1

1.0e-1 1.0e-1 5.0e-2 0.0e0 0.0e0 -1.0e-1 -5.0e-2

-1.0e-1 -2.0e-1 1100 1500 1900 2300 1100 1500 1900 2300

1.8e-1 D C 5.0e-2

1.3e-1 0.0e0

-5.0e-2 8.0e-2 -1.0e-1

3.0e-2 Intensity Relative -1.5e-1

-2.0e-2 -2.0e-1

1100 1500 1900 2300 1100 1500 1900 2300 1.9e-1 E Wavelength (nm)

9.0e-2

-1.0e-2

-1.1e-1

1100 1500 1900 2300 Wavelength (nm)

Figure 4.25 RTSIMPLISMA resolved spectra for the Windig NIR data set. Panel A, B,

C, D, and E correspond to component 1, 2, 3, 4, and 5, respectively.

136

0

-1

-2

-3 β log

-4

-5

2468101214 Component Number

Figure 4.26 RTSIMPLISMA relative purity curve for the Windig time resolved mass spectrometry data set. The transition point is highlighted. 137 components in the mixture spectra. Windig resolved three components from the spectra by applying SIMPLISMA from the column direction (i.e., time direction) instead of the direction of mass charge ratio. The method was denoted TSIMPLISMA. The reference spectra of the three compounds, TSIMPLISMA resolved spectra and RTSIMPLISMA resolved spectra are given in Figure 4.27, Figure 4.28, and Figure 4.29, respectively.

Compared to the reference spectra, both TSIMPLISMA and RTSIMPLISMA resolved spectra represent the major features of the three compounds in the first three components.

TSIMPLISMA resolved three components from the data set. However, RTSIMPLISMA resolved one extra component from the data set, which is in agreement with the result by simplified Borgen method 97 and subspace comparisons 95 The extra component is a pyrolysis product yielded during the measurement process. 97 The resolved concentration profiles of TSIMPLISMA and RTSIMPLISMA are given in Figure 4.30. The concentration profile of the pyrolysis product increases with time and is clearly different from the other components. This finding showed that the component is largely independent from the others and should be included in the models.

4.5 Conclusions

A WC2-RTSIMPLISMA method was developed. The RTSIMPLISMA automatically determines the number of components by locating the transition point of the relative purity curve. The RTSIMPLISMA performed well for three of four Windig standard data sets. The algorithm had difficulty to resolve the accurate models from the 138

Compound A 1.0 144 0.8 0.6 116 0.4 64 89 118 205 0.2 0.0 Compound B

1.0 72 171 0.8 242

% 0.6 0.4 0.2 0.0 Compound C 1.0 205 0.8 0.6 71

0.4 245 464 0.2 0.0 100 200 300 400 500 m/z

Figure 4.27 Reference spectra for the three photographic color coupling compounds in the Windig time resolved mass spectrometry data set.

139

Compound A

0.15 144

0.10 116 0.05 64 89 118 145 205 0.00 Compound B

242 0.15 72 171 0.10 0.05

Relative Abundance 0.00 Compound C

205 0.10 71 0.05 245 464

0.00

100 200 300 400 500 m/z

Figure 4.28 TSIMPLISMA resolved spectra for the Windig time resolved mass spectrometry data set (α = 0.03).

140

Compound A 0.8 144 0.6 0.4 116 0.2 64 89 0.0 Compound B

0.6 242 72 171 0.4 0.2 0.0 Compound C 205 0.6 0.4 71 Relative Abundance 0.2 245 464 0.0 Pyrolysis Product

0.8 118 0.6 0.4 91 0.2 145 0.0

100 200 300 400 500 m/z

Figure 4.29 RTSIMPLISMA resolved spectra for the Windig time resolved mass spectrometry data set. 141

2.5e7 Compound A A Compound A B Compound B Compound B Compound C Compound C Pyrolysis Product 4.0e6 2.0e7

3.0e6 1.5e7

2.0e6 1.0e7 Integrated intensity

5.0e6 1.0e6

0.0e0 0.0e0

35 40 45 50 35 40 45 50 Spectrum Number Spectrum Number

Figure 4.30 TSIMPLISMA (Panel A) and RTSIMPLISMA (Panel B) resolved concentration profiles for the Windig time resolved mass spectrometry data set. 142

NIR data sets, although the correct number of compounds was found. Drug and bacterial data sets were used to evaluate the WC2-RTSIMPLISMA method. The results showed satisfactory models could be obtained with 4 × 4 daublet 14-daublet 4 compression prior to RTSIMPLISMA processing. The 2D compression results in a compression factor of

1/256 that retains the key chemical information for IMS spectra. Compared to row compression, compressing the acquisition time dimension is less influential on the model accuracy. Future work shall focus on the real-time implementation of the WC2-

RTSIMPLISMA algorithm.

143

Chapter 5 Real-Time Two-Dimensional Wavelet Compression and Its

Application to Real-Time Self-Modeling of IMS data

5.1 Introduction

The work in Chapter 4 developed a WC2-RTSIMPLISMA algorithm and found the optimal settings of the algorithm for processing ion mobility spectra. The WC2-

RTSIMPLISMA was implemented in off-line. The dimensions of the data matrix were culled to powers of two because the multiple-level operation subsamples the smooth part by 2 successively for each level. Commonly, the signals with non-dyadic length are padded to dyadic length with zero or some other values. However, padding adds to the computational burden and can introduce edge distortion to the reconstructed spectrum. In this work, the WC algorithm is to be modified so that it could efficiently compress the data with arbitrary length without the need of redundant padding. Eliminating the constraint provides a gain in computational efficiency by eliminating extra computations introduced by padding the data.

In addition, a real-time wavelet compression algorithm is presented and integrated with RTSIMPLISMA, which affords real-time WC2-RTSIMPLISMA. The IMS data were compressed in both drift and acquisition times as they were acquired from the spectrometer. The integrated software package for real-time modeling is called Get

Chemical Information Now (GCIN). The time performance of real-time WC2-

RTSIMPLISMA was assessed by recording the time spent during the processing. The effects of the real-time WC2 on RTSIMPLISMA were assessed. The real-time WC2- 144

RTSIMPLISMA algorithm was applied to the detection of low quantity of TNT and dynamic modeling a mixture of three explosives. A novel method using reference data to improve the resolution power of the real-time WC2-RTSIMPLISMA algorithm was proposed. The method borrowed the idea of the method in chromatography. Drug data sets were used to evaluate the proposed idea.

5.2 Theory

A set-aside approach was reported to handle non-dyadic signals for the complete wavelet transform.56 The last point was left alone in the detail part. The inverse transform was not described. In this paper, a slightly different algorithm, hence referred to as the add-one method, was developed to efficiently compress data with an arbitrary number of points. The algorithm determines whether the number of smooth points at each level,

l l denoted as nx for level l , is even. If nx is odd, then the last point is duplicated and concatenated to the end of the smooth part. The resulting smooth part is further transformed to smooth and detail parts. Using the add-one algorithm, the padding is optimal with respect to the level of compression and edge artifacts are avoided because the added point is duplicated from the end point.

For the inverse transform, the end point of the smooth part is discarded if the original number of points at the level is odd. For example, if the original length of a spectrum is 150, the traditional padding method requires 106 points to make the data dyadic length (i.e., 28 ) for the complete DWT, while only four duplications, at levels 1,

3, 5, and 6, respectively, need to be made using the add-one method. 145

The implementation of WC2-RTSIMPLISMA in real-time requires to update the compressed data matrix after each spectrum is acquired. Real-time row compression is relatively simple. For each IMS spectrum, data points are accumulated in a memory buffer of a data acquisition device with fixed length. The WC can be implemented instantly after a complete spectrum is acquired. The compression of columns is different because the length in column direction increases gradually during the measurement course. A simple strategy is to apply the pyramid WT algorithm to the entire data set after each spectrum is acquired. However, redundant calculations are involved to transform the old rows over and over. In this project, a new algorithm is designed to eliminate the redundancy, by which all of incoming rows are processed only once. This is achieved by exploiting the technique of circular finite impulse response (FIR) filter in digital signal processing (DSP).102 First, the new method simplifies the multi-level operations in pyramid algorithm with one single-level operation. The goal is to find a vector φ , by which a matrix Φ could be constructed so that the smooth part at level l can be transformed from the raw signal ( x(0) ) by one single-level operation. The mathematical expression is given in equations (5.1) and (5.2).

al(0) ( ) ΦΤφ=−=→():b Hx x (5.1)

()l (0) x = Φx (5.2) for which, the matrix Φ contains a rows of circular translates of φ by multiples of b (,ab∈ z) (The term T was defined in Chapter 2). In other words, the smooth part at 146 level l could be obtained by circularly passing the original spectrum through the FIR filter defined by φ . The vector φ can be deduced from equation (2.24). As a result, φ()l for calculating x()l can be derived from the father wavelet h :

φ()l ==h, l 1 (5.3) φ()lMll=×h Τφ() (− 1) −2, l >1 for which, M is the length of h and l is the compression level number.

Similar to FIR in DSP, a circular buffer is used to minimize the memory usage and improve data processing efficiency103. For column compression, the circular buffer length is assigned to the length of φ()l while the buffer width is equal to the number of points in a row-compressed spectrum. The circular buffer contains the most recent data to be processed. A dynamic pointer always points to the data point that comes into the buffer most recently. Since the buffer is circular, the address next to the dynamic pointer always refers to the memory occupied by the oldest data point that will be overwritten by the incoming data.

The schematic of circular buffer is given in Figure 5.1, for which Pn represents the dynamic pointer for new coming data point and Pp represents the start point of the data to be processed. At the beginning of the acquisition, Pn and Pp point to the same position.

As a new point is received, it is placed at the position pointed by Pn. Afterwards Pp is moved clockwise one position while Pp stays unchanged. Once Pp hits the position pointed by Pp, the buffer is full and convolute the data in the buffer by the FIR filter. The 147

Pn

Pp

Figure 5.1 Circular buffer. As a new point is received, it is placed into the memory pointed by pointer Pn. The start position of the data to be processed is located by Pp. 148 convolution result is stored in another dynamically incremental memory block.

l Afterwards, Pp is moved clockwise 2 positions and Pn is moved clockwise one position.

Since the value at the Pn position was processed and no longer useful, it will be overwritten by newly acquired data. The next convolution occurs when Pn hits the position next to Pp. The above process repeats until the acquisition is stopped. If the buffer is not full when the acquisition is stopped, the end point in the buffer will be duplicated to fill the unoccupied positions.

5.3 Experimental Section

The IMS data used in the paper was generated by an ion trap mobility spectrometer (ITMS), ITEMISER contraband detection and identification system (Ion

Track Instruments, Inc., Wilmington, MA, USA). The ITMS system was interfaced with a laptop through a National Instruments (Austin, TX, USA) PCMCIA card (Type

DAQCard-AI-16XE-50). The laptop for time performance experiments was equipped with an Intel Pentium® III 800 MHz processor and 128 MB of memory. The operating system was Microsoft Windows 98 Second Edition. The laptop for the other experiments was equipped with an Intel Pentium® III 1 GHz processor and 384 MB of memory. The operation system was Microsoft Windows XP. The explosive data was collected in explosive or negative mode while the drug data was collected in narcotics mode or positive mode. The acquisition rate was 80 kHz, and each spectrum consisted of 1500 points. 149

Chemicals used were urea nitrate, RDX (cyclotrimethylenetrinitramine), TNT

(2,4,6-trinitrotoluene), heroin, MDMA (3,4 methylenedioxymethamphetamine), cocaine, and ethanol (absolute). The stock solution of each explosive and drug was prepared in ethanol (ca. 6.0×10−1 mg/mL urea nitrate, 1.8×10−2 mg/mL RDX, 3.6×10−2 mg/mL TNT,

1.0×10−1 mg/mL heroin, 1.0×10−1 mg/mL MDMA, 1.0×10−1 mg/mL cocaine). The structures for MDMA and these explosives are given in . Explosive or drug solution was placed on a multi-use sample trap (Ion Track Instruments, Inc., Wilmington, MA, USA,

Part No. M0001166-E). The volume was 1.0 µL for each sample. The sample trap was exposed to air to evaporate the ethanol to yield samples. Several hundreds of blank spectra were collected before the sample trap was placed into the thermal desorber of the

ITMS. The thermal desorption time was 5 s.

All of the programs were written at Ohio University using LabVIEW 6.1 full version (National Instruments, USA) and Visual C++ 6.0 (Microsoft, Seattle, WA, USA).

The homemade software package is called GCIN (Version 3.0) that includes data acquisition, data visualization, and user interface. An offline version of the GCIN software package has been developed to study IMS data by post-run. The offline version includes more features, such as the calculation and display of relative purity curves, the process of conventional SIMPLISMA, etc. The description of the software will be given in Appendix C. The software user interface was written in LabVIEW. The real-time

WC2-RTSIMPLISMA algorithm was written and compiled in Visual C++. The compiled module was called by the code interface node (CIN) of the VIs in LabVIEW. The real- time VI system continuously acquired data from the ion trap mobility spectrometer and 150

A B O O N

N

O H2N NH3 NO3 N N O N N O O

C D O O N N O O O HN

O N O O

Figure 5.2 Structures for (A) urea nitrate, (B) cyclotrimethylenetrinitramine (RDX),

(C) 2,4,6-trinitrotoluene (TNT), and (D) 3,4 methylenedioxymethamphetamine

(MDMA) 151 simultaneously implemented 2D wavelet compression and RTSIMPLISMA. The compressed data and RTSIMPLISMA models were displayed on the VI panel in real time. The negative values in RTSIMPLISMA models were retained, because negative peaks indicate correlations between pure variables in the model and deviations from linearity (e.g., peak shape distortions). All of the calculations used single-precision (32- bit) floating-point arithmetic.

5.4 Results and Discussion

5.4.1 Time Performance of Real-Time WC2-RTSIMPLISMA

The optimal wavelet type for row compression is daublet 14 and that for column compression is daublet 4 according to the previous studies. The optimal compression

level is four. The optimal threshold ∆0 is 0.5 for RTSIMPLISMA. The father wavelet of daublet 14 and daublet 4 were given in Figure 2.7. The daublet 14 contains fourteen non- zero coefficients and daublet 4 has four non-zero coefficients. The wavelet with more non-zero coefficients generally furnishes better approximations but also has a greater overhead with processing. Therefore, the selection criterion of wavelet filters for real- time compression is biased towards the selection of the smallest set of filter points that can still adequately retain the characteristic trends in the data.

As described in the theory section, the row compression adapted multi-level decomposition using the add-one method. Each spectrum includes 1420 points that can be compressed to 89 points with a 4-level compression and requires a single duplication at level three. The column compression used the FIR method. For the daublet 4 filter at 152 level four, the vector φ(4) in equation (5.3) that defines the FIR filter is given in Figure

5.3. The vector contains 46 non-zero coefficients. Therefore, the size of the circular buffer allocated for column compression was 16,376 bytes (i.e., 89 × 46 × 4).

The RIP is a prominent feature in the IMS blank spectra. In terms of

SIMPLISMA, the RIP represents a pure component in IMS data. The IMS data with the single RIP component was used to study the real-time 2D compression improvement on the speed of SIMPLISMA modeling. The time performance curves are given in Figure

5.4 and Figure 5.5. Figure 5.4 gave the time performance curve for real-time

SIMPLISMA without any compression. Polynomial fitting discovered that the consumed

time t was a quadratic function of spectrum number ns with the quadratic model below:

2 (5.4) tnn =+− 0.00020ss 0.050 0.35

The processing time increases as the square of the number of spectra. The average processing rate is 2.20 spectra/s, which was much slower than the processing rate for acquisition only that was 20.6 spectra/s according to Figure 5.5. The rate for the last 100 spectra was even much slower, which was 1.20 spectra/s. Moreover, for acquisition only,

the processing time was linear to spectrum number (tn= 0.049 s ). From Figure 5.5, compressing the data before the implementation of RTSIMPLISMA improves the computational performance. The average processing rate for 25,000 spectra with 2D 4 ×

4 level compression was 18.6 spectra/s. Similar to data acquisition only, the consumed time was linear to the acquired number of spectra (tn= 0.052− 23.7 ). The results suggest that, with respect to time constraints, the real-time implementation of RTSIMPLISMA 153

0.3

0.2

0.1 Amplitude

0.0

-0.1

10 20 30 40 Point Number

Figure 5.3 The vector φ(4) that defines the FIR filter for column compression. 154

898

800

600

400 Acquisition Time (s)

200

0

100 600 1100 1600

Spectrum Number

Figure 5.4 The time performance curve for RTSIMPLISMA without compression. 155

3633 Acquisition only 4 x 4 WC2 - RTSIMPLISMA 2 3000 2 x 2 WC - RTSIMPLISMA

2000

1345 1214

Acquisition Time (s) 1000

0

1000 6000 11000 16000 21000

Spectrum Number

Figure 5.5 The time performance curves for data acquisition only and real-time WC2-

RTSIMPLISMA. 156 with the 4 × 4 WC2 compression was successful. The average rate of level 2 × 2 WC2-

RTSIMPLISMA was 6.88 spectra/s and the rate for the last 100 spectra was 3.44 spectra/s. Therefore, the level 2 × 2 WC2 compression also improved the implementation speed compared to implementing RTSIMPLISMA without compression, although it was not as significant as the 4 × 4 compression.

5.4.2 Enhanced IMS Measurement by Real-Time WC2-RTSIMPLISMA

Two data sets are used in this section to illustrate the potential of enhanced IMS measurement by real-time WC2-RTSIMPLISMA. The first one is from the experiment of blank disk trap. A multi-use blank trap disk was placed into the desorber of ITMS. The disk was not removed until the major residue was evaporated and only RIP appeared.

Start the real-time GCIN program and insert the cleaned blank trap disk into the desorber after collecting about 300 spectra. The disk was removed after 5 s. The process stopped when about 800 spectra were collected. Repeat the process for three times (Blank 1,

Blank 2, and Blank 3). The IMS data of Blank 3 is given in Figure 5.6. From this figure, we can find some subtle change on concentration profile of RIP occurred around the

300th spectrum. However, it is not obvious. Moreover, the average spectra of the three replicates are almost identical to the RIP, as presented in Figure 5.7 . Therefore, signal averaging, which is commonly used for preprocessing IMS data before peak identification, does not provide any useful information. However, real-time WC2-

RTSIMPLISMA was able to model the subtle change and reveal the chemical information underlying it. The final resolved spectra of the three replicates are given in 157

Figure 5.6 IMS data set of blank trap disk on 3D surface plot. The data set was acquired from ITEMISER ITMS in explosive mode with the implementation of WC2-

RTSIMPLISMA in real-time. 158

2.9

Blank 1 2.4 Blank 2 Blank 3 RIP 1.9

Intensity (V) 1.4

0.9

0.4

-0.1

2 7 12 17 Drift Time (ms)

Figure 5.7 The average IMS spectra for three replicates of IMS measurement of a blank trap disk and the average spectrum of the data set that only has RIP. 159

Figure 5.8. The unknown component in the models contributes to the subtle change. The

SIMPLISMA spectrum has two peaks at 4.01 and 4.80 ms. These small features can be disclosed by closely looking at the raw data. Figure 5.9 presents the variation profiles at different drift times for the data set of Blank 3. The variation profile at 2.00 ms represents the typical variation profile of noise across the data set. The profile at 3.60 corresponds to the change of RIP during the measurement course. The intensity of RIP decreased at 14 s when the blank trap disk was inserted into the desorber and increased at 19 s while the disk was removed. In this period, the intensity of the peaks at 4.01 and 4.08 ms changed in the opposite trend to the RIP. The amplitude of the two peaks was very small, about 1-

2% of the maximum intensity of RIP. Therefore, this change can hardly be found in the

3D plot and the average spectrum. However, the change was modeled by real-time WC2-

RTSIMPLISMA at 19 s and the two-component model was reported till the end of the experiment (37s).

The other experiment is similar to the blank trap disk experiment. The only difference is, instead of using blank trap disk as sample, 1. µL ethanol solution of TNT

(ca. 3.6×101 pg TNT) was placed onto the trap disk and air-dried before it was inserted into the desorber. Two replicate data sets (TNT 1 and TNT 2) were collected for this experiment. The 3D surface plot of TNT 1 is given in Figure 5.10 and the average spectrum of the data set is given in Figure 5.11. Like the blank data set, one can hardly resolve the chemical information during the process by visual observation of the both figures. But it can be known that some change occurred after the sample disk was 160

A B 0.2 0.2

RIP RIP Unknown Unknown 0.1 0.1 Relative Intensity Relative Intensity 0.0 0.0

2 7 12 17 2 7 12 17 Drift Time (ms) Drift Time (ms)

C 0.2

RIP Unknown 0.1 Relative Intensity

0.0

271217 Drift Time (ms)

Figure 5.8 Real-time WC2-RTSIMPLISMA resolved spectra for the data sets from the three replicates of blank trap disk experiments. Panel A corresponds to blank 1 in

Figure 5.7, and B to blank 2, and C to blank 3, respectively. 161

Acquisition Time (s) Acquisition Time (s) 0102030 0102030

A B 2.79 0.00

2.71 -0.01 Intensity (V) Intensity (V) Intensity

2.63 -0.02

100 300 500 700 100 300 500 700 Spectrum Number Spectrum Number Acquisition Time (s) Acquisition Time (s) 0102030 0102030

CD0.14

0.37

0.10

0.34 Intensity (V)Intensity Intensity (V)Intensity

0.06 0.31

100 300 500 700 100 300 500 700 Spectrum Number Spectrum Number

Figure 5.9 The variation profiles corresponding to different drift time for the raw data set of blank 3 in Figure 5.7. Panels A, B, C, and D correspond to the drift time of 2.00,

3.60, 4.01, and 4.80 ms, respectively. 162

Figure 5.10 IMS data set of 3.6×101 pg TNT on 3D surface plot. The data set was acquired from the ITEMISER ITMS in explosive mode with the implementation of

WC2-RTSIMPLISMA in real-time. 163

2.9

TNT 1 2.4 TNT 2 RIP

1.9

1.4 Intensity (V) Intensity

0.9

0.4

-0.1

271217 Drift Time (ms)

Figure 5.11 The average IMS spectra for two replicates of IMS measurement of

3.6×101 pg TNT and the average spectrum of the data set that only has RIP. 164 inserted into the desorber from Figure 5.10 because the RIP intensity reduced a little at that point. However, real-time WC2-RTSIMPLISMA successfully captured the subtle change and resolved three components in 5 s after the sample was introduced. The resolved spectra of the two replicate data sets are given in Figure 5.12. The three components represent RIP, unknown blank component, and TNT, respectively. The peak at 6.38 ms represents the peak of TNT. Furthermore, a sample trap disk with 3.6 pg TNT was tested in the same procedure. The TNT peak was resolved in only two of the four replicates, suggesting that the repeatability of the method was reduced when the amount of TNT was in pg level.

5.4.3 Real-Time Self-Modeling of IMS Data of Explosives

Three explosives were used to test the efficacy of the real-time WC2-

RTSIMPLISMA. The ethanol solution of urea nitrate (ca. 6.0 × 102 ng urea nitrate) was placed on the sample trap disk and the acquisition was started. The disk was inserted into the desorber at 27 s and removed at 32 s. Then RDX (ca. 18 ng) was deposited onto the same disk and reinserted into the desorber at 106 s and removed at 111 s. Afterwards,

TNT (ca. . 36 ng) was deposited onto the same disk and reinserted into the desorber at

175 s and removed at 180s. The VI stored the data points and acquisition time during the real-time process. The ITEMISER response for this experiment is given as a 3D surface plot in Figure 5.13. The data set comprised 3637 spectra and 1420 points per spectrum were processed in real time. The 2D wavelet compressed data contained 228 compressed spectra with 89 points per spectrum. The compression factor was 1/256. 165

A RIP 0.15 Unknown TNT

0.00 Relative Intensity

-0.15

271217 Drift Time (ms)

B RIP 0.16 Unknown TNT

0.04 Relative Intensity

-0.08

2 7 12 17 Drift Time (ms)

Figure 5.12 Real-time WC2-RTSIMPLISMA resolved spectra for the data sets from the two replicate data set of 36 pg TNT. Panel A corresponds to TNT 1 in Figure 5.11, and Panel B to TNT 2, respectively. 166

Figure 5.13 IMS data set of explosives (urea nitrate, RDX, and TNT) on a 3D surface plot. The data was acquired from the ITEMISER ITMS at explosive mode with the implementation of WC2-RTSIMPLISMA in real-time. 167

The real-time process ended at 258.3 s and the data size was 21 MB. Figure 5.14 gives the final concentration profiles that are in uncompressed representation of the modeling results of the real-time WC2-RTSIMPLISMA. There are two RIP peaks in the model, one of which resulted from the instrumental drift time shift that occurred for several spectra from 200 to 250 s. The RIP 1 is the normal RIP for the ITMS system while RIP 2 was due to drift time shift. The RIP1 decreased at 27 s, 106 s, and 75 s, when urea nitrate,

RDX, and TNT were added respectively and depleted the reactant ions. The intensity of

RIP 2 was very small before 200 s, suggesting there has no significant shift of dirt time in this period. For RIP2, spikes were found in the period from 200 to 250 s.

The real-time resolved spectra for different acquisition time zones are given in

Figure 5.15 (0 -177.0 s), Figure 5.16 (177.1 - 249.0 s), and Figure 5.17 (249.1 - 258.3 s).

The real-time WC2-SIMPLISMA was able to detect the added explosives in about 1-2 s after the sample trap with the corresponding explosives were placed into the desorber.

Only the spectral profile of RIP 1 was resolved and displayed on the VI panel before 28.2 s (Figure 5.15A). The two component model with RIP 1 and urea nitrate (Figure 5.15B) was replaced by a three component model (Figure 5.15C) very shortly (0.7 s). The unknown component probably resulted from blank response of the trap disk, the explosive residues on the trap, or the cluster ions formed by urea nitrate. The peak at 6.53 ms corresponds to the drift time of RDX might resulted from the RDX residue on the trap disk. Introducing RDX at 106 s suppressed the intensity of the unknown component and made it indiscernible. As a result, the components given in Figure 5.15D include RIP1, 168

Spectrum Number 500 1000 1500 2000 2500 3000 3500

RIP 1

11

6 Urea Nitrate 8 4 0 RDX

2 0 TNT Integrated Intensity (V) 2 0 RIP 2 2

-1

10 60 110 160 210 Acquisition Time (s)

Figure 5.14 Real-time WC2-RTSIMPLISMA resolved concentration profiles at the final point (258.3 s). 169

Point Number Point Number 100 500 900 1300 100 500 900 1300

A B 0.2 RIP1 RIP1 0.2 Urea Nitrate

0.1 0.1 Relative Intensity Relative Intensity

0.0 0.0

2 7 12 17 2 7 12 17 Drift Time (ms) Drift Time (ms) Point Number Point Number 100 500 900 1300 100 500 900 1300

C D RIP 1 RIP 1 Urea Nitrate Urea Nitrate 0.15 Unknown RDX 0.13

0.00 0.02 Relative Intensity Relative Intensity

-0.15 -0.09 2 7 12 17 2 7 12 17 Drift Time (ms) Drift Time (ms)

Figure 5.15 Real-time WC2-RTSIMPLISMA resolved component spectra at different acquisition time (Part I). (A: 0-28.2 s; B: 28.3-29.0 s, C: 29.1-108.1 s, and D: 108.2-

177.0 s). 170

Point Number 100 500 900 1300 RIP 1

0.14

0.02 Urea Nitrate

0.14

0.02

RDX 0.20 Relative Intensity

0.00

TNT

0.14

0.02

2 7 12 17 Drift Time (ms)

Figure 5.16 Real-time WC2-RTSIMPLISMA resolved component spectra at different acquisition time (Part II). (177.1 - 249.0 s) 171

Point Number 100 500 900 1300

RIP 1

0.12 0.01 Urea Nitrate

0.14 0.02 RDX 0.20

Relative Intensity Relative 0.00

TNT 0.20 0.05 RIP 2 0.20

0.00

2 7 12 17 Drift Time (ms)

Figure 5.17 Real-time WC2-RTSIMPLISMA resolved component spectra at different acquisition time (Part III). (249.1 - 258.3 s) 172 urea nitrate, and RDX. The TNT spectrum was detected from 177.1 s (Figure 5.16).

While all of the four resolved components (i.e., RIP1, urea nitrate, RDX, and TNT) were retained in the model from 177.1 to the end of the process, another RIP component (RIP

2) was resolved from 249.1 s. The SIMPLISMA-det resolved concentration and spectral profiles from the raw data set are presented in Figure 5.18 and Figure 5.19. Compared to the final concentration and spectral profiles from real-time WC2-RTSIMPLISMA, they are in agreement with each other. Yet small difference can be found between the two models because WC2 introduces minor distortion to the data set. For the final concentration profiles of RIP 1 and RIP 2 from real-time WC2-RTSIMPLISMA, the separate spikes due to the drift time shift were modeled to small broad peaks compared to the ones from SIMPLISMA-det. From the above results, we can conclude that the real- time WC2-RTSIMPLISMA was able to sensitively model the chemical mixtures on line and consequently alarm to any change of the chemical environment around. Moreover, the instrumental status could be monitored as well.

5.4.4 Internal Reference Method for Real-Time WC2-RTSIMPLISMA

Five drug data sets were collected for this section. The experimental setup of the data sets is described in Table 5.1. The experiment was performed on the ITEMISER

ITMS in positive, or narcotics mode. The sample volume was 1.0 µL.

An ion mobility spectrum from each of the MDMA, cocaine, and heroin data sets is given in Figure 5.20. RIP appeared in all of the spectra because the quantity of the drugs was not large enough to completely deplete the reagent ions. MDMA has a characteristic peak at 6.68 ms, cocaine at 8.30 ms, and heroin at 9.17 ms. 173

Spectrum Number 500 1000 1500 2000 2500 3000 3500

RIP 1

10 5 Urea Nitrate 7

3

RDX

2 0 TNT

Integrated Intensity (V) 4

1

RIP 2 10 4

10 60 110 160 210 Acquitition Time (s)

Figure 5.18 SIMPLISMA-det resolved concentration profiles from the raw IMS data of explosives. 174

Point Number 100 500 900 1300

RIP 1

0.14 0.02 Urea Nitrate

0.14 0.02 RDX

Relative Intensity 0.12 0.01 TNT

0.14 0.02 RIP 2

0.14 0.02

271217 Drift Time (ms)

Figure 5.19 SIMPLISMA-det resolved component spectra from the raw IMS data set of explosives. 175

Table 5.1 Experimental setup for the data sets in Section 5.4.4. In the table, ti is the time when the sample was inserted into the desorber; is the time when the measurement ts stopped; is the total number spectra collected. Sample volume was 1 µL and the ns sample disk was removed at 5 s after it was inserted.

t (s) (s) Data set i ts ns Sample constituent

MDMA 8 35 1288 1.0 × 102 ng MDMA

Cocaine 11 42 754 1.0 × 102 ng cocaine

Heroin 7 35 1286 2.0 × 102 ng heroin

66 ng heroin, 33 ng MDMA, and 33 Drug Mixture A - 238 3372 ng

33 ng heroin, 33 ng MDMA, and 33 Drug Mixture B 4 29 1080 ng 176

MDMA

RIP

0.7 MDMA

0.1

Cocaine 1.1 RIP

Cocaine

Intensity (V) Intensity 0.3

Heroin

RIP

1.0 Heroin

0.0

2 7 12 17 Drift Time (ms)

Figure 5.20 Ion mobility spectra of 1µL ethanol solution with 1.0 × 102 ng MDMA,

1.0 × 102 ng cocaine, and 2.0 × 102 ng heroin, respectively, collected on the

ITEMISER ITMS in positive ion mode. The 342nd spectrum in MDMA data set, the

249th spectrum in cocaine data set, and the 495th spectrum in heroin data set are presented. 177

The following experiment was performed to illustrate the internal reference method for real-time WC2-RTSIMPLISMA. A 1.0 µL of drug mixture A was placed onto the sample disk. The disk was inserted into the desorber of the ITMS at 12 s and removed at 17s. Afterwards, 1.0 × 102 ng MDMA was deposited onto the same sample disk. The disk was reinserted into the desorber at 56 s and removed at 61 s. Then 1.0 × 102 ng cocaine was placed onto the disk. The disk was reinserted into the desorber at 111 s and removed at 116 s. Lastly, 2.0 × 102 ng heroin was placed onto the disk. The disk was reinserted into the desorber at 169 s and removed at 175 s. The process was stopped at

238 s.

Before the reference data (i.e., the data of the single-drug sample) was collected, the real-time WC2-RTSIMPLISMA was not able to completely resolve the four components in the data set. The resolved spectra at 47s (i.e., the 800th spectrum) are given in Figure 5.21, in which MDMA and cocaine were modeled into one single component referred to as mixture component in the figure. The difficulty of resolving the four components (reactant ions, heroin, MDMA, and cocaine) from the mixture data set resulted from the similar concentration profiles among the three drugs on the

ITEMISER ITMS, especially between MDMA and cocaine. After collecting the reference spectra of the three drugs following the mixture data, the four components could be completed resolved. The final resolved spectra are given in Figure 5.22. Five components were resolved including a RIP component, an unknown component, and three drug components that correspond to MDMA, cocaine, and heroin, respectively. The 178

RIP 0.20 Mixture Component Heroin

0.15

0.10 Relative Intensity

0.05

0.00

-0.05

2 7 12 17 Drift Time (ms)

Figure 5.21 Real-time WC2-RTSIMPLISMA resolved component spectra from the data set of drug mixture A. 179

Point Number 100 500 900 1300

RIP

0.12 0.01 MDMA

0.12 0.01 Cocaine

0.10 Relative Intensity Relative 0.00 Heroin

0.10 0.00 Unknown

0.04 -0.08

2 7 12 17 Drift Time (ms)

Figure 5.22 Real-time WC2-RTSIMPLISMA resolved component spectra for drug mixture A with internal reference spectra of cocaine, MDMA, and heroin. 180 resolved concentration profiles are given in Figure 5.23, from which we can find the unknown component appeared when the heroin sample was introduced into the

ITEMISER ITMS. However, the peak was not found in the ion mobility spectrum of the single-heroin sample, suggesting that the component may be an intermolecular cluster ion formed during the measurement process. Because the same disk was used during the entire process and drug residues were remained on the disk, the three drug components appeared at each time when the disk was inserted into the desorber. Nevertheless, the maximum point of the concentration profile for each drug was located in the period when the corresponding single-drug sample was introduced. The concentration profiles of cocaine and MDMA were very similar to each other before the reference data was collected, therefore could not be resolved. The internal reference data spiked into the mixture data set enlarge their difference and made them discernable from each other.

The above method could be used as an aid for enhancing the resolution power of the SMCR method when the reference substance is available. As a method similar to the internal standard method in chromatography, it can offset the effect of instrumental shift on the qualification accuracy. However, using internal reference lowers the portability of the instrument and increases analysis time and cost. This problem can be addressed by a method using digital internal reference in this work. The reference is “digital” instead of real because the reference data obtained in lab is programmed into the instrument in advance. The reference data is appended to the sample data in field detection. This digital 181

Spectrum Number 500 1000 1500 2000 2500 3000

RIP 10.0 5.0

MDMA 4.0

1.0

Cocaine

3.0 1.0 Heroin Integrated Intensity (V) 1.4 0.2 Unknown

1.2 0.1 10 60 110 160 210 Acquitition Time (s)

Figure 5.23 Real-time WC2-RTSIMPLISMA resolved concentration profiles for drug mixture A with internal reference spectra of cocaine, MDMA, and heroin. 182 reference method is feasible provided the instrument can be well calibrated. The

ITEMISER ITMS is such an instrument with well-defined calibrants for both modes.

For real-time WC2-RTSIMPLISMA, it is even more advantageous to append the compressed reference data to the compressed sample data and apply RTSIMPLISMA to the resulting data set.

The data set of drug mixture B was used to illustrate the feasibility of the proposed digital reference method. Different from the drug mixture A, the quantity of heroin was reduced half to 1.0 × 102 ng (as given in Table 5.1), which made the concentration profiles of the three drugs are more difficult to differentiate. The resolved component spectra by real-time WC2-RTSIMPLISMA were presented in Figure 5.24.

The three drug components were modeled into one single component, referred to as mixture component in the figure.

The three single-drug data sets in Table 5.1 were used as reference data. The 4 × 4 daublet 14 – daublet 4 2D compression was applied to each other data sets. Appending of the compressed reference data sets to the compressed data set of the drug mixture B resulted in a data set with 278 spectra with 89 points in each spectrum. The combined compressed data was analyzed by RTSIMPLISMA and the resolved model is given in

Figure 5.25 and Figure 5.26. In Figure 5.25, the three drug components were successfully modeled with three resolved spectra. Two RIP components were resolved due to the instrumental shift with respect to drift time. From the resolved concentration profiles in

Figure 5.26, the concentration profiles for the three drug component were similar to one 183

0.20 RIP Mixture Component

0.15

0.10

0.05 Relative Intensity Relative

0.00

-0.05

2 7 12 17 Drift Time (ms)

Figure 5.24 Real-time WC2-RTSIMPLISMA resolved component spectra for drug mixture B. 184

Point Number 100 500 900 1300

RIP 1

0.1 0.0 MDMA

0.1 0.0 Cocaine

0.1 Relative Intensity 0.0 Heroin

0.1 0.0 RIP 2 0.2 0.0

2 7 12 17 Drift Time (ms)

Figure 5.25 Real-time WC2-RTSIMPLISMA resolved component spectra for drug mixture B with appended IMS reference spectra of cocaine, MDMA, and heroin. 185

RIP 1

10.0

5.0 MDMA 4.0

1.0

Cocaine 4.0

1.0

Heroin Integrated Intensity (V) Intensity Integrated

1.4 0.2 RIP 2 4.0

1.0

1000 2000 3000 4000 Spectrum Number

Figure 5.26 Real-time WC2-RTSIMPLISMA resolved concentration profiles for drug mixture B with appended reference IMS spectra of cocaine, MDMA, and heroin. 186 another in the mixture data set. Appending the reference data enabled the complete resolution of the three components.

5.5 Conclusions

An integrated real-time 2D wavelet compression algorithm has been coupled with a real-time SIMPLISMA algorithm in the GCIN software package. The algorithm compresses spectra as they are collected both with respect to the spectral dimension (i.e., drift time) and the sample acquisition time dimension. A novel recursive algorithm was developed for compressing the sample acquisition time dimension in real-time. The

RTSIMPLISMA has been enhanced to automatically determine the number of components in the model. The results demonstrate that the computational efficiency of

RTSIMPLISMA considerably increased with the real-time 4 × 4 level 2D wavelet compression. The real-time WC2-RTSIMPLISMA was able to disclose the very small features in IMS data. The algorithm could resolve the mixture IMS data of the interested explosives including urea nitrate, RDX, and TNT. Moreover, the drug data sets that cannot completely be resolved by regular WC2-RTSIMPLISMA could be resolved using the internal reference method. 187

Chapter 6 Summary and Future Work

In Chapter 1, a brief overview of SMCR, IMS, and data compression was given, and the motivations of the research projects were stated. In Chapter 2, the theory of

SIMPLISMA and WT, two major chemometric methods used in the dissertation were introduced.

In Chapter 3, the conventional SIMPLISMA was modified to furnish an algorithm amenable for real-time implementation. The number of components can be automatically estimated using the NPV threshold method during an IMS measurement process. The real-time SIMPLISMA allows resolving the spectral and concentration profiles of the pure components in a mixture data while it was acquired from an IMS device. Time constraint was found as a major problem for the real-time SIMPLISMA. Batch processing could partially address the problem.

In Chapter 4, a WC2-RTSIMPLISMA method was developed. The optimal settings for the method were obtained. A transition point was found in the relative purity curves of two IMS data sets and four reference data sets. The number of components in the studied data sets could be accurately determined by searching the transition point. A 4

× 4 2D wavelet compression was applied that can reduce the input size of

RTSIMPLISMA without degrading the quality of the final model.

In Chapter 5, a real-time WC2 algorithm was developed that could significantly accelerate the real-time implementation of RTSIMPLISMA. A software package (GCIN) was developed. The real-time WC2-RTSIMPLISMA was proven a useful tool for 188 discovering the dynamic information during the IMS measurements. The internal reference method was used as an aid for resolving the mixture data set that could not be resolved by regular WC2-RTSIMPLISMA.

Real-time WC2 can be coupled with the other chemometric algorithms, such as

OPA, and EFA. The WC2 uses partial linear WT and the compressed data preserves the same ordinate to the raw data, making it very easy to combine with the other algorithms.

One could treat the WC2 algorithm as a black box without the need of understanding the details inside it. Only a few simple parameters (i.e., compression levels and wavelet types) are needed to be selected. Moreover, WC2 can be applied as a good real-time compression tool in industrial fields. An example is to use it as a compression tool for remote data acquisition. Data could be compressed on site before it is sent by internet or wireless network. This compression is especially desired when data throughput is large or network bandwidth is limited.

In Chapter 5, the RTSIMPLISMA algorithm was not able to resolve the correct model of the Windig NIR data set due to the broad peaks and severely shifted baseline, although the correct number of components was determined. Significant negative values were found in the resolved spectra due the correlations. Alternating linear squares (ALS) has been commonly used to address this type of problem by applying non-negativity constraints or using unconstrained estimation and then subsequently setting negative values to zero in each iteration.104 ALS regression might be applied to improve the

RTSIMPLISMA model of the Windig NIR data set. Future work could also include extending the WC2-RTSIMPLISMA to the other research areas, such as chemical 189 imaging, electrochemical signals, and protein analysis. Currently, multivariate image analysis (MIA) is a very active area in chemometrics. 3, 105 The high dimensionality of data in MIA is large and complex. The powerful compression capability of WC2-

RTSIMPLISMA could reduce the dimensionality and RTSIMPLISMA could simplify the complexity by extracting the important features underlying the chemical images.

The real-time WC2-RTSIMPLISMA was designed for IMS devices. However, the application is not limited in IMS devices. The second future work could be to extend the tool to other instruments. The studies in Chapter 4 found that WC2-RTSIMPLISMA could resolve the components from the mixture data sets other than IMS, such as Raman, time resolved mass spectrometry, and FTIR microscopy data. With respect to the GCIN software, it can be transferred onto the other instruments by only modifying the data acquisition module. Furthermore, the real-time WC2-RTSIMPLISMA was used for the qualitative purposes in this dissertation. IMS is basically a qualitative method because the amplitude of response to an analyte largely correlates to the concentration of the other compounds in the mixture, resulting in the nonlinear relationship between the analyte concentration and the amplitude of the instrumental response. A simple experiment was performed on the ITEMISER to explore the potentials of using the algorithm for quantitative measurements. The sample disks with different amounts of TNT were inserted into the desorber of the ITEMISER sequentially. The real-time WC2-

RTSIMPLISMA modeled the concentration changes during the process, by which a standard curve could be plotted by the TNT quantities with respect to the changes of the

TNT concentration profile. The results were not included in this dissertation due to the 190 large error (relative error > 100%) and low repeatability (relative standard deviation >

50%) found when using the standard curve to predict TNT samples. However, the experiment suggested the potentials of using real-time SMCR to quantitatively determine an analyte in a mixture by the other instruments with linear responses. This real-time quantitation method can potentially diminish the matrix effects that occurred in the conventional standard curve method, because the SMCR can mathematically resolve the analyte signal from the mixture. In addition, the effects of instrumental condition could be minimized because the signals of the standard and the sample are acquired continuously in a data set. Meanwhile, it is a method that can be automated. However, it should be pointed out that the optimal settings for real-time WC2-RTSIMPLISMA should be investigated before it could be used for any other data types. 191

References

(1) Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics: Part A; Elsevier: Amsterdam, The Netherlands, 1997.

(2) http://www.gage-applied.com/ Accessed 2/2003.

(3) Lavine, B. K.; Workman, J. Chemometrics. Anal. Chem. 2002, 74, 2763-2769.

(4) Harrington, P. B.; Reese, E. S.; Rauch, P. J.; Hu, L.; Davis, D. M. Interactive self- modeling mixture analysis of ion mobility spectra. Appl. Spectrosc. 1997, 51, 808-816.

(5) Rauch, P. J.; Harrington, P. B.; Davis, D. M. Near real-time self-modeling mixture analysis. Chemom. Intell. Lab. Syst. 1997, 39, 175-185.

(6) Harrington, P. B.; Chen, G.; Urbas, A. A. Strategies for smarter chemical sensors. Int. J. Ion Mobility Spectrom. 2001, 4, 26-30.

(7) Asbury, G. R.; Wu, C.; Siems, W. F.; Hill, H. H. Separation and identification of some chemical warfare degradation products using electrospray high resolution ion mobility spectrometry with mass selected detection. Anal. Chim. Acta 2000, 404, 273-283.

(8) Fällman, Å.; Rittfeldt, L. Detection of chemical warfare agents in water by high temperature solid phase microextraction-ion mobility spectrometry (HT-SPME- IMS). Int. J. Ion Mobility Spectrom. 2001, 4, 85-87.

(9) Sielemann, S.; Baumbach, J. I.; Schmidt, H.; Pilzecker, P. Quantitative analysis of benzene, toluene, and m-xylene with the use of a UV-ion mobility spectrometer. Field Anal. Chem. Technol. 2000, 4, 157-169.

(10) Wan, C.; Harrington, P. B.; Davis, D. M. Trace analysis of BTEX compounds in water with a membrane interfaced ion mobility spectrometer. Talanta 1998, 46, 1169-1179.

(11) Sielemann, S.; Baumbach, J. I.; Schmidt, H.; Pilzecker, P. Detection of alcohols using UV-ion mobility spectrometers. Anal. Chim. Acta 2001, 431, 293-301.

(12) Asbury, G. R.; Klasmeier, J.; Hill, H. H. Analysis of explosives using electrospray ionization/ion mobility spectrometry (ESI/IMS). Talanta 2000, 50, 1291-1298. 192

(13) Garofolo, F.; Migliozzi, V.; Roio, B. Application of ion mobility spectrometry to the identification of trace levels of explosives in the presence of complex matrices. Rapid. Commun. Mass Spectrom. 1994, 8, 527-532.

(14) Eiceman, G. A.; Preston, D.; Tiano, G.; Rodriguez, J.; Parmeter, J. E. Quantitative calibration of vapor levels of TNT, RDX, and PETN using a diffusion generator with gravimetry and ion mobility spectrometry. Talanta 1997, 45, 54-74.

(15) Ewing, R. G.; Atkinson, D. A.; Eiceman, G. A.; Ewing, G. J. A critical review of ion mobility spectrometry for the detection of explosives and explosive related compounds. Talanta 2001, 54, 515-529.

(16) Wu, C.; Siems, W. F.; Hill, H. H. Secondary electrospray ionization ion mobility spectrometry/mass spectrometry of illicit drugs. Anal. Chem. 2000, 72, 396-403.

(17) Patchett, M. L.; Minoshima, Y.; Harrington, P. B. Detection of gamma- hydroxybutyrate and gamma-butyrolactone by ion mobility spectrometry. Spectroscopy 2002, 17, 16-+.

(18) Fytche, L. M.; Hupe, M.; Kovar, J. B.; Pilon, P. Ion mobility spectrometry of drugs of abuse in customs scenarios: concentration and temperature study. J. Forensic Sci. 1992, 37, 1550-1566.

(19) Tabrizchi, M. Ion mobility spectrometry of alkali salts. Int. J. Ion Mobility Spectrom. 2001, 4, 74-76.

(20) Snyder, A. P.; Thornton, S. N.; Dworzanski, J. P.; Meuzelaar, H. L. C. Detection of the picolinic acid biomarker in Bacillus spores using a potentially field- portable pyrolysis gas chromatography ion mobility spectrometry system. Field Anal. Chem. Technol. 1996, 1, 45-59.

(21) Vinopal, R. T.; Jadamec, J. R.; deFur, P.; Demars, A. L.; Jakubielski, S.; Green, C.; Anderson, C. P.; Dugas, J. E.; DeBono, R. F. Fingerprinting bacterial strains using ion mobility spectrometry. Anal. Chem. Acta 2002, 457, 83-95.

(22) Ogden, I. D.; Strachan, N. J. C. Enumeration of escherichia-coli in cooked and raw meats by ion mobility spectrometry. 1993, 74, 402-405.

(23) Buxton, T. L. Ph.D. Dissertation, Ohio University, Athens, OH, 2002.

(24) Hill, H. H.; Simpson, G. Capabilities and limitations of ion mobility spectrometry for field screening applications. Field Anal. Chem. Technol. 1997, 1, 119 - 134.

(25) Eiceman, G. A.; Karpas, Z. Ion Mobility Spectrometry; CRC Press: Boca Raton, 1994. 193

(26) St. Louis, R. H.; Hill, H. H. Ion mobility spectrometry in analytical chemistry. Anal. Chem. 1990, 21, 322-355.

(27) Hill, H. H.; Siems, W. F.; St. Louis, R. H.; McMinn, D. G. Ion mobility spectrometry. Anal. Chem. 1990, 62, A1201-A1209.

(28) Schmidt, S.; Appel, M. F.; Garnica, R. M.; Schindler, R. N.; Benter, T. Atmospheric pressure laser ionization. An analytical technique for highly selective detection of ultralow concentrations in the gas phase. Anal. Chem. 1999, 71, 3721-3729.

(29) Wu, C.; Hill, H. H.; Rasulev, U. K.; Nazarov, E. G. Surface ionization ion mobility spectrometry. Anal. Chem. 1999, 71, 273-278.

(30) Wittmer, D.; Luckenbill, B. K.; Hill, H. H.; Chen, Y. H. Electrospray-ionization ion mobility spectrometry. Anal. Chem. 1994, 66, 2348-2355.

(31) Hamilton, J. C.; Gemperline, P. J. Mixture analysis using factor analysis. J. Chemom. 1990, 4, 1-13.

(32) Windig, W.; Antalek, B. Resolving nuclear magnetic resonance data of complex mixtures by three-way methods: Examples of chemical solutions and the human brain. Chemom. Intell. Lab. Syst. 1999, 46, 207-219.

(33) Lawton, W. H.; Sylvestre, E. A. Mathematical determination of the pure spectra of two components in a two-component mixture. Technometrics 1971, 13, 617- 633.

(34) Vandeginste, B. G. M.; Derks, W.; Kateman, G. Multicomponent self-modeling curve resolution in high-performance liquid-chromatography by iterative target transformation analysis. Anal. Chim. Acta 1985, 173, 253-264.

(35) Maeder, M. Evolving factor-analysis for the resolution of overlapping chromatographic peaks. Anal. Chem. 1987, 59, 527-530.

(36) Malinowski, E. R. Window factor-analysis - theoretical derivation and application to flow-injection analysis data. J. Chemom. 1992, 6, 29-40.

(37) Kvalheim, O. M.; Liang, Y. Z. Heuristic evolving latent projections - resolving 2- way multicomponent data. Anal. Chem. 1992, 64, 936-946.

(38) Cuesta-Sánchez, F.; van den Bogaert, B.; Rutan, S. C.; Massart, D. L. Multivariate peak purity approaches. Chemom. Intell. Lab. Syst. 1996, 34, 139- 171. 194

(39) Cuesta-Sánchez, F.; Toft, J.; van den Bogaert, B.; Massart, D. L. Orthogonal projection approach applied to peak purity assessment. Anal. Chem. 1996, 68, 79- 85.

(40) Windig, W.; Guilment, J. Interactive self-modeling mixture analysis. Anal. Chem. 1991, 63, 1425-1432.

(41) Paatero, P. Least squares formulation of robust non-negative factor analysis. Chemom. Intell. Lab. Syst. 1997, 37, 23-35.

(42) Tauler, R.; Smilde, A. K.; Kowalski, B. R. Selectivity, local rank, 3-way data- analysis and ambiguity in multivariate curve resolution. J. Chemom. 1995, 9, 31- 58.

(43) Smith, D. S.; Kramer, J. R. Multisite metal binding to fulvic acid determined using multiresponse fluorescence. Anal. Chim. Acta 2000, 416, 211-220.

(44) Garrido Frenich, A.; Torres-Lapasió, J. R.; De Braekeleer, K.; Massart, D. L.; Martínez Vidal, J. L.; Martínez Galera, M. Application of several modified peak purity assays to real complex multicomponent mixtures by high-performance liquid chromatography with diode-array detection. J. Chromatogr., A 1999, 855, 487-499.

(45) De Braekeleer, K.; Cuesta-Sánchez, F.; Hailey, P. A.; Sharp, D. C. A.; Pettman, A. J.; Massart, D. L. Influence and correction of temperature perturbations on NIR spectra during the monitoring of a polymorph conversion process prior to self- modeling mixture analysis. J. Pharm. Biomed. Anal. 1998, 17, 141-152.

(46) Vacque, V.; Dupuy, N.; Sombret, B.; Huvenne, J. P.; Legrand, P. Self-modeling mixture analysis applied to FT-Raman spectral data of hydrogen peroxide activation by nitriles. Appl. Spectrosc. 1997, 51, 407-415.

(47) Batonneau, Y.; Laureyns, J.; Merlin, J. C.; Bremard, C. Self-modeling mixture analysis of Raman microspectrometric investigations of dust emitted by lead and zinc smelters. Anal. Chim. Acta 2001, 446, 23-37.

(48) Gargallo, R.; Cuesta-Sánchez, F.; IzquierdoRidorsa, A.; L., M. D. Application of eigenstructure tracking analysis and SIMPLISMA to the study of the protonation equilibria of cCMP and several polynucleotides. Anal. Chem. 1996, 68, 2241- 2247.

(49) Windig, W. Spectral data files for self-modeling curve resolution with examples using the SIMPLISMA approach. Chemom. Intell. Lab. Syst. 1997, 36, 3-16. 195

(50) Windig, W.; Antalek, B.; Lippert, J. L.; Batonneau, Y.; Brémard, C. Combined use of conventional and second-derivative data in the SIMPLISMA self-modeling mixture analysis approach. Anal. Chem. 2002, 74, 1371-1379.

(51) Reese, E. S.; Harrington, P. B. The analysis of methamphetamine hydrochloride by thermal desorption ion mobility spectrometry and SIMPLISMA. J. Forensic Sci. 1999, 44, 68-76.

(52) Shaw, L. A.; Harrington, P. B. Seeing through the smoke with dynamic data analysis - Detection of methamphetamine in forensic samples contaminated with nicotine. Spectroscopy 2000, 15, 40-+.

(53) Buxton, T. L.; Harrington, P. B. Rapid multivariate curve resolution applied to identification of explosives by ion mobility spectrometry. Anal. Chim. Acta 2001, 434, 269-282.

(54) Cai, C.; Harrington, P. B.; Davis, D. M. Two-dimensional Fourier compression. Anal. Chem. 1997, 69, 4249-4255.

(55) Harrington, P. B.; Isenhour, T. L. Application of robust eigenvectors to the compression of infrared spectral libraries. Anal. Chem. 1988, 60, 2687-2692.

(56) Trygg, J.; Kettaneh-Wold, N.; Wallbäcks, L. 2D wavelet analysis and compression of on-line industrial process data. J. Chemom. 2001, 15, 299-319.

(57) Urbas, A. A.; Harrington, P. B. Two-dimensional wavelet compression of ion mobility spectra. Anal. Chim. Acta 2001, 446, 393-412.

(58) Daszykowski, M.; Walczak, B.; Massart, D. L. Projection methods in chemistry. Chemom. Intell. Lab. Syst. 2003, 65, 97-112.

(59) Malinowski, E. R. Factor Analysis in Chemistry, 2 ed.; John Wiley & Sons, Inc.: New York, 1991.

(60) Vogt, F.; Tacke, M. Fast principal component analysis of large data sets based on information extraction. J. Chemom. 2002, 16, 562-575.

(61) Harrington, P. B.; Hu, L. Recovery of variable loadings and eigenvalues directly from Fourier compressed data. Appl. Spectrosc. 1998, 52, 1328-1338.

(62) Walczak, B., Ed. Wavelets in Chemistry; Elsevier: Amsterdam, The Netherlands, 2000.

(63) Harrington, P. B.; Rauch, P. J.; Cai, C. Multivariate curve resolution of wavelet and Fourier compressed spectra. Anal. Chem. 2001, 73, 3247-3256. 196

(64) Ehrentreich, F.; Sümmmchen, L. Spike removal and denoising of Raman spectra by wavelet transform methods. Anal. Chem. 2001, 73, 4364-4373.

(65) Ho, H. L.; Cham, W. K.; Chau, F. T.; Wu, J. Y. Application of biorthogonal wavelet transform to the compression of ultraviolet-visible spectra. Comput. Chem. 1999, 23, 85-96.

(66) Chau, F. T.; Gao, J. B.; Shih, T. M.; Wang, J. Compression of infrared spectral data using the fast wavelet transform method. Appl. Spectrosc. 1997, 51, 649-659.

(67) Leung, A. K. M.; Chau, F. T.; Gao, J. B.; Shih, T. M. Application of wavelet transform in infrared spectrometry: spectral compression and library search. Chemom. Intell. Lab. Syst. 1998, 43, 69-88.

(68) Alsberg, B. K.; Woodward, A. M.; Winson, M. K.; Rowland, J.; Kell, D. B. Wavelet denoising of infrared spectra. Analyst 1997, 122, 645-652.

(69) Lasa, J.; Sliwka, I.; Rosiek, J.; Wal, K. Application of the discrete wavelet transforms for denoising in GC analysis. Chem. Anal. 2001, 46, 529-537.

(70) Chen, H. Wavelet analyses of electroanalytical chemistry responses and an adaptive wavelet filter. Anal. Chim. Acta 1997, 346, 319-325.

(71) Wu, S.; Nie, L.; Wang, J.; Lin, X.; Zheng, L.; Rui, L. Flip shift subtraction method: a new tool for separating the overlapping voltammetric peaks on the basis of finding the peak positions through the continuous wavelet transform. J. Electroanal. Chem. 2001, 508, 11 - 27.

(72) Eriksson, L.; Trygg, J.; Johansson, E.; Bro, R.; Wold, S. Orthogonal signal correction, wavelet analysis, and multivariate calibration of complicated process fluorescence data. Anal. Chim. Acta 2000, 420, 181-195.

(73) Trygg, J.; Wold, S. PLS regression on wavelet compressed NIR spectra. Chemom. Intell. Lab. Syst. 1998, 42, 209-220.

(74) Ren, S.; Gao, L. Simultaneous quantitative analysis of overlapping spectrophotometric signals using wavelet multiresolution analysis and partial least squares. Talanta 2000, 50, 1163 - 1173.

(75) Mehay, A. W.; Cai, C.; Harrington, P. B. Regularized linear discriminant analysis of wavelet compressed ion mobility spectra. Appl. Spectrosc. 2002, 56, 223-231.

(76) Walczak, B.; van den Bogaert, B.; Massart, D. L. Application of wavelet packet transform in pattern recognition of near-IR data. Anal. Chem. 1996, 68, 1742- 1747. 197

(77) Walczak, B.; Massart, D. L. Wavelet packet transform applied to a set of signals: A new approach to the best-basis selection. Chemom. Intell. Lab. Syst. 1997, 38, 39-50.

(78) Zhang, X.; Zheng, J.; Gao, H. Curve fitting using wavelet transform for resolving simulated overlapped spectra. Anal. Chim. Acta 2001, 443, 117-125.

(79) Cai, C.; Harrington, P. B. Wavelet transform preprocessing for temperature constrained cascade correlation neural networks. J. Chem. Inf. Comput. Sci. 1999, 39, 874-880.

(80) Collantes, E. R.; Duta, R.; Welsh, W. J.; Zielinski, W. L.; Brower, J. Preprocessing of HPLC trace impurity patterns by wavelet packets for pharmaceutical fingerprinting using artificial neural networks. Anal. Chem. 1997, 69, 1392-1397.

(81) Walczak, B.; Massart, D. L. Wavelets - something for analytical chemistry. Trends Anal. Chem. 1997, 16, 451-463.

(82) Jetter, K.; Depczynski, U.; Molt, K.; Niemöller, A. Principles and applications of wavelet transformation to chemometrics. Anal. Chim. Acta 2000, 420, 169-180.

(83) Leung, A. K. M.; Chau, F. T.; Gao, J. B. A review on applications of wavelet transform techniques in chemical analysis: 1989-1997. Chemom. Intell. Lab. Syst. 1998, 43, 165-184.

(84) Alsberg, B. K.; Woodward, A. M.; Kell, D. B. An introduction to wavelet transforms for chemometricians: a time-frequency approach. Chemom. Intell. Lab. Syst. 1997, 37, 215-239.

(85) Walczak, B.; Massart, D. L. Noise suppression and signal compression using the wavelet packet transform. Chemom. Intell. Lab. Syst. 1997, 36, 81-94.

(86) Cuesta-Sánchez, F.; Khots, M. S.; Massart, D. L. Algorithm for the assessment of peak purity liquid chromatography with photodiode-array detection. Anal. Chem. Acta 1994, 285, 181-192.

(87) Strang, G. Linear Algebra and Its Applications, 2nd ed.; Academic Press: New York, 1980.

(88) Mallat, S. A theory for multiresolution signal decomposition. IEEE Trans. Pattern Anal. Machine Intell. 1989, 11, 674-693.

(89) Baumbach, J. I.; Eiceman, G. A. Ion mobility spectrometry: Arriving on site and moving beyond a low profile. Appl. Spectrosc. 1999, 53, 338A-355A. 198

(90) Collins, R. L.; Ellickson, P. L.; Bell, R. M. Simultaneous polydrug use among teens: prevalence and predictors. J. Subst. Abuse 1998, 10, 233-253.

(91) Roberts, A. J.; Polis, I. Y.; Gold, L. H. Intravenous self-administration of heroin, cocaine, and the combination in Balb/c mice. Eur. J. Pharmacol. 1997, 326, 119- 125.

(92) Harrington, P. B.; Buxton, T. L.; Chen, G. Classification of bacteria by thermal methylation hydrolysis ion mobility spectrometry using SIMPLISMA and multidimensional wavelet compression. Int. J. Ion Mobility Spectrom. 2001, 4, 148-151.

(93) DeLuca, S.; Sarver, E. W.; Harrington, P. d. B.; Voorhees, K. J. Direct analysis of nacterial fatty acids by curie-point pyrolysis tandem mass spectrometry. Anal. Chem. 1990, 62, 1465-1472.

(94) ftp://ftp.clarkson.edu/pub/hopkepk/Chemdata/Windig Accessed 12/2001.

(95) Shen, H.; Liang, Y.; Kvalheim, O. M.; Manne, R. Determination of chemical rank of two-way data from mixtures using subspace comparisons. Chemom. Intell. Lab. Syst. 2000, 51, 49-59.

(96) De Braekeleer, K.; Massart, D. L. Evaluation of the orthogonal projection approach (OPA) and the SIMPLISMA approach on the Windig standard spectral data sets. Chemom. Intell. Lab. Syst. 1997, 39, 127-141.

(97) Grande, B.-V.; Manne, R. Use of convexity for finding pure variables in two-way data from mixtures. Chemom. Intell. Lab. Syst. 2000, 50, 19-33.

(98) Lippert, J. L.; Melpolder, S. B.; Kelts, L. M. Raman-spectroscopic determination of the pH-dependence of intermediates in sol-gel silicate formation. J. Non-Cryst. Solids 1988, 104, 139-147.

(99) Windig, W.; Markel, S. Simple-to-use interactive self-modeling mixture analysis of FTIR microscopy data. J. Mol. Struct. 1993, 292, 161-170.

(100) Windig, W.; Stephenson, D. A. Self-modeling mixture analysis of 2nd-derivative near-Infrared spectral data using the SIMPLISMA approach. Anal. Chem. 1992, 64, 2735-2742.

(101) Phalp, J. M.; Payne, A. W.; Windig, W. The resolution of mixtures using data from automated probe mass spectrometry. Anal. Chim. Acta 1995, 318, 43-53.

(102) Gubner, J. A.; Chang, W. Wavelet transforms for discrete-time periodic signals. Signal Proc. 1995, 42, 167-180. 199

(103) TMS320C3X User's Guide. Texas Instruments. 1994.

(104) Bro, R.; De Jong, S. A fast non-negativity-constrained least squares algorithm. J. Chemom. 1997, 11, 393-401.

(105) Wise, B. M.; Geladi, P. A brief introduction to multivariate image analysis (MIA). Newsletter for The North American Chapter of the International Chemometrics Society 2000, 22, 3-7.

200

Appendix A: Publications

(1) Chen, G.; Harrington, P. B. Real-time interactive self-modeling mixture analysis. Appl. Spectrosc. 2001, 55, 621-629.

(2) Harrington, P. B.; Chen, G.; Urbas, A. A. Strategies for smarter chemical sensors. Int. J. Ion Mobility Spectrom. 2001, 4, 26-30.

(3) Harrington, P.B.; Buxton, T.L.; Chen, G. Classification of bacteria by thermal hydrolysis methylation ion mobility spectrometry using SIMPLISMA and multidimensional wavelet compression. Int. J. Ion Mobility Spectrom. 2001, 4, 148-151.

(4) Chen, G.; Harrington, P. B. SIMPLISMA applied to two-dimensional wavelet compressed IMS data. Anal. Chim. Acta 2003, In press.

(5) Chen, G.; Harrington, P. B. Real-time two-dimensional wavelet compression and its application on real-time modeling ion mobility data. Anal. Chim. Acta 2003, In press. 201

Appendix B: Presentations

(1) G. Chen, P.B. Harrington, “Real-time interactive self-modeling mixture analysis of ion mobility spectra”, oral presentation at The 51st Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy, New Orleans, LA, March 13, 2000, 41.

(2) G. Chen, P. B. Harrington, “Temperature-constrained radial basis function neural networks”, oral presentation at The 52nd Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy, New Orleans, LA, March 5, 2001, 312.

(3) G. Chen, P. B. Harrington, “Detection of heroin in drugs of abuse using multivariate curve resolution with two-dimensional wavelet compression”, oral presentation at The 53rd Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy, New Orleans, LA, March 20, 2002, 889

(4) G. Chen and P. B. Harrington, “Real-time self-modeling mixture analysis with wavelet compression for detection of explosives using an ion trap mobility spectrometer", poster at The 53rd Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy, New Orleans, LA, March 20, 2002, 2126P 202

Appendix C: MATLAB Scripts function [c, s] = rtsimplisma(x, scanrate, beta, isIMSdata) % function [c, s] = rtsimplisma(x, scanrate, beta, isIMSdata) % Input: x = data matrix with spectra in rows scanrate = points/s for IMS data only. Default is 80,000 % beta = delta_log_belta0, threshold to determine transition point. Default is 0.4 (suggest set to % 0.5 for compressed data) % isIMSdata = 1 for IMS data. For non-IMS data, baseline subtraction and 3 S/N rule will not % be applied. Default is 1 % Output: c = resolved concentration profiles in columns % s = resolved spectra in rows % Adapted from C++ code that was used for Chapter 4 if nargin == 1 scanrate = 80000; beta = 0.4; isIMSdata = 1; elseif nargin == 2 beta = 0.4; isIMSdata = 1; elseif nargin == 3 isIMSdata=1; end [ns, nx] = size(x); if isIMSdata ~= 1 xnew = x; sqtot=sum(x.^2); tot = sum(x); SS = sqtot-tot.*tot/ns; else baseline = x(:,round(1.5*scanrate/1000):round(3.0*scanrate/1000)); x = x-mean(baseline,2)*ones(1,nx); noise = mean(std(baseline')); min_SS = 9*noise^2*(ns-1); sqtot_x=sum(x.^2); tot_x = sum(x); SS_x = sqtot_x-tot_x.*tot_x/ns; ixnew = find(SS_x>min_SS); xnew = x(:,ixnew); sqtot = sqtot_x(ixnew); SS = SS_x(ixnew); end

% maximum number of components, in case relative purity curve does not converge MAX_NC = 15; ipurevar = []; purity =[]; gs = []; % first pure variable [p1, ip1] = max(SS); ipurevar = [ipurevar ip1]; purity = [purity p1]; c1 = xnew(:, ipurevar); gs = c1/sqrt(c1'*c1); cur_relpurity = 0.0; last_relpurity = 0.0;

% search pure variables for ii=2:MAX_NC proj = gs'*xnew; proj = proj.^2; proj = sum(proj, 1); w=1-proj./sqtot; [newpurity, inewpurevar] = max(SS.*w); cur_relpurity = log10(newpurity/p1); delta_relpurity = abs(last_relpurity-cur_relpurity); if abs(cur_relpurity)> 2.0 if delta_relpurity

if dot ~= 0 s(k,:) = s(k,:) ./ sqrt(dot); else s(k,:) = zeros(size(s(k,:))); end end % calculate concentration profiles c = x*pinv(s);

function [fir] = buildfir(h, level) % function [fir] = buildfir(h, level) % Input: h = father wavelet for smooth part, e.g., for daub4, h = [0.48 0.84 0.22 -0.13] % level = wavelet compression level % Output: fir = built fir by which raw signal is transformed to the smooth part at the input level % See Equation (5.3) for mathematical description if level == 1 fir = h else fir = h len_h = length(h) for j = 2:level len = length(fir); T =[]; len_fir = 2^(j-1)*(len_h-1)+len for i= 1:length(h) paddingLeft = zeros(1,(i-1)*2^(j-1)); paddingRight = zeros(1, 2^(j-1)*(len_h-i)); T =[T; paddingLeft fir paddingRight]; end fir = h*T; end end