and Actuators B 80 *2001) 243±254

Arti®cial intelligence methods for selection of an optimized array for identi®cation of volatile organic compounds$ Robi Polikara,*, Ruth Shinarb, Lalita Udpac, Marc D. Porterb aDepartment of Electrical and Computer Engineeing, Rowan University, 136 Rowan Hall, Glassboro, NJ 08028, USA bAmes Laboratory, USDOE and Department of Chemistry, Microanalytical Instrumentation Center, Iowa State University, Ames, IA 50011, USA cDepartment of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA Accepted 19 July 2001

Abstract

We have investigated two arti®cial intelligence *AI)-based approaches for the optimum selection of a sensor array for the identi®cation of volatile organic compounds *VOCs). The array consists of quartz crystal microbalances *QCMs), each coated with a different polymeric material. The ®rst approach uses a decision tree classi®cation algorithm to determine the minimum number of features that are required to classify the training data correctly. The second approach employs the hill-climb search algorithm to search the feature space for the optimal minimum feature set that maximizes the performance of a neural network classi®er. We also examined the value of simple statistical procedures that could be integrated into the search algorithm in order to reduce computation time. The strengths and limitations of each approach are discussed. # 2001 Elsevier Science B.V. All rights reserved.

Keywords: Optimum coating selection; Decision tree; Wrapper search; Neural network classi®cation

1. Introduction various chemical properties *e.g. solubility parameters [14±16]) of the VOCs and the compatibility of each with Piezoelectric chemical sensors, such as surface acoustic a range of compositionally different polymer coatings. Some wave *SAW) devices and quartz crystal microbalances researchers have also tried using various signal processing *QCMs), have been widely used for detection and identi®- metrics, such as the Euclidean distance [17], or the principal cation of volatile organic compounds *VOCs) [1±8]. In component analysis [18] to obtain the optimum set of general, an array of polymer-coated sensors is used for coatings for speci®c applications. detection, where the change in the resonant frequency of Since there may be a large number of polymers suitable for each sensor as a function of VOC concentration constitutes a the identi®cation of a VOC, the selection of the smallest set response pattern. Over the past 15 years, a signi®cant giving the best performance is an ill-de®ned problem. This amount of work has been done on developing pattern situation arises because testing every possible combination is recognition algorithms, using principal component analysis, usually not manageable. Furthermore, many researchers have neural networks and fuzzy inference systems, for various gas investigated the relationship between the number of sensors sensing problems [9±13]. However, these methods can only and the performance of the array [19], and found that using as be successful, if the features *polymer-coated sensor many sensors as possible does not necessarily improve the responses) used to identify the VOCs allow an ef®cient performance of a classi®cation system. In fact, Park and separation of patterns in the feature space. The challenge Zellers [20], and Park et al. [21], through a careful analysis is then to identify a subset of polymer coatings such that a of the required number of sensors versus the number of classi®cation algorithm provides optimum classi®cation analytes and Osbourn et al. [22] through an examination of performance. Selection of coatings is usually based on the effects of increasing the sensor size, have shown that the performance of classi®ers for VOC identi®cation typically degrades as the number of sensors increase beyond a certain $ Portions of this work were completed while the corresponding author number. Therefore, an ef®cient algorithm for optimum selec- was with the Department of Electrical and Computer Engineering of Iowa tion of sensors is of paramount importance. State University. * Corresponding author. Tel.: ‡1-856-256-5372; fax: ‡1-856-256-5241. For small pools of potential coatings, an exhaustive search E-mail address: [email protected] *R. Polikar). may be manageable. For example, Zellers and coworkers

0925-4005/01/$ ± see front matter # 2001 Elsevier Science B.V. All rights reserved. PII: S 0925-4005*01)00903-0 244 R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254 used extended disjoint principal components regression subset of features for a PR problem. Feature subset selection analysis to conduct an exhaustive search on a 10-polymer is commonly encountered in pattern analysis, machine dataset and identi®ed four polymers as requisite array learning and arti®cial intelligence [29,30]. Many studies elements for optimum identi®cation of six VOCs have shown that most classi®cation algorithms perform best [20,21,23]. The use of four polymers out of 10 amounts when the feature space includes only the most relevant to 210 possible combinations, which is manageable for information that is required for identi®cation [31±33]. an exhaustive search. However, as the number of possible While having relevant features is a key to successful coatings increase, an exhaustive search becomes computa- performance of any classi®cation algorithm, the de®nition tionally prohibitive. Adding only two more coatings to of a relevant feature has been extensively debated. Some the pool, for instance, requires evaluating 495 possible studies suggest algorithms that are preprocessing in nature. four-coating combinations, and a more practical problem These preprocessing algorithms can be viewed as ®ltering of choosing 6 out of 20 coatings requires testing 38,760 the data, and thus eliminating irrelevant features. Statistical different combinations of coatings. measures, such as the properties of the probability distribu- In efforts to reduce the number of candidate coatings from tion function of the data, are often employed for ®ltering out a larger pool of potentially useful coatings, various pattern the irrelevant features, and consequently, these algorithms recognition *PR) algorithms have been developed. Principal are referred to as ®lter approaches [34±36]. Filter algo- component analysis *PCA), a dimensionality reduction tech- rithms, however, are independent of the classi®cation algo- nique, has been one of the most popular of such techniques. rithm to be used to process the data. Some researchers Carey et al. used PCA [24] to reduce the feature vector suggest that relevant features for any set of data are depen- obtained from 27 sensors to less than 8 for an identi®cation dent on the classi®cation algorithm [30,37]. For example, a problem consisting of 14 VOCs. Avila et al. introduced good set of features for a neural network may not be as correspondence analysis as an alternative to PCA [25] and effective for decision trees. Such studies indicate that a showed that it had computational advantages as well as feature selection algorithm must be based on or wrapped performance improvement over PCA on the same dataset around the classi®cation algorithm [37]. Feature selection used by Carey et al. [24]. PCA has been employed not only algorithms that use such an approach are known as wrapper in the gas sensor area, but also in many other areas where approaches. Most wrapper approaches, on the other hand, data analysis for dimensionality reduction is important. With suffer from large computational time and space complexity PCA, the strategy is to ®nd a set of n orthogonal vectors problems, particularly for data sets with a large number along which the m dimensional data has the largest variance of features. such that n < m. PCA is, therefore, a dimensionality reduc- Due to the limited number of possible coatings typically tion procedure, rather than a feature selection procedure. used in gas sensing area, the computational complexity of This distinction is because the principal components are wrapper approaches does not constitute a major drawback. computed as the projection of the data on a set of orthogonal We have therefore analyzed two techniques based on the vectors that are the eigenvectors of the covariance matrix of wrapper approach, and we report herein on the perfor- the data. The covariance matrix, may and frequently does, mances of these two arti®cial intelligence *AI) approaches contain signi®cant information obtained from each sensor. for selecting the optimum set of coatings for VOC identi- Consequently, PCA does not reduce the number of sensors, ®cation. The ®rst approach is based on Quinlan's iterative nor does it identify the optimum set of coatings. Recently, dichotomizer 3 *ID3) algorithm [31], a decision tree Osbourn and Martinez [26], and Ricco et al. [27] introduced algorithm that integrates classi®cation and feature selec- visual empirical region of in¯uence pattern recognition tion. The second approach is a modi®ed version of the *VERI-PR) for identi®cation of VOCs. Various shortcom- wrapper model of Kohavi and John [37], which uses a ings of neural network and statistical techniques for pattern hill-climb search algorithm to search the feature space recognition have also been addressed by these authors. for an optimum set of features. The original wrapper For example, VERI does not require or assume any speci®c model combines the hill-climb search with ID3. We have probability distributions to be known, and it does not explored integrating the hill-climb search with a multilayer require a large number of parameters to be adjusted by perceptron *MLP) neural network. We have also investi- the user. Furthermore, VERI is a versatile algorithm not only gated the value of using a different starting point for the capable of pattern recognition, but also of optimum feature search, based on the variance of the data, to accelerate the selection. The optimal feature selection capabilities of VERI convergence of the hill-climb search. This scheme allowed on a VOC identi®cation problem have been reported to be us to signi®cantly reduce the computational complexity very promising [28]. However, the feature selection module of the search. is based on an exhaustive search, called leave-one-out, and We emphasize that our goal was to develop a systematic therefore the authors recommended its use for pools of less and ef®cient procedure for determining the optimum coat- than 20 coatings. ings, and we note that the best set of coatings for any Selection of optimum coatings for gas sensing is actually application depends on the analytes to be detected and a subset of the more general need for choosing an optimum identi®ed. The analytes and coatings used in this study were R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254 245 selected from those that have been reported extensively in analytes. The bubblers were composed of two connected the literature. compartments. The gas carrier bubbled through the solution in the ®rst compartment, supplying the vapor, whereas the second analyte-containing compartment served as a head- 2. Experimental space equilibrator. This process resulted in a gas stream saturated with the analyte vapor. The saturated analyte vapor 2.1. Experimental system and sample preparation was further diluted with nitrogen to obtain the desired concentrations at a total ¯ow rate of 200 sccm. The sensors The sensor responses used in this study were from 9 MHz were exposed periodically to the reference gas or to the QCMs purchased from Standard Crystals that were subse- diluted analyte vapor stream by means of the computer quently coated with several different polymer ®lms. Cr/Au controlled three-way valve and a MKS multi-gas controller contacts were deposited onto the quartz by means of a *model 147B) that controlled the mass ¯ow controllers. resistive heating evaporator. The ®lms were cast on the Polyethylene and Te¯on1 tubings together with stainless QCMs from dilute solutions of polymers, typically 20 ml steel or brass valves were used, but only Te¯on1 and of 0.3±3% *w/w), spinning at 2000±5000 rpm. The sensors stainless steel were exposed to the analytes. All experiments were then dried at 658C for 24 h. The thickness of the were performed at ambient temperature. coatings were calculated from the frequency shifts detected To evaluate sensor performance, the resonant frequency after coating application [38]. The coated QCMs were of the sensors was monitored before and following exposure subsequently mounted in a sealed test ®xture, which could to VOCs. Repeated measurements indicated reproducibility house up to six sensors. An array of 12 crystals, coated with of the collected data with small variations of 2±4%. The the following polymers, was used to detect and identify variability, due to small temperature ¯uctuations, was within 12 VOCs. The polymers were Apiezon L *APZ), poly*iso- experimental error. The frequency response was monitored butylene) *PIB), poly[di*ethylene glycol) adipate] *DEGA), using a HP8753C network analyzer, interfaced to an IEEE sol±gel *SG), poly[bis*cyanoallyl)polysiloxane] *OV275), 488 card installed in a PC, running HP8516A resonator- poly*dimethylsiloxane) *PDS), poly*diphenoxyphosphazene) measurement software. Real time data were displayed and *PDPP), polychloroprene *PCP), poly[dimethylsiloxane-co- saved. The data were then analyzed to obtain frequency methyl*3-hydroxypropyl)siloxane]-graft-poly*ethylene gly- shifts *relative to the baseline) versus VOC concentration. col) 3-aminopropyl ether *PDS-CO), poly*dimethylsilox- Typical noise levels *standard deviations of the baseline) for ane), hydroxy terminated *PDS-OH), polystrene beads the QCMs were around 0.01 Hz. Further details regarding *PSB), and graphite *GRAP). the experimental setup can be found in [39,40]. Fig. 1 depicts a schematic of the experimental setup. The vapor generation system consisted of a gas stream module 2.2. Data collection and handling and a three-way switchable valve. The gas stream module included a reference module, dry nitrogen ¯owing at The 12 VOCs used were *AC), methyl ethyl 200 sccm that served to establish the baseline response, ketone *MEK), *ET), *ME), 1,2-dichlor- and an analyte module. The analyte vapor was generated oethane *DCA), acetonitrile *ACN), 1,1,1-trichloroethane by means of calibrated mass ¯ow controllers *Tylan general *TCA), trichloroethylene *TCE), hexane *HX), octane *OC), FC-280 AV) and conventional gas bubblers containing the toluene *TL) and xylene *XL). These VOCs were exposed to

Fig. 1. Experimental setup. 246 R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254

Fig. 2. A typical signature pattern for toluene. the sensor array at seven different concentration levels, uniquely identify all 12 VOCs. The ®rst method is the namely 70, 140, 210, 250, 300, 350 and 700 ppm, yielding new version of Quinlan's C4.5 decision tree algorithm, the 84 responses that constituted the experimental database. C5.0, which is based on its well-known predecessor Response patterns at each concentration were composed of ID3 [23]. C5.0 is actually a classi®cation algorithm, with 12 features, representing the resonant frequency change of a a built-in feature selector that automatically chooses the best particular sensor to each of the above listed VOCs. These features that would maximize its own performance. The responses were considered as signature patterns of their second method expands the recent work of Kohavi and John respective VOCs. The signature pattern for toluene at [37], and it is based on an organized search of feature space 250 ppm is shown in Fig. 2, as a typical response pattern. for the optimum feature subset. We note that sensor responses to any given VOC were We note that frequency response patterns *signature pat- notably linear with concentration within the 50±1500 ppm terns) for various VOCs will be referred to as the feature range. Therefore, the available data with responses to 12 vectors. The response of each individual sensor *coated with VOCs at seven concentrations were interpolated to include a different polymer) constitutes the individual features. We 15 concentrations. This interpolation was achieved through therefore use the terms feature and sensor response inter- linear regression of the experimental data. Linear regression changeably. coef®cients with r2 > 0:998 were obtained for each sensor and each VOC. The interpolated data set allowed us to obtain 3.1. Method I: ID3/C4.5/C5.0 family of decision trees estimated frequency responses of sensors at 70, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 3.1.1. Generating decision trees and 1500 ppm, resulting in 180 data instances. The inter- Decision trees are compact forms of displaying a list of polated data, however, were not used for any of the training IF-THEN rules in a hierarchical order. These rules are used algorithms, but were simply used for evaluating neural to make classi®cation decisions about the input pattern. network performances at intermediate concentration levels. Decision trees are one of the most commonly used machine It has also been realized that the change in frequency is learning algorithms for classi®cation applications, ID3 linearly dependent on the thickness of the coatings: the being one of the most popular [30,31]. thicker the coating, the higher the frequency response. ID3 classi®es test data by constructing a decision tree However, no attempt has been made to date to normalize from training data. The algorithm determines the fea- the response with respect to the coating thickness in order to tures necessary for correct classi®cation of training data. test the generalization capabilities of the neural network The decision tree starts by identifying the most impor- classi®ers. tant feature, based on the information content of each feature. For a training data set, T, the probability, P,that

a certain response pattern belongs to a speci®c class Ci 3. Results is:

freq C ; T† In the following sections, we describe and evaluate two P ˆ i *1) methods to select a subset of the 12 coatings that can jTj R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254 247 where freq*Ci, T) is the number of patterns in T that belong to class Ci, and jTj is the total number of patterns in the training data set. The information I associated with the probability P is defined as:  freq C ; T† I ˆÀlog i *2) 2 jTj and measured in units of bits. Note that, by definition, P must lie in the [0, 1] interval. The minus sign in Eq. *2) assures that I is a positive quantity. The average amount of information, info*T) in bits, needed to identify the class of any pattern in training set T is then defined as the sum over all classes weighted by their frequency of occurrence:  XN freq C ; T† freq C ; T† info T†ˆÀ i  log i *3) jTj 2 jTj iˆ1 The information needed to identify the class of any pattern after the training data set has been partitioned into K subsets based on the value of feature X is given by: Fig. 3. Sample decision tree generated by C5.0.

XK jT j info T†ˆ i  info T † *4) X jTj i iˆ1 feature X is excessively large, split_info*X) will also be large, making the gain_ratio*X) small. where |Ti| is the number of patterns in partition i of the training set T. Eq. *3) is generally referred to as the entropy Details and example uses of this procedure can be found before partitioning, and Eq. *4) as the entropy after parti- elsewhere [41]. C4.5, and more recently C5.0, the newest tioning of the training set T. The original ID3 algorithm version of ID3 family of decision trees, adds a number of uses a criterion called `gain' to determine the additional new features to the algorithm, such as cross-validation and information obtained by partitioning T using the feature boosting, as described later in this section. A sample deci- X. Thus: sion tree generated by C5.0 for this particular problem of determining optimum coatings is shown in Fig. 3. gain X†ˆinfo T†ÀinfoX T† *5) The decision tree given by C5.0 can easily be converted ID3 ®rst selects the feature that has the largest gain and into a set of rules from which the classi®cation can be made. places that feature at the root of the tree. This feature is then For example, the ®rst rule generated by the tree in Fig. 3 can removed from the feature set, and the feature with the next be expressed as ``IF PIB response is less than 0.19947, AND largest gain becomes the second important feature and so the response of PSB is less than 0.057534, THEN the VOC is forth. This criterion, however, has a very strong bias in favor ET *7)'', where the sensor responses are given as normalized of features that have many outcomes, that is, features that frequency deviations from the fundamental frequency of partition the data set into the largest number of classes. each coated sensor in response to pure nitrogen. The number Although a very legitimate bias, use of this criterion can in parenthesis *7) refers to the number of patterns that were cause poor classi®er performance if an irrelevant feature classi®ed correctly with this rule. In this case, all seven uniquely identi®es all classes. responses to the seven different concentrations of ethanol This problem can be overcome by de®ning split_info of a were successfully classi®ed by this rule. If applicable, a feature: second number following a slash sign is given corresponding  to the number of misclassi®cation cases. XK jT j jT j This decision tree algorithm was tested on the original split info X†ˆ i  log i *6) jTj 2 jTj experimentally obtained data set consisting of responses of iˆ1 12 sensors to 12 VOCs. A number of trees were constructed where split_info represents the potential amount of informa- using various options such as pruning, cross-validation, and tion that is generated by dividing T into K partitions by the boosting. feature X. Then: Pruning is a procedure for removing the redundancy gain X† from the generated decision tree. Pruning usually results gain ratio X†ˆ *7) in much simpler trees, using a signi®cantly smaller number split info X† of features than the original tree, at a possible cost of is the proportion of information generated that is useful by minor performance deterioration. The features used in the the split of T. If the number of partitions, K, generated by the ®nal tree are considered as the most important features. 248 R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254

Cross-validation is used to optimize the generated tree by Table 1 evaluating the tree on a test data set. Cross-validation is Results of the neural network trained with the features suggested by C5.0 achieved by partitioning the entire database *training and VOC Train Performance VOC Train Performance testing) into M blocks, where each block is internally divided AC 2 6/7 TCA 2 7/7 into a training sub-block and a testing sub-block *hold-out MEK 1 6/7 TCE 3 7/7 set). During this partitioning, the number of patterns and ET 4 6/7 HX 4 7/7 class distributions are made as uniform as possible. M trees ME 2 7/7 OC 2 7/7 are generated from these M blocks of data, and the average ACN 2 6/7 TL 3 7/7 error rate over the M holdout sets is considered to be a good DCA 3 7/7 XL 2 7/7 predictor of the error rate of the tree built from the full data set. Finally, boosting is also a procedure for generating multiple trees, where the misclassi®ed test signals of the randomly selecting 30 patterns from the database of 84 previous tree are moved to the training data set. Boosting patterns. The remaining 54 patterns, constituted the test [42,43] is a common procedure used for improving the data, and they were used to evaluate the classi®cation performance of a classi®er. performance. The results are summarized in Table 1. The Train columns indicates the number of signals used in the 3.1.2. Results using decision trees training data for each VOC, and the Performance columns Among many trees generated using these various options, indicates the number of correctly classi®ed patterns for each none performed satisfactorily. Although able to reduce the VOC out of seven that were in the original dataset. With only number of features from 12 to 5±7, the classi®cation per- four misclassi®ed signals and a test data performance of formance using these features was in the range of 63±83%. 93%, this neural network performed signi®cantly better than Contrary of its intention, previous studies have shown that the best decision tree used as a classi®er. The same neural this algorithm is most useful when the features selected in its network was then tested on the expanded *synthetic) data set, decision tree are actually used to train a neural network [44]. and the results are shown in Table 2. With eight misclassi- In other words, this algorithm appears to be a good feature ®cations, the network that was trained with patterns of the selection algorithm, rather than a classi®cation algorithm, original *experimentally obtained) database had a correct although it was originally designed as a classi®cation classi®cation performance of 96% on the expanded data set. scheme. The performance of the neural network, and consequently One of the better trees, constructed using boosting, that of the feature selection capability of the decision tree pruning, and cross-validation options, is the ®ve-feature tree method, must be evaluated with some caution, however. The shown in Fig. 4. As evident, this tree used the features APZ, feature subset containing ®ve features was the best of over PIB, OV275, PSB, and PDS. These features were then used 40 different feature subsets suggested by C5.0 at various to train a neural network. A MLP neural network with a attempts. A number of different parameters had to be 5 Â 25 Â 12 architecture was trained, where the numbers adjusted by trial and error in order to obtain this feature refer to the number of nodes in each layer. The responses of set. Therefore, this algorithm may not be the most ef®cient ®ve sensors constituted the input layer; there were 25 hidden one to use, particularly for users who are not very familiar layer nodes and 12 output nodes, each output representing with decision tree algorithms. one of the 12 VOCs. The training data were obtained by 3.2. Method II: modified wrapper approach

The decision tree-based approach for the selection of optimum features is a very ef®cient algorithm, rapidly converging to a solution. This algorithm is dif®cult to use, however, requiring the adjustment of several para- meters. Furthermore, since it does not satisfactorily perform

Table 2 Results on the expanded data set *15 concentrations)

VOC Performance VOC Performance

AC 14/15 TCA 13/15 MEK 14/15 TCE 15/15 ET 10/15 HX 15/15 ME 15/15 OC 14/15 ACN 14/15 TL 13/15 DCA 12/15 XL 13/15 Fig. 4. Optimal decision tree generated by C5.0. R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254 249 as a classi®er, a separate classi®cation algorithm, such as a a subset of the feature space in an organized manner. The neural network, must be used for the actual classi®cation. On algorithm is based on testing all feature subsets within a the other hand, the features chosen by the decision tree may limited search space using ID3 until there is no further not always be optimal for neural network classi®cation. improvement in ID3 performance. The underlying idea Most importantly, the decision tree is unable to give a is that any feature subset selection algorithm should be pre-speci®ed number of features that would work best. based on the subsequent classi®cation algorithm intended The number of features used in the tree is determined by for use. Therefore, the feature subset selection algorithm the algorithm and not by the user. Pre-specifying a number K must work together with, or be ``wrapped around'', its for the number of features, and being able to ask the intended classi®er. algorithm to ®nd the best K features that would optimize The best method for ®nding the optimum feature subset is the performance among all other K feature subsets is a very to search the feature space exhaustively for every possible desirable property. These concerns motivated us to look for feature combination. The problem, however, is that such a an alternate method for obtaining an optimal feature subset. search algorithm can be computationally prohibitive. More recently developed wrapper approaches [37,45] Because of this problem, only a subset of the feature space have been successfully used as feature selection algorithms, must be searched in an organized manner by exploiting any where the features are selected based on the performance of additional information that is available. The search can start, the subsequent classi®cation algorithm. Thus, the features for instance, using all features and progress by removing selected are the optimum features for the particular classi- features that do not contribute notably to the classi®cation ®cation algorithm to be used. Wrapper approaches also performance *backward search), or alternatively, the search allow us to select the number of features, and there are can begin devoid of features and proceed by adding features fewer parameters to be selected. These bene®ts, however, that contribute the most to classi®cation performance come at a cost of computational complexity. *forward search). We adopted the forward search approach, where we 3.2.1. Strong and weak relevance started the search with zero features, and the performance Kohavi and John expanded the meaning of relevance in was evaluated for each feature using a classi®er acting feature selection by de®ning strong relevance and weak as an evaluation function. Once the feature that gave the relevance as follows [37]: let Xi be a feature, Si ˆfX1, best performance was identi®ed, one and only one feature ..., XiÀ1, Xi‡1, ... Xm} be the set of all features except Xi, was iteratively added to the search. These two-feature and let xi and si be the value assignments to Xi and Si, subsets were then evaluated by the classi®er, and the best respectively. The feature Xi is strongly relevant if and only if two-feature subset was determined. A third feature was there exists some xi, y, and si such that the probabilistic then added to those two features, and this procedure was relation in Eq. *8) holds: continued until adding new features did not improve performance. This search procedure is commonly known P Y ˆ yjXi ˆ xi; Si ˆ si† 6ˆ P Y ˆ yjSi ˆ si†; as the hill-climb search algorithm, where a subset of for P Xi ˆ xi; Si ˆ si† > 0 *8) feature space is searched until the best performance is found. where Y is a random variable for the set of classes, and y is a Kohavi and John [37] suggested that this method could class assignment to the current pattern. A feature Xi is ®nd all strongly relevant features as well as some weakly weakly relevant, if it is not strongly relevant and there exists relevant features. These researchers also suggested that this 0 a subset of features Si of Si for which there exists some xi, y, method would work best with decision tree algorithms or 0 0 0 and si with P*Xi ˆ xi, Si ˆ si† > 0 such that: with Bayes classi®ers. They therefore used the classi®cation performance of ID3 and Naive Bayes classi®ers as their P Y ˆ yjX ˆ x ; S0 ˆ s0† 6ˆ P Y ˆ yjS0 ˆ s0† *9) i i i i i i evaluation functions. These de®nitions are based on Bayes classi®ers [46], The reduction in the computational complexity using the which are statistically considered as optimal classi®ers. hill-climb can be easily seen from a numerical example. For However, these classi®ers require that the distribution of example, searching for the best feature subset from a set of the data and their classes be fully known, which is seldom 12 features requires evaluating *that is, training and testing) true. According to above de®nitions, a feature is strongly C 12=1†‡C 12=2†‡C 12=3†‡ÁÁÁ‡C 12=11†‡C 12= relevant if removing this feature alone results in perfor- 12†ˆ4095 different networks, where C n=k† is the number mance degradation of an optimal Bayes classi®er. A feature, of possible combinations of choosing k features from a set X, is then weakly relevant if it is not strongly relevant and of n. The maximum number of subsets searched using hill- there exists a subset of features, S0, that does not include X, climb search, on the other hand, is N*N À 1)/2, where N is such that the performance of Bayes classi®er on S0 is worse the original number of features. For 12 features, there would than the performance on S0 [fXg. be 66 subsets to search, which is computationally more Kohavi and John's [37] approach, originally developed to feasible. Fig. 5 shows the complete search space for N ˆ 4, improve the classi®cation accuracy of ID3, simply searches where each node has a binary code indicating the features 250 R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254

Fig. 6 illustrates the application of this algorithm to the 12-feature space. Note that at each stage, the number of possible combinations that need to be evaluated decreases by 1. Therefore, the maximum number of feature subsets that must be evaluated is N‡ N À 1†‡ N À 2†‡ÁÁÁ ‡ N À N À 1††,orN*N À 1)/2. Also note that in most cases, the total number of feature subsets to be searched will be less than this number, since the search will stop when the optimum set is found.

3.2.2. Results using wrapper approach The hill-climb search algorithm has a potential problem of getting trapped at a local performance maximum, since the search algorithm continues only until performance improve- ment stops. It is quite possible for the performance to stop improving at a certain feature set, but then continue to Fig. 5. Complete feature space for N ˆ 4. improve. Since the feature space in this database was manageably small, all 66 subsets in the hill-climb search that are included and the ones that are not. For instance, the space were examined, thus effectively eliminating this feature subset [0, 1, 1, 0] includes the second and third but potential problem of local performance maximum. For each not the ®rst and fourth features. Note that each node is feature subset, a new MLP was trained with the experimen- connected to nodes that have one and only one feature added tally generated training data set consisting of 30 patterns, or deleted. Every feature set obtained from a previous node and the network was then tested on the remaining 54 *parent node) by adding or removing one feature is called a patterns. The performance of the MLP *percentage of cor- child node. Obtaining the children of a parent node is rectly classi®ed test samples) was used as the evaluation referred to as expanding. As such, the hill-climb search function. The network architecture was T  25  12, where algorithm can be formally described as follows: T was the number of features in the current feature subset being evaluated. The same training and testing data sets were 1. Let S be the initial feature subset, typically *0, 0, ...,0,0). used for each network. On a reasonably con®gured machine, 2. Expand S: find all children of S by adding or removing for example, a Pentium III1 running at 800 MHz or better one feature at a time. with 128 MB RAM or higher, the algorithm takes little over 3. Apply the evaluation function, f, to each child, s. 1 h to complete.

4. Let smax be the child with the highest evaluation f6s). The feature subset that had the best performance for the 5. If f*smax† > f S†, then S smax, return S to step 2, else least number of features on the model neural network 6. Return S as solution. architecture *4  25  12 with 30 training data, 54 testing

Fig. 6. Hill-climb search for the feature space with 12 features. Evaluation function is the performance of the T Â 25 Â 12 MLP neural network, where T is the number of features in current feature space. R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254 251

Table 3 most likely to fail during these steps. This observation leads The performance of the best feature subset as chosen by hill-climb us to a more computationally ef®cient approach: if we can VOC Train Performance VOC Train Performance initially identify a few of the best possible features and then start the hill climbing from that point, the search time can be AC 2 7/7 TCA 2 7/7 MEK 1 7/7 TCE 2 7/7 signi®cantly reduced. Furthermore, selecting the ®rst few ET 2 7/7 HX 3 7/7 critical features increases performance by avoiding any ME 3 7/7 OC 4 7/7 initial missteps starting the hill climb, and reduces the ACN 2 7/7 TL 4 7/7 possibility of being trapped at a local maximum. We note DCA 3 7/7 XL 2 7/7 that if no prior information is known about the data and/or relevance of the possible features, statistical procedures can data) was PIB, OV275, SG, and PDPP. Although another be used to determine important features. subset with an additional feature had slightly higher perfor- One such procedure is using the variance of the features mance, the four-feature subset was preferred because of its among different classes. Intuitively, features whose values smaller dimension. To avoid a rapid growth in adding change when the class changes carry more information than features, we also added a subroutine to the program code features whose values do not change with class. In addition, to penalize marginally when adding an additional feature. It if the value of a particular feature is constant regardless of is interesting to note that PIB, OV275, and PDPP were also the class, then that feature provides no discriminatory on the most successful coatings list of Zellers et al. [23] for a information and therefore it is of no use. This approach, similar list of VOCs. The algorithm was therefore able to however, has a major ¯aw. If a particular feature changes in pick the best set of coatings at a fraction of time, without each case, then the variance of this feature would be very requiring an exhaustive search of all possible coatings. high; nevertheless, the high variance would render this The feature subset chosen by the hill-climb search, based feature useless for classi®cation. Care must therefore be on 30 training patterns, classi®ed all 54 validation patterns taken to select the features that have the maximum variance *previously unseen by the network) as well as the training among different classes, but minimum variance among the patterns correctly, giving 100% classi®cation performance. patterns of the same class. Such features are good candidates The distribution of the training data and the performance are as the best starting features. An effective normalization given in Table 3. This network was also tested with the scheme is also necessary for this approach to work. expanded synthetic data set of 15 concentrations, and all but When applied to the database we analyzed, the features three patterns *all ethanol) of the total 180-signal set were that had the highest variance among different classes *and classi®ed correctly, giving a classi®cation performance smallest within individual classes) were PIB and OV275, for of 98.3%. which the hill-climb search agreed by identifying them as two of the best features. However, identifying these two 3.3. Improving wrapper approach features took about 1 h *on a PIII 800 MHz machine with 128 MB RAM) using the hill-climb search, whereas var- The hill-climb search technique is not guaranteed to ®nd iance-based identi®cation of the initial two features took the best feature set, since it is prone to getting trapped in a local only a few seconds. Finally, when the hill-climb search was maximum in the performance space. When the total number initialized with PIB and OV275, the search identi®ed the of features is small, the algorithm can search the entire hill- other two features as SG and PDPP in less than 15 min. climb search space, which essentially eliminates the local Since the time required for computing the variances are performance maximum problem. Note that the hill-climb negligible compared to search time, total running time was search space is typically orders of magnitude smaller than also less than 15 min. the entire feature space. When the total number of features is For completeness, the following is the list of the features, large, however, searching even the hill-climb search space in descending order of their variances: may be quite computationally expensive. Furthermore, when starting with one feature, the initial steps are more likely to 1. PIB result in the network not converging, or the search being 2. OV275 trapped at a local maximum. This situation arises because 3. GRAP only one *or very few) feature is not suf®cient for convergence 4. PSB to the desired error minimum or for satisfactory performance. 5. PDPP On the other hand, we note that the time required for 6. APZ selecting the next feature decreases as the number of selected 7. DEGA features increases. In other words, ®nding the best subset 8. PCP with k features takes more time than ®nding the best subset 9. PDS-CO with k ‡ 1 features, given that the best k features are known 10. PDS from previous iterations. We observe that the search spends 11. SG most of its time during initial steps, and that the networks are 12. PDS-OH 252 R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254

Table 4 If the original feature space is very large, and/or the The performance of the coatings chosen based on their variance statistical properties of the data do not allow accurate VOC Train Performance VOC Train Performance predictions of the initial feature set, yet another alternative can be integrating the decision tree-based approach into the AC 2 6/7 TCA 3 5/7 MEK 2 5/7 TCE 1 1/7 hybrid procedure. In such cases, the decision tree can be ET 2 7/7 HX 4 7/7 used to obtain a rough estimate of the relevant features, and ME 2 6/7 OC 3 6/7 the hill-climb search can be started with these features. ACN 3 6/7 TL 2 5/7 Finally, we note two of the most important advantages of DCA 3 7/7 XL 3 7/7 the hill-climb approach: it allows the user to pre-specify the number of features desired, and it requires less number of parameters to be optimized. Future work includes develop- It should be noted that SG, which was chosen by the ing better search algorithms, and/or better starting points for hill-climb search, was at the bottom of the list. At issue then these selection algorithms. We are also interested in pursu- is how this approach alone would perform, if the features ing the optimum selection of coatings for mixtures of VOCs. were chosen from the top of this list. To answer this question, the top four coatings from this list were chosen and the standard network was trained again with 30 cases. Acknowledgements The distribution of the training data and the results of this network are shown in Table 4. Note that the worst perfor- The authors gratefully acknowledge the assistance and mance came from TCE, which happened to have only one suggestions of Guojun Liu, Robert Lipert, and Bikas Vaidya. representative in the training set. Apart from TCEs six This work was supported by Fisher Controls International of misclassi®cations, there were nine other misclassi®cations. Marshalltown, Iowa and the Microanalytical Instrumenta- These results prove that features should not be chosen tion Center of Iowa State University. The Ames Laboratory based on their variance only. However, choosing a few is operated for the US Department of Energy by Iowa State features with the highest variance, and using them as initial University under contract W-7405-Eng-82. features in the hill-climb search may provide the best of two worlds by reducing the total processing time of the search algorithm, as well as the possibility of being trapped References at a local maximum. [1] S.L. Rose-Pehrsson, J.W. Grate, D.S. Ballantine, P.C. Jurs, Detection of hazardous vapors including mixtures using pattern recognition 4. Conclusions and future work analysis of responses from surface acoustic wave devices, Anal. Chem. 60 *1988) 2801±2811. We examined the viability of two feature selection meth- [2] A. D'Amico, C. Di Natale, E. Verona, in: K. Rogers *Ed.), Handbook of Biosensors and Electronic Noses, CRC Press, Boca Raton, FL, ods for the VOC identi®cation problem. The ®rst approach, 1997, Chapter 9, pp. 197±223. using a decision tree to determine the features carrying the [3] J.W. Grate, S. Rose-Pehrsson, D.L. Venezky, M. Klutsy, H. most information and then training a neural network with Wohltjen, Smart sensor system for trace organophosphorus and these features, performed fairly well on both experimental organosulfur vapor detection employing a temperature-controlled and expanded databases. The correct classi®cation perfor- array of surface acoustic wave sensors, automated sample precon- centration, and pattern recognition, Anal. Chem. 65 *1993) mance was 93% on the original experimentally generated 1868±1881. data set, and 96% on the expanded data set. One major [4] J.W. Grate, B.M. Wise, M.H. Abraham, Method for unknown vapor drawback of this scheme is the number of parameters that characterization and classification using a multivariate sorption need to be optimized for various options of the decision tree detector: initial derivation and modeling based on polymer-coated algorithm. On the other side, decision trees are considerably acoustic wave sensor arrays and linear solvation energy relationships, Anal. Chem. 71 *1999) 4544±4553. faster to train than neural networks, which constitutes a [5] J.D.N. Cheeke, Z. Wang, Acoustic wave gas sensors, Sens. Actuators major drawback of the second approach. B 59 *1999) 146±153. The second approach based on a hill-climb search of the [6] C.K. O'Sullivan, G.G. Guilbault, Commercial quartz crystal micro- feature space performed very well: the network trained with balances Ð theory and applications, Biosens. Bioelectron. 14 *1999) the four features selected by hill-climb search classi®ed all 663±670. [7] L. Cui, M.J. Swann, A. Glidle, J.R. Barker, J.M. Cooper, Odour patterns correctly. This approach, although signi®cantly mapping using microresistor and piezoelectric sensor pairs, Sens. faster and computationally more ef®cient than exhaustive Actuators B 66 *2000) 94±97. search, can still be computationally expensive if the number [8] T. Nakamoto, A. Iguchi, T. Moriizumi, Vapor supply method in odor of features becomes large. Using simple statistical measures sensing system and analysis of transient sensor responses, Sens. for selecting the ®rst few features has been introduced Actuators B 71 *2000) 155±160. [9] J.W. Gardner, Detection of vapours and odours from a multisensor as an intuitive and effective solution for reducing the array using pattern recognition. Part 1. Principal component and computation time. cluster analysis, Sens. Actuators B 4 *1991) 109±115. R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254 253

[10] P.M. Schweizer-Berberich, S. Vaihinger, W. Gopel, Characterization [33] B.V. Dasarathy, Nearest Neighborhood *NN) Norms: NN Pattern of food freshness with sensor arrays, Sens. Actuators B 18/19 *1994) Classification Techniques, IEEE Computer Society Press, Los 282±290. Alamitos, 1990. [11] N. Ryman-Tubb, They all stink! chemometrics and the neural [34] H. Almuallim, T.G. Dietterich, Learning Boolean concepts in the approach, Proc. SPIE Virtual Intell. 2878 *1996) 117±127. presence of many irrelevant features, Artif. Intell. 69 *1994)279±306. [12] B. Yea, T. Osaki, K. Sugahara, R. Konishi, The concentration [35] K. Kira, L.A. Rendell, Feature selection problem: traditional methods estimation of inflammable gases with a semiconductor gas sensor and a new algorithm, in: Proceedings of the 10th International utilizing neural networks and fuzzy inference, Sens. Actuators B 41 Conference on *AAAI92), 1992, pp. 129±134. *1997) pp. 121±129. [36] D. Koller, M. Sahami, Toward optimal feature selection, in: [13] Z. Wang, J. Hwang, B.R. Kowalski, ChemNets: theory and Proceedings of the 13th International Conference on Machine application, Anal. Chem. 67 *1995) 1497±1504. Learning *ICML-96), 1996, pp. 284±292. [14] D.S. Ballantine, S.L. Rose, J.W. Grate, H. Wohltjen, Correlation of [37] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artif. surface acoustic wave device coating responses with solubility Intell. 97 *1997) 273±324. properties and chemical structure using pattern recognition, Anal. [38] H. Wohltjen, Mechanism of operation and design considerations for Chem. 58 *1986) 3058±3066. surface acoustic wave device vapor sensors, Sens. Actuators 5 *1984) [15] J.W. Grate, M.H. Abraham, Solubility interactions and the design of 307±325. chemically-selective sorbent coatings for chemical sensors and [39] R. Polikar, Algorithms for Enhancing Pattern Separability, Optimum arrays, Sens. Actuators B 3 *1991) 85±111. Feature Selection and Incremental Learning with Applications to Gas [16] D. Amati, D. Arn, N. Blom, M. Ehrat, J. Saunois, H.M. Widmer, Sensing Systems, Ph.D. Dissertation, Iowa State Sensitivity and selectivity of surface acoustic wave sensors University, Ames, IA, 2000. for organic solvent vapor detection, Sens. Actuators B 7 *1992) [40] R. Shinar, G. Liu, M.D. Porter, Graphite microparticles as coatings 587±591. for quartz crystal microbalance-based gas sensors, Anal. Chem. 72 [17] K. Nakamura, T. Suzuki, T. Nakamoto, T. Moriizumi, Sensing film *2000) 5981±5987. selection of QCModor sensor suitable for apple flavor discrimina- [41] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan tion, IEICE Trans. Electron. E83 *2000) 1051±1056. Kaufmann, San Mateo, CA, 1993. [18] T. Nakamoto, K. Nakamura, T. Moriizumi, Classification and [42] R. Schapire, Strength of weak learning, Machine Learning 5 *1990) evaluation of sensing films for QCModor sensors by steady-state 197±227. sensor response measurement, Sens. Actuators B 69 *2000) 295±301. [43] Y. Freund, R. Schapire, A decision theoretic generalization of on-line [19] C. Di Natale, A. D'Amico, A.M.F. Davide, Redundancy in sensor learning and an application to boosting, Comput. Syst. Sci. 57 *1997) arrays, Sens. Actuators A 37/38 *1993) 612±617. 119±139. [20] J. Park, E.T. Zellers, Determining the minimum number of sensors [44] M. Seo, Automatic Ultrasonic Signal Classification Scheme, MS required for multiple vapor recognition with arrays of polymer- Thesis, Iowa State University, Ames, IA, 1997. coated SAW sensors, Proc. Electrochem. Soc. 99 *1999) 132±137. [45] R. Kohavi, G.H. John, in: L. Huan, H. Motoda *Eds.), Feature [21] J. Park, W.A. Groves, E.T. Zellers, Vapor recognition with small Extraction, Construction and Selection: A Data Mining Perspective, arrays of polymer-coated microsensors, Anal. Chem. 71 *1999) Kluwer Academic Publishers, Norwell, MA, 1998. 3877±3886. [46] K. Fukunaga, Statistical Pattern Recognition, 2nd Edition, Academic [22] G.C. Osbourn, R.F. Martinez, J.W. Bartholomew, W.G. Yelton, A.J. Press, San Diego, 1990. Ricco, Optimizing chemical sensor array, Proc. Electrochem. Soc. 99 *1999) 127±131. [23] E.T. Zellers, S.A. Batterman, M. Han, S.J. Patrash, Optimal coating selection for the analysis of organic vapor mixtures with polymer- Biographies coated surface acoustic wave sensor arrays, Anal. Chem. 67 *1995) 1092±1106. Robi Polikar received his BS degree in electronics and communications [24] W.P. Carey, K.R. Beebe, B.R. Kowalski, D.L. Illman, T. Hirschfeld, engineering from Istanbul Technical University in 1993, MS and PhD Selection of adsorbates for chemical sensor arrays by pattern degrees both in co-major biomedical engineering and electrical engineer- recognition, Anal. Chem. 58 *1986) 149±153. ing from Iowa State University, Ames, Iowa, in 1995 and in 2000, [25] F. Avila, D.E. Myers, C. Palmer, Correspondence analysis and respectively. He is currently an assistant professor of electrical and adsorbate selection for chemical sensor arrays, J. Chemom. 5 *1991) computer engineering at Rowan University, Glassboro, NJ. His current 455±465. research interests include signal processing, pattern recognition, and [26] G.C. Osbourne, R.F. Martinez, Empirically defined regions of applications of neural networks for biomedical engineering. influence for clustering analysis, Pattern Recog. 28 *1995) 1793± 1806. Ruth Shinar received her PhD in physical chemistry from the Hebrew [27] A.J. Ricco, R.C. Crooks, G.C. Osbourn, Surface acoustic wave University, Jerusalem, Israel, in 1977. She was a post-doctoral fellow at the chemical sensor arrays: new chemically sensitive interfaces com- University of California, Santa Barbara before joining the Microelectronics bined with novel cluster analysis to detect volatile organic Research Center and then the Microanalytical Instrumentation Center at compounds and mixtures, Accounts Chem. Res. 31 *1998) 289±296. Iowa State University. Her research interests include surface chemistry, [28] G.C. Osbourn, J.W. Bartholomew, A.J. Ricco, G.C. Frye, Visual- chemical sensors, photovoltaics, bioassays, and chip-scale instrumentation. empirical region of influence pattern recognition applied to chemical microsensor array selection and chemical analysis, Accounts Chem. Lalita Udpa received her PhD in electrical engineering from Colorado Res. 31 *1998) 297±305. State University in 1986. She is currently a professor of electrical [29] R.O. Duda, D. Stork, P.E. Hart, Pattern Classification, 2nd Edition, engineering at Iowa State University. Dr. Udpa works primarily in the area Wiley, New York, 2001. of nondestructive evaluation *NDE). Her research interests include [30] T.M. Mitchell, Machine Learning, WCB/McGraw Hill, Boston, 1997. numerical modeling of the forward problem and solution of the inverse [31] J.R. Quinlan, Induction of decision trees, Machine Learning 1 *1986) problems in NDE. She works extensively on the application of signal 81±106. processing, pattern recognition, and neural network algorithms. She also [32] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning teaches graduate level pattern recognition and signal processing classes at algorithms, Machine Learning 6 *1991) 37±66. Iowa State University. 254 R. Polikar et al. / Sensors and Actuators B 80 62001) 243±254

Marc D. Porter received his BS and MS degrees in chemistry from Wright structural and electrochemical issues in self-assembled monolayers. He is State University in 1977 and 1979, respectively, and his PhD in analytical presently a professor of chemistry at Iowa State University and is the Director chemistry from Ohio State University in 1984. His graduate work focused on of its Microanalytical Instrumentation Center. His research interests include new ways to characterize electrode materials. He then joined Bell surface analytical chemistry, monolayer assemblies, chip-scale instrumenta- Communications Research as a post-doctoral associate, and explored tion, analytical separations, bioassays, and chemically selective microscopies.