Research Article

Received: 20 February 2011, Revised: 05 October 2011, Accepted: 27 December 2011, Published online in Wiley Online Library: 2012

(wileyonlinelibrary.com) DOI: 10.1002/cem.1423 Boosting partial least-squares discriminant analysis with application to near infrared spectroscopic variety discrimination Shi-Miao Tana, Rui-Min Luoa, Yan-Ping Zhoua*, Hui Xua, Dan-Dan Songa, Tan Zeb, Tian-Ming Yangc** and Yan Nied

In the present study, boosting has been combined with partial least-squares discriminant analysis (PLS-DA) to develop a new pattern recognition method called boosting partial least-squares discriminant analysis (BPLS-DA). BPLS-DA is implemented by firstly constructing a series of PLS-DA models on the various weighted versions of the original calibration set and then combining the predictions from the constructed PLS-DA models to obtain the integrative results by weighted majority vote. Coupled with near infrared (NIR) spectroscopy, BPLS-DA has been applied to discriminate different kinds of tea varieties. As comparisons to BPLS-DA, the conventional principal com- ponent analysis, linear discriminant analysis (LDA), and PLS-DA have also been investigated. Experimental results have shown that the inter-variety difference can be accurately and rapidly distinguished via NIR spectroscopy coupled with BPLS-DA. Moreover, the introduction of boosting drastically enhances the performance of an individ- ual PLS-DA, and BPLS-DA is a well-performed pattern recognition technique superior to LDA. Copyright © 2012 John Wiley & Sons, Ltd.

Keywords: boosting partial least-squares discriminant analysis; near infrared spectroscopy; tea variety discrimination

1. INTRODUCTION been proven fruitful in the pattern recognition community. It may be because of the fact that PLS-DA is able to provide reme- The identification, authentication, or adulteration issues are of dial measures to the problems of correlated inputs and limited utmost importance within a wide variety of fields, such as food observations. However, PLS-DA might still be susceptible to over- [1,2], agriculture [3], pharmaceutical [4], and herbal medicine fitting by introducing noninformative variables, limiting its prac- [5]. Such issues are in essence the discrimination/classification tical efficiency in NIR spectroscopic applications. In addition, the ones. Traditionally, the discrimination tasks are executed by uti- performance of PLS-DA is also strongly dependent upon the lizing several different wet chemistry methods to detect the dis- homogeneity of the model error and the uniformity of the data parities of the interested constituents among the associated sets [14,15]. The PLS-DA may show deteriorated performance samples. The involved wet chemistry tools are always precise in cases where the calibration samples are singularly distributed but time consuming, laborious, and invasive, including Raman spectroscopy [2], mass spectroscopy [6], and high performance liquid chromatography [7]. Therefore, rapid and simple analytical * Correspondence to: Yan-Ping Zhou, Key Laboratory of Pesticide and Chemical techniques to implement the discrimination tasks are highly Biology of Ministry of Education, College of Chemistry, Central Normal demanding. Near infrared (NIR) spectroscopy, as a rapid, simple, University, Wuhan, 430079, China. nondestructive, and environmentally friendly technique, has E-mail: [email protected] been proven to be a well alternative to the wet chemical tools; ** Correspondence to: Tian-Ming Yang, College of Pharmacy, South-Central examples of which are associated with its extensive applications University for Nationalities, Wuhan, 430074, China. in agricultural [8], manufacturing [9], pharmaceutical [10], and E-mail: [email protected] food industries [11–13]. Because of the low selectivity of the NIR absorption bands, the a S.-M. Tan, R.-M. Luo, Y.-P. Zhou, H. Xu, D.-D. Song Key Laboratory of Pesticide and Chemical Biology of Ministry of Education, NIR spectroscopic technique is often coupled with multivariate College of Chemistry, Central China Normal University, Wuhan 430079, China chemometric algorithms for realizing the discrimination tasks. The fundamental idea behind the multivariate chemometric tool, b T. Ze that is, pattern recognition technique, is to establish a calibration State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, China model relating the measured NIR signals to certain properties of samples, say the sample class membership. This calibration c T.-M. Yang model is then applied to predict the same properties of samples College of Pharmacy, South-Central University for Nationalities, Wuhan outside of the calibration set from their measured NIR responses. 430074, China Whether the properties can be predicted accurately depends d Y. Nie greatly on the performance of the applied pattern recognition College of Urban & Environment Sciences, Central China Normal University, 34 method. Partial least-squares discriminant analysis (PLS-DA) has Wuhan 430079, China

J. Chemometrics 2012; 26:34–39 Copyright © 2012 John Wiley & Sons, Ltd. BPLS-DA for NIR tea variety discrimination

into clusters and the model errors are highly heterogeneous. weights of all the original calibration samples, wn,1=1/N,(n =1, Appropriate sample weighting is indispensable for further ..., N, N is the size of the original calibration set). Then, for t =1 improving the model performance. Boosting, originated from to T (T is the ensemble size), perform the following steps. the machine learning community, has been firstly developed Step 1. According to the weights wt, N samples, called the by Schapire [16] and recently progressively introduced in boosting set, are picked up with replacement from the original chemistry [17–24]. It refers to the formulation of an accurate pre- calibration set. The weight distribution specifies the relative diction rule by combining a series of rough and moderately importance of each instance for the current cycle. inaccurate rules-of-thumb. This technique is able to reduce Step 2. Construct a PLS-DA model on the boosting set and use simultaneously bias and variance through integrating multiple this discriminant rule to estimate the class memberships of the predictions. Another very nice property associated with boosting original calibration samples. If a sample is misclassified, its asso- lies in its capability of optimizing the weights assigned to the ciated error en, t is 1; otherwise, the error is 0. samples. In addition, boosting holds great potential in capturing Step 3. Calculate the sum of the weighted error of all the accurately the nonlinear structure of data by adding a set of original calibration samples by the following formulation: basic learners adaptively. Until now, boosting has attracted con- fi – siderable attention for the classi cation problems [21 24]. Much XN has been reported about the success of boosting for yielding errt ¼ wn;ten;t (1) more accurate prediction [25,26]. n¼1 Inspired by these appealing properties of boosting and the drawbacks of PLS-DA, we invoked boosting for improving the performance of PLS-DA, forming a new pattern recognition tech- In essence, the errt equals the sum of the weights associated nique called boosting partial least-squares discriminant analysis with the misclassified samples in the original calibration set. If (BPLS-DA) in the current study. BPLS-DA is carried out by con- errt > 1/2, set T = t 1, and abort the loop. structing a series of PLS-DA models followed by combining the errt Step 4. Let b ¼ and update the weight of each instance predictions from the resultant PLS-DA models via weighted t 1 errt in the original calibration set by using majority voting. Coupled with NIR spectroscopy, the newly pro- posed BPLS-DA has been applied to discriminate various tea va- rieties, exemplified as , Xihulongjing, Xinyangmaojian, ðÞ1en;t w ; þ ¼ w : b (2) Qihong, Tieguanyin, and Yinzhen. The feasibility of tea variety n t 1 n t t discrimination by using NIR spectroscopy combined with pattern recognition has been validated by He et al. [27] recently. In the e b current study, the NIR spectra obtained from these six tea varie- Notice here the superscript (1 n, t) represents that t is raised e ties have been combined as one data set for the discrimination to (1 n, t)th power. TheP new weights should be normalized for N ¼ b analysis. Such a data set may present some inherent characteris- the next iteration so that n¼1wn;tþ1 1. In Eq. (2), t is an indi- tics, viz. collinearity, nonlinearity, heterogeneity, or existence of cator of the confidence of the built PLS-DA model in the tth outliers. Hereon, before pattern recognition, data analysis has cycle. Lower bt shows higher confidence of the PLS-DA model. been carried out, including nonlinearity and outlier detections. Such a weight-updating strategy implies that if a sample is When recognition analysis is referred, PCA has been first applied misclassified by the PLS-DA model constructed in step 2, its to ascertain the discrimination possibility with the NIR spectra. weight would rise. The ensemble size T is vital to decrease the Simultaneously, as comparisons, PLS-DA and linear discriminant variability of the ensemble prediction and offers reliable estima- analysis (LDA) have also been investigated. Results have shown tion of the unknown samples. It used to be determined via cross that NIR spectroscopy combined with BPLS-DA is indicated to hold validation [30]. great potential as an accurate, rapid, and noninvasive strategy for After T cycles are finished, T PLS-DA models are induced. As far identifying the tea quality. In addition, the invoking of boost- as the prediction is concerned, each PLS-DA model gives a pre- ing substantially improves the performance of PLS-DA, and the diction for the ith unknown sample (i.e., yi, t 2 {1, 2, ..., K}) and a performance of BPLS-DA compares favorably with that of LDA. corresponding bt. These T predictions are combined into a final result. The importance of each prediction for the final result is measured by log(1/bt), with larger value of log(1/bt) indicating 2. THEORY higher importance. By this, it means that if Ck is the collection of the PLS-DA models classifying the ith sample as the kth class. 2.1. Boosting partial least-squares discriminant analysis Then, the “vote” for the ith instance identified as the kth class There are several implementations of boosting for classification can be computed by using purposes, such as gradient boosting [28], stochastic variants of boosting [29], adaptive boosting (AdaBoost) [30], and many X ¼ ðÞ=b ; ¼ ; ; ...; others [31,32]. AdaBoost is proven as the foremost and the most Vk log 1 t k 1 2 K (3) 2 popular version of boosting for classification. In the present t Ck study, a version of AdaBoost for multiclass case [30], that is, AdaBoost.M1, is introduced to improve the performance of and this instance is assigned to the class with the maximal vote. PLS-DA. In BPLS-DA, PLS-DA is taken as the basic classification This is the so-called weighted majority voting [30]. During itera- learner. The procedure of BPLS-DA is described as follows. tion, it sometimes happens that errt equals 0 or 1, resulting in b b Consider the original calibration set with N samples belonging the fact that log(1/ t)or tcannot be computed. Thereby, we re- to K classes. The class membership is defined as k 2 {1, 2, ..., set errt as 0.995 when it is equal to 1 and errt as 0.005 in case that K, K is the total number of classes}. Firstly, initialize the identical it equals 0. 35

J. Chemometrics 2012; 26:34–39 Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem S.-M. Tan et al.

3. EXPERIMENTAL 3.1. Sample preparation Six varieties of tea leaf samples, purchased from the supermar- ket, are originated from different provinces in China. The tea leaves were crushed via a beater mill and sieved with a 0.5 mm screen, and this sieved tea powder was used for the further NIR spectral measurement. The tea information is summarized in Table 1, including the category, variety, origin, and number.

3.2. Acquisition of NIR data The NIR spectra for the powdered tea samples were collected on an Antaris II NIR spectrometer (Thermo Electron Co., Waltham, MA, USA) in the reflectance mode. This NIR spectrometer is fur- nished with a quartz sample cup, an integrating sphere, and an indium gallium arsenide (InGaAs) detector. Under a steady level of temperature and humidity, the NIR spectral measurement was performed with the spectral range and the resolution as 4000–10,000 cm1 and 8 cm1, respectively. In addition, a total of 64 scans were accumulated per measurement. The back- ground was collected at the beginning of this experiment and then after every 1 h. The obtained 300 spectra were in turn collected from 50 Biluochun, 50 Xihulongjing, 50 Xinyangmaojian, 50 Qihong, 50 Tieguanyin, and 50 Yinzhen samples. The whole data set was ran- domly split into a calibration set of 176 samples and a prediction set of 124 samples for LDA, PLS-DA, and BPLS-DA modeling. The calibration set was composed of 25 Biluochun, 30 Xihulongjing, 33 Xinyangmaojian, 32 Qihong, 26 Tieguanyin, and 30 Yinzhen tea samples. The remainder forms the prediction set.

4. RESULTS AND DISCUSSION Figure 1. (a) Raw near infrared (NIR) spectra of six varieties of tea pow- der. (b) Second derivatives of the raw NIR spectra. Figure 1(a) depicts the raw NIR spectra for the powdered tea samples, showing three clusters of absorption bands. The band 1 cluster ranging from 4000 to 5000 cm might be attributed to addition, it was shown that the wavenumber interval selection – the second overtone of C H deformation mode and the aggre- procedure could eliminate the extra variability generated by – – gations of the O H and N H combination modes. The two bands noncomposition-related factors such as perturbations in the 1 – from 5200 to 7200 cm possibly arise from the C H stretching experimental conditions and the physical properties of samples, overtone and the O–H and N–H stretching overtones, respec- 1 thereby offering ameliorated performance for multivariate spec- tively. The weak absorption peak around 8500 cm is due to tral analysis [33]. Consequently, the derivative NIR spectra were – the second overtone of C H stretching mode. All of these analyzed using the moving window partial least-square method absorptive bands might be caused by the multi-ingredients of [33]. This method ascertained several informative spectral tea, such as polyphenols, alkaloids, proteins, amino acids, and regions, that is, 4040–4180, 4570–4690, 5130–5380, 5830–5940, some aroma compounds. It can also be seen that the original 6790–6890, 6970–7310, and 7510–7620 cm1. The preprocessed NIR spectra of different samples shows considerable baseline spectra were used for the ultimate tea variety identification drifts. This baseline drifts could be effectively eliminated by the analysis. second derivative treatment, as indicated in Figure 1(b). In Because such a data set may present some inherent attributes, for example, nonlinearity or presence of outliers, data analysis before pattern recognition was performed. Runs test method Table 1. Category, number, and origin of tea samples [34] was applied to test whether nonlinearity exists or not and quantify the extent if nonlinearity is present. For this data set, No. Tea variety Sample number Tea origin Tea category the presence of serious nonlinearity was proven by the runs test 1 Biluochun 50 Jiangsu that yielded a statistical value of 16.24 whose absolute value is 2 Xihulongjing 50 Green tea much larger than the critical value of 1.96 [34]. The absence of out- 3 Xinyangmaojian 50 Henan Green tea 2 lier is testified by the Cook’s squared distance CDð Þ method [35] 4 Qihong 50 Anhui i (results not shown). 5 Tieguanyin 50 As for the recognition analysis, PCA was firstly used for explor-

36 6 Yinzhen 50 Hunan ing the cluster tendency. The score plot by the top two principal

wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics 2012; 26:34–39 BPLS-DA for NIR tea variety discrimination components is shown in Figure 2. From Figure 2, it can be seen the spectrum belongs to a particular class or 0 if it does not that although some points overlap each other, the obvious clus- belong to this class. That is, the dummy codes for six varieties ter tendency occurs. This good cluster tendency may be due to can be represented as follows: Biluochun (1, 0, 0, 0, 0, 0), the fact that the diverse varieties of tea are in possession of con- Xihulongjing (0, 1, 0, 0, 0, 0), Xinyangmaojian (0, 0, 1, 0, 0, 0), siderable differences in their botanical, genetic, and agronomical Qihong (0, 0, 0, 1, 0, 0), Tieguanyin (0, 0, 0, 0, 1, 0), and Yinzhen characteristics and especially the tea treatment processes as well (0, 0, 0, 0, 0, 1). Calibration was then carried out by regressing as the original regions. In addition, via a visual inspection of the spectra on the dummy variables. The estimated dummy vec- Figure 2, one can obtain that three kinds of green tea lie adja- tors by PLS-DA for the calibration and prediction sets are shown cent to each other. Qihong, a category of black tea, lies adjacent in Figures 3(a and b), respectively. Six varieties of tea samples to the three varieties of green tea and only makes a little super- can be recognized via locating the maximal dummy codes of position with Xihulongjing, as shown in Figure 2. Tieguanyin the samples, respectively. For example, as shown in Figure 3(a), (oolong) and Yinzhen (yellow tea) are separated fully from each the maximal values of the dummy codes for the first two sam- other and clearly differentiated from the other four varieties of ples are located in t6, indicating that these two samples that tea. The distinctness of inner chemical compositions among essentially belong to Biluochun are misclassified into Yinzhen. the diverse categories of tea leads to the occurrence of such a Via an aborative inspection of Figure 3, one can obtain that phenomenon, mainly resulting from the dissimilarities of manu- although a majority of tea samples are identified accurately, facturing procedures among these four categories of tea. For quite a few are misclassified. Table 2 lists the classification results green tea, the tea leaves are only roasted but not fermented from which one can obtain that the total recognition rates (RR) before being dried, containing more of the simple flavonoids called . The tea leaves for yellow tea are slightly fer- mented via a step called “sealed yellowing” before being dried. Oolong and black tea, respectively, are semi-fermented and fully . The fermentation is an oxidation procedure con- verting the catechins into theaflavins and thearubigins. Hence, different categories of tea present various chemical characteris- tics. All of these indicated that the minor NIR spectroscopy differ- ence can provide enough information for the further tea variety discrimination analysis. In the current study, as a comparison, LDA was first used for the discrimination analysis. By using LDA, six varieties of tea in the calibration set are identified accurately, while those in the prediction set suffer from serious overlapping and confusion, indicating that LDA presents good learning performance but less effectiveness to the unknown samples. This may result from the fact that the data set possesses strong nonlinearity, in which the variable number is much larger than the sample one. To compare with BPLS-DA, PLS-DA was also employed to relate the measured NIR signals to the variety memberships of tea samples. For convenience, the variety membership of each sample is coded as a dummy vector by assigning a value of 1 if

Figure 3. The estimated dummy vectors by partial least-squares discrimi- Figure 2. Score plot by the top two principal components for 300 pow- nant analysis refer to the variety memberships of the samples in (a) the cal- dered tea samples, where symbols ○ (circle), ⊳(triangle-right), + (plus), ibration set and (b) the prediction set. Here, t1, t2, t3, t4, t5, and t6 represent *(star),◇ (diamond), and ? (pentagram) represent Biluochun, Xihulongjing, Biluochun, Xihulongjing, Xinyangmaojian, Qihong, Tieguanyin, and Yinzhen Xinyangmaojian, Qihong, Tieguanyin, and Yinzhen tea samples, respectively. tea samples, respectively. 37

J. Chemometrics 2012; 26:34–39 Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem S.-M. Tan et al.

Table 2. Classification results using BPLS-DA compared with those obtained by PLS-DA

PLS-DA BPLS-DA Data set Variety Number of RR (%) Total RR (%) Number of RR (%) Total RR (%) misclassified misclassified samples samples Calibration set BLC 13 48 89 0 100 100 (including 176 XHLJ 0 100 0 100 samples) XYMJ 7 79 0 100 QH 0 100 0 100 TGY 0 100 0 100 YZ 0 100 0 100 Prediction set BLC 17 32 83 2 92 98 (including 124 XHLJ 0 100 0 100 samples) XYMJ 4 76 0 100 QH 0 100 0 100 TGY 0 100 0 100 YZ 0 100 0 100 BLC, XHLJ, XYMJ, QH, TGY, and YZ, respectively, represent Biluochun, Xihulongjing, Xinyangmaojian, Qihong, Tieguanyin, and Yinzhen tea samples. BPLS-DA, boosting partial least-squares discriminant analysis; PLS-DA, partial least-squares discriminant analysis; RR, recognition rate.

for the calibration and prediction sets are 89% and 83%, respec- PLS-DA, are also documented in Table 2. By using BPLS-DA, the tively. PLS-DA provided the RRs of 48% and 79%, respectively, for total RRs for the calibration and prediction sets were improved Biluochun and Xinyangmaojian in the calibration set. The RRs for respectively from 89% by PLS-DA to 100% and 83% by PLS-DA these two varieties in the prediction set were 32% and 76%. For to 98%. Compared with PLS-DA, BPLS-DA yielded much higher the remainder four varieties, PLS-DA provided the RRs of 100% in RRs for Biluochun and Xinyangmaojian in both the calibration both the calibration and prediction sets. It seems that PLS-DA and prediction sets. For the other four varieties, both BPLS-DA recognizes the diverse categories of tea with relatively satisfac- and PLS-DA provided fully accurate discrimination for the cali- tory results but provides relatively poor discrimination among bration and prediction sets. All of these are clearly shown in green tea. These results are tally with those obtained by PCA. Table 2, indicating that the inter-variety difference of tea can This may be due to the component similarities among green be well discriminated by using NIR spectroscopy coupled with tea and the deficiency of PLS-DA in calibrating the data set BPLS-DA. This should benefit from the introduction of boosting, possessing complex and unknown nonlinearity. which has the ability to simultaneously reduce the bias and var- To further improve the discrimination rule, the BPLS-DA iance. This property of boosting improves the generalization method was employed. In BPLS-DA, two parameters were performance of a single PLS-DA model. Furthermore, boosting needed to be identified, that is, the number of latent variable is capable of mitigating the overfitting to singular sample distri- in each cycle and the ensemble size. BPLS-DA shared the latent bution and heterogeneous errors via optimizing the sample variable number with PLS-DA. The ensemble size of 11 is iden- weighting and combining the model ensemble. The great poten- tified by cross-validation method, as shown in Figure 4. The tial of boosting in capturing accurately the nonlinear structure of discrimination results by BPLS-DA, together with those by data may be another advantageous factor. Because the ensem- ble size for this data set is 11, the time required to run the BPLS-DA is only several seconds. 0.08

0.06 5. CONCLUSIONS

This paper developed a new multivariate pattern recognition 0.04 method named BPLS-DA. PLS-DA acts as the basic learner in PRESS BPLS-DA. BPLS-DA aims to establish a sequence of PLS-DA mod- 0.02 els iteratively and integrate the outputs of all these resultant PLS-DA models to obtain the final results. BPLS-DA was applied 0.00 to the NIR spectroscopic tea variety discriminant analysis. Exper- imental results revealed that boosting effectively improved the 0 5 10 15 20 25 30 performance of the single PLS-DA model, and BPLS-DA com- Ensemble size pared favorably with LDA. In addition, it was expected that Figure 4. The curve of ensemble size versus PRESS obtained by cross NIR spectroscopy coupled with BPLS-DA might hold great poten- validation on the original calibration set. The vertical real line indicates tial as an accurate, rapid, and noninvasive strategy for other 38 the optimal ensemble size selected. discrimination tasks.

wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics 2012; 26:34–39 BPLS-DA for NIR tea variety discrimination

Acknowledgements 13. Liu YL, Chen YR, Ozaki Y. Two-dimensional visible/near-infrared cor- relation spectroscopy study of thermal treatment of chicken meats. – The authors are grateful to the National Natural Science Founda- J. Agric. Food Chem. 2000; 48: 901 908. 14. Ferré J, Rius FX. Selection of the best calibration sample subset for tion (no. 21105035), the self-determined research funds of CCNU multivariate regression. Anal. Chem. 1996; 68: 1565–1571. from the colleges’ basic research and operation of MOE (no. 15. Gy PM. Introduction to the theory of sampling I. Heterogeneity of a CCNU09A01012), the Fundamental Research Funds for the Central population of uncorrelated units. Trends in Anal. Chem. 1995; 67–76. Universities (nos. 111016 and 20110348), the open funds of the 16. Schapire RE. The strength of weak learnability. Mach. Learn. 1990; 5: 197–227. State Key Laboratory of Chemo/Biosensing and Chemometrics of 17. Zhang MH, Xu QS, Massart DL. Boosting partial least-squares. Anal. Hunan University (no. 200910), and the Hubei Province Natural Chem. 2005; 77: 1423–1431. Science Foundation (grant no. 2010CBB00402) for supporting the 18. Zhou YP, Jiang JH, Wu HL, Shen GL, Yu RQ, Yukihiro O. Dry film research work. method with ytterbium as the internal standard for near infrared spectroscopic plasma glucose assay coupled with boosting support vector regression. J. Chemometr. 2006; 20:13–21. REFERENCES 19. Zhou YP, Jiang JH, Lin WQ, Zou HY, Wu HL, Shen GL, Yu RQ. Boosting support vector regression in QSAR studies of bioactivities of chemi- 1. Gonzalvez A, Armenta S, de la Guardia M. Trace-element com- cal compounds. Eur. J. Pharm. Sci. 2006; 28: 344–353. position and stable-isotope ratio for discrimination of foods with 20. Bjørn-Helge M, Vegard HS, Tormod N. Ensemble methods and partial protected designation of origin. Trends Anal. Chem. 2009; 28: least squares regression. J. Chemometr. 2004; 11: 498–507. 1295–1311. 21. He P, Xu CJ, Liang, YZ, Fang KT. Improving the classification accuracy 2. Kizil R, Irudayaraj J. Discrimination of irradiated starch gels using in chemistry via boosting technique. Chemometr. Intell. Lab. Syst. FT-Raman spectroscopy and chemometrics. J. Agric. Food Chem. 2004; 70:39–46. 2006; 54:13–18. 22. He P, Fang KT, Liang YZ, Li BY. A generalized boosting algorithm and 3. Bilal M, Jaffrezic A, Dudal Y, Guillou CL, Menasseri S, Walter C. its application to two-class chemical classification problem. Anal. Discrimination of farm waste contamination by fluorescence spec- Chim. Acta 2005; 543: 181–191. troscopy coupled with multivariate analysis during a biodegradation 23. Shao XG, Bian XH, Cai WS. An improved boosting partial least study. J. Agric. Food Chem. 2010; 58: 3093–3100. squares method for near-infrared spectroscopic quantitative analy- 4. Shen H, Carter JF, Brereton RG, Eckers C. Discrimination between tab- sis. Anal Chim. Acta 2010; 666:32–37. let production methods using pyrolysis-gas chromatography–mass 24. Zhang MH, Xu QS, Daeyaert F, Lewi PJ, Massart DL. Application of spectrometry and pattern recognition. Analyst 2003; 128: 287–292. boosting to classification problems in chemometrics. Anal. Chim. 5. Fu HY, Huan SY, Xu L, Tang LJ, Jiang JH, Wu HL, Shen GL, Yu RQ. Acta 2005; 544: 167–176. Moving window partial least-squares discriminant analysis for identi- 25. Simone B, Agostino Di C. Improving nonparametric regression fication of different kinds of bezoar samples by near infrared spec- methods by bagging and boosting. Comput. Stat. Data. An. 2002; troscopy and comparison of different pattern recognition methods. 38: 407–420. J. Near Infrared Spectrosc. 2007; 15: 291–297. 26. Schapire RE, Rochery M, Rahim M, Gupta N. Boosting with prior 6. Nasioudis A, Memboeuf A, Heeren RMA, Smith DF, Vkey K, Drahos L, knowledge for call classification. IEEE T. Speech Audi. P. 2005; 13 van den Brink OF. Discrimination of polymers by using their charac- 174–181. teristic collision energy in tandem mass spectrometry. Anal. Chem. 27. He Y, Li XL, Deng XF. Discrimination of varieties of tea using near 2010; 82: 9350–9356 infrared spectroscopy by principal component analysis and BP 7. Michael EH, Birgitte A, Jbrn S. Automated and unbiased classification model. J. Food Eng. 2007; 79: 1238–1242. of chemical profiles from fungi using high performance liquid chro- 28. Jerome HF. Greedy function approximation: A gradient boosting matography. J. Microbiol. Methods 2005; 61: 295–304. machine. Ann. Stat. 2001; 29: 1189–1232. 8. Alain C, Martine D, Marcia V. Nondestructive measurement of 29. Jerome HF. Stochastic gradient boosting. Comput. Stat. Data. An. fresh tomato lycopene content and other physicochemical charac- 2002; 38: 367–378. teristics using visible-NIR spectroscopy. J. Agric. Food Chem. 2008; 30. Yoav F, Robert ES. A Decision-theoretic generalization of on-line 56: 9813–9818. learning and an application to boosting. J. Comput. Syst. Sci. 1997; 9. Jens B, Jacques W. Industrial applications of online monitoring of 55: 119–139. drying processes of drug substances using NIR. Org. Process Res. 31. Yoav F, Raj I, Robert ES, Yoram S. An efficient boosting algorithm for Dev. 2008; 12: 235–242. combining preferences. J. Mach. Learn. Res. 2003; 4: 933–969. 10. Christoffer A, Jonas J, Stefan AE, Sune S, Staffan F. Time-resolved NIR 32. Mohammad A, Romuald B, Hubert C. A new boosting algorithm for spectroscopy for quantitative analysis of intact pharmaceutical improved time-series forecasting with recurrent neural networks. tablets. Anal. Chem. 2005; 77: 1055–1059. Inform. Fusion 2008; 9:41–55. 11. Jing M, Cai WS, Shao XG. Quantitative determination of the compo- 33. Jiang JH, James BR, Siesler HW, Ozaki Y. Wavelength interval selec- nents in corn and tobacco samples by using near-infrared spectros- tion in multicomponent spectral analysis by moving window partial copy and multiblock partial least-squares. Anal. Lett. 2010; 43: least-squares regression with applications to mid-infrared and near- 1910–1921. infrared spectroscopic data. Anal. Chem. 2002; 74: 3555–3565. 12. Šašić S, Ozaki Y. Short-wave near-infrared spectroscopy of biological 34. Centner V, De Noord OE, Massart DL. Detection of nonlinearity in fluids. 1. Quantitative analysis of fat, protein, and lactose in raw milk multivariate calibration. Anal. Chim. Acta 1998; 376: 153–168. by partial least-squares regression and band assignment. Anal. 35. Cook, RD. Detection of influential observations in linear regression. Chem. 2001; 73:64–71. Technometr. 1977; 19:15–18. 39

J. Chemometrics 2012; 26:34–39 Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem