Bayesian WaveletRegression on Curves With Application toa Spectroscopic Calibration Problem

P. J. Brown, T. Fearn, and M. Vannucci

Motivatedby calibration problems in near-infrared (N IR)spectroscopy,we considerthe setting in which the many predictorvariables arise fromsampling an essentially continuous curve at equallyspaced pointsand there may bemultiple predictands. We tackle thisregression problem by calculating the transformsof the discretized curves,then applying a Bayesian variable selectionmethod using mixture priors to the multivariate regression of predictands on wavelet coefŽcients. For prediction purposes, we average overa set oflikely models. Applied to a particularproblem in N IRspectroscopy,this approach was able toŽ ndsubsets ofthe wavelet coefŽcients withoverall better predictive performance thanthe more usualapproaches. I ntheapplication, the available predictorsare measurements ofthe N IRreectance spectrum ofbiscuit dough pieces at 256equally spaced wavelengths.The aim is topredict the composition (i.e., the fat,  our,sugar, and water content)of the dough pieces usingthe spectral variables.Thus we have amultivariateregression of four predictands on 256 predictors with quite high intercorrelation among the predictors. A trainingset of 39samples is availableto Ž tthisregression. Applying a wavelet transformreplaces the256 measurements oneach spectrum with256 wavelet coefŽcients thatcarry thesame information.The variable selection method could use subsetsof these coefŽcients thatgave goodpredictions for all fourcompositional variables on aseparate test set ofsamples. Selectingin the wavelet domainrather thanfrom theoriginal spectral variablesis appealing in this application, because asinglewavelet coefŽcient can carry informationfrom a bandof wavelengthsin theoriginal spectrum. This band can benarrowor wide,depending on thescale ofthewavelet selected. KEY WORDS:Markov chain Monte Carlo; Mixture prior; Model averaging; Multivariate regression; Near-infrared spectroscopy; Variable selection.

1. INTRODUCTION measurethe composition of biscuit dough pieces (formed but unbakedbiscuits), for possibleon-line implementation. (For a Thisarticle presents a newway of tacklinglinear regression problemsin whichthe predictor variables arise from fulldescription of the , see Osborne, Fearn, Miller, anessentially continuous curve at equally spaced points. The andDouglas 1984.) Brie y, two similar sample sets were workwas motivatedby calibration problems in near-infrared madeup, with the standard recipe varied to provide a large (NIR)spectroscopy,of which the following example is typical. rangefor eachof the four constituents under investigation: fat,sucrose, dry  our,and water. The calculated percentages 1.1 Near-Infrared Spectroscopy of BiscuitDoughs ofthesefour ingredients represent the q 4responses.There D were n 39samples in the calibration or training set, with QuantitativeN IRspectroscopyis used to analyze such D diversematerials as food and drink, pharmaceutical prod- sample23 excluded from theoriginal 40 as an outlier, and a further m 39in the separate prediction or validation set, ucts,and petrochemicals. The N IRspectrumof a sampleof, D say,wheat  ouris a continuouscurve measured by modern againafter one outlier was excluded.Thus Y and Yf , the scanninginstruments at hundreds of equally spaced wave- matricesof compositional for thetraining and validation sets,are both of dimension39 4. lengths.The information contained in this curve can be used € topredict the chemical composition of the sample. The prob- AnN IRreectance spectrum is available for eachdough lemlies in extracting the relevant information from possibly piece.The original spectral data consist of 700 points mea- thousandsof overlapping peaks. Osborne, Fearn, and Hindle suredfrom 1100to 2498 nanometers (nm) insteps of 2 nm. (1993)described applications in food analysis and reviewed For ouranalyses using , we havechosen to reduce the someof thestandard approaches to the calibration problem. numberof spectralpoints to savecomputational time. The Ž rst Theexample studied in detail here arises from anexper- 140and last 49 wavelengths, which were thoughtto contain imentdone to test the feasibility of N IRspectroscopyto littleuseful information, were removed,leaving a wavelength rangefrom 1380nm to 2400 nm, over which we tookevery otherpoint, thus increasing the gap to 4 nmand reducing the P.J.Brownis PŽzer Professorof Medical , I nstituteof Math- numberof points to p 256.The matrices X and X of spec- ematics andStatistics, University of Kent, Canterbury, Kent, CT2 7NF, D f U.K.(E-mail: [email protected] ).T.Fearnis Professorof Applied traldata are then 39 256.Samples of three centered spectra € Statistics,Department ofStatistical Science, UniversityCollege London, aregiven on the left side of Figure1. WC1E6BT, U.K. (E-mail: [email protected] ).M.Vannucciis Assistant Professorof Statistics, Department ofStatistics, Texas A&M University, Theaim is to derive an equation that will predict the CollegeStation, TX 77843(E-mail: [email protected] ). This work responsevalues Y from thespectral data X for futuresam- was supportedby theU.K. Engineering and Physical Sciences Research Coun- cil underthe Stochastic Modelling in Science andTechnology I nitiative,grant ples where Y isunknown but X canbe measuredcheaply and GK/K73343.M. Vannuccialso acknowledges support from the Texas Higher rapidly. AdvancedResearch Program,grant 010366-0075 from Texas A&M Inter- nationalResearch TravelAssistant Program and from the National Science FoundationCAREER award DMS-0093208.The spectroscopic calibration problemwas providedby theFlour Milling and Baking Research Association. Theauthors thank the associate editorand a referee, as well as MikeW est, © 2001 American StatisticalAssociation DukeUniversity, and Adrian Raftery, University of W ashington,for sugges- Journal of theAmerican StatisticalAssociation tionsthat helped improve the article. June 2001, Vol.96, No.454, Applicationsand Case Studies 398 Brown, Fearn and Vannucci:Bayesian WaveletRegression on Curves 399

– 0.06 0.5

e – 0.08 0 c n T a t c – 0.1 W – 0.5 e l D f e

R – 0.12 – 1

– 0.14 – 1.5 1400 1600 1800 2000 2200 2400 0 50 100 150 200 250

0.1 1 e c

n 0.5 T a t c 0.05 W e l D f

e 0 R

0 – 0.5 1400 1600 1800 2000 2200 2400 0 50 100 150 200 250

0.08

e 0.06 0.4 c n T a t c 0.04 W 0.2 e l D f e

R 0.02 0

0 – 0.2 1400 1600 1800 2000 2200 2400 0 50 100 150 200 250 Wavelength Wavelet coefficient

Figure1. Original Spectra(left column)and Wavelet Transforms (right column).

1.2 StandardAnalyses 1.3 SelectingWavelet Coef’ cients Themost commonly used approaches to this calibration Theapproach that we investigatehere involves selecting problemregress Y on X,withthe linear form variables,but we selectfrom derivedvariables. The idea is totransform each spectrum into a setof wavelet coefŽ cients, thewhole of whichwould sufŽ ce to reconstructthe spectrum, Y XB E D C andselect good predictors from amongthese. There are good reasonsfor thinkingthat this approach might have advantages beingjustiŽ ed either by appeals to the Beer– Lambert law overselecting from theoriginal variables. (Osborneet al. 1993) or on the grounds that it worksin prac- Inpreviouswork (Brown, Fearn, and V annucci1999; tice.I nSection6 we alsoinvestigate logistic transformations Brown,V annucci,and Fearn 1998a,b) we exploredBayesian ofthe responses, showing that overall their impact on the pre- approachesto the problem of selecting predictor variables in dictionperformance of the model is not beneŽ cial. thismultivariate regression context. W eapplythis methodol- Theproblem is not straightforward, because there are many ogyto wavelet selection here. morepredictor variables (256) than training samples (39) in Weknowthat wavelets can be used successfully for com- ourexample. The most commonly used methods for overcom- pressionof curveslike the spectra in ourexample, in thesense ingthis difŽ culty fall into two broad classes: variable selection thatthe curves can be accuratelyreconstructed from afraction andfactor-based methods. When scanning N IRinstruments ofthe full set of wavelet coefŽ cients (Trygg and W old1998; Žrst appeared,the standard approach was toselect (typically Walczakand Massart 1997). Furthermore, the wavelet decom- positionof the curve is a localone, so that if the information usinga stepwiseprocedure) predictors at a smallnumber of relevantto our prediction problem is contained in a particular wavelengthsand use multiple linear regression with this subset partor partsof the curve, as ittypically is, then this informa- (Hrushka1987). Later, this approach was largelysuperseded tionwill be carried by a verysmall number of wavelet coef- bymethods that reduce the p spectralvariables to scores on Žcients.Thus we mayexpect selection to work. The ability amuchsmaller number of factors and then regress on these ofwavelets to model the curve at different levels of resolu- scores.T wovariants— principal components regression (PCR; tiongives us theoption of selectingfrom ourcurve at a Coweand McNicol 1985) and partial regres- ofbandwidths. I nsomesituations it may be advantageous to sion(PLS; Wold,Martens, and W old1983)— are now widely selecta sharpband, as we dowhenwe selectone of theorigi- used,with equal effectiveness, as thestandard approaches. The nalvariables; in other situations a broadband, averaging over increasingpower of computershas triggered renewed research manyadjacent points, may be preferable. Selecting from the interestin wavelength selection, now using computer-intensive waveletcoefŽ cients gives us both of these options automati- searchmethods. callyand in a verycomputationally efŽ cient framework. 400 Journal oftheAmerican Statistical Association, June 2001

Notsurprisingly, has also been success- the Ž ne J 1toa coarserone, say J r.Analgorithm for ƒ ƒ fullyused for datacompression and denoising of N IRspec- theinverse construction also exists. traldata (McClure, Hamid, Giesbrecht, and W eeks1984). For Wavelettransforms can be computedvery rapidly and have thispurpose, there is probably very little to choose between goodcompression properties. Because they are localized in theFourier and wavelet approaches. When it comes to select- bothtime and , wavelets have the ability to repre- ingsmall numbers of coefŽcients for prediction,however, the sentmany classes of functionsin a sparseform bydescribing localnature of wavelets makes them the obvious choice. importantfeatures with few coefŽcients. (For ageneralexpo- Itisworth emphasizing that what we aredoing here is sitionof wavelet theory see Daubechies 1992.) notwhat is commonly described as wavelet regression. W e arenot Ž ttinga smoothfunction to a singlenoisy curve by 2.2 Matrix-VariateDistributions usingeither thresholding or shrinkage of wavelet coefŽ cients Inwhatfollows we usenotation for matrix-variatedistribu- (see Clydeand George 2000 for Bayesianapproaches that use tionsdue to Dawid (1981). W ewrite mixturemodeling and Donoho, Johnstone, Kerkyacharian, and Picard1995). Unlike those authors, we haveseveral curves, V M 4â1 è5 ƒ ® thespectra of 39 dough pieces, each of which is transformed towavelet coefŽ cients. W ethenselect some of these wavelet whenthe random matrix V hasa matrix-variatenormal distri- coefŽcients (the same ones for eachspectrum), not because butionwith M andcovariance matrices ƒii è and ‘ jj â theygive a goodreconstruction of the curves (which they do for its ith row and jthcolumn. Such a V couldbe gener- not)or to remove noise (of whichthere is very little to start ated as V M A0UB, where M1 A1 and B areŽ xedmatrices D C with)from thecurves, but rather because the selected coefŽ - such that A0A â and B0B è and U isa randommatrix D D cientsare useful for predictingsome other quantity measured withindependent standard normal entries. This notation has onthe dough pieces. One consequence of this is that it is theadvantage of preserving the matrix structure instead of notnecessarily the large wavelet coefŽ cients that will be use- reshaping V asavector.Italsomakes for mucheasier formal ful;small coefŽ cients in critical regions of the spectrum also Bayesianmanipulation. maycarry important predictive information. Thus the standard Theother notation that we useis thresholdingor shrinkage approaches are just not relevant to thisproblem. W © ·4„3 è5

2. PRELIMINARIES for arandommatrix W withan inverse Wishart distribution withscale matrix è andshape parameter „.Theshape param- 2.1 WaveletBases and Wavelet Transforms eterdiffers from themore conventional degrees of freedom, Waveletsare families of functions that can accurately againmaking for veryeasy Bayesian manipulations. With U describeother functions in a parsimoniousway. I n L2425, for and B deŽned as earlier, and with U as n p with n > p, 1 € example,an orthonormal wavelet basis is obtained as trans- W B04U0U5ƒ B hasan inverse Wishart distribution with D lationsand dilations of a motherwavelet – as – 4x5 shapeparameter „ n p 1andscale matrix è. The expec- j1 k D D ƒ C 2j=2–42j x k5 with j1 k integers.A function f isthen repre- tation of W exists for „ > 2 and is then è=4„ 25. (More ƒ ƒ sentedby a waveletseries as detailsof these notations, and a correspondingform for the matrix-variate T ,canbe found in Brown 1993, App. A, or f4x5 dj1 k–j1 k4x51 (1) Dawid1981.) D j1 k : 2 X 3. MODELING withwavelet coefŽ cients dj1 k f 4x5–j1 k4x5dx describing D j featuresof the function f atthe spatial location 2 ƒ k and 3.1 MultivariateRegression Model j R frequencyproportional to 2 (or scale j). Thebasic setup that we consideris a multivariatelinear Daubechies(1992) proposed a classof waveletfamilies that regressionmodel, with n observationson a q-variateresponse havecompact support and a maximumnumber of vanishing and p explanatoryvariables. Let Y denote the n q matrix of € momentsfor anygiven smoothness. These are used exten- observedvalues of the responses and let X be the n p matrix € sivelyin statistical applications. ofpredictor variables. Our specialconcern is with functional Waveletshave been an extremely useful tool for theanalysis predictordata; that is, the situation in whicheach row of X is andsynthesis of discrete data. Let Y 4y 1 : : : 1 y 5, n 2J , D 1 n D avectorof observations of a curve x4t5 at p equallyspaced bea sampleof a functionat equally spaced points. This vec- points. torof observations can be viewed as an approximation to the Thestandard multivariate normal regression model has, functionat the Ž nescale J .Afastalgorithm, the discrete conditionalon 1 B1 è, and X, wavelettransform (DWT), existsfor decomposing Y into a set ofwaveletcoefŽ cients (Mallat 1989) in only O4n5 operations. Y 1n 0 XB ® 4 In1 è51 (2) TheDWT operatesin practice by of linearrecursive Ž l- ƒ ƒ ters.For illustrationpurposes, we canwrite it in matrix form where 1 is an n 1vectorof 1’ s,  is a q 1vectorof inter- n € € as Z WY, where W isan orthogonal matrix corresponding cepts, and B 4‚ 1 : : : 1 ‚ 5 is a p q matrixof regression D D 1 q € tothe discrete wavelet transform and Z isavectorof wavelet coefŽcients. Without loss of generality, we assumethat the coefŽcients describing features of the function at scales from columns of X havebeen centered by subtracting their means. Brown, Fearn and Vannucci:Bayesian WaveletRegression on Curves 401

Theunknown parameters are , B, and the q q error 3.3 AFramework for VariableSelection € covariancematrix è.Aconjugateprior for thismodel is as Toperformselection in the wavelet coefŽ cient domain, we follows.First, given è, furtherelaborate the prior on BQ byintroducing a latentbinary p-vector ƒ. The jthelement of ƒ1 ƒj ,maybe either 1 or  0  0 4h1 è5 (3) ƒ 0 ® 0,depending on whether the jth column of Z is or is not includedin the model. When ƒj isunity,the matrix and,independently, ofthecorresponding row of BQ is“large,”and when ƒj is 0, the covariancematrix is azeromatrix. W ehaveassumed that the

priorexpectation of B is 0, and so ƒj 0effectivelydeletes B B0 ® 4H1 è50 (4) Q D ƒ the jthexplanatory variable (or waveletcoefŽ cient) from the model.This gives, conditional on ƒ, Themarginal distribution of è is then

BQ ƒ ® 4HQ ƒ 1 è51 (9) è © ·4„3 Q50 (5) where BQ ƒ and HQ ƒ are just BQ and HQ withthe rows and,in the case of H,columnsfor which ƒ 0deleted.Under this Notethat the priors on both  and B havecovariances depen- Q j D dent on è inawaythat directly extends the univariate regres- prior,each row of BQ ismodeled as having a scalemixture of sionnatural conjugate prior distributions. the type Inpractice,we let h torepresent vague prior knowl- ! ˆ 6B7 41 ƒ 5I ƒ N 401 h è51 (10) edge about  and take B 0,leavingthe speciŽ cation of H, Q 6j27 ƒ j 0 C j Q jj 0 D „, and Q toincorporate prior knowledge about our particular with h equal to the jthdiagonal element of the matrix H application. Q jj Q D WHW0 and I adistributionplacing unit mass onthe 1 q 0 € zerovector. Note that the rows of BQ arenot independent. 3.2 Transformationto Wavelets Asimpleprior distribution  4ƒ5 for ƒ takes the ƒj to be independentwith Pr 4ƒ 15 w and Pr4ƒ 05 1 w , Wenowtransform the predictor variables by applying to j D D j j D D ƒ j withhyperparameters w tobe speciŽ ed, for j 11 : : : 1 p. In each row of X awavelettransform, as describedin Section2.1. j D ourexample we takeall of the wj equalto a common w, so Inmatrixform, multiplyingeach row of X bythesame matrix thatthe nonzero elements of ƒ havea binomialdistribution W isequivalent to multiplying X onthe right side by W0. withexpectation pw. Thewavelet transform is orthogonal (i.e., W0W I), and D Mixturepriors have been widely used for variableselection thus(2) canbe written as inthe original model space, originally by Leamer (1978) and morerecently by George and McCulloch (1997) and Mitchell Y 1  0 XW0WB 4I 1 è50 (6) andBeauchamp (1988) for thelinear multiple regression case. ƒ n ƒ ® n Carlinand Chib (1995), Chipman (1996), and Geweke (1996), Wecannow express the model in terms of wavelet transfor- amongothers, concentrated on special features of thesepriors. mationsof thepredictors as Clyde,DeSimone, and Parmigiani (1996) used model mix- ingin prediction problems with correlated predictors when expressingthe space of models in terms of an orthogonaliza- Y 1  0 ZB 4 I 1 è51 (7) ƒ n ƒ Q ® n tionof thedesign matrix. Their methods are not directly appli- cableto our situation, because the wavelet transforms do not where Z XW0 isnow a matrixof wavelet coefŽ cients and D leaveus with an orthogonal design. The use of mixture pri- B WB isa matrixof regression coefŽ cients. The trans- Q D orsfor selectionin the multivariate regression setup has been formedprior on BQ ,inthe case of inclusion of allpredictors, is investigatedby Brown et al. (1998a,b). 4. SELECTINGWAVELET COEFFICIENTS BQ ® 4HQ 1 è51 (8) 4.1 Posterior Distributionof ƒ where HQ WHW0 andthe parameters  and è areunchanged Theposterior distribution of ƒ giventhe data,  4ƒ Y1 Z5, D — bythe orthogonal transformations, as are the priors (3) and(5). assignsa posteriorprobability to each ƒ-vectorand thus to Inpractice,wavelets exploit the recursive application of eachpossible subset of predictors (wavelet coefŽ cients). This Žlters,and the W-matrixnotation is more useful for expla- posteriorarises from thecombination of alikelihoodthat gives nationthan for computation.V annucciand Corradi (1999) greatweight to subsets explaining a highproportion of the proposeda fastrecursive algorithm for computingquantities variationin the responses Y anda priorfor ƒ thatpenalizes such as WHW0.Theiralgorithm has a usefullink to the two- largesubsets. I tcanbe computed by integrating out 1 B1 dimensionalDWT (DWT2), makingcomputations simple. The and è from thejoint posterior distribution of these parameters 2 matrix WHW0 canbe computed from H with an O4n 5 algo- and ƒ giventhe data. With the vague ( h ) prior for , ! ˆ rithm.(For moredetails, see secs. 3.1 and 3.2 of Vannucciand thisparameter is essentially estimated by the mean Y in the Corradi1999.) calibrationdata (see Smith1973), and to simplifythe formulas 402 Journal oftheAmerican Statistical Association, June 2001 thatfollow ,we nowassume that the columns of Y have been Swaptwo components by choosing independently at ran- ¡ centered.(Full details of the prior to posterior analysis have doma 0anda 1inthe current ƒ andchanging both of beengiven by Brown et al. 1998b, who also considered other them.This move is chosen with probability 1 ”. ƒ priorstructures.) After somemanipulation, we have Thenew candidate model, ƒü ,isaccepted with probability

1 q=2 g4ƒü 5  4ƒ Y1 Z5 g4ƒ5 H Z0 Z Hƒ ƒ min 1 1 0 (14) — / D Q ƒ ƒ ƒ C Q ƒ g4ƒ5 4n „ q 15=2 ƒ C C ƒ Qƒ  4ƒ51 (11) € Thusa moreprobable ƒü isalwaysaccepted, and a lessprob-

1 1 ableone may be accepted. There is scope for furtheringenu- where Q Q Y0Y Y0Z 4Z0 Z Hƒ 5ƒ Z0 Y and Z is ƒ D C ƒ ƒ ƒ ƒ C Q ƒ ƒ ƒ ityin designing the sequence of randommoves. For example, Z withthe columns for which ƒ 0deleted.Care is needed j D movesthat add or subtract or swap two or three or moreat a incomputing (11); the alternative forms discussedlater may time,or a combinationof these, may be useful. be useful. Thesequence of ƒ’sgeneratedby thesearch is a realization Asimplifyingfeature of this setup is that all of the com- ofaMarkovchain, and the choice of acceptance probabilities putationscan be formulated as least squares problems with ensuresthat the equilibrium distribution of this chain is the modiŽ ed Y and Z matrices.By writing distributiongiven by (11). I ntypicaluses of such schemes,

1 therealizations are monitored after a suitableburn-in period Z H 2 Y Z ƒ Q ƒ 1 Y 1 toverify that they appear stationary. Here we havea closed Q ƒ D Q D 0 Ipƒ form for theposterior distribution, and are using the chain ³ ´ simplyto explore this distribution. Thus we havenot been 1=2 soconcerned about strict convergence of the Markov chain. where HQ ƒ isamatrixsquare root of HQ ƒ and pƒ isthe number of 1’s in ƒ,therelevant quantities entering into (11) can be FollowingBrown et al. (1998a), we adopta strategyof running computed as thechain from anumberof differentstarting points (four here) andlooking at the four marginal distributions provided by the 1 computed g4 5 valuesof the visited ƒ’s.W ealsolook for good HQ ƒ Zƒ0 Zƒ HQ ƒ ZQ ƒ0 ZQ ƒ (12) C D indicationof ¢ mixing and explorations with returns. Because and we knowthe relative probabilities, we donot need to worry aboutusing a burn-inperiod. 1 Notethat, given the form oftheacceptance probability (14), Qƒ Q YQ 0YQ YQ 0ZQ ƒ 4ZQ ƒ0 ZQ ƒ 5ƒ ZQ ƒ0 YQ 3 (13) D C ƒ ƒ-vectorswith high have a greaterchance ofappearing in the sequence. Thus we mightexpect that a that is, Qƒ is given by Q plusthe residual sum of prod- longrun of sucha chainwould visit many of the best subsets. uctsmatrix from theleast squares regression of YQ on ZQ ƒ . The QR decompositioncan then be used (see, e.g., Seber 5. PREDICTION 1984,chap. 10, sec. 1.1b), which avoids “ squaring”as in (12) Supposenow that we wishto predict Y , an m q matrix of and (13). f € further Y-vectorsgiven the corresponding X-vectors, X 1 4m f € 4.2 MetropolisSearch p5.First,we treat Xf exactlyas the training data have been treated,by subtracting the training data means and transform- Equation(11) gives the posterior probability of each of the ingto wavelet coefŽ cients Z 1 4m p5.Themodel for Y , p f € f 2 different ƒ vectors,and thus of each choice of wavelet followingthe model for thetraining data (6), is coefŽcient subsets. What remains to do is to look for “good” waveletcomponents by computing these posterior probabili- Y 1  0 Z B 4 I 1 è50 (15) f ƒ m ƒ f Q ® m ties. When p ismuchgreater than about 25, too many subsets existfor thisto be feasible. Fortunately, we canuse simula- Ifwe believeour Bayesian mixture model, then logically we tionmethods that will Ž ndthe ƒ vectorswith relatively high shouldapply the same latent structure model to prediction as posteriorprobabilities. W ecanthen quickly identify useful wellas to training. This has the practical appeal of providing coefŽcients that have high marginal probabilities of ƒ 1. averagingover a rangeof likelymodels (Madigan and Raftery j D Here we usea Metropolissearch, as suggested for model 1994). selectionby Madiganand Y ork(1995) and applied to variable Resultsof Brown et al. (1998b) demonstrate that with the selectionfor regressionby Brown et al. (1998a), George and columns of Yf centeredusing the mean Y from thetrain- ingset, the expectation of the predictive distribution p4Y McCulloch(1997), and Raftery, Madigan, and Hoeting (1997). f — 0 Thesearch starts from arandomlychosen ƒ andthen moves ƒ1 Z1 Y5 is given by Zf BO ƒ with througha sequenceof further values of ƒ.At eachstep, the 1 1 1 2 1 algorithmgenerates a newcandidate ƒ byrandomly modify- B Z0 Z Hƒ ƒ Z0 Y H Z0 Z ƒ Z0 Y0 (16) O ƒ D ƒ ƒ C Q ƒ ƒ D Q ƒ Q ƒ Q ƒ Q ƒ Q ingthe current one. T wotypes of movesare used: Averagingover the posterior distribution of ƒ gives Addor delete a componentby choosing at random one ¡ componentin the current ƒ andchanging its value. This Y Z B  4ƒ Z1 Y51 (17) O f D f O ƒ — moveis chosen with probability ”. ƒ X Brown, Fearn and Vannucci:Bayesian WaveletRegression on Curves 403 andwe mightchoose to approximate this by some restricted for eachof the PCR equations.The results given in rows three set of ƒ values,perhaps the r mostlikely values from the andfour of Table1 showthat, as usual, there is little to choose Metropolissearch. betweenthe two methods. These results are for PLSandPCR usingthe reduced 256-point spectrum that we usedfor our 6. APPLICATION TONEAR-INFRARED waveletanalysis. Repeating the analyses using the original SPECTROSCOPYOF BISCUITDOUGHS 700-pointspectrum yielded results very similar to those for Wenowapply the methodology developed earlier to the PLSandsomewhat worse thanthose reported for PCR. spectroscopiccalibration problem described in Section 1.1. Becauseshortly we needto specify a priordistribution on First,however, we reportthe results of someother analyses of regressioncoefŽ cients, it isinteresting to examinethose result- these data. ingfrom afactor-typeapproach. Combining the coefŽ cients for theregression of constituent on factor scores with the 6.1 Analysisby Standard Methods loadingsthat produce scores from theoriginal spectral vari- For allof the analyses carried out here, both compositional ablesgives the coefŽ cient vector that would be applied to a andspectral data were centeredby subtracting the training set measuredspectrum to give a prediction.Figure 2 plotsthese meansfrom thetraining and validation data. The responses, vectorsfor thePLS equationsfor thefour constituents, show- butnot the spectral data, were alsoscaled, to giveeach of the ingthe smoothness in the 256 coefŽ cients that we attemptto fourvariables unit in the training set. reect in our prior distribution for B. Meansquared prediction errors were convertedback to the originalscale by multiplyingthem by thetraining sample vari- 6.2 WaveletTransforms of Spectra ances.This preprocessing of the data makes no difference to Toeachspectrum we applya wavelettransform, convert- thestandard analyses, which treat the response variables sep- ingit to a setof 256 wavelet coefŽ cients. W eusedthe arately,but it simpliŽ es the prior speciŽ cation for ourmulti- MATLABtoolbox W avbox4.3 (T aswell1995) for thisstep. variatewavelet analysis. Usingspectra with 2 m (m integer)points is not a realrestric- Osborneet al. (1984) derived calibrations by multiple tionhere. Methods exist to overcome the limitation, allow- regression,using various stepwise procedures to select wave- ingthe DWT tobe applied on any length of data. W eused lengthsfor eachconstituent separately. The mean squared MP(4) (Daubechies1992, p. 194), wavelets with four vanish- error ofpredictionson the 39 validationsamples for theircal- ingmoments. The Daubechies wavelets have compact support, ibrationsare reported in the Ž rst row ofTable1. importantfor goodlocalization, and a maximumnumber of Brownet al. (1999) also selected small numbers of wave- vanishingmoments for agivensmoothness. A largenumber of lengthsto Ž ndcalibration equations for thisexample. Their vanishingmoments leads to high compressibility, because the Bayesiandecision theory approach differed from theapproach Žne-scalewavelet coefŽ cients are essentially 0 wherethe func- ofOsborne in being multivariate (i.e., in trying to Ž ndone tionsare smooth. On the other hand, support of the wavelets smallsubset of wavelengths suitable for predictingall four increaseswith an increasing number of vanishing moments, constituents)and in using a moreextensive search using sim- sothere is a trade-offwith the localization properties. Some ulatedannealing. The results for thisalternative wavelength ratherlimited exploration suggested that the chosen wavelet selectionapproach are given in the second row ofTable1. familyis a goodcompromise for thesedata. For thepurpose of comparison, we carriedout two other Thegraphs on the right side of Figure 1 showthe wavelet analysesusing partial least squares regression (PLS) andprin- transformscorresponding to the three N IRspectrain the left cipalcomponents regression (PCR). These approaches, both column.CoefŽ cients are ordered from coarsestto Ž nest. ofwhich construct factors from thefull spectral data and then regressconstituents on thefactors, are very much the standard 6.3 Prior Settings toolsin NIRspectroscopy(see, e.g. Geladi and Martens 1996; Geladi,Martens, Hadjiiski, and Hopke 1996). For thecompu- Weneedto specify the values of H, „, and Q in (3), (4), tations,we usedthe PLS Toolbox2.0 of Wise and Gallagher and(5) andthe w thatan element of ƒ is (EigenvectorResearch, Manson, W A). Althoughthere are 1.W ewishto put in weak but proper prior information about è. We choose „ 3,because this is thesmallest integer value multivariateversions of PLS, we tookthe usual approach of D suchthat the expectation of è, E4è5 Q=4„ 25,exists.The calibratingfor eachconstituent separately. The number of fac- D ƒ scalematrix Q is chosen as Q k I with k 005,compara- torsused, selected in each case by cross-validation on the D q D trainingset, was Žvefor eachof the PLS equationsand six blein size to the expected error variancesof thestandardized Y given X. With „ small,the choice of Q isunlikely to be critical. Table1. Mean Squared Errors of Predictionon the 39 Biscuit Dough Muchmore likely to beinuential are the choices of H and Piecesin theValidation Set UsingFour CalibrationMethods w inthe priors for B and ƒ.Toreect the smoothness in the coefŽcients B,asexempliŽ ed in Figure 2, while keeping the Method Fat Sugar Flour Water form of H simple,we havetaken H tobe thevariance matrix 2 i j ofaŽrst-orderautoregressive process, with h ‘ — ƒ —. We StepwiseMLR 0044 10188 0722 0221 ij D Decisiontheory 0076 0566 0265 0176 derivedthe values ‘ 2 254 and  032by maximizing a D D PLS 0151 0583 0375 0105 typeII likelihood(Good 1965). I ntegrating , B, and è from PCR 0160 0614 0388 0106 thejoint distribution given by (2), (3), (4), and (5) for the 404 Journal oftheAmerican Statistical Association, June 2001

Fat Sugar 6 4

4 2 t n e i c i f

f 2 0 e o C

0 – 2

– 2 – 4 1400 1600 1800 2000 2200 2400 1400 1600 1800 2000 2200 2400

Flour Water 4 6

4 2 t n

e 2 i c i f

f 0 e

o 0 C – 2 – 2

– 4 – 4 1400 1600 1800 2000 2200 2400 1400 1600 1800 2000 2200 2400 Wavelength Wavelength

Figure2. Coef’ cient Vectors From 5-FactorPLS Equations.

regressionon thefull untransformed spectra, with h and waveletcoefŽ cients, and wanting to inducea similarly“ small” ! ˆ B 0, we get modelwithout constraining the possibilities too severely, we 0 D chose w inthe prior for ƒ sothat the expected model size q=2 4„ q 15=2 1 4„ n q 15=2 f K ƒ Q C ƒ Q Y0Kƒ Y ƒ C C ƒ 1 (18) was pw 20.W ehavegiven equal prior probability here / — — — — — C — D tocoefŽ cients at the different levels. Although we consid- where eredthe possibility of varying w inblocks, we hadno strong prioropinions about which levels were likelyto provide the K I XHX0 D n C mostuseful coefŽ cients, apart from asuspicionthat neither andthe columns of Y arecentered, as in Section 4.1. With thecoarsest nor the Ž nestlevels would feature strongly. k 005 and „ 3alreadyŽ xed,(18) is a function,via H, of D D 6.4 Implementingthe Metropolis Search ‘ 2 and .Weusedfor ourprior the values of these hyperpa- rametersthat maximize (18). Possible underestimation due to Wechosewidely different starting points for thefour theuse of thefull spectra was takeninto account by multiply- Metropolischains, by setting to 1 theŽ rst 1,the Ž rst 20,the ingthe estimate of ‘ 2 bythe in ation factor 256 =20, re ect- Žrst 128(i.e., half), and all elements of ƒ.Therewere 100,000 ingour prior belief as to the expected number of included iterationsin each run, where an iteration comprised either coefŽcients. adding/deletingor swapping, as described in Section 4.2. Figure3 showsthe diagonal elements of the matrix H Thetwo moves were chosenwith equal probability ” 1=2. Q D D WHW0 impliedby ourchoice of H.Thevariance matrix of the Acceptanceof thepossible move by (14)was bygenerationof g4 5 g4 ü 5 ith column of BQ in (8) is ‘ ii HQ ,sothis plot shows the pattern aBernoullirandom variable. Computation of ƒ and ƒ inthe prior of the regression coefŽ cients when the was doneusing the QR decompositionof MA TLAB. predictorsare wavelet coefŽ cients. The wavelet coefŽ cients For eachchain, we recordedthe visited ƒ’sandtheir corre- areordered from coarsestto Ž nest,so the decreasing prior spondingrelative probability g4ƒ5.Noburn-in was necessary, variancemeans that there will be more shrinkage at the Ž ner asrelatively unlikely ƒ’swouldautomatically be down- levels.This is alogicalconsequence of thesmoothness that we weightedin our analysis. There were approximately38,000– havetried to expressin the prior distribution. The spikes in the 40,000successful moves for eachof the four runs. Of these plotat the level transitions are from theboundary condition moves,around 95% were swaps.The relative probabilities of problemsof thediscrete wavelet transform. theset of distinct visited ƒ were thennormalized to 1 over Weknowfrom experiencethat good predictions can usually thisset. Figure 4 plotsthe marginal probabilities for com- beobtained with 10 or so selected spectral points in exam- ponents of ƒ1 P4ƒ 151 j 11 : : : 1 256.The spikes show j D D plesof this type. Having no previous experience in selecting whereregressor variables have been included in subsets with Brown, Fearn and Vannucci:Bayesian WaveletRegression on Curves 405

500

450 t n e

i 400 c i f f e o c

n o i

s 350 s e r g e r

f o

e 300 c n a i r a v

r o i r 250 P

200

150 0 50 100 150 200 250 Wavelet coefficient

Figure3. Diagonal Elements of H WHW0 , With h PlottedAgainst i . Q D Q ii

(i) (ii) 1 1

0.8 0.8 y t i l

i 0.6 0.6 b a b o r 0.4 0.4 P

0.2 0.2

0 0 0 50 100 150 200 250 0 50 100 150 200 250

(iii) (iv) 1 1

0.8 0.8 y t i l

i 0.6 0.6 b a b o r 0.4 0.4 P

0.2 0.2

0 0 0 50 100 150 200 250 0 50 100 150 200 250 Wavelet coefficient Wavelet coefficient

Figure4. Marginal Probabilities of Componentsof ƒ for Four Runs. 406 Journal oftheAmerican Statistical Association, June 2001

(a)

120

100 s e n o

80 f o

r

e 60 b m u 40 N

20

0 0 1 2 3 4 5 6 7 8 9 10 4 Iteration number x 10 (b) 0

– 100 s b o r p

– 200 g o L

– 300

– 400 0 1 2 3 4 5 6 7 8 9 10 Iteration number 4 x 10

Figure5. Plots in SequenceOrder for Run(iii). (a)The number of 1’s;(b) log relative probabilities. highprobability. For oneof theruns, (iii), Figure 5 givestwo amongthe visited ones had 10 coefŽ cients, accounted for 9% moreplots: the number of 1’ s, and the log-relative probabil- ofthe total visited probability, and least squares predictions ities, log4g4ƒ55,ofthe visited ƒ,plottedover the 100,000 gavemean squared errors of 0059, 0466, 0351 and 0047. iterations.The other runs produced very similar plots, quickly Examiningthe scales of theselected coefŽ cients is interest- movingtoward models of similar dimensions and posterior ing.The model making the best predictions used coefŽ cients probabilityvalues. (10,11, 14, 17, 33, 34, 82, 132, 166, 255), which include (0%, Despitethe very different starting points, the regions 0%,0%, 37%, 6%, 6%, 1%, 2%) of all of the coefŽ cients at exploredby the four chains have clear similarities in that plots theeight levels from thecoarsest to theŽ nestscales. The most ofmarginals are overall broadly similar .However,there are usefulcoefŽ cients seem to be in the middle of thescale, with alsoclear differences, with some chains making frequent use somemore toward the Ž nerend. Note that the very coarsest ofvariables not picked up by others. Although we wouldnot coefŽcients, which would be essentialto anyreconstruction of claimconvergence, all four chains arrive at some similarly thespectra, are not used, despite the smoothness implicit in H “good,”albeitdifferent, subsets. With mutually correlated pre- andthe consequent increased shrinkage associated with Ž ner dictorvariables, as we havehere, there will always tend to scales,as seen in Figure 3. bemany solutions to the problem of Ž ndingthe best predic- Someidea of the locations of the wavelet coefŽ cients tors.W eadoptthe pragmatic stance that all we aretrying to selectedby the modal model can be obtained from Figure6. dois identify some of the good solutions. I fwe happento Here we usedthe linearity of the wavelet transform and the misssome other good ones, then this is unfortunate but not linearityof the prediction equation to express the prediction disastrous. equationas a vectorof coefŽ cients to be applied to theoriginal spectraldata. This vector is obtained by applying the inverse 6.5 Results wavelettransform to the columns of the matrix of the least Wepooledthe distinct ƒ’svisitedby the four chains, nor- squaresestimates of theregression coefŽ cients. Because selec- malizedthe relative probabilities, and ordered them according tionhas discarded unnecessary details, most of thecoefŽ cients toprobability. Then we predictedthe further 39 unseen sam- arevery close to 0. W ethusdisplay only the 1600– 1800 nm plesusing Bayes model averaging. Mean squared prediction range,which permits a betterview of the features of coefŽ - errors convertedto the original scale were 00631 04491 0348, cientsin the range of interest. These coefŽ cient vectors can and 0050.These results use the best 500 models, accounting becompared directly with those shown in Figure 2. They are for almost99% of the total visited probability and using 219 appliedto the same spectral data to produce predictions, and waveletcoefŽ cients. They improve considerably on all of the despitethe different scales, some comparisons are possible. standardmethods reported in T able1. The single best subset For example,the two coefŽ cient vectors for fatcan be eas- Brown, Fearn and Vannucci:Bayesian WaveletRegression on Curves 407 ilyinterpreted. Fats and oils have absorption bands at around ingpoints used previously with the data in the original scale. 1730and 1765 nm (Osborne et al. 1993), and strong positive Diagnosticplots and plots of the marginals appeared very coefŽcients in this region are the major features of bothplots. similarto those of Figures 4 and5. The four chains visited Thewavelet-based equation (Fig. 6) is simpler, the selection 151,183distinct models. W eŽnallycomputed Bayes model havingdiscarded much unnecessary detail. The other coefŽ - averagingand least squares predictions with the best model, cientvectors show less resemblance and are also harder to unscalingthe predicted values and transforming them back to interpret.One thing to bearin mindin interpreting these plots theoriginal scale as isthat the four constituents add up to 100%. Thus it should 100 exp4Z 5 notbe surprisingto Žndthe fat measurement peak in alleight Y i 1 i 11 21 i D 3 D j 1 exp4Zj 5 1 plotsnor that the coefŽ cient vectors for sugarand  ourare so D C stronglyinversely related. 100 100 exp4Z 5 Y P 1 and Y 3 0 Finally,we commentbrie y onresults that we obtainedby 3 D 3 4 D 3 j 1 exp4Zj 5 1 j 1 exp4Zj 5 1 investigatinglogistic transformations of thedata. Our response D C D C variablesare in fact percentages and are constrained to sum Thebest P 500 visited models accounted for P 99.4%of the total upto 100; thus they lie on a simplex,rather than in the full visitedprobability, used 214 wavelet coefŽ cients, and gave q-dimensionalspace. Sample ranges are 18 3 for fat, 17 7 Bayesmean squared prediction errors of 0058, 0819, 0457, and for sugar,51 6for our,and 14 3 for water. 0080.The single best subset among the visited ones had 10 FollowingAitchison (1986), we transformedthe original coefŽcients, accounted for 15.8%of the total visited proba- Y 0sintolog ratios of the form bility,and gave least squares prediction errors of 00911 07931 0496, and 0119.The logistic transformation does not seem to Z ln4Y =Y 51 Z ln4Y =Y 51 and 1 D 1 3 2 D 2 3 havea positiveimpact on the predictive performance, and overall,the simpler seems to be adequate. Our Z3 ln4Y4=Y350 (19) D approachgives Bayes predictions on theoriginal scale satisfy- Thechoice of the third ingredient for thedenominator was ingthe constraint of summingexactly to 100.This stems from themost natural, in that  ouris the major constituent and theconjugate prior, linearity in Y 1 zeromean for theprior dis- alsobecause ingredients in recipes often are expressed as a tributionof theregression coefŽ cients, and vague prior for the intercept.I tiseasilyseen that with X centered,  01 100 and ratioto  ourcontent. W ecenteredand scaled the Z variables O D 2 B01 0for eitherleast squares or Bayesestimates, and hence andrecomputed empirical Bayes estimates for ‘ and . (The O D otherhyperparameters were notaffected by the logistic trans- allpredictions sum to 100. I naddition,had we imposeda formation.)W ethenran four Metropolis chains using the start- singulardistribution to deal with the four responses summing

Fat Sugar 200 400

100 200 t n e i c i f

f 0 0 e o C

– 100 – 200

– 200 – 400 1400 1500 1600 1700 1800 1400 1500 1600 1700 1800

Flour Water 400 200

200 100 t n e i c i f

f 0 0 e o C

– 200 – 100

– 400 – 200 1400 1500 1600 1700 1800 1400 1500 1600 1700 1800 Wavelength Wavelength

Figure6. Coef’ cient Vectors From the“ Best”Wavelet Equations. 408 Journal oftheAmerican Statistical Association, June 2001 to100, then a properanalysis would suggest eliminating one RegionalConference Series inApplied ), Philadephia:Society component,but then the predictions for theremaining three forI ndustrialand Applied Mathematics. Dawid,A. P.(1981),“ SomeMatrix-V ariate DistributionTheory: Notational componentswould be as we havederived. Thus our analysis Considerationsand a Bayesian Application,” Biometrika,68,265– 274. isBayes for thesingular problem, even though superŽ cially Donoho,D., Johnstone,I .,Kerkyacharian,G., and Picard, D. (1995),“ Wavelet itignores this aspect. This does not address the positivity Shrinkage:Asymptopia?” (with discussion), Journalof the Royal Statistical Society,Ser.B, 57,301– 369. constraint,but the composition variables are all so far from Geladi,P .,andMartens, H. (1996),“ ACalibrationTutorial for Spectral Data. theboundaries compared to residual error thatthis is not an Part1. Data Pretreatment andPrincipal Component Regression Using Mat- issue.Our desireto stick with the original scale is supported lab,” Journalof Near InfraredSpectroscopy ,4,225– 242. Geladi,P .,Martens,H., Hadjiiski, L., andHopke, P .(1996),“ ACalibration bythe Beer– Lambert law, which linearly relates absorbance Tutorialfor Spectral Data. Part2. Partial Least SquaresRegression Using tocomposition. The logistic transform distorts this linear Matlaband Some Neural NetworkResults,” Journalof Near InfraredSpec- relationship. troscopy,4,243– 255. George,E. I.,andMcCulloch, R. E.(1997),“ Approachesfor Bayesian Vari- able Selection,” StatisticaSinica ,7,339–373. 7. DISCUSSION Geweke, J.(1996), “ Variable Selectionand Model Comparison in Regres- sion,” in BayesianStatistics 5 ,eds J.M.Bernardo,J. O.Berger,A. P. Inspecifyingthe prior parameters for ourexample, we made Dawid,and A. F.M. Smith,Oxford, U.K.: Clarendon Press, pp.609– 620. anumberof arbitrary choices. I nparticular,the choice of H Good,I .J.(1965), TheEstimation of Probabilities. An Essay onModern asthevariance matrix of anautoregressive process is a rather BayesianMethods ,Cambridge,MA: MITPress. Hrushka,W .R.(1987),“ Data Analysis:Wavelength Selection Methods,” cruderepresentation of the smoothness in the coefŽ cients. I t in Near-InfraredTechnology in the Agricultural and F oodI ndustries , eds mightbe interesting to try to model this in more detail. How- P.Williams andK. Norris,St Paul, MN: American Associationof Cereal ever,although there may be roomfor improvement,the simple Chemists,pp. 35– 55. Leamer, E.E.(1978),“ RegressionSelection Strategies and Revealed Priors,” structureused here does appear to work well. Journalof theAmerican Statistical Association ,73,580– 587. Anotherarea for futureinvestigation is the use of more Madigan,D., and Raftery, A. E.(1994),“ ModelSelection and Accounting for sophisticatedwavelet systems in this context. The addi- ModelUncertainty in Graphical Models Using Occam’ s Window,” Journal ofthe American Statistical Association ,89,1535– 1546. tional exibilityof wavelet packets (Coifman, Meyer, and Madigan,D., and Y ork,J. (1995), “ Bayesian GraphicalModels for Discrete Wickerhauser1992) or m-bandwavelets (Mallet, Coomans, Data,” InternationalStatistical Review ,63,215– 232. Kautsky,and De Vel1997) might lead to improved predic- Mallat,S. G.(1989),“ MultiresolutionApproximations and W avelet Orthonor- mal Bases of L2 4r5,” Transactionsof the American Mathematical Society , tions,or might be just an unnecessary complication. 315, 69–87. Mallet,Y .,Coomans,D., Kautsky, J., and De Vel,O. (1997),“ ClassiŽcation [Received October1998. Revised November 2000.] UsingAdaptive Wavelets forFeature Extraction,” IEEETransactionson PatternAnalysis and Machine I ntelligence ,19,1058– 1066. McClure,W .F.,Hamid, A., Giesbrecht, F. G., and Weeks, W. W .(1984), REFERENCES “FourierAnalysis Enhances N IRDiffuse Reectance Spectroscopy,” AppliedSpectroscopy ,38,322– 329. Aitchison,J. (1986), TheStatistical Analysis of CompositionalData , London: Mitchell,T. J., and Beauchamp, J. J.(1988),“ Bayesian Variable Selection Chapmanand Hall. inLinear Regression,” Journalof theAmerican Statistical Association , 83, Brown,P .J.(1993), Measurement,Regression, and Calibration ,Oxford,U.K.: 1023–1036. ClarendonPress. Osborne,B. G.,Fearn, T., and Hindle, P .H.(1993), PracticalN IRSpec- Brown,P .J.,Fearn, T., and V annucci,M. (1999),“ TheChoice of V ariables troscopy,Harlow,U.K.:Longman. inMultivariate Regression: A Bayesian Non-ConjugateDecision Theory Osborne,B. G.,Fearn, T., Miller, A. R.,and Douglas, S. (1984),“ Applica- Approach,” Biometrika,86,635– 648. tionof Near- InfraredRe ectance Spectroscopyto Compositional Analysis Brown,P .J.,Vannucci,M., andFearn, T. (1998a), “ Bayesian Wavelength ofBiscuits and Biscuit Doughs,” Journalof theScience ofFoodand Agri- Selectionin Multicomponent Analysis,” Journalof , 12, culture,35,99– 105. 173–182. Raftery,A. E.,Madigan, D., andHoeting, J. A.(1997),“ Bayesian Model ,(1998b),“ MultivariateBayesian Variable Selectionand Prediction,” Averagingfor Linear Regression Models,” Journalof theAmerican Statis- Journalof theRoyal Statistical Society, Ser.B, 60,627– 641. ticalAssociation ,92,179– 191. Carlin,B. P.,andChib, S. (1995),“ Bayesian ModelChoice via Markov Seber,G. A.F.(1984), MultivariateObservations ,New York:Wiley. ChainMonte Carlo,” Journalof the Royal Statistical Society, Ser. B, 57, Smith,A. F.M.(1973),“ AGeneral Bayesian LinearModel,” Journalof the 473–484. RoyalStatistical Society ,Ser.B, 35,67– 75. Chipman,H. (1996),“ Bayesian Variable SelectionWith Related Predictors,” Taswell, C.(1995),“ Wavbox4: ASoftwareT oolboxfor Wavelet Transforms CanadianJournal of Statistics , 24, 17–36. andAdaptive Wavelet Packet Decompositions,”in Wavelets andStatis- Clyde,M., DeSimone, H., andParmigiani, G. (1996),“ Predictionvia Orthog- tics,eds A.Antoniadisand G. Oppenheim,New York:Springer-V erlag, onalizedModel Mixing,” Journalof the American Statistical Association , pp.361– 375. 91,1197– 1208. Trygg,J., and W old,S. (1998),“ PLSRegressionon Wavelet-Compressed Clyde,M., and George, E. I.(2000),“ FlexibleEmpirical Bayes Estima- NIRSpectra,” Chemometrics andI ntelligentLaboratory Systems , 42, tionfor Wavelets,” Journalof the Royal Statistical Society , Ser. B, 62, 209–220. 681–698. Vannucci,M., andCorradi, F. (1999), “ CovarianceStructure of W avelet Coef- Coifman,R. R.,Meyer,Y .,andWickerhauser, M. V.(1992),“ Wavelet Anal- Žcients:Theory and Models in a Bayesian Perspective,” Journalof the ysisand Signal Processing,” in Wavelets andTheir Applications , eds M. B. RoyalStatistical Society ,Ser.B, 61,971– 986. Ruskai,G. Beylkin,R. Coifman,I .Daubechies,S. Mallat,Y .Meyer,and Walczak, B.,and Massart, D.L.(1997),“ Noise Suppressionand Signal Com- L.Raphael,Boston: Jones and Bartlett, pp. 153– 178. pressionUsing the Wavelet Packet Transform,” Chemometrics andI ntelli- Cowe,I .A.,and McNicol, J. W.(1985),“ TheUse ofPrincipal Compo- gentLaboratory Systems , 36, 81–94. nentsin the Analysis of Near- InfraredSpectra,” AppliedSpectroscopy , 39, Wold,S., Martens, H., and Wold, H. (1983),“ TheMultivariate Calibration 257–266. Problemin Chemistry Solved by PLS,” in MatrixP encils ,eds A.Ruheand Daubechies,I .(1992), TenLectures onWavelets (Vol.61, CBMS-NSF B.Kagstrom,Heidelberg: Springer, pp. 286– 293.