R in Chemistry
Total Page:16
File Type:pdf, Size:1020Kb
News The Newsletter of the R Project Volume 6/3, August 2006 Editorial by Ron Wehrens and Paul Murrell some nice examples of how to initialize the optimiza- tion. In the next paper, Johannes Ranke describes the Welcome to the third issue of R News for 2006, our drfit package for fitting dose-response curves. second special issue of the year, this time with a fo- The following three papers consider applications cus on uses of R in Chemistry. Thanks go to our guest that are more analytical in nature, and feature spec- editor, Ron Wehrens, for doing all of the heavy lifting tral data of various kinds. First, Bjørn-Helge Mevik to get this issue together. Ron describes below the discusses the pls package, which implements PCR delights that await in the following pages. Happy and several variants of PLS. He illustrates the pack- reading! age with an example using near-infrared data, which is appropriate, since this form of spectroscopy would Paul Murrell not be used today but for the existance of multi- The University of Auckland, New Zealand variate calibration techniques. Then, Chris Fraley [email protected] and Adrian Raftery describe several chemical appli- cations of the mclust package for model-based clus- R has become the standard for statistical analysis tering. The forms of spectroscopy here yield images in biology and bioinformatics, but it is also gaining rather than spectra; the examples focus on segment- popularity in other fields of natural sciences. This ing microarray images and dynamic magnetic res- special issue of R News focuses on the use of R in onance images. Ron Wehrens and Egon Willigha- chemistry. Although a large number of the Biocon- gen continue with a paper describing self-organising ductor packages can be said to relate to chemical sub- maps for large databases of crystal structures, as im- disciplines such as biochemistry or analytical chem- plemented in the package wccsom. To compare the istry, we have deliberately focused on those applica- spectral-like descriptors of crystal packing, a spe- tions with a less obvious bioinformatics component. cially devised similarity measure has to be used. Rather than providing a comprehensive The issue concludes with a contribution by Ra- overview, the issue gives a flavour of the diversity jarshi Guha on the connections between R and the of applications. The first two papers focus on fit- Chemistry Development Kit (CDK), another open- ting equations that are derived from chemical know- source project that is rapidly gaining widespread ledge, in this case with nonlinear regression. Peter popularity. With CDK, it is easy to generate descrip- Watkins and Bill Venables show an example from tors of molecular structure, which can then be used chromatography, where the retention behaviour of in R for modelling and predicting properties. The pa- carboxylic acids is modelled. Their paper features per includes a description of the rcdk package, where Contents of this issue: The pls package . 12 Some Applications of Model-Based Clustering Editorial . 1 in Chemistry . 17 Non-linear regression for optimising the sepa- Mapping databases of X-ray powder patterns . 24 ration of carboxylic acids . 2 Fitting dose-response curves from bioassays Generating, Using and Visualizing Molecular and toxicity testing . 7 Information in R . 28 Vol. 6/3, August 2006 2 the key point is the connection between R and Java Institute for Molecules and Materials (which underlies CDK). Analytical Chemistry The Netherlands Ron Wehrens [email protected] Non-linear regression for optimising the separation of carboxylic acids by Peter Watkins and Bill Venables This article is intended to show some of the pow- erful general facilities available in R for non-linear re- In analytical chemistry, models are developed to de- gression, illustrating the ideas with simple, yet im- scribe a relationship between a response and one or portant non-linear models typical of those in use in more stimulus variables. The most frequently used Chemometrics. The particular example on which model is the linear one where the relationship is lin- we focus is one for the response behaviour of a car- ear in the parameters that are to be estimated. This is boxylic acid using reverse-phase high performance generally applied to instrumental analysis where the liquid chromatography and we use it to optimise the instrument response, as part of a calibration process, separation of a mixture of acids. is related to a series of solutions of known concentra- Non-linear regression in general is a very un- tion. Estimation of the linear parameters is relatively structured class of problems as the response func- simple and is routinely applied in instrumental anal- tion of the regression may literally be any function at ysis. Not all relationships though are linear with re- all of the parameters and the stimulus variables. In spect to the parameters. One example of this is the specific applications, however, certain classes of non- Arrhenius equation which relates the effect of tem- linear regression models are typically of frequent oc- perature on reaction rates: currence. The general facilities in R allow the user to build up a knowledge base in the software itself k = A exp (−E /RT) × exp() (1) a that allows the fitting algorithm to find estimates for where k is the rate coefficient, A is a constant, Ea is initial values and to find derivatives of the response the activation energy, R is the universal gas constant, function with respect to the unknown parameters au- and T is the temperature in degrees Kelvin. As k is tomatically. This greatly simplifies the model fitting the measured response at temperature T, A and Ea process for such classes of models and usually makes are the parameters to be estimated. The last factor the process much more stable and reliable. indicates that there is an error term, which we as- sume is multiplicative on the response. In this article we will assume that the error term is normal and ho- The working example 2 moscedastic, that is, ∼ N(0,σ ) Aromatic carboxylic acids are an important class of One way to find estimates for A and Ea is to trans- compounds since many are pharmacologically and form the Arrhenius equation by taking logarithms biologically significant. Thus it is useful to be able of both sides. This converts the relationship from a to separate, characterise and quantify these types of multiplicative one to a linear one with homogeneous, compounds. One way to do this is with chromatog- additive errors. In this form linear regression may be raphy, more specifically, reverse phase high perfor- used to estimate the coefficients in the usual way. mance liquid chromatography (RP-HPLC). Due to Note that if the original error structure is not mul- the ionic nature of these compounds, analysis by tiplicative, however, and the appropriate model is, HPLC can be complicated as the hydrogen ion con- for example, as in the equation centration is an important factor for separation of these compounds. Waksmundzka-Hajnos (1998) re- k = A exp (−Ea/RT) + (2) ported a widely used equation that models the sepa- then taking logarithms of both sides does not lead to ration of monoprotic carboxylic acids (i.e. containing a linear relationship. While it may be useful to ignore a single hydrogen ion) depending on the hydrogen this as a first step, the optimum estimates can only ion concentration, [H+]. This is given by be obtained using non-linear regression techniques, + that is by least squares on the original scale and not (k−1 + k0([H ]/Ka)) k = + + (3) in the logarithmic scale. Starting from initial values (1 + [H ]/Ka) for the unknown parameters, the estimates are itera- tively refined until, it is hoped, the process converges where k is the capacity factor, k0 and k−1 are the k to the maximum likelihood estimates. values for the non-ionised and ionised forms of the R News ISSN 1609-3631 Vol. 6/3, August 2006 3 carboxylic acid, and Ka is the acidity constant. In Parameters in the first set, (for our example, just this form, non-linear regression can be applied to Ka) are called the “non-linear parameters” while the Equation 3 to estimate the parameters, k0, k−1 and second set are the “linear parameters” (though a bet- Ka. Deming and Turoff (1978) reported HPLC mea- ter term might be “conditionally linear parameters”). surements for four carboxylic acids containing a sin- The so-called “partially linear” fitting algorithm re- gle hydrogen ion; benzoic acid (BA), o-aminobenzoic quires only that initial values be supplied for the non-linear parameters. To illustrate the entire fitting acid (OABA), p-aminobenzoic acid (PABA) and p- process, we note that when BA is the response, a rea- hydroxybenzoic acid (HOBA). sonable initial value is Ka = 0.0001. Using this we The R data set we use here has a stimulus variable can fit the regression as follows: named pH and four responses named BA, OABA, PABA and HOBA, the measured capacity factors for the acids > tmp <- transform(ba, H = 10^(-pH)) at different pH values. The data set is given as an > ba.nls <- nls(BA ~ cbind(1, H/Ka)/(1 + H/Ka), appendix to this article. data = tmp, algorithm = "plinear", start = c(Ka = 0.0001), trace = T) Initial values 13.25806 : 0.000100 3.531484 54.819291 7.180198 : 4.293692e-05 5.890920e-01 4.197178e+01 While non-linear regression is appropriate for the fi- 1.441950 : 5.600297e-05 1.635335e+00 4.525067e+01 nal estimates, we can pay less attention to the error 1.276808 : 5.906698e-05 1.831871e+00 4.597552e+01 structure when trying to find initial values.