Comparison of Goodness of Fit Statistics for Linear Regression, Part I

Chemometrics in Spectroscopy Comparison of Goodness of Fit Statistics for Linear Regression, Part I The authors begin a discussion of the statistical tools available to compare and correlate two or more data sets. Jerome Workman Jr. and Howard Mark he scope of this series of columns is to provide sta- tools used here can be employed whenever the assumptions tistical tools for comparing two columns of data. of linear correlation are suspected or assumed for a set of X With respect to analytical applications such data and Y data. The data set used for this example is from T might be represented for simple linear regression as Miller and Miller (1) as shown in Table I. This data set is the concentration of a sample (X) versus an instrument used so that the reader can compare the statistics calculated response when measuring the sample (Y). X and Y also can and displayed using the formulas and figures described in denote a comparison of the reference analytical results (X) this reference with respect to those shown in this series of versus predicted results (Y) from a calibrated instrument. columns. The correlation coefficient and other goodness of At other times one might use X and Y to represent the fit parameters can be evaluated properly using standard instrument response (X) to a reference value (Y). Whatever statistical tests. The worksheets provided in this column data pairs one is comparing, there are several statistical series can be customized for specific applications providing tools that are useful to assess the meaning of a change in Y information tailored for particular method comparisons as a function of a change in X.These include, but are not and validation studies. limited to: correlation (r), the coefficient of determination 2 (R ), the slope (K1), intercept (K0), the z-statistic, and of Looking at Cause and Effect course the respective confidence limits for these statistical Several general assumptions are made when performing X parameters. The use of graphical representation also is a and Y linear regression computations. One is assuming that powerful tool for discerning the relationships between X if the correlation between X and Y is significantly large then and Y paired data sets. some cause-and-effect relationship possibly could exist The specific software used for this pedagogical exercise is between changes in X and changes in Y.However,it is MathCad 2001i software (MathSoft Engineering and important to remember that probability alone tells us only Education, Inc., Cambridge, MA), which we find particu- if X and Y appear to be related. If no cause–effect relation- larly useful for describing the precise mathematics ship exists between X and Y the regression model will have employed behind each set of examples. The mathematical no true predictive importance. Thus knowledge of cause- and-effect creates a basis for decision-making when using regression models. Limitations of inferences derived from Jerome Workman Jr. serves on the Editorial Advisory probability and statistics arise from limited knowledge of Board of Spectroscopy and is chief technical officer and vice pres- the characteristics and stability of: the nature and origins of ident of research and engineering for Argose, Inc. (Waltham, MA). the set of samples used for X and Y comparison, the charac- He can be reached by e-mail at [email protected]. teristics of the measuring instruments used for collecting Howard Mark serves on the Editorial Advisory Board of both X and Y data, the set of operators performing the Spectroscopy and measurements, and the precise set of measurement or runs a consulting experimental conditions. service, Mark One must note that probability alone can detect only Electronics (69 “alikeness” in special cases, thus cause–effect cannot be Jamie Court, Suffern, directly determined, only estimated. If linear regression is NY 10901). He can to be used for comparison of X and Y,one must assess be reached via e- whether the five assumptions for use of regression apply. As mail at hlmark@ a refresher, recall that the assumptions required for the prodigy.net. application of linear regression for comparisons of X and Y 38 Spectroscopy 19(4) April 2004 www.spectroscopyonline.com include: the errors (variations) are nal: the instrument response does represent several sample spaces to independent of the magnitudes of X not change adequately with a varia- include: compositional space, instru- or Y; the error distributions for both tion in the analyte value. This phe- ment space, and measurement or exper- X and Y are known to be normally nomenon indicates that small imental condition space (for example, distributed (Gaussian); the mean and changes in analyte concentration are sample handling and presentation variance of Y depend solely upon the not detected by the measurement spaces). Interpretive spectroscopy in absolute value of X; the mean of each instrument. Different or additional which spectra–structure correlations are Y distribution is a straight-line func- instrument response information is understood is a key intellectual process tion of X; and the variance of X is required to describe the analyte (the in approaching spectroscopic measure- small or near zero while the variance problem is underdetermined). ments if one is to achieve an under- of Y is exactly the same for all values •The instrument response changes standing in the X and Y relationships of of X. dramatically with little or no change these measurements. The requirement for a priori knowl- in analyte value. In this example, edge useful for providing a scientific additional clarification is required to The Concept of Correlation basis for comparison of X and Y data define the relationship between the The main concept addressed in this poses six key questions for the statisti- analyte value and the spectroscopic– column series is the idea of correla- cian or analyst when using regression as chemical data for the sample, as tion. Correlation can be referred to as a comparative tool: interfering factors other than analyte the apparent degree of relationship •IfX is a true predictor of Y,does a concentration are affecting the between variables. (The term apparent cause–effect relationship exist? instrument response. is used because there is no true infer- •IfX is a true predictor of Y,what is ence of cause-and-effect when two the optimum mathematical relation- Factors affecting the integrity of variables are highly correlated.) One ship to describe a measurement device spectroscopic data include: the varia- can assume that a cause-and-effect response with respect to the reference tions in sample chemistry, the varia- relationship exists, but this assump- data? (Such information defines the tions in the physical condition of sam- tion cannot be validated using correla- optimum mathematical tools to use ples, and the variation in measurement tion alone as the test criteria. for comparison.) conditions. Calibration data sets must Correlation often has been referred to •What are the effects of operator and measurement or experimental conditions on the change in X relative to Y? (a) (b) •What are the effects on X and Y of making measurements using multiple instruments with multiple operators? •What is the theoretical response for the X value with respect to the Y value? •What is the limit of detection (LOD) relative to changes in X and Y? Is this limit acceptable for the intended application? (c) (d) In routine comparisons of X and Y data for spectroscopic analysis, when X and Y denote a comparison of the reference analytical results versus instrument response, at least three main categories of modeling problems are found: •The technique is not optimal: the instrument response (Y) is a predictor of analyte values (X). The limitation for modeling is in the representation of calibration set chemistry, sample presentation, and unknown variations Figure 1. An illustration of the use of scatter plots for gleaning visual information with respect of instrument and operator during to the correlation between variables X (abscissa) and Y (ordinate). measurement. •There is no clear, specific analyte sig- April 2004 19(4) Spectroscopy 39 Chemometrics in Spectroscopy as a statistical parameter seeking to icance and meaning of correlation for visual display to assess qualitatively define how well a linear or other fit- specific test cases. It should be pointed the potential relationship between two ting function describes the relation- out that when only two variables are or more variables. Figure 1a illustrates ship between variables; however, two compared for correlation this is a positive, high correlation between X variables might be highly correlated referred to as simple correlation. and Y. Figure 1b indicates no real cor- under a specific set of test conditions However when more than two variables relation between the variables. Figure and not correlated under a different are compared for correlation this is 1c demonstrates a high, negative cor- set of experimental conditions. In this termed multiple correlation.In spec- relation between the variables. Figure case the correlation is conditional and troscopy, correlation is used in two 1d shows several phenomena in the so also is the cause-and-effect phe- main ways: for calibration of the instru- relationship between X and Y.An ini- nomenon. If two variables always are ment response (Y) at one or more tial observation indicates that there correlated perfectly under a variety of channels (as absorbance or reflectance are three potential outlier samples: conditions one might have a basis for of the sample at some wavelength or one above the line in the upper left cause-and-effect, and such a basic series of wavelengths) to the known corner, and two beneath the line in the relationship permits a well-defined analyte property (X) for that sample; lower right corner.

Load more