<<

BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION

It is important when taking a sample or designing an to remember that no matter how powerful the used, the infer• ences made from a sample are only as good as the data in that sample. . . . No amount of sophisticated statistical analysis will make good data out of bad data. There are many scientists who try to disguise badly constructed by blinding their readers with a com• plex statistical analysis."-O'Mahony (1986, pp. 6, 8)

INTRODUCTION The main body of this book has been concerned with using good sensory test methods that can generate quality data in well-designed and well-exe• cuted studies. Now we turn to summarize the applications of statistics to sensory data analysis. Although statistics are a necessary part of sensory research, the sensory scientist would do well to keep in mind O'Mahony's admonishment: Statistical anlaysis, no matter how clever, cannot be used to save a poor experiment. The techniques of statistical analysis do, how• ever, serve several useful purposes, mainly in the efficient summarization of data and in allowing the sensory scientist to make reasonable conclu-

647 648 SENSORY EVALUATION OF : PRINCIPLES AND PRACTICES sions from the information gained in an experiment One of the most im• portant conclusions is to help rule out the effects of chance variation in producing our results. "Most people, including scientists, are more likely to be convinced by phenomena that cannot readily be explained by a chance hypothesis" (Carver, 1978, p. 587). Statistics function in three important ways in the analysis and interpre• tation of sensory data. The first is the simple description of results. Data must be summarized in terms of an estimate of the most likely values to represent the raw numbers. For example, we can describe the data in terms of averages and standard deviations (a measure of the spread in the data). This is the descriptive function of statistics. The second goal is to provide evidence that our experimental treatment, such as an ingredient or processsing variable, actually had an effect on the sensory properties of the product, and that any differences we observe between treatments were not simply due to chance variation. This is the inferential function of statistics that provides a kind of confidence or support for our conclusions about products and variables we are testing. The third goal is to estimate the de• gree of association between our experimental variables (called indepen• dent variables) and the attributes measured as our data (called the dependent variables). This is the function of statistics that can be a valuable addition to the normal sensory testing process and is sometimes overlooked. Statistics such as the correlation coefficient and chi-square can be used to estimate the strength of relationship between our variables, the size of experimental effects, and the goodness-of-fit of equa• tions or models we generate from the data (some very important concerns). These statistical appendices are prepared as a general guide to statistics as they are applied in sensory evaluation. The reader need not have any fa• miliarity with statistical concepts but does need a secondary school level of proficiency in mathematics. Statistics is neither complex nor impenetrable, even to those who have found it confusing in previous encounters. Some of our personal viewpoints are sprinkled amidst the formal discourse in order to give the user of statistics a of an underlying philosophy and some of the pitfalls inherent in these tools. Statistics form an important part of the equipment of the sensory scientist Since most evaluation procedures are conducted along the lines of scientific inquiry, there is error in measurement and a need to separate those outcomes that may have arisen from chance variation from those results that are due to experimental variables (ingredi• ents, processes, packaging, shelf-life).ln addition, since the sensory scientist uses human beings as measuring instruments, there is added variability compared to other analytical procedures such as physical or chemical mea• surements done with instruments. This makes the conduct of sensory testing BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 649 especially challenging, and makes the use of statistical methods a necessity. The statistical sections are divided into separate topics so that readers who are familiar with some areas of statistical analysis can skip to sec• tions of special interest. Students who desire further explanation or addi• tional worked examples may wish to refer to O'Mahony (1986), Sensory Evaluation of : Statistical Methods and Procedures. Like the follow• ing sections, it is written with minimal expectations of mathematical com• petence. It also contains a good deal of pithy commentary on the use of statistics in sensory testing. For the more statistically proficient sensory scientist, the books by Gacula and Singh (1984), Statistical Methods in Food and Consumer Research, and Piggott (1986), Statistical Procedures in Food Research, contain information on more complex designs and ad• vanced topics. This statistical appendix is not meant to supplant courses in statistics, which are recommended and should probably be required for every sensory professional. Rather it is intended as quick reference for the most commonly used types of statistical tests. Instructors may also wish to use this as a refresher or supplemental guide for students who may have had a statistics course but who have lost some familiarity with the tests due to lack of practice or application. It is very prudent for sensory scientists to maintain an open dialogue with statistical consultants or other statistical experts who can provide ad• vice and support for sensory research. This advice should be sought early on and continuously throughout the experimental process, analysis, and interpretation of results. R. A. Fisher is reported to have said: "To call the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: He may be able to tell you what the experiment died of (Fisher, Indian Statistical Congress, 1938)?' Obtaining advice about the alternatives for statistical analysis is often a good idea. To be fully effective, the sensory professional should use statistical consultants early in the experimental design phase and not as magicians to rescue an experiment gone wrong. Every sensory scientist should be sufficiently fa• miliar with statistics so that the arguments and rationale from statisticians can be understood and appreciated. This does not that the sensory professional has to take a backseat to the statistical advisor in setting up the study. Sensory workers understand more about what can be done given the limitations on human attention and sensory function. Often the best exper• imental design for a problem is not entirely workable from a practical point of view, since human testing can necessarily involve fatigue, adaptation, and loss of concentration, as well as difficulties in maintaining attention and loss of motivation at some point. The negotiation between the sensory scientist and the statistician can yield the best practical result. 650 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

BASIC STATISTICAL CONCEPTS

Why are statistics important in sensory evaluation? The primary reason is that there is variation or error in measurement Variation implies that a re• peated of the same event will not necessarily be the same as the first observation of that event Due to factors beyond our control, the numerical values we obtain as are not the same over re• peated . In sensory evaluation, this is obvious in the fact that different participants in a sensory test give different data. It is against this background of uncontrolled variation that we wish to tell whether the ex• perimental variable of interest had a reliable effect on the perceptions of our panelists. These experimental variables in food and consumer product research are usually changes in ingredients, in processing conditions, or in packaging. In manufacturing, the variability may be across lots or batches, or across different production lines or manufacturing sites. For foods, an important concern is how the product changes over time. These changes can be desirable, as in cheese ripening or in the aging of meats and wines. They can also be undesirable and limit the useful lifetime ofthe product af• ter processing. Hence there is a need for shelf-life or stability testing where time and conditions of storage are critical variables. The existence of error implies that there is also some true state of the world and some average or best estimate of a representative value for the product Because our measurements are imperfect, they vary. Imperfect measurement implies lack of precision or repeatability. Examples of mea• surements prone to variation are easy to fmd in sensory testing. Consumers differ in how much they like or prefer one product over another. Even well• trained panelists will give different intensity ratings to a or texture characteristic. In repeating a triangle test or other discrimination proce• dure, we would not expect exactly the same number of people to give the right and wrong answer on a subsequent test However, we need to use the outcome of a sensory test to guide business decisions about products and to determine if research results are likely due to chance or to the experimen• tal variables we have manipulated. Unfortunately, the in our mea• surements introduces an element of risk in making decisions. Statistics are never completely foolproof or airtight Decisions even under the best condi• tions of experimentation always run the risk of being wrong. However, sta• tistical methods help us to minimize, control, and estimate that risk. Here are hypothetical examples of risky decisions:

1. Lowering the sweetener level in a cola by 5% to save on costs because a test showed that 9 of 10 people saw no difference 2. Lowering the level of fragrance in an air freshener by 50% to produce BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 651

a lighter fragranced product because a test showed that 9 of 10 people smelled a difference Intuition tells us that in both of these cases, the number of observations was probably too small to safely generalize to the population of interest That is, we probably can't infer from these results that the bulk of con• sumers are likely to behave in the same ways. We could make decisions based on these results, but there is still a high degree or uncertainty about the generality of the results and, therefore, risk. The methods of statistics give us rules to estimate and minimize the risk in decisions when we generalize from a sample (an experiment or test) to the greater population of interest. They are based on consideration of three factors: the actual measured values, the error or variation around the val• ues, and the number of observations that are made (sometimes referred to as "sample size"). From the examples given above, most people would con• clude that the decisions were risky since only 10 people participated in the tests. Yet the actual level of risk and the possibility of making erroneous conclusions are not only determined by this factor but also by the levels of error in measurement, sources of variation in the experimental setup, and the actual numbers returned in the data. The interplay of these three fac• tors forms the basis for statistical calculations in all of the major statistical tests used with sensory data, including t-tests on , analysis of vari• ance and F-ratios, and comparisons of proportions or frequency counts. In the case of the t-test on means, the factors are (1) the actual difference be• tween the means, (2) the or error inherent in the exper• imental measurement, and (3) the sample size or number of observations we made. So variation in measurement is a fact of life in sensory research. What are some of the sources of error variation in sensory measurement? There are differences among people due to their physiological makeup. People have different sensitivities and responsiveness to sensory stimulation, as shown in conditions like smell blindness, called specific anosmias (dis• cussed in Chapter 2). Physical differences can arise from genetic factors, differences in anatomy and physiological functioning, a person's state of health or medication, and so on. People are also notoriously variable in their likes and dislikes. Even within a person, there are differences in sen• sory function from to moment. The sensory nerves have sponta• neous background levels of intrinsic neural activity. This background activity varies over time. Our responsiveness is determined by our state of adaptation to background levels of the energy or stimuli that are ambient in the situation. For example, the level of sodium in our saliva in part deter• mines how we perceive the salt in foods. This salivary sodium is not con- 652 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES stant; it changes as a function of salivary flow rate, our level of arousal, and other factors. Our test samples themselves may not be completely identical. Consider any sample of meat to be used in a sensory test. Even samples cut from the same muscle of the same animal may differ in connective tissue and content. It is not always possible to achieve perfect control of por• tion size, temperature, time since cooking, and many other variables, espe• cially in off-site consumer testing. People will differ in their interpretations of the words that denote the characteristics we asked them to judge. They may not use scales in the same way, being more or less timid about using the endpoints on rating scales, or they may have different conservative bi• ases towards staying in the middle of the scale. Although variation can be minimized in when it is conducted with trained panels under controlled conditions, some variability in the data is inevitable. How can we characterize this variability? Variation in the data produces a distribution of the values across the available measurement points. These distributions can be represented graphically as . A is a type of graphical representation, a picture of the frequency counts of how

A - Experimental Sample

10~------~

!! c:: 8 G) "tJ c:: 0 c. 6 (II Ratings of G) a: on 15 - pt. v-alrea~orv .... 0 4 a.. G) .CI E z::::s 2

0 1 2 3 4 56 7 8 9101112131415

Ratings

A histogram showing a sample distribution of data from a panel's ratings of the perceived intensity of a sensory characteristic on a 15-point category scale. BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 653 many times each measurement point is represented in our data set. A dis• tribution then is merely a count of the frequency (number of times) a given value occurs in our collected observations. We often graph these data in a bar graph, the most common kind of histogram. Examples of distributions include sensory thresholds among a population, different ratings by sub• jects on a sensory panel (as in Figure Al.1,), or judgments of product liking on a 9-point scale across a sample of consumers. In doing our experiment, we assume that our measurements are more or less representative of the entire population of people or samples that could ever be measured or who might try our product. Thus, the experimental measurements are referred to as a sample, and the underlying or parent group as a population. The dis• tribution of our data bears some resemblance to the parent population, but it may differ due to the variability in the experiment and error in our mea• suring. The laws of statistics can tell us how much our sample distribution is likely to reflect the true or underlying population from which our sample is drawn.

Data Description How do we describe our measurements? Consider a sample distribution, as pictured in the histogram Figure Al.1. These measurements can be charac• terized and summarized in a few parameters. There are two important as• pects we can use for the summary. First, what is the best single estimate of our measurement? Second, what was the variation around this value? Description of the best or most likely single value involves measures of . Three are commonly used: The mean is commonly called an average, and is the sum of all data values divided by number of observa• tions. This is a good representation of the central value of the data for dis• tributions that are symmetric, i.e., not too heavily weighted in high or low values but evenly dispersed over the high and low ends. Another common measure is the or 50th , the middle value when the data are ranked. The median is a good representation of the central value even when the data are not symmetrically distributed. When there are some ex• treme values at the high end, for example, the mean will be unduly influ• enced by the higher values (they pull the average up). The median is simply the middle value after the measurements are rank ordered from lowest to highest, or the average of the two middle values when there are an even number of data points. For some types of data, we need to know the , which is the most frequent value. This is appropriate when our data are only separated into name-based categories. For example, we could ask for the modal response to the question: When is the product consumed (breakfast, lunch, dinner, or snack)? So a list of items or responses with no 654 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES particular ordering to the categories can be summarized by the most fre• quent response. Often our measurements will be distributed around a cen• tral value that is the most common. The most typical situation with ordered numerical scales of measurement is to obtain a symmetric or bell-shaped curve or at least a triangular histogram in which the central value is also the most frequent. Extreme values are less frequently occurring. In this case, the mean, median, and mode will be at about the same value. The second way to describe our data is to look at the variability or spread in our observations. This is usually achieved with a measure called the standard deviation. This specifies the degree to which our measures are dispersed about the central value. The standard deviation is derived us• ing the following reasoning: First, we ask, how does each data point deviate from the mean? This entails a process of subtraction. Next, we need to aver• age all these differences to get a single measure of the overall tendency of scores to be "off" from the mean. However, since some are positive and some are negative, we can't just add them up, or the positives and negatives will cancel. So we square them first before adding (and take a square root later). Remember, we are not measuring a whole population but rather do• ing an experiment that "samples" observations from all of those that are possible about a set of products or a set of people. The standard deviation of such an experimental group of data has the following form: Standard deviation of a sample ( S)

S = .._)I. (X- M/ I (N -1) (AI.1) where M =mean of X scores = (I X) I N I X= the sum of the scores

Computationally, this is more easily calculated as

.._) r.i- (I.x/ IN (AI.2) s = ------''----'- ...JN-1 where I :1 = the sum of all the squared scores (I x) 2 = the square of the sum of the scores A good exercise for students is to derive the equivalence of these two for• mulas, Al.1 and Al.2. Since the experiment or sample is only a small repre• sentation of a much larger population, there is a tendency to underestimate the true degree of variation that is present. To counteract this potential bias, the value of N- 1 is used in the denominator, forming what is called an BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 655

''unbiased estimate" of the standard deviation. In some statistical proce• dures, we don't use the standard deviation but its squared value. This is called the sample variance, $l in this notation. Another useful measure of the variability in the data is the . This weights the standard deviation for the size of the mean and can be a good way to compare the variation from different methods, scales, experiments, or situations. In essence, the measure becomes dimensionless or a pure measure for the percent of variation in our data. The coefficient of variation (CV) is expressed as a percentage in the following formula:

CV (%) = (S/M)(100) (AI.5) where S = the sample standard deviation M =the mean value For some scaling methods such as magnitude estimation, variability tends to increase with increasing mean values, so the standard deviation by itself may not say much about the amount of error in the measurement. The error changes with the level of the mean. The coefficient of variation, on the other hand, is a relative measure of error that takes into account the intensity value along the scale of measurement. The example below shows the calculations of the mean, median, mode, and standard deviation and coefficient of variation for the data in the his• togram of Figure AI.t. The data are shown in Table A1.1.

&1, Data from Figure AI.1 (rank ordered) 2 5 7 9

3 5 7 9

3 6 7 9

4 6 8 9

4 6 8 10

4 6 8 10

4 6 8 10

5 6 8 11

5 6 8 11

5 7 9 12 13 656 SENSORY EvALUATION oF FooD: PRINCIPLES AND PRACTICES

N=41 Mean of the scores = (IX)IN = (2 + 3 + 3 + 4 ... + 11 + 12 + 13) I 41 = 7.049 Median =middle score = 7 Mode = most frequent score = 6 Standard deviation =

:I? - (I:x)2 IN f- N _ 1 = \f'[2,303 - (83,521141)]140 = 2.578

CV (%) = 100 (SDimean) = 100 (2.578 I 7.049) = 36.6%

Population Statistics

In making decisions about our data, we like to infer from our experiment what might happen in the population as a whole. That is, we would like our results from a subsample of the population to apply equally well when projected to other people or other products. By population, we don't necessar• ily mean the population of the nation or the world. We use this term to mean the group of people (or sometimes products) from which we drew our experimental panel (or samples) and the group to which we would like to apply our conclusions from the study. The laws of statistics tell us how well we can generalize from our experiment (or sensory test) to the rest of the population ofinterest. Since people are variable, populations can be described by the same sort of statistics as our sample statistics in the last unit. For example, we can make guesses about the mean and the standard deviations, based on our samples or experiments. The population means and standard deviations are usually denoted by Greek letters, as opposed to standard letters for sample-based statistics. Many things we measure about a group of people will be normally distrib• uted. That means the values form a bell-shaped curve described by an equa• tion usually attributed to Gauss. The bell curve is symmetric around a mean value: Values are more likely to be close to the mean than far from it. The curve is described by its parameters of its mean and its standard deviation, as shown in Figure AI.2. The standard deviation of a population, 1:, is similar to our formula for the sample standard deviation as is given by

cr = ..j(X- Jl) 2 IN (AI.4) BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 657 where X= each score (value for each person, product) J..L = population mean N= number ofitems in population How does the standard deviation relate to the normal distribution? This is an important relationship, which forms the basis of statistical risk estima• tion and inferences from samples to populations. Since we know the exact shape of the normal distribution (given by its equation), standard deviations describe known percentages of observations at certain degrees of difference from the mean. In other words, proportions of observations correspond to areas under the curve. Furthermore, any score, X, can be described in terms of a Z-value, which states how far the score is from the mean in standard de• viation units. Thus,

X -J..L z= -• (AI.5) cr

Important properties of the normal distribution curve: 1) areas (under the curve) correspond to proportions of the population. 2) each standard deviation subsumes a known proportion -1/2 (X-!.L) 2t cr 2 Frequency = 1/(

mean 3) since proportions lcr are related to probabilities, we know how likely or unlikely certain values are going to be. Extreme scores (away from the mean) are rare or improbable.

z (marks off equal standard deviation units)

The normal distribution curve is described by the parameters of its mean and its standard deviation. Areas under the curve mark off discrete and known percentages of observations. 658 SENSORY EVALUATION OF FoOD: PRINCIPLES AND PRACTICES

Z-scores represent differences from the mean value, but they are also related to areas under the normal curve. When we define the standard devi• ation as one unit, the Z-score also is related to area under the curve, to the left or right of its value, expressed as a percentage of the total area. In this case, the Z-score becomes a useful value to know when we want to see how likely a certain observation would be and when we make certain assump• tions about what the population may be like. Given assumptions about the shape of the curve, and knowing its mean and standard deviation, we can tell what percentage of observations will lie a given distance (Z-score) from the mean. This Z-value is useful when we want to state how far a score is from the mean and how likely or unlikely it is. Since the frequency distribu• tion actually tells us how many times we expect different values to occur, we can convert this Z-score to a probability value (sometimes called a p• value) representing the area under the curve to the left or right of the Z• value. In statistical testing, where we look for the rarity of calculated event, we are usually examining the "tail" of the distribution, or the smaller area that represents the probability of values more extreme than the Z-score. Another way to think about this is that given a certain amount of vari• ability in our situation, we can state how likely or unlikely an observation will be, knowing its position relative to the mean. The position is the dis• tance from the mean in standard deviation units, hence the cr in the demon• inator of Eq. (AI-5). When we use distributions like the normal curve, we know that some statistics we calculate in an experiment will be distributed like Z. This is a very powerful connection, since it lets us see how likely our experimental outcome would be when chance or pure error variation alone is operating. When we see events that are very unlikely, we conclude that chance alone cannot be blamed for our result and that something else must be the cause, such as the variables we are testing in our experiment. This is the logic of testing null and alternative hypotheses, discussed below. This probability value represents the area under the curve outside our given Z-score, and is the chance (expected frequency) with which we would see a score of that magnitude or one that is even greater. The p-value is found from the area under the curve outside of the Z-score (in the "tail of the distribution"), and tables converting Z-values to p-values are com• monly found in the statistics texts. An example is Table A, Appendix VI.

HYPOTHESIS TESTING AND Statistical inference has to do with how we draw conclusions about what populations are like based on samples of data from experiments. This is the logic that is used to determine whether our experimental variables had a BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 659 real effect or whether our results were likely to be due to chance or unex• plained random variation. Before we move onto this notion of statistical de• cision making, a simpler example of inferences about populations, namely confidence intervals, will be illustrated. One example of inference is in the estimation of where the true popula• tion values are likely to occur based on our sample. In addition to examin• ing the probability of occurrence of individual scores, we can also examine the certainty with which our sample estimates will fall inside a of values on the scale of measurement. For example, we might want to know the following information: Given the sample mean and standard deviation, within what interval is the true or population value likely to occur? For small samples, we use the t- to help us. The t-statistic is like Z, but it describes the distribution of small experiments better than the Z-statistic that governs large populations. Since most experiments are much smaller than populations, and sometimes are a very small sample indeed, the t-sta• tistic is useful for much sensory evaluation work. Often we use the 95% to describe where the value of the mean is expected to fall 95% of the time, given the information in our sample or experiment. For a mean value X of N observations, the 95% confidence interval is given by

(AI.6) where tis the t-value corresponding toN- 1 degrees of freedom (explained below) that includes 2.5% of expected variation in the upper tail outside this value and 2.5% in the lower tail (hence a two-tailed value, also ex• plained below). Suppose we obtain a mean value of 5.0 on a 9-point scale, with a standard deviation of 1.0 in our sample, and there are 15 observa• tions. The t-value for this experiment is based on 14 or n - 1 degrees of freedom and is shown in Table B, Appendix VI, to be 2.145. So our best guess is that the true mean lies in the range of 5 ± 2.145 ( 1d15) or between 4.45 and 5.55. This could be useful, for example, if we wanted to ensure that our product had a mean score of at least 4.0 on this scale. We would be fairly confident, given the sample values from our experiment, that it would in fact exceed this value. For continuous and normally distributed data, we can similarly esti• mate a 95% confidence interval on the median (Smith, 1988). The interval for the median, xmed' is given by

(AI.7) Xmed- 1.253t(sf\[N) to Xmed + 1.253t(sf\[N) 660 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

For larger samples, say N> 50, we can replace the t-value with its Z• approximation, using Z = 1.96 in these formulas for the 95% confidence interval. A great deal of product experimentation involves comparisons, not just the estimation of single population values. In most sensory studies, there is a comparison of two or more samples, treatments, groups of people, or other levels of some independent variable. In the nature of or ex• perimentation, comparing treatments or levels of an independent variable are rephrased as questions about distributions. For example, we may have changed an ingredient or a process in our product. Let's call that a "treat• ment," since many statistics books talk in terms of treatments rather than variables. We collect some information about our two products, such as sensory scores on some scale. We then would like to know whether the mean values from the two products are likely to be the same or different in the greater population. Statistics can help us make a good or safe conclu• sion from our data. Note that our comparison of means is usually phrased as "same" or "different?' Another way to look at this is to ask whether the groups from which our experimental values are drawn was likely to be the same parent population or two different ones. This is usually phrased in terms of means but could be in terms of standard deviations, too. To help us in these questions, we take our data description tools and first calculate means and standard deviations. From these values, we do some further calculations to come up with values called test statistics. These statistics, like the Z-score mentioned above, have known distribu• tions, so we can tell how likely or unlikely the observations will be when chance variation alone is operating. When chance variation alone seems very unlikely (usually 1 chance in 20 or less), then we reject this notion and conclude that our observations must be due to our actual experimental treatment. This is the logic of statistical hypothesis testing. It's that simple. One such statistic, very useful for small experiments, is called Student's t-statistic. Student was the pseudonym of the original publisher of this sta• tistic, a man named Gosset who worked for the Guinness Brewery and did not want other breweries to know that Guinness was using statistical meth• ods (O'Mahony, 1986). The t-statistic is useful for comparing means when small experiments have been done, as occurs in a lot of sensory research. By small experiments, we mean experiments with numbers of observations per variable in the range of about 50 or less. Conceptually, the t-statistic is the difference between the means divided by an estimate of the error or un• certainty around those means, called the of the means. This is very much like a Z-statistic, but it is adjusted for the fact that we have a small sample, not a big population. In fact, the t-distribution looks a lot like a bell curve, only flatter. This makes sense because we stand a proportion- BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 661 ally larger chance of seeing more extreme values when we only take a small sample of a population. Imagine that we did our experiment many times and each time calcu• lated a mean value. These means themselves, then, could be plotted in a histogram and would have a distribution of values. The standard error of the mean is like the standard deviation of this of means. If you had lots of time and money, you could repeat the experiment over and over and estimate the population values from looking at the distri• bution of sample mean scores. However, we don't usually do such a series of experiments, so we need a way to estimate this error. Fortunately, the er• ror in our single experiment gives us a hint of how likely it is that our ob• tained mean is likely to reflect the population mean. That is, we can estimate that the limits of confidence are around the mean value we got. Th~ laws of statistics tell us that the standard error of the mean is simply the sample standard deviation divided by the square root of the number of observations (N). This makes sense in that the more observations we make, the more likely it is that our obtained mean actually lies close to the true population mean. In a way, this is a tremendous shortcut. The experiment does not have to be repeated endless times; rather we can estimate with known confidence the region within which the true value must lie, from the data in a single experiment. The larger the experiment in terms of ob• servations, the narrower the region of certainty within which the true mean can be pinpointed. So in order to test whether the mean we see in our experiment is differ• ent from some other value, there are three things we need to know: the mean itself, the sample standard deviation, and the number of observa• tions. An example of this form of the t-test is given below, but first we need to take a closer look at the logic of statistical testing. The logical process of statistical inference is similar for the t-tests and all other statistical tests. The only difference is that the t-statistic is com• puted for testing differences between two means, while other statistics are used to test for differences among other values, like proportions, standard deviations, or . In the t-test, we first assume that there is no differ• ence between population means. Another way to think about this is that it implies that the experimental means were drawn from the same parent population. This is called the null hypothesis. Next, we look at our t-value calculated in the experiment and ask how likely this value would be, given our assumption of no difference. Because we know the shape of the t-distri• bution, just like a Z-score, we can tell how far out in the tail our calculated t-statistic lies. From the area under the curve out in that tail, we can tell what percent of the time we could expect to see this value. If the t-value we calculate is very high and positive or very low and negative, it is unlikely- 662 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES a rare event given our assumption. This happens when there is a big differ• ence between the mean values, relative to the error in estimating those means. Remember, we have based this on the assumption that the popula• tion means are not different, so we would only expect the samples to be very different on rare occassions when chance alone is operating. If this rarity passes some arbitrary cutoff point, usually 1 chance in 20 or less, we conclude that our initial assumption was probably wrong. Then we make a conclusion that the population means are in fact different or that the sam• ple means were drawn from different parent populations. In practical terms, this usually implies that our treatment variable (ingredients, pro• cessing, packaging, shelf-life) did produce a different sensory effect from some comparison level. We conclude that the difference was not likely to happen from chance variation alone. This is the logic of null hypothesis testing. It is designed to keep us from making errors of concluding that the experiment had an effect when there really was only a difference due to chance. Furthermore, it limits our likelihood of making this mistake to a maximum value of 1 chance in 20 in the long run (when certain conditions are met, see postcript at the end of this chapter). Here is a worked example of a simple t-test. We do an experiment with the following scale, rating a new ingredient formulation against a control for overall level:

D D D D D D D I I I much less sweet about the same much more sweet We convert their box ratings to scores 1(for the leftmost box) through 7 (for the rightmost). The data from 10 panelists look as follows: Panelist Rating Panelist Rating 1 5 6 7 2 5 7 5 3 6 8 5 4 4 9 6 5 3 10 4 We now set up our null hypothesis and an alternative hypothesis differ• ent from the null. A common notation is to let the symbol H0 stand for the null hypothesis and Ha stand for the alternative. Several different alterna• tives are possible, so it takes some careful thought as to which one to choose. This is discussed further below. The null hypothesis is also stated as an equation concerning the population value, not our sample, as follows:

H 0 : J..l = 4.0. This is the null hypothesis.

H 0 : J..l ::f. 4.0. This is the alternative hypothesis. BASIC STATISTICAL CoNCEPTS FOR SENSORY EvALUATION 663

Note that the Greek letter ''mu" is used since these are statements about population means, not sample means from our data. Also note that the al• ternative hypothesis is nondirectional, since the population mean could be higher or lower than our of 4.0. So the actual t-value after our calculations might be positive or negative. This is called a two-tailed test. If we were only interested in the alternative hypothesis (Ha) with a "greater than" or "less than" prediction, the test would be one-tailed, as we would only examine one end of the t-distribution when checking for the significance of the result. Here are the calculations from the data set above:

Mean =LXIN = 5.0 s = ~ ((262) - (2,500110))19 = 1.157 LX=50 LX=262 (LYl=2,500 t= (5.0- 4.0) I (1.157 I '-'N) = 11.565 = 2.758

So our obtained t-value for this experiment is 2.758. Next we need to know if this value is larger than what we would expect by chance less than 5% of the time. Statistical tables for the t-distribution tell us that for a sam• ple size of 10 people, we expect at-value of± 2.262 only 5% of the time. The two-tailed test looks at both high and low tails and adds them together since the test is nondirectional, with t high or low. So this critical value of + 2.262 cuts off 2.5% of the total area under the t-distribution in the upper half, and -2.262 cuts off 2.5% in the lower half. Any values higher than 2.262 or lower than -2.262 would be expected less than 5% of the time. In statistical talk, we say that the probability of our obtained result, then, is less than 0.05, since 2.758 > 2.262. In other words, we obtained at-value from our data that is even more extreme than the cutoff value of2.262. So far all of this is some simple math and a cross-referencing of the ob• tained t-value to what is predicted from the tabled t-values under the null hypothesis. The next step is the inferential leap of statistical decision mak• ing. Since the obtained t-value was bigger in magnitude that the critical t• value, H 0 is rejected and the alternative hypothesis is accepted. In other words, our population mean is likely to be different from the middle of our scale value of 4.0. We don't actually know how likely this is, but we know that the experiment would produce the sort of result we see only about 5% of the time when the null is true. So we infer that it is probably false. Looking back at the data, this does not seem too unreasonable, since 7 out of 10 panelists scored higher than the null hypothesis value of 4.0. When we reject the null hypothesis, we claim that there is a statistically signifi• cant result. The use of the term "significance" is unfortunate, for in simple everyday English it means "important?' In statistical terms, "significance" 664 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES only implies that a decision has been made but does not tell us whether the result was important or not. The steps in this chain of reasoning, along with some decisions made early in the process about the alpha level and power of the test, are shown in Figure Al.3. Before going ahead, there are some important concepts in this process of statistical testing that need further explanation. The first is degrees of freedom. When we look up our critical values for a statistic, the values are tabled not in terms of how many observations were in our sample but how many degrees of freedom we have. Degrees of freedom have to do with how many parameters we are estimating from our data relative to the number of observations. In essence, this notion asks how much the resulting values would be free to move, given the constraints we have from estimating other statistics. For example, when we estimate a mean, we have freedom for that value to move or change until the last data point is collected. Another way to think about this is the following: If we knew all but one data point and al• ready knew the mean, we would not need that last data point. It would be determined by all the other data points and the mean itself, so it has no freedom to change. We could calculate what it would have to be. As another example, we can ask how much do we know about our sample variance? If we had four observations, and knew three of them and our sample vari• ance, we could calculate what the value of the fourth, unknown data point would be. In general, degrees of freedom are equal to the sample size, mi• nus one for the parameters we are estimating. Most statistics are tabled by their degrees of freedom. If we wanted to compare two groups of N1 and N2 observations, we would have to calculate some parameters like means and standard deviations for each group. So the total numbers of degrees of free• dom are N1 - 1 + Nz - 1 or N1 + Nz - 2. A second important consideration is whether our statistical test is a one• or a two-tailed test. Do we wish to test whether the mean is simply different from some value, whether it is larger or smaller than some value? If the question is simply "different from," then we need to examine the probability that our test statistic will fall into either the low or high tail of its distribution. As stated above in the example of the simple t-test, if the question is direc• tional, e.g., "greater than" some value, then we examine only one tail. Most statistical tables have entries for one- and two-tailed tests. It is important, however, to think carefully about our underlying theoretical question. The choice of statistical alternative hypotheses is related to the research hypothe• sis. In some sensory tests, like paired preference, we don't have any way of predicting which way the preference will go, and so the statistical test is two• tailed. This is in contrast to some discrimination tests like the triangle proce• dure. In these tests, we don't expect performance below chance unless there is something very wrong with the experiment. So the alternative hypothesis BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 665

Formulate null hypothesis and alternative hypothesis

Choose alpha level for Type I error

Choose sample size (calculate beta risk or power if needed)

CONDUCT EXPERIMENT - GATHER DATA

Calculate summary statistics and statistics for null hypothesis test

Compare statistic to critical values or to probability levels re: alpha

Reject null hypothesis Fail to reject or accept (depending upon power)

Draw conclusions, interpret, and make recommendations

PJrl Steps in statistical decision making in an experiment The items before the ~ collection of the data concern the experimental design and statistical con- 11 ••• ventions to be used in the study. After the data are analyzed, the inferential process begins, first with data description, then computation of the test statistic, and then comparison of the test statistic to the critical value for our predetermined alpha level and the size of the experiment If the computed test statistic is greater in magnitude than the critical value, we reject the null hypothesis in favor of the alternative hypothesis. If the computed test statistic has a value smaller in magnitude than the critical value, we can make two choices. We can reserve judgment if the sample size is small, or we can accept the null hypothesis is we are sure that the power and sensi• tivity of the test is high. A test of good power is in part determined by, hav• ing a substantial number of observations and test sensitivity is determined by having good experimental procedures and controls (see Appendix V). 666 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES is that the true proportion correct is greater than chance. Therefore the al• ternative is looking in one direction and is therefore one-tailed. A third important statistical concept to keep in mind is the type of distri• bution. So far there are three different kinds of distributions we have dis• cussed, and it's easy to become confused about them. First, there are overall population distributions. They tell us what the whole world would look like if we measured all possible values. This is usually not known, but we can make inferences about it from our experiments. Second, we have sample distributions derived from our actual data. What does our sample look like? The data distribution can be pictured in a graph such as a his• togram. Third, there are distributions of test statistics. If the null hypothesis is true, how is the test statistic distributed over many experiments? How will the test statistic be affected by samples of different sizes? What values would be expected? What variance due to chance alone? It is against these expected values that we examine our calculated value and get some idea of its probability. With the realization that statistical decisions are based on probabilities, it is clear that some uncertainty is involved. Our test statistic may only hap• pen 5% of the time under a true null hypothesis, but the null might still be true, even though we rejected it So there is a chance that our decision was a mistake and that we made an error. It is also possible sometimes that we fail to reject the null when a true difference exists. These two kinds of mis• takes are called Type I and T}pe II errors. A Type I error is committed when we reject the null hypothesis when it is actually true. In terms of a t-test comparison of means, the Type I error implies that we concluded that two population means are different when they are in fact the same; i.e., our data were in fact sampled from the same parent population. In other words, our treatment did not have an effect, but we mistakenly concluded that it did. The process of statistical testing is valuable, though, because it protects us from committing this kind of error and going down blind alleys in terms of future research decisions, by limiting the proportion of times we could make these decisions. The upper limit on how often these bad decisions could be made is the p-value we choose for our statistical cutoff, usually 5%. This upper limit on the risk of Type I error (over the long term) is called alpha-risk. As shown in Table AI.2, there are two kinds of errors we can make. The other kind of error occurs when we miss a difference that is real. This is called a Type II error, formally defined as a failure to reject the null hy• pothesis when the alternative hypothesis is actually true. Failures to detect a difference in a t-test or, more generally, to fail to observe that an experi• mental treatment had an effect can have important or even devastating business implications. Failing to note that a revised manufacturing process BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 667

Errors in Statistical Decision Making Outcome of Evaluation True State: Difference Reported No Difference Reported A difference exists "hit rate" Type II error ''missing the effect" (probability is beta-risk) No difference exists Type I error "correct rejection" a "false alarm" (probabilty is alpha-risk)

was in fact an improvement would lose the potential benefit if the revision was not adopted as a new standard procedure. Similarly, revised ingredi• ents might be passed over when they in fact produce improvements in the product as perceived by consumers. Alternatively, bad ingredients might be accepted for use if the modified product's flaws are undetected. It is neces• sary to have a sensitive enough test to protect against this kind of error. The long-term risk or probability of making this kind of mistake is called beta• risk, and one minus the beta-risk is defined as the statistical power of the test. The protection against Type II error by statistical means and by exper• imental strategy is discussed in Appendix V.

VARIATIONS OF THE T-TEST A very common and useful test in sensory evaluation is to examine whether the means from two small groups of observations (e.g., two prod• ucts or two different panels) are statistically different or whether we can conclude that they are about the same. This is tested using the t-statistic, which is distributed like a Z-score, only a little bit wider and flatter. We compare the t-values we obtain to tabled values for probabilities less than our accepted level of risk, usually with a probability of .05. There are three kinds of t-tests commonly used. One is a test of an experimental mean against a fixed value, like a population mean or a specific point on a scale like the middle of a just-right scale, as in the example above. The second test is when observations are paired, for example when each panelist eval• uates two products, and the scores are associated since each pair comes from a single person. This is called the paired t-test or dependent t-test. The third type oft-test is performed when different groups of panelists evaluate 668 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES the two products. This is called the independent groups t-test The formulas for each test are similar, in that they take the general form of a difference between means divided by the standard error. However, the actual compu• tations are a bit different The section below gives examples of the three comparisons of means involving the t-statistic. The first type oft-test is the test against a population mean or another fixed value. In this case t is equal to

Mean of sample - population mean (AI.8) Standard deviation of sample/ -{N

The calculations for this test are the same as the example shown above, for the just-right scale for sweetness. The second kind oft-test is the test of paired observations, also called the dependent t-test. To calculate this value oft, we first have to arrange the pairs of observations in two columns, and subtract one from the other to create a difference score for each pair. The difference scores then become the numbers used in further calculations. The null hypothesis is that the mean of the difference scores is zero. This makes sense, since if there were no systematic difference between the two products, then some difference scores would be positive and some negative, canceling each other when summed and leaving a mean difference of zero. We also need to calculate a standard deviation of these difference scores, and a standard error by divid• ing this standard deviation by the square root of N, the number of panelists.

Mean of difference scores t = (AI.9) Standard deviation of different scores/~

Here is an example of at-test where each panelist tasted both products, and we can perform a paired t-test. Products were rated on a 25-point scale for acceptance. Note that we compute a difference score (D) in this situation.

Panelist Product A ProductB Difference, D D2 1 20 22 2 4 2 18 19 1 1 3 19 17 -2 4 4 22 18 -4 16 5 17 21 4 16 6 20 23 3 9 7 19 19 0 0 8 16 20 4 16 9 21 22 1 1 10 19 20 1 1 BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 669

Calculations: Sum of D= 10, mean ofD = 1 Sum ofrf=68 Standard deviation of D = "Lfi - ("LD/ IN = ~ 5819 = 2.538 N-1

Standard error of D = 2.5381 {10 = 0.803 and finally, t= mean of DIStandard error of D= 1 I 0.803 = 1.24

This value does not exceed the tabled value for the 5%, two-tailed limit on t, and so we conclude there is insufficient evidence for a difference. In other words, we don't reject the null hypothesis. The two samples were rather close, compared to the level of error among panelists. The third type oft-test is conducted when there are different groups of people, sometimes called the independent groups t-test. Sometimes the ex• perimental constraints dictate a situation where we have two groups that only one product each. Then a different formula for the t-test applies. Now the data are no longer paired, and a different calculation is needed to estimate the standard error, since two groups were involved, and they have to be combined somehow to get a common estimate of the standard devia• tions. We also have some different degrees of freedom, now given by the sum of the two group sizes minus 2 or (Ngroup 1 + Ngroup- 2).

Mean of sample 1 - mean of sample 2 t= Pooled standard error

For the independent t-test, the pooled error requires some work to get at, computationally. It provides an estimate of the error combining the error levels of the two groups. The pooled standard error is given by the follow• ing formula:

(AI.10)

A computationally equivalent form of the t-test then is as follows:

(mean1 - mean2) ~{[~ (~)(~ + ~- 2)]1(~ + ~)} t= (AI.11) 670 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

Here is a worked example of an independent groups t-test. In this hypothet• ical case, we have two panels, one from a manufacturing site and one from a research site, both evaluating the perceived pepper heat from an ingredi• ent submitted for use in a highly spiced product The product managers have become concerned that the plant QC panel may not be very sensitive to pepper heat due to their dietary consumption or other factors, and that the use of ingredients is getting out of line with what research and develop• ment personnel feel is an appropriate level of pepper. So the sample is eval• uated by both groups, and an independent groups t-test is performed. Our null hypothesis is that there is no difference in the population means, and our alternative hypothes is that the QC plant will have lower mean ratings in the long run (one-tailed situation). The data set is comprised of pepper heat ratings on a 15-point category scale:

Manufacturing QC Panel R&D Test Panel 7 9 12 10 6 8 5 7 8 7 6 9 7 8 4 12 5 9 3

First, some preliminary calculations: 2 N1 = 10 I. score1 = 63 mean1 = 6.30 I. score1 = 453 2 ~ = 9 I. score2 = 79 mean2 = 8. 78 I. score2 = 713

Now we have all the information we need. to calculate the value oft. Using the computational formula, the calculations are as follows:

(6.30- 8.78) -v {[10(9)(10 + 9 - 2)]/(10 + 9)} t= =- 2.556 (AI.12) -v 453 - (3,969/10) + 713 - (6,241/9)

Degrees of freedom are 17 ( = 10 + 9 - 2). The critical t-value for a one• tailed test at 17 df is 1. 740, so this is a statistically significant result. Our QC panel does seem to be giving lower scores for pepper heat than does the R&D panel. Given this result, we might want to conduct a consumer study to see if the pepper level has become objectionable to consumers. We also might want to replicate the study since the sample size was fairly small. If the difference is confirmed to be a problem in the consumer population, we BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 671 should consider calibrating our panels against the consumer information so that these internal panels become more accurate reflections of what is perceived in the true target market. Note that the variability is also a little higher in the QC panel. For highly unequal variability (one standard deviation more than 3 times as large as the other) some adjustments must be made. The problem of unequal vari• ance becomes more serious when the two groups are also very different in size. The t-distribution becomes a poor estimate of what to expect under a true null, so the alpha level is no longer adequately protected. One ap• proach is to adjust the degrees of freedom and formulas. This approach is given in advanced-statistics books (e.g., Snedecor and Cochran, 1989). The nonpooled estimates of the t-value are provided by some statistics pack• ages, and it is usually prudent to examine these adjusted t-values if unequal group size and unequal variances happen to be the situation with your data.

The Sensitivity of the Dependent t-Test for Sensory Data

In sensory testing, it is often valuable to have each panelist try all of the products in our test. For simple paired tests of two products, this enables the use of the dependent t-test. This is especially valuable when a question is simply whether a modified process or ingredient has changed the sen• sory attributes of a product. The dependent t-test is preferable to the sepa• rate groups approach, with which different people try each product. The reason is apparent in the calculations. In the dependent or paired t-test, the statistic is calculated from a difference score. This means that the differ• ences among panelists in overall sensory sensitivity or even in their idio• syncratic scale usage are largely removed from the situation. It is common to observe that some panelists have a "favorite" part of the scale and may restrict their responses to one section of the allowable responses. However, with the dependent t-test, as long as panelists rank-order the products in the same way, there will be a statistically significant result. This is one way to partition the variation due to subject differences from the variation due to other sources of error. In general, partitioning of error adds power to sta• tistical tests, as shown in the section on repeated measures ANOVA (see Appendix III). Of course, there are biases and potential problems in having people evaluate both products, like the introduction of contextual biases, sequential order effects, and possible fatigue and carryover effects from tasting both products. However, the advantage gained in the sensitivity of the test usually far outweighs the liabilities of repeated testing. Consider the following example. In this data set, we have paired obser• vations and a fair degree of spread in the actual ratings being given by sub• jects. This might be due to individual differences in sensitivity, as would be 672 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES found with bitter , for example. However, most panelists agreed that one product was rated higher than the other. This kind of situation, very common in sensory work, leads to a significant paired t-test. However, if the data were mistakenly treated with the independent group's t-test, the result would not break through the critical value of t, and no statistically significant difference would be reported. Judge Product A ProductB Difference (B-A) 1 5 5 2 2 2 4 2 5 1 2 1 4 6 8 2 5 4 5 1 6 6 6 0 7 8 9 1 8 9 9 0 9 4 5 -1 10 7 9 2 11 4 6 2 12 1 3 2

A paired samples t-test on product A vs. product B gives a mean differ• ence of 1.167, a standard deviation of the difference as 1.03 and at-value of 3.924. At 11 degrees of freedom, this has a p-value of only .002, in other words less than 5% and thus statistically significant. The independent groups t-test, on the other hand, tests a mean value of 4.58 against a mean value of 5. 75, with standard deviations of 2.6 and 2.5, respectively. Given this high level of variation among judges, the t-value is only 1.105, which has a p-value of .28, much higher than the 5% required for statistical signif• icance. Even though the degrees of freedom are larger, the between-person variance here swamps the result. The consistent , however, allows the paired t-test to detect the pattern of differences. Why does this happen? The data are shown graphically in Figure Al.4. Note that for 9 of the 12 panelists, product B is rated higher than product A. For two others, the scores were tied, and for one other person, the reverse pattern was found. Since the paired t-test works on the basis of differences, and not absolute scores, the pattern of differences is consistent enough to yield a significant result. The independent groups t-test, on the other hand, looks at actual scores and the amount of variance among the panelists, which is considerable. The pattern of correlation between the two scores for panelists is what helps the dependent t-test in this case. This correlation is shown in the lower section of Figure AI.4. In any case, where such a cor• relation is seen, the dependent t-test will be more sensitive than the inde• pendent group's t-test. However, the decision to use one or the other is BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 673

Product Scores - lines show differences

10

~ Tie 8 ! i 0 6 ~tie Ill Cl 1 c 0 ProductA ; ~ r--- • Product B a: i 4 0 0 0 t reversal lr ~ 2 0 1 i 0 0 0 0 2 4 6 8 10 12 14

Panelists (arbitrarily numbered)

10 R'2 = 0.849 • • • 8 • ID ..u :::l 6 'a • • 2 a. • • 4 Clc • Individual panelist :;:: • a:Cll • • 2 •

0 0 2 4 6 8 10

Rating - Product A

The upper panel shows directions of difference in the paired data for the ~ two products. The lower panel shows the pattern of correlation, where the ""' dependent t-test becomes more efficient at detecting differences than un• paired methods. 674 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES based on the actual experimental design. Sensory practitioners should con• sider using a paired experimental design whenever this correlation is likely (probably most of the time!). In other words, have each panelist eval• uate both products if that is at all possible.

SUMMARY: STATISTICAL HYPOTHESIS TESTING Statistical testing is designed to prevent us from concluding that a treat• ment had an effect when none was really present, and our differences were merely due to chance or the experimental error variation. Since the test statistics like Z and t have known distributions, we can tell whether our re• sults would be extreme, i.e., in the tails of these distributions a certain per• centage of the time when only chance variation was operating. This allows us to reject the notion of chance variation in favor of concluding that there was an actual effect The steps in statistical testing are summarized in the flowchart shown in Figure AI.3. Sample size, or the number of observations to make in an experiment, is one important decision. As noted above, this helps determine the power and sensitivity of the test, as the standard errors decrease as a function of 1 divided by the square root of N. Also note, how• ever, that this square-root function means that the advantage of increasing sample size becomes less and less as N gets large. In other words, there is a law of diminishing returns. At some point, the cost considerations in doing a large test will outweigh the advantage in reducing uncertainty and lower• ing risk. An accomplished sensory professional will have a feeling for how well the sensitivity of the test balances against the informational power and uncertainty and risks involved, and about how many people are enough to ensure a sensitive test These issues are discussed further in the section on beta-risk and statistical power. Note that statistical hypothesis testing by itself is a somewhat impover• ished manner of performing scientific research. Rather than establishing theorems, laws, or general mathematical relationships about how nature works, we are simply making a binary yes/no decision, either that a given experimental treatment had an effect or that it did not. Statistical tests can be thought of as a starting point or a kind of necessary hurdle that is a part of experimentation in order to help rule out the effects of chance. However, it is not the end of the story, only the beginning. In addition to statistical sig• nificance, the sensory scientist must always describe the effects. It is easy for students to forget this point and make the mistake of reporting signifi• cance but failing to describe what happened. There are further matters of scientific investigation after the issue of statistical significance is dispensed with. For example, it is desirable to develop a quantitative model of the phe- BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 675 nomenon if possible. Then and only then does sensory advance from mere exploration to a science that is more generally useful in predict• ing quantitative relationships between physical and sensory variables.

POSTSCRIPT: WHAT P-VALUES SIGNIFY AND WHAT THEY DON'T

No single statistical concept is probably more often misunderstood and so often abused as the obtained p-value that we find for a statistic after con• ducting an analysis. It is easy to forget that this p-value is based on a hypo• thetical curve for the test statistic, like the t-distribution, that is calculated under the assumption that the null hypothesis is true. So the obtained p• value is taken from the very situation that we are trying to reject or elimi• nate as a possibilty. Once this fact is realized, it is easier to put the p-value into proper perspective and give it the due respect is deserves, but no more. What does the p-value mean? Let's reiterate. It is the probability of ob• serving a value of the test statistic (t, z, r, chi-square, or F -ratio) that is as large or larger than the one we obtain in our experimental analysis, when the null hypothesis is true- that much, no more and no less. In other words, assuming a true null, how likely or unlikely would the obtained value of the t-test be? When it becomes somewhat unlikely, say it is expected less than 5% of the time, we reject this null in favor of the alternative hypothesis and conclude that there is statistical significance. Thus we have gained some assurance of a relationship between our experimental treatments and the sensory variables we are measuring. Or have we? Here are some common misinterpretations of the p-value: 1. The p-value represents the odds-against chance- absolutely false (Carver, 1978). The probability that the result is a fmding that depends to some degree on random events is always true. There is always some random or unexplained error in our measurement or there would be no need to do statistics. This also puts the cart before the horse. The chance of observing the t-statistic under a true null is not the same as the chance of the null be• ing true given the observations. In mathematical logic, the probability of A given B is not necessarily the same as B given A. If I find a dead man on my front lawn, the chance that he was shot in the head is quite slim (less than 5%) in my neighborhood, but if I fmd a man shot through the head, he is more than likely dead (95% or more) (example modified from Carver, 1978). 2. A related way of misinterpreting the p-value is to say that the p-value represents the chance of making a Type I error. This is also not strictly true, although it is widely assumed to be true (Pollard and Richardson, 1987). Indeed, that is what our alpha cutoff is supposed to limit, in the long run. But the actual value of alpha in restricting our liability depends also on the inci- 676 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES dence of true differences vs. no differences in our long-term testing program. Only when the chance of true difference is about 50% is the alpha value an accurate reflection of our liability in rejecting a true null. The incidence is usually not known but can be estimated. In some cases like or shelf-life testing, there are a lot more ''no difference" situations, and then al• pha widely underestimates our chances of being correct, once the null is re• jected. The table below shows how this can occur: In this example, we have a 10% incidence of a true difference, alpha is set at .05, and the beta-risk (chance of missing a difference) is 20%, a reasonable value in practical re• search settings. For ease of computation, 1,000 tests are conducted: Incidence Difference Detected Difference Not (in 1,000 total tests) by Test Detected by Test Difference exists 80/100 tests 20/100 tests 100 tests (10% Incidence) (beta-risk= 20%) Difference does not exist 45 tests 855 tests 900 tests (alpha = 5% of 900)

The chance of being correct, having decided on a difference, is 80/125 or a little under 2/3. There is still about a 1/3 chance (45/125) of being wrong once you have rejected the null, even though we have done all our statistics correctly and are sure we are running a good testing program. The prob• lem in this scenario is the low probability beforehand of actually being sent something worth testing. An example of this kind of misinterpretation is common in elementary statistics textbooks. For example, a well-known and reputable introductory statistics tests gives the following incorrect information:

The probability of rejecting H0 erroneously (comitting a Type I error) is known; it is equal to alpha .... Thus you may be 95% confident that your

decision to reject H0 is correct (Welkowitz et al., 1982, p. 163).

This is absolutely untrue, as the above example illustrates. With a low true incidence of actual differences, the chance of being wrong once H0 is rejected may be very high indeed. This cannot only occur in quality control and shelf-life studies but also in research and development where small changes in ingredient substitution or minor modifications in processing have come to be the "safe" style of research in a conservative corporate en• vironment. 3. One minus the p-value gives us the reliability of our data, the faith we should have in the alternative hypothesis, or an index of the degree of sup• port for our research hypothesis in general. All of these are false (Carver, 1987). Reliability, certainly, is important (getting the same result on re• peated testing) but is not estimated simply as 1 - p. Actually, the value of a replicated experiment is much higher than a low p-value, especially if the BASIC STATISTICAL CONCEPTS FOR SENSORY EVALUATION 677 comes from another laboratory and another test panel. It is this constancy that is the reward to most basic scentific inquiry: Science is very rarely concerned with significant differences; it is over• whelmingly concerned with significant sameness. We begin to believe that an effect is real when an original study is repeated in other centers, at other times, and with other classes of subject and gives broadly simi• lar results. It is the invariance of the effect when other things change that we seek. In my view the use ofp-values has little or nothing to con• tribute to the scientific way of working. What is needed from a study is an estimate of the size of the effect to be measured and an estimate of how accurately it has been measured (Neider, 1988, p. 431).

Effect size is not the same as having a low p-value, but has to do with the variance in the data accounted for by the experimental variable. Once again we return to measures of association rather than measures of differ• ence. The estimation of is a complicated matter, as it can depend on the range of variables tested, the pure error rate, other variables in the experimental design, the quality of the test setup, controls, and so on. It is beyond the scope of this chapter to introduce the sensory student to the complexities of this part of statistical science, but the interested reader is referred to writings by Rosenthal (1987) for an introduction. The notion of estimating the chance of being right or wrong given a certain outcome is covered in the branch of statistics known as Bayesian statistics, after Baye's theorem, which allows the kind of calculation illustrated in the table above (see Berger and Berry, 1988) Interpreting p-values as evidence for the alternative hypothesis is just as wrong as interpreting them as accurate measures of evidence against the null. Once again, it depends on the incidence or . There is a misplaced feeling of elation that students and even some more mature researchers seem to get when they obtain a low p-value, as if this were an indication of how good their experimental hypothesis was. It is suprising, then, that journal editors continue to allow the irrelevant convention of adding extra stars, asterisks, or other symbols to indicate low p-values (**.01, ***.001, etc.) beyond the preset alpha level in experimental reports. The informational content of these extra stars is minimal, and this conven• tion should be done away with. It is an empty rhetoric, a case of misleading advertisement for the value of the experimental results. Overzealous teachers and careless commentators have given the world the impression that our standard statistical measures are inevitable, necessary, optimal and mathematically certain. In truth, statistics is a branch of rhetoric and the use of any particular statistic ... is nothing more than an exhortation to your fellow humans to see meaning in data as you do (Raskin, 1988, p. 432). 678 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

REFERENCES Basker, D. 1988. Critical values of differences among rank sums for multiple com• parisons. , 42 (2), 79, 80-84. Berger, J.O., and Berry, D.A. 1988. Statistical analysis and the illusion of objectivity. American Scientist, 76, 159-165. Carver, R.P. 1978. The case against statistical significance testing. Harvard Educational Review, 48, 378-399. Gacula, M.C., Jr., and Singh, J. 1984. Statistical Methods in Food and Consumer Research. Academic, Orlando, FL. Hays, W.L. 1973. Statistics for the Social , 2d ed. Holt, Rinehart and Winston, New York. Lundahl, D.S., and McDaniel, M.R. 1988. The panelist effect-fixed or random? Journal of Sensory Studies. 3, 113-121. Neider, J.A. 1988. Letter to the editor. American Scientist, 76,431-432. O'Mahony, M. 1986. Sensory Evaluation of Food: Statistical Methods and Procedures. Dekker, New York. Piggott, J.R. 1986. Statistical Procedures in Food Research. Elsevier Applied Science, London. Pollard, P., and Richardson, J.T.E. 1987. On the probability of making Type I errors. Psychological Bulletin, 102, 159-163. Rosenthal, R. 1987. Judgment Studies: Design, Analysis and Meta-analysis. Cambridge University Press, Cambridge. Siegel, S. 1956. for the Behavioral Sciences. McGraw-Hill, New York. Sokal, R.R., and Rohlf, F.J. 1981. Biometry, 2d ed. Freeman, New York. Snedecor, G.W., and Cochran, W.G. 1989. Statistical Methods, 8th ed. Iowa State University Press, Ames, lA. Stone, H., and Sidel, J.L .. 1993. Sensory Evaluation Practices, 2d ed. Academic, San Diego, CA. Welkowitz, J., Ewen, R.B., and Cohen, J. 1982. Introductory Statistics for the Behavioral Sciences. Academic, New York. II appendix

NON PARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS

Although statistical tests provide the right tools for basic psychophysi• cal research, they are not ideally suited for some of the tasks encoun• tered in sensory analysis.''-M. O'Mahony (1986, p. 401)

INTRODUCTION TO THE NONPARAMETRIC TESTS The t-test and other ''parametric" statistics work well for situations in which the data are graded or nearly continuous as in rating scales. In other situations, however, we categorize performance into right and wrong an• swers and count numbers of people who get tests correct or incorrect, or we count the number who make a choice of one product over another. Common examples of this kind of testing include the triangle test and other similar procedures where there is a forced-choice, such as in paired prefer• ence tests (e.g., which product did you like better?). In these situations, we want to use a specific kind of distribution for sta• tistical testing, one that is based on discrete, categorical data. This is called the binomial distribution and is described in this section. The binomial dis• tribution is useful for tests based on proportions, where we have counted people in different categories or choices. The binomial distribution is actu• ally a special case where there are only two outcomes (e.g., right and wrong answers in a triangle test). Sometimes we may have more than two alternatives for classifying responses, in which case multinomial distribu-

679 680 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES tion statistics apply. A commonly used statistic for comparing frequencies when there are two or more response categories is the chi-square statistic. For example, we might want to know if the meals during which a food product is consumed (say, breakfast, lunch, dinner, snacks) differed among teenagers and adults. If we asked consumers about the situation in which they most commonly consumed the product, the data would consist of counts of the frequencies for each group of consumers. The chi-square sta• tistic could then be used to compare these two frequency distributions. It will also indicate whether there is any association between the response categories (meals) and the group variable, or whether these two variables are independent. Some response alternatives have more than categorical or nominal properties, and represent of response. For example, we might ask for ranked preference of three or more variations of for a new prod• uct. Consumers would rank them from most appealing to least appealing, and we might want to know whether there is any consistent trend, or whether all flavors are about equally preferred across the group. For rank order data, there are a number of statistical techniques within the nonpara• metric toolbox. Many of these are very powerful in the statistical sense, even rivaling their parametric counterparts. That is, they are very sensitive to deviations from random (chance) responding. Since it has been argued that many sensory measurements, even rating scales, do not have interval-level properties (see Chapter 7), it often makes sense to apply a nonparametric test, especially those based on ranks, if the researcher has any doubts about the inherent in the scaling data. Alternatively, the nonparametric tests can be used as a check on conclusions from the traditional tests. Since the nonparametric tests in• volve fewer assumptions than their parametric alternatives, they are more ''robust" and less likely to lead to erroneous conclusions or misestimation of the true alpha-risk when assumptions have been violated. Furthermore, they are often quick and easy to calculate, so reexamination of the data does not entail a lot of extra work. In addition to the issue of having ordinal levels of measurement, nonparametric methods are also appropriate when the data deviate from a normal distribution. If inspection of the data set shows a pattern of high or low outliers, i.e., some asymmetry or skew, then a nonparametric test should be conducted. When data are ranked or have ordinal-level properties, a good measure of central tendency is the median. For data that are purely categorical (nominal level), the measure of central tendency to report is the mode, the most frequent value. Various measures of dispersion can be used as alter• natives to the standard deviation. When the distribution of the data are not normal, the 95% confidence interval for the median of N scores lies be- NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 681 tween the values of the observations numbered [(N + 1)/2] ± 0.98 Wwhen the data are rank ordered. When the data are reasonably normal, the stan• dard error of the median is 1.255 sl--iN, where sis the standard deviation of the sample, and the 95% confidence interval becomes the median ± t 1.255 sHIV, where tis the two-tailed t-value for N- 1 degrees of freedom (Smith, 1988). Another simple alternative is to state the semi-, or one-half the difference between the data from the 75th and 25th per• centiles. There are several nonparametric versions of the correlation coefficient. One commonly used is the rank order correlation attributed to Spearman. Nonparametric statistical tests with worked examples are shown in the texts by Siegel (1956), Hollander and Wolfe (1975), and Conover (1980).11 is advisable that the sensory professional have some familiarity with the common nonparametric tests so that they can be used when the assump• tions of the parametric tests seem doubtful. Many statistical computing packages will also offer nonparametric modules and various choices of these tests. The sections below illustrate some of the common binomial, chi-square, and rank order statistics, with some worked examples from sensory applications.

BINOMIAL-BASED TESTS ON PROPORTIONS

The binomial distribution describes the frequencies of events with discrete or categorical outcomes. Examples of such data in would be the proportion of people preferring one product over another in a test or the proportion answering correctly in a triangle test. The distribution is based on the binomial expansion, (p + qt, where pis the probability of one out• come, q is the probability of the other outcome ( q = 1 - p), and n is the number of samples or events. Under the null hypothesis in most discrimi• nation tests, the value of pis determined by the number of alternatives and therefore equals one-third in the triangle test and one-half in the duo-trio or paired comparison tests. A classic and familiar example of binomial-based outcomes is in toss• ing a coin. Assuming that it is a fair coin puts the expected probability at one-half for each outcome (heads or tails). Over many tosses (analogous to many observations in a sensory test) we can predict the likely or expected numbers of heads and tails, and how often these various possibilities are likely to occur. For predicting the number of each possibility, we "expand" the combi• nations of (p + qt, letting p represent the numbers of heads and q the num1 bers of tails inn total throws as follows: 682 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

1. For one throw, the values are (p + q) 1 or p + q= 1. One head or one tail can occur, and they will occur with probability p = q = 1/2. The coefficients (multipliers) of each term, divided by the total number of outcomes gives us the probability of each combination. 2. For two throws, the values are (p + q) 2, so the expansion is p2 + 2pq + l = 1 (note that the probabilities total1).l is associated with the outcome of two heads, and the multiplicative rule of probabilities tells us that the chances are (1/2) (1/2) = 1/4.2 pqrepresents one head and one tail (this can occur two ways); the probability is 1/2. In other words, there are four possible combinations and two of them include one head and one-tail, so the chance of this is 2/4, or 1/2. Similarly, q2 is associated with two tails, and the probability is 1/4. 3. Three throws will yield the following outcomes: (p + q) 3 or p3 + 3p2 q + 3pl + q3• This expansion tells us that there is a 1/8 chance of three heads or three tails, but there are three ways to get two heads and one-tail (HHT, HTH, THH) and similarly three ways to get one head and two tails, so the probability of these two outcomes is 3/8 for each. Note that there are eight possible outcomes for three throws, or more generally 2n outcomes of n observations.

As the expansion continues, the distribution of events, in terms of the possible numbers of one outcome, will form a bell-shaped distribution (when p = q = 1/2), much like the normal distribution as shown in Figure AII.1. The coefficients for each term in the expansion are given by the for• mula for combinations, in which the outcome for where p appears A times out of Ntosses is as follows:

N! Coefficient = ---- (N- A)! A!

This is the number of times the particular event can occur.Thus we can find an exact probability for any sample based on the expansion. This is manageable for small samples, but as the number of observations becomes large, the binomial distribution begins to resemble the normal distribution reasonably well, and we can use a Z-score approximation to simplify our calculations. For small samples, we can actually do these calculations, but reference to a table will save time. For the sake of illustration, here is an example of a small experiment (see O'Mahony, 1986, for similar example). Ten people are polled in a pilot test for an alternative formula of a food product. Eight prefer the new prod• uct over the old: two prefer the old product. What is the chance that we would see a preference split of 8/10 or more, if the true probabilities were NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 683

0.6 0.6 A B 0.5 0.5

> 0.4 0.4 u z w ::J 0.3 0.3 0 w a: u. 0.2 0.2

0.1 0.1

0.0 0.0 1H 1T 2H 1H1T 2T

0.4 0.4

0.3 0.3 > u z w ::J 0.2 0.2 0 w a: u. 0.1 0.1

0.0 0.0 3H 2H1T 1H2T 3T 4H 3H1T 2H2T 1H3T 4T OUTCOME OUTCOME

0.3 E

> 0.2 u z w ::J 0 w a: u. 0.1

10H 9H BH 7H 6HSHST6T 7T 8T 9T 10T OUTCOME

The binomial expansion is shown graphically for tossing a coin, with outcomes of heads and tails for various numbers of throws. A, frequencies (probabilities) associated with a single toss; B, frequencies expected from 2 tosses; C, from 3 tosses; D, from 4 tosses; and E, from 10 tosses. Note that as the number of events (or observations) increases, the distribution begins to take on the bell-shaped appearance of the normal distribution. 684 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

112 (i.e., a 50150 split in the population)? The binomial expansion for 10 observations, p= q= 112 is:

10 9 8 2 7 3 p + 10p q + 45p q + 120p q ... , etc.

In order to see if there is an 8-to-2 split or larger, we need to calculate the proportions of times these outcomes can occur. Note that this includes the values in the "tail" of the distribution; that is, it includes the outcomes of a 9-to-1 and a 10-to-none preference split So we only need the first three terms, or (1/2)10 + 10(1/2)\112) + 45(1/2)8 (112t This sums to about .055 or 5.5%. Thus, if the true split in the population as a whole was 50/50, we'd see a result this extreme (or more extreme) only about 5% of the time. Note that this is about where we reject the null hypothesis, usually. Also note that we have only looked at one tail of the distribution for this computation. In a preference test, we would normally not predict at the outset that one item is preferred over another. This re• quires a two-tailed test, and we would need to double this value. So for small experiments, we can usually go directly to tables of the cu• mulative binomial distribution, which gives us exact probabilities for our outcomes based on the tails or ends of the expansion equation. For larger experiments (N> 25 or so), we can use the normal distribution approxima• tion, since with large samples the binomial distribution looks a lot like the normal distribution. Using the normal distribution, the extremeness of a pro• portion can be represented by a Z-score, rather than figuring all the proba• bilities and expansion terms. For any proportion we observe, the disparity from what's expected by chance can be expressed as a Z-score, which of course has an associated probability value useful for statistical testing.

Proportion observed -proportion expected - continuity correction expected Z= Standard error of the proportion (AII.1) The continuity correction accounts for the fact that we can't have frac• tional observations, and the distribution of the binomial outcomes is notre• ally a continuous measurement variable. In other words, there are only some limited number of whole number outcomes since we are counting discrete events (you can't have half a person prefer product A over product B). The continuity correction accounts for this approximation by adjusting by the maximum amount of deviation from a continous variable in the counting process, or one-half of one observation. The standard error of the proportion is estimated to be the square root of NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 685 p times q divided by the number of observations (N). Note that as with the t• value, our standard error, or the uncertainly around our observations, de• creases as the reciprocal of the square root of N. So once again, our certainty that the observed proportion lies near to the true population proportion in• creases as N gets large. The mathematical notation for Z then becomes

pobs- pexp- (0.5/N) = Z (AII.2) ~pq!N where Pobs = XIN X= number of observations of one outcome Pexp =estimate asp, the chance probability under the null hypothesis assumption q= 1- p

Computationally, this is sometimes written as the following equation, after multiplying numerator and denominator by N:

X-Np-0.5 =Z (AII.3) ~Npq

Tables for minimum numbers correct, commonly used for triangle tests, paired preference tests, etc., solve this equation for X as a function of Nand Z (see Roessler et al., 1978). The tables show the minimum number of people who have to get the test correct to reject the null hypothesis and then conclude that a difference exists. For one-tailed discrimination tests at an alpha-risk of 0.05, the Z-value is 1.65. In this case, the minimum value can be solved from the inequalities where Eqs. (AII.2) or (AII.3) are solved for X, and the equal sign is changed to "greater than" (Z must exceed 1.65). Given the value of 1.65 for Z, and 1/3 for p and 2/3 for q, the equation can be solved for X and N as follows:

X~ (2N + 3)/6 + .778 -{JV (AII.4)

Of course, X must be rounded to the next whole number, since you can't have a fraction of a person answer correctly, and the value must exceed the 1.65 criti• cal one-tailed Z. For tests that have a p value of 1/2 (such as duo-trio, paired comparison, or dual standard), the same transformation takes this form: 686 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

X~ (N + 1)/2 + .825 -{N (AII.5)

We can also use these relationships to determine confidence intervals on proportions. The 95% confidence interval on an observed propotion, Pobs (=XIN, where X is the number correct in a choice test) is equal to:

(AII.6) where Z will take on the value of 1.96 for the two-tailed 95% intervals for the normal distribution. This equation would be useful for estimating the interval within which a true proportion is likely to occur. The two-tailed situation is applicable to a paired preference test, as shown in the following example. Suppose we test 100 consumers and 60% show a preference for product A over product B. What is the confidence interval around the level of 60% preference for product A, and does this interval overlap the null hy• pothesis value of 50%? Using Eq. (AII.6),

= .60 ± 1.96 ...j .5(.5)/100 = .60 ± .098

In this case, the lower limit is 50.2%, so there is just enough evidence to conclude that the true population proportion would not fall at 50%, given this result, 95% of the time.

CHI-SQUARE

The chi-square statistic is a useful statistic for comparing frequencies of events classified in a table of categories. If each observation can be classi• fied by two or more variables, it enters into the frequency count for a part of a matrix or classification table, where rows and columns represent the lev• els of each variable. For example, we might want to know whether there is any relationship between gender and consumption of a new reduced-fat product. Each person could be classified as high- vs. low-frequency users of the product, and also classified by gender. This would create a two-way table with four cells representing the counts of people who fall into one of the four groups. If there were no association between the two variables, we would expect the classification by consumption group to follow the overall percentages of classification by gender. For example, if we had a 50/50 split in sampling the two sexes, and also a 50/50 split in sampling our high- and NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 687 low-frequency groups, we would expect an even proportion of about 25% of the total number of observations to fall in each cell of our table. To the ex• tent that one or more cells in the table are disproportionally filled or lack• ing in observations, we would find evidence of an association or lack of independence of gender and product use. Two examples follow, one with no association between the variables and the other with a clear association. (Numbers represent counts of 200 total participants.) No Association Clear Association Usage Group Usage Group Low High (Total) Low High (Total) Males 50 50 (100) 75 25 (100) Females 50 50 (100) 20 80 (100) (Total) (100) (100) (200) (95) (105) (200)

In the example at the left, the within-cell entries for frequency counts are exactly what we would expect based on the marginal totals; with one• half the groups classified according to each variable, we expect one-fourth of the total in each of the four cells. Of course, a result of exactly 25% in each cell would rarely be found in real life, but the actual proportions would be close to 1/4 if there were no association. In the right-hand exam• ple, we see that females are more inclined to fall into the high-usage group and that the reverse is true for males. So knowing gender does help us pre• dict something about the variance in usage group, and we conclude that there is a relationship between these two variables In the more general sense, the chi-square statistic is useful for comparing any two distributions of data across two variables. The general form of the statistic is to (1) compute the expected frequency minus the observed fre• quency, (2) square this value, (3) divide by the expected frequency, and (4) sum these values across all cells, Eq. (AII.7). The expected frequency is what would be predicted from random chance or from some knowledge or theory of what is a likely outcome based on previous or current observations.

2 (Observed - expected) ~=L Expected (AII.7)

The statistic has degrees of freedom equal to the product of the number of rows minus one, times the number of columns minus one. What if we have no prediction about the expected frequencies ahead of time? In the ab• sence of any knowledge about expected frequencies, they are estimated from the marginal proportions in the parent variables, as in the example of gender and product usage stated above. Specifically, this can be found from the product of the two marginal totals associated with that column, divided 688 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES by the total number of observations. An example is shown in Figure AII.2 for a two-by-two table. As stated above, the chi-square test for independence can be used to see if there is any association between two nominal classification variables. Each observation is classified on both variables and thus occupies one cell in a two-way table. One variable is represented in the rows and the other in the columns. The cell entries represent frequency counts ofthe number ofitems classified that way (on that given combination of the two variables). This simple version of the chi-square test can be applied to several sensory data sets, including the same-different paired test or the A, not-A test This form of the test and a simple computational formula are shown in Figure AII.2. Some care is needed in applying chi-square tests, as they are temptingly easy to perform and so widely applicable to questions of cross-classifica• tion and association between variables. The tests using chi-square usually assumes that each observation is independent, e.g., that each tally is a dif• ferent person. It is not considered appropriate by some writers for related• samples data, such as repeated observations on the same person. The chi-square test is also not robust if the frequency counts in any cells are too small, usually defined as minimum count of five observations or less as rule of thumb. Another common mistake is to use percentages rather than raw numbers in the calculations. This is an error that can greatly inflate the value of chi-square. Beyond these precautions, the statistic is useful for many purposes. As we shall see in the rank order statistics, many of those tests are also based on the chi-square distributions.

McNEMAR TEST The chi-square statistic is most often applied to independent observations classified on categorical variables. However, many other statistical tests fol• low a chi-square distribution as a test statistic. Repeated observations of a group on a simple dichotomous variable (two classes) can be tested for change or difference using the McNemar test for the significance of changes. This is a simple test originally designed for before-and-after ex• periments such as the effect of information on attitude change. However, it can be applied to any situation where test panelists view two products and their responses are categorized into two classes. For example, we might want to see if the number of people who report liking a product changes af• ter the presentation of some information about a product, such as its nutri• tional content Stone and Sidel (1995) give an example of using the McNemar test for changes to assess whether just-right scales show a differ• ence between two products. NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 689

General form of the chi-square test for 2X2 table:

Chi-squared = 2 N(AD-BC) A B G (E)(F)(G)(H:

E=A+C D H F = D + B, etc. c N=A+B+C+[ E F N

1. Same I Different Test 2. A, not-A Test

Pairs presented: Sample presented:

same different A not-A

same A

Response Response

different not-A

Example: Sample presented: A not-A Chi-squared = (sum) 30 1 5 45 so[(30)(2s)-(1 o)(1srf A

( 40)( 40)( 45)(35: Response 10 25 35 not-A = 11.42 ( p < .OS) (sum) 40 40 RO

~ Some uses of the chi-square test for 2 x 2 contingency tables. ~ 690 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

The general form of the test classifies responses in a two-by-two matrix, with the same response categories as rows and columns. Since the test is designed to examine changes or differences, the two cells with the same values of row and column variables are ignored. We are interested only in the other two corners of the table where the classification differs. Such an analysis could be performed after an experiment on change in liking fol• lowing presentation of nutritional information, as shown below, with the frequency counts represented by the letters a through d. Before Information Presented Number Liking Number Disliking the Product or Neutral After information Number liking the product a b Number disliking or neutral c d

The McNemar test calculates the following statistic:

~ (lb- cl-1) Chi-square = (AII.7) b+c

Note that the absolute value of the difference in the two cells where change occurs is taken and that the other two cells (a and d) are ignored. The ob• tained value must exceed the critical value of chi-square for elf= 1, which is 3.84 for a two-tailed test and 2.71 for a one-tailed test with a directional al• ternative hypothesis. It is important that the expected cell frequencies be larger than 5. Expected frequencies are given by the sum of the two cells of interest, divided by 2. Here is an example testing for a change in preference/choice response following a taste test among 60 consumers: Before Tasting Prefer Product A Prefer Product B After tasting: Prefer product A 12 33 Prefer product B 8 7

(133- 81- 1)2 Chi square= 576/41 = 14.05 33+8

This is larger than the critical value of 3.84, so we can reject the null hy• pothesis and conclude that there was a change in preference favoring prod• uct A, as suggested by the frequency counts. Although there was a 2-to-1 NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 691 preference forB before tasting (marginal totals of 40 vs. 20), 33 of those 40 people switched to product A, while less than half of those preferring prod• uct A beforehand switched in the other direction. This disparity in changes drives the significance of the McNemar calculations and result.

USEFUL RANK ORDER TESTS

The simplest case of ranking is the paired comparison, when only two items are to be ranked as to which is stronger or more preferred. Thus it is not surprising that a simple test based on comparisons between two sam• ples is based on the same binomial statistics that we use for the paired comparison and paired preference tests. However, since the responses are not constrained to a choice, the test can be used with any data such as scaled responses that have at least ordinal properties. In other words, it must be possible to use the responses to weigh whether one product was given a higher score than another, or that the two products were scored equally. Obviously, in cases of no difference, we expect all the positive rankings (product A over B) to equal all the negative rankings (product B over A) so that the binomial probability of 1/2 can be used. In the two-sample case, where every panelist scores both products, the scores can be paired. A simple nonparametric test of difference with paired data is the . Since it only involves counting the rankings, it is very easy to calculate. Probabilities can be examined from the binomial cumula• tive table, or one- or two-tailed tables can be used (as appropriate to the hy• potheses of the study) from the critical-value tables used for discrimination tests with p =% (Roessler et al., 1978). The sign test is the nonparametric parallel of the dependent groups or paired t-test and is just about as sensi• tive to differences for small samples. Unlike the t-test, we do not need to fulfill the assumptions of normally distributed data, so the sign test is ap• plicable to situations with skewed data having outlying values, whereas the t-test is not. With skewed data, the t-test can be misleading, since high out• liers will exert undue leverage on the value of the mean. Since the sign test only looks for consistency in the direction of comparisons, the skew or out• liers are not so influential. There are several nonparametric counterparts to the independent groups t-test. One of these, the Mann-Whitney U test, is shown later. Here is an example of the sign test. Its basic principle is simply to count the direction of paired scores and assume a 50/50 split under the null hy• pothesis. In this example, we have panelists scoring two products on a rat• ing scale, and every panelist views both products. For example, these could be scores for oxidation flavors from a small QC panel where one product is the control and a second is an aged sample in a shelf-life study. 692 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

Note how the data are paired. Plus or minus signs are given to each pair• ing for whether A is greater than B or B is greater than A, respectively, hence the name of the test. Ties are omitted, losing some statistical power, so the test works best at detecting trends and differences when there are not too many ties.

Panelist Score, Product A Score, Product B (sign, B>A) 1 3 5 + 2 7 9 + 3 4 6 + 4 5 3 5 6 6 0 6 8 7 7 4 6 + 8 3 7 + 9 7 9 + 10 6 9 +

Count the number of plus signs ( = 7), and omit ties. We can then find the probability of 7/9 in a two-tailed binomial probability table, which is 09. While this is not enough evidence to reject the null, it might warrant further testing, as there seems to be a consistent trend. The paired t-test for these data gives us a p-value of .051, suggesting a similar decision.

MANN WHITNEY U TEST A parallel test to the independent groups t-test is the Mann Whitney U test. It is almost as easy to calculate as the sign test and thus stands as a good al• ternative to the independent groups t-test when the assumptions of normal distributions and equal variance are doubtful. The test can be used for any situation in which two groups of data are to be compared and the level of measurement is at least ordinal. For example, two manufacturing sites or production lines might send representative samples of soup to a sensory group for evaluation. Mean intensity scores for saltiness might be gener• ated for each sample, and then the two sets of scores would be compared. If no difference were present between the two sites, then rankings of the com• bined scores would find the two sites to be interspersed. On the other hand, if one site was producing consistently more salty soup than another, then that site should move toward higher rankings and the other site toward lower rankings. The U test is sensitive to just such patterns in a set of com• bined ranks. The first step is to rank the combined data, and then find the sum of the ranks for the smaller of the two groups. For a small experiment, with the NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 693 larger of the two groups having less than 20 observations, the following for• mula should be used:

(AII.8)

where n1 = the smaller of the two samples

n2 = the larger of the two samples

R1 = the sum of the ranks assigned to the smaller group

The next step is to test whether U has the correct form, since it may be high or low depending on the trends for the two groups. The smaller of the two forms is desired. If U is larger than n1 n/2, it is actually a value called U' and must be transformed to U by the following formula:

U = n111.:J- U' (AII.9)

Critical values for U are shown in Table E, Appendix VI. Note that the obtained value for U must be equal to or smaller than the tabled value in order to reject the null, as opposed to many other tabled statistics where the obtained value must exceed a tabled value. If the sample size is very large, with n2 greater than 20, the U statistic can be converted to a Z-score by the following formula, analogous to the difference between means di• vided by a standard deviation:

U- (n1 ~)12 z = ---;::::::==::::::::::::::::::::==~ (AII.10) .V [n~~ (nt + ~ + 1)/12]

If there are ties in the data, the standard deviation in the above formula needs adjustment as follows:

S.D. = .V [n1 ~(N(N -1))] [((N -N)/12 -:E1] (AII.11)

where N= n1+n2 T= (t5-t)/12, where tis the number of observations tied for a given rank This demands an extra housekeeping step where ties must be counted and the value for T computed before Z can be found. A worked example for a small sample is illustrated below. In our example of salty scores for soups, let's assume we have the fol• lowing panel means: 694 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

Site A SiteD 4.7 8.2 3.5 6.6 4.3 4.1 5.2 5.5 4.2 4.4 2.7

So there were 6 samples ( = n2) taken from site A and 5 ( = n1) from site D. The ranking of the 11 scores would look like this: Score Rank Site 8.2 1 D 6.6 2 D 5.5 3 D 5.2 4 A 4.7 5 A 4.4 6 D 4.3 7 A 4.2 8 A 4.1 9 D 3.5 10 A 2.7 11 A

R1 is then the sum of the ranks for site D ( = 1 + 2 + 3 + 6 + 9 = 21). Plugging into the formula Eq. (AilS), we find that U = 30 + 15-21 = 24. Next, using equation Eq. (AII.9), we then check to make sure we have U and not U' (the smaller of the two is needed). Since U is larger than n1 n/2 = 15, we did in fact obtain U', so we subtract U from 30, giving a value of 6. This is then compared to the maximum critical value in Appendix VI, Table E. For these sample sizes, the U value must be 3 or smaller to reject the null at a two-tailed probability of .05, so there is not enough evidence to reject the null in this comparison. Inspection of the rankings shows a lot of over• lap in the two sites, in spite of the generally higher scores at site D. The in• dependent groups t-test on these data also gives a p-value higher than .05, so there is agreement in this case. Siegel (1956) states that the Mann• Whitney test is about 95% as powerful as the corresponding t-test.

RANKED DATA WITH MORE THAN Two SAMPLES Two tests are commonly used in sensory evaluation for ranked products where there are three or more items being compared. The Friedman "" on ranked data is a relatively powerful test that can be applied to any data set where all products are viewed by all panelists; that is, there is a complete ranking by each participant. The data set for the NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 695

Friedman test thus takes the same form as a one-way analysis of variance with repeated measures, except that ranks are used instead of raw scores. It could be used for any data set where the rows form a set of related or matched observations where it is appropriate for the numbers to be con• verted to ranks within the row. The is very sensitive to any pattern that consistently ranks one of the products higher or lower than the middle ranking. The statistic is compared to a chi-square value that de• pends on the number of items and the number ofpanelists. The second test that is common in sensory work is Kramer's rank sum test. Critical values of significance for this test were recalculated and published by Basker (1988) (see Table J, Appendix VI) since older tables contained some errors. Each of these methods is illustrated with an example below.

Example of the Friedman Test: Eighteen employees are asked to rank three flavor submissions for their appropriateness in a chocolate/malted milk drink. We would like to know if there is a significant overall difference among the candidates as ranked. The Friedman test constructs a chi• square statistic based on column totals, 7J, in each of the J columns. For a matrix of K rows and J columns, we compared the obtained value to a chi• square value of J- 1 degrees of freedom. Here is the general formula:

2 X2 = {12/[K(.J)(J + 1)][l: (1}) ]}- 3(K)(J + 1) (AII.12)

Flavor Ranks Panelist A B c 1 1 3 2 2 2 3 1 3 1 3 2 4 1 2 3 5 3 1 2 6 2 3 1 7 3 2 1 8 1 3 2 9 3 1 2 10 3 1 2 11 2 3 1 12 2 3 1 13 3 2 1 14 2 3 1 15 2.5 2.5 1 16 3 2 1 17 3 2 1 18 2 3 1 Sum (column total, T): 39.5 42.5 26 696 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

x2 = {12/[K (J) (J + 1)][1: (7f?n - 5(K)(J + 1) = {12/[18(5)(4)] [(59.5)2 + (42.5)2 + (26.0)~}- 5(18)(4) = 8.58

In the chi-square table for J- 1 degrees of freedom, in this case d/ = 2, the critical value is 5.99. Since our obtained value of 8.58 exceeds this, we can reject the null. This makes sense since product C had a predominance of first rankings in this test. Note that in order to compare individual samples, we require another test. The sign test is appropriate, although if many pairs of samples are intercompared, then the alpha level needs to be reduced to compensate for the experiment-wise increase in risk. Another approach is to use the last-significant-difference (LSD) test for ranked data, where the LSD is equal to 1.96 -.j(NK(K + 1)/6, for Kitems ranked by Npanelists. Items whose rank sums differ by more than this amount may be considered signifi• cantly different. Here is an example of the Kramer rank-sum test: Three flavors are ranked by 20 panelists. The ranks are summed and tables referred to from Basker's recalculated minimum values for the rank-sum test (Table J, Appendix VI). Ranks are summed, and the difference between sums are compared to the critical values, so there is slightly less calculation than with Friedman.

Flavor Ranks Panelist A B c 1 1 3 2 2 2 3 1 3 1 3 2 4 1 2 3 5 3 1 2 6 2 3 1 7 3 2 1 8 1 3 2 9 3 1 2 10 3 1 2 11 2 3 1 12 2 3 1 13 3 2 1 14 2 3 1 15 2.5 2.5 1 16 3 2 1 17 3 2 1 18 2 3 1 19 2 2 2 20 2 2 2 NoNPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 697

Sum (column total): 43.5 46.5 30

Differences A vs. B = 3.0 B vs. C = 16.5 A vs. C = 13.5

When compared to the minimum critical differences atp < .05, (= 14.8), there is a significant difference between B and C, but not between any other pair. What about the comparison of A vs. C, where the difference was close to the critical value? A simple sign test between A and C would have yielded a 15/18 split, short of statistical significance, so both tests give the same conclu• sion. It is sometimes valuable to confirm the outcome with an additional test to gain confidence in the conclusion (as long as the additional test is appro• priate).

THE SPEARMAN RANK ORDER CoRRELATION

The common correlation coefficient, r, is also known as the Pearson product• moment correlation coefficient. It is a powerful tool for estimating the degree of linear association between two variables. However, it is very sensitive to outliers in the data. If the data do not achieve an interval scale of measure• ment, or have a high degree of skew or outliers, the nonparametric alternative given by Spearman's formula should be considered. The Spearman rank order correlation was one of the frrst to be developed (Siegel, 1956) and is commonly signified by the Greek letter p (rho). The statistic asks whether the two variables line up in similar rankings. Tables of significance indicate whether an association exists based on these rankings. The data must first be converted to ranks and a difference score calculated for each pair of ranks, similar to the way differences are computed in the paired t-test. These differences scores, d, are then squared and summed. The formula for rho is as follows:

(AII.15)

Thus the value for rho is easy to calculate unless there are a high proportion of ties. If greater than one-fourth of the data are tied, an adjustment should be made. The formula is very robust in the case of a few ties, with changes in rho usually only in the third decimal place. If there are many ties, a correction must be calculated for each tied case based on (P - t)/12, where t is the number of items tied at a given rank. These values are then summed for each variable x and y, to give values, Tx and Ty; rho is then calculated as follows: 698 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES

rho = IT + I,y2 - 2.,([2 (AII.14) 2~ITW where I.x2 = [(N- JV)/12) - J:.Tx r.y2 = [(N- JV)/12] - J:.Tr

Suppose we wished to examine whether there was a relationship between mean chewiness scores for a set of products evaluated by a texture panel and mean scores on scale for hardness. Perhaps we suspect that the same underlying process variable gives rise to textural problems observable in both mastication and initial bite. Mean panel scores over 10 products might look like this, with the calculation of rho following:

Product Chewiness Rank Hardness Rank Difference [)2 A 4.3 7 5.0 6 1 1 B 5.6 8 6.1 8 0 0 c 5.8 9 6.4 9 0 0 D 3.2 4 4.4 4 0 0 E 1.1 1 2.2 1 0 0 F 8.2 10 9.5 10 0 0 G 3.4 5 4.7 5 0 0 H 2.2 3 3.4 2 1 1 I 2.1 2 5.5 7 5 25 J 3.7 6 4.3 3 3 9

The sum of the IJ2 values is 36, so rho computes to the following:

6(36) - - - 1 - 1,000- 10 - 1 .218 - .782 (AII.15)

This is a moderately high degree of association, significant at the .01 level. This is obvious from the good agreement in rankings, with the exception of products I and J. Note that product F is a high outlier on both scales. This inflates the Pearson correlation to .839, as it is sensitive to the leverage exerted by this point that lies away from the rest of the data set.

CONCLUSIONS

Some of the nonparametric parallels to common statistical tests are shown in Table AII.1. Further examples can be found in statistical texts such as Siegel (1954). The nonparametric statistical tests are valuable to the sen• sory scientist for several reasons, and it should be part of a compete sen- NONPARAMETRIC AND BINOMIAL-BASED STATISTICAL METHODS 699

mJ.\1 Parametric and Nonparametric Statistical Tests1 Purpose Parametric Test Nonparametric Parallel Compare two products paired (dependent) sign test (matched data) t-test on means Compare two products independent groups Mann-Whitney U test (separate groups) t-test on means Compare multiple products one-way analysis Friedman test on ranks (Complete block design) with repeated measures Kramer(Basker) test Test association of two Pearson (product-moment) Spearman, rank order continuous variables correlation, r correlation, rho

1Nonparametric tests are performed on ranked data instead of raw numbers. Other nonpara• metric tests are available for each purpose; the listed ones are common.

sory training program to become familiar with the most commonly used tests. Also, the binomial distribution forms the basis for the choice tests commonly used in , so it is important to know how this distribution is derived and when it approximates normality. The chi• square statistics are useful for a wide range of problems involving categor• ical variables and as a nonparametric measure of association. They also form the basis for other statistical tests such as the Friedman and McNemar tests. Nonparametric tests may be useful for scaled data where the interval level assumptions are in doubt. In addition, skewed data are common with some forms of sensory measurement, such as magnitude estimation. This type of scaling tends to have a few high values above the bulk of a data set• in other words, positive skew. The analyst may choose to transform such data to make them a better approximation to the normal distribution, and logarithmic transformation is common with this scaling technique (see Chapter 7). An alternative to transformation is the application of nonpara• metric statistical summaries such as and nonparametric tests of significance. In the case of deviations from the assumptions of a parametric test, confirmation with a nonparametric test may lend more credence to the significance of a result.

REFERENCES Basker, D. 1988. Critical values of differences among rank sums for multiple com• parisons. Food Technology, 42 (2), 79, 80-84. Conover, W.J. 1980. Practical Nonparametric Statistics, 2d ed. Wiley, New York. Hollander, M., and Wolfe, D.A. 1973. Nonparametric Statistical Methods. Wiley, New York. 700 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

O'Mahony, M. 1986. Sensory Evaluation ofFood: Statistical Methods and Procedures. Dekker, New York. Roessler, E.B., Pangborn, R.M., Sidel, J.L., and Stone, H. 1978. Expanded statistical tables estimating significance in paired-preference, paired-difference, duo-trio and triangle tests. Journal of , 43, 940-943. Siegel, S. 1956. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, New York. Smith, G.L. 1988. Statistical analysis of sensory data. In J.R. Piggott, ed. Sensory Analysis of Foods. Elsevier Applied Science, London. Stone, H., and Sidel, J.L. 1993. Sensory Evaluation Practices, 2d ed. Academic, San Diego. ANALYSIS OF VARIANCE

INTRODUCTION The Parable of the Turtles The turtles would return to the shore every spring and lay their eggs in the soft sand. The turtle eggs were greatly prized by the shore people. It was always good after a hard winter to find the spots where the eggs were buried. Finding the eggs presented two problems to the egg hunters. The first was to know what part of the beach had been dis• turbed. The turtles would hollow out part of the ground, leaving piles of soft sand and low spots. The beach was mostly flat, just above the high tide, so the bumps would sometimes show up clearly. However, there was a natural bumpiness to the beach in spots, so this was sometimes harder than others. They had to see where the bumps and hollows seemed larger than usual. The second problem was to tell turtle eggs from the edible ones. The poison turtles would build smaller nests with smaller bumps and hollows than the good turtles, so the egg hunters would have to be careful to see that the nests looked deep enough, and the piles high enough to ensure that the eggs were safe. An old member of the clan taught them how. By taking your boat out into the surf, you could look at the beach head on, and see where the turtles had come back to nest.

Analysis of variance is the most common statistical test performed in descriptive analysis and other sensory tests where more than two products are compared using scaled responses. It provides a very sensitive tool for seeing whether treatment variables such as changes in ingredients, processes, or packaging had an effect on the sensory properties of products. As in the story above, it is a method for finding variation that can be attrib• uted to some specific cause, against the background of existing variation due to other causes. These other unexplained causes produce the experi-

701 702 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES mental error or noise in the data. As with the two breeds of turtles, we can also find differences among various treatments, and we seek ways to get the most sensitive perspective on those differences or effects. The best point of view can be achieved by good experimental design, careful control, and good testing practices, and also by taking advantage of statistical tests that give us the best separation of our treatment effects from other effects (such as panelists' scale usage habits) and from error variation. The following sections illustrate some of the basic ideas in analysis of vari• ance and provide some worked examples. Other designs and analysis for more complex experiments can be found in Wmer (1971), Cochran and Cox (1957), Gacula and Singh (1984), and Hays (1975). As noted by O'Mahony (1986), there is an unfortunate variation in the terms used by different statisticians for the experimental designs a models in analysis of variance. Where practical, we have tried to use the same nomenclature as O'Mahony (1986), since that work is already familiar to many workers in sensory evaluation, and the terminol• ogy of Winer (1971), a classic treatise on ANOVA for behavioral data. Calculations and worked examples are given through two factor experiments with repeated measures and simple between-groups multifactor experiments. As this guide is meant for students and practitioners, some theory and devel• opment of models has been left out However, the reader can refer to the statis• tics texts cited above if deeper explanation of the underlying models is desired.

BASIC ANALYSIS OF VARIANCE, RATIONALE, AND WORKED EXAMPLE Analysis of variance is a way to address multiple treatments or levels, i.e., to compare several means at the same time. Not all experiments have just two cases to compare-some have many levels of an ingredient or process variable, and some experiments may have several factors with products formulated at different levels. Factors in ANOVA terminology mean inde• pendent variables, in other words the variables that you manipulate or variables under your direct control in an experiment. Note that the statisti• cal comparison could be achieved by performing multiple t-tests, but the risk of rejecting a true null hypothesis would become inflated. Analysis of variance estimates the variance or squared deviations at• tributable to each factor. This can be thought of as the degree to which each factor or variable moves the data away from the grand or overall mean of the data set. It also estimates the variance or squared deviation due to error. Error can be thought of as other variance not attributable to the factors we manipulate. In an analysis of variance we construct a ratio of the factor variance to the error variance. This ratio follows the distribution of an F- ANALYSIS OF VARIANCE 703 statistic. A significant F-ratio for a given factor implies that at least one of the individual comparisons among means is significant for that factor. The F-value represents the pooled variation of the means of our factor of inter• est from the overall mean of the data set, divided by the mean-squared er• ror. In other words we have a model, where there is some overall tendency for the data, and then variation around that value. The means from each of our treatment levels and their differences from this grand mean represent a way to measure the effect of those treatments. However, we have to view those differences in the light of the random variation that is present in our experiment. So, like the t- or z-statistic, the F-ratio is a ratio of signal to noise. In a simple two-level experiment with only one variable, the F-statis• tic is simply the square of the t-value, so there is an obvious relationship between the F- and t-statistics. The statistical distributions for F indicate whether the ratio we obtain in the experiment is one we would expect only rarely by the operation of chance. Thus we apply the usual statistical reasoning when deciding to ac• cept or reject a null hypothesis. The null hypothesis for ANOVA is usually that the means for the treatment levels would all be equal in the parent population. Another way to think about this test is that it is a test of a model with different terms, and we wish to see if the factors driving those terms are making a real contribution beyond what might happen by simple sam• pling error. Analysis of variance is thus based on a model, a linear model, that says that any observation is composed of several influences: the grand mean, plus (or minus) whatever deviations are caused by each treatment factor, plus the interactions of treatment factors, plus error. The worked example below will examine this in more detail, but first a look at some of the rationale and derivation. The rationale proceeds as follows: 1. We wish to know whether there are any significant differences among multiple means, relative to the error in our experimental measures. 2. To do this, we examine variance or squared standard deviations. 3. We look at the variance of our sample means from the overall ("grand") mean of all our data. This is sometimes called the vari• ance due to "treatments?' Treatments are just the particular levels of our independent variable. 4. This variance is examined relative to the variance within treatments, i.e., the unexplained error or variability not attributed to the treat• ments themselves. Rephrased, the AN OVA examines variation between treatments, relative to variation within treatments. Variation within treatments or error vari• ance can also be thought of as being the variance arising from differences 704 SENSORY EVALUATION OF Foon: PRINCIPLES AND PRACTICES in the multiple measurements taken of each treatment-either from multi• ple panelists, replicates, instrumental measures, or whatever. The test is done by calculating a ratio, which is distributed as an F-statistic. The F-dis• tribution looks like a t-distribution squared but depends on the number of degrees of freedom associated with our treatments (the numerator of the ratio) and the degrees offreedom associated with our error (the denomina• tor oftheratio). Here is a mathematical derivation; a similar but more detailed explana• tion can be found in O'Mahony (1986). Variances (squared standard devia• tions) are noted by 82, x represents each score and M is the mean of x scores or (I. x)/N. Variance is the mean difference of each score from the mean, given by

2 2 :Ex -(:Ex) IN (AIII.t) N-1

This expression can be thought of as the ''mean-squared deviation:' For our experimental treatments, we can speak of the means squared due to treatments, and for error, we can speak of the ''mean-squared error:' The ratios of these two quantities give us the F-ratio, which we compare to the distribution of the F-statistic. Note that in the second or computational for• mula for 82, we accumulate sums of squared observations. This forms an important step in construction of the F-ratio. Sums of squares form the ba• sis of the calculations in ANOVA. To calculate the sums of squares, it is helpful to think about partitioning the total variation: Total variance is partitioned into variance between treatments and variance within treatments (or error). This can also be done for the sums of squares (SS).

SS total = SS between + SS within (AIII.2)

This is useful since SS within ("error") is tough to calculate-it is like a pooled standard deviation over many treatments. However, SS total is easy! It is simply the numerator of our overall variance or

2 (l:x)2 = :Ex - -- over all x data points (AIII.3) N

So we usually estimate SS within (error) as SS total minus SS between. A mathematical proof of how the SS can be partitioned like this is found in O'Mahony (1986, Appendix C, p. 379). Based on this idea, here is the calculation in a simple one-way ANOVA. ANALYSIS OF VARIANCE 705

"One-way" merely signifies that there is only one treatment variable or fac• tor of interest. Remember, each factor may have multiple levels, which are usually the versions of the product to be compared. In the following exam• ples, we will talk in terms of products and sensory judges or panelists. LetT= a total (it is useful to work in sums). Let a= number of products (or treatments). Let b =number of panelists per treatment. The product, ab = N.

2 r Twithout subscript is the SS total = I:x -- (AIII.3a) N grand total of all data

O'Mahony calls T 2 IN a "correction factor" or C, a useful convention.

(AIII.3b) where the a subscript refers to different products.

SS within = SS total - SS between

We then divide each SS by its associated degrees of freedom to get our mean squares. We have mean squares associated with products and mean squares associated with error. These two estimates of variance will be used to form our F-ratio. The sample data set is shown in Table AIII.t.

g Data Set for Simple One-Way AN OVA Panelist Product A ProductB Product C 1 6 8 9 2 6 7 8 3 7 10 12 4 5 5 5 5 6 5 7 6 5 6 9 7 7 7 8 8 4 6 8 9 7 6 5 10 8 8 8 Totals: 61 68 79 Means: 6.1 6.8 7.9 706 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

Question: Did the treatment we used on the products make any differ• ence? In other words, are these means likely to represent real differences or just the effects of chance variation? The AN OVA will help address these questions. First, column totals are calculated and a grand total determined, as well as the correction factor ("C") and the sum of the squared data points.

T. = 61 Tb = 68 T, = 79 Sums: (AIII.4) (product A) (product B) (product C)

GrandT (sum of sums)= 208

T2/N = (208)2/30 = 1,442.13 (O'Mahony's "C" factor) I.(x)2 = 1,530 (sum of all individual scores, after they are squared)

Given this information, the sums of squares can be calculated as follows:

SS total = 1,530- 1,442.13 = 87.87

SS due to treatments ("between") (AIII.5)

(61 2 + 682 + 792) - 1,442.13 = 16.47 10

Next, we need to find the degrees of freedom. The total degrees of free• dom in the simple one-way AN OVA are the number of observations minus 1. The degrees offreedom for the treatment factor are the number oflevels minus 1. The degrees of freedom for error are the total degrees offreedom minus the treatment ("between") df.

tif total = N- 1 = 29 tif for treatments = 3 - 1 = 2 (AIII.6) tif for error = tif total - tif between = 29 - 2 = 27

Finally, a table is constructed to show the calculations of the mean squares (our variance estimates) for each factor, and then to construct the F-ratio. The mean squares are the SS divided by the appropriate degrees of free- ANALYSIS OF VARIANCE 707 dom. Our table looks like this: Source ofVariance ss d/ Mean Squares F Total 87.87 29 Between 16.47 2 8.235 3.119 Within (error) 71.4 27 2.64

A value ofF= 3.119 at 2 and 27 degrees of freedom is just short of sig• nificance at p = .06. Most statistical software programs will now give an ex• act p-value for the F-ratio and degrees offreedom. If the AN OVA is done "by hand," then the F -ratio should be compared to the critical value found in a table such as Appendix VI, Table D. We see from this table that the critical value for 2 and 27 clfis 3.35.

AN INTERLUDE FOR THE CONFUSED

If this all seems a little mysterious, you are probably not alone in feeling that way. Let's look at the end-product of the ANOVA, namely, the F-ratio. What is an F-ratio? An F-ratio is a statistic generated in analysis of vari• ance. It has a distribution that looks a lot like chi-square and is shaped like the distribution of t, only squared. It is a ratio that measures treatment ef• fects (such as product differences) relative to error, just like at-value or Z• value from a test on proportions. Mathematically, it is the ratio of the mean-squared differences among treatments over the mean-squared error. In the two-product situation, this is just the square of the t-value. Why does an F -ratio have two numbers for degrees of freedom? The first clfis for the numerator (treatments minus one), while the second degree of freedom is the dfs for the error term in the denominator. This is necessary since the shape of the F-distribution changes depending on both degrees of freedom. Since we need to know how big F needs to get in order to leave 5% or less in the tail of the distribution, we need to know how the distribution is shaped. The degrees of freedom determine this. Note that this 5% is the upper tail cutoff; since F is based on variance, it will not have positive and negative values like tor z. In other words, F-ratio tests for AN OVA are all one-tailed. What does a significant F-ratio mean? It means that there is at least one significant difference among means that you are comparing. In other words, the treatment you are testing had an effect. Strictly speaking, the de• viation of treatment means from the grand mean of the data set was large, compared to the error, and would be a rare event under a true null hypoth• esis. Note that this does not tell you which pairs of means were different. After the ANOVA there are paired comparisons in order to make conclu• sions about differences between specific levels. However, the ANOVA is a 708 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES very sensitive first pass at the data to see if anything was happening that would be unlikely under a true null.

MULTIPLE- OF VARIANCE AND THE CONCEPT OF A LINEAR MODEL

Most foods require the analysis of multiple attributes. These attributes may vary as a function of ingredients and process variables. In many cases, we will have more than one experimental variable of interest, e.g., two or more ingredients or two or more processing changes. The most powerful, applica• ble statistical tool for analysis of scaled data where we have two or more in• dependent variables (called factors) is the multifactor analysis of variance. A simple sample problem and hypothetical data set are shown in Table AIII.2. We have two sweeteners, sucrose and high-fructose corn syrup (HFCS), being blended in a food (say, a breakfast cereal), and we would like to understand the impact of each on the sweetness of the product. We vary the amount of each sweetener added to the product (2%, 4%, and 6% of each) and have a panel of four individuals rate the product for sweetness. Four panelists are probably too few for most experiments, but this example is simplified for the sake of clarity. We use three levels of each sweetener, in a factorial design. This means that every level of one factor is combined with every level of the other factor. We would like to know whether these levels of sucrose had any effect, whether the levels ofHCFS had any effect, and whether the two sweeteners in combination produced any result that would not be predicted from the average response to each sweetener. This last item we call an (more on this below). First, let's look at the cell means and marginal means:

~ Data Set for a Two-Factor Analysis of Variance Factor 1: Level of Sucrose Level1 Level2 Level3 Factor 2: (2%) (4%) (6%) Level of HFCS (data:) level A (2%) 1, 1, 2, 4 3,5,5,5 6,4, 6, 7 level B (4%) 2, 3,4, 5 4,6,7,5 6,8,8,9 level C (6%) 5,6,7,8 7,8,8,6 8, 8, 9, 7

Numbers in the table refer to the data values given by the four panelists. ANALYSIS OF VARIANCE 709

Factor 1 Factor 2: Level1 Level2 Level3 Mean (Across Factor 1) Level A 2.0 4.5 5.75 4.08 Level B 3.5 5.5 7.75 5.58 Level C 6.5 7.25 8.0 7.25 Mean (across factor 2) 4.0 5.75 7.17 "Grand Mean" 5.63

Next, let's look at some graphs of these means to see what happened. Figure AIII.1 shows the trends in the data: Here's what happened in the analysis: The ANOVA will test hypotheses from a . This model states that any score in the data set is determined by a number of factors:

Score = grand mean + factor 1 effect + factor 2 effect + interaction effect+ error

In plain English, there is an overall tendency for these products to be sweet that is estimated by the grand mean. For each data point, there is some per• turbation from that mean due to the first factor, some due to the second fac-

11 Summary of Mean Effects

tllc :;: 9 as Level of Sweetener - a: Factor2 Ill Ill Cl) 7 --o- Level A c LeveiB Cl) -Cl) LeveiC en:r: 5 ----- c -- as Cl) :E 3

Level 1 Level 2 Level 3

Level of Sweetener Factor 1

r.tr!ll Means from the two-factor sweetener experiment. ~ 710 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES tor, some due to the particular ways in which the two factors interact or com• bine, and some random error process. The effects due to the factors' null hy• potheses is based on means. The means for the null hypothesis are the population means we would expect from each treatment, averaged across all the other factors that are present. These can be thought of as ''marginal means," since they are estimated by the row and column totals (we would of• ten see them in the margins of a data matrix as calculations proceeded). We are testing whether the marginal means are likely to be equal (in the underlying population), or whether there was some systematic differences among them, and whether this variance was large relative to error, in fact so large that we would rarely expect such variation under true null hypothesis. The AN OVA uses an F-ratio to compare the effect variance to our sample error variance. The exact calculations for this ANOVA are not presented here, but they are performed in the same way as the two-factor complete design ANOVA, sometimes called ''repeated measures" (Winer, 1971), which is illustrated in a later section. The output of our AN OVA will be presented in a table like this: Effect Sums of squares df MS F Factor 1 60.22 2 30.11 25.46 Error 7.11 6 1.19 Factor 2 60.39 2 30.20 16.55 Error 10.94 6 1.82 Interaction 9.44 4 2.36 5.24 Error 5.22 12 0.43

We then decide significance by looking up the critical F-ratio for our nu• merator and denominator degrees of freedom. If the obtained F is greater than the tabulated F, we reject the null hypothesis and conclude that our factor had some effect. The critical F-values for these comparisons are 5.14 for the sweetener factors (both significant) and 3.26 for the interac• tion effect, since it has a higher number of degrees of freedom (see Appendix VI, Table D). The interaction arises since the higher level of sweetener 2 in Figure AII1.1 shows a flatter slope than the lower levels. So there is some saturation or flattening out of response, a common finding at high sensory intensities.

A Note About Interactions What is an "interaction"? Unfortunately, the word has both a common meaning and a statistical meaning. The common meaning is when two things act on or influence one another. The statistical meaning is similar, but it does not imply that a physical interaction, say, between two food ANALYSIS OF VARIANCE 711

9

C) c -a- Panel1 ·- 8 a:-ca • Panel2 U) U) Cl) 7 c ...E u::: 6 c ca Cl) 5 ==

4 Product 1 Product 2

C) 9 ·-c -a- Consumer Group 1 a:-CIS Consumer Group 2 7 • >- :: :cca Q. 5 -Cl) (,) (,) c:( c 3 ca Cl) == 1 Product 1 Product 2

Interaction effect. Upper panel: magnitude interaction; lower panel: crossover interaction. 712 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES chemicals, occurred. Instead, the term "interaction" means that the effect of one variable was different depending on the level of the other variable. Two examples of interaction follow. For the sake of simplicity, only means are given to represent two variables at two points each. In the first example, we see two panels evaluating the firmness of tex• ture of two food products. One panel sees a big difference between the two products while the second finds only a small difference. This is visible as a difference in the slope of the lines connecting the product means. Such a difference in slope is called a magnitude interaction when both slopes have the same sign, and it is fairly common in sensory research. For example, panelists may all evaluate a set of products in the same rank order but may be more or less fatigued across replications. Thus decrements in scores will occur for some panelists more than others, creating a panelist by repli• cate interaction. The second example of an interaction is a little less common. In this case the relative scores for the two products change position from one panel to the other. One panel sees product 1 as deserving a higher rating than product 2, while the other panel finds product 2 to be superior. This sort of interaction can happen with consumer acceptance ratings when there are market segments, or in descriptive analysis if one panel misun• derstands the scale direction or it is misprinted on the ballot (e.g., with end-anchor words reversed). This is commonly called a cross-over interac• tion. Figure AIII.2 shows these interaction effects.

ANALYSIS OF VARIANCE FROM COMPLETE BLOCK DESIGNS AND PARTITIONING OF PANELIST VARIANCE The concept behind the complete block analysis of variance for sensory data are simply that all panelists view all products, or all levels of our treat• ment variable (Gacula and Singh, 1984). This type of design is also called the ''repeated measures" analysis of variance in the behavioral sciences (Winer, 1971; O'Mahony, 1986). The design is analogous to the dependent or paired observations t-test, but considers multiple levels of a variable, not just two. Like the dependent t-test, it has added sensitivity, since the varia• tion due to panelist differences can be partitioned from the analysis, in this case taken out of the error term. When the error term is reduced, the F-ra• tio due to the treatment or variable of interest will be larger, so it is "easier" to find statistical significance. This is especially useful in sensory evalua• tion, where panelists, even well-trained ones, may use different parts of the scale, or they may simply have different sensitivities to the attribute being evaluated. When observations are correlated, i.e., all subjects rank order ANALYSIS OF VARIANCE 713

Panelist vs. product variation

Line scale ratings for two products, A and 8, by two panelists, Judge 1 and Judge 2. (Ratings indicated by this mark: I )

weak strong Product A

Judge 1 weak strong Product 8

weak strong Product A

Judge 2 weak

Differences between products .,. .., ("within subjects") .,. ... 1 1 1 1 ... Differences between subjects (judges)

Do these judges agree? In what way?

1\vo hypothetical panelists rating two products. They agree on the rank or• der of the product and the approximate sensory difference, but use differ• ent parts of the scale. The differences can be separated into the between• products (within panelist) differences and the difference between the pan• elists in the overall part of the scale they have used. The dependent t-test separates these two sources of difference by converting the raw scores to difference scores (between products) in the analysis. This provides a more sensitive comparison than leaving the interindividual difference in the er• ror term. 714 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES treatments the same, the complete block ANOVA will usually produce a significant difference between products, in spite of panelists using different ranges of the scale. The important idea for sensory research is that variation due to panelist idiosyncracies can be estimated and partitioned. Given the differences among individual humans in their sensory functions, this is an important ability. The example below shows the kind of situation where a complete block analysis, like the dependent t-test, will have added value in finding significant differences. In this example, two ratings by two subjects are shown in Figure AIII.3. The differences between products, also called ''within subject differences," are in the same direction and of the same magnitude. However, the differences between panelists in their preferred part of the scale are quite large. In any conventional analysis, this variance between people would swamp the product effect by creating a large error term. However, the panelist differences can be pulled out of the error term in a complete block design using repeated measures analysis. Note that the term ''repeated measures" is another piece of statistical jargon that is sometimes interpreted differently by different researchers. Do not confuse it with simple replications. In this case, we mean it to repre• sent the partitioning of panelist variance in analysis of variance that is pos• sible when a complete design is used, i.e., when each and every panelist evaluated all of the products in the experiment. Note: The within-subject effects in repeated measures terminology cor• responds to between-treatment effects in simple ANOVA terminology. This can be confusing. To see the advantage of this analysis, we will show examples with and without the partitioning of panelist variance. Here is a worked example, first without partitioning panelist effects or as if there were three indepen• dent groups evaluating each product. This is a simple one-way ANOVA as shown for the data in Table AIII.1. Three products are rated. They differ in having three levels of an ingredient. The sample data set is shown in Table AII1.3, with one small change from the first example of simple AN OVA. Here is how the one-way ANOVA would look:

Sums: Ta = 54 Tb = 62 T: = 74 GrandT = 190 GrandT (sum of sums) = 190 T 2/N = (190) 2/30 = 1,203.3 (O'Mahony's "C" factor) (AIII.7) I.(i) = 1,352 SS total = 1,352- 1,203.3 = 148.7

SS due to products = (AIII.8) npanelists ANALYSIS OF VARIANCE 715

542 + 622 + 742 = ------1,205.5 = 20.5 (AIII.9) 10

sserror = sstotal- ssproducts = 148.7-20.5 = 128.4

MSproducts = 20.5/2 = 10.15 (degrees of freedom are 2 for three treatmenu MSerror = 128.4/27 = 4. 76 ( d/, error = total - products = 29 - 2 = 27)

F(2, 27) = MSproducts/MSerror = 10.15/4.76 = 2.15 (AIII.10)

For 2 and 27 degrees offreedom, this p = .14 (.05 < p, not significant). The critical F-ratio for 2 and 27 degrees of freedom is 5.55, as shown in Appendix VI, Table D. Now, here is the difference in the complete block ANOVA. An additional computation requires row sums and sums of squares for the row variable, which is our panelist effect as shown in Table AIII.4. In the one-way analysis, the data set was analyzed as if there were 50 different people contributing the ratings. Actually, there were 10 panelists who viewed all products. This fits the requirement for a complete block de• sign. We can thus further partition the error term into an effect due to pan• elists ("between-subjects" effect) and residual error. To do this, we need to estimate the effect due to interpanelist differences. Take the sum across rows (to get panelist sums), then square them and sum again down the

• Data Set for the Complete Block Design Panelist Product A Product B Product C 1 6 8 9 2 6 7 8 3 7 10 12 4 5 5 5 5 6 5 7 6 5 6 9 7 7 7 8 8 4 6 8 9 7 6 5 10 1 2 3 716 SENSORY EvALUATION oF FooD: PRINCIPLES AND PRACTICES

• Data Set for the Complete block Design Showing Panelists Calculations (Rows)

Lpanelist (Lpanelist) 2 1 6 8 9 23 529 2 6 7 8 21 441 3 7 10 12 29 841 4 5 5 5 15 225 6 5 6 9 20 400 7 7 7 8 22 484 8 4 6 8 18 324 9 7 6 5 18 324 10 1 2 3 6 36 Totals 54 62 74 1,352 3,928 L(Lpanelist) 2

new column as shown in Table AIII.4. The panelist sum of squares is anal• ogous to the product sum of squares, but now we are working across rows as well as down the columns.

Ss . _~[

"C," once again, is the "correction factor" or the grand total squared, divided by the number of observations. In making this calculation, we have used 9 more degrees of freedom from the total, so these are no longer available to our estimate of error tlfbelow. A new sum of squares for residual error can now be calculated:

SSerror = SStotai - SSproctucts - SSpanelists = 148.7- 20.3 - 106 = 22.36

(AIII.12) and the mean square for error (MSerror) is SSerro/18 = 22.36/18 = 1.24

Note that there are now only 18 degrees of freedom left for the error, since we took another 9 to estimate the panelists' variance. However, the mean-square error has shrunk from 4.76 to 1.24. Finally, a new F-ratio for the product effect ("within subjects") has removed the between-subjects ef• fect from the error term:

F = MSproductsfMSerror = 10.5/1.24 = 8.17 (AIII.13) ANALYSIS OF VARIANCE 717

At 2 and 18 degrees of freedom, this is significant at p = .003.

Why was this significant when panelist variance was partitioned, but not in the usual one-way ANOVA? The answer lies in the systematic varia• tion due to panelists' scale use and the ability of the two-way ANOVA tore• move this effect from the error term. Making error smaller is a general goal of just about every sensory study, and here we see a powerful way to do this mathematically. A few warnings about the repeated measures AN OVA from complete block designs: The model underlying the repeated measures analysis of variance from complete block designs has more assumptions than the simple one-way ANOVA. Most important in these is an assumption that the covariance (or de• gree of relationship) among all pairs of treatment levels is the same. Unfortunately, this is rarely the case in observations of human judgment Consider the following experiment: We have sensory judges examine an ice cream batch for shelf-life in a heat shock experiment on successive days. We are lucky enough to get the entire panel to sit for every experimen• tal session, so we can use a complete block or repeated measures analysis. However, the panel is also becoming involved in testing frozen yogurt, and their frame of reference is being altered slightly, a trend that seems to be af• fecting the data as time progresses. So their data from adjacent days· are more highly correlated than their data from the first and last test. Such time-dependencies violate the assumption of " of covariance?' But all is not lost. A few statisticians have suggested some solutions, if the violations were not too bad (Greenhouse and Geisser, 1959; Huynh and Feltd, 1976). Both of these techniques adjust your degrees of freedom in a conservative manner to try and account for a violation of the assumptions and still protect you from that terrible deed of making a Type I error. The corrections are via an "epsilon" value that you will sometimes see in AN OVA package printouts, and adjusted p-values, often abbreviated G-G or H-F. If the violations are~really bad, there is a test for this called "sphericity" and a criterion that can be applied (Huynh and Mandeville, 1979). The recommen• dation in the case oflack of sphericity is to use a multivariate analysis of vari• ance approach or MANOVA, which does not labor under the covariance assumptions of repeated measures. Since most packaged printouts give you MAN OVA statistics nowadays anyway, it doesn't hurt to give them a look and see if your conclusions about significance would be any different

The Value of Using Panelists as Their Own Controls The data set in the complete block example was quite similar to the data set used in the one-way ANOVA illustrated first. The only change was in pan- 718 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES elist 10, who rated the products all as an 8 in the first example. In the sec• ond example, this nondiscriminating panelist was removed and data were substituted from an insensitive panelist, but one with correct rank order• ing. This panelist rated the products 1, 2, and 3, following the general trend of the rest of the panel, but on an overall lower level of the scale. Notice the effect of substituting a person who is an outlier on the scale but who discriminates the products in the proper rank order. Because his values are quite low, he adds more to the overall variance than to the prod• uct differences, so the one-way ANOVA goes from nearly significant (p = .06) to much less evidence against the null (p = .14). In other words, the panelist who didn't differentiate the products but who sat in the middle of the data set was not very harmful to the one-way ANOVA, while the pan• elist with low values is seen as errorful, even though he or she discrimi• nated among the products. Since the complete block design allows us to partition out overall panelist differences and focus just on product differ• ences, the fact that he or she was a low rater does not hurt this analysis. In fact, the F -ratio for products is now highly significant (p = .003). In general, the panelists are monotonically increasing, with the exceptions of pan• elists 4, 5, and 9 (dotted lines) as shown in Figure AIII.4. The panelist with low ratings follows the majority trend and thus helps the situation.

14,------. Panelist trends 12

10 --o-- Panelist 1 -o- Panelist 2 Cl --o-- Panelist3 c 8 --.. -- Panelist 4 -as ~~· Panelist 5 a: 6 ---tr------Panelist 6 :_::~ ... ,::~~:;;;:~ ... ~ Panelist 7 4 -o- Panelist 8 --.. -- Panelist 9 --o--- Panelist 10 2

0+----,----~----r---~----~--~ A B c

Product

Panelist trends in the repeated measures example. Note that panelist 10 has rank-ordered the products in the same way as the panel means, ~ but is a low outlier. This panelist is problematic for the one-way AN OVA, """' but less so when the panelist effects are partitioned as in the repeated• measures models. ANALYSIS OF VARIANCE 719

The same statistically significant result is obtained in a Friedman "analy• sis of variance" on ranks (see Appendix II). A potential insight here is the fol• lowing: Having a complete design allows repeated measures ANOVA. This allows us to "get rid of'' panelist differences in scale usage, sensory sensitiv• ity, anosmia, etc., and focus on product trends. Since human beings are noto• riously hard to calibrate, this is highly valuable in sensory work.

SENSORY PANELISTS: FIXED OR RANDOM EFFECTS In ajixed if.fects model, specific levels are chosen for a treatment vari• able, levels that may be replicated in other experiments. Common exam• ples of fixed-effect variables might be ingredient concentrations, processing temperatures or times of evaluation in a shelf-life study. In a random if.fects model, the values of a variable are chosen by being ran• domly selected from the population of all possible levels. Future replica• tions of the experiment might or might not select this exact same level, person, or item. The implication is that future similar experiments would also seek another random sampling rather than targeting specific levels or spacing of a variable. In this ANOVA model, the particular level chosen is thought to exert a systematic influence on scores for other variables in the experiment. In other words, interaction is assumed. Examples of random effects in experimental design are common in the behavioral sciences. Words chosen for a memory study or odors sampled from all available odor materials for a recognition screening test are ran• dom, not fixed, stimulus effects. Such words or odors represent a more-or• less random choice from among the entire set of such possible words or odors and do not represent specific levels of a variable that we have chosen for study. Furthermore, we wish to generalize to all such possible stimuli and make conclusions about the parent set as a whole, and not just the words or odors that we happened to pick. In some cases, products in a sen• sory study would be random effects if they represented a sampling of a large group of possible products to include without any systematic concern for the identity or specific characteristics of each product. Sensory panelists are also an example of a random effect. Unfortunately, an ongoing issue is whether sensory panelists are ever fixed effects. The fixed-effects model is simpler, and is the one most people learn in a beginning statistics course, so it has unfortunately persisted in the literature even though behavioral science suggests that human beings are a random effect as they are used in sensory work. In some cases, between-groups variables, like age group or gender, represent a fixed effect, since those categorizations have been specifically chosen. So the group variable can be a fixed effect, but the par• ticipants within groups are still a random effect. 720 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

While they are never truly randomly sampled, they meet the criteria of being a sample of a larger population of potential panelists and of not being available for subsequent replications (for example, in another lab). Each panel has variance associated with its composition; that is, it is a sample of a larger population. Also, each product effect includes not only the differ• ences among products and random error but also the interaction of each panelist with the product variable. For example, panelists might have steeper or shallower slopes for responding to increasing levels of the ingre• dient that forms the product variable. This common type of panelist inter• action necessitates the construction ofF-ratios with interaction terms in the denominator. The use of an interaction term as the denominator for er• ror has important consequences for the analysis and its sensitivity. Using the wrong error term can lead to erroneous rejection of the null hypothesis. Sensory scientists who mistakenly choose the wrong error term because they used the simple fixed-effects ANOVA will be taking additional risks due to incorrect estimation of the appropriate F-ratio. Fixed effects, on the other hand, tend to be specific levels of a variable that experimenters are interested in, whereas random effects are samples of a larger population to which they wish to generalize the other results of the experiment. Sokal and Rohlf (1981) make the following useful distinction: [Fixed vs. random effects models depend] on whether the different lev• els of that factor can be considered a random sample of more such lev• els, or are fixed treatments whose differences the investigator wishes to contrast" (p. 206).

This view has not been universally applied to panelist classification within the sensory evaluation community. Here are some common rejoinders to this position: When panelists get trained, are they no longer a random sample and therefore a fixed effect? The answer is that they weren't really a random sample to begin with, and even the best-designed consumer study is never truly a random sample. The random sample argument is a kind of red her• ring. The panelists probably weren't chosen to represent specific levels of a variable (age, sex, etc.), and even if they are screened for acuity, the screen• ing criteria are not going to form a group comparison in further descriptive analyses. Most importantly, we wish to generalize these results to any panel of different people, similarly screened and trained, from the popula• tion of qualifying individuals. Hays (1973) puts this in perspective by stating that even though the sample has certain characteristics, that doesn't invali• date its status as a sample of a larger group: Granted that only subjects about whom an inference is to be made are those of a certain age, sex, ability to understand instructions, and so ANALYSIS OF VARIANCE 721

forth, the experimenter would, nevertheless, like to extend his inference to all possible such subjects (p. 552).

The second complaint concerns the use of the interaction term in AN OVA. We can assume no interaction in the model, or even test for the existence of an interaction. The answer is that you can, but why choose a laxer and riskier model, inflating your chance of Type I error? You already have violated the assumptions of your statistical analysis in several ways: The variables are not really random variables, they are not usually normally distributed, and they are not truly interval scales and worthy of . Each of these violations misestimates the true chance of making a Type I error, so why make the situation even worse, when the correct analysis is simple to do? Also, the test for no interaction, when successful, depends on a failure to reject the null, which is an ambiguous result, since it can happen from a sloppy experiment with high error variance, just as well as from a situation where there is truly no effect. Next, let's take a closer look at this panelist by sample interaction. At issue is what is the correct error term for ANOVA. If panelists are a fixed effect, the smallest within-cell error term is correct. However, if panelists are a random effect (most commonly they are), the panelist-by-sample

Crossover interaction with panelists 9

7 Ill Ill :s ------" ii --<>-- Panelist1 > Panelist2 Ill 5 --a- Panelist 3 ii ------u ------Means tn

3

Low Medium High

Moisture Levels

An example of panelist crossover interaction that would enter into an a. error term for a model where panelists are a random effect . 722 SENSORY EvALUATION OF FooD: PRINCIPLES AND PRAcTICES

(PxS) interaction is the correct error term. The F-ratio model is the following:

(Sample variance) + (P x S variance) + error F (samples)=------• (P x S variance) + error (AIII.14)

In other words, since the expected mean squares for samples contains the P x S variance, we must divide by the P x S mean squares in order to construct a proper F-ratio. An important underlying principle of the F-ratio is that the numerator and denominator should differ by only one term. Using the fixed-effects error term alone in the denominator results in the construc• tion of an improper F-ratio. Here is an example of this type of"error." Three panelists rate a product for moistness. One panelist is confused and shows a crossover interaction. The data points and means are shown in Figure AIII.5. Here's another way to think about this: It's not just the spread in the data at each point that is part of the error but also changes in the slope for each panelist when looking at the effect of interest (in this case moisture levels). Other approaches have been taken to this issue. Stone and Sidel (1993) recommended examining the panelist interaction for significance and using it as an error term only if it is significant. However, this strategy may result in using the smaller error term when the decision is based on a failure to reject the null. The lack of a significant interaction effect may occur for a number of reasons, so this approach entails additional risk.

PLANNED CoMPARISONs BETWEEN MEANS FoLLOWING ANOVA

Finding a significant F-ratio in ANOVA is only one step in statistical analysis of experiments with more than two products. It is also necessary to explore the treatments and compare them to see which pairs were different in most experiments. A number of techniques are available to do this, most based on variations on the t-test. The rationale is to avoid inflated risk of Type I error that should be inherent in making comparisons just by repeating t• tests. For example, the Duncan test attempts to maintain "experiment-wise" alpha at .05. In other words, across the entire set of paired comparisons of the product means, we would like to keep alpha-risk at a maximum of 5%. Since risk is a function of number of tests, the critical value of the t-statistic is adjusted to maintain risk at an acceptable level. Different approaches exist, differing in assumptions and degree of "liber• ality" in amount of evidence needed to reject the null. Common types include ANALYSIS OF VARIANCE 723 the tests called Scheffe, Tukey, or HSD (honestly-significant-difference); and Newman-Keuls, Duncans, LSD (least-significant-difference). The Scheffe test is most conservative. The Duncan procedure guards against Type I error among a set of comparisons, when there is already a significant effect noted in the ANOVA for the appropriate factor containing the means of interest (Gacula and Singh, 1984). This is a good compromise test to use for sensory data. An example in Winer (1971, pages 200-201) shows how some of the different tests would differ in the critical differences that are required. The general pattern is that the Scheffe test requires the largest difference between totals, and the LSD test the least. Also the LSD test does not change based on the number of means being compared. So the LSD test is the least conserva• tive. This may have value when the risk of missing a difference (Type II error) has important consequences. The LSD test and the Duncan test are illustrated below. The least significant difference or LSD test is quite popular, since you simply compute the difference between means required for significance, based on your error term from the ANOV A. The good part about this is that the error term is a nicely pooled estimate of error considering all your treatments simultaneously. However, the LSD test does very little to protect you from making too many comparisons, since the critical values do not go up with the numbers of means to compare, as is the case with some of the other statistics. So you are relying on your ANOVA to ensure that you have not committed Type I error.

rV2(MSerroJ LSD= n (AIII.15) where n = the number of panelists in a one-way AN OVA or one-factor repeated measures t = the t-value for a two-tailed test with the degrees of freedom for the error term

Calculations for the Duncan multiple-range test use a "studentized range statistic," usually abbreviated with a lower case q. The general formula for comparing pairs of individual means is the quantity

(AIII.16)

This calculated value in any comparison must exceed a tabled value q based on the number of means separating the two we wish to compare, 724 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

when all the means are rank ordered. MS error is the error term associated with that factor in the ANOVA from which the means originate, n is the number of observations contributing to each mean, and qP is the studen• tized range statistic from Duncan's tables (see Appendix VI, Table G). The subscript, P, indicates the number of means between the two we are com• paring (including themselves), when they are rank ordered. The degrees of freedom are n- 1. Note that the quantity is very much like at-test. Consider these general steps:

1. Run AN OVA; find MS error term. 2. Determine means for comparison. 3. Rank order means. 4. Find q values for p (number of means between, plus 2) and n - 1 df. 5. Find critical differences that must be exceeded by the values of

These critical differences are useful when you have lots of means to compare. Here is a sample problem. From a simple one-way ANOVA on four ob• servations, the means were as follows: treatment A = 9, treatment B = 8, and treatment C = 5.75. If we compare treatments A and C, the quantity to exceed q becomes

9-5.75 3.25 = 7.5 (AIII.17) -)(0.75/4) .433

The critical value of q, for r= 3, alpha= .05, is 4.516, so we can reject the null. An alternative computation is to find a critical difference by multiplying our value of q times the denominator (our error term) to find the difference between the means that must be exceeded for significance. This is some• times easier to tabulate if you are comparing a number of means "by hand?' In the above example, using the steps above for finding a critical difference, we multiply q (or 4.516) times the denominator term for the pooled stan• dard error (.433), giving a critical difference of 1.955. Since 9 - 5.75 ( = 3.25) exceeds the critical difference of 1.955, we can conclude that these two samples were different.

TWO-WAY ANOVA FROM RANDOMIZED COMPLETE BLOCK DESIGNS

A common design in sensory analysis is the two-way ANOVA with all pan• elists rating all products. This design would be useful with a descriptive ANALYSIS OF VARIANCE 725 panel, for example, where all panelists commonly evaluate all products (a complete block design). A common example is when there is one treatment variable and replications. Each score is a function of the panelist effect, treatment effect, replication effect, interactions, and error. Error terms for treatment and replication are the interaction effects of each with panelists, which form the denominator of the F-ratio. This is done since panelist effects are random effects, and the panelist by treatment inter• action is embedded in the sum of squares for treatments. For purposes of this example, treatments and replications are considered fixed effects (this is a or Type III in the SAS language). The sample data set is shown in Table AIII.5. Once again there are a small number of panelists, so that the calculations will be a little simpler. However, in most real sensory studies, the panel size would be considerably larger, e.g., 10 to 12 for descrip• tive data and 50 to 100 for consumer studies. The underlying model says that the total variance is a function of the product effect, replicate effect, panelist effect, the three two-way interactions, the three-way interaction, and random error. We have no estimate of the smallest within-cell error term other than the three-way interaction. Another way to think about this is that each score deviates from the grand mean as a function of that particular product mean, that particular panelist mean, that particular replication mean, plus (or minus) any disturbing influences from interactions. Step 1. First, we calculate main effects and sums of squares. As in the one-way ANOVA with repeated measures, there are certain values we need to accumulate:

Grand total= 155 (Grand total) 2/N = "correction factor" = 18,225/18 = 1,012.5 Sum of squared data = 1,085

There are three "marginal sums" we need to calculate in order to estimate main effects. The product marginal sums (across panelists and reps):

ffi!J~i Data Set for a Two-Factor ANOVA with Partitioning of Panelist Variation REP 1 REP2 Product A B c A B c Panelist 1 6 8 9 4 5 10 Panelist 2 6 7 8 5 8 8 Panelist 5 7 10 12 6 7 9 726 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

L.A=34 (L A) 2 = 1,156 L.B=45 (L B)2 = 2,025 L.C=56 (L. C)2 = 3,136

The sum of squares for products then becomes SSprod = [(1,156 + 2,025 + 3,136) I 6] -correction factor = 1,052.83-1,012.5 = 40.33

(We need the value - 1,052.83 - later. Let's call it PSS1 for ''partial sum of squares?') Similarly, we calculate replicate and panelist sums of squares. The replicate marginal sums (across panelists and products) are: L rep1 = 73 (L rep1l = 5,329 L rep2 = 62 (L rep2)2 = 3,844

The sum of squares for replicates then becomes: SSreps= [(5,329 + 3,844) I 9] -correction factor= 1,019.2- 1,012.5 = 6.7

Note: the divisor, 9, is not the number of reps (2) but the number of pan• elists times the number of products (3 x 3 = 9). Think of this as the number of observations contributing to each marginal total. (We need the value, 1019.2, later in calculations. Let's call it PSS2.) As in other repeated measures designs, we need the panelist sums (across products and reps): I: pan1 = 42 (I: pan1)2 = 1,764 1: pan2 = 42 (1: pan2l = 1, 764 I: pan3 =51 (I: pan3)2 = 2,601

The sum of squares for panelists then becomes: SSpan =[(1,764 + 1,764 + 2,601) I 6]- correction factor= 1,021.5- 1,012.5 =9.00 (We will need the value 1,021.5 later in calculations. Let's call it PSS3, for partial sum of squares #3.)

Step 2. Next, we need to construct summary tables of interaction sums. (a) Here are the rep-by-product interaction calculations:

rep1 rep2 Squared Values A 19 15 361 225 B 25 20 625 400 c 29 27 841 729 Sum of squared values= 3,181 3,18113 = 1,060.3 = PSS4 To calculate the sum of squares, we need to subtract the PSS values for each main effect, and then add back the correction term (this is dictated by the underlying variance component model): ANALYSIS OF VARIANCE 727

SS rep x prod = (5,181/5) - PSS1 - PSS2 + C = 1,060.5- 1,052.85- 1,019.2 + 1,012.5 = 0.77

(b) Here are the panelist-by-product interaction calculations: pant pan2 pan3 Squared Values A 10 11 13 100 121 169 B 13 15 17 169 225 289 c 19 16 21 361 256 441

Sum of squared values= 2,151 2,151/2 = 1,065.5 = PSS5 SS pan x prod = (2,151/2) - PSS1 - PSS5 + C = 1,065.5 - 1,052.85 - 1,021.5 + 1,012.5 = 5.67

(c) Here are the rep-by-panelist interaction calculations: rep1 rep2 Squared Values pant 23 19 529 361 pan2 21 21 441 441 pan3 29 22 841 484 Sum of squared values= 3,097 5,097/5 = 1,052.5 = PSS6 SS rep x pan = (5,097/5) - PSS2 - PSS5 + C = 1,052.5- 1,021.5- 1,019.2 + 1,012.5 = 4.13

(d) The fmal estimate is for the sum of squares for the three-way interac• tion. This is all we have left in this design, since we are running out of degrees of freedom. This is found by the sum of the squared (data) values, minus each PSS from the interactions, plus each PSS from the main effects, minus the correction factor.

SS (3-way) = I.x2- PSS4- PSS5- PSS6 + PSS1 + PSS2 + PSS5- C = 1,085 - 1,060.5 - 1,065.5 - 1,052.5 + 1,052.85 + 1,019.2 + 1,021.5 - 1,012.5 = 5.93 Step J. Using the above values, we can calculate the fmal results: Sum of Squares 41 Mean Square F products 40.33 2 20.17 21.97 prod x pan 3.67 4 0.917 replicates 6.72 1 6.72 3.27 rep x pan 4.13 2 2.07 prod x rep 0.77 2 0.39 0.26 prod x rep x pan 5.93 4 1.48 Note that only the product effect was significant. There was a tendency for rep 2 to be lower, but it could not achieve significance, perhaps due to too few degrees of freedom. For main effects, df = levels - 1; e.g., three products gives 2 df. For interactions, df = product of df for individual fac- 728 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES tors, e.g., prod x pan df= (3- 1) x (3- 1) = 4. The critical F-ratios are 6.94 for the product effect (2, 4 dj), 18.51 for the replicate effect (1, 2 dj), and 6.94 for the interaction (2, 4 dj).

A Note on In factorial designs with two or more variables, the sensory specialist will of• ten have to make some decisions about how to group the products that will be viewed in a single session. Sessions or days of testing often form the blocks of an experimental design. The examples considered previously are fairly com• mon in that the two variables in a simple factorial design are often products and replicates. Of course, judges are a third factor but a special one. Let's put judges aside for the moment and look at a complete block design in which each judge participates in the evaluation of all products and all of the repli• cates. This is a common design in descriptive analysis using trained panels. Consider the following scenario: Two sensory technicians (students in a food company on summer internships) are given a sensory test to design. The test will involve comparisons of four different processed soft-cheese spreads, and a trained descriptive panel is available to evaluate key attrib• utes such as cheese flavor intensity, smoothness, and mouthcoating. Due to the tendency of this product to coat the mouth and other similar carryover effects, only four products can be presented in a sitting. The panel is avail• able for testing on four separate sessions on different days. There are then two factors to be assigned to blocks of sessions in this experiment, the products and the replicates. Technician A decides to present one version of the cheese spread on each day, but replicate it 4 times within a session (this technician obviously had a sensory evaluation course and was taught the value of replication). Technician B, on the other hand, presents all four products on each day, so that the sessions (days) become blocked as replicates. Both technicians use counterbalanced orders of presentation, random codes, and other reason• able good practices of sensory testing. The blocking schemes are illustrated in Figure A111.6. Which design seems better? A virtually unanimous opinion is that as• signing all four products within the same block (session) is better than as• signing replicates within a session. The observers will have a more stable frame of reference within a session than across sessions, and this will im• prove the sensitivity of the product comparison. There may be day-to-day variations in uncontrolled factors that may confound the product compar• isons across days (changes in conditions or in the panelists themselves) and add to random error. Having the four products present in the same ses- ANALYSIS OF VARIANCE 729

A Blocking Problem

Four product are available on four days. Four replicates are desired. The product causes carry-over and fatigue and tends to coat the mouth, so only four products can be tested in a day. How are the two factors to be assigned to the blocks of days (sessions) for the sensory panel?

Approach "A" : assign replicates within a block, products across days (sessions):

Day 1 Day 2 Day 3 Day4

Product 1 Product 2 Product 3 Product 4

Approach "B": Assign products within each block and replicates across days Day 1 Day 2 Day 3 Day 4

Rep 1 Rep2 Rep3 Rep4

Which blocking scheme is better and why?

Examples of blocking strategy for the hypothetical example of the processed cheese spread . •

sion lends a certain directness to the comparison without any burden of memory load. There is less likelihood of drift in scale usage within a ses• sion as opposed to testing across days. If identical versions of a product are presented on the same day, panelists may attempt to find differences that are not really there or may distribute their ratings across the scales unnec- 730 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES essarily due to the frequency effect (see Chapter 9). Of course, there may be good reasons under some circumstances to test different products on different days. There may be limitations in prepara• tion or pilot plant operations. There may be a need to present products at a carefully controlled time interval since preparation or manufacturing. However, in most circumstances the sensory specialist will have the oppor• tunity to present at least a subset of the products to be tested within a ses• sion, if not the complete set in the experiment. Why then assign products within a block and replicates across ses• sions? Simply stated, the product comparison is most often the more criti• cal comparison of the two. Product differences are likely to be the critical question in a study. Replicates can tell us something about panel stability, panelist discrimination, and reliability, and they certainly add sensitivity and statistical power to the test. However, replicates as an experimental question are generally of less importance than the product comparison. From this kind of reasoning, we can formulate a general principle for as• signment of variables to blocks in sensory studies where the experimental blocks are test sessions: Assign the variable of greatest interest within a block so that all levels of that factor are evaluated together. Conversely, as• sign the variable of secondary interest across the blocks if there are limita• tions in the number of products that can be presented and if such limitations prevent a complete design within each block or session .

• Data Set for a Two-Factor AN OVA with a Comparison Between Groups ("Split Plot") Panelist Group ProductJ ProductK ProductL 1 1 3 5 4 2 1 4 5 6 3 1 1 3 4 4 1 3 4 5 5 1 4 5 3 6 1 4 5 5 7 1 6 6 7 8 1 6 8 4 Panelist Group ProductJ ProductK Product L 9 2 5 7 9 10 2 3 5 8 11 2 6 4 7 12 2 1 5 8 13 2 2 6 9 14 2 3 3 6 15 2 4 5 7 16 2 3 3 8 ANALYSIS OF VARIANCE 731

SPLIT-PLOT OR BETWEEN-GROUPS (NESTED) DESIGNS

It is not always possible to have all subjects rate all products. A common design uses different groups of people to evaluate different levels of a variable. Also, we might simply want to compare two panels, having presented them with the same levels of a test variable. For example, we might have a set of products evaluated in two sites in order to see if panelists are in agreement in two manufacturing plants or between a QC panel and an R&D panel. In this case, there will be repeated measures on one variable (the products) since all panelists see all products. But we also have a between-groups variable that we wish to compare. We call this a "split plot" design in keeping with the nomenclature of Stone and Sidel ( 1993). However, simply bear in mind that we have one group variable and one repeated measures variable. In behavioral research, these are sometimes called "between-subjects" and ''within• subjects" effects. A sample data set for such a design is shown in Table AIII.6. In this example, two panels of eight panelists each rated a juice product for perceived sources. Three levels of acidification were presented. We wish to know whether the products differed, and whether the panels scored the products in a similar fashion. The analysis proceeds as follows: Let n = the number of panelists per group(= 8): let A= the number of groups(= 2); let B =the number of products (= 3). First we need to assemble a two-factor summary table, giving the sums across the cells in a matrix. The cells are the sums of each product score for each group (a 2 x 3 matrix) as follows: AB Summary Table Group Product J Product K Product L Total 1 31 41 38 110 2 27 38 62 127 Total 58 79 100 237

Next we need to square these sums:

AB2 Summary Table Group Product J Product K Product L Total 1 961 1,681 1,444 4,086 2 729 1,444 3,844 6,017 Total 10,103

Now we find the panelist marginal sums and squared values, just as in repeated measures analysis. These are the sums across rows in the data set and the squares of those sums. We call these values P and P 2 in the following analysis since they are derived from people or panelists. Since this is fairly obvious from the preceding examples, it is not shown here. 732 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

Next we accumulate some partial sums of squares to use in computa• tion. There are six of these needed for this design. The first two are just for the correction factor and the total sum of squares. The next three are for the group effect, the product effect, and the group-by-product interaction. The last is for the panelist effect.

PSS1 =(grand total)2 I nab= (237) 2 I (8)(2)(3) = (56,169) I 48 = 1,170.19 PSS2 = ~x2 = 1,347 PSS3 = ~Ta2) I nb= (1102 + 1272) I (8)(3) = (12,100 + 16,129) I 24= 1176.21 PSS4 = (~Tb2 ) Ina= (582 + 792 + 1002) I (8)(2) = (3,364 + 6,241 + 10,000) I 16 = 1,225.31 PSS5 = [(~(AB) 2 ] In= ( 961 + 1,681 + ...)I 8 = 10.10318 = 1,262.88 PSS6 = (~P2) I b = (122 + 152 + ...) I 3 = 3,669 I 3 = 1,223

Now we can compute our complete sums of squares, mean squares, and F-ratios. SS (group effect)= PSS3- PSS1 = 1,176.21- 1,170.19 = 6.02 SS( error, panelists within groups) = PSS6 - PSS3 = 1,223 - 1176.21 = 46.79 SS(products) = PSS4- PSS1 = 1,225.31 - 1,170.19 = 55.12 SS(product by group interaction) = PSS5 - PSS3 - PSS4 + PSS1 = 1,262.88-1,176.21-1,225.31 + 1,170.19 = 31.54 SS(error, product by panelist within groups) = PSS2 - PSS5 - PSS6 + PSS3 = 1,347- 1,262.88- 1,223 + 1,176.21 = 37.33

The final analysis table looks like this: Effect ss df MS F p Groups 6.02 1 6.02 1.80 .20 Error 46.79 14 3.34 Products 55.12 2 27.56 20.67 .00003 Interaction 31.54 2 15.77 11.82 .0002 Error 37.33 28 1.33

In conclusion, our products differed, but these differences depended on which panel you asked, as shown in the ANOVA table and the means shown below. Group 1 had difficulty discriminating between product K and product L, while the other group clearly differentiated all three products. Another way to describe this interaction might be to say that group 2 had a steeper slope (a magnitude interaction). The big difference however, re• sides in the much higher score given to product L in comparison to the oth• ers by group 2. These same results are found using MANOVA statistics and with Greenhouse-Geisser corrections, so we have some confidence that the repeated measures model here did not lead us astray. Product Means: Group ProductJ ProductK Product L 1 3.9 5.1 4.8 2 3.4 4.8 7.8 ANALYSIS OF VARIANCE 733

Split-Plot Analysis with Unequal Group Size

The first example of a between-groups comparison used equal group sizes. However, many sensory studies, especially in consumer work, will not be able to obtain equal sample sizes. For unequal sample sizes, there are a few more calculations that must be done. In addition to the summary table based on simple-cell, row, and column sums, we also construct a summary table based on means, and compute some additional partial sums of squares. These values based on means are then multiplied by the of the group size. The harmonic mean is the number of groups divided by the sum of the reciprocals of the group size. Here is a worked example with two groups (a gender variable) evaluating the sweetness of a product with different amounts of added sucrose and fructose, using a simple 9-point intensity scale for sweetness. A similar worked example can be found in Winer (1975, chapter on multifactor experi• ments with repeated measures, p. 600). The research questions involve whether either or both groups can tell the difference in sweetness among these three products and whether males and females differ. A split plot analy• sis of variance can help us decide whether there are product differences and evidence for any gender group difference. The data set is shown in Table Alii. 7

Group and Product Summary

Step 1: Compute within-cell and marginal sums: Gender (Group) Sucrose Sucrose Fructose Group Sums 1% 2% 2% Sums: male 23 38 51 112 female 22 57 73 152 Product sums 45 95 124 (total) = 264)

PSS (total) = l: (x2) = 1356 (used later for total SS) PSS (groups) = [(112)2/30] + [(152)2/48] = 899.47 (used for group SS) PSS (int) = [(23) 2 + (38)2 + (51)2]/10 + [(22) 2 + (57)2 + (73)2]/16 = 1,023.78 PSS (panelists) = (l: each panelist's scores) 2 /3 = 3,200 I 3 = 1,066.67

Now we must construct the second partial sums of squares based on the summary table of means. The mean values are as follows: Gender (Group) Sucrose Sucrose Fructose 1% 2% 2% Sum (rows) Male 2.30 3.80 5.10 11.2 Female 1.38 3.56 4.56 9.49 Sum (columns) 3.67 7.36 9.76 20.69 734 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

• Data Set for Between-Groups Comparison with Unequal Group Size Gender (Group) Sucrose Sucrose Fructose 1% 2% 2% Male 2 3 6 Male 0 5 6 Male 3 0 0 Male 3 8 5 Male 5 6 6 Male 1 3 5 Male 3 0 4 Male 4 5 7 Male 0 3 5 Male 2 5 7 Female 0 5 1 Female 0 1 2 Female 0 5 3 Female 7 7 5 Female 1 2 7 Female 0 6 9 Female 2 3 1 Female 2 4 3 Female 2 5 8 Female 1 3 6 Female 0 3 7 Female 0 3 3 Female 2 2 2 Female 1 4 7 Female 3 3 8 Female 1 1 1

Next we need the partial sums of squares based on these means and the harmonic mean group size as a multiplier. We will notate the new partial sums of squares as MPSS to denote that they are based on the means.

The harmonic mean group size is nh = 2 I [1110 + 1/16] = 12.307 MPSS (correction factor) = ( 20.69)2 I 6 = 71.346

MPSS (groups)= [(11.2)2 + (9.49)2 ] I 3 = 71.833

MPSS (products)= [(3.67)2 + (7.36)2 + (9.76)2] I 2 = 80.475 MPSS (int) = (2.30)2 + ... (4.56)2 = 81.084

Now we can compute the sums of squares. Note that for some of these we can use the partial sums of squares as in the equal group size ANOVA, and for others we use the new MPSS sums based on means. SS (groups)= [MPSS (products) - MPSS (correction factor] harmonic mean group size= [71.833- 71.346]12.307 = 5.99 SS (between groups error)= PSS (panelists)- PSS (groups)= 1,066.67- 899.47 = 167.2 ANALYSIS OF VARIANCE 735

SS (products)= (MPSS (products)- MPSS (correction factor] harmonic mean group size= [80.475- 71.346] 12.307 = 112.42 SS (interaction) = MPSS (int) - MPSS (products)- MPSS (groups) + MPSS (correction factor) harmonic mean = [81.084 - 80.475 - 71.83 + 71.346) 12.307 = 1.435 SS (error) = PSS (total) - PSS (int) - PSS (pan) + PSS (products) = 1,356 -1,023.78 - 1,066.67 + 899.47 = 165.02

t.lftotal = N - 1 = 77 4J;,roducts = 3 - 1 = 2 d/,roup=2-1=1 4fbetween panelists = 26 - 1 = 25 d/ between error = 4fbetween panelists - d/group = 25 - 1 = 24 d/interactions = (d/,roup) (4!;,roducts) = 1(2) = 2 d/ error = t.lftotat - (sum of all other d/) = 48

The ANOVA table appears as follows: Source SS dj MS F p Groups 5.99 1 5.99 0.86 (NS) "Between" error 167.2 24 6.97 Products 112.42 2 56.21 16.38 <.001 Interaction 1.435 2 0.72 0.21 (NS) Error 165.02 48 3.43

So there is a clear difference among the products, but no effect of gender or any interaction. The lack of interaction implies that the groups perceived the differences among the products in a similar manner.

EPILOGUE: OTHER TECHNIQUES

Analysis of variance remains the bread-and-butter everyday statistical tool for the vast majority of sensory experiments and multiproduct com• parisons. However, it is not without its shortcomings. One concern is that we end up examining each scale individually, even when our descriptive profile may contain many scales for flavor, texture, and appearance. Furthermore, many of these scales are intercorrelated. They may be providing redundant information or they may be driven by the same underlying of latent causes. A more comprehensive approach would include multivariate techniques such as principal components analysis to assess these patterns of correlation. Analysis of variance can also be done fol- 736 SENSORY EvALUATION OF FooD: PRINCIPLES AND PRACTICES lowing principal components, using the factor scores rather than the raw data. This has the advantage of simplifying the analysis and reporting of results, although the loss of detail about individual scales may not always be desired. See Chapter 17 for descriptions of multivariate techniques and their applications. A second concern is the restrictive assumptions of ANOVA, which are of• ten violated. Normal distributions, equal variance, and, in the case of re• peated measures, homogeneity of covariance are not always the case in human judgments. So our violations lead to unknown changes in the risk levels. Risk may be underestimated as the statistical probabilities of our analysis and are based on distributions and assumptions that are not always descriptive of our . For such reasons, it has recently be• come popular to use MAN OVA, a multivariate model approach to effect test• ing, instead of repeated-measures ANOVA. While this more conservative approach is reasonable from an experimental point of view, it is often the case in applied sensory testing that missing a true difference is more damag• ing than reporting a false difference. In order to catch all the differences that might be present among products, it makes sense to use the most sensitive analysis we can. The partitioning of panelist variation in analysis from com• plete block designs is a very sensitive method. Of course, the two approaches can easily be compared. Many current statistical analysis software packages offer both types of analysis, and some even give them automatically or as de• faults. The sensory scientist can then compare the outcomes of the two types of analyses. If they are the same, the conclusions are straightforward. If they differ, some caution is warranted in drawing conclusions about statistical significance. Closer examination of the data to see exactly what is happening with the raw numbers is warranted. The analyses shown in this section are relatively simple ones. It is ob• vious that more complex experimental designs are likely to come the way of a sensory testing group. In particular, incomplete designs in which peo• ple evaluate only some of the products in the design are common. Product developers often have many variables at several different levels each to screen. We have stressed the complete block designs here because of the efficient and powerful partitioning of panelist variance that is possible. Discussions of incomplete designs and their can be found in Cochran and Cox (1957) and other statistics texts. More complex designs are not shown here for several reasons. First, the goal of this section is to give researchers unfamiliar with ANOVA a concise treatment and worked examples to illustrate how the analyses progress and from whence the mysterious F-ratios arise. This should serve instructors of sensory science who wish students to have some profi• ciency in the analysis of simple data sets using AN OVA. We recognize that ANALYSIS OF VARIANCE 737 it is unlikely at this point in the history of computing that many sensory professionals or statistical analysis services will spend much time doing ANOVAs by hand. If the experimental designs are complex or if many de• pendent measures have been collected, software packages are likely to be used. In these cases, the authors of the program have taken over the bur• den of computing and partitioning variance. However, decisions can still be made on a theoretical level by the sensory scientist. For example, we have included all interaction terms in the above analysis, but the linear models on which ANOVAs are based need not include such terms if there are theoretical or practical reasons to omit them. Many current statistical analysis packages allow the specification of the linear model, returning some discretionary modeling power to the scientist

REFERENCES Cochran, W.G., and Cox, G.M. 1957. Experimental Designs, 2d ed. Wiley, New York. Gacula, M.C. Jr., and Singh, J. 1984. Statistical Methods in Food and Consumer Research. Academic, Orlando, FL. Greenhouse, S.W., and Geisser, S.1959. On methods in the analysis of profile data. Psychometrika, 24, 95-112. Hays, W.L. 1973. Statistics for the Social Sciences, 2d ed. Holt, Rinehart and Winston, New York. Huynh, H., and Feldt, L.S. 1976. Estimation of the box correction for degrees offree• dom in the randomized block and split plot designs. Journal of Educational Statistics, 1, 69-82. --and Mandeville, G.K. 1979. Validity conditions in repeated measures designs. Psychological Bulletin, 86, 964-973. Lundahl, D.S., and McDaniel, M.R. 1988. The panelist effect-fixed or random? Journal of Sensory Studies. 3, 113-121. O'Mahony, M. 1986. Sensory Evaluation ofFood: Statistical Methods and Procedures. Dekker, New York. Sokal, R.R., and Rohlf, F.J. 1981. Biometry, 2d ed. Freeman, New York. Stone, H., and Sidel, J.L. 1993. Sensory Evaluation Practices, 2d ed. Academic, San Diego. Winer, B.J. 1971. Statistical Principles in Experimental Design, 2d ed. McGraw-Hill, New York. CORRELATION, REGRESSION, AND MEASURES OF ASSOCIATION

INTRODUCTION Sensory scientists are frequently confronted with the situation where they would like to know if there is a significant association between two sets of data. For example, the sensory specialist wants to know if the perceived brown-color intensity (dependent variable) of a series of cocoa powder-ic• ing mixtures increased as the amount of cocoa (independent vari• able) in the mixture increased. In another example, the sensory scientist wants to know if the perceived sweetness of grape juice (dependent vari• able) is related to the total concentration of fructose and glucose (indepen• dent variable) in the juice, as determined by high-pressure liquid chromatography. In these cases we need to determine if there is evidence for an associa• tion between the independent and dependent variables. In some cases, we may also be able to infer a cause-and-effect relationship between the inde• pendent and the dependent variables. The measures of association between two sets of data are called correlation coefficients, and if the size of the cal• culated correlation coefficient leads us to reject the null hypothesis of no as• sociation, then we know that the change in the independent variables-our treatments, ingredients, or processes-were associated with a change in the

738 CORRELATION, REGRESSION, AND MEASURES OF ASSOCIATION 739 dependent variables-the variables we measured, such as sensory-based responses or consumer acceptance. However, from one point of view, this hypothesis testing approach is scientifically impoverished. It is not a very informative method of scientific research, because the statistical hypothesis decision is a binary or yes/no relationship. We conclude that either there is evidence for an association between our treatments and our observations or there is not. Having rejected the null hypothesis, we still do not know the degree of relationship or the tightness of the association between our vari• ables. Even worse, we have not specified any type of mathematical model or equation that might characterize the association. Associational measures, also known as correlation coefficients, can ad• dress the first of these two questions. In other words, the correlation coeffi• cient will allow us to decide whether there is an association between the two data series. Modeling, in which we attempt to fit a mathematical ftmc• tion to the relationship, addresses the second question. The most wide• spread measure of association is the simple correlation coefficient (Pearson's correlation coefficient). The most common approach to model fitting is the simple . These two approaches have similar underlying calculations and are related to one another. The first sections of this chapter will show how correlations and regressions are computed and give some simple examples. The later sections will deal with related topics: how to build some more complex models between several variables, how measures of association are derived from other statistical methods, such as analysis of variance, and finally how to compute measures of association when the data do not have interval scale properties (Spearman rank corre• lation coefficient and Cramer's <1>'\ Correlation is of great importance in sensory science because it func• tions as a building block for other statistical procedures. Thus methods like principal components analysis draw part of their calculations from mea• sures of correlation among variables. In modeling relationships among data sets, regression and multiple regression are standard tools. A common application is the predictive modeling of consumer acceptability of a prod• uct based on other variables. Those variables may be descriptive attributes, as characterized by a trained (nonconsumer) panel, or they may be ingre• dient or processing variables, or even instrumental measures. The proce• dure of multiple regression allows one to build a predictive model for consumer likes based on a number of such other variables. A common ap• plication of regression and correlation is in sensory instrumental relation• ships. Finally, measures of correlation are valuable tools in specifying how reliable our measurements are-by comparing panelist scores over multi• ple observations and assessing agreement among panelists and panels. 740 SENSORY EvALUATION oF FooD: PRINCIPLES AND PRACTICES

When the sensory scientist is investigating possible relationships between data series, the first step is to plot the data in a scatter diagram with the X series on the horizontal axis and the Y series on the vertical axis. Blind application of correlation and regression analyses may lead to wrong conclu• sions about the relationship between the two variables. As we will see, the most common correlation and regression methods estimate the parameters of the "best" straight line through the data (regression line) and the closeness of the points to the line (simple correlation coefficient). However, the relation• ship may not necessarily be described well by a straight line; in other words, the relationship may not necessarily be linear. Plotting the data in scatter diagrams will alert the specialist to problems in fitting linear models to data that are not linearly related (Anscombe, 1975). For example, the four data sets listed in Table AIV.1 and plotted in Figure AIV.1 clearly are not all accurately described by a linear model. In all four cases, the 11 observation pairs have mean x-values equal to 9.0, meany-values equal to 7.5, a correla• tion coefficient equal to 0.82, and a regression line equation of y = 5 + 0.5x. However, as Figure AIV.1 clearly indicates, the data sets are very different. These data sets are the so-called Anscombe quartet, created by their author to highlight the perils of blindly calculating linear regression models and Pearson correlation coefficients without frrst determining through scatter plots whether a simple linear model is appropriate (Anscombe, 1975).

• The Anscombe Quartet: Four Data Sets Illustrating Principles Associated with Linear Correlation1 a b c d X y X y X y X y 4 4.26 4 3.10 4 5.39 8 6.58 5 5.68 5 4.74 5 5.73 8 5.76 6 7.24 6 6.13 6 6.08 8 7.71 7 4.82 7 7.26 7 6.42 8 8.84 8 6.95 8 8.14 8 6.77 8 8.47 9 8.81 9 8.77 9 7.11 8 7.04 10 8.04 10 9.14 10 74.6 8 5.25 11 8.33 11 9.26 11 7.81 8 5.56 12 10.84 12 9.13 12 8.15 8 7.91 13 7.58 13 8.74 13 12.74 8 6.89 14 9.96 14 8.10 14 8.84 19 12.50 1Anscombe (1973). CORRELATION, REGRESSION, AND MEASURES OF AsSOCIATION 741

A B 15

10 • • • • • • • • • • • • • • • • 5 • • • • 0 0 5 10 15 20

c D

• •

• I • • I • • • • I • • • I

• Scatter plots of the Anscombe quartet (Anscombe, 1973).

CORRELATION When two variables are related, a change in one is usually accompanied by a change in the other. However, the changes in the second variable may not be linearly related to changes in the first In these cases the correlation coefficient is not a good measure of association between the variables (Figure AIV.2). The concomitant change between the variables may occur because the two variables are causally related. In other words, the change in the one variable causes the change in the other variable. In the cocoa icing sugar 742 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES

A B

• • • • • • • • • • • • • • •• • • • •

c D

• • • • • • • • • • • • • • • • • • • •

~ The correlation coefficient could be used as a summary measure for plots ~A and B, but not for plots C and D (in analogy to Meilgaard et al., 1991).

example, the increased concentration of cocoa powder in the mixture caused the increase in perceived brown color. However, the two variables may not be necessarily causally related because a third factor may drive the changes in both variables, or there may be several intervening vari• ables in the causal chain between the two variables (Freund and Simon, 1992). An anecdotal example often used in statistics texts is the following: Some years after World War II, a statistician found a correlation between CoRRELATION, REGRESSION, AND MEASURES OF AssociATION 743

the number of storks and the number of babies born in England. This did not mean that the storks "caused" the babies. Both variables were related to the rebuilding of England (increase in families and in the roofs that storks use as nesting places). One immediate cautionary note in associational statistics is that the causal inference is not often clear. The usual warning to students of statistics is that "correlation does not imply causation." The rate of change or dependence of one variable on another can be measured by the slope of a line relating the two variables. However, that slope will numerically depend on the units of measurement. For example, the relationship between sweetness and molarity of a sugar will have a differ• ent slope if concentration is measured in percent-by-weight or if units should change to millimolar instead of molar concentration. In order to have a measure of association that is free from the particular units that are chosen, we must standardize both variables so that they have a common system of measurement. The statistical approach to this problem is to replace each score with its standard score, or difference from the mean in standard devia• tion units. Once this is done, we can measure the association by a measure known as the Pearson correlation coefficient (Blalock, 1979). When we have not standardized the variables, we still can measure the association, but now it is simply called covariance (the two measure "vary together") rather than correlation. The measure of correlation is extremely useful and very wide• spread in its applications. The simple or Pearson correlation coefficient is calculated using the following computational equation (Snedecor and Cochran, 1980):

(AIV.1)

where the series of X data points = the independent variable series of Y data points = the dependent variable

Each data set has n data points, and the degrees of freedom associated with the simple correlation coefficient are n- 2. If the calculated r-value is larger than the r-value listed in the correlation table for the appropriate alpha level, then the correlation between the variables is significant. The value of the Pearson's correlation coefficient always lies between -1 and +1. Values of r close to absolute 1 indicate that a very strong linear relationship exists between the two variables. When r is equal to zero, then 744 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES there is no linear relationship between the two variables. Positive values of rindicate a tendency for the variables to increase together. Negative values of r indicate a tendency of large values of one variable to be associated with small values of the other variable. Figure AIV.2 highlights the differences in the scatter plots for r-values close to 1 and close to zero.

Pearson's Correlation Coefficient Example In this study, a series of 14 cocoa icing sugar mixtures were rated by 20 panelists for brown-color intensity on a 9-point category scale. Was there a significant correlation between the percentage of cocoa powder added to the icing sugar mixture and the perceived brown color intensity?

Cocoa Mean Brown- x2 y2 XY Added,% Color Intensity

30 1.73 900 2.988 51.82 35 2.09 1225 4.37 73.18 40 3.18 1600 10.12 127.27 45 3.23 2025 10.42 145.23 50 4.36 2500 19.04 218.18 55 4.09 3025 16.74 225.00 60 4.68 3600 21.92 280.91 65 5.77 4225 33.32 375.23 70 6.91 4900 47.74 483.64 75 6.73 5625 45.26 504.54 80 7.05 6400 49.64 563.64 85 7.77 7225 60.42 660.68 90 7.18 8100 51.58 646.36 95 8.54 9025 73.02 811.82 l:X=875 l:Y= 73.32 l:X2= 60,375 n/.!=446.56 l:XY= 5,167.50 a Values rounded to decimals after calculation of squares and products.

875 • 72.32 5 ,167. 5 0 - 14 r= =0.9806

2 2 (875) ] [ (73.32) ] [ 60,375 - ~ 446.56 - ~

At an alpha level of 5%, the tabular value of the correlation coefficient for 12 degrees of freedom is 0.4575. The calculated value = 0.9806 exceeds CORRELATION, REGRESSION, AND MEASURES OF ASSOCIATION 745 the tabled value, and we can conclude that there is a significant association between the perceived brown-color intensity of the cocoa icing sugar mix• ture and the percentage of cocoa added to the mixture. Since the only in• gredient changing in the mixture is the amount of brown color, we can also conclude that the increased amount of cocoa causes the increased percep• tion of brown color.

Coefficient of Determination The coefficient of determination is the square of the Pearson's correlation coefficient (,-2), and it is the estimated proportion of the variance of the data set Y that can be attributed to its linear correlation with X, while 1 - ? (the coefficient of nondetermination) is the proportion of the variance of Y that is free from effect of X (Freund and Simon, 1992). The coefficient of deter• mination can range between 0 and 1, and the closer the r 2 value the better the straight-line fit.

LINEAR REGRESSION Regression is a general term for fitting a function, usually a linear one, to describe the relationship among variables. Various methods are available for fitting lines to data, some based on mathematical solutions and others based on iterative or step-by-step trials to minimize some residual error or badness-of-fit measure. In all of these methods, there must be some mea• surement of how good the fit of the model or equation is to the data. The least-squares criterion is the most common measure of fit for linear rela• tionships (Snedecor and Cochran, 1980; Afifi and Clark, 1984). In this ap• proach, the best-fitting straight line is found by minimizing the squared deviations of every data point from the line in the y-direction (Figure AIV.3).

The equation is:

y =a+ bx (AIV.2) where a= value of estimated intercept b =value of estimated slope 746 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

Y-mean ts'

Fitted value ~Residual 1 X-mean I [;{' Observed value

X

Least-squares regression line and residuals. The least-squares criterion • minimizes the squared residuals for each point (from Neter and • Wasserman, 1974).

The estimated least-squares regression is calculated using the following equations:

(AIV.5)

lr Lx a=--b=-n n (AIV.4)

It is possible to assess the of the equation in different ways. The usual methods used are the coefficient of determination and analysis of variance (Piggott, 1986).

Analysis of Variance

In a linear regression, it is possible to partition the total variation into the variation explained by the and the residual or unex- CORRELATION, REGRESSION, AND MEASURES OF AsSOCIATION 747 plained variation (Neter and Wasserman, 1974). The F-test is calculated by the ratio of the regression mean square to the residual mean square with 1 and ( n - 2) degrees of freedom. This tests whether the fitted re• gression line has a nonzero slope. The equations used to calculate the to• tal sums of squares and the sums of squares associated with the regression are:

(AIV.5)

A - 2 SSregression = r.(Y - Y) (AIV.6)

A 2 SSresidual = r.(Y - Y) = SStotal - SSregression (AIY.7)

Analysis of Variance for Linear Regression Source of Degrees of Sum of Mean F-Value Variation Freedom Squares Squares

Regression 1 S S regression SSregression /1 MSregressio/MSresiduai Residual n- 2 ssresidual ssresidua/(n- 2) Total n- 1 sstotal

Prediction of the Regression Line

It is possible to calculate confidence intervals for the slope of the fitted regression line (Neter and Wasserman, 1974). These confidences lie on smooth curves (the branches of a hyperbola) on either side of the regres• sion line (Figure AIV.4). The confidence intervals at their smallest are at the mean value of the X series, and the intervals become progressively larger as one moves away from the mean value of the X series. All pre• dictions using the regression line should fall within the range used to calculate the fitted regression line. No predictions should be made out• side this range.

Linear Regression Example We will return to the example of the cocoa powder-icing sugar mixture used, for example, in the Pearson's correlation coefficient section. A linear regression line was fitted to the data using Eqs. (AIV.5) to (AIV.7). 748 SENSORY EvALUATION OF FooD: PRINCIPLES AND PRAcTicEs

y

X

The confidence region for a regression line is two curved lines on ei• • ther side of the linear regression line. These curved lines are closest to . the regression line at the mean values for X and Y (from Neter and Wasserman, 1974).

875. 72.52 5,167.50- ---- 14 b = = 0.102857 (875)2 60,575-----u-

a =5.257- (0.102857)(62.5) = -1.19156

The fitted linear regression equation is y = 0.1028 X- 1.1916, and the coefficient of determination is 0.9616 with 12 degrees of freedom. CORRELATION, REGRESSION, AND MEASURES OF AsSOCIATION 749

MULTIPLE LINEAR REGRESSION

Multiple linear regression calculates the linear combination of indepen• dent variables (more than one X) that is maximally correlated with the de• pendent variable (a single Y). The regression is performed in a least• squares fashion by minimizing the sum of squares of the residual (Stevens, 1986; Afifi and Clark, 1984). The regression equation must be cross-vali• dated by applying the equation to an independent sample from the same population-if predictive power drops sharply, then the equation has only limited utility. In general, one needs about 15 subjects (or observations) per independent variable to have some hope of successful cross-validation, al• though some scientists will do multiple linear regression with as few as 5 times as many observations as independent variables. When the indepen• dent variables are intercorrelated or multicollinear, it spells real trouble for the potential use of MLR. Multicollinearity limits the possible magnitude of the multiple correlation coefficient (R) and makes determining the impor• tance of a given independent variable in the regression equation difficult, as the effects of the independent variables are confounded due to high in• tercorrelations. Also, when variables are multicollinear, then the order that independent variables enter the regression equation makes a difference with respect to the amount of variance of Y that each variable accounts for. It is only with totally uncorrelated independent variables that the order has no effect. Multiple linear regression analyses can be performed using any reputable statistical analysis software package (Piggott, 1986).

OTHER MEASURES OF ASSOCIATION Spearman When the data are not derived from an interval scale but from an ordinal scale, the simple correlation coefficient is not appropriate. However, the Spearman rank correlation coefficient is appropriate. This correlation coef• ficient is a measure of the association between the ranks of the independent and the dependent variables. This measure is also useful when the data are not normally distributed (Blalock, 1979). The Spearman coefficient is often indicated by the symbol rs. Similar to the simple correlation coefficient, the Spearman coefficient ranges be• tween -1 and 1. Values of rs close to absolute 1 indicate that a very strong relationship exists between the ranks of the two variables. When rs is equal to zero, then there is no relationship between the ranks of the two variables. Positive values of rs indicate a tendency for the ranks of the 750 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES variables to increase together. Negative values of rs indicate a tendency of large rank values of one variable to be associated with small rank values of the other variable. The Spearman correlation coefficient is calculated using the following equation:

r, 1- (AIY.8)

where n = the number of ranked products d = the difference in ranks between the two data series

Critical values of rs are found in the Spearman rank correlation table. When the value for n is more than 60, then the Pearson tabular values can be used to determine the tabular values for the Spearman correlation coefficient

Spearman Correlation Coefficient Example We are returning to the cocoa powder example used for the Pearson's cor- relation coefficient. In this case, the two panelists were asked to rank the perceived brown intensities of the 14 cocoa powder-icing sugar mixtures. Was there a significant correlation between the ranks assigned by the two panelists to the cocoa mixtures?

Cocoa in Mixture, % Panelist A (X) Panelist B (Y) d d2 30 1 1 0 0 35 2 2 0 0 40 4 3 1 1 45 6 4 2 4 50 7 7 0 0 55 5 8 -3 9 60 9 6 3 9 65 3 5 -2 -2 70 10 9 1 1 75 8 10 -2 4 80 11 12 1 1 85 14 13 1 1 90 13 11 2 4 95 12 14 -2 4

'Ld2 = 42 CoRRELATION, REGRESSION, AND MEASuREs OF AssociATION 751

6. 42 r, = 1 - 14(156 _ 1) = 0.8859 (AIV.9)

The tabular value for the Spearman correlation coefficient with 14 ranks at an alpha value of 5% is equal to 0.464. We can conclude that the two panelists' rank ordering of the brown-color intensities of the cocoa powdericing sugar mixtures were significantly similar. However, there is no direct causal relationship between the rank order of panelist A and that of panelist B.

Cramer's Measure ($'2)

When the data are not derived from an interval or an ordinal scale but from a nominal scale, then the appropriate measure of association is the Cramer measure, ~'2 (phi-hat prime squared) (Herzberg, 1989). This association co• efficient is a squared measure and can range from 0 to 1. The closer the value of ~'2 is to 1, the greater the association between the two nominal variables; the closer ~'2 is to zero, the smaller the association between the two nominal variables. The Cramer coefficient of association is calculated using the following equation:

2 ~'2 = 1- z._ (AIV.10) nq where n = the sample size q = one less than the smaller of the category variables represented by the rows (r) and the columns (c) X2 = chi-square observed for the data

Cramer Coefficient Example

The following data set is hypothetical. Altogether 190 consumers of chewing gum indicated which flavor they usually used. The sensory scientist was interested in determining if there was an association between gender and gum flavors. The observed values were:

Fruit Flavor Gum Mint Flavor Gum Bubble Gum Men 35 15 50 100 Women 12 60 18 90 Totals 47 75 68 190 752 SENSORY EvALUATION oF FooD: PRINCIPLES AND PRACTICES

Expected value for women using fruit-flavored gum is 47•(90/190) = 22.263. For men using bubble gum, the expected value is 68•(100/190) = 35.789. The expected values are:

Fruit Flavor Gum Mint Flavor Gum Bubble Gum Men 24.737 39.474 35.789 100.000 Women 22.263 35.526 32.210 89.999 Totals 47.000 75.000 677.999 189.998

(AIV.11)

Thus the calculated X2 = 52.935 and the degrees of freedom are equal to (r- 1) • (c- 1):

2- (35- 24.737)2 (15- 39.474)2 X - 24.737 + 39.474

(50- 35.789)2 (12- 22.263)2 (AIV. 12) + 35.789 + 22.263

(60 - 35.526)2 (18 - 32.210)2 + 35.526 + 32.210

In this case with two rows and three columns, we get (2- 1)•(3- 1) = 2. The tabular value for X2o.o5, ctr= 2 is 5.991. The X2 value is significant. However, to determine if there is a significant association between the genders and their use of gum flavors, we use Eq. (AIV.9):

2 <1>'2 = X.. (AIV.13) nq

52.935 = 1. 190

= 0.2786

The Cramer value of association is 0.2786. There is not a strong association between gender and the use of gum flavors. CORRELATION, REGRESSION, AND MEASURES OF ASSOCIATION 753

REFERENCES

Afifi, A.A., and Clark, V. 1984. Computer-Aided Multivariate Analysis. Lifetime Learning Publications, Belmont, CA, pp. 80-119. Anscom be, F.J. 1973, Graphs in statistical analysis. American Statistician, 27, 17-21. Blalock, H.M. 1979. ;2d ed. McGraw-Hill, New York. Freund, J.E., and Simon, G.A. 1992. Modem Elementary Statistics, 8th ed. Prentice• Hall, Englewood Cliffs, NJ, p. 474. Herzberg, P.A. 1989. Principles of Statistics. Robert E. Krieger, Malabar, FL, pp. 378-380. Meilgaard, M, Civille, G.V., and Carr, B.T. 1991. Sensory Evaluation Techniques. 2d ed. CRC, Boca Raton, Fl. Neter, J., and Wasserman, W. 1974. Applied Linear Models. Richard D. Irwin, Homewood, IL, pp. 53-96. Piggott, J.R. 1986. Statistical Procedures in Food Research. Elsevier Applied Science, New York, pp. 61-100. Snedecor, G.W., and Cochran, W.G. 1980. Statistical Methods. 7th ed. Iowa State University Press, Ames, Iowa, pp. 175-193. Stevens, J. 1986. Applied for the Social Sciences. Erlbaum, Hillsdale, NJ. v appendix

STATISTICAL POWER AND TEST SENSITIVITY

Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, fol• lowing the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is al• ways strictly invalid, and it is functionally invalid unless power is high.-J. Cohen (1988)

INTRODUCTION

The need for sensitive sensory methods and powerful statistical procedures arises from several sources in applied testing. The first is the need to dis• cover when treatments of interest are having an effect. In food product de• velopment, these treatments usually involve changes in food constituents, the methods of processing, or types of packaging. A purchasing department may change suppliers of an ingredient. A QC department may test for the stability of a product during its shelf-life. In each of these cases, it is desir• able to know when a product has become perceivably different from some comparison or control treatment, and thus sensory tests are conducted. Most statistical tests are done to ensure that a true null hypothesis is not

754 STATISTICAL POWER AND TEST SENSITIVITY 755

rejected without cause. In other words, the statistics that most of us are first taught involve tests against an assumption that the ingredient, process, or packaging change has had no effect. When enough evidence is gathered to show that our data would be rare occurrences given this assumption, we reject the assumption and conclude that a difference did occur. This process keeps us from making the Type I error discussed in Appendix I. This is a very important process, for it keeps us from concluding that a dif• ference or even an improvement in the product is preceived when there is little or no change, and it provides some evidence that our results were un• likely to be due to chance variation. In practical terms, this keeps a re• search program focused on real effects and ensures that business decisions about changes are made with some confidence. However, another kind of error in statistical decision making is also im• portant. This is the error associated with accepting the null when a differ• ence did occur. Missing a true difference can be as dangerous as finding a spurious one, especially in product reasearch. In order to provide tests of good sensitivity, then, the sensory evaluation specialist conducts tests using good design principles and sufficent numbers of judges and replicates. The principles of good pratice are discussed in Chapter 3. These principles pro• vide methodological rules of thumb that ensure that a test has minimized error variance due to unwanted sources. For example , we counterbalance or ramdomize orders of product presentation to minimize possible order or carryover effects of one product on another. Most of these practices are aimed at reducing unwanted error variance. Panel screening, orientation, and training are the other tools at the disposal of the sensory specialist that can help minimize unwanted variability. Another example is in the use of reference standards, both for sensory terms and for intensity levels in de• scriptive judgments. Reference standards for sensory terms help align the panelists' concepts of what is to be rated, and standards for intensity values help align their frame of reference for perceptual strength. Both types of training or calibration can lower error variance. Considering the general form of the t-test, we discover that two of the three variables in the statistical formula are under some control of the sen• sory scientist. Remember that the t-test takes this form:

t= differences between means /standard error and the standard error is the sample standard deviation divided by the square root of the sample size. Of course, the exact mathematical calcula• tions depend on which of the three types oft-test one is conducting, but they all follow this pattern. The standard deviation or error variance can be 756 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES minimized by good experimental controls, panel training, and so on. Another tool for reducing error is partitioning, for example in the removal of panelist effects in the complete-block AN OVA designs or in the paired t• test. A more direct test of product differences is possible without including systematic judge variation in the error term or denominator of the statisti• cal formula. As the denominator of a test statistic (like a F-ratio or a t• value) becomes smaller, the value of the test statistic becomes larger, and it is easier to reject the null. The probability of observing the results (under the assumption of a true null) shrinks. The second factor under the control of the sensory professional is the sample size, which usually refers to the number of judges or observations. In some AN OVA models, additional de• grees of freedom can also be gained by replication. In applied work, it is also necessary to base some business decisions on acceptance of the null hypothesis. Sometimes we conclude that two prod• ucts are sensorially similar, or that they are a good enough match that no systematic difference is likely to be observed by regular users of the prod• uct. In this scenario, it is critically important that a sensitive and powerful test be conducted so that a true difference is not missed, otherwise the con• clusion of ''no difference" could be spurious. Such decisions are common in statistical quality control, ingredient substitution, supplier changes, shelf-life and packaging studies, and a range of associated research ques• tions. The goal of such tests is to match an existing product or provide a new process or cost reduction that does not change or harm the sensory quality of the item. In some cases, the goal may be to match a competitor's successful product. In these practical scenarios, it is necessary to estimate the power of the test, which is the probability that a true difference would be detected. In sta• tistical terms, this is usually described in an inverse way, first by defining the quantity beta as the long-term probability of missing a true difference or the probability that a Type II error is committed. Then 1 minus beta is de• fined as the power of the test. Power depends on several interacting factors, namely, the amount of error variation, the sample size, and the size of the difference one wants to be sure to detect in the test. This last item must be defined and set using the professional judgment of the sensory specialist or by management. This judgment causes no end of worry since it requires an explicit decision process rather than simply "cranking through the num• bers?' Having made a decision, the analyst has "stuck his neck out" and may have to answer for what turns out, in retrospect, to be a poor recommenda• tion. The program may be dealing with a new product where there is not enough of a track record to estimate how big of a difference might be impor• tant. However, in much applied research with existing food products, there is a knowledge base, much more so than in basic research. STATISTICAL POWER AND 'fEST SENSITIVITY 757

This chapter will discuss the factors contributing to test power and give some worked examples and practical scenarios where power is important in sensory testing. Discussions of statistical power and worked examples can also be found in Amerine et al. (1965), Gacula and Singh (1984), and Gacula (1991, 1995). Gacula's writings include considerations oftest power in substantiating claims for sensory equivalence of products. Examples specific to discrimination tests can be found in Schlich (1995) and in Ennis (1995). General references on statistical power include the classic text by Cohen (1988), his overview article written for behavioral scientists (Cohen, 1992), and the introductory statistics text by Welkowitz et al. (1982).

FACTORS AFFECTING THE POWER OF STATISTICAL TESTS Mathematically, the power of a statistical test is determined by three inter• acting variables. Each of these entails choices on the part of the experi• menter. They may seem arbitrary, but in the words of Cohen, "all conventions are arbitrary. One can only demand of them that they not be unreasonable" (1988, p. 12). Two choices are made in the routine process of experimental design, namely, the sample size and the alpha level. The sample size is usually the number of judges in the sensory test This is commonly represented by the letter N in statistical equations. In more complex designs like multifactor AN OVA, N can reflect both the number of judges and replications, or the total number of degrees of freedom con• tributing to the error terms for treatments that are being compared. Often this value is strongly influenced by company traditions or lab "folklore" about panel size. It may also be influenced by cost considerations or the time needed to recruit, screen, and/or train and test a sufficiently large number of participants. However, this variable is the one most often con• sidered in determinations of test power, as it can readily be modified in the experimental design phase. Many experimenters will choose the number of panelists using consid• erations of desired test power. Gacula (1993) gives the following example. For a moderate-to-large consumer test, we might want to know whether the products differ by 1/2 a point on the 9-point scale at most in their mean values. Suppose we had prior knowledge that for this product, the standard deviation is about 1 scale point (cr = 1); we can find the required number of people for an experiment with 5% alpha and 10% beta (or 90% power). This is given by the following relationship:

(AV.1) 758 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

where ).11 - ).12 = the minimal difference we must be sure to detect Za. and Z~ = the Z-scores associated with the desired Type I and Type II error limits

In other words, there are 52 observers required to ensure that a 112 point difference in means can be ruled out at 900/o power when a nonsignificant re• sult is obtained. Most calculations of sample size have some variation on this formula, where Za., Z~, and the effect size (in this case Jl1 - Jl2) are critical. The second variable affecting power is the alpha level, or the choice of an upper limit on the probability of rejecting a true null hypothesis. In other words, this is the upper limit on Type I error in our experiments. Usually we set this value at the traditional level of 0.05, but there are no hard-and-fast rules about this magical number. In many cases in ex• ploratory testing or industrial practice, the concern over Type II error• missing a true difference-are of sufficient concern that the alpha level for reporting statistical significance will float up to 0.10 or even higher. This strategy shows us intuitively that there is a direct relationship between the size of the alpha level and power, or in other words, an inverse relationship between alpha-risk and beta-risk. This intuitive appreciation can be en• hanced if we think about the process of accepting a null hypothesis and us• ing this decision as the basis for a conclusion of no difference between products. Consider the following outcome: We allow alpha to float up to 0.10 or 0.20 (or even higher) and still fail to find a significant p-value for our statistical test Now we have a highly inflated risk of finding a spurious difference, but enhanced ability to reject the null. If we still fail to reject the null, even at such relaxed levels, then there probably is no true difference among our products. In essence we are giving the null more of a chance to be disproved. If we still can't reject it with this greater opportunity, we can be more certain that there is an insignificant difference or none that is measurable. This, of course, assumes a good (not sloppy) experiment, good laboratory practices, and sufficient sample size, i.e., meeting all the usual concerns about reasonable methodology. The inverse relationship between alpha and beta will be illustrated in a simple example to follow. The third factor in the determination of power concerns the effect size one is testing against as an alternative hypothesis. This is usually a stum• bling block for scientists who do not realize that they have already made two important decisions in setting up the test-the sample size and alpha level. However, this third decision seems much more subjective to ·most people, so they are often unwilling to psychologically accept that there is a necessary and important choice to be made. For the general t-test, this ef• fect size is the distance between the mean of the t-distribution (usually zero) under the null hypothesis and the mean of the t-distribution under STATISTICAL POWER AND TEsT SENSITIVITY 759 some fixed alternative hypothesis. One can also think of this as the distance between the mean of a control product and the mean of a test product un• der an alternative hypothesis, in standard deviation units. For example, let's assume that our control product has a mean of 6.0 on some scale and the sample has a standard deviation of 2.0 scale units. We could test whether the comparison product had a value of less than 4.0 or greater than 8.0, or one standard deviation from the mean in a two-tailed test. In plain language, this is the size of a difference that one wants to be sure to detect in the experiment. If the means of the treatments were two standard deviations apart, most scientists would call this a relatively strong effect, one that a good experi• ment would not want to miss after the statistical test is conducted. If the means were one standard deviation apart, this is an effect size that is com• mon in many experiments. If the means were less than one-half of one standard deviation apart, that would be a smaller effect, but one that still might have important business implications. Various authors have re• viewed the effect sizes seen in behavioral research and have come up with some guidelines for small, medium, and large effect sizes based on what is seen during the course of experimentation with human beings (Cohen, 1988; Welkowitz et al. 1982). Several problems arise. First, this idea of effect size seems arbitrary, and an experimenter may not have any knowledge to aid in this decision. The sensory professional may simply not know how much of a consumer im• pact a given difference in the data is likely to produce. It is much easier to "let the statistics make the decision" by setting an alpha level according to tradition and concluding that no significant difference means that two product are sensorially equal. As shown above, this is bad logic and poor experimental testing. Experienced sensory scientists may have some ad• vantage here, because they may have additional information at their dis• posal that makes this decision less arbitrary. They may know, on the basis of testing over a period of time, what effect sizes tend to have practical im• pact. They may have information about what size of a difference, when seen by a trained panel, will matter to consumers. Furthermore, they should have a good understanding of how the standard deviations relate to scale values. Trained panels will show standard deviations around 10% of scale range (Lawless, 1988). This is roughly about 1 point on a 9-point cat• egory scale or 1112 points on a 15-point scale. The value will be slightly higher for difficult sensory attributes like aroma or odor intensity, and lower for "easier" attributes like visual and some textural attributes. Consumers, on the other hand, will have intensity attributes with variation in the realm of 25% of scale range and sometimes even higher values for hedonics (acceptability). In the case of choice data from discrimination ex- 760 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES periments, the values of effect size can be set on the basis of d' (d-prime) es• timates from signal detection models, or from models based on the propor• tions of discriminators in a difference test (discussed later). Another problem is that the effect size is often stated in units of stan• dard deviations, and the absolute quantity on a scale will change as the ex• perimental error increases or decreases. Therefore the effect size should not be based solely on published guidelines from the behavioral science lit• erature but on knowledge of one's experimental tools, their reliability, and range of experimental error variance. As noted earlier, a sloppy experiment has low sensitivity no matter what the sample size. Sensitivity in this sense is an ability to make sound conclusions and recommendations from the data rather than from a mathematical quantity. The terms "sensitivity" here refers to the overall quality of the test and is thus a broader consideration than power. Sensitivity entails low error, high power, sufficient sample size, good testing conditions, good design, and so on. The term ''power" refers to the formal statistical concept describing the probability of accepting a true alternative hypothesis (e.g., finding a true difference). In a parallel fashion, Cohen (1988) drew an important distinc• tion between effect size and "operative effect size" and showed how a good design can increase the effective sensitivity of an experiment. He used the illustration of a paired t-test as opposed to an independent groups t-test. In the paired design, subjects function as their own controls, since they evalu• ate both products. The between-person variation is ''partitioned" out of the picture by computation of difference scores for each individual, differences between the two products of interest. This effectively takes judge variation in scale usage out of the picture. In contrast, in a between-groups design, the differences in scale usage become part of the error variance. In Cohen's example test sensitivity is increased by a better experimental design. The third problem with effect size is that clients are often unaware of it and do not understand why some apparently arbitrary decision has to en• ter into scientific experimentation. This is a communication problem. It is partly due to the limited statistical training of most people in decision-mak• ing positions. If the issue of statistical power and test sensitivity needs to be aired in reports of experiments or in defending the conclusions and recom• mendations from them, it may be useful to recruit a statistical consultant to help explain the decision process. In mathematical terms, this effect size can be stated for the t-test as the number of standard deviations separating means, usually signified by the letter d. In the case of choice data, the common estimate is our old friend d' ( d prime) from signal detection theory, sometimes signified as a population estimate by the Greek letter delta (Ennis, 1993). For analyses based on cor• relation, the simple Pearson's r is a common and direct measure of associ- STATISTICAL POWER AND TEST SENSITIVITY 761

ation. Some guidelines for effect size parameters in simple statistical tests on behavioral data are shown here (Cohen, 1988).

Parameter Test Small Effect Medium Effect Large Effect d t-test .20 .50 .80 r correlation .10 .30 .50 (p,- Pz~ proportions .10 .30 .50 -{TJ(j f ANOVA .10 .25 .40

In the analysis of variance, the situation is a little more complicated. In the case of AN OVA, an analogous value to dis given by the letter fin Cohen's

test, and a related term in other texts is given by the Greek letter <1> (phi) (e.g., Beyer, 1987). Both of these represent the pooled estimate of treatment differences relative to error, an effect size measure for testing multiple means in AN OVA. The value ofjis the square root of treatment variance divided by error variance (in other words the ratio of the standard deviation of treatment means to the pooled standard deviation for error). The relationship between f and <1> is similar to the relationship between d and the t-value in a t-test. The F-value is simply the f-value divided by the square root of N. In the two• sample case, the value ofjis simply d/2. Of course in ANOVA we are testing for differences among several means, and the power calculation changes depending on how the treatment means are separated. They could be bunched at the extremes of their range, bunched in the middle, or evenly distributed. The distribution of the means affects the exact relationship, but 2d is an upper limit. For a further discussion of power estimates in AN OVA, the reader is referred to Cohen (1988) and Hays (1973). Cohen'sjparameter is also used to quantify effect size in ANOV A. This parameter is useful as it is related to the variance accounted for by the variable in question. Variance accounted for is sometimes symbolized as 112 (eta squared), similar to the R2 of multiple regression. The relationship between! and eta is 112 =PI (1/j2). So for most effect sizes,! is close to eta, or the square root of the percent of variance due to the treatment (anjof .3 gives a percent of variance accounted for of .0826, or close to 9% ). This is also then a measure of the "degree of association" between our results and the original treatments or independent variables. Some people find this "degree of association" a useful way to think about effect size, much like a correlation. Diagrams below will illustrate how effect size, alpha, and beta interact. As an example, we perform a test with a rating scale, e.g., a just-about-right scale, and we want to test whether the mean rating for the product is higher than the midpoint of the scale. This is the simple t-test against a 762 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES fixed value, and our hypothesis is one-tailed. For the simple one-tailed t• test, alpha represents the area under the t-distribution to the right of the cutoff determined by the limiting p-value (usually 5%). It also represents the upper tail of the sampling distribution of the mean as shown in Figure AV.1. The value of beta is shown by the area underneath the alternative hy• pothesis curve to the left of the cutoff as shaded in Figure AV.1. We have shown the sampling distribution for the mean value under the null as the bell-shaped curve on the left. The dashed line indicates the cutoff value for the upper 5% of the tail of this distribution. This would be the common value set for statistical significance, so that for a given sample size (N), the t-value at the cutoff would keep us from making a Type I error more than 5% of the time (when the null is true). The right-hand curve represents the sampling distribution for the mean under a chosen alternative hypothesis. This may seem a little hypothetical, but it is fairly straightforward to calcu-

cutoff determined by chosen alpha level

Region of acceptance (of null) Region of rejection (of null)

"d" sampling distribution of mean differences under a fixed sampling distribution alternative hypothesis of mean differences based on "d" under a true null M1 - M2 /cr

0

beta risk associated with area alpha risk associated with under alternative hyothesis curve area under null curve above below cutoff cutoff, e.g. 5%

Power shown as the tail of the alternative hypothesis, relative to the cutoff determined by the null hypothesis distribution. The diagram is most easily interpreted as a one-tailed t-test A test against a fixed value of a mean would be done against a population value or a chosen scale point such the midpoint of a just-right scale. The value of the mean for the alternative hypothesis can be based on research, prior knowledge, or the effect size, d, the difference between the means under the null and alternative hypothe• ses, expressed in standard deviation units. Beta is given by the shaded area underneath the sampling distribution for the alternative hypothesis, below the cutoff determined by alpha. Power is 1 minus beta. STATISTICAL POWER AND TEST SENSITIVITY 763 late what this would look like. We know the mean from our choice of effect size, and we can base the variance on our estimate from the sample standard error. When we choose the value for mean score for our test product, the d-value becomes determined by the difference of this mean from the control, divided by the standard deviation. Useful examples are drawn in Gacula's (1991, 1995) discussion and in the section on hypothesis testing in Sokal and Rohlf (1981). In this diagram, we can see how the three interacting variables work to determine the size of the shaded area for beta-risk. As the cutoff is changed by changing the alpha level, the shaded area would become larger or smaller (see Figure AV.2). As the alpha-risk is increased, the beta-risk is decreased, all other factors being held constant. This is shown by shifting the critical value for a significant t-statistic to the left, increasing the alpha "area," and decreasing the area associated with beta. A second effect comes from changing the effect size or alternative hy• pothesis. If we test against a larger d-value, the distributions would be sep-

Effect of increasing alpha level to decrease beta risk

New cutoff original cutoff from increased determined by alpha level alpha level Region of acceptance (of null) ... Region of rejection (of null) I

beta risk reduced alpha risk increased by change in alpha

~ ~ncreas~g the alpha level decreases the area associated with beta, ~ 1mprovmg power. 764 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES arated, and the area of overlap is decreased. Beta-risk decreases when we choose a bigger effect size for the alternative hypothesis. Conversely, test• ing for a small difference in the alternative hypothesis would pull the two distributions closer together, and if alpha is maintained at 5%, the beta-risk associated with the shaded area would have to get larger. A good example of this change with different alternative hypotheses can be seen in Gacula and Singh (1984). In their example, the meat from a cattle population is ex• pected to have a certain value on a tenderness scale. The alternative hy• pothesis may state a specific difference between the expected value and the test sample mean on the tenderness scale. The chances of missing a true difference are very high if the alternative hypothesis states that the differ• ence is very small. If the alternative hypothesis is that the difference is large, the power of the test is greater. This should seem intuitively correct It's easier to detect a bigger difference than a smaller one, all other things in the experiment being equal. The third effect comes from changing the sample size or the number of observations. The effect of increasing N is to shrink the effective standard deviation of the sampling distributions, decreasing the standard error of the mean. This makes the distributions taller and thinner so there is less overlap and less area associated with beta. The t-value for the cutoff moves to the left in absolute terms. In summary, we have four interacting variables, and knowing any three, we can determine the fourth. These are alpha, beta, N, and effect size. If we wish to specify the power of the test up front, we have to make at least two other decisions, and then the fourth will be determined for us. For example, if we want 80% test power (beta = 0.20), and alpha equals 0.05, and we can test only 50 subjects, then the effect size we are able to detect at this level of power is fixed. If we desire 80% test power, want to detect 0.5 standard deviations of difference, and set alpha at 0.05, then we can calcu• late the number of panelists that must be tested (i.e., N has been deter• mined by the specification of the other three variables). In many cases, experiments are conducted only with initial concern for alpha and sample size. In that case, there is a monotonic relationship between the other two variables that can be viewed after the experiment to tell us what power can be expected for different effect sizes. These relationships are illustrated be• low using the program GPOWER, a freeware program capable of showing these relationships for the simple test designs often encountered in re• search (Erdfelder et al., 1996). Tables for the power of statistical tests can also be found in Cohen (1988). For a specific illustration, let's examine the simple between-groups t• test to look at the relationship between alpha, beta, effect size, and N. In this situation, we want to compare two means generated from independent STATISTICAL POWER AND 'fEST SENSITIVITY 765 groups, and the alternative hypothesis predicts that the means are not equal (i.e., no direction is predicted). Figure AV.3 shows the power of the two-tailed independent groups t-test as a function of different sample sizes (N) and different alternative hypothesis effect sizes (d). If we set the lower limit of acceptable power at 50%, we can see from these curves that using 200 panelists would allow us to detect a small difference of about 0.3 stan• dard deviations. With 100 subjects, this difference must be about 0.4 stan• dard deviations, and for small sensory tests of 50 or 20 panelists (25 or 10 per group, respectively), we can only detect differences of about 0.6 or 0.95 standard deviations with 50/50 chance of missing a true difference. This in• dicates the liabilities in using a small sensory test to justify a ''parity" deci• sion about products. Often, sensory scientists want to know the required sample size for a test so that they can recruit the appropriate number of consumers or panelists for a study. Figure AV.4 shows the sample size required for different experi-

0.90

0.80

0.70

'iii' 0.60 ...Gl ~ :!:.. 0.50 ... ; 0.40 0 ll. 0.30 N:20

0.20 -tr- N:SO -o- N=100 0.10 --¢-- N=200 0.00 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 Effect size "d"

~ Power of the two-tailed independent groups t-test as a function of different ~ sample sizes (N) and different alternative hypothesis effect sizes (d); the decision is two-tailed at alpha = .05. The effect size d represents the differ• ence between the means in standard deviations. Computed from the GPOWER program ofErdfelder et al. (1996). 766 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

3.50 -o-- d::0.80

3.00

CD N Ui CD 'ii. 2.50 E Ill"' "i -0 2.00 -Cl 0 ...I

1.50

0.5 Power (1-beta) .95

~ Number of judges required for independent groups t-test at different levels ~ of power; the decision is two-tailed at alpha= .05. Note that the sample ••• size is plotted on a log scale. Computed from the GPOWER program of Erdfelder et al. (1996).

ments for a between-groups t-test and a decision that is two-tailed. An ex• ample of such a design would be a consumer test for product acceptability, with scaled data and each of the products placed with a different consumer group (a so-called ''monadic design"). Note that the scale is log transformed, since the group size becomes very large if we are looking for small effects. For a very small effect of only 0.2 standard deviations, we need 388 con• sumers to have a minimal power level of 0.5. Ifwe want to increase power to 90%, the number exceeds 1,000. On the other hand, for a big difference of 0.8 standard deviations (about 1 scale point on the 9-point hedonic scale), we only need 28 consumers for 50% power and 68 consumers for 90% power. This illustrates why some sensory tests done for research and product de• velopment purposes are smaller than the corresponding marketing re• search tests. Often the goal is to uncover a reliable and perceivable leap forward in product technology, an improvement of solid and practical sig• nificance. There is reason for concern and great potential waste of re• sources if a research program is started based on a false difference, so alpha-risk is key. Market research tests, on the other hand, may be aimed at STATISTICAL POWER AND TEST SENSITIVITY 767 finding small advantages in a mature or optimized product system, and this requires a test of high power to keep both alpha- and beta-risks low. Of course there are many exceptions to these generalizations in both product development and marketing research.

WORKED EXAMPLES

Example 1. Claim of equivalence and calculations from obtained results. Gacula (1991, 1995) gives examples of calculations of test power using sev• eral scenarios devoted to substantiating claims of product equivalence. These are mostly based on larger-scale consumer tests, where the sample size justifies the use of the normal distribution (Z) rather than the small sample t-test. In such an experiment, the calculation of power is straight• forward, once the mean difference associated with the alternative hypothe• sis is stated. The calculation for power follows this relationship:

Power= 1 -beta= 1 - [(Xc- fln) I SE] where Xc represents the cutoff value for a significantly higher score, deter• mined by the alpha level. For a one-tailed test, the cutoff is equal to the mean plus 1.645 times the standard error. For a two-tailed alternative, the value of Z changes from 1.645 to 1.96. The Greek letter represents the value of the cumulative normal distribution; in other words, we are converting the Z• score to a proportion or probability value. Since many tables of the cumula• tive normal distribution are given in the larger proportion rather than the tail (as is true in Gacula's tables), it is sometimes necessary to subtract the tabled value from 1 to get the other tail. The parameter Jln represents the mean dif• ference as determined by the alternative hypothesis. This equation simply finds the area underneath the alternative hypothesis Z-distribution, beyond the cutoffvalue Xc. Here is a scenario similar to one from Gacula (1991). A consumer group of 92 panelists evaluates two products and gives them mean scores of 5.9 and 6.1 on a 9-point hedonic scale. This is not a significant difference, and the sensory professional is tempted to conclude that the products are equivalent. Is this conclusion justified? The standard deviation for this study was 1.1, giving a standard error of 0.11. The cutoff values for the 95% confidence interval are then 1.96 stan• dard errors, or the mean plus or minus 0.22. We can immediately see that the two means lie within the 95% confidence interval, and the statistical conclusion of no difference seems to be justified. A two-tailed test is used to see whether the new product is higher than the standard product receiving a 5.9. The two-tailed test requires a cutoff that is 1.96 standard errors, or 768 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

0.22 units above the mean. This sets our upper cutoff value for value Xc at 5.9 + 0.22 or 6.12. Once this boundary has been determined, it can be used to split the distributions expected on the basis of the alternative hypotheses into two sections. This is shown in Figure AV.5. The section that is higher in this case represents the detection of a difference or power (null rejected), while the section that is lower represents the chance of missing the differ• ence or beta (null accepted). In this example, Gacula originally used the actual mean difference of 0.20 as the alternative hypothesis. This would place the alternative hypoth• esis mean at 5.9 + 0.2 or 6.1. To estimate beta, we need to know the area in the tail of the alternative hypothesis distribution to the left of the cutoff. This can be found once we know the distance of the cutoff from our alter• native mean of 6.12. In this example, there is a small difference from the cutoff of only 6.12- 6.1, or 0.02 units on the original scale, or 0.02 divided by the standard error to give about 0.2 Z-score units from the mean of the alternative to the cutoff. Essentially, this mean lies very near to the cutoff, and we have split the alternative sampling distribution about in half. The area in the tail associated with beta is large, about 0.57, so power is about 43% (1 minus beta). Thus we see that this test is below lower practical lim• its for beta, and the conclusion is not strongly supported by the power un• der the assumptions that the true mean lies so close to 5.9. However, we have tested against a very small difference as the basis for our alternative hypothesis. Suppose we relax the alternative hypothesis. Let's presume that we de• termined before the experiment that a difference of one-half of one stan• dard deviation on our scale would be considered the lower limit of a difference of any practical importance. We could then set the mean for the alternative hypothesis at 5.9 plus one-half of the standard deviation (1.1/2 or 0.55). The mean for the alternative now becomes 5.9 + 0.55, or 6.45. Our cutoff is now 6.12- 6.45 units away (.33), or .33 divided by the standard er• ror of .11 to convert to Z-score units, giving a value of 3. This has effectively shifted the expected distribution to the right, while our decision cutoff re• mains the same at 6.12. The area in the tail associated with beta would now be less than 1%, and power would be about 99%. The choice of an alterna• tive hypothesis can greatly affect the confidence of our decisions. If the business decision justifies a choice of one-half of a scale unit as a practical cutoff (based on one-half of one standard deviation), then we can see that our difference of only 0.2 units between mean scores is fairly "safe?' The power calculations tell us exactly how safe this would be. Essentially, there is only a 1% chance of seeing this result or one more extreme, if our true mean score was 0.55 units higher. The observed events are fairly unlikely given this alternative, so we reject it in favor of accepting the null. STATISTICAL POWER AND TEST SENSITIVITY 769

Cutoff for decision determined by alpha level = 6.12

Region of acceptance (of null) Region of rejection (of null) If product scores higher than 6.12, 6_12 reject the null. I

alpha risk at 5%

5.9 Sampling distribution of the mean under the true null.

Alternative hypothesis sampling distribution based on actual mean obtain for comparison product = 6.1 or 0.2 units of difference.

Power =43% (null rejected under true alternative)

Alternative hypothesis relaxed to test against a true comparison product mean of 6.45 or larger.

Power= 99% (null rejected under true alternative)

6.45

Power first depends on setting a cutoff value based on the sample mean, standard error, and the alpha level. In the example shown, this value is . 6.12. The cutoff value can then be used to determine power and beta-risk for various expected distributions of means under alternative hypotheses. In Gacula's first example, the actual second product mean of 6.1 was used. The power calculation gives only 43%, which does not provide a great deal of confidence in a conclusion about product equivalence. The lower exam• ple shows the power for testing against an alternative hypothesis that states that the true mean is 6.45 or higher. Our sample and experiment would detect this larger difference with greater power. 770 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

Example 2. Computing sample size for a difference test. Amerine et al. (1965) gave a useful general formula for computing the necessary numbers of judges in a discrimination test, based on beta-risk, alpha, and the critical difference that must be detected. This last item is conceived of as the differ• ence between the chance probability Po and the probability associated with an alternative hypothesis, Pa· This difference is discussed below from two slightly different perspectives: the chance adjusted probability in Schlich's (1993) tables and the sensory difference in Ennis's (1993) tables taken from Thurstonian or signal detection models. Whatever model we may apply, choice of the alternative probability requires some degree of professional judgment, like any assessment of power. For the sake of example, we will take the chance-adjusted probability for 50% correct, which is halfway be• tween the chance probability and 100% detection. In the triangle test, the chance-adjusted 50% probability is 66.7% (halfway between 33.3% and 100%). This has sometimes been used as a functional definition of thresh• old (see Chapter 6) and also represents the best estimate when 50% of the population can discriminate the difference in the so-called ''proportions of discriminators" model. The formula is as follows:

(AV.2) for a one-tailed test (at a= 0.05), Za. = 1.645, and if beta is kept to 10% (90% power) then z13 = 1.28. The critical difference, p0 - Pa' has a strong influence on the equation. In the case where it is set to 33.3% for the triangle test (a threshold of sorts), we then require respondents as shown in the following calculation:

N = H1.645~ .333(.666) + 1.28~ .666(.333)v333 r= 17.11 So you would need a panel of about 17 or more persons to protect against missing a difference this big and limit your risk to 10%. Note that this is a fairly gross test, as the difference we are trying to detect is potentially large. If half of a consumer population notices a difference, the product could be in trouble. Now, suppose we do not wish to be this lenient but prefer a test that will be sure to catch a difference about half this big at the 95% power level in• stead of 90%. Let's change one variable at a time to see which has a bigger effect. First the beta-risk goes from 90% to 95%, and if we remain one• tailed, z13 now equals 1.645. So the numbers become: STATISTICAL POWER AND TEST SENSITIVITY 771

N = n1.645 ~ .333(.666) + 1.645 ~ .666(.333)1/.333 r = 22.06

So with the increase only in power, we need five additional people for our panel. However, if we decrease the effect size we want to detect by half, the numbers become:

N = {[t.645 ~ .333(.666) + 1.28 ~ .666(.333)11.167 r= 68.09

Now the required panel size has quadrupled. The combined effect of chang• ing both beta and testing for a smaller effect is:

N = {[t.645 ~ .333(.666) + 1.645 ~ .666(.333)l/.167r = 85.73

Note that in this example, the effect of halving the effect size (critical differ• ence) was greater than the effect of halving the beta-risk. Choosing a rea• sonable alternative hypothesis is a decision that deserves some care. If the goal is to ensure that almost no person will see a difference or only a small proportion of consumers (or only a small proportion of the time), a large test may be necessary to have confidence in a "no difference" conclusion. A panel size of 86 testers is probably a larger panel size than most people consider for a triangle test. Yet it is not unreasonable to have this extra size in the panel when the research question and important consequences con• cern a parity or equivalence decision. Similar "large" sample sizes can be found in the test for similarity as outlined by Meilgaard et al. (1991).

POWER IN SIMPLE DIFFERENCE AND PREFERENCE TESTS The scenarios in which we test for the existence of a difference or the exis• tence of a preference often involve important conclusions when no signifi• cant effect is found. These are testing situations where acceptance of the null and therefore establishing the power of the test are of great impor• tance. Perhaps for this reason, power and beta-risk in these situations have been addressed by several theorists. The difference testing approaches of Schlich and Ennis are discussed below, and a general approach to statisti• cal power is shown in the introductory text by Welkowitz et al. (1982). Schlich (1993) published risk tables for discrimination tests and offered a SAS routine to calculate beta-risk based on exact binomial probabilities. His article also contains tables of alpha- and beta-risk for small discrimina• tion tests and minimum numbers of testers, and correct responses associ- 772 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

ated with different levels of alpha and beta. Separate tables are computed for the triangle test and for the duo-trio test. The duo-trio table is also used for the directional paired comparison, as the tests are both one-tailed with chance probability of one-half. The binomial-based tables are more accu• rate than other tables that are based on approximations to the normal dis• tribution. The simple tables showing minimum numbers of testers and correct responses for different levels of beta and alpha in the triangle test are abridged and shown for the triangle test (Table AV.1) and for the duo• trio test (Table AV.2). In both of these cases, the effect size parameter is stated as the chance• adjusted percent correct. This is based on Abbott's formula, where the chance adjusted proportion, d. pa J. is based on the difference between the al•

ternative hypothesis percent correct, Pa, and the chance percent correct, p0 , by the following formula:

Pa- Po 1- Po

Minimum Number of Responses (n) and Correct Responses (x) to Obtain a Level of Type I and Type II Risks in the Triangle Test. (Pc is the chance-adjusted percent correct or proportion of discriminators.) • Type II Risk 0.20 0.10 0.05

Type I Risk n X n X n X Pc=50% .10 12 7 15 8 20 10 .05 16 9 20 11 23 12 .01 25 15 30 17 35 19 Pc=40% .10 17 9 25 12 30 14 .05 23 12 30 15 40 19 .01 35 19 47 24 56 28 Pc=30% .10 30 14 43 19 54 23 .05 40 19 53 24 66 29 .01 62 30 82 38 97 44 Pc=20% .10 62 26 89 36 119 47 .05 87 37 117 48 147 59 .01 136 59 176 74 211 87 STATISTICAL POWER AND TEST SENSITIVITY 773

Minimum Number of Responses (n) and Correct Responses (x) to ~ Obtain a Level of Type I and Type II Risks in the Duo-Trio Test (Pc is the chance-adjusted percent correct or proportion of discriminators.) Type II Risk 0.20 0.10 0.05

Type I Risk n X n X n X

P=50%c .10 19 13 26 17 33 21 .05 23 16 33 22 42 27 .01 40 28 50 34 59 39

pc =40% .10 28 18 39 24 53 32 .05 37 24 53 33 67 41 .01 64 42 80 51 96 60

P=30%c .10 53 32 72 42 96 55 .05 69 42 93 55 119 69 .01 112 69 143 86 174 103

P=20%c .10 115 65 168 93 214 117 .05 158 90 213 119 268 148 .01 252 145 325 184 391 219

d' We have tabulated the values for pa J. at 20% to 50% above chance. In d' Schlich's tables, other levels of effect size are given, and he discusses pa J. as the proportion of individuals discriminating. This is the so-called discrimi- nator or guessing model discussed in Chapter 5. The logic of this approach is to adjust the observed proportion correct for those who may have ob• tained the right answer (e.g., chosen the odd sample in the triangle test) by mere guessing. Schlich suggests the following guidelines for effect size, that 50% above chance is a large effect (50% discriminators), 37.5% above chance is a medium effect, and 25% above chance is a small effect. Schlich also gave some examples of useful scenarios in which the inter• play of alpha and effect size are driven by competing business interests. For example, manufacturing might wish to make a cost reduction by changing an ingredient, but if a spurious difference is found they will not recommend the switch and will not save any money. Therefore the manufacturing deci• sion is to keep alpha low. A marketing manager, on the other hand, might want to ensure that very few if any consumers can see a difference. Thus they wish to test against a small effect size, or be sure to detect even a small number of discriminators. Keeping the test power higher (beta low) under 774 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES both of these conditions will drive the required sample size (N) to a very high level, perhaps hundreds of subjects, as seen in the examples below. It is not often practical to run discrimination tests with more than 100 or more testers, so some compromise must be forthcoming in negotiations between the conflicting parties, and the sensory manager will have to communicate the risks that are involved within the constraints of practical testing. Schlich's tables provide a crossover point for a situation in which both alpha and beta will be low given a sufficient number of testers and a cer• tain effect size. If fewer than the tabulated number (X) answer correctly, the chance of Type I error will increase should you decide that there is a difference, but the chance of Type II error will decrease should you decide that there is no difference. Conversely, if the number of correct judges ex• ceeds that in the table, the chance of finding a spurious difference will de• crease should you reject the null, but the chance of missing a true difference will increase if you accept the null. So it is possible to use these minimal values for a decision rule. Specifically, if the number is less than X, accept the null and there will be lower beta-risk than that listed in the column heading. If the number correct is greater than X, reject the null and alpha risk will be lower than that listed. It is also possible to interpolate to find other values. Of course, a better approach than is to actu• ally compute the value from the SAS routine. Another set of tabulations for power in discrimination tests has been given by Ennis (1995). Instead of basing the alternative hypothesis on the proportions of discriminators, he has computed a measure of sensory dif• ference or effect size based on Thurstonian modeling. These models take into account the fact that different tests may have the same chance proba• bility level, but some discrimination methods are more difficult than oth• ers. The concept of ''more difficult'' shows up in the signal detection models as higher variability in the perceptual comparisons. The more difficult test requires a bigger sensory difference to obtain the same number of correct judges. The triangle test is more difficult than the three-alternative forced• choice test (5-AFC). In the 5-AFC, the panelist's attention is usually di• rected to a specific attribute rather than choosing on the odd sample. However, the chance percent correct for both triangle and 5-AFC tests is 113. The correction for guessing, being based on the chance level, does not take into account the difficulty of the triangle procedure. The "difficulty" arises due to the inherent variability in judging differences between things as op• posed to judging simply how strong or weak a given attribute is. In the tri• angle test, three items must be compared (three paired comparisons, really) to find the outlier, while in the 5-AFC test one merely has to find the strongest or weakest. Thurstonian models (a variation of signal detection models) are an im- STATISTICAL POWER AND TEST SENSITIVITY 775 provement over the ''proportion of discriminators" model since they do ac• count for the difference in inherent variability. Ennis's tables use the Thurstonian sensory differences, symbolized by the lowercase Greek letter delta, B. This is much like the d value in the t-test discussed above or the d' (d prime) value in signal detection, but it is a population (model) statistic rather than a sample statistic. Delta represents the sensory difference in standard deviations. The standard deviations are theoretical variability es• timates of the sensory impressions created by the different products. The delta values have the advantage that they are comparable across methods, unlike the percent correct or the chance-adjusted percent correct Thus it is advantageous for sensory scientists who may deal with various difference test methods to begin to think in terms of Thurstonian sensory distance rather than effect size based on percent correct TableAV.3 shows the num• bers of judges required for different levels of power (80%, 90%) and differ• ent delta values in the duo-trio, triangle, 2-AFC (paired comparison), and 3-AFC tests. The lower numbers of judges requirelj. for the 2-AFC and 3- AFC tests arise from their higher sensitivity to differences, i.e., lower inher• ent variability under the Thurstonian models. In terms of delta values, we can see that the usual discrimination tests done with 25 or 50 panelists will detect only gross differences (delta > 1.5)

Number of Judges Required for Different Levels of Power and Sensory Difference for Triangle, 3-AFC, Duo-Trio, and 2-AFC Tests, with Alpha=0.05 •0 2-AFC DT 3-AFC Triangle 800/o Power 0.50 78 3,092 64 2,742 0.75 35 652 27 576 1.00 20 225 15 197 1.25 13 102 9 88 1.50 9 55 6 47 1.75 7 34 5 28 2.00 6 23 3 19 900/o Power 0.50 108 4,283 89 3,810 0.75 48 902 39 802 1.00 27 310 21 276 1.25 18 141 13 124 1.50 12 76 9 66 1.75 9 46 6 40 2.00 7 31 5 26 776 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES if the triangle or duo-trio procedures are used. This fact offers some warn• ing that the ''nonspecific" tests for overall difference (triangle, duo-trio) are really only gross tools that are better suited to giving confidence when a difference is detected. When no difference is detected, there is often a chance that a small difference is missed. The AFC tests, on the other hand, like a paired comparison test where the attribute of difference is specified (e.g., ''pick which one is sweeter") are safer when a no-difference decision is important. Although this is a more valid approach across test methods than the simple correction for guessing, it is sometimes difficult to know what is a large or small difference based on the delta value or hypothetical sensory differences. Furthermore, the models may not always hold, given changing strategies of judges in actual situations (Antinone et al., 1994; Macmillan and Creelman, 1991). Sensory professionals may consider either approach depending on their degree of comfort with the underlying theory and its communication value to management. However, they should bear in mind that if the ''proportion of discriminators" is used in setting the effect size, that this effect size is not constant if the task is changed. Being a "discrimi• nator" in a triangle test is different than being a discriminator in a 5-AFC test. The logic and merits of the two approaches are discussed further in Chapter 5. A useful table for various simple tests was given by Welkowitz et al. (1985) in which the effect size and sample size are considered jointly to produce a power table as a function of alpha. This produces a value we will tabulate as the capital Greek letter delta (~.to distinguish it from the lower• case delta in Ennis's tables), while the raw effect size is given by the letter d. ~ can be thought of as the d-value corrected for sample size. The~ and d• values take the forms shown in Table AV.4 for the simple statistical tests. Computing these delta values, which take into account the sample size, al• lows the referencing of power questions to one simple table (Table AV.5). In

~ ~ Conversion of Effect Size to Delta Value Considering Sample Size Test d-Value 8

One-sample t-test d= (f..lo -!l1) I a 8= di-{N Dependent t-test d = (f..lo - !l1) I a 8= di-{JV 8 = d/v'(2_M_tf_Vr2-)/i-(M_t_+_N_2) Independent groups t-test d=(f.l.o-!l1)1a Correlation r 8= di-,/N-1 Proportions (Po - Pa) !.V Po (1- Po) 8= di-{N STATISTICAL PowER AND TEsT SENSITIVITY 777

• Effect Size Adjusted for Sample Size to Show Power as a Function of Alpha Two-tailed alpha: .05 .025 .01 .005 One-tailed alpha: .10 .05 .02 .01

~ Power 0.2 .11 .05 .02 .01 0.4 .13 .07 .03 .01 0.6 .16 .09 .03 .01 0.8 .21 .13 .06 .04 1.0 .26 .17 .09 .06 1.2 .33 .22 .13 .08 1.4 .40 .29 .18 .12 1.6 .48 .36 .23 .16 1.8 .56 .44 .30 .22 2.0 .64 .52 .37 .28 2.2 .71 .59 .45 .35 2.4 .77 .67 .53 .43 2.6 .83 .74 .61 .51 2.8 .88 .80 .68 .59 3.0 .91 .85 .75 .66 3.2 .94 .89 .78 .70 3.4 .96 .93 .86 .80 3.6 .97 .95 .90 .85 3.8 .98 .97 .94 .91 4.0 .99 .98 .95 .92 From Welkowitz et al., 1982, Appendix Table H. Reprinted by permission. other words, now all of these simple tests may then allow power calculations via the same table. Here is a worked example, using a two-tailed test on proportions (Welkowitz et al., 1982). Suppose a marketing researcher thinks that a product improvement will produce a preference difference of about 8% against the standard product. In other words, he or she expects that in a preference test, the split among consumers of this product would be something like 46% preferring the standard product and 54% preferring the new product. The researcher conducts a preference test with 400 people, considered by "intu• ition" to be a hefty sample size, and finds no difference. What is the power of the test and what is the certainty that the researcher did miss a true difference of that size? First, the d value becomes (.54 - .50) I --,/.50(1 - .50), or 0.08. Note that the test is two-tailed, as we make the conservative prediction that the prefer• ence could in fact go either way. The delta-value is d -JN, or .08 --j400 or 1.60. Referring to Table AV.5, we find that with alpha set at the traditional 5% level, the power is 48%, so there is still a 52% chance of Type II error (missing a difference) even with this "hefty" sample size. 778 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

The problem in this example is that the alternative hypothesis predicts a close race. If the researcher wants to distinguish advantages that are this small, even larger tests must be conducted to be conclusive. It is perhaps more common in mature product lines that a ''win" this small will be found in some testing programs. The product may already be optimized, and only small advantages accrue with further research and development. So the al• ternative hypothesis here may not be unrealistic. Furthermore, getting a parity or sensory equivalence result could involve an important business decision: The alleged improvement may not be worthwhile and money can be saved by not making the change in ingredients or processing. We can then turn the situation around and ask how many consumers should be in the test given the small win that is expected, and the impor• tance of a ''no preference" conclusion. We can use the following relation• ship for proportions:

N= 2(Md/

For a required power of 80% and keeping alpha at the traditional 5% level, we find that a delta value of 2.80 is required. Substituting in our ex• ample, we get:

N = 2(2.80/.08)2 or 2,450

This might not seem like a common consumer test for sensory scientists, who are more concerned with alpha-risk, but in marketing research or po• litical polling of close races, these larger samples are sometimes justified, as our example shows.

SUMMARY AND CONCLUSIONS A finding of ''no difference" is often of great importance in applied sensory evaluation in support of product research. Many business decisions in foods and consumer products are made on the basis of small product changes that result in cost reduction, a change of process variables in manufacturing, or a simple change of ingredients or suppliers. Whether or not consumers will notice the change is the inference made from sensory research. In many cases, some insurance is provided by designing a sensitive test under con• trolled conditions. This is the philosophy of the "safety net" approach to sen• sory testing. It might be paraphrased as follows: "Ifwe don't see a difference under controlled conditions using trained (or selected, screened, oriented, STATISTICAL POWER AND TEST SENSITIVITY 779 etc.) panelists, then consumers are unlikely to notice this change under the more casual and variable conditions of natural observation?' This logic, of course, depends on the assumption that the laboratory test is in fact more sensitive to allowing the detection of sensory differences than the con• sumer's experience. This is usually so, but not always, since the consumer has extended and multiple opportunities to observe the product under a va• riety of conditions, while the laboratory-based sensory analysis is often lim• ited in time, scope, and conditions of evaluation. In these scenarios, our everyday statistics aimed at rejecting the null are not much help. Most of our statistics are designed to provide evidence for differences against an initial assumption or "straw man" that results would be null. Many researchers even go so far as to embrace the old saying that "science cannot prove a negative" and thus infer that theory only moves ahead when hypotheses are rejected. This is simply an illogical extension of overly rigid reasoning and can be harmful in the realm of applied sen• sory technology, product research, and business decision making. In tradi• tional experimental statistics, null hypothesis rejection provides evidence of the dependence of sensory changes on various treatments or, more broadly speaking, of the association of dependent and independent vari• ables. However, when the goal is to provide evidence that two treatments are sensorially not different, our hypotheses testing must be supplemented with additional information. As stated above, a conclusion of ''no differ• ence" based only on a failure to reject the null hypothesis is not logically airtight. If we fail to reject the null, at least three possibilities arise: First, there may have been too much error or random noise in the experiment, so the statistical significance was lost or swamped by large standard devia• tions. It is a simple matter to do a sloppy experiment. Second, we may not have tested a sufficient number of panelists. If the sample size is too small, we may miss statistical significance because the confidence intervals around our observations are simply too wide to rule out potentially impor• tant sensory differences. Third, there may truly be no difference (or no practical difference) between our products. So a failure to reject the null hypothesis is ambiguous, and it is not proper to conclude that two products are sensorially equivalent simply based on a failure to reject the null. More information is needed. One approach to this is experimental. If the sensory test is sensitive enough to show a difference in some other condition or comparison, it is difficult to argue that the test was simply not sensitive enough to find any difference in a similar study. Consideration of a track record or demon• strated history of detecting differences with the test method is helpful. Another approach is to "bracket" the test comparison with known levels of difference. In other words, include extra products in the test that one would 780 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES expect to be different. Baseline or positive and negative control compar• isons can be tested, and if the panel finds significant differences between those benchmark products, we have evidence that the tool is working. Of course, the more commonsense impact the additional comparisons have, the more convincing the experiment becomes. If the experimental compar• ison was not significant, then the difference is probably smaller than the sensory difference in the benchmark or bracketing comparisons that did reach significance. Using meta-analytic comparisons (Rosenthal, 1987), this conclusion about relative effect size may be formally tested. A related approach is to turn the significance test around, as in the test for significant similarity discussed in Chapter 6. In that approach, the performance in a discrimination test must be at or above chance but significantly below some chosen cutoff for concluding that products are similar, practically (not statistically) speaking. As suggested above, another approach to this problem is based on the track record of the sensory method and panel. As applied in a particular laboratory and with a known panel, it is reasonable to conclude that a tool that has often shown differences in the past is operating well and is suffi• ciently discriminative. Given the history of the sensory procedure under known conditions, it should be possible to use this sort of commonsense approach to minimize risk in decision making. In an ongoing sensory test• ing program for discrimination, it would be reasonable to use a panel of good size (say 50 screened testers), perform a replicated test, and know whether the panel had shown reliable differences in the past. In such a sce• nario, such a panel would provide evidence for sensory equivalence when no significant difference is found. However, this is slippery ground, be• cause even good panels can have a bad day, or someone in the prep area may have made a mistake in coding or serving. One would certainly not want to reach a conclusion about equivalence based on an obtained p• value of 0.051 when alpha is 0.05. This charges directly into the realm of the ambiguous failure to reject the null once again. The third approach is to do a formal analysis of the test power. When a failure to reject the null is accompanied by evidence that the test was of sufficient power, reasonable scientific conclusions may be stated and busi• ness decisions can be made with reduced risk. Sensory scientists would do well to make some estimates of test power before conducting any experi• ment where a null result will generate important actions. Often our intu• itive estimates of test power are higher than statistical calculations show them to be. It is easy to overestimate the statistical sensitivity of a test. On the other hand, it is possible to design an overly sensitive test that finds small significant differences of no practical import. As in other statistical areas, considerations of test power and sensitivity must also be based on STATISTICAL POWER AND TEST SENSITIVITY 781 the larger framework of practical experience and consumer and/or mar• ketplace validation of the sensory procedure. Finally, it is important to remember that having a powerful test may lead to rejection of the null hypothesis when the difference has little or no practical meaning. It is easy to get caught up in power calculations and be• gin designing and executing experiments with larger and larger sample sizes to be able to accept a true null and to have that conclusion of ''no dif• ference" be believable. However, the downside of large samples is that re• jection of the null may come too easily: Small effects of little practical importance may be detected, and this may cause management to take courses of action that are poorly justified. The implications of an experi• mental outcome must be considered in light of the effect size that can have real impact on the perception of a product by consumers. This requires professional experience, product knowledge, and common sense. Virtually any study can be made to show significant results if one uses enough subjects, regardless of how nonsensical the content may be. There is surely nothing on earth that is completely independent of any• thing else.-W. L. Hays, 1973, p. 415.

REFERENCES Amerine, M.A., Pangborn, R.M., and Roessler, E.B. 1965. Principles of Sensory Evaluation of Food. Academic, New York. Antinone, M.A., Lawless, H.T., Ledford, R.A., and Johnston, M. 1994. The impor• tance of diacetyl as a flavor component in full fat cottage cheese. , 59, 38-42. Beyer, W. H., ed. 1987. Handbook of Tables for Probability and Statistics, 2d ed. CRC, Boca Raton, FL. Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2d ed. Erlbaum, Hillsdale, NJ. --. 1992. A power primer. Psychological Bulletin, 112, 155-159. Ennis, D.M. 1993. The power of sensory discrimination methods. Journal of Sensory Studies, 8, 353-370. Erdfelder, E., Faul, F., and Buchner, A. 1996. Gpower: a general power analysis pro• gram. Behavior Research Methods, Instrumentation and Computers, 28, 1-11. Gacula, M.C., Jr. 1991. Claim substantiation for sensory equivalence and superior• ity. In H.T. Lawless and B.P. Klein, eds. Sensory Science Theory and Applications in Foods. Dekker, New York, pp., 413-436. Gacula, M.C, Jr., and Singh, J. 1984. Statistical Methods in Food and Consumer Research. Academic, Orlando, FL. --. 1993. Design and Analysis of Sensory Optimization. Food and Press, Trumbull, CT. Hays, W.L. 1973. Statistics for the Social Sciences. Holt, Rinehart and Winston, New York. 782 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES

Lawless, H.T. 1988. Odour description and odour classification revisited. In D.M.H. Thomson, ed. Food Acceptability. Elsevier Applied Science, London, pp. 27-40. Macmillan, N.A., and Creelman, C.D. (1991). Detection Theory: A User's Guide. Cambridge University Press, Cambridge, UK. Meilgaard, M., Civille, G.V., and Carr, B.T. (1991). Sensory Evaluation Techniques, 2d ed. CRC, Boca Raton, FL. Rosenthal, R. 1987. Judgment Studies: Design, Analysis and Meta-analysis. Cambridge University Press, Cambridge, UK. Schlich, P. 1993. Risk tables for discrimination tests. and Preference, 4, 141-151. Sokal, R.R., and Rohlf, F.J. 1981. Biometr,. 2d ed. Freeman, New York. Welkowitz, J., Ewen, R.B. and Cohen, J. 1982. Introductory Statistics for the Behavioral Sciences. Academic, New York. STATISTICAL TABLES

A. Cumulative Probabilities of the Standard Normal Distribution Entry 784

B. Table of Critical Values of t 785

C. Table of Critical Values of Chi-Square 786

D. Percentage Points of the F -Distribution 787

E. Critical Values of U 789

F. Table of Critical Values of p, the Spearman Rank Correlation Coefficient, and r, the Pearson-Product Moment Correlation Coefficient 790

G. Critical Values for Duncan's New Multiple Range Test 791

H. Critical Values of the Triangle Test for Similarity 792

I. Table of Probabilities Associated with Values as Small as Observed values of x in the Binomial Test 793

J. Critical Values of Differences Between Rank Sums 794

783 784 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES n Cumulative Probabilities of the Standard Normal Distribution Entry ~ area 1- a under the standard normal curve from-= to z(1- a). z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 .0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359 .1 .5398 .5438 .5478 .5527 .5557 .5596 .5636 .5675 .5714 .5753 .2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 .3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 .4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879

.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 .6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 .7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 .8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 .9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621 1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319

1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890 2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 2.6 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981 2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 Selected Cumulative probability 1- a: .90 .95 .975 .98 .99 .995 .999 z(1- a): 1.282 1.645 1.960 2.054 2.326 2.576 3.090

Reprinted from J. Neter and W. Wasserman, Applied Linear Statistical Models, 1974, by permission of Richard D. Irwin, Homewood, IL. STATISTICAL TABLES 785

~~~ Table of Critical Values oft Level of Significance for One-Tailed Test .10 .05 .025 .01 .005 .0005 d/ Level of Significance for Two-Tailed Test .20 .10 .05 .02 .01 .001 1 3.078 6.314 12.706 31.821 63.657 636.619 2 1.886 2.920 4.303 6.965 9.925 31.598 3 1.638 2.353 3.182 4.541 5.841 12.941 4 1.533 2.132 2.776 3.747 4.604 8.610 5 1.476 2.015 2.571 3.365 4.032 6.859 6 1.440 1.943 2.447 3.143 3.707 5.959 7 1.415 1.895 2.365 2.998 3.499 5.405 8 1.3977 1.860 2.306 2.896 3.355 5.041 9 1.383 1.833 2.262 2.821 3.250 4.781 10 1.372 1.812 2.228 2.764 3.169 4.587

11 1.363 1.796 2.201 2.718 3.106 4.437 12 1.356 1.782 2.179 2.681 3.055 4.318 13 1.350 1.771 2.160 2.650 3.012 4.221 14 1.345 1.761 2.145 2.624 2.977 4.140 15 1.341 1.753 2.131 2.602 2.947 4.073 16 1.337 1.746 2.120 2.583 2.921 4.015 17 1.333 1.740 2.110 2.567 2.898 3.965 18 1.330 1.734 2.10 2.552 2.878 3.922 19 1.328 1.729 2.093 2.539 2.861 3.883 20 1.325 1.725 2.086 2.528 2.845 3.850 21 1.323 1.721 2.080 2.518 2.831 3.819 22 1.321 1.717 2.074 2.508 2.819 3.792 23 1.319 1.714 2.069 2.500 2.807 3.767 24 1.318 1.711 2.064 2.492 2.797 3.745 25 1.316 1.708 2.060 2.485 2.787 3.725 26 1.315 1.706 2.056 2.479 2.779 3.707 27 1.314 1.703 2.052 2.473 2.771 3.690 28 1.313 1.701 2.048 2.467 2.763 3.674 29 1.311 1.699 2.045 2.462 2.756 3.659 30 1.310 1.697 2.042 2.457 2.750 3.646 40 1.303 1.684 2.021 2.423 2.704 3.551 60 1.296 1.671 2.000 2.390 2.660 3.460 120 1.289 1.658 1.980 2.358 2.617 3.373

00 1.282 1.645 1.960 2.326 2.576 3.291

Reprinted from E.S. Pearson and H.O. Hartley, Biometrika, Tables for Statisticians, Vol. 1, 3d ed., 1966, by permission of the Trustees of Biometrika. 786 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES g Table of Critical Values of Chi-Square

Probability Under H0 that :If~ chi-square df .20 .10 .05 .02 .01 .001 1 1.64 2.71 3.84 5.41 6.64 10.83 2 3.22 4.60 5.99 7.82 9.21 13.82 3 4.64 6.25 7.82 9.84 11.34 16.27 4 5.99 7.78 9.49 11.67 13.28 18.46 5 7.29 9.24 11.07 13.39 15.09 20.52

6 8.56 10.64 12.59 15.03 16.81 22.46 7 9.80 12.02 14.07 16.62 18.48 24.32 8 11.03 13.36 15.51 18.17 20.09 26.12 9 12.24 14.68 16.92 19.68 21.67 28.88 10 13.44 15.99 18.31 21.16 23.21 29.59

11 14.63 17.28 19.68 22.62 24.72 31.26 12 15.81 18.55 21.03 24.05 26.22 32.91 13 16.98 19.81 22.36 25.47 27.69 34.53 14 18.15 21.06 23.68 26.87 29.14 36.12 15 19.31 22.31 25.00 28.26 30.58 37.70 16 20.46 23.54 26.30 29.63 32.00 39.29 17 21.62 24.77 27.59 31.00 33.41 40.75 18 22.76 25.99 28.87 32.35 34.80 42.31 19 23.90 27.20 30.14 33.69 36.19 43.82 20 25.04 28.41 31.41 35.02 37.57 45.32

21 26.17 29.62 32.67 36.34 38.93 46.80 22 27.30 30.81 33.92 37.66 40.29 48.27 23 28.43 32.01 35.17 38.97 41.64 49.73 24 29.55 33.20 36.42 40.27 42.98 51.18 25 30.68 34.38 37.65 41.57 44.31 52.62

26 31.80 35.56 38.88 42.86 45.64 54.05 27 32.91 36.74 40.11 44.14 46.96 55.48 28 34.03 37.92 41.34 45.42 48.28 56.89 29 35.14 39.09 42.56 46.69 49.59 58.30 30 36.25 40.26 43.77 47.96 50.89 59.70

Reprinted from E.S. Pearson and C.M. Thompson, Table of percentage points of the chi-square distribution, Biometrika, Vol. 32, 1941, by permission of the Biometrika Trustees. STATISTICAL TABLES 787

g Percentage Points of the F-Distribution l>t 1 2 5 4 5 10 20 u2 Upper 5% points 5 6.61 5.79 5.41 5.19 5.05 4.74 4.56 6 5.99 5.14 4.76 4.55 4.59 4.06 5.87 7 5.59 4.74 4.55 4.12 5.97 5.64 5.44 8 5.52 4.46 4.07 5.84 5.69 5.55 5.15 9 5.12 4.26 5.86 5.65 5.48 5.14 2.94 10 4.96 4.10 5.71 5.48 5.55 2.98 2.77 11 4.84 5.98 5.59 5.56 5.20 2.85 2.65 12 4.75 5.89 5.49 5.26 5.11 2.75 2.54 15 4.67 5.81 5.41 5.18 5.05 2.67 2.46 14 4.60 5.74 5.54 5.11 2.96 2.60 2.59

15 4.54 5.68 5.29 5.06 2.90 2.54 2.55 16 4.49 5.65 5.24 5.01 2.85 2.49 2.28 17 4.45 5.59 5.20 2.96 2.81 2.45 2.25 18 4.41 5.55 5.16 2.95 2.77 2.41 2.19 19 4.58 5.52 5.15 2.90 2.74 2.58 2.16 20 4.55 5.49 5.10 2.87 2.71 2.55 2.12 21 4.52 5.47 5.07 2.84 2.68 2.52 2.10 22 4.50 5.44 5.05 2.82 2.66 2.50 2.07 25 4.28 5.42 5.05 2.80 2.64 2.27 2.05 24 4.26 5.40 5.01 2.78 2.62 2.25 2.05 25 4.24 5.59 2.99 2.76 2.60 2.24 2.01 26 4.25 5.57 2.98 2.74 2.59 2.22 1.99 27 4.21 5.55 2.96 2.75 2.57 2.20 1.97 28 4.20 5.54 2.95 2.71 2.56 2.19 1.96 29 4.18 5.55 2.95 2.70 2.55 2.18 1.94 50 4.17 5.52 2.92 2.69 2.55 2.16 1.95 40 4.08 5.25 2.84 2.61 2.45 2.08 1.84 60 4.00 5.15 2.76 2.55 2.57 1.99 1.75 120 5.92 5.07 2.68 2.45 2.29 1.91 1.66 00 5.84 5.00 2.60 2.57 2.21 1.85 1.57 Upper 1% points 5 16.26 15.27 12.06 11.59 10.97 10.05 9.55 6 15.75 10.92 9.78 9.15 8.75 7.87 7.40 7 12.25 9.55 8.45 7.85 7.46 6.62 6.16 8 11.26 8.65 7.59 7.01 6.65 5.81 5.56 9 10.56 8.02 6.99 6.42 6.06 5.26 4.81

(Thble D continued on next page) 788 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

(TableD continued) Percentage Points of the F -Distribution

Ut 1 2 3 4 5 10 20

\)2 Upper 1% points 10 10.04 7.56 6.55 5.99 5.64 4.85 4.41 11 9.65 7.21 6.22 5.67 5.32 4.54 4.10 12 9.33 6.93 5.95 5.41 5.06 4.30 3.86 13 9.07 6.70 5.74 5.21 4.86 4.10 3.66 14 8.86 6.51 5.56 5.04 4.69 3.94 3.51 15 8.68 6.36 5.42 4.89 4.56 3.80 3.37 16 8.53 6.23 5.29 4.77 4.44 3.69 3.26 17 8.40 6.11 5.18 4.67 4.34 3.59 3.16 18 8.29 6.01 5.09 4.58 4.25 3.51 3.08 19 8.18 5.93 5.01 4.50 4.17 3.43 3.00 20 8.10 5.85 4.94 4.43 4.10 3.37 2.94 21 8.02 5.78 4.87 4.37 4.04 3.31 2.88 22 7.95 5.72 4.82 4.31 3.99 3.26 2.83 23 7.88 5.66 4.76 4.26 3.94 3.21 2.78 24 7.82 5.61 4.72 4.22 3.90 3.17 2.74 25 7.77 5.57 4.68 4.18 3.85 3.13 2.70 26 7.72 5.53 4.64 4.14 3.82 3.09 2.66 27 7.68 5.49 4.60 4.11 3.78 3.06 2.63 28 7.64 5.45 4.57 4.07 3.75 3.03 2.60 29 7.60 5.42 4.54 4.04 3.73 3.00 2.57 30 7.56 5.39 4.51 4.02 3.70 2.98 2.55 40 7.31 5.18 4.31 3.83 3.51 2.80 2.37 60 7.08 4.98 4.13 3.65 3.34 2.63 2.20 120 6.85 4.79 3.95 3.48 3.17 2.47 2.03 00 6.63 4.61 3.78 3.32 3.02 2.32 1.88

This table is abridged from Table 18 of the Biometrika Tables for Statisticians, Vol. 1, 3d ed., edited by E.S. Pearson and H.O. Hartley. Reproduced by the kind permission of the trustees of Biometrika. STATISTICAL TABLES 789

Critical Values of U for a One-Tailed Test at a. = .025 g or for a Two-Tailed Test at a. = .05 nt 10 11 12 13 14 15 16 17 18 19 20 n2 8 17 19 22 24 26 29 31 34 36 38 41 9 20 23 26 28 31 34 37 39 42 45 48 10 23 26 29 33 36 39 42 45 48 52 55 11 26 30 33 37 40 44 47 51 55 58 62 12 29 33 37 41 45 49 53 57 61 65 69 13 33 37 41 45 50 54 59 63 67 72 76 14 36 40 45 50 55 59 64 67 74 78 83 15 39 44 49 54 59 64 70 75 80 85 90 16 42 47 53 59 64 70 75 81 86 92 98 17 45 51 57 63 67 75 81 87 93 99 105 18 48 55 61 67 74 80 86 93 99 106 112 19 52 58 65 72 78 85 92 99 106 113 119 20 55 62 69 76 83 90 98 105 112 119 127 Critical Values of U for a One-Tailed Test at a. = .05 or for a Two-Tailed Test at a. = .10 8 20 23 26 28 31 33 36 39 41 44 47 9 24 27 30 33 36 39 42 45 48 51 54 10 27 51 34 37 41 44 48 51 55 58 62 11 31 34 38 42 46 50 54 57 61 65 69 12 34 38 42 47 51 55 60 64 68 72 77 13 37 42 47 51 56 61 65 70 75 80 84 14 41 46 51 56 61 66 71 77 82 87 92 15 44 50 55 61 66 72 77 85 88 94 100 16 48 54 60 65 71 77 85 89 95 101 107 17 51 57 64 70 76 83 89 96 102 109 115 18 55 61 68 75 82 88 95 102 109 116 123 19 58 65 72 80 87 94 101 109 116 123 150 20 62 69 77 84 92 100 107 115 123 130 158 From D. Auble, 1955. Extended tables for the Mann-Whitney U-Statistic. Reprinted with permission of the Institute of Educational Research, Indiana University. 790 SENSORY EVALUATION OF FoOD: PRINCIPLES AND PRACTICES

Table of Critical Values of p, the Spearman Rank Correlation Coefficient, and r, the Pearson-Product Moment Correlation Coefficient N p r .05 .01 .05 .01 4 1.000 .811 .917 5 .900 1.000 .754 .874 6 .829 .943 .707 .834 7 .714 .893 .666 .793 8 .643 .833 .632 .765 9 .600 .783 .602 .735 10 .564 .746 .576 .708 12 .506 .712 .532 .661 14 .456 .645 .497 .623 16 .425 .601 .468 .590 18 .399 .564 .444 .561 20 .377 .534 .423 .537 22 .359 .508 .404 .515 24 .343 .485 .388 .596 26 .329 .465 .374 .478 28 .317 .448 .361 .463 30 .306 .432 .349 .449

From G.J. Glasser and R.F. Winter, Biometrika, Vol. 48, by permission of the trustees of Biometrika. STATISTICAL TABLES 791

Critical Values for Duncan's New Multiple Range Test: II Protection Level P= (.95)p-1; Significance Level a= .05 Number of Means Bracketing Comparison1 df 2 3 4 5 6 3.461 3.587 3.649 3.680 7 3.344 3.477 3.548 3.588 8 3.261 3.399 3.475 3.521 9 3.199 3.339 3.420 3.470 10 3.151 3.293 3.376 3.430 11 3.113 3.256 3.342 3.397 12 3.082 3.225 3.313 3.369 13 3.055 3.200 3.289 3.348 14 3.033 3.178 3.268 3.329 15 3.014 3.160 3.250 3.312 16 2.998 3.144 3.235 3.298 17 2.984 3.130 3.222 3.285 18 2.971 3.118 3.210 3.274 19 2.960 3.107 3.199 3.264 20 2.950 3.097 3.190 3.255 24 2.919 3.066 3.160 3.226 30 2.888 3.035 3.131 3.199 40 2.858 3.006 3.102 3.171 60 2.829 2.976 3.073 3.143 120 2.800 2.947 3.045 3.116

00 2.772 2.918 3.017 3.089

1Number of means, when rank ordered, between the pair being compared and including the pair itself. From D.B. Duncan, A New Multiple Range Test, Biometrics, Vol. 11, 1955. Reprinted with permission of the International Biometrics Society. 792 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

Critical Values of the Triangle Test for Similarity m(Maximum Number Correct) as a Function of N, Beta, and the Proportion Discriminating Proportion Discriminating N Beta 20% 30% 30 .05 11 .10 10 11 33 .05 12 .10 11 13 36 .05 13 .10 12 14 42 .05 16 .10 14 17 48 .05 16 19 .10 17 20 54 .05 18 22 .10 20 23 60 .05 21 25 .10 22 26

7Z .05 26 30 .10 27 32 84 .05 31 36 .10 32 38 96 .05 36 42 .10 38 44

Accept the null with 100 (1-beta) confidence if the number of correct choices does not exceed the tabled value for the allowable proportion of discriminators. Reprinted from M. Meilgaard, G.V. Civille, and B.T. Carr, Sensory Evaluation Techniques, Copyright 1991, CRC, Boca Raton, FL. STATISTICAL TABLES 793

Table of Probabilities Associated with Values as Small as Observed II Values of x in the Binomial Text Nix 0 1 2 3 4 5 6 7 8 9 10 11 5 031 188 6 016 109 344 7 008 062 227 8 004 035 145 363 9 002 020 090 254 10 001 011 055 172 377 11 006 033 113 274 12 003 019 073 194 387 13 002 011 046 133 291 14 001 006 029 090 212 395 15 004 018 059 151 304 16 002 011 038 105 277 402 17 001 006 025 072 166 315 18 001 004 015 048 119 240 407 19 002 010 032 084 180 324 20 001 006 021 058 132 252 412 21 001 004 013 039 095 192 332 22 002 008 026 067 143 262 416 23 001 005 017 047 105 202 339 24 001 003 011 032 076 154 271 419 25 002 007 022 054 115 212 345 These probabilities are one-tailed. To use for two-tailed tests, as in forced preference, the probability must be doubled. Reprinted from H. Walker and D. Lev, Statistical Inference, 1943, by permission of Holt, Rinehart and Winston. 794 SENSORY EVALUATION OF FOOD: PRINCIPLES AND PRACTICES

g Critical Values of Differences Between Rank Sums, p = 0.05 No. of No. ofProducts Assessors 2 3 4 5 6 7 8 9 10 20 8.8 14.8 21.0 27.3 33.7 40.3 47.0 53.7 60.6 21 9.0 15.2 21.5 28.0 34.6 41.3 48.1 55.1 62.1 22 9.2 15.5 22.0 28.6 35.4 42.3 49.2 56.4 63.5 23 9.4 15.9 22.5 29.3 36.2 43.2' 50.3 57.6 65.0 24 9.6 16.2 23.0 29.9 36.9 44.1 51.4 58.9 66.4 25 9.8 16.6 23.5 30.5 37.7 45.0 52.5 60.1 67.7 26 10.0 16.9 23.9 31.1 38.4 45.9 53.5 61.3 69.1 27 10.2 17.2 24.4 31.7 39.2 46.8 54.6 62.4 70.4 28 10.4 17.5 24.8 32.3 39.9 47.7 55.6 63.6 71.7 29 10.6 17.8 25.3 32.8 40.6 48.5 56.5 64.7 72.9 30 10.7 18.2 25.7 33.4 41.3 49.3 57.5 65.8 74.2 31 10.9 18.5 26.1 34.0 42.0 50.2 58.5 66.9 75.4 32 11.1 18.7 26.5 34.5 42.6 51.0 59.4 68.0 76.6 33 11.3 19.0 26.9 35.0 43.3 51.7 60.3 69.0 77.8 34 11.4 19.3 27.3 35.6 44.0 52.5 61.2 70.1 79.0 35 11.6 19.6 27.7 36.1 44.6 53.3 62.1 71.1 80.1 36 11.8 19.9 28.1 36.6 45.2 54.0 63.0 72.1 81.3 37 11.9 20.2 28.5 37.1 45.9 54.8 63.9 73.1 82.4 38 12.1 20.4 28.9 37.6 46.5 55.5 64.7 74.1 83.5 39 12.2 20.7 29.3 38.1 47.1 56.3 65.6 75.0 84.6 40 12.4 21.0 29.7 58.6 47.7 57.0 66.4 76.0 85.7 41 12.6 21.2 30.0 39.1 48.3 57.7 67.2 76.9 86.7 42 12.7 21.5 30.4 39.5 48.9 58.4 68.0 77.9 87.8 43 12.9 21.7 30.8 40.0 49.4 59.1 68.8 78.8 88.8 44 13.0 22.0 31.1 40.5 50.0 59.8 69.6 79.7 89.9 45 13.1 22.2 31.5 40.9 50.6 60.4 70.4 80.6 90.9 46 13.3 22.5 31.8 41.4 51.1 61.1 71.2 81.5 91.9 47 13.4 22.7 32.2 41.8 51.7 61.8 72.0 82.4 92.9 48 13.6 23.0 32.5 42.3 52.2 62.4 72.7 83.2 93.8 49 13.7 23.2 32.8 42.7 52.8 63.1 73.5 84.1 94.8 50 13.9 23.4 33.2 43.1 53.3 63.7 74.2 85.0 95.8 55 14.5 24.6 34.8 45.2 55.9 66.8 77.9 89.1 100.5 60 15.2 25.7 36.3 47.3 58.4 69.8 81.3 93.1 104.9 65 15.8 26.7 37.8 49.2 60.8 72.6 84.6 96.9 109.2 70 16.4 27.7 39.2 51.0 63.1 75.4 87.8 100.5 113.3 80 17.5 29.6 42.0 54.6 67.4 80.6 93.9 107.5 121.2 90 18.6 31.4 44.5 577.9 771.5 85.5 99.6 114.0 128.5 100 19.6 33.1 46.9 61.0 75.4 90.1 105.0 120.1 135.5 110 20.6 34.8 49.2 64.0 79.1 94.5 110.1 126.0 142.1 120 21.5 36.3 51.4 66.8 82.6 98.7 115.0 131.6 148.4

From D. Basker, Critical values of differences among rank sums for multiple comparisons, Food Technology, 42: 79, 1988. Reprinted by permission of the Institute of Food Technologists. BINOMIAL PROBABILITIES FROM DISCRIMINATION TESTS

A duo-trio test is conducted to determine the amount of added diacetyl that might be detected in cottage cheese. Two test are conducted. In both tests the control sample has no added diacetyl (but probably some small amount from the natural fermentation process). Test sample "A" has 1 ppm added diacetyl. Test sample "B" has 10 ppm added diacetyl. In the first test with test sample "A," 25 panelists participate; 17 match the sample to the correct reference item. A second test is run with 50 participants, and 34 match the correct sample to the reference. In a third test with test sample "B," only 20 panelists are available to participate; 16 match the sample to the correct reference item. What are the proportions correct and the Z-scores and p-value for each test? The Z-score is given by the formula

(Proportion correct - Proportion expected) - Continuity correction Z=------(Standard error of the proportion) or by the computationally equivalent

(X - Np - 0.5) z = ------

795 796 SENSORY EvALUATION oF FooD: PRINCIPLES AND PRACTICES where Proportion correct = XIN, X= number correct, N = total judges Proportion expected = chance level = 1/2 for the duo-trio test Continuity correction = 1/(2N) Standard error of the proportion= ..Jpq/N, where p =chance proportion ( = 112) and q = 1 - p

If the obtained Z-score is greater than 1.65, we have an observation that would occur less than 1 chance in 20 under a true null hypothesis. The null hypothesis is that the population proportion correct, Pc = 1/2 and the alternative hypothesis is that Pc > 1/2. For the first test, we have 68% correct and

[(17/25)- (1/2)]- (1/50) 0 16 z = = _._ = 1.60 ..Jco.5) co.5)/25 °·10 which does not exceed our critical one-tailed Z-value of 1.645, and thus does not provide sufficient evidence against the null. The p-v~lue is greater than 0.05. For the second test, we have 68% correct and

[(34/50) - (1/2)] - (1/100) 0 17 z- --·--240 - ..Jco.5) co.5)/5o - 0·0707 - ·

This now does exceed our critical one-tailed value for Z, and the p-value is less than 0.05, so now there is sufficient evidence to reject the null and conclude that there is a perceptible difference from the added diacetyl. Note that there was the same proportion correct in both tests (68%) but that the greater sample size in the second test gave more statistical power to detect a significant difference. Primarily, the standard error of the propor• tion is lower due to the increased "N." In plain terms, we have a smaller confidence interval around the observed proportion, and a lower probability that the actual proportion could be as low as the chance level. However, the different statistical decisions in the two cases raises the issue of whether the difference is of any practical significance-the effect is near the significance cutoff and clearly a lot of people on the panel missed it entirely. So the statistical decision here needs to be supplemented by some professional judgment in drawing conclusions and making recommenda• tions. For the third test, we have 16/20 correct (80%) and

[(16/20) - (1/2)] - (1/40) 0 275 z- --·--246 - ..Jco.5) co.5)/2o - 0·111 - ·

This is also sufficient evidence against the null (exceeds 1.645). COMPLETE BLOCK ANALYSIS OF VARIANCE

A common experimental design in descriptive analysis is the complete block with replicates, involving three factors: products (or samples), replicates, and panelists. The analysis of variance is performed as shown below. Texture evaluation was a major focus of a study on the sensory properties of hearts of palm.* The example compares descriptive panel ratings on a 15 point category scale for firmness of the product's core. There are three products, two canned processed versions (A and B, below) and oneJresh sample (prod- uct E, below). Three replicates are performed on each sample by nine panel- ists. The raw data set is shown below:

Product "A" Product "B" Product "E" Panelist R, R2 R; R, R2 R; R, R2 R; 1 1 1 1 1 2 1 13 15 15 2 3 1 2 2 5 4 15 15 12 3 1 1 11 11 3 8 14 12 12 4 3 2 4 5 12 4 11 11 8 5 3 1 2 5 2 6 13 13 13 6 8 6 2 3 2 3 11 12 9 7 1 2 3 3 3 5 8 13 8 8 2 1 8 2 2 1 12 14 7 9 1 2 2 3 3 2 11 8 13

The R1 , R2 , and R3 headings are for the three replicates, of course.

*Data are from the Master of Professional Studies, Project Report of Torres, V. A. 1992. "The U.S. market for palm hearts: an introductory study," Cornell University, and were reported in Lawless, H., Torres, V. and Figueroa, E. 1993. Sensory Evaluation of Hearts of Palm, Journal of Food Science, 58, 134-137.

797 798 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES

The analysis proceeds with the following calculations: Grand total of scores: 496; C = (496) 2/N = 246,016/81 = 5057.25 ("correction factor"). This factor is used later in calculations of sums of squares. Next we find the sums of squares for the main effects of products and repli• cates:

Product sums: I:A = 75 (I:A) 2 = 5626 I:B = 105 (I:B) 2 = 10,609 I:E = 518 (I:E)Z = 101,124

[(I:A) 2 + (I:B) 2 + (I:E) 2) = 117,558 117,558/27 = PSS1 (partial sum of squares #1) = 4546.59

The PSS values are used in later calculations.

PSSI- C = SS (products) = 4546.59-5057.25 = 1509.558 Replicate sums: I:R1 = 166 (I:RJ) 2 = 27,556

I:R2 = 164 (I:R2) 2 = 26,896

I:R3 = 166 (I:R3) 2 = 27,556

[(I:R1) 2 + (I:R2) 2 + (I:R3) 2) = 82,008 82,008/27 = 5057.55 = PSS2 SS Reps = 5057.55 - 5057.25 = 0.1

The panelist effects must also be calculated to take advantage of the complete block design. They will be partitioned from the error terms, a useful adjustment for differences in scale usage.

Summary of Panelist Effects (I: Pan)2

I: Panelist 1 = 50 502 = 2500 I: Panelist 2 = 59 592 = 5481 I: Panelist 5 = 75 75 2 = 5529 I: Panelist 4 = 60 602 = 5600 I: Panelist 5 = 58 582 = 5564 I: Panelist 6 = 56 562 = 5156 I: Panelist 7 = 46 462 = 2116 I: Panelist 8 = 49 492 = 2401 I: Panelist 9 = 45 452 = 2025

I: (I: Pan)2 = 27,952 PSS5 = 27,952/9 = 5105.78 SS (panelists) = 5105.78- 5057.25 = 68.54 COMPLETE BLOCK ANALYSIS OF VARIANCE 799

Next, interaction sums of squares are found:

REP x Product Interaction: Product A B E Rep1 23 35 108 Rep2 17 34 113 Rep3 35 34 97

Summing and squaring: 232 + 172 + · · · = 529 + 289 +· · · = 39,422 PSS4 = 39,422/9 = 4380.22 SS (Rep x Product) = PSS4-PSS1-PSS2+C = 380.22 - 4346.59 - 3037.33 + 3037.23 = 33.53

Product-by-Panelist Interaction: Panelist rPanA r Pan B r PanE r PA2 r PB2 r PE2 1 3. 4. 43. 9. 16. 1849. 2 6. 11. 42. 36. 121. 1764. 3 13. 22. 38. 169. 484. 1444. 4 9. 21. 30. 81. 441. 900. 5 6. 13. 39. 36. 169. 1521. 6 16. 8. 32. 256. 64. 1024. 7 6. 11. 29. 36. 121. 841. 8 11. 5. 33. 121. 25. 1089. 9 5. 8. 32. 25. 65. 1024.

p: (PA) 2 +I. (PB)2 +I. (PE)2] = 769 + 1505 + 11456 = 13730 PSS5 = 13730/3 = 4576.667 SS (Product x Panelist) = PSS5- PSS1- PSS3 + C = 161.53

Rep x Panelist Interaction

Panelist rPRt r PR2 r PR3 (r PRt) 2 (r PRz)2 (r PR3) 2 1 15. 18. 17. 225. 324. 289. 2 20. 21. 18. 400. 441. 324. 3 26. 16. 31. 676. 256. 961. 4 19. 25. 16. 361. 625. 256. 5 21. 16. 21. 441. 256. 441. 6 22. 20. 14. 484. 400. 196. 7 12. 18. 16. 144. 324. 256. 8 16. 17. 16. 256. 289. 256. 9 15. 13. 17. 225. 169. 289.

[I. (PR1? +I. (PRz) 2 +I. (PR3) 2] =3212 + 3084 + 3268 =9564 PSS6 = 9564/3 = 3188 SS (Rep x Panelist) = PSS6- PSS2- PSS3 + C = 82.127 800 SENSORY EvALUATION oF Foon: PRINCIPLES AND PRAcTicEs

SS 3-Way Interaction: Fortunately, this can be calculated from the partial sums of squares and the sum of all the squared data points:

= 1: (x)2 - PSS4- PSS5- PSS6 + PSS1 + PSS2 + PSS3- C = 4866-4380.22-4576.67- 3188 + 4346.59 + 3037.33 + 3105.778- 3037.235

It is sometimes easier to calculate this expression in two stages: Total Plusses Total Minus

4866. l:z-2 4380.22 PSS4 4346.59 PSS1 4576.67 PSS5 3037.33 PSS2 3188. PSS6 3105.78 PSS3 3037.24 c 15355.7 15182.13

SS (3-way interaction)= 15355.7- 15182.13 = 173.57

Now we are ready to table the ANOVA values.

Anova Table Source ss d/ MS F p Products 1309.36 2 654.68 64.81 <0.001 error 1 161.53 16 10.10 Replicates 0.1 2 0.05 <1 NS error 2 82.3 16 5.13 PxR 33.531 4 8.38 1.54 NS error 3 173.57 32 5.42

The error terms are as follows:

Error 1 (products) is the product by panelists interaction. Error 2 (replicates) is the replicate by panelist interaction. Error 3 is the 3-way interaction.

In summary, there was a large difference in firmness between at least one pair of the samples. In a way this is not surprising since the mean values were 2.8 (product A), 3.8 (product B), and 11.8 for product E, with the fresh product being much firmer than the processed vegetables. There was no replicate effect and no evidence for any change of product firmness across the three replicates. The remaining part of the story is to test for differences among the three samples to see which pairs were different. This is illustrated by finding the Duncan range statistic for the three comparisons as shown below. COMPLETE BLOCK ANALYSIS OF VARIANCE 801

The critical range statistic, q, for 16 dJ comparing two adjacent means is 2.998 and for two means with one intervening (across three total) is 3.144. The range, then, or difference between means must exceed

q --.jMS error/N = 2.998 --../10.10/27 for two adjacent ranked means = 1.833 and

= 3.144 --../10.10/27 for the high and low value, = 1.904

The value of 10.10 is the mean square error term from our ANOVA table for the product factor and there were 27 observations per mean. Examining our product differences:

B - A = 3.8 - 2.8 = 1.0, so this does not exceed 1.833 E - A = 11.8 - 2.8 = 9.0, which does exceed 1.904. E - B = 11.8 - 3.8 = 8.0 which does exceed 1.833.

So product E, as expected, is significantly firmer than the other two. GLOSSARY OF TERMS

The following list is focused on the concepts related to sensory evaluation methods, the physiology and function of the , and statistical terms common in sensory work. Glossary entries of some older terms are modi• fied from Amerine, Pangborn, and Roessler, Sensory Evaluation of Food (1965). Omitted from the list are specific sensory descriptive attributes that may be found in the ASTM lexicon for aroma and flavor (Civille and Lyon, 1996) and texture-related terms found in Chapter 11 and in Mufioz (1986).

A, not-A test A discrimination procedure in which a previously identified sample ("A'') is presented and also a test sample. Both are presented blind• labeled in the test. The panelist's task is to assign the label "N' or ''not-A'' to each of the samples. ABX test A discrimination procedure in which two reference samples are given ("N' and "B"), representing the two levels of the variable under inves• tigation, e.g., a control sample and a test sample. The third or "X" sample represents either the control or test sample (this is counterbalanced across judges or trials), and the panelist's task is to match the test item "X" to the correct sample from the original reference pair. Alpha-risk The long-term probability of rejecting a null hypothesis when it is true. When set before an experiment, usually at 5%, it is the upper ac• ceptable limit on the chance of committing a Type I error. Widely and mis• takenly believed to be the chance of having made a wrong decision after rejecting the null in a given experiment. See beta-risk. Affective test A general class of sensory tests that assess the acceptability of products or the relative preference among a set of products.

803 804 SENSORY EvALUATION OF FooD: PRINCIPLES AND PRAcTicEs

Aftertaste A sensation following the removal of a taste stimulus that may comprise a continuation of the sensory quality perceived during the pres• ence of the stimulus, or a different quality induced by salivary dilution, rinses with , or the act of swallowing. Aguesia Loss of taste sensitivity. Alternative hypothesis See Null hypothesis. Analysis of variance (ANOVA) A class of statistical methods used for sep• arating the sources of variance in a data set In most sensory evaluation, the analysis of variance (ANOVA) allows tests of multiple levels of treatment variables against the background of error variance, and thus is a test of ex• perimental effects based on F-ratios. Anchor (1) A term used to label a point on an intensity scale. (2) A physical reference standard used to exemplify a point on an intensity scale or to sta• bilize intensity scale usage across panelists by providing anchoring terms or physical examples. Anosmia Loss of smell sensitivity. Also "specific anosmia," which is the in• sensitivity to the smell of a closely related family of chemical compounds among individuals with otherwise normal , believed to be of genetic origin, but modifiable through exposure or by physical maturation. Appearance The visual properties of a product including size, shape, color, texture, gloss, transparency, cloudiness, and so on. Aroma The fragrance or odor of a product as perceived by the nose from sniffing through the external nares. In some cultures, aroma may also refer to retronasal smell (q.v.). Ascending method of limits A threshold study using the method of lim• its in which only series of ascending concentrations or ascending intensity levels are used in the stimulus sequences. Assimilation effect The tendency for a product to be judged as more sim• ilar to another product or closer to expectations than it would be judged in isolation or under other baseline conditions. Astringency (1) A sensory experience in the oral cavity that may include drying sensations, roughing of the oral tissues, and a tightening, drawing, or puckering sensation, as found after stimulation with alums or tannins. (2) A sensation of tightening or drawing felt on the external skin after chemical stimulation. Attribute response restriction effect ("dumping effect") The infla• tion of a rating on an attribute due to the absence of an appropriate at- GLOSSARY OF TERMS 805 tribute on a ballot or to allow a participant to respond to a salient sensation. The response is "dumped" onto an inappropriate ques• tion or scale. The effect may arise in consumer acceptance testing due to the inability to voice a concern or negative response to a question that was omitted, or in descriptive testing when inadequately trained panels assign a sensation to an inappropriate intensity scale when the appropriate inten• sity scale is omitted. Beta-risk The long-term probability of accepting a null hypothesis when it is false, i.e., the chance of missing a true difference or making a Type II error. Bias Any systematic tendency to distort the overt response to a stimulus so that it becomes an inaccurate reflection of the actual sensation or hedonic reaction. Blind test A test in which the detailed identity of the product or products is unknown to the participant In practice, some knowledge of product iden• tity is required in order to ensure that the product is evaluated relative to the proper category or a reasonable frame of reference. Nonetheless, in sen• sory evaluations such conceptual information is generally minimal, and the identity of products with regard to which sample is a control or standard vs. a test product is always disguised in the formal evaluation phase. Category scale Any of a class of scaling procedures in which the judge is presented with discrete and limited alternatives as choices for rating the perceived intensity of a specified attribute or degree of liking of a product Central location test A consumer test that is conducted in a specific loca• tion and is therefore generally under more controlled conditions than a home use test. Chemesthesis The sensory mechanisms across the body surface and var• ious orifices that are generally responsive to chemical stimulation, espe• cially to irritation, and found primarily in mucous membranes. The term refers to a more general chemical sensitivity than that found in the highly differentiated receptors for taste and smell. Chemoreceptor A cell that is responsive to chemical stimulation and ini• tiates a neural signal to higher brain centers that is interpreted as a smell, taste, or chemical irritation. Chi-square distribution A statistical distribution based on variance ra• tios. The distribution gives rise to various statistical tests, loosely referred to as chi-square tests, useful for comparing frequency counts, proportions, and trends in ranked data. Chroma A dimension of perceived color in the Munsell system describing 806 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES the color purity or saturation, i.e., the amount of color-inducing wave• lengths relative to the amount of gray or white light that dilutes or desatu• rates the perceived color (syn.: saturation). Circumvallate papillae A group of button-shaped structures on the dor• sal posterior arranged in an inverted V. Each papilla is surrounded by a circular trench containing taste buds. Coding The act of assigning labels to products so that their identity is dis• guised, as in "blind-labeled coding?' Coemclent of correlation A statistical value assessing the degree of lin• ear relationships between two variables and ranging from -1.0 (perfect in• verse relationship) through 0 (no relationship) to +1.0 (perfect direct relationship). Comparative judgment Direct evaluation of one product relative to an• other, usually along a specified dimension such as a sensory attribute or degree of liking. Complete block An experimental design in which all treatment levels are assigned to one block, as in having all panelists evaluate all versions of a product (the panel is a block) or in having all products viewed within a sin• gle session (the session is a block). Contrast effect The tendency for a product to be judged as more extreme in the presence of another product than it would be if judged in isolation, i.e., lower in intensity in the presence of a stronger product or higher in in• tensity in the presence of a weaker product Contrast effects may also occur in hedonic judgment. Correction for continuity A mathematical correction used in the approx• imation of continuous distributions by discrete data, often used in tests on proportions or frequencies. Degrees of freedom A value for the number of observations that are un• constrained or free to vary once our statistical values are calculated from a sample data set In most cases, the degrees of freedom are given by the number of observations in the data minus 1. Tabled statistics such as t-val• ues and F-values are often given in terms of the degrees of freedom. Dependent variable A variable that is measured as the data in an experi• ment such as proportions of correct judgments or numbers from scaled rat• ings; a dependent variable is free to vary as a function of the independent variables plus random error. Descriptive analysis A class of sensory tests in which a trained panel rates specified attributes of a product on scales of perceived intensity. GLOSSARY OF TERMS 807

Detection threshold The minimum level of energy that must be present to be detected by an observer an arbitrary percentage of the time the stimu• lus is presented, usually 50%. Difference test Any of a class of tests designed to demonstrate the sensory difference of two products, or to eliminate that possibility, including the simple discrimination tests such as forced-choice, the rated-degree-of-dif• ference procedures, and scaled attribute ratings when focused on the sim• ple question of the existence of a difference. Difference threshold The amount of decrease or increase of stimulus en• ergy that is necessary to create a reliably noticeable change. This is opera• tionalized in the method of constant stimuli as the amount of increase (or decrease) that is judged stronger on 75% of trials (or weaker on 25% of tri• als). When expressed as a percentage or proportional basis, this is synony• mous with the just-noticeable-difference. Discrimination The behavioral demonstration that two stimuli are re• sponded to differently, on a reliable basis. Discrimination test A class of sensory tests in which two products or stimuli are compared to see whether panelists can differentiate them on a sensory basis. These include choice tests such as the triangle procedure and the duo-trio test, and also the directed tests of difference such as paired comparisons and other n-alternative forced-choice tests that focus on a specific attribute. Dual-standard test A discrimination test in which four items are pre• sented. The first two items are a reference pair of the two levels of the vari• able under investigation, e.g., a control sample and a test sample. These are inspected by the panelists. The next two items are blind-coded versions of the same items, and the panelist's task is to correctly match each of the blind-coded items to their corresponding reference item from the first pair. Dumping effect See Attribute response restriction effect. Duo-trio test A discrimination test in which three items are presented-a reference and then two test items-one of which matches the reference and the other of which is a variation of the variable under investigation. The judge's task is to match the correct test item to the reference. Enhancement A situation in which a product or attribute is rated as more intense or more liked in the presence of another stimulus, product, or sen• sory attribute. Error of anticipation A change in response during a stimulus sequence that occurs before a change in sensation that would justify such a response. 808 SENSORY EvALUATION oF Fooo: PRINCIPLES AND PRAcTicEs

Error of habituation A failure to change response during a stimulus se• quence of when a change in sensation justifies such a change in response. Expert An individual acknowledged or self-ordained to act as a judge of sensory attributes, defects, or overall product quality based on experience and/or training. Generally not used in sensory evaluation tests. See Panelist Factorial designs Experimental designs in which all levels of one test variable are combined with all levels of another test variable. Fechner's law A principle stating that the perceived intensity of a stimulus is proportional to the logarithm of stimulus energy. Filiform papillae The small, conical protrusions on the dorsal anterior tongue surface responsive to tactile stimulation but not taste. Flavor A complex group of sensations compromising olfactory, taste, and other chemical sensations such as irritation or chemical heat. Flavor profile A proprietary descriptive analysis procedure developed by the A. D. little consulting group in the late 1940s and characterized by a high degree of panel training, use of a category scale for sensory intensity, and a consensus procedure for arriving at the product's sensory specification. Focus group A technique for qualitative research in which a group of ap• proximately 10 consumers are given a semistructured interview by a nondirective moderator, and discussion of a range of issues is facilitated. Foliate papillae A set of structures consisting of grooves in the lateral tongue surfaces containing taste buds. Forced-choice Describing a test situation in which the participant must choose one response from a set of fixed alternatives, and nonresponse is not allowed. Free-choice profiling A descriptive method in which untrained or mini• mally trained panelists evaluate products using their own individual set of descriptors. Data are generally submitted to Procrustes analysis to produce a consensus perceptual map of important dimensions and to assess the de• gree of agreement of individual panelist's data to the statistically manufac• tured consensus map. Fungiform papillae Structures on the dorsal anterior tongue surface and anterior edges, intermingled with the smaller filiform papillae, resembling small buttons or mushroom-shaped protrusions and sometimes containing taste buds in their upper surface. Grading The process of applying a categorical value to a lot or group of products. While this may be based on sensory information, it is no longer GLOSSARY OF TERMS 809 generally considered an example of sensory evaluation methodology. Gustation The sense of taste, limited to those sensations mediated by the taste buds and receptor cells of the tongue and soft palate; believed to com• prise the sensations of sweet, sour, salty, bitter, and tastes. Halo effect (1) The tendency of a product to be viewed more positively than normal due to one or more overriding or influential sensory attributes or other positive influences. (2) The tendency of a sensory attribute to be rated as more intense or more hedonically positive due to other logically unrelated sensory attributes in a product Opposite: "horns" effect Hedonic Referring to the likes, dislikes, or preferences of a person. Home use test A consumer test involving placement of the products in the home for a period of time, in order to determine acceptability or preference under realistic conditions. Hue A dimension of perceived color in the Munsell system, referring to variation in the quality of color, as in the variation across the visible spec• trum of monochromatic lights. Hypoguesia Partial loss of taste sensitivity. Hyposmia Partial loss of smell sensitivity. Incomplete block An experimental design in which only some of the lev• els of a variable are assigned to each block. For example, each panelist may evaluate only some of the versions of a product, or only some of the prod• ucts are evaluated within a single session (see Complete block). Independent variable A variable that is manipulated directly by an ex• perimenter; these form the factors of a study. Inhibition A general term for the reduction in sensation or response as a function of another stimulus being present, in contrast to how the same stimulus or product would be responded to in another baseline condition. Intensity scale A set of numerical or categorical response alternatives used to report the perceived (sensation) strength of a stimulus or product Judge See Panelist. Just-noticeable difference (JND) See Difference threshold. Labeled magnitude scale (LMS) A hybrid scaling procedure in which a vertical line scale is labeled with general intensity descriptors in a quasi• logarithmic spacing. The judge's task is to mark a point on the line to rep• resent the perceived intensity of a stimulus. Data are often comparable to those generated from magnitude estimation. 810 SENSORY EvALUATION oF Foon: PRINCIPLES AND PRAcTicEs

Line marking A class of scaling techniques in which a person responds by choosing a position on a line (vertical or horizontal, marked or un• marked) representing different perceived intensities of a specified attribute or different degrees ofliking for a product Also called visual analog scales, line scales. Magnitude estimation A scaling procedure in which numbers are gener• ated freely by a panelist to represent the ratios of perceived intensities of products or stimuli; originally claimed to produce ratio-scale data. Masking A complete or partial reduction in intensity of a sensation as a function of other stimuli or sensations that are present at the same time. Mean A measure of central tendency. The or average is the sum of all the observed values divided by the number of observations. The is the Mh root of the product of N observations.

Method of adjustment A classical psychophysical procedure in which a variable stimulus is adjusted by a panelist to be sensorially equal to a refer• ence item. Method of constant stimuli A classical psychophysical procedure used primarily for measuring difference thresholds. Varying levels of a compari• son stimulus are compared to a constant reference item in pairs, and the panelist's task is to indicate which of each pair is perceived as more intense. Method of limits A classical psychophysical procedure in which a stimu• lus energy level is raised and lowered to determine the point at which it be• comes detectable in threshold measurement In modern adaptations, a forced-choice test may be added at each energy level. A category of sensations occurring in the oral cavity, related to the oral tissues and their perceived condition (e.g., drying, coating), as op• posed to texture, in which the sensations are related to the food itself, as in firmness, hardness, etc.

N-alternative forced-choice A discrimination test procedure in which one sample represents a variation of the experimental variable under study, and the other N- 1 samples are control or baseline samples. The panelist's task is to identify which of the N blind-coded samples is per• ceived as most intense in a specified attribute. When there are only two samples, this is the paired comparison procedure. Nares The openings to the nose (internal or external). Newtonian fluids Fluids in which shear stress at a given shear rate re• mains constant over a period of time. GLOSSARY OF TERMS 811

Null hypothesis An exact mathematical statement concerning a value from a population distribution (such as a mean or proportion) for which it is possible to compute a statistic and the probability of obtaining a statistic as extreme or more extreme from an experiment. The alternative hypothe• sis is a contrasting statement about the value from the population distribu• tion. In simple difference testing for scaled data (e.g., where the t-test is used), the null assumes that the population mean values for two treatments are equal. In simple difference testing on proportions (e.g., where the data represent a count of correct judgments, as in the triangle test), the null hy• pothesis is that the population proportion correct equals the chance proba• bility. The null is often misphrased as "there is no difference" (a conclusion from the experiment, not a null hypothesis). Objective Describing data that are capable of being verified by indepen• dent observation or measurement and therefore not dependent on the re• port of a single individual. Observer See Panelist. Odor The characteristic smell of a substance Odorant A stimulus giving rise to an odor or odors. Olfactometer A device for delivering controlled and repeatable concen• trations of an odorant to an experimental subject for olfactory evaluation. Olfactory epithelium Those portions of the interior of the nasal passages that contain olfactory receptors, are innervated by the olfactory nerve, are differently pigmented, and contain different cells from the surrounding nasal epithelium. One- and two-tailed tests Describes the consideration of only one or two ends of a statistic's distribution in determining the obtained p-value. In a one-tailed test, the alternative hypothesis is directional (e.g., the population mean of the test sample is greater than that of the control sample), while in a two-tailed test, the alternative hypothesis does not state a direction (e.g., the population mean of the test sample is not equal to the population mean of the control sample). P-value The probability of observing a test statistic as large or larger than the one calculated from an experiment, when the null hypothesis is true. Used as the basis for rejecting the null hypothesis when compared to the preset alpha level. Often mistakenly assumed to be the probability of mak• ing an error when the null hypothesis is rejected. Paired comparison (1) A discrimination test procedure in which two products are presented and the judge's task is to choose the one that is per- 812 SENSORY EvALUATION oF Foon: PRINCIPLES AND PRAcTicEs ceived as higher or more intense in a specified attribute. (2) A preference test in which the judge's task is to choose which of two items is more ap• pealing or more acceptable on a sensory basis. Panel A group of people that comprises a test population chosen for spe• cific characteristics such as product usage, sensory acuity, or willingness to participate in repeated sensory tests. Panelist Generally, a participant in a sensory evaluation. Several related terms are commonly used. "Panelist" connotes a participant as a member of a group that is often tested on more than one occasion (as in a descrip• tive analysis panel). However, "consumer panelists" in an affective (hedo• nic) test may only participate once. "Judge" is an older term for a sensory panelist with some special qualifications, such as having been screened for sensory function and possibly trained. It is usually applied to participants in sensory-analytic techniques such as discrimination, descriptive, or qual• ity control tests. "Subject" is a term for a participant in a psychological ex• periment, and to the extent that sensory tests are focused experiments on sensory-product interactions, this term has been borrowed in sensory eval• uation. "Observer" is an older term for a subject, and connotes participa• tion in any sensory or perceptual experiment, not just visual tests. Parameter A characteristic that is measured about something, such as the mean of a distribution. Perception The act of becoming aware of a stimulus and its qualities based on the sensations that are caused and the interpretation of those sen• sations based on previous experience. Power (1) The statistical estimation of the probability that a test will reject a false null hypothesis, e.g., that a true difference will be detected, esti• mated as 1 minus beta. (2) Loosely used, the ability of a sensory test to de• tect true differences or effects, called "sensitivity" in Appendix V. Power law (or power function, Stevens's power function) A psychophysical function in which perceived intensity is proportional to the physical stimu• lus energy raised to· an exponent that is characteristic of a given sensory modality. This forms a straight line in a log-log plot and is a common find• ing with magnitude estimation data. Preference Referring to the comparison of two or more products on an hedonic basis. Preference test A test involving choice or ranking of two or more prod• ucts for their appeal on a sensory basis. GLOSSARY OF TERMS 813

Procrustes analysis A class of statistical techniques combining aspects of principal components analysis with methods of rotation, translation, center• ing, and reflection to produce a consensus configuration or perceptual map of products from individual data. The individual data sets need not use the same attributes or descriptors but may be idiosyncratic, as in free-choice profiling. Principal components analysis A widely used variation of factor ana• lytic statistical techniques that reduces a set of dependent variables to a smaller set of underlying variables (called factors) based on patterns of correlation among the original variables. The analysis produces factor loadings, which indicate the strength of relationship of the new factors to the original variables and factor scores, that are the values of the products or stimuli on the factors. Psychophysics The branch of experimental psychology dealing with quantification of sensory-based responses as a function of stimulus energy. Quality (1) (when plural) The sensory attributes of a product that differ in kind rather than intensity or degree. (2) The constellation of attributes that determines the degree of liking for a product (3) The response of expert judges who have been exposed to a wide range of products in the specific category over a long period of time to evaluate the total characteristics of a product (4) An absence of defects. (5) Conformance to specifications. (6) Fitness for intended purposes. Quality control The measurement of sensory and physical attributes of products in order to control and maintain an acceptable level of variation in production, and actions taken on the basis of those measurements. Quantitative descriptive analysis (QDA) A proprietary descriptive analy• sis method developed at the Stanford Research Institute in the early 1970s and characterized by use of line scales, replicated experimental designs, con• sumer-oriented descriptive terminology, and use of analysis of variance.

Ranking The act of ordering a group of products with respect to the per• ceived intensity of a sensory attribute or the degree of liking.

Rating A general term for the application of numerical values or numeri• cal response categories to products based on their sensory attributes. Recognition threshold The minimum energy level of a stimulus that permits correct identification of the stimulus, i.e., naming or labeling. Reference standard (1) A physical example or set of examples used to il• lustrate a specific sensory attribute. (2) An example used to illustrate a value on an intensity scale. 814 SENSORY EvALUATION OF FooD: PRINCIPLES AND PRACTICES

Responsiveness The general reactivity of an observer to a stimulus. Near detection threshold, this is called "sensitivity." Retronasalsmell The act of perceiving a volatile olfactory stimulus origi• nating in the mouth via transportation of the stimulus molecules up the back of the nasopharynx and into the region of the olfactory receptors from the direction opposite to that of aroma. Often mistaken for "taste" by un• trained observers. Rheopectic Describing a condition in which apparent viscosity increases with the time of shearing, but the change is reversible (rare in food). Saliva The liquid expressed by the salivary glands, either in a resting or stimulated state, containing water, , mucopolysaccharides, and var• ious ions including bicarbonate. Same-different test A discrimination test procedure in which two pairs of samples are presented to judges. One pair contains identical samples (e.g., two versions of a control sample), and the second pair contains the samples that differ in the variable under investigation (e.g., a control sample or a test sample). The panelist's task is to respond as to whether the items of the pair are sensorially the same or different See A, not-A test Sample (1) A subset of the possible participants in a study. When used in the statistical sense of "sample size," it refers to the number of panelists participating in a study or experiment, usually symbolized by the letter N. (2) A product under testing. Saturation A dimension of perceived color describing the color purity, i.e., the relative amount of color-inducing wavelengths vs. the amount or gray or white light that dilutes or desaturates the perceived color (syn.: chroma). Score A value from a measurement, synonymous with a datum. To assign a value or response to an item, usually but not necessarily numerical, based on predetermined criteria. See Grading, Ranking, Rating. Sensory adaptation A decrease in the sensitivity or responsiveness of an observer as a function of constant stimulation. Sensory evaluation A scientific discipline used to evoke, measure, ana• lyze, and interpret people's reactions to products based on the senses. Sensitivity (1) The degree to which an observer is responsive to a stimulus at or near threshold. For a given observer or a given stimulus material, sen• sitivity is proportional to the reciprocal of the threshold energy or concen• tration. (2) The general ability of an experimental procedure to detect differences or statistically significant effects, including aspects of expert- GLOSSARY OF TERMS 815 mental design; good sensory methodology; experimental control of product handling and serving; environmental factors; panelist factors such as screening, training, and other calibration; and statistical power considera• tions such as the sample size. Shear thickening A condition in which apparent viscosity increases with the time of shearing but the change is irreversible. Shear thinning A condition in which apparent viscosity decreases with the time of shearing but the change is irreversible. The fluid stays in the thinned state once the shear stress is removed, as in some gum solutions and starch pastes. Somesthesis The combined sensory mechanisms along the body surface for sensing touch, warmth, cold, pressure, pain, and related tactile sensations. Spectrum technique A proprietary descriptive analysis technique developed by the Sensory Spectrum consulting group in the 1980s and characterized by use of highly calibrated observers using cross-referenced intensity scales. Statistic (1) A value calculated from data such as tor F, also called a test statistic. (2) A variable with tabled values with an expected distribution, based on certain assumptions. The tabled values provide limits on alpha• risk, based on the probability of obtaining values as extreme or more ex• treme, and thus provide a critical value to which the obtained test statistic is compared in making statistical decisions concerning "significance?' Stimulus That object, energy, or physical property that causes a sensation. Stimulus error An error in sensory judgment caused by presuppositions or expectations concerning the identity or nature of the product to be evaluated. Subject See Panelist. Subjective Not independently verifiable, as in the expressed likes or dis• likes of a specific individual. Subliminal Below threshold (from "limen," meaning threshold). Suprathreshold Referring to evaluation in a range clearly above the threshold, as in scaling intensity along the full dynamic range of a stimulus. Synergy The failure of a model to predict additivity in the direction of a re• sult larger than expected. In taste and smell mixtures, the quality is claimed under various conditions but can most reasonably be viewed as an intensity value of the mixture that is higher than would be predicted from the individual dose-response curves (i.e., psychophysical functions) of the components when scaled alone. 816 SENSORY EVALUATION OF FooD: PRINCIPLES AND PRACTICES

Taste bud A collection of modified epithelial cells, residing in a papilla on the tongue or soft palate, responsive to taste stimuli and making synaptic contact with primary afferent taste nerves. Texture Characteristics of a product perceived by the visual or tactile senses including geometric qualities, surface attributes, perceived changes under deforming forces or when forced to flow if liquid, phase change be• havior such as melting and residual tactile sensations following chewing, swallowing, and expectoration. Texture profile A method for evaluating perceived food texture developed by the General Foods Corporation in the 1960s and including mechanical and rheological (force-related) properties as well as geometric properties of foods, evaluated by a trained panel using intensity-calibrated scales. Time-intensity scaling A class of methods involving the evaluation of sensory attributes or hedonics over time after the exposure to and sampling of a product; often involving the measurement of rate of change, duration, or other time-related parameters of sensation. Thixotropic Describing conditions in which apparent viscosity decreases with the time of shearing but the change is reversible; in other words, the fluid will revert to its original state on standing, as in some starch gel pastes. Threshold A general term referring either to a concept or its measure• ment. (1) Conceptually, the threshold is the point at which an observer first responds to a stimulus as the physical energy level is increased. (2) When operationalized, it represents an arbitrarily chosen point on a data function relating probability of response to physical energy, e.g., the 50% point in a forced-choice test after correction for chance performance. Triangle test A discrimination test in which three products are presented, two being the same and a third that is a different version of the variable un• der investigation. The judge's task is to choose the item that is most differ• ent from the other two. Treatment The different levels of an experimental (independent) variable. In other words, what has been changed about a product and is the subject of the test. Turbidity The ability of a sample, particularly a fluid, to scatter a beam of incident light as it passes through a product,. Turbidity is usually measured instrumentally but corresponds to the sensory properties of cloudiness or haziness (as opposed to clarity, transparency). Turbinate bones The bony shelflike protrusions extending laterally from GLOSSARY OF TERMS 817 the midline or septum into the nasal cavity. Olfactory epithelia lie above the uppermost or third turbinate bones. Two-tailed test A statistical test with a non directional alternative hypothe• sis, necessitating estimation of values extreme or more extreme than the obtained statistic in both positive and negative tails of a distribution. Type I error The act of rejecting the null hypothesis when it is true. This usually results in a conclusion that two versions of a product are different when they are not, or that a treatment variable had an effect when the re• sults were due to chance variation or error alone. Type II error The act of accepting the null hypothesis when it is false. This usually results in a conclusion that two products are equivalent when in fact a true sensory difference exists, or that a treatment variable did not have any effect when in fact it did. Umami Referring to the taste associated with salts of glutamate, aspartate, and some ribonucleotides that are known to synergistically combine in mix• tures. A Japanese term translating roughly as "delicious taste" but denoting some of the same properties as the English ''meaty," "savory," or "brothy:• Value A dimension of color in the Munsell system synonymous with light• ness, i.e., the overall amount of light reflected from a product Viscosity Resistance to flow, measured physically or evaluated on the ba• sis of sensory perception. Weber's law A principle stating that the difference threshold is propor• tional to the starting intensity level of the comparison stimulus. The ratio of the difference threshold to the reference stimulus is thus a constant value.

ADDITIONAL REFERENCES

Amerine, M.A., Pangborn, R.M., and Roessler, E.B. 1965. Principles of Sensory Evaluation of Food. Academic, New York. Civille, G.V., and Lyon, B.G. 1996. Aroma and Flavor Lexicon for Sensory Evaluation. American Society for Testing and Materials. West Conshocken, PA. Munoz, A.M. 1986. Development and application of texture reference scales. Journal of Sensory Studies, 1, 55-85. INDEX

A scaling, 219 Anosmia, 52, 796 Abbott's formula, 159, 772 specific,20,53-55,56, 174,185,189 Acceptance, 628 Anscombe's quartet, 740-741 Acceptancetes~,450-457 Appearance, 73,406,796 Acceptor set size, 467-468 reflectance, 409 Ascending method of limi~, 796 Appropriateness scales, 467 Adaptation, 43-44, 58, 181, 187-188, Aroma, 406, 796 307,626,806 Assimilation effect, 796 cross-adaptation, 49 ASTM (American Society for Testing sensory, 806 and Materials), 2, 19, 176, 200, Adaptation level, 306-308 226 Adaptive procedures (thresholds), Astringency,42,66-67,796 193-195 Attitudes, measurement of, 509 Affective shifts, 310-311 Attribute response restriction, 796 Affective test, 795-796 Aftertaste, 796 Agree/disagree scales, 509 B Aguesia, 796 Alpha-risk, 666, 758, 795 Barter scales, 469-470 interaction with beta, 763 Basker's tables, 238, 444, 446, 448, Alternative hypothesis, 662-663, 695-696 758-759, 796 , 551-552 Analysis ofvariance, 352,587,701-737, Bayesian statistics, 677 746,796 Beidler taste equation, 177-178 multiple factors, 708-710 Beidler, L. M., 37 one-way, 702-707 Beta-risk, 667, 756, 797 repeated measures, 712-717 Bias,304-306,319-325,797 split plot, 731-735 centering, 307, 321-323 two-way, 725-728 contextual, 213 Anchors, 320, 796 contraction, 323 intensity, 567 frequency,320-321 end,234 number (logarithmic), 323-324

819 820 INDEX

Bias (Continued) lightsource,414-416 number usage, 213, 222 lightness, 408, 424 odd-sample, 331 metameric match, 416 positional, 331 purity, 408 response, 177,213 reflectance, 426 transfer, 324 saturation, 408 Binomial distribution, 437-439, 679, value, 417-418 681-686 Color solids Bipolar scales, 230 chromaticity, 422-423 Blind-labled testing, 23, 474, 625 CIE,419,422,424,427 Blind test, defined, 797 color gamut, 420, 422-423 Blocking, 728-730 conversion equations, 425 Gardner, 424 HunterLAB, 424 matchable colors, 420 c mathematical, 418-419, 421 Munsell, 417 Cannibalization, 611 nonmatchable colors, 420 Capsaicin, 62, 202 three-light system, 419-421 Category ratings, 36-37, 208, 220-224, tristimulus values, 420-421,423 241-246,302 Communication skills, 643 Category review, 605-606, 609-611 Comparative judgment, 798 Category scale, 797 Competitive surveillance, 604 Centering bias, 460-461 Complete block design, 629,712-717, Central location test, 486-487, 797 79& Central tendency, measures of, Concept evaluation, 520 653-654 Concept mapping, 538 Chemesthesis, 61-62,797 Cones, 410 Chemoreceptor, 797 Confidence intervals, 659-660 Chi-square distribution, 797 Consumer complaints, 619 Chi-square statistic, 439, 680, 686-688 Consumer field tests, 480-484 Children, preference testing of, Consumer panels, standing (ongoing), 455-457 485-586 Chili pepper, 61-62, 202, 229, 248, 249 Consumer sensory evaluation, 430-431 Chroma, 797 Consumertesting, 15-17,107,628-629 Circumvallate papillae, 798 central location, 104 Claim substantiation, 5, 20, 509 Consumer tests, types of, 484-489 Clarity, 412 Consumers C02 (carbon dioxide), 66, 72 recruiting, 494-495 Coding, 798 screening, 493 of questionnaire responses, 640 (See also Qualifying panelists) Coefficient of variation, 655-656 Content analysis, 537 Coefficient of determination, 745-746 Contexteffects,248-249,301-340 Collinearity, 587 antidotes, 332-335 Color, 406, 408 Contrast effect, 248-249, 306-309, 798 blindness, 410-411, 416 Correction for continuity, 798 brightness, 408 Correlation, 398,413 chroma,409,417-418,421,424 causality, 741-745 color-rendering index, 415-416 coefficient, 798 colortemperature,415-416 Cramers, 739,751-752 delta, 427 nonparametric, rank order, 681, differences, 426 697-698 hue,408,417-418,421,424 Pearson's, 739-745 INDEX 821

Correspondence analysis, 618 normal distribution in, 132 Counterbalancing, 333 paired comparison test, 117 Covariance, 743 replications in, 135 Cross-modality matching, 36, 229, 230 Smith's test, 136-137 strengths of, 128 triangle test, 121 two-out-of-five test, 126 weaknesses of, 128 D z-test on proportion, 132, 137 Discussion guide, 529, 531 d-prime (d'), 146 Distributions, statistical, 653 Dairy judging, 561-564 Dual-standard test, 626, 799 Degrees offreedom, 664,798 Dumping effect, 328-329, 352, 799 Demographic information, 501 Duo-trio test, 8, 626, 799 Dependent variable, 798 balanced reference, 122 Descriptive analysis, 9-11,71,217,362, constant reference, 122 371,557,565-567,596,798 probability, 122 ballot training, 363, 365 concepts, 343, 342 consensus training, 363 descriptors, 343, 344, 345, 346, 348 deviation from reference, 366 E dynamic Flavor Profile, 368 intensity variation, 367 Effect size, 677, 758-759, 761 language, 342 in ANOVA, 761 limitations, 252 Electrogustometry, 188 reference standards, 343, 346, 347, Electromagnetic spectrum, 409 351,363-364 Emmert's law, 302 Sensory Spectrum, 358 Employee panels, 472-474, 485, 642 Texture Profile, 356 End-use avoidance, 222-223, 247, 320 uses, 341 End anchors, verbal, 628 Desensitization, 63, 72 Enhancement, 799 Detection threshold, 29, 173, 799 Errors Diacetyl, 189, 190 of anticipation, 181, 330, 799 Difference from ideal, 237 in decisions, 4, 95, 153, 134, 137 Difference test, 799 of habituation, 181, 330, 800 Difference threshold, 29, 30, 54, 152, in measurement, 650 514, 799 stimulus, 330 Discriminal dispersion, 152, 153 time-order, 331 Discriminant analysis, 611 Type I and Type II, 4, 95, 133, 134, Discrimination, 799 137,141,551,587,666,756,809 Discrimination testing, 6-9, 16, 94, 116, Executive summary, 541 799 Expectation effects, 323, 330 A-not-A test, 124, 795 Expectoration, 96 ABX test, 127, 795 Experimental design, 103, 629-630, binomial distribution, 129 633-635 binomial tables, 130, completely randomized design, 104 chi-square distribution, 131 design structure, 103 duo-trio test, 122 incomplete block design, 108 Harris-Kalmus test, 126 randomized complete block design, mistakes in, 138 105 n-alternative forced-choice, 117, treatment structure, 103, 106 125, 129, 155 Expert, 800 822 INDEX

F 617-618,800 indiosyncratic descriptors, 369, 370 Fabric, 390 procrustes analysis, 369, Face scales, for product acceptance, training, 369 455-456 Frequency effects, 315-317 Facilities Friedman analysis ofvariance on booths, 85-89 ranks,238,446-449,694-696 climate control, 90 Funding for sensory research, 644 discussion area, 87 Fungiform papillae, 800 evaluation area, 85, 86, 91 lights, 91 preparation area, 87, 89 serving hatches, 88 G tables, 86 Gas chromatography, 200 testing, 493 Geometric mean, 629 waiting area, 87 Glomeruli, olfactory, 51, 52 Factor analysis, 608 Glossiness, 409, 413 Factorial design, 708, 800 Glutamate, monosodium (MSG), 43, False close, 536 236 False enhancement, 326-329 Good practice, 574-577 Fechner, G. T., 29, 32, 152, 173, 237 Grading, 800 Fechner's law, 800 Guessing models, 159-162 Field services (agencies), 492-499 Gustation, 801 Field tests, management decisions in, 512 Filiform papillae, 800 Fixed vs. random effects, in ANOVA, 719-772 H Flavor, 18,53,174,406-407,800 oxidized, 58 Halo effects, 69, 326, 327, 432, 507, 801 trigeminal, 61-67 Haze, 411-413 Flavor Profile, 9-10, 222, 347, 356, 393, Hedonic, 801 800 Hedonic (affective) tests, 10-12 amplitude, 347, 348, 352 Hedonic scale, 9-point, 254-258, consensus,347-349,350 450-455,628 panel leader, 350 Hedonic scaling, 430, 450-457 profile attribute analysis, 349 Hedonic tests, 230, 596 sunburst graph, 348 Helson, H., 307 training, 350 Histograms, 652-653 Focusgroups,520-522,523-533,629,800 Home use test, 16, 629, 801 analysis and reporting, 536-541 Horns effect, 69, 326 follow-up to field tests, 499 Hue,801 in sensory evaluation, 526-528 Human subjects, 109 moderating,533-536,544-545,619 , 801 multiple moderators, 543-544 Hyposmia, 801 videotaping, 532 Hypothesis testing, statistical, 658-667, Focus panels, 528 674-675 Foliate papillae, 800 Food action rating scale (FACT), 466 Food choice, 521 I Forced-choice, 626, 800 Fovea,410 Ideal product, scaling of, 461-462 Fragrancetesting,488 1FT (Institute of Food Technologists), 3 Frame of reference, 234-235,248, 301 Illiteracy, 432, 499 Free-choice profiling, 368, 592, Incentives, 107-108,503 INDEX 823

Incidence, product usage, 493 245-246 Incomplete block designs, 736, 801 log transformation of data, 228, 252 Independent t-test, 668, 669-670, 692 use of zero, 226 Independent variable, 801 warm-up task, 227 Indirect scales, 37, 152, 235 Magnitude matching, 232 Individual differences, 20 Maltol, 326 Ingestion, 96 Mann-Whitney U-test, 692-694 Inhibition, 801 MANOVA, 717-736 Instructions, 96 Market gap analysis, 604 Instrumental analysis, 17, 19, 550 Marketing research, 21-23, 480-484, 766 Instrumental sensory correlations, 630, Masking, 802 739 McNemar test, 688, 69Q-691 Intensity scale, 801 Mean, 653, 802 Interactions Means separation in ANOVA, 710-712 following ANOVA, 707,722-725 crossover, 712 canonical variate analysis, 588 magnitude, 712 Median, 629, 653, 680 Interpretation, 335 Merton, R. K., 520, 542 Interval scales, 212, 680 Meta-analysis, 237 Interviewers, field service, 493 Metallic taste, 42-43 Interviewing, 501-502 Method of adjustment, 29, 30-31, 311, Irritation, nasal, 65 802 Method of constant stimuli, 29, 30, 152, 802 Method oflimits, 29-30, 173, 181-182, J 802 Method of single stimuli, 32, 220 Judge, 801 Michaelis-Menten kinetic equation, 37, Judgment process, 210 178 Just-about-right scale, 321-323 Mixture interactions, 43, 58 Just-noticeable difference, 237, 801 enhancement of, 45, 70-71 Just-right scales, 457-462 release from suppression, 46, 47, analysis of, 459, 463-465 60-61 suppression of, 44, 59, 69 synergy in, 45-46, 71 K Mode, 211,653 Modulus, 226 Kramer rank sum test, 695, 696-697 Monadic sequential tests, 491 Monadic tests, 532,491 Mouthfeel, 802 Multidimensional scaling (MDS), L 612-617 pairs test, 185-186, 198 Labeled magnitude scale, 232, 801 Multiple Multiple standards difference test, Likert scales, 223, 509 572-574 Line marking, statistics, 352, 585 Line scales, 209, 217-220 Multivariate canonical variate analysis, 588 discriminant analysis, 588 multivariate analysis of variance, M 586 preference mapping, 457, 596-598 Magnitude estimation, 52, 208-209, principal component analysis, 588, 215,226-255,524-525,699,802 596 comparison to category scales, Procrustes analysis, 592, 593, 596 824 INDEX

N in home use tests, 492 statistical analysis, 455-440 n-Alternative forced-choice test, 2-AFC, Paired t-test, 667,671-674, 691 117,125,129,802 Palate cleansers, 95 n-Alternative forced-choice test, 5-AFC, Panel, 804 125,129,155,802 Panelist, 804 Nares, 802 calibration of, 225, 248, 554-555 Naturalistic observation, 525 partitioning of, variation, 712-717 Newtonian fluids, 802 screening of, 174-175, 188, 568-570 Nominal scales, 211, 751 training of, 15, 566, 755 Non parametric statistical tests, 679-700 Paper, 590 Normal distribution, 657 Papillae, 40-41 Null hypothesis, 661-665, 758 circumvallate, 41 Numbers of panelists (sample size), filliform, 40 757, 770-771 foliate, 41 fungiform, 40, 62 Paradox of the nondiscriminating dis- criminators, 154-155 0 Parameter, 804 Objective, 805 Parducci, A, 515 Odor,805 Pass-fail QC methods, 558-559 classification of, 55 Peer debriefmg, 557 identification of, 54-55 Perception, 804 Odor units, 198-201 Perceptualmapping,605,606-618 Odorant, 805 Phenylthiocarbamide (PTC), 46-47, Olfactometer, 805 184-185,189 Olfactory bulbs, 51-52 Piperine, 62 Olfactory epithelium, 805 Population, statistical, 655 Olfactory receptors, 50-52 Population statistics, 656-658 Olfactory sensitivity, 186 Poulton, 519 One-on-one interviews, 522-525, 542 Power,4, 155,154,155,667,804 One-tailed tests, 118, 120, 665, 664, 805 in discrimination tests, 771-775 Open-ended questions, 498,509-511, statistical, 754-782 629 Power law, 804 Optimization, 468 Preference,457,628,804 Ordinal scales, 211, 680 Preference ranking, 444-449 Orthogonality, 252, 589 Preference tests, 11,451-449, 804 Oxidation, , 566 following dilference tests, 452-455 misuse of, 449 no preference option, 440-445 null hypothesis, 455 p paired, 451-440 Principal component analysis, 415, 608, p-value,658,675-677,805 755 Packaging, 17-18 factor pattern, 590 Pain, 219 Kaiser criterion, 589 Paired comparison, 8, 152, 626, 691, rotation, 590, 591 805 scree test, 589 Paired comparison tests Probits, 197-198 dilference, 117, 119 Procrustes analysis, 805 directional, 117-118 consensusin,595,594,595 probability, 118 generalized,569,570,617 same/dilferent test, 117 rotation, 595 simple dilference test, 117 translation, 595 INDEX 825

Product concept test, 481-483 Range effect, 249 Product development, 520 Rank order tests, 691-699 Product distribution, in consumer tests, Ranking,238-239,628,680,749,805 495 Rated difference from control, 191-193 Projective mapping, 542 Rating, 805 Projective methods, 534, 543 Rating scales, 32 PROP, 6-n-propylthiouracil, 46 Ratings PsychometricftJnction,30-31,176 degree of difference, 559-561 Psychophysical function, 174 quality of, 561-565 Psychophysics, 18,28-38,209-210, 805 Ratio scales, 213 Purchase intent, 508 Receiver operating characteristic (ROC Purkinje shift, 410 curve), 150-151 Recognition threshold, 179-180, 805 Reference standard, 805 Regression, 740 Q linear, 739, 745-748 mulitiple, 739, 749 Qualifying panelists, in Reliability, 3, 15-16, 19 acceptance/preference testing, in qualitative research, 525 470-474 Repeated measures Qualitative research, 603 ANOVA, 712-717 Quality, 805 covariance assumption, 717 definition of, 548-549 Repertory grid method, 371, 542 scoring of, 567-571 Reponsiveness, 806 Quality control, 18, 335, 805 Response output function, 229, 242, program development, 552-555 304-306 requirements, 555-557 Response restriction, 326-329 sensory methods, 558-572 Retina, 409-410 Quality judging, quality grading, 6, 23, Retronasal smell, 806 24 Reversed pair technique, 308, 313-315 Quantitative descriptive analysis, 10, Rheopectic, 806 351,361,805 Ribosides, ribonucleotides, 43 cobweb, 354, 355 Rods, 410 consensus in, 351,356 panel leader in, 351, 356 Rogers, C., 533 polar coordinate graphs, 354 radar plots, 354, 355 training in, 351, 353, 356 Questions s order of, 501 Saliva,41,63-64,67,806 wording of, 503-509 Same-different test, 806 Samples, 806 design of, 499-511, 628-629, 635, carriers of, 92, 94, 95 638-639 containers, 94 sample of, 514-518 number per session, 103 self-administered, 499 serving temperature of, 93 size of, 92, 164, 387,489-490 statistical, 653 three-digit codes in, 97-99 R Satisfaction scales, 507-508 R-index, 142 Saturation, 806 Random effects, in ANOVA, 630 Savory flavors, 43 ,96-99,104,107,333 Scale points, numbers of, 222 Range-frequency theory, 315-318 Scales, pseudonumerical, 251-252 826 INDEX

Scaling,627 Statistical analysis, 2-3, 16 orthogonal terms, 252 Statistical consultants, use of, 649 practicalguidelines,246-252 Statistics, nonparametric, 214-216 relative-to-reference, 250-251 Stevens, S. S., 32, 211, 228 Score,806 Stimulus, 807 Scoville units, 201-202 Stimulus error, 807 Screening of consumers, 493 Stopping rule, 187, 191 Segmentation, 604 Store retrieval, 605-606 Semantic scaling, 233 , 490 Semi-ascending paired difference Stress, 612 method, 192 Stuart-Maxwell test, 459-460 Sensitivity, 806 Subject, 807 Sensory evaluation, 806 Subjective, 807 and business decisions, 5,18 Subliminal, 807 careers in, 3 Substance P, 63 definition of, 1-3 Sums of squares, 704 training in, 3 Suprathreshold, 807 Type I and Type II, 15-16 Synergy, 199, 807 Sensory interactions, between modali• ties,67-73 Sensory quality shifts, 310-311 Sensory quality tests, guidelines for, 575-576 T Sequentialeffects,156 t-statistic, student's, 659-660 Sequential sensitivity analysis, 158, 187 t-tests, 651, 667-674, 755-756, 761-762 Shear thickening, 807 unequal group size, 674 Shear thinning, 807 unequalvariance,674 Shelf-life, 335, 557, 650 Tacticalresearch,602 Shine, 414 Taints, 20, 143, 174 Side-by-side tests, 490 Tannins, 66 Sign test, 691-692 Taste, 39-49, 406-407 Signal detection theory (SDT), blindness to PTC and PROP, 46-47 141-150,177,195-197,559 categories, 42-43 Similarity test, 142, 157-159, 166-169, inhibition by capsaicin, 72 771-780 palatal taste area, 41 Size constancy, 302 receptors, 39 Skip pattern, in questionnaires, 500 taste bud, 39 Smell, olfaction, 50-51 Taste bud, 808 retronasal, 50 Taste nerves, 41 Social desirability effects, 506 , 41 Somesthesis, 807 glossopharyngeal, 41 Sortin~542,613-617 greater superficial petrosal, 41 Sounds, 383-384 vagus, 41 Spearman's rho, 697-698 Telephone interviews, 499-599 Spectrum, 807 Test objectives, 13 lexicon, 358, 359 Test sensitivity, 754-782 panel leader, 359 vs. statistical power, 760 reference scales, 361 Test specification sheets, 495-496 training, 359, 361 Tests of frequencies, 686-688 Staircase procedure, 194 Testsofproportions,679-686 Standard· deviation, 654-656 Tetrad procedures, 155 Standard error of the mean, 661 Texture Statistic, 807 auditory, 379, 383-384 INDEX 827

contrast, 380-381, 389 Type I error, 4, 95, 133, 134, 137, 141, crispness, 383-386 587,666,756,809 crunchll1ess,383-386 Type II error, 4, 95, 133, 134, 137, 141, melting, 388-389 587, 666, 756, 809 mouthfeel, 388-389 phase change, 388-389 tactile,379-380,387,390-391 visual, 379-380, 386-387 u Texture Profile, 10, 223, 248, 356, 358, U.S. Army Quartermaster Food and 392,808 Contall1er Institute, 19 reference scales for, 357 Umami taste, 42, 43, 809 scales for, 393-395 Urban, 176 trall1ing for, 358, 394 Texture Profile Analysis (TPA), 398-399 Thixotropic, 808 v Thorndike, 326 Threshold,29,47,53, 143,165,173,808 Validation of field test questionnaires, absolute (detection), 29, 173 498 dilference, 152,174 Validity, 4, 15, 19 dilference, JND (just-noticeable dif• in qualitative research, 525-526 ference), 29, 30, 54, 152, 174 Value, 809 ascending forced-choice method, Variability, 2,3,4 183-187 Variability forced-choice, 30, 181-191 measures of, 654-653 recognition, 179-180 sources of, 651-652 termll1al, 180 Variance, 655, 704 Thurstone, L. L., 142, 152, 235, 255 Viscosity, 809 Thurstonian scaling, 12, 142, 151, 190, Visual illusions, 302-303 213,235-236,255,774-775 Time-ll1tensity scaling, 219, 325, 329, 808 Tip-of-the-nose phenomenon, 54 w Tolerance discrimll1ation ratio, 237 Weber, E. H., 28-29 Top box score, 508 Weber's law, 28, 173, 809 Transformation of data, 252 Wine aroma wheel, 56-58 Treatment, 808 Wine quality assessment, 579-582 Triangle test, 7, 154-161, 626, 808 Within subject analysis, 629 probability, 121, 154-161 Trigemll1al irritation, 181 Trigeminal nerves, 61-67 Thrbidity, 411-413, 808 z Thrbll1ate bones, 808 Two-tailed, 663, 664, 803, 809 Z-scores, 146-147,658