<<

Hypothesis Testing: Methodology and Limitations where VarWn is the of Wn and Z is a standard Casella G, Berger R L 2001 , 2nd edn. normal . We now have the basis for an Wordsworth\Brooks Cole, Pacific Grove, CA θ % Hsu J C 1996 Multiple Comparisons, Theory and Methods. approximate test, for example, we would reject H!: θ at level 0.05 if (W kθ )\NVarW " 1.645. Note Chapman & Hall, London ! n ! θ n Hwang J T, Casella G, Robert C, Wells M T, Farrell R H 1992 that VarWn could depend on ! and we can still use it Estimation of accuracy in testing. Annals of 20: in the test . This type of test, where we use the 490–509 actual variance of Wn, is called a test. Lehmann E L 1986 Testing Statistical Hypotheses, 2nd edn. If VarWn also depends on unknown we Springer, New York could look for an estimate S# of VarW with the Schervish M 1995 Theory of Statistics. Springer, New York \ # n n property that (VarWn) S n converges in probability to one. Then, using Slutsky’s Theorem (see Casella and G. Casella and R. L. Berger kθ \ Berger 2001, Sect. 5.5), we can deduce that (Wn ) Sn also converges in distribution to a standard . The large- test based on this statistic is called a . Hypothesis Testing: Methodology and 4. Conclusions Limitations Hypothesis testing is one of the most widely used, and Hypothesis tests are part of the basic methodological some may say abused, methodologies in statistics. toolkit of social and behavioral scientists. The philo- Formally, the hypotheses are specified, an α-level is sophical and practical debates underlying their ap- chosen, a test statistic is calculated, and it is reported plication are, however, often neglected. The fruitful whether H or H is accepted. In practice, it may ! " application of hypothesis testing can benefit from a happen that hypotheses are suggested by the , the clear insight into, the underlying concepts and their choice of α-level may be ignored, more than one test limitations. statistic is calculated, and many modifications to the formal procedure may be made. Most of these modi- fications cause bias and can invalidate the method. For example, a hypothesis suggested by the data is 1. The Idea of Hypothesis Testing likely to be one that has ‘stood out’ for some reason, A test is a statistical procedure to obtain a statement and hence H" is likely to be accepted unless the bias is corrected for (using something like Scheffe’s on the truth of falsity of a proposition, on the basis of method—see Hsu 1996). empirical evidence. This is done within the context of Perhaps the most serious criticism of hypothesis a model, in which the fallibility or variability of this testing is the fact that, formally, it can only be reported empirical evidence is represented by probability. In α this model, the evidence is summarized in observed that either H! or H" is accepted at the prechosen - level. Thus, the same conclusion is reached if the test data, which is assumed to be the outcome of a stochastic, i.e., probabilistic, process; the tested prop- statistic only barely rejects H! and if it rejects H! resoundingly. Many feel that this is important in- osition is represented as a property of the probability formation that should be reported, and thus it is distribution of the observed data. almost required to also report the p-value of the hypothesis test. For further details on hypothesis testing see the 1.1 Some History classic book by Lehmann (1986). Introductions are also provided by Casella and Berger (2001) or The first published statistical test was by John Schervish (1995), and a good introduction to multiple Arbuthnot in 1710, who wondered about the fact comparisons is Hsu (1996); see also Hypothesis Tests, that in human births, the fraction of boys born year Multiplicity of. after year appears to be slightly larger than the fraction of girls (cf. Hacking 1965). He calculated See also: Explanation: Conceptions in the Social that this empirical fact would be exceedingly Sciences; Hypothesis Testing: Methodology and Limi- unlikely (he obtained a probability of tations 1\483600000000000000000000) if the probability of a male birth were exactly 0.5, and argued that this was a proof of divine providence, since boys—some of whom will be soldiers—have a higher risk of an early Bibliography death, so that a higher ratio of male births is needed to Berger R L 1982 Multiparameter hypothesis testing and ac- obtain an equal ratio of males among young adults. ceptance . Technometrics 24: 295–300 We see here the basic elements of a test: the proposition

7121 Hypothesis Testing: Methodology and Limitations that the male birth ratio is 0.5, related to data by true H!, and failing to reject a false H!. Neyman and regarding these as outcomes of stochastic variables, Pearson conceived of the null hypothesis as a standard the calculation that the data would be unlikely if the situation, the burden of proof residing with the proposition were true, and a further conclusion inter- researcher to demonstrate (if possible) the untenability preting the falsity of the proposition. of this proposition. One of the first statistical procedures that comes Correspondingly, they called the error of rejecting a close to a test in the modern sense was proposed by true H! an error of the first kind and the error of failing Karl Pearson in 1900. This was the famous chi-squared to reject a false H! an error of the second kind. Errors test for comparing an observed distribution of the first kind are considered more serious than to a theoretically assumed distribution. Pearson de- errors of the second kind. The probability of— rived the now well-known statistic to test the prop- correctly—rejecting H! if H" is true, which is 1 minus osition, or hypothesis, that the probabilities of the the probability of an error of the second kind, given various possible (finitely many) outcomes of some that the is true, they called the random variable are given by certain preassigned power of the test. Neyman and Pearson proposed the numbers. Pearson proved that this statistic has (in a requirement that the probability of an error of the first large-sample approximation) the chi-squared distri- kind, given that the null hypothesis is indeed true, do bution; this distribution can therefore be used to not exceed some threshold value called the significance calculate the probability that, if the hypothesis holds, level usually denoted by α. Further they proposed to the test statistic will assume a value equal to or larger determine the test so that, under this essential con- than the value actually observed. dition, the power will be maximal. The idea of testing was further codified and elab- In the Neyman–Pearson formulation, we obtain orated in the first decades of the twentieth century, richer results (namely, specific precepts for construct- mainly by R. A. Fisher (e.g., 1925). In his significance ing good tests) at the cost of a more demanding model. tests the data are regarded as the outcome of a random In addition to Fisher’s null hypothesis, we also need to variable X (usually a vector or matrix), which has a specify an alternative hypothesis; and we must con- which is a member of some ceive the testing problem as a two-decision situation. family of distributions; the tested hypothesis, also This led to vehement debate between Fisher on the one called the null hypothesis, is an assertion which defines hand, and Neyman and E. Pearson on the other. This a subset of this family; a test statistic T l t(X), which debate and the different philosophical positions are is a function of X, is used to indicate the degree to summarized by Hacking (1965) and Gigerenzer et al. which the data deviate from the null hypothesis; and (1989), who also give a further historical account. The the significance of the given outcome of the test latter study also discusses how this controversy was statistic is calculated as the probability, if the null resolved in the teaching and practice of statistics in the hypothesis is true, to obtain a value of T which is at social sciences by a hybrid theory which combines least as high as the given outcome. If the probability elements of both approaches to testing, and which has distribution of T is not uniquely determined by the been treated often as an objective precept for the null hypothesis, then the significance is the maximal scientific method, glossing over the philosophical value of this probability, for all distributions of T controversies. compatible with the null hypothesis. The significance Examples of this hybrid character are that, in probability is now often called the p-value (the letter p accordance with the Neyman–Pearson approach, the referring to probability). With Fisher originates the theory is explained by making references to both the convention to consider a statistical testing result as null and the alternative hypotheses, and to errors of ‘significant’ if the significance probability is 0.05 or the first and second kind (although power tends to be less—but Fisher was the first to recognize the arbitrary treated in a limited and often merely theoretical way), nature of this threshold. whereas—in the spirit of Fisher—statistical tests are A competing approach was proposed in 1928 by J. regarded as procedures to give evidence about the Neyman and Egon Pearson (the son of Karl). They particular hypothesis tested and not merely as rules of criticized the arbitrariness in Fisher’s choice of the test behavior that will in the long run have certain (perhaps statistic and asserted that for a rational choice of test optimal) error rates when applied to large numbers of statistic one needs not only a null hypothesis but also hypotheses and data sets. Lehmann (1993) argues that an alternative hypothesis, which embodies a prop- indeed a unified formulation is possible, combining osition that competes with the proposition represented the best features of both approaches. by the null hypothesis. They formalized the testing Instead of implementing the hypothesis test as a problem as a two decision problem. Denoting the null ‘reject\don’t reject’ decision with a predetermined hypothesis by H! and the alternative H", the two significance level, another approach often is followed: decisions were represented as ‘reject H!’ and ‘do not to report the p-value or significance probability, reject H!.’ (The second decision was also called ‘accept defined as the smallest value of the significance level at H!,’ but it will be discussed below that this is an which the observed outcome would lead to rejection of unfortunate term.) Two errors are possible: rejecting a the null hypothesis. Equivalently, this can be defined

7122 Hypothesis Testing: Methodology and Limitations as the probability, calculated under the null hypoth- The most commonly used test for this testing esis, of observing a result deviating from the null hy- problem is Student’st-test, called after the pseudonym pothesis at least as much as the actually observed of W. Gosset, who laid the mathematical basis for this result. This deviation is measured by the test statistic, test in 1908. The test statistic is and the p-value is just the tail probability of the test statistic. For a given significance level α, the null M kM T l A B hypothesis is rejected if and only if p % α. E G p 1 j 1 S# F nA nB H 2. Hypothesis Tests in Empirical Research # where MA and MB are the two sample , S is the pooled within-group variance, and nA and nB are the 2.1 The Role of the Null Hypothesis two sample sizes. (Definitions of these quantities can be found in any statistics textbook.) This formula In the social and behavioral sciences, the standard use illustrates the property of many test statistics that an of hypothesis testing is directed at single research observed effect (here the difference between the two questions, practically always fragments of larger sample means) is compared to a measure of variability investigations, about the existence of some or other (based here on the within-group variance). Student\ effect. This effect could be a difference between two Gosset showed that, if the variable X has a normal group means, the effect of some explanatory variable distribution with the same variance in the two groups, on some dependent variable as expressed by a re- then T has, if the null hypothesis is true, the so-called gression coefficient in multiple analy- t distribution on df l n jn k2 degrees of freedom. sis, etc. Expressed schematically, the researcher would A B For the two-sided alternative hypothesis, H! is rejected like to demonstrate a research hypothesis which states for large values of the absolute value of T, for the one- that the effect in question, indeed, exists. The negation sided hypothesis H is rejected for large values of T of this hypothesis, then, is understood as the prop- ! itself. The threshold beyond which H! is rejected is osition that the effect is absent, and this proposition is called critical value, and is determined from the t- put up as the null hypothesis. The research hypothesis, distribution in such a way that the significance level stating that the effect exists, is the alternative hy- has the pre-assigned value. The one-sided threshold at pothesis. Rejecting the null hypothesis is interpreted as α significance level is denoted t ; α, so that the ‘rejection support for the existence of the hypothesized effect. In df o " q region’ for the one-sided t-test is given by T tdf; α . this way, the burden of proof rests with the researcher The power of the one-sided test is larger than that of the in the sense that an error of the first kind is to state that two-sided test for µ " µ (corresponding to the one- there is evidence supporting the existence of the effect A B µ ! µ sided alternative hypothesis) and smaller for A B. if, in reality, the effect is absent. The usual welcoming This is, in the Neyman–Pearson formulation, the phrase for the rejection of the null hypothesis is the reason for using the one-sided test if the alternative statement that a significant result has been obtained. hypothesis is one-sided (in which case µ ! µ values A B are not taken into consideration). The conventional value of the significance level in 2.2 Example: The t-test the social and behavioral sciences is 0.05. For large n jn Consider, as a major example, the proposition that values of the combined sample size A B, the critical t two groups, arbitrarily labeled A and B, are different value of the -test approximates the critical value that with respect to some numerical characteristic X. can be computed from the standard normal distri- bution. This is because the fact that the variance is Available data are measurements XAi of this charac- teristic for some individuals i in group A and measure- estimated and not known beforehand becomes less and less important as the combined sample size gets ments XBi of the characteristics for other individuals i in group B. The first step in modeling this is to consider larger; if the population variance was known be- S# the observed values X as outcomes of random vari- forehand and substituted for , the test statistic ables, usually with the assumption that they are would have the standard normal distribution, and the z stochastically independent and have a probability test would be the so-called -test. For this test, the distribution not depending on the individual i. The critical value is 1.645 for the one-sided test and 1.960 next step is, usually, to focus on the expected values, for the two-sided test. The power depends on many i.e., population means in the two groups, denoted here quantities: the sample sizes, the population means and by µ and µ . The tested proposition is formalized as , and the significance level. As examples, the A B µ µ power of the one-sided z-test for α l 0.05 is equal to the statement that A and B are different (two-sided alternative hypothesis) or as the statement that one 0.50 for µ µ E G (e.g., A) is bigger than the other (e.g., B) (one-sided µ kµ A B 1 1 alternative hypothesis). The null hypothesis H! is l 1.645p j µ µ σ n n defined as the statement that A and B are equal. F A B H 7123 Hypothesis Testing: Methodology and Limitations where σ is the within-group . The the observations rather than their numerical values. µ kµ \σ power is equal to 0.95 if ( A B) is equal to twice They are standard fare in statistical textbooks. Second, this value. robust tests have been developed, based on numerical values of the observations, which are less sensitive to outliers or heavy-tailed distributions, e.g., by some kind of automatic downweighting of outlying observa- tions (e.g., Wilcox, 1998). The purpose of such tests is 2.3 The Role of Assumptions to retain a high power as much as possible while The probability statements that are required for decreasing the sensitivity to deviations from the statistical tests do not come for free, but are based on normal distribution or to departures from other certain assumptions about the observations used for assumptions. the test. In the two-sample t-test, the assumptions are Third, diagnostic means have been developed to that the observations of different individuals are spot single observations, or groups of observations, outcomes of statistically independent, normally distri- with an undue influence on the result of the statistical buted, random variables, with the same expected value procedure (e.g., Cook and Weisberg 1999, Fox 1991). for all individuals within the same group, and the same The idea behind most of these diagnostics is that most variance for all individuals in both groups. Such of the observations come from a distribution which assumptions are not automatically satisfied, and for corresponds well enough to the assumptions of the some assumptions it may be doubted whether they are , but that the observations could be ever satisfied exactly. The null hypothesis H! and contaminated by a small number of poorly fitting alternative hypothesis H" are statements which, strictly observations. Ideally, these observations should also speaking, imply these assumptions, and which there- be recognizable by close inspection of the data or data fore are not each other’s complement. There is a third collection process. After deleting such poorly fitting possibility: the assumptions are invalid, and neither H! observations one can proceed with the more tra- nor H" is true. The sensitivity of the probabilistic ditional statistical procedures, assuming that the re- properties of a test to these assumptions is referred to maining data do conform to the model assumptions. as the lack of robustness of the test. The focus of robustness studies has been on the assumptions of the null hypothesis and the sensitivity of the probability of an error of the first kind to these assumptions, but 3. Confidence InterŠals studies on robustness for deviations from assumptions of the alternative hypothesis have also been done, cf. Confidence intervals are closely related to hypothesis Wilcox (1998). tests. They focus on a parameter in the statistical One general conclusion from robustness studies is model. Examples of such parameters are, in the two- that tests are extremely sensitive to the independence sample situation described above, the difference of the µ kµ assumptions made. Fortunately, those assumptions two population means, A B, or the within-group are often under control of the researcher through the standard deviation, σ. To frame this in general terms, choice of the experimental or observational design. consider a one-dimensional statistical parameter de- Traditional departures from independent observa- noted by θ. A null hypothesis test is about the question tions are multivariate observations and within-subject whether the data is compatible with the hypothesis θ µ kµ repeated measures designs, and the statistical literature that the parameter (in this case A B) has the abounds with methods for such kinds of dependent particular value 0. Instead, one could put forward the observations. More recently, methods have been question: what are the values of θ with which the data developed for clustered observations (e.g., individual is compatible? This question can be related to hy- respondents clustered within groups) under the names pothesis testing by considering the auxiliary problem of multilevel analysis and hierarchical linear modeling. of testing the null hypothesis H :θ l θ against the θ  θ " ! Another general conclusion is that properties of alternative hypothesis H": !, for an arbitrary but tests derived under the assumption of normal distribu- fixed value θ . The data may be said to be compatible ! θ tions, such as the t-test, can be quite sensitive to with any value of ! for which this null hypothesis outliers, i.e., single, or a few, observations that deviate would not be rejected at the given level of significance. strongly from the bulk of the observations. Since the Thus, the confidence interval is the interval of non- occurrence of outliers has a very low probability under rejected null hypotheses. normal distributions, they are ‘assumed away’ by the Another way of viewing the confidence interval is as normality assumption. The lack of robustness and follows. A confidence coefficient 1kα is determined by sensitivity to outliers have led to three main develop- the researcher (a conventional value is 0.95). The ments. confidence interval is an interval with lower boundary First, there are non-parametric tests, which do not L and upper boundary U, calculated from the data assume parametric families of distributions (such as and therefore being random variables, with the prop- the normal). Most of these are based on the ranks of erty that the probability that L % θ % U, i.e., the

7124 Hypothesis Testing: Methodology and Limitations interval contains the true parameter value, is at least other statistical procedures) should realize that these 1kα. It can be proved mathematically that the interval procedures are based on the assumption that there is of non-rejected null hypotheses has precisely this random variability in the data which cannot be property. completely filtered out of the results. Whether a test The confidence interval can be inverted to yield a result is significant or not depends partly on chance, hypothesis test for any null hypothesis of the form H : and each researcher obtaining a significant result θ l θ ! ! in the following way. For the most usual null should be aware that this could be an error of the first hypothesis that θ l 0, the hypothesis is rejected if the kind, just as a non-significant result could be an error value 0 is not included in the confidence interval. More of the second kind. generally, the null hypothesis H :θ l θ is rejected θ ! ! whenever ! is outside the confidence interval. This shows that a confidence interval is strictly more informative than a single hypothesis test. 4.1 Some Misinterpretations The following list tries to correct misinterpretations that have haunted applications of statistical tests. A 4. Problems in the Interpretation and Use of more extensive discussion can be found, e.g., in Hypothesis Tests Nickerson (2000). (a) A common misinterpretation is that non- While hypothesis tests have been applied routinely by rejection implies support for the null hypothesis. hosts of social and behavioral scientists, there have Nonrejection should be interpreted, however, as an been continuing debates, sometimes vehement and undecided outcome: there is not enough evidence polemic, about their use, among those inclined to against the null hypothesis, but this does not imply philosophical or foundational discussion. Many con- that there is evidence for the null hypothesis. Maybe tentious issues can be found in Harlow et al. (1997), the sample size is small, or error variability is large, so Nickerson (2000), and Wilkinson and TFSI (1999). that the data does not contain much information Within the context of the present contribution I can anyway. Usually the alternative hypothesis contains only give a personally colored discussion of some main probability distributions that approximate the null points which might contribute to a better under- distribution; e.g. when testing µ kµ l 0 against µ kµ  A B standing and a more judicious application of hy- A B 0, the difference between the two population pothesis tests. means can be tiny, while the alternative hypothesis is The reason that there has been so much criticism of still true. In other words, the power of practically all the nature and use of hypothesis tests, is in my view the tests is quite low in a sizeable part of the alternative difficulty of reasoning with uncertain evidence and the hypothesis, so if one of the distributions in this part natural human tendency to take recourse to behavioral would prevail, chances would be low of rejecting the rules instead of cogitate afresh, again and again, about null hypothesis. Therefore, nonrejection provides how to proceed. The empirical researcher would like support for those parts of the alternative practically as to conclude unequivocally whether a theory is true or strongly as for the null hypothesis itself, and non- false, whether an effect is present or absent. However, rejection may not be interpreted as support for the null experimental and observational data on human behav- hypothesis and against the alternative hypothesis. ior are bound to be so variable that the evidence (b) If one wishes to get more information about produced by these data is uncertain. Our human whether a nonsignificant result provides support for tendency is to reason—if only provisionally—as if our the null hypothesis, a power study is not the answer. conclusions are certain rather than tentative; to be Statistical power is the probability to reject the null concise rather than accurate in reporting our argu- hypothesis, if a given effect size obtains. Its interpret- ments and findings; and to use simple rules for steering ation cannot be inverted as being the degree of support how we make inference from data to theories. We are of the null hypothesis in the case of nonsignificance. tempted to reason as if ‘p " α’ implies that the Power studies are important in the stage of planning a investigated effect is nil, and the presumed theory is study. Once the study has been conducted, and one false, while ‘p % α’ proves that the effect indeed exists wishes to see in more detail to what extent the null and and the theory is true. alternative hypotheses are supported, a confidence However, such cognitive shortcuts amount to seri- interval is the appropriate procedure (also see ous misuse of hypothesis testing. Much of the criticism Wilkinson and TFSI, 1999, p. 596). of the use of hypothesis testing therefore is well- (c) Test results tell us nothing about the probabilities founded. In my view this implies that better inter- of null or alternative hypotheses. The probabilities pretations of hypothesis testing should be promoted, known as the significance level or the power of the test and that tests should be used less mechanically and are probabilities of certain sets of outcomes given the combined with other argumentations and other statis- condition that the null or, respectively, the alternative tical procedures. In any case, the user of tests (or hypothesis is true.

7125 Hypothesis Testing: Methodology and Limitations

(d) A significant result is not necessarily important. therapy B, while A is better; or choose therapy A, while Very low p-values do not in themselves imply large B is better) are about the same, then there is no effect sizes. A small estimated effect size still can yield asymmetry between the two competing hypotheses, a low p-value, e.g., if the corresponding and a significance test is not in order. is very small, which can be caused by a large sample or (b) Often there are more than two outcomes of the by good experimental control. Investigation of effect decision situation. For example, in the two-sample sizes is important, and is a different thing than situation discussed above, where the population means hypothesis testing. The importance of reporting effect are µ and µ , and where the researcher entertains a A B µ " µ sizes is stressed in Schmidt (1996) and Wilkinson and theory implying that A B, it is natural to define TFSI (1999). three decision outcomes defined as ‘µ " µ (support µ A! µ B (e) The alternative hypothesis is not the same as a for the theory),’ ‘undecided,’ and ‘ A B (evidence scientific theory. If a researcher is investigating some against the theory).’ Usually, however, such a situation theory empirically, this can be based on tests of which is represented by a Neyman–Pearson testing problem the alternative hypotheses are deduced from the with null hypothesis H :‘µ l µ ’ and alternative H : µ " µ ! A µ kB µ " theory, but always under the assumption that the ‘ A B.’ If the t-statistic for A B yields a strongly study was well designed and usually under additional negative outcome (‘significance in the wrong direc- technical assumptions such as, e.g., normality of the tion’) then the researcher only concludes that the null distributions. Since the alternative hypothesis is only a hypothesis is not rejected, whereas the data conveys consequence and not a sufficient condition for the the message that the alternative hypothesis should be theory, it is possible that the alternative hypothesis is rejected. The data then have to be considered as true but the theory is not. To find support for theories, evidence against the theory embodied in H". Such a one has to find corroborations of many consequences three-decision formulation is closer to the actual of the theory under various circumstances, not just of purpose of hypothesis tests in many situations where a one of its implications. On the other hand, it is possible one-dimensional parameter is being tested, cf. that the theory is true but the alternative hypothesis Leventhal (1999). deduced from it is not, e.g., because the or (c) The Neyman–Pearson approach is formulated observation was ill-designed or because the auxiliary as if the data at hand is the only evidence available technical assumptions are not satisfied. about the null and alternative hypotheses: it is an in (f) The null hypothesis is just what it says: a Šacuo formulation. This is almost never realistic. Often hypothesis. The test is not invalidated by the mere fact the propositions under study are theoretically plaus- that there are other reasons for thinking, or knowing, ible or implausible, there is earlier empirical evidence, that the null hypothesis is wrong. Even if there may be there may be converging evidence (as in the case of other reasons for which the null hypothesis is wrong, triangulation studies) in the same investigation, and still it can be sensible to check whether the data at other evidence may become available in the future. In hand are, or are not, compatible with it. the light of this surrounding evidence, it is unreal- istic—perhaps arrogant—to presume that the re- searcher has to take either of two decisions: ‘reject H!’ or ‘do not reject H .’ It is more realistic to see the data 4.2 Limitations of the Neyman–Pearson Approach ! at hand as part of the evidence that was and will be to Hypothesis Testing accumulated about these hypotheses. This is a major The Neyman–Pearson (1928) formulation changed reason why presenting the result of the test as a p-value the concept of hypothesis testing because it provided a is more helpful to other researchers than presenting rational basis for the choice of a test statistic. This was only the ‘significant–nonsignificant’ dichotomy (also obtained at the cost of a quite narrow frame for the see Wilkinson and TFSI 1999). notion of a hypothesis test: a two-decision problem in There are several ways in which the results obtained which one error (‘type I’) is considered much more from the data at hand, perhaps summarized in the p- serious than the other error (‘type II’). This model for value, can be combined with other evidence. In the decision-making is often inappropriate and, when it is first place, an informal combination can be and often useful, it is more commonly a useful approximation will be made by a sensible, nonformalized weighing of than a faithful representation of the actual decision the various available pieces of evidence. One formal situation. Some of its limitations are discussed briefly way of combining evidence is provided by the Bayesian in this section. paradigm (e.g., Gelman et al. 1995). This paradigm (a) Even when a dichotomous decision must be presupposes, however, that the prior and current ideas taken as a result of a scientific investigation, it is of the researcher about the plausibility of all possible possible that both types of error should be considered values of the statistical parameter under study be equally serious. This is the case, e.g., when two represented in a probability distribution for this therapies are compared and the most promising one parameter, a very strong requirement indeed. Other must be chosen for further study or for practical formal ways of combining evidence have been de- application. If the costs of both kinds of error (choose veloped and are known collectively by the name of

7126 Hypothesis Tests, Multiplicity of meta-analysis (e.g., Cook et al. 1992, Schmidt 1996). Cook T D, Cooper H, Cordray D S, Hartmann H, Hedges L V, These methods require that the several to-be-com- Light R J, Louis T A, Mosteller F 1992 Meta-analysis for bined hypothesis tests address the same substantive Explanation. A Casebook. Russell Sage Foundation, New hypothesis, or that they can be regarded as tests of the York same substantive parameter. Fisher R A 1925 Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, UK Fox J 1991 Regression Diagnostics. Sage, Newbury Park, CA Gelman A, Carlin J B, Stern H S, Rubin D B 1995 Bayesian Data Analysis. Chapman & Hall, London 5. Conclusion Gigerenzer G, Swijtink Z, Porter T, Daston L, Beatty J, Kru$ ger L 1989 The Empire of Chance. Cambridge University Press, The statistical hypothesis test is one of the basic New York elements of the toolbox of the empirical researcher in Hacking I 1965 Logic of Statistical Inference. Cambridge the social and behavioral sciences. The difficulty of University Press, Cambridge, UK reasoning on the basis of uncertain data has led, Harlow L L, Mulaik S A, Steiger J H (eds.) 1997 What if there however, to many misunderstandings and unfortunate were no Significance Tests? Erlbaum, Mahwah, NJ applications of this tool. The Neyman–Pearson ap- Lehmann E L 1993 The Fisher, Neymann–Pearson theories of proach is a mathematically elegant formulation of testing hypotheses: one theory or two? Journal of the American what a hypothesis test could be. Its precise formulation Statistical Association 88: 1242–9 rarely applies literally to the daily work of social and Leventhal L 1999 Updating the debate on one- versus two-tailed tests with the directional two-tailed test. Psychological Reports behavioral scientists, very often it is a useful ap- 84: 707–18 proximation to a part of the research question, Neyman J, Pearson E S 1928 On the use and interpretation of sometimes it is inappropriate—but sacrosanct it is not. certain test criteria for purposes of statistical inference. A hypothesis test evaluates the data using a test Biometrika 20A: 175–240, 263–94 statistic set up to the null hypothesis with the Nickerson R S 2000 Null hypothesis significance testing: A alternative hypothesis, and the p-value is the prob- review of an old and continuing controversy. Psychological ability to obtain, if the null hypothesis is true, Methods 5: 241–301 outcomes of the test statistic that are at least as high as Schmidt F L 1996 Statistical significance testing and cumulative the outcome actually observed. Low values therefore knowledge in psychology: Implications for training of re- are evidence against the null hypothesis, contrasted searchers. Psychological Methods 1: 115–29 with the alternative hypothesis. It is more conducive to Wilcox R R 1998 The goals and strategies of robust methods (with discussion). British Journal of Mathematical and Stat- the advance of science to report p-values than merely istical Psychology 51: 1–64 whether the hypothesis was rejected at the conven- Wilkinson L, Task force on statistical inference 1999 Statistical tional 0.05 level of significance. methods in psychology journals. American Psychologist 54: In the interpretation of test outcomes one should be 594–604 aware that these are subject to random variability; and that the probability calculations are based on assump- T. A. B. Snijders tions which may be in error. Incorrect assumptions can invalidate the conclusions seriously. Nonpara- metric and robust statistical procedures have been developed which depend less on these kinds of as- sumption, and diagnostics have been developed for checking some of the assumptions. When reporting empirical results, it is usually helpful to report not only Hypothesis Tests, Multiplicity of tests but also estimated effect sizes and\or confidence intervals for important parameters. To sum up: ‘It Standard procedures for controlling the probability of seems there is no escaping the use of judgment in the erroneous statistical inferences apply only for a single use and interpretation of statistical significance tests’ comparison or determination from a given set of data. (Nickerson 2000, p. 256). Multiplicity is encountered whenever a data analyst draws more than one inference from a . With See also: Hypothesis Testing in Statistics; Hypothesis multiplicity, standard procedures must be adjusted for Tests, Multiplicity of; Model Testing and Selection, the number of comparisons or determinations that Theory of; Significance, Tests of actually or potentially are performed. If some ad- justment is not made, the probability of erroneous inferences can be expected to increase with the degree of multiplicity. In other words, the data analyst loses control of a critical feature of the analysis. Bibliography Questions arising from problems of multiplicity Cook R D, Weisberg S 1999 Applied Regression Including raise a diversity of issues, issues that are important, Computing and Graphics. Wiley, New York difficult, and not fully resolved. This account empha-

7127

Copyright # 2001 Elsevier Science Ltd. All rights reserved. International Encyclopedia of the Social & Behavioral Sciences ISBN: 0-08-043076-7