<<

Psychological Bulletin 1975, Vol. 82, No. 1, 1-20

Consequences of Prejudice Against the Null Hypothesis

Anthony G. Greenwald Ohio State University

The consequences of prejudice against accepting the null hypothesis were examined through (a) a mathematical model intended to stimulate the research-publication process and (b) case studies of apparent erroneous rejec- tions of the null hypothesis in published psychological research. The input parameters for the model characterize investigators' probabilities of selecting a problem for which the null hypothesis is true, of reporting, following up on, or abandoning research when data do or do not reject the null hypothesis, and they characterize editors' probabilities of publishing manuscripts concluding in favor of or against the null hypothesis. With estimates of the input parameters based on a questionnaire survey of a sample of social psychologists, the model output indicates a dysfunctional research-publication system. Particularly, the model indicates that there may be relatively few publications on problems for which the null hypothesis is (at least to a reasonable approximation) true, and of these, a high proportion will erroneously reject the null hypothesis. The case studies provide additional support for this conclusion. Accordingly, it is concluded that research traditions and customs of discrimination against accept- ing the null hypothesis may be very detrimental to research progress. Some procedures that can help eliminate this are prescribed.

In a standard college dictionary (Webster's no value," "insignificant," and presumably New World, College Edition, 1960), null is "invalid." My aims here are to document this defined as "invalid; amounting to nought; of state of affairs, to examine its consequences no value, effect, or consequence; insignifi- for the archival accumulation of scientific cant." In statistical hypothesis testing, the knowledge, and lastly, to make a positive case null hypothesis most often refers to the hy- for the formulation of more potent and ac- pothesis of no difference between treatment ceptable null hypotheses as a part of an over- effects or of no association between variables. all research strategy. Interestingly, in the behavioral , re- Because of my familiarity with its litera- searchers' null hypotheses frequently satisfy ture, most of the illustrative material I use is the nonstatistical definition of null, being "of drawn from social . This should not be read as an implication that the prob- Preparation of this report was facilitated by grants lems being discussed are confined to social from National Foundation (GS-3050) and psychology. I suspect they are equally char- US. Public Health Service (MH-20527-02). Although they should not be held responsible for positions acteristic of other behavioral science fields espoused herein, I am very grateful to the following that are lacking in well-established organizing for providing comments on earlier drafts: Mari R. theoretical systems. Tones. Paul Isaac, David Bakan, Timothy C. rock, Bibb ~atani,Thomas M. Ostrom, Hanan C. Selvin, Martin Fishbein, Zick Rubin, and Richard The A. Zeller. My paraphrasing of some widespread be- Requests for reprints should be sent to Anthony G. Greenwald, Department of Psychology, Ohio State liefs of behavioral scientists concerning the University, 404C West 17th Avenue, Columbus, Ohio appears below. Some partial 432 lo. sources for the content of this listing are 2 ANTHONY G. GREENWALD

Festinger (1953, pp. 142-143), Wilson and respond to those in the preceding listing.) Miller (l964), Aronson and Carlsmith (1969, Briefly stated, these attacks are: p. 21), and Mills (1969, pp. 442-448). 1. The notion that you cannot prove the 1. Given the characteristics of statistical null hypothesis is true in the same sense that analysis procedures, a null result is only a it is also true that you cannot prove any exact basis for uncertainty. Conclusions about rela- (or point) hypothesis. However, there is no tionships among variables should be based reason for believing that an estimate of some only on rejections of null hypotheses. parameter that is near a zero point is less 2. Little knowledge is achieved by finding valid than an estimate that is significantly out that two variables are unrelated. Science different from zero. Currently available Baye- advances, rather, by discovering relationships sian techniques (eg, Phillips, 1973) allow between variables. methods of describing acceptability of null 3. If statistically significant effects are ob- hypotheses. tained in an , it is fairly certain 2. The point is commonly made that the- that the experiment was done properly. ories predict relationships between variables; 4. On the other hand, it is inadvisable to therefore, finding relationships between vari- place confidence in results that support a null ables (i.e., non-null results) helps to confirm hypothesis because ,there are too many ways theories and thereby to advance science. This (including incompetence of the researcher), argument ignores the fact that scientific ad- other than the null hypothesis being true, for vance is often most powerfully achieved by obtaining a null result. rejecting theories (cf. Platt, 1964). A major Given the existence of such beliefs among strategy for doing this is to demonstrate that behavioral science researchers, it is not sur- relationships predicted by a theory are not prising that some observers have arrived at obtained, and this would often require ac- conclusions such as: ceptance of a null hypothesis. 3. I am aware of no reason for thinking Many null hypotheses tested by classical procedures that a statistically significant rejection of a are scientifically preposterous, not worthy of a null hypothesis is an appropriate basis for as- moment's credence even as approximations. (Ed- wards, 1965, pp. 401-402) suming that the conceptually intended vari- ables were manipulated or measured validly. It [the null hypothesis] is usually formulated for The significant result (barring Type I error) the express purpose of being rejected. (Siegel, 1956, does indicate that some relationship or effect P. 7) was observed, but that is all it indicates. The researcher who would claim that his data Refutations of Null Hypothesis show a relationship between two variables "Cultural Truisms" should be as clearly obliged to show that I am sure that many behavioral science re- those variables are the ones intended as should searchers endorse the beliefs previously enu- the researcher who would claim that his data merated but would have difficulty in provid- show the absence of a relationship. ing a rational defense for these beliefs should 4. Perhaps the most damaging accusation they be strongly attacked. That is, these atti- against the null hypothesis is that incom- tudes toward the null hypothesis may have petence is more likely to lead to erroneous some of ithe characteristics of cultural truisms nonsignificant, "negative," or null results than as described by McGuire (1964). Cultural to erroneous significant or "positive" results. truisms are beliefs that are so widely and un- There is some substance to this accusation- questioningly held that their adherents (a) when the incompetence has the effect of in- are unlikely ever to have heard them being troducing noise or unsystematic error into attacked and may therefore (b) have diffi- data. Examples of this sort of incompetence culty defending them against an attack. If I are the use of unreliable paper-and-pencil am correct, the reader will have difficulty de- measures, conducting research in a "noisy" fending the preceding beliefs against the fol- setting (i.e., one with important extraneous lowing attacks (the numbered paragraphs cor- variables uncontrolled), unreliable apparatus PREJUDICE AGAINST THE NULL HYPOTHESIS functioning, inaccurate placement of record- jection of the null hypothesis and continuing ing or stimulating electrodes, random errors to revise until the null hypothesis is (at last! ) in data recording or transcribing, and making rejected or until the problem is abandoned too few observations. These types of incom- without publication; (f) failing to report ini- petence are often found in the work of the tial data collections (renamed as "pilot data'" novice researcher and are proper cause for or "false starts") in a series of studies that caution in accepting null findings as adequate eventually leads to a prediction-confirming re- evidence for the absence of effects or relation- jection of the null hypothesis; (g) failing to ships. Some other very common types of in- detect data analysis errors when an analysis competence are much more likely to produce has rejeoted the null hypothesis by miscom- false positive or significant results. These putation, while vigilantly checking and re- types of incompetence result in the introduc- checking computations if the null hypothesis tion of systematic errors into data collection. has not been rejected; and (h) using stricter Examples of such sources of artifact (cf. editorial standards for evaluating manuscripts Rosenthal & Rosnow, 1969) are experimenter that conclude in favor of, rather than against, bias, inappropriate demand characteristics, the null hypothesis. nonrandom sampling, invalid or contaminated Perhaps the enumeration of the items on manipulations or measures, systematic ap- this list will arouse sufficient recognition of paratus malfunction (e.g., errors in calibra- symptoms in readers to convince them that tion), or systematic error (either accidental the illness of anti-null-hypothesis prejudice or intentional) in data recording or transcrib- indeed exists. However, just as a hypochon- ing. This latter category of incompetence is driac should have better evidence that he is by no means confined to novices and may be ill than that the symptoms he has just heard quite difficult to detect, particularly since our about seem familiar, so should we have better existing customs encourage greater suspicion evidence than symptom recognition for mak- of null findings than of significant findings. ing conclusions about the existence of preju- Behavioral Symptoms of Anti-Null- dice against the null hypothesis. Hypothesis Prejudice A Survey to Estimate Bias Against We should not perhaps be very disturbed the Null Hypothesis about the exisltence of the beliefs previously listed if those beliefs would prove to be unre- In order to obtain some more concrete evi- lated to behavior. The following is a list of dence regarding the manifestations of anti- some possible behavioral symptoms of preju- null-hypothesis prejudice, I conducted a sur- dice against null hypotheses: (a) designing vey of reviewers and authors of articles sub- research so that the personal prediction of mitted to the Journal of Personality and So- the researcher is identified with rejection cial Psychology (JPSP).The sample included rather than acceptance of the null hypothesis; the primary (corresponding) authors and the (b) submitting results for publication more reviewers for all manuscripts that I processed often when the null hypothesis has been re- as an associate edittor of JPSP during a 3- jected than when it has not been rejected; month period in 1973. The sample thus con- (c) continuing research on a problem when sisted of 48 authors and 47 reviewers to whom results have been close to rejection of the null I sent a questionnaire. Returns were obtained hypothesis ("near significant"), while aban- from 36 authors (75%) and 39 reviewers doning the problem if rejection of the null (81 %). The major items in the questionnaire hypothesis is not close; (d) elevating ancillary assessed behavior in situations in which bias hypothesis tests or fortuitous findings to for or against the null hypothesis could occur. prominence in reports of studies for which These situations were (a) initial formulation the major dependent variables did not provide of a problem, (b) setting probabilities of a clear rejection of the null hypothesis; (e) Type I and Type I1 error, and (c) deciding revising otherwise adequate operationaliza- what action to pursue once results were ob- tions of variables when unable to obtain re- tained. All questions were stated with refer- ANTHONY G. GREENWALD

TABLE 1 RESULTSOF SURVEYOF JPSP AUTHORSAND REVIEWERSTO DETERMINEPREJUDICE TOWARDOR AGAINSTTHE NULLHYPOTHESIS

Mean responses for Question Reviewers Authors All SD%II

1. What is the probability that your typical prediction will be for a rejection (rather than an acceptance) of a null hypothesis?

2. Indicate the level of alpha you typically regard as a satisfactory basis for rejecting the null hypothesis.

3. Indicate the level of beta you would regard as a satisfactory basis for accepting the null hypothesis.

4. After an initial full-scale test of the focal hypothesis that allows rejection of the null hypothesis, what is the probability that you will (a) submit the results for publication before further data collection, (b) conduct an exact replication before deciding whether to submit for publication, (c) conduct a modified replication before deciding whether to submit, (d) give up the problem. Total

5. After an initial full-scale test of the focal hypothesis that does not allow rejection of the null hypothesis, what is the probability that you will (a) submit the results for publication before further data collection, (b) conduct an exact replication before deciding whether to submit for publication, (c) conduct a modified replication before deciding whether to submit, (d) give up the problem. Total - Nole. Table entries are means of respondents' estimates of probabilities, based on the number of responses given in parentheses. ence to a test of the "focal hypothesis" for sis rejections and less likely to abandon the a new line of research. The focal hypothesis problem following a null hypothesis rejection test was further defined as "the one hypothe- than were reviewers. Given these rather lim- sis test that is of greatest importance" to the ited differences, the following discussion of line of investigation. Responses were indi- these data treats only $the overall responses cated on probability scales that could range for the combined sample. from 0 to 1.00. The major results are given The questionnaire results gave several in Table 1. strong confirmations of existence of prejudice With the exception of responses to two against the null hypothesis. In the stage of questions, the results for authors and review- formulation of a problem, respondents indi- ers were quite similar. This was not (terribly cated a strong preference for identifying their surprising because there was substantial over- own predictions with an expected rejection, lap between the population~from which these rather than an acceptance of the null hypothe- two subsamples were drawn. From Questions sis. The mean probability of the researcher's 4a and 4d it can be seen that authors reported personal prediction being of the null hypothe- they were more likely to report null hypothe- sis rejection (Question 1: X = .81 * .04) is PREJUDICE AGAINST THE NULL HYPOTHESIS 5 substantially greater than 50.l This state of a .17 (k.06) probability that beta would be affairs is consistent with supposing that re- set in advance. Rather than indicating a searchers set themselves the goal of confirm- prejudice toward acceptance of the null hy- ing a theoretically predicted relation between pothesis then, I (think the responses to the variables more often than refuting one, de- questions on alpha and beta indicate that ac- spite good reason to believe that knowledge ceptance of the null hypothesis is not usually may advance more rapidly by the latter strat- treated as a viable research outcome. egy (Platt, 1964). In terms of what is done after completion In setting the probability of Type I error, of a full-scale data collection to test a focal respondents indicated relatively close adher- hypothesis, a major bias is indicated in the ence to the .05 alpha criterion (Question 2: .49 (f.06) probability of submitting a rejec- R = .046 + .004). Responses to Question 3 tion of the null hypothesis for publication indicated a substantial lack of standard prac- (Question 4a) compared to the low probabil- tice with regard to Type I1 errors (i.e., ac- ity of .06 (k.03) for submitting a nonrejec- cepting the null hypothesis when in truth it tion of the null hypothesis for publication should be rejected). About 50% of the re- (Question 5a). A secondary bias is apparent spondents failed to answer the question re- in the probability of continuing with a prob- questing specification of a preferred Type I1 lem and is computed conditionally upon the error (beta) criterion. Those who did indicate decision to write a report having not been a Type I1 error criterion indicated much more made following data collection. This derived tolerance for this type of error than for a index has a value of .86 (2.05) when the Type I error, the resulting estimate of beta initial result is a rejection of the null hypothe- being approximately .30 (Question 3: J? = sis, compared to .70 (t.05) when the initial .27 + .09). This estimate, it should be noted, result is a nonrejection of the null hypothesis, is in line with Cohen's (1962) conclusion that indicating greater likelihood of proceeding in studies published in the Journal of Abnormal the former case.= and Social Psychology were relatively low on In sum, the questionnaire responses of a power (probability of rejecting the null hy- sample of contributors to the social psycho- pothesis when the alternative is true; power logical literature gave self-report evidence of = 1.00 - beta). In regard to tolerance for substantial against the null hypothesis Type I and Type I1 errors then, the question- in formulating a research problem and in de- naire respondents appeared biased toward ciding what to do with the data once collected. null hypothesis acceptance in the sense that In the following section, the impact of these they reported more willingness to err by ac- biases on the content of the archival literature cepting, rather than rejecting, the null hy- is considered. pothesis. Such a conclusion would, I think, be quite misleading. Rather, responses to other questions not summarized in Table 1 and the frequency of nonresponse to Question 3 indi- The alpha criterion most commonly em- cated that most respondents did not take ployed in the behavioral sciences is .05. With- seriously the idea of setting a Type I1 error out giving the matter much thought, one may criterion in advance. For example, the re- guess on this basis that approximately 1 in sponses to questions asking for probability of 20 publications may be an erroneous rejection setting alpha and beta criterions in advance of a true null hypothesis. However, some of data collection indicated a .63 (k.09) thought on the matter soon brings the dis- probability that alpha would be set in ad- covery that the probability of a published vance of data collection, compared with only article being a Type I error depends on much

1 The errors of estimates given are equal to the 2 In the case of a result rejecting the null hypothe- limits of 95% confidence intervals, approximately sis, this index is computed as [(4b + 4c) + (4b + plus or minus twice the standard deviation of the 4c + 4d)], the numbers referring to the responses estimated mean. to the questions given in Table 1. 6 ANTHONY G. GREENWALD more than (a) the researcher's alpha criterion. ternative outcomes at four types of choice The other determinants include (b) the prob- points: (a) the researcher's formulation of a ability of accepting the null hypothesis when hypothesis, (b) his collection of data, (c) his it is false (Type I1 error or beta), (c) the a evaluation of obtained results, and (d) an priori probability of an investigator selecting editor's judgment of a manuscript reporting a problem for which the null hypothesis is the research results. At each of these points true or false, (d) the probability of rejections in the research-publication process, behavioral versus nonrejections of the null hypothesis bias relating to the null hypothesis may enter. being submitted for publication, (e) the prob- The model incorporates parameters that serve ability of the researcher's giving up in despair to quantify these biases, and these are listed after achieving a rejection versus a nonrejec- here in their sequential order of occurrence tion of the null hypothesis, and (f) the prob- in the research-publication process. ability of an editor's accepting an article that The probability that the null hypothesis is reports a rejection versus a nonrejection of the true for the focal hypothesis. Specification of null hypothesis. All of these probabilities rep- this parameter requires a clear definition of resent opportunities for the occurrence of the null hypothesis. If by the null hypothesis strategies that discriminate against the null one refers to the hypothesis of exactly no hypothesis. The model I develop functions to difference or exactly no correlation, and so derive consequences for the content of pub- forth, then the initial probability of the null lished literature from assumptions made about hypothesis being true must be regarded effec- these strategies. tively as zero, as would be the probability of any other point hypothesis. In most cases, Model Description however, the investigator should not be con- In the model employed for the research- cerned about the hypothesis that the true publication system (see Figure I), a critical value of a statistic equals exactly zero, but notion is that of a focal hypothesis test. It rather about the hypothesis that the effect or is assumed that in any line of investigation, relationship to be tested is so small as not to there is one statistical test that is of major be usefully distinguished from zero. For the interest. This may be a test for a main or in- purposes of the model then, the probability of teraction effect in an analysis of variance, a the null hypothesis being true becomes iden- test of the difference between two groups or tified with the probability that the true state treatments, a test of correlation between two of affairs underlying the focal hypothesis is variables, and the like. This statistical test is within a null range (cf. Hays, 1973, pp. 850- assumed to be made in terms of a rejection 853). In the model, the probability that the or acceptance (nonrejection, if you prefer) investigator's focal hypothesis is one for which of a null hypothesis of no main effect, no in- truth is within such a null range is repre- teraction, and so forth. In conducting this sented as ho. The probability that truth is focal hypothesis test, the researcher is as- outside this range is hl = 1.00 - ho. One sumed to have formulated an extent of devia- would have to be omniscient to be assured of tion from the null hypothesis (an alternative selecting accurate values for the ho and hl hypothesis, HI) that he would like to be able parameters. It seems, however that, the values to detect with probability (power) 1 - p. In of these parameters should be clearly weighted practice, this formulation of HI may often be in the direction of starting with a false null an implicit consequence of setting a critical hypothesis (i.e., hl > ho). Some reasons for region for rejection of the null hypothesis this are that (a) researchers identify their personal predictions predominantly with the with a given risk, a, of Type I error. For ex- falsity of the null hypothesis (see Table 1, ample, assuming p = a, the start of the critical region is effectively a midpoint be- Question I), and there may often be good tween the null hypothesis and HI. reason for them to make these predictions; In the model, the fate of a research problem and (b) as argued by McGuire (l973), there is traced in terms of the probabilities of al- is usually at least a narrow sense in which PREJUDICE AGAINST THE NULL HYPOTHESIS

FORMULATE PROBLEM

COLLECT DBTn

WRlTE REPORT?

EDITOR'S DECISION

FIGURE1. Model of research-publication system. (Five types of sequential decision points in the research-publication process are represented by rows of circles that can be thought of as spinners in a board game, each spinner selecting one of two departures from the decision point. The spinners shown on four of the circles depict a published Type I error resulting from a researcher's first data collection on a problem.) most researchers' predictions are correct. For in a planned piece of research and thereupon no outstandingly good reason, the values of terminate or otherwise alter plans, the notion .20 and .80 were selected for ho and hl, re- of a data collection is somewhat vague in spectively. To compensate for the difficulty practice and must necessarily be so in the of justifying this initial assumption, system model). The outcome of a data collection will results are given below for other values of either be a rejection or a nonrejection of the these parameter^.^ null hypothesis. The probability of rejection Outcome of data collection. As used here if the null hypothesis is true is characterized data collection refers to the researcher's ac- in the model as ro and is approximately equiv- tivities subsequent to problem formulation, alent to the researcher's alpha criterion. Based up to and including the statistical analysis of results. It is assumed that any such data col- 3 The equations for computing system output in- lection can be characterized by probabilities dexes have been prepared as a computer program in the BASIC language. This program generates system of Type I and Type I1 errors that are either output indexes in response to values of the system explicitly chosen by the investigator or else input parameters entered at a terminal by the user. follow implicitly (cf. Cohen, 1962) from his The program therefore permits ready examination choices of sample size, dependent measures, of consequences of assumptions other than those made presently about values of the system's input statistical tests, and the like. (Because in- parameters. A listing of this program may be ob- vestigators may often examine data midway tained from the author. s ANTHONY G. GREENWALD

on the questionnaire responses, ro is estimated article with probability eo if it reports a non- at .05. If the null hypothesis is false, then rejection of the null hypothesis and el if the probability of its rejection is characterized it reports a rejection of the null hypothe- as r1 and this should be approximately 1.00 sis. The questionnaire data did not per- minus the researcher's beta criterion. This mit any estimates of these parameters and value is estimated at .70 based on the ques- they have been estimated, somewhat arbi- tionnaire responses. Probabilities of nonrejec- trarily, as both being equal to .25. Thus, al- tion of the null hypothesis are 1.00 minus though the model permits analysis of the ro (which equals .95) or 1.00 minus rl (which consequences of editorial discrimination for equals .30), re~pectively.~ or against the null hypothesis, no initial as- Probability of writing a report. The model sumption has been made regarding the exist- assumes that upon completing a data collec- ence of such bias. tion, the researcher examines his results and If at first you don't succeed. The researcher decides whether or not to write a report. The may be left holding a bagful of data if (a) probability of deciding to write if the null he has decided not to report the results or hypothesis has not been rejected is repre- (b) he has decided to report them but has sented as wo and is estimated at .06, based on been unable to obtain the cooperation of an the questionnaire results (see Table 1, Ques- editor. At this point, the model allows the re- tion 5a). When the null hypothesis has been searcher to decide whether to continue re- rejected the probability of deciding to write is search or to abandon the problem. If the re- represented as wl and is estimated at .49, sult of the preceding data collection was a based on the questionnaire responses (Ques- nonrejection of the null hypothesis, the prob- tion 4a). ability of continuing is represented as co; Probability of editorial acceptance. In order if the result was a rejection of the null hypo- for the result of a data collection to appear in thesis, the probability of continuing is repre- sented as cl. Estimates of these parameters print, it has to be accepted for publication by have been derived from the questionnaire re- an editor. In the model, an editor accepts an sponses by computing the probability of con- tinuations b and c in response to Questions 4 The .05 level is probably a conservatively low 4 and 5, conditional on a decision to write estimate of alpha employed by the researchers to whom questionnaires were sent. In response to a not having been made. The resulting estimates question that asked for an estimate of a level of are .70 for co and 36 for cl. alpha which "although not satisfactory for rejecting The model assumes that the researcher con- the null hypothesis, would lead you to consider that tinues research by returning to the data col- the null hypothesis is sufficiently likely to be false so as to warrant additional data collection before lection stage, at which point the fate of his drawing a conclusion," the mean response was .ll research is subject to the r, w, e, and c param- (2.02). This suggests that researchers may be willing eters as before. In carrying out computations to treat "marginally significant" results more like based on the model, a three-strikes-and-out null hypothesis rejections than like nonrejections. rule was assumed. That is, if the researcher Further, the .05 estimate of YO is based on the clas- sical hypothesis-testing assumption of an exact null has not achieved publication after three data hypothesis, rather than a range null hypothesis, as collections, it is assumed that he will abandon is employed in the model. The adoption of the range the problem. With parameter values estimated hypothesis framework has the effect of increasing alpha over its nominal level, the extent of the in- for the present system, 62% of lines of in- crease being dependent on the width of the null vestigation are published or abandoned after range in relation to the power (1 - j3) of the re- three attempts in any case. The limitation to search. Since full development of this point is beyond three data collections is of little practical im- the scope of the present exposition, it shall simply be noted that the presently employed estimates of r~ portance since the major output indices of and rl are at best approximate. The estimates ac- the model (see below) change little with addi- tually employed, as derived from the questionnaire tional iterations. responses, are conservative in the sense that they probably err by leading to an overly favorable The Figure 1 representation of the model estimate of system output. portrays the researcher's choice points as PREJUDICE AGAINST THE NULL HYPOTHESIS 9 spinners in a game of chance, the parameter to others. Particularly, they are appropriate values then being represented by the areas to areas of research in which null hypothesis in which each spinner may stop. This illus- decision procedures (Rozeboom, 1960) are tration is intended to make it clear that the dominant. Further, given the use of null hy- model parameters are conditional probabilities, pothesis decision procedures, assumptions each indicating the probability of a specific made about the present state of the system departure from a choice point once that choice are most appropriate for those areas of re- point has been reached, rather than being an search in which measurement error is substan- attempted judgment of the research process. tial in relation to the magnitude of theoret- ically or practically meaningful effects. These Limitations of the Model are areas in which investigators are prone to No pretense is made for this model provid- work with relatively high risks of Type I ing anything more than a potentially useful error and to proceed otherwise in ways that approximation to the research-publication tend to discriminate against acceptance of the system. Limitations in the accuracy with null hypothesis. Within psychology, for ex- which some central model parameters can be ample, much research in psychophysics, neuro- estimated have already been mentioned. Per- psychology, and operant behavior would not haps the most glaring weakness in the model properly be considered in terms of the pres- is its assumption that the probability of edi- ent model. On the other hand, much research torial acceptance of a report is independent of in social, developmental, experimental, clin- the sequence of events that precede submission ical, industrial, and counseling psychology to a journal. The model considers all manu- would, I expect, be reasonably well simulated scripts that reject the null hypothesis to be by the model. equivalent before the editorial process regard- less of the number of data collections in which Model Output Indices the null hypothesis was rejected. Similarly, In order to illustrate how the model's out- all manuscripts that report acceptance of the put indices respond to change in model param- null hypothesis are regarded as equivalent. eters, Figure 2 presents seven output mea- Perhaps even more importantly, the model sures as a function of the model parameter assumes the editorial process to be insensi- ho (probability that the null hypothesis is tive to the actual truth-falsity of the null true for the focal hypothesis test). These re- hypothesis. The performance of the system sults have been obtained with model param- would be at least a little better, on the various eters other than ho held constant at their output criteria to be reported, if the model previously described values (estimated from assumed some success of the editorial process questionnaire responses). in weeding out Type I and Type I1 errors If Type I and Type I1 errors are examined rather than these having a likelihood of ac- as a percentage of total journal content, it may ceptance equal to true rejections and accept- be seen that these represent a gratifyingly ances of the null hypothesis, respectively. small proportion of total published content These modifications have not been made (upper portion of Figure 2), given the esti- partly because they would add complexity and mated present-system value of ho = .20. It also because the elaboration of additional then becomes a bit disturbing to note that the parameters for the editorial process would not Type I error rate of the system (system seriously affect the relations between the alpha) is rather high, .30. (System alpha is model's input parameters and its ouput in- computed as the proportion of all publications dices. (They would affect absolute values of on the right side of Figure 1 that are Type I the output indices.) errors.) System beta (the proportion of all Note a general caution: The model param- publications on the left side of Figure 1 that eter estimates based on questionnaire re- are Type I1 errors) is quite low, .05. sponses are certainly more appropriate to It is somewhat coincidental, but nonetheless some areas of behavioral science research than remarkable, that the system output levels of ANTHONY G. GREENWALD

tem output includes very few publications of true acceptances of the null hypothesis. An information transmission index. Because it is difficult to interpret the Type I and Type I1 percentage error indices or the system alpha and beta indices directly as measures of the quality of functioning of the research- publication system, it is desirable to have an index that better summarizes the system's ac- curacy in communicating information about the truth and falsity of researchers' null hypotheses to journal readers. An information transmission index, computed as shown in Table 2, can partially serve this purpose. 100 To intermet the information transmission Proboblllly of ochievlng publicol~on index, assume that a journal reader is pre- sented with a list of the focal hypotheses tested in an upcoming journal issue. Max- imally, reading the journal might reduce the reader's uncertainty about the truth-falsity

40 of the several focal hypotheses by an average of 1.00 bit. The information transmission in- error dex will approach this maximum value to the .20 extent that (a) there is a fifty-fifty likeli- hood that the null hypothesis is true or false

.o 0 for the published articles (i.e., the reader's un- certainty is maximal), (b) there is a fifty- Probob~lltytho1 null hypolhests is true (system poromeler ho) fifty true-false reporting ratio for the null FIGURE2. Scvcn output indices for the research- hypothesis in the published articles (i.e., the publication system model. (To illustrate responsive- journal's content is maximally uncertain), ness of output indices to an input parameter, the and (c) the published conclusions are per- seven indices are plotted as a function of system parameter ho [which equals the probability that the fectly accurate (or perfectly inaccurate! ) re- researcher formulates a problem for which the null garding the truth-falsity of the null hypothe- hypothesis is, in fact, true].) sis. It is im~ortantto note that this index bears little direct relation to the percentage of alpha (.30) and beta (.05) are exactly the articles reporting a correct result. To appre- reverse of the alpha (ro = .05) and beta (1.00 - rl = .30) levels used as estimates of TABLE 2 model input parameters. The explanation for the discrepancy between the system alpha and system beta indices, on the one hand, and Published result Type I and Type I1 errors considered as a Truth of Ho Not reject IIo Reject HO Sum percentage of all publications, on the other, HOtrue I poo I POI I PO. can be found in an index giving the per- centage of all publications in which the null hypothesis is reported as rejected for the focal Sum P.0 P 1 1.00 hypothesis test (upper portion of Figure 2). Note: Tahle entries are proportions of only those lines of This index has the quite high value of 91.570 investigation that have reached the stage of journal puhli- when ho = .20. It is apparent then that the cation. The index is computed as (cf. Attneave, 1969. pp. 46 ff): I 1 high value of system alpha, despite the low (- ,Z 0%.log? pi.) f (- Z fi.ilog2P.j) i-0 j-o proportion of publications that are Type I 11 - (- .Z: 2 pii log? PI,). errors, is a consequence of the fact that sys- I-oja PREJUDICE AGAINST THE NULL HYPOTHESIS 11 ciate this, consider that a journal may print obviously out of touch with reality. Despite nothing but correct rejections of the null hy- these limitations, it is difficult to formulate pothesis. By definition then, all of its content an index that better summarizes functioning would be correct. However, the reader who of the research-publication system. had an advance list of the focal hypotheses There is an alternative form of the informa- of the to-be-published articles would gain no tion transmission index that may seem prefer- information regarding the truth-falsity of any able to the one shown in Table 2. This alter- focal null hypothesis from actually reading nate index is based on all lines of investiga- the journal, since he could know, by ex- tion (not just those that reach the stage of trapolation from past experience, that the null publication) and classifies the outcomes of hypothesis would invariably be rejected. these lines as published rejection of the null In Figure 2 (lower part), it is apparent hypothesis, published nonrejection, and also that the information transmission index has non~ublication.This index has the virtues of a very low value (about .10 bits) given the (a) summarizing activity in the whole system present-system assumption that ho = .20. The (rather than just the published portion) and fact that the information transmission index (b) allowing nonpublication to provide in- increases dramatically as ho increases reflects formation about the truth-falsity of the null primarily some virtue in compensating, at the hypothesis. Computations have been made for problem formulation stage, for biases against this index, the results indicating system func- the null hypothesis residing elsewhere in the tioning at about as poor a level as does the system. index described in Table 2. The alternate in- Comment on the information transmission dex has not been presented in Figure 2 chiefly index. A few of my colleagues have objected because its implicit assumption-that system to the information transmission index as a output watchers can keep track of lines of re- summary of system functioning because it search that do not achieve publication-seems takes no account of their primary criterion too unreasonable. for evaluating published research-the im- A final index shown in Figure 2, the prob- portance of the problem with which the re- ability of achieving a publication given em- search is concerned. These colleagues pointed barcation on a research problem, is one that out that archives full of confirmations and re- ought to be of practical concern to research- jections of trivial null hypotheses would get ers. This index, plotted as a function of the high marks on the transmission index but ho parameter, indicates interestingly that the would make for poor science. I am in full system "rewards" researchers with publica- sympathy with this view and would not like tions to the extent that they formulate a prob- readers to construe my preference for the in- lem for which the null hypothesis is false. formation transmission index as a call for A Check on the Model's Accuracy journals to catalog trivial results. Thus, it should be emphasized (that the information One means of obtaining a rough check on transmission index is insensitive to several the model's validity is to compare its pre- possible system virtues. Particularly, (a) it dicted proportion of articles for which a focal takes no account of the value to readers of null hypothesis is accepted against the actual the conceptual content of journal articles; (b) content of the literature. With the assistance it ignores the information contained in tests of John A. Miller and Karl E. Rosenberg, of nonfocal hypotheses; and (c) by concep- such a check was made for the Journal of Personality and Social Psychology for the tualizing the test of the focal hypothesis as year 1972. Every article published that year having just an accept-reject outcome, it was read to determine, first, what the focal ignores possible information in the direction null hypothesis was and, second, whether the or magnitude of effect shown by the focal article concluded in favor of acceptance or hypothesis test. Further, the assumption im- rejection of that null hypothesis. Out of 199 plicit in the index-that readers can be aware articles for which a focal null hypothesis was in advance of articles' focal hypotheses-is identified, 24 reported acceptance (or nonre- 12 ANTHONY G. GREENWALD

STEP @ Parameterr estimated from survey to demonstrate the improvement in system a Continue 011 problems lco= cl = 101 functioning that is potentially possible if @ + Writi up all resulls lw0=wI -10) @) + Moke H,, likeithood * .SO (ho=hl=.51 biases against the null hypothesis are elim- @ +Make bsto = olpho . 05 lro.l-r, :.051 inated. Figure 3 shows the consequences of @ t Reduce alpha and befa to 01 (ro = 1 - rl = ,011 step-by-step restoration of equal status to the null hypothesis, as reflected in values of the .7/ information transmission index. It is quite ap- parent from Figure 3 that unbiased behavior at the various stages of the research-publica- tion process can have highly desirable effects on the informativeness of published research. The methods of achieving such unbiasedness are considered in more detail below. System Eflect on Generality of Research Findings

FIGURE3. Effects on information transmission in- The information transmission and other dex of step-by-step alterations in research-publica- system output indexes are insensitive to what tion system parameters to reduce bias against the may be the worst consequence of prejudice null hypothesis. (Present-system parameters esti- against the null hypothesis-the archival ac- mated from survey results are given in the text. Hypothetical changed parameter values are indicated cumulation of valid results with extremely in parentheses in the legend, and characterize also limited generality. all points to the right of the one in which the change Consider the situation of the researcher is first indicated.) who starts off with the hypothesis that an in- crease in variable x produces an increase in jection) of that hypothesis. A 95% con- variable y. Since he is very convinced of the fidence interval for the proportion of articles virtues of the theory that led to this predic- reporting null hypothesis acceptance ( 12.10/0 tion, he is willing to proceed through a num- + 4.5'70) included the model's estimated ber of false starts and pilot tests during which value for the present system of 8.5%, pro- he uses a few different experimenters to col- viding some evidence supporting the model's lect data, a few different methods of manip- validity. A similar check of four psychological ulating variable x, a few different measures of journals in the mid 1950s by Sterling (1959) variable y, and a few different settings to yielded a lower estimate of 8 out of 294 mask the true purpose of the experiment. At (which equals 2.7% + 1.9%) articles that re- last, he obtains the result that confirms the ported nonrejection of a focal hypothesis test. expected impact of x on y but is properly However, it is possible that Sterling may have concerned that the result may have been an used a more lenient criterion for declaring that unreplicable Type I error. To relieve this con- an article rejected the null hypothesis for a cern, he conducts an exact replication using focal hypothesis test (cf. Sterling, 1959, pp. the same combination of experimenter, opera- 31-32).' tionalization of x, measure of y, and experi- Toward a More Satisfactory System mental setting that previously "worked," and is gratified to discover that his finding is The foregoing results strongly suggest that replicated. Concerned about the validity of the research-publication system is functioning his procedures and measures, he also obtains well below its potential in research areas char- evidence indicating that the manipulation of acterized by prejudice against the null hy- x was perceived as intended and that the pothesis. With the system model it is easy measure of y had adequate reliability and validity. He then publishes his research con- It gives me pause, in reading over this paragraph, to consider whether or not I would have reported cluding that increases in x cause increases in the results of the content check of JPSP if it had y and, therefore, that his theory, which pre- not been confirming of the model. dicted this relationship, is supported. PREJUDICE AGAINST THE NULL HYPOTHESIS 13

The potential fault in this conclusion should than are other areas of behavioral science be obvious from the way I have presented the research. problem, but it is not likely to be obvious to the researcher conducting the investigation. Attitude and Selective Learning Because of his investment in confirming his theory with a rejection of the null hypothesis, Between 1939 and 1958, approximately 10 he has overlooked the possibility that the ob- studies (referenced in Greenwald & Sakumura, served x-y relationship may be dependent on 1967) reported the consistent finding that a specific manipulation, measure, experimenter, subjects, when exposed to information on a setting, or some combination of them. In his controversial topic, more easily learned in- eagerness to proclaim a general x-y relation- formation that was agreeable rather than dis- ship, he has been willing to attribute his pre- agreeable to their existing attitude on the vious false starts to his (or, better, his re- issue. This selective learning effect was re- search assistants') incompetence and, on this garded as sufficiently established to appear in basis, does not feel it either necessary or de- many introductory psychology and social psy- sirable to inform the scientific community chology textbooks, the study of Levine and about them. Murphy (1943) particularly being regarded This style of researcher's approach has been as somewhat of a classic. well described by McGuire (1973) : Starting in 1963, however, almost all pub- lished studies that included a test of this The more persistent of us typically manage at last hypothesis failed to confirm it (Brigham & to gct control of the experimental situation so that Cook, 1969; Fitzgerald & Ausubel, 1963; we can reliably demonstrate the hypothesized rela- Greenwald & Sakumura, 1967; Waly & Cook, tionship. But note that what the experiment tests is 1966). In one study (Malpass, 1969) the hy- not whether the hypothesis is true but rather whether the experimenter is a sufficiently ingenious stage pothesis appeared to be confirmed in only one manager. (p. 449) of three conditions in which it was tested. In general, the reported since 1963 For further discussion of situations in which have been quite carefully done, each publica- findings of limited generality appear to be tion typically reporting the results of more much more general, I refer the reader to than one replication of the hypothesis test Campbell's ( 1969, pp. 358-363) typology of and with careful attempts to control extra- threats to valid inference. neous variables that might contaminate the tests. Therefore it does not seem reasonable to suggest that these recent findings should If the results generated by the model are be regarded uniformly as Type I1 errors. to be believed, then the existing archival Because the recent investigations have also literature in the behavioral sciences should made strenous attempts, generally unsuccess- contain some blatant Type I errors. Although ful, to explain the earlier findings in terms of the absolute frequency of Type I error pub- interactions with previously uncontrolled fac- lications is not expected to be high, there tors, the possibility that most of the earlier should be some true null hypotheses for which results were Type I errors is currently very only rejections of the null hypothesis have plausible. This apparent epidemic of Type I been published. About the only way to demon- error can be readily understood in terms of strate the existence of Type I errors conclu- the hypothesized present research-publication sively is to demonstrate that "established" system. Several of the earlier publications re- findings cannot be replicated and that such ported rejections of the null hypothesis with failures to replicate cannot easily be regarded an alpha criterion greater than .05.After the as Type I1 errors. As mentioned before, the selective learning effect had thus established fact that two of the three following cases are some precedent in the literature, presumably drawn from social psychology reflects only my researchers and editors were more disposed to familiarity with this field, not any belief that regard a rejection of the null hypothesis as social psychology is more prone to such errors true than false. Possibly, also, investigators 14 ANTHONY G. GREENWALD who could not obtain the established finding subjects receiving the communication from an were content to regard their experiments as untrustworthy source. That is, the interaction inadequate in some respect or other and did effects were due primarily or entirely to loss not even bother to seek publication for what of effects, with passage of time, for subjects they may have believed to be Type I1 errors, receiving the communication from a trust- nor did they bother to conduct further re- worthy source. search that might have explained their failure The sleeper effect, it is clear, was estab- to replicate published findings. lished in the literature by a series of studies, each of which employed an ostensible alpha The Sleeper Effect criterion of .05 but for which the effective alpha criterion was substantially higher. In The sleeper effect in persuasion is said to the original Hovland et al. (1949) study, occur when a communication from an un- alpha was inflated through the selective re- trustworthy or inexpert source has a greater porting of post hoc significance tests; in the persuasive impact after some time delay than later studies, it was inflated by use of an in- it does on original exposure. That is, the com- appropriate overall interaction effect test in- munication presumably achieves its effect stead of the simple effect of the time variable while the audience "sleeps" on it. This result within the untrustworthy source conditions. is established well enough so that it is de- Evidence that the original and subsequent scribed in most introductory social psychology sleeper effect reports are likely to have been texts. The research history of the sleeper effect Type I errors has come recently from a series demonstrates a variety of ways in which Type of seven investigations by Gillig and Green- I publication errors may occur (if one as- wald ( 1974) involving a total of 656 subjects. sumes, that is, that the effect is not a genuine With their procedures, a true sleeper effect one). (increase over time) of .SO points on the 15- The original report of a sleeper effect by point opinion measure they used would have Hovland, Lumsdaine, and Sheffield (1949) in- been detected with better than .95 probabil- volved the use of an alpha criterion that was ity. A 95% confidence interval (2.27 scale inflated by selective sampling from multiple points) around the observed mean change of post hoc tests of the hypothesis. That is, the +.I4 clearly included the hypothesis of zero effect was not predicted and was found on change. only a subset (not an a priori one) of the *pinion items used by the investigators. In Quasi-Sensory Communication subsequent years, experimental investigators A perennially interesting subject for be- have chosen to look for the sleeper effect in havioral science research concerns the pos- terms of a comparison between the temporal sibility of perception of events that provide course of opinion changes induced b$ the no detectable inputs to known sensory recep- same communication from a trustworthy, tors. Research on extrasensory perception or versus an untrustworthy, source. That is, the quasi-sensory communication (Clement, 1972 ; increase in effect over time with the untrust- McBain, Fox, Kimura, Nakanishi, & Tirada, worthy source should not be matched by a 1970) is so plagued with research-publication similar increase when the source is trust- system problems that no reasonable person worthy. Significant interaction effects involv- can regard himself as having an adequate ing these two variables of source credibility basis for a true-false conclusion. This state and time since communication have been re- of affairs is not due to lack of research. It ported in a number of studies (e.g., Gillig & would be difficult to estimate, on the basis of Greenwald, 1974; Hovland & Weiss, 1951; the published literature, the amount of re- Kelman & Hovland, 1953; Shulman & Wor- search energy that has been invested in para- rall, 1970; Watts & McGuire, 1964). How- psychological questions, and this is precisely ever, in none of these studies was there re- the problem. It is a certainty that the pub- ported a significant increase in impact, with lished literature, both in favor of and against passage of time since the communication, for quasi-sensory communication, represents only PREJUDICE AGAINST THE NULL HYPOTHESIS 15 a small fraction of the total research effort. topics) also seems to be highly likely to de- Two anecdotes in my own experience are tect and communicate relatively unlikely illustrative: events. As it is functioning in at least some 1. A physicist at Ohio State University areas of behavioral science research, the re- once described to me an investigation, con- search-publication system may be regarded ducted with a colleague as a digression at as a device for systematically generating and their laboratory, into the detection of human- propagating anecdotal information. expressed affect by plank6 They happened to have electronic apparatus of sufficient ac- curacy to detect electric potential charges of as small a magnitude as 10 nV between two points on the same or opposite surfaces of a My criticisms of researchers' null-hypothe- leaf. This was approximately one part in lo7 sis-related strategies are not new. They have of the baseline voltage. They failed to detect been expressed, in part, by several previous responses of this magnitude reliably in a num- writers, the article by Bakan (1966) being ber of tests involving verbally and facially perhaps closest to the approach I have taken. communicated threats to the plant. When I The point that Type I publication errors are learned of this (at a cocktail party, of course) underestimated by reported alpha criteria has I asked if they had any intention either to been made also in critiques of the use of sig- publish their results or to repeat the experi- nificance tests in sociology (Selvin, 1966) and ment. The reply was negative, although I ex- psychology (Sterling, 1959) (see also the pect the scientific community would have anthology edited by Morrison & Henkel, been informed had their results been positive. 1970). What I have attempted to add to the 2. As an editorial consultant to a journal, previous critiques is a quantitative assessment I was asked to review an article that obtained of the magnitude of the problem as it exists, an extrasensory perception effect that would by means of (a) a questionnaire survey and reject the hypothesis of no effect if alpha (b) a system simulation employing system were set at .lo. I advised the editor that parameters derived from the survey results. the result was one that had a higher prob- The obtained quantitative estimates must be ability of being Type I error than the os- regarded as frightening, even calling into tensible .lo, but the appropriate editorial re- question the scientific basis for much pub- sponse, since the study was competently lished literature. done and the problem was interesting, was Previous critics have not been negligent in to guarantee to publish the results if the in- suggesting remedies for what they too have vestigators would agree (a) to conduct a regarded as an undesirable situation. Some replication and (b) to publish the outcome suggestions have been intended for use in con- of the replication (as well as the already sub- junction with the standard significance testing mitted study). Two years later, the study was approach. For example, Cohen (1962) has published (Layton & Turnbull, 1974; see also pointed out that social psychological experi- Greenwald, 1974) with the results of the ments often have power adequate to detect replication failing to confirm the original only relatively large effects. His suggestion findings. for higher powered experiments, if adopted, Now we all know that anecdotes are unac- should be expected to result in an increase ceptable as scientific evidence because of the in the frequency of null hypothesis rejec- inflated probability that unusual events will tions relative to nonrejections. However, it is be noticed and propagated as anecdotes. What also possible that increased awareness of ex- is distressing is that the published literature perimental power may lead to taking null re- sults more seriously. Hays (1963) has sug- on quasi-sensory communication (and other gested using estimates of magnitude of asso- ciation to accompany the standard reports of 6 1 am grateful to James T. Tough and James C. Garland for permission to give this informal report alpha levels. This would help to assure that of their investigation. trivial effects associated with a rejection of 16 ANTHONY G. GREENWALD the null hypothesis would be recognized as garded as a research outcome that is as ac- such, but might have no systematic effect on ceptable as any other. I cannot leave this the treatment of null results. recommendation just baldly stated because I Other writers have recommended depar- suspect that most readers will not know how tures from the significance testing framework. to go about analyzing and reporting data in a Particularly, suggestions for the use of in- fashion that can lead to the acceptance of the terval estimation (Grant, 1962) or Bayesian null hypothesis. I conclude, therefore, by con- analytic techniques (Bakan, 1966; Edwards, sidering some technical points related to ac- Lindman, & Savage, 1963) would help to ceptance of the null hypothesis. It should be avoid prejudice against the null hypothesis, clear to readers, as it is to me that what fol- because with these procedures, results need not lows is a rather low-level consideration of be stated in terms of acceptance or rejection technical matters, directed at users around of a null hypothesis. Despite the goad reasons my own level of statistical naivetC but none- for using interval estimation and Bayesian theless accurate as far as I can determine techniques that have been advanced by sev- through consultation with more expert col- eral writers, inspection of current journals leagues. makes it apparent that tests of significance against point null hypotheses remain the pre- How to Accept the Null Hypothesis dominant mode of analysis for behavioral Gracefully data. (Further, there is little evidence that Use a range, rather than a point, null hy- behavioral researchers have given increased plothesis. The procedural recommendations to attention to the power of their research de- follow are much easier to apply if the re- signs or to magnitude of association, in pur- searcher has decided, in advance of data col- suit of the suggestions by Cohen, Hays and lection, just what magnitude of effect on a others.) dependent measure or measure of association It would be a mistake, I think, to expect is large enough not to be considered trivial. that a recommendation to adopt some analy- This decision may have to be made somewhat sis strategy other than (or in addition to) sig- arbitrarily but seems better to be made some- nificance testing might, by itself, eliminate what arbitrarily before data collection than bias against accepting the null hypothesis. to be made after examination of the data. This is because, as has been shown here, the The minimum magnitude of effect that the problem exists as much or more in the be- researcher is willing to consider nontrivial is havior of investigators before collecting and then a boundary of the null range. The illus- after analyzing their data as in the techniques trations that follow employ a "two-tailed1' they use for analysis. Further, since a re- null range that is symmetric around the zero search enterprise may often be directed quite point of a test statistic. properly at the determination of whether a Select N on the basis of desirable error of given relationship or effect does or does not estimate of the test statistic. Assume, for ex- approximate a zero value, it seems inappro- ample, that in an experiment with one treat- priate to urge the dropping of methods of ment condition and a control condition, the analysis in which null hypotheses are com- researcher had decided that a treatment versus pared with alternatives. As noted earlier, a control difference of .50 units on his de- research question stated in null hypothesis pendent measure is a minimum nontrivial ef- versus alternative hypothesis form is espe- fect. (Therefore, the null range is (-SO, cially appropriate for theory-testing research. + -50) on this measure). It would seem inap- In such research, a result that can be used to propriate to collect data with N only large accept a null hypothesis may often serve to enough so that the estimate of the treatment advance knowledge by disproving the theory. effect would have a standard error of, say, For these reasons, my basic recommenda- SO. To appreciate this, consider that a 95% tion is a suggested attitude change of research- confidence interval based on this imprecise an ers (and editors) toward the null hypothesis. estimate would encompass about twice the Support for the null hypothesis must be re- width of the null range. I can think of no PREJUDICE AGAINST THE NULL HYPOTHESIS 17 hard and fast way of specifying a desirable degree of precision, but I would suggest that an error of estimate of effect on the order of 10%-20% of the width of the null range may often be appropriate. (A 95 % confidence in- terval then would be 40%-80% of the null range's width.) More precision than this may often be desirable, but the researcher has to make such decisions based on the cost of ob- taining such precision relative to the value of the knowledge obtained thereb~.~ Have convincing evidence that manipula- tions and measures are valid. Whether the data are to be used to accept or reject a null hy- pothesis or to make some other conclusion, it seems essential that the researcher be able to document the validity of his procedures rela- tive to the conceptual variables being studied. In the case of accepting the null hypothesis, the results are patently useless if the re- searcher has not defended himself against the argument that his operations lacked corre- spondence with the variables that were critical to his hypothesis test. However, the researcher FIGURE4. Comparison of significance testing, con- fidence interval estimation, and posterior probability drawing a conclusion that rejects a null hy- estimation for three sample sizes. (The example as- pothesis should feel equal compulsion to dem- sumes a null hypothesis range of (-.SO, +SO), a onstrate that his procedures were valid. variance of 1.00, and an obtained sample estimate Compute the posterior probability of the of +.25 for a hypothetical treatment effect. The dis- tribution centered over 0.00 is the expected distribu- null (range) hypothesis. I refer the reader tion of sample mean estimates of the effect if the to statistical texts (e.g., Hays, 1973, chap. point null hypothesis of 0.00 is true. This is used to 19; Mosteller & Tukey, 1969, pp. 160-183; compute significance levels [as]. The distribution Phillips, 1973) for an introduction to Baye- centered over +.25 is the Bayesian posterior likeli- sian methods (see also Edwards et al., 1963). hood distribution, the posterior probability [P"] estimate being computed as the fraction of the area Figure 4 offers a comparison of three modes under this distribution falling in the interval (-.SO, of analysis-significance testing, interval esti- f.50). LB and UB are lower and upper confidence mation, and Bayesian posterior probability interval boundaries.) computation-for some hypothetical data. These hypothetical data are for the difference ple sizes as an aid to comparing the different between two correlated means on a measure analysis procedures. for which the researcher's null range is (-.SO, The first analysis employs a standard two- +.SO). The standard error of the difference tailed significance test for a point (not range) scores is assumed to be 1.00, and the obtained null hypothesis. This would seem to be the sample mean difference (MD) is +.25, a analysis currently preferred by most be- point clearly within the null range. Each havioral scientists. At cz = .05, this analysis analysis method is presented for three sam- does not reject the null (point) hypothesis for the smallest sample size shown but does do

7 Setting N to achieve a given level of precision so for the two larger sample sizes, despite (a) requires some advance estimate of variability of the the observed data point being well within the data. If such information is unavailable at the out- null range and (b) the fact that with the set of data collection, it may then be necessary to determine this variability on the basis of initial data larger sample sizes we should have more con- collection. fidence in the accuracy of this estimate. 18 ANTHONY G. (

Clearly, computation of the significance level increased certainty afforded by the larger of an obtained result relative to an exact null sample sizes for the MD = +.25 result. hypothesis is not a useful way of going about To provide a more concrete illustration of accepting a range null hypothesis. With a a posterior probability computation used as relatively large N it is, rather, a good means the basis for accepting a null hypothesis, con- of exercising prejudice against the null hy- sider the data from the Gillig and Greenwald pothesis. (1974) sleeper effect study described in the The second analysis shown in Figure 4 pre- earlier section on Type I errors. In this study, sents 95% confidence intervals for the MD = Gillig and Greenwald were employing a 15- +.25 result for the three sample sizes. If we point opinion scale as the dependent measure. consider the containment of the 95% confi- They considered that a change from an im- dence interval within the null range as a cri- mediate posttest to a delayed posttest of less terion for accepting the null hypothesis, then than SOon this scale was a trivial effect (SO we should accept the null hypothesis for the was less than 25% of the standard deviation two larger sample sizes. This is definitely an of the obtained difference scores). They em- improvement over the significance test analy- ployed 273 subjects to estimate this change, sis, but it still has some drawbacks. Particu- so that the standard error of their estimate larly, (a) we are at a loss to make direct use of the effect was .I34 (which equals 13.4% of the data for the smallest sample size, for of the (-SO, +.SO) null range). Computa- which the 95% confidence interval over- tion of the posterior likelihood distribution of spreads the null range; and (b) the conclu- the hypothesis, assuming a uniform prior dis- sion does not reflect the increase in confidence tribution, indicated that 99.6% of the area that should be associated with the result for under the posterior distribution was within the N = 160 relative to that for N = 80. It is null range. The .996 figure can therefore be apparent that these drawbacks of the confi- taken as a posterior probability measure of dence interval procedure stem from the awk- acceptance of the null (range) hypothesis for wardness of relating the interval estimation these data. This figure can be expressed alter- procedure to a decision relative to the null nately as a posterior of .996/(1 - hypothesis (cf. Mosteller Pr Tukey, 1969, pp. .996) = 249: 1 in favor of the null hypothesis. 180-183). For comparison, an odds ratio of 19: 1 (a= The final procedure illustrated in Figure 4 .05) is frequently considered "significant" in involves the computation of posterior likeli- rejecting a point null hypothesis (as con- hood distributions based on the obtained data. trasted with all possible alternatives). When in a Bayesian analysis one starts from Report all results of research for which con- ignorance (a "diffuse," "uniform," or "gentle" ditions appropriate to testing a given hypothe- prior likelihood distribution), the posterior sis have been established. As has been demon- likelihood distribution is constructed directly strated earlier, successful communication of from the mean and variability of the obtained information through archival publication is data, much as is a confidence interval. A crit- severely threatened by self-censorship on the ical difference from the confidence interval part of investigators who obtain unpredicted analysis is that the assumptions underlying (often meaning null) findings. The only the Bayesian analysis facilitate drawing a con- justifiable basis for withholding a report of clusion about the acceptability of the null hy- the results of a data collection should be that pothesis. For the posterior distributions pre- the hypothesis intended for testing was not sented in Figure 4, a uniform prior distribution actually tested. This could come about through failures of manipulation, measurement, is assumed. The resulting posterior probabil- randomization, and so forth. As previously ity statements have the desirable feature of noted the investigator should be prepared allowing us to conclude that the (range) null for these possibilities, meaning that he should hypothesis is considerably more likely than its be able to support a decision to withhold data complement for all three sample sizes, while by demonstrating that such an invalidating at the same time allowing expressions of the condition obtained. Given a valid hypothesis PREJUDICE AGAINST THE NULL HYPOTHESIS 19 test, the only justifiable procedure for report- Bakan, D. The test of significance in psychological ing less than all of the data obtained is the research. Psychological Bulletin, 1966, 66, 432-437. Binder, A. Further considerations of testing the null decidedly dubious one of discarding portions of hypothesis and the strategy and tactics of in- the data randomly; any nonrandom decision vestigating theoretical models. Psychological Re- procedure with widespread application would view, 1963, 70, 107-115. result in publications being a biased sample Brigham, J. C., & Cook, S. W. The influence of atti- tude on the recall of controversial material: A fail- of actual research results. Therefore, research- ure to conform. Journal of Experimental Social ers should make a point of including at least P~yChology,1969, 5, 240-243. brief mentions of findings of preliminary data Campbell, D. T. Prospective: Artifact and control. collections, explaining why these results have In R. Rosenthal & R. L. Rosnow (Eds.), Artijmt been ignored (if they have), in reports of in behavioral reseach. New York: Academic Press, 1969. data on which more final conclusions have Campbell, D. T., & Stanley, J. C. Experimental and been based. It will be obvious that the ad- quasi-experimental designs for research on teach- monition to publish all one's data fails to take ing. In N. L. Gage (Ed.), Handbook of research into account the reality of editorial rejection. on teaching. Chicago: Rand McNally, 1963. This point prompts a few final comments. Clement, D. E. Quasi-sensory communication: Still not proved. Journal of Personality and Social Psy- First, it is a truly gross ethical violation for a chology, 1972, 23, 103-104. researcher to suppress reporting of difficult- Cohen, J. The statistical power of abnormal-social to-explain or embarrassing data in order to psychological research: A review. Journal of Ab- present a neat and attractive package to a normal and Social Psychology, 1962, 65, 145-153. journal editor. Second, it is to be hoped that Edwards, W. Tactical note on the relation between scientific and statistical hypotheses. Psychological journal editors will base publication decisions Bulletin, 1965, 63, 400-402. on criteria of importance and methodological Edwards, W., Lindman, H., & Savage, L. J. Bayesian soundness, uninfluenced by whether a result statistical inference for psychological research. Psy- supports or rejects a null hypothesis. chological Review, 1963, 70, 193-242. Festinger, L. Laboratory experiments. In L. Festinger & D. Katz (Eds.), Research methods in the be- havioral sciences. New York: Holt, 1953. As has, I hope, been clear there is a moral Fitzgerald, D., & Ausubel, D. P. Cognitive versus to all this. In the interest of making this affective factors in the learning and retention of moral fully explicit (and also for the benefit controversial material. Journal of Educational Psy- of the reader who has started at this point), chology, 1963, 54, 73-84. I offer the following two boiled-down recom- Gillig, P. M., & Greenwald, A. G. Is it time to lay the sleeper effect to rest? Journal of Personality mendations. and Social Psychology, 1974, 29, 132-139. 1. Do research in which any outcome (in- Grant, D. A. Testing the null hypothesis and the cluding a null one) can be an acceptable and strategy and tactics of investigating theoretical informative outcome. models. Psychological Review, 1962, 69, 54-61. 2. Judge your own (or others') research Greenwald, A. G. Significance, nonsignificance, and not on the basis of the results but only on the interpretation of an ESP experiment. Journal of Experimental Social Psychology, 1974, 10, in press. basis of adequacy of procedures and im- Greenwald, A. G., & Sakumura, J. S. Attitude and portance of findingss selective learning: Where are the phenomena of yesteryear? Journal of Personality and Social Psy- Concluding note: Although I have not had occa- chology, 1967, 7, 387-397. sion to cite their work directly in this report, the articles of Binder (1963), Campbell and Stanley Hays, W. L. Statistics for psychologists. New York: Holt, Rinehart & Winston, 1963. (1963), Lykken (1968), and Walster and Cleary (1970) have stimulated some of the ideas developed Hays, W. L. Statistics for social scientists (2nd ed.). here. New York: Holt, Rinehart & Winston, 1973. Hovland, C. I., Lumsdaine, A. A., & Sheffield, F. D. REFERENCES Experiments on mass communication. Princeton: Aronson, E., & Carlsmith, J. M. Experimentation in Princeton University Press, 1949. social psychology. In G. Lindzey & E. Aronson Hovland, C. I., & Weiss, W. The influence of source (Eds.), Handbook of social psychology (2nd ed., credibility on communication effectiveness. Public Vol. 2), Reading, Mass: Addison-Wesley, 1969. Opinion Quarterly, 1951, 15, 635-650. Attneave, F. Applications of information theory to Kelman, H. C., & Hovland, C. I. "Reinstatement" psychology. New York: Holt, 1959. of the communicator in delayed measurement of 2 0 ANTHONY G. GREENWALD

opinion change. Journal of Abnormal and Social Platt, J. R. Strong inference. Science, October 1964, Psychology, 1953, 48, 327-335. pp. 347-353. Layton, B. D., & Turnbull, B. Belief, evaluation, and Rosenthal, R., & Rosnow, R. L. (Eds.). Artifact in performance on an ESP task. Journul of Experi- behavioral research. New York: Academic Press, mental Social Psychology, 1974, 10, in press. 1969. Levine, J. M., & Murphy, C. The learning and for- Rozeboom, W. R. The fallacy of the null-hypothesis getting of controversial material. Journal of Ab- significance test. Psychological Bulletin, 1960, 57, normal and Social Psychology, 1943,38, 507-517. 416-428. Lykken, D. T. in psychological Selvin, H. C., & Stuart, A. Data-dredging procedures research. Psychotogical Bulletin, 1968, 70, 151-159. in survey analysis. American Statistician, 1966, 20, Malpass, R. S. Effects of attitude on learning and 20-23. memory: The influence of instruction induced sets. Shulman, G. I., & Worrall, C. Salience patterns, Journal of Experimental Social Psychology, 1969, source credibility, and the sleeper effect. Public 5, 441-453. Opinion Quarterly, 1970, 34, 371-382. McBain, W. N., Fox, W., Kimura, S., Nakanishi, M., Siege], S. Nonparametric statistics for the behavioral & Tirado, J. Quasi-sensory communication: An sciences. New York: McGraw-Hill, 1956. investigation using semantic matching and ac- Sterling, T. D. Publication decisions and their pos- centuated effect. Journal of Personality and Social sible effects on inferences drawn from tests of sig- Psychology, 1970, 14, 281-291. nificance-or vice versa. Journal of the American McGuire, W. J. Inducing resistance to persuasion: Statistical Association, 1959, 54, 30-34. Some contemporary approaches. In L. Berkowitz Walster, G. W., & Cleary, T. A. A proposal for a new (Ed.), Advances in experimental social psychology editorial policy in the social sciences. American (Vol. 1). New York: Academic Press, 1964. Statistician, 1970, 24, 16-19. McGuire, W. J. The yin and yang of progress in social psychology: Seven koan. Journal of Per- Waly, P., & Cook, S. W. Attitudes as a determinant of learning and memory: A failure to confirm. sonality and Social Psychology, 1973, 26, 446-456. Journal of Personality and Social Psychology, 1966, Mills, J. The experimental method. In J. Mills (Ed.), Experimental social psychology. Toronto: Mac- 4, 280-288. millan, 1969. Watts, W. A,, & McGuire, W. J. Persistency of in- Morrison, D. E., & Henkel, R. E. (Eds.). The sig- duced opinion change and retention of the inducing nificant test controversy. Chicago: Aldine, 1970. message contests. Journal of Abnormal and Social Mosteller, F., & Tukey, J. W. Data analysis, includ- P~ychology,1964, 68, 233-241. ing statistics. In G. Lindzey & E. Aronson (Eds.), Wilson, W. R., & Miller, H. A note on the inconclu- Handbook of social psychology (2nd ed., Vol. 2). siveness of accepting the null hypothesis. Psycho- Reading, Mass: Addison-Wesley, 1969. logical Review, 1964, 71, 238-242. Phillips, L. D. Bayesian statistics for social scientists. New York: Crowell, 1973. (Received February 18, 1974)