<<

Statistical 1991, Vol. 6,No.4, 363-403 Replication and Meta-Analysis in Jessica Utts

Abstract. Parapsychology, the laboratory study of phenomena, has had its history interwoven with that of statistics. Many of the controversies in parapsychology have focused on statistical issues, and statistical models have played an integral role in the experimental work. Recently, parapsychologists have been using meta-analysis as a tool for synthesizing large bodies of work. This paper presents an overview of the use of statistics in parapsychology and offers a summary of the meta-analyses that have been conducted. It begins with some anecdotal information about the involvement of statistics and statisti- cians with the early history of parapsychology. Next, it is argued that most nonstatisticians do not appreciate the connection between power and "successful" replication of experimental effects. Returning to para- psychology, a particular experimental regime is examined by summariz- ing an extended debate over the interpretation of the results. A new set of experiments designed to resolve the debate is then reviewed. Finally, meta-analyses from several areas of parapsychology are summarized. It is concluded that the overall evidence indicates that there is an anoma- lous effect in need of an explanation. Key words and phrases: Effect size, psychic research, statistical contro- versies, randomness, vote-counting.

1. INTRODUCTION Parapsychology, as this field is called, has been a source ifcontroversy throughout its history. Strong In a June 1990 Gallup Poll, 49% of the 1236 beliefs tend to be resistant to change even in the respondents claimed to believe in extrasensory per- face of data, and many people, scientists included, ception (ESP), and one in four claimed to have had seem to have made up their minds on the question a personal experience involving (Gallup without examining any empirical data at all. A and Newport, 1991). Other surveys have shown critic of parapsychology recently acknowledged that even higher percentages; the University of "The level of the debate during the past 130 years Chicago's National Opinion Research Center re- has been an embarrassment for anyone who would cently surveyed 1473 adults, of which 67% claimed like to believe that scholars and scientists adhere that they had experienced ESP (Greeley, 1987). to standards of rationality and fair play" (Hyman, Public opinion is a poor arbiter of science, how- 1985a, page 89). While much of the controversy has ever, and experience is a poor substitute for the focused on poor experimental design and potential scieptific method. For more than a century, small fraud, there have been attacks and defenses of the numbers of scientists have been conducting labora- statistical methods as well, sometimes calling into tory experiments to study phenomena such as question the very foundations of probability and telepathy, and , collec- statistical inference. tively known as "psi" abilities. This paper will Most of the criticisms have been leveled by psy- examine some of that work, as well as some of the chologists. For example, a 1988 report of the U.S. statistical controversies it has generated. National Academy of concluded that "The committee finds no scientific justification from research conducted over a period of 130 years for Jessica Utts is Associate Professor, Division of the existence of parapsychological phenomena" Statistics, University of California at Davis, 469 (Druckman and Swets, 1988, page 22). The chapter Kerr Hall, Davis, California 95616. on parapsychology was written by a subcommittee 364 J. UTTS chaired by a psychologist who had published a ments, offered one of the earliest treatises on the similar conclusion prior to his appointment to the statistical evaluation of forced-choice experiments committee (Hyman, 1985a, page 7). There were no in two articles published in the Proceedings of the parapsychologists involved with the writing of the Society for Psychical Research (Edgeworth, 1885, report. Resulting accusations of bias (Palmer, Hon- 1886). Unfortunately, as noted by Mauskopf and orton and Utts, 1989) led U.S. Senator Claiborne McVaugh (1979) in their historical account of the Pel1 to request that the Congressional Office of period, Edgeworth's papers were "perhaps too diffi- Technology Assessment (OTA) conduct an investi- cult for their immediate audience" @age 105). gation with a more balanced group. A one-day Edgeworth began his analysis by using Bayes' workshop was held on September 30, 1988, bring- theorem to derive the formula for the posterior ing together parapsychologists, critics and experts probability that chance was operating, given the in some related fields (including the author of this data. He then continued with an argument paper). The report concluded that parapsychology "savouring more of Bernoulli than Bayes" in which needs "a fairer hearing across a broader spectrum "it is consonant, I submit, to experience, to put 112 of the , so that emotionality both for cr and P," that is, for both the prior proba- does not impede objective assessment of experimen- bility that chance alone was operating, and the tal results" (Office of Technology Assessment, prior probability that "there should have been some 1989). additional agency." He then reasoned (using a It is in the spirit of the OTA report that this Taylor series expansion of the posterior prob- article is written. After Section 2, which offers an ability formula) that if there were a large prob- anecdotal account of the role of statisticians and ability of observing the data given that some statistics in parapsychology, the discussion turns to additional agency was at work, and a small objec- the more general question of replication of experi- tive probability of the data under chance, then the mental results. Section 3 illustrates how replica- latter (binomial) probability "may be taken as a tion has been (mis)interpreted by scientists in many rough measure of the sought a posteriori probabil- fields. Returning to parapsychology in Section 4, a ity in favour of mere chance" @age 195). Edge- particular experimental regime called the "ganz- worth concluded his article by applying his method feld" is described, and an extended debate about to some data published previously in the same the ,interpretation of the experimental results is journal. He found the probability against chance to discussed. Section 5 examines a meta-analysis of be 0.99996, which he said "may fairly be regarded recent ganzfeld experiments designed to resolve the as physical certainty" @age 199). He concluded: debate. Finally, Section 6 contains a brief account Such is the evidence which the calculus of of meta-analyses that have been conducted in other probabilities affords as to the existence of an areas of parapsychology, and conclusions are given agency other than mere chance. The calculus is in Section 7. silent as to the of that agency-whether it is more likely to be vulgar illusion or ex- 2. STATISTICS AND PARAPSYCHOLOGY traordinary law. That is a question to be decided, not by formulae and figures, but by Parapsychology had its beginnings in the investi- general philosophy and common sense [page gation of purported mediums and other anecdotal 1991. claims in the late 19th century. The Society for Psychical Research was founded in Britain in 1882, Both the statistical arguments and the experi- and its American counterpart was founded in mental controls in these early experiments were Boston in 1884. While these organizations and their somewhat loose. For example, Edgeworth treated menibers were primarily involved with investigat- as binomial an experiment in which one person ing anecdotal material, a few of the early re- chose a string of eight letters and another at- searchers were already conducting "forced-choice" tempted to guess the string. Since it has long been experiments such as card-guessing. (Forced-choice understood that people are poor random number (or experiments are like multiple choice tests; on each letter) generators, there is no statistical basis for trial the subject must guess from a small, known analyzing such an experiment. Nonetheless, Edge- set of possibilities.) Notable among these was worth and his contemporaries set the stage for the Nobel Laureate Charles Richet, who is generally use of controlled experiments with statistical evalu- credited with being the first to recognize that prob- ation in laboratory parapsychology. An interesting ability theory could be applied to card-guessing historical account of Edgeworth's involvement and experiments (Rhine, 1977, page 26; Richet, 1884). the role telepathy experiments played in the early F. Y. Edgeworth, partly in response to what he history of randomization and experimental design considered to be incorrect analyses of these experi- is provided by Hacking (1988). REPLICATION IN PARAPSYCHOLOGY 365 One of the first American researchers to and Greenwood, 1937). Stuart, who had been an use statistical methods in parapsychology was undergraduate in mathematics at Duke, was one of John Edgar Coover, who was the Thomas Welton Rhine's early subjects and continued to work with Stanford Psychical Research Fellow in the Psychol- him as a researcher until Stuart's in 1947. ogy Department at from 1912 Greenwood was a Duke mathematician, who appar- to 1937 (Dommeyer, 1975). In 1917, Coover pub- ently converted to a statistician at the urging of lished a large volume summarizing his work Rhine. (Coover, 1917). Coover believed that his results Another prominent figure who was distressed were consistent with chance, but others have ar- with Kellogg's attack was E. V. Huntington, a gued that Coover's definition of significance was mathematician at Harvard. After corresponding too strict (Dommeyer, 1975). For example, in one with Rhine, Huntington decided that, rather than evaluation of his telepathy experiments, Coover further confuse the public with a technical reply to found a two-tailed p-value of 0.0062. He concluded, Kellogg's arguments, a simple statement should be "Since this value, then, lies within the field of made to the effect that the mathematical issues in chance deviation, although the probability of its Rhine's work had been resolved. Huntington must occurrence by chance is fairly low, it cannot be have successfully convinced his former student, accepted as a decisive indication of some cause Burton Camp of Wesleyan, that this was a wise beyond chance which operated in favor of success in approach. Camp was the 1937 President of IMS. guessing" (Coover, 1917, page 82). On the next When the annual meetings were held in December page, he made it explicit that he would require a of 1937 ('jointly with AMS and AAAS), Camp p-value of 0.0000221 to declare that something released a statement to the press that read: other than chance was operating. Dr. Rhine's investigations have two aspects: It was during the summer of 1930, with the experimental and statistical. On the exper- card-guessing experiments of B. Rhine at Duke J. imental side mathematicians, of course, University, that parapsychology began to take hold have nothing to say. On the statistical side, as a laboratory science. Rhine's laboratory still however, recent mathematical work has exists under the name of the Foundation for Re- established the fact that, assuming that the search on the Nature of Man, housed at the edge of experiments have been properly performed, the Duke University campus. the statistical analysis is essentially valid. If It wasn't long after Rhine published his first the Rhine investigation is to be fairly attacked, book, in 1934, that the it must be on other than mathematical grounds attacks on his methodology began. Since his claims [Camp, 19371. were wholly based on statistical analyses of his experiments, the statistical methods were closely One statistician who did emerge as a critic was scrutinized by critics Anxious to find a conventional William Feller. In a talk at the Duke Mathemati- explanation for Rhine's positive results. cal Seminar on April 24, 1940, Feller raised three The most persistent critic was a psychologist criticisms to Rhine's work (Feller, 1940). They had from McGill University named Chester Kellogg been raised before by others (and continue to be (Mauskopf and McVaugh, 1979). Kellogg's main raised even today). The first was that inadequate argument was that Rhine was using the binomial shuffling of the cards resulted in additional infor- distribution (and normal approximation) on a se- mation from one series to the next. The second was ries of trials that were not independent. The experi- what is now known as the "file-drawer effect," ments in question consisted of having a subject namely, that if one combines the results of pub- *ess the order of a deck of 25 cards, with five each lished studies only, there is sure to be a bias in of five symbols, so technically Kellogg was correct. favor of successful studies. The third was that the By 1937, several mathematicians and statis- results were enhanced by the use of optional stop- ticians had come to Rhine's aid. Mauskopf and ping, that is, by not specifying the number of trials McVaugh (1979) speculated that since statistics was in advance. All three of these criticisms were ad- itself a young discipline, "a number of statisticians dressed in a rejoinder by Greenwood and Stuart were equally outraged by Kellogg, whose argu- (1940), but Feller was never convinced. Even in its ments they saw as discrediting their profession" third edition published in 1968, his book An Intro- (page 258). The major technical work, which ac- duction to Probability Theory and Its Applications knowledged that Kellogg's criticisms were accurate still contains his conclusion about Greenwood and but did little to change the significance of the Stuart: "Both their arithmetic and their experi- results, was conducted by Charles Stuart and ments have a distinct tinge of the supernatural" Joseph A. Greenwood and published in the first (Feller, 1968, page 407). In his discussion of Feller's volume of the Journal of Parapsychology (Stuart position, Diaconis (1978) remarked, "I believe 386 J. UTTS Feller was confused. . . he seemed to have decided diverse views among parapsychologists on the na- the opposition was wrong and that was that." ture of the problem. Honorton (1985a) and Rao Several statisticians have contributed to the (1985), for example, both argued that strict replica- literature in parapsychology to greater or lesser tion is uncommon in most branches of science and degrees. T. N. E. Greville developed applicable that parapsychology should not be singled out as statistical methods for many of the experiments in unique in this regard. Other authors expressed parapsychology and was Statistical Editor of the disappointment in the lack of a single repeatable Journal of Parapsychology (with J. A. Greenwood) experiment in parapsychology, with titles such from its start in 1937 through Volume 31 in 1967; as "Unrepeatability: Parapsychology's Only Find- Fisher (1924, 1929) addressed some specific prob- ing" (Blackmore, 1985), and "Research Strategies lems in card-guessing experiments; Wilks (1965a, b) for Dealing with Unstable Phenomena" (Beloff, described various statistical methods for parapsy- 1985). chology; Lindley (1957) presented a Bayesian anal- It has never been clear, however, just exactly ysis of some parapsychology data; and Diaconis what would constitute acceptable evidence of a re- (1978) pointed out some problems with certain ex- peatable experiment. In the early days of investiga- periments and presented a method for analyzing tion, the major critics "insisted that it would be experiments when feedback is given. sufficient for Rhine and Soal to convince them of Occasionally, attacks on parapsychology have ESP if a parapsychologist could perform success- taken the form of attacks on statistical inference in fully a single 'fraud-proof' experiment" (Hyman, general, at least as it is applied to real data. 1985a, page 71). However, as soon as well-designed Spencer-Brown (1957) attempted to show that true experiments showing randomness is impossible, at least in finite se- emerged, the critics realized that a single experi- quences, and that this could be the explanation for ment could be statistically significant just by the results in parapsychology. That argument re- chance. British psychologist C. E. M. Hansel quan- emerged in a recent debate on the role of random- tified the new expectation, that the experiment ness in parapsychology, initiated by psychologist J. should be repeated a few times, as follows: Barnard Gilmore (Gilmore, 1989, 1990; Utts, 1989; If a result is significant at the .O1 level and Palmer, 1989, 1990). Gilmore stated that "The ag- this result is not due to chance but to informa- nostic statistician, advising on research in psi, tion reaching the subject, it may be expected should take account of the possible inappropriate- that by making two further sets of trials the ness of classical inferential statistics" (1989, page antichance odds of one hundred to one will be 338). In his second paper, Gilmore reviewed several increased to around a million to one, thus en- non-psi studies showing purportedly random sys- abling the effects of ESP-or whatever is re- tems that do not behave as they should under sponsible for the original result-to manifest randomness (e.g., ~versen, Longcor, Mosteller, itself to such an extent that there will be little Gilbert and Youtz, 1971; Spencer-Brown, 1957). doubt that the result is not due to chance Gilmore concluded that "Anomalous data . . . [Hansel, 1980, page 2981. should not be found nearly so often if classical statistics offers a valid model of reality" (1990, In other words, three consecutive experiments at page 54), thus rejecting the use of classical statisti- p 5 0.01 would convince Hansel that something cal inference for real-world applications in general. other than chance was at work. This argument implies that if a particular experi- ment produces a statistically significant result, but subsequent replications fail to attain significance, Iinplicit and explicit in the literature on parapsy- then the original result was probably due to chance, chology is the assumption that, in order to truly or at least remains unconvincing. The problem with establish itself, the field needs to find a repeat- this line of reasoning is that there is no consid- able experiment. For example, Diaconis (1978) eration given to sample size or power. Only an started the summary of his article in Science with experiment with extremely high power should the words "In search of repeatable ESP experi- be expected to be "successful" three times in ments, modern investigators. . . " (page 131). On succession. October 28-29, 1983, the 32nd International Con- It is perhaps a failure of the way statistics is ference of the Parapsychology Foundation was held taught that many scientists do not understand the in San Antonio, Texas, to address "The Repeatabil- importance of power in defining successful replica- ity Problem in Parapsychology." The Conference tion. To illustrate this point, psychologists Tversky Proceedings (Shapin and Coly, 1985) reflect the and Kahnemann (1982) distributed a questionnaire REPLICATION IN PARAPSYCHOLOGY 367 to their colleagues at a professional meeting, with with 71 as opposed to 69 successful trials. The the question: one-tailed p-values for the combined trials are 0.0017 for Professor A and 0.0006 for Professor B. An investigator has reported a result that you To address the question of replication more ex- consider implausible. He ran 15 subjects, and plicitly, I also posed the following scenario. In reported a significant value, t = 2.46. Another December of 1987, it was decided to prematurely investigator has attempted to duplicate his pro- terminate a study on the effects of aspirin in reduc- cedure, and he obtained a nonsignificant value ing attacks because the data were so convinc- of t with the same number of subjects. The ing (see, e.g., Greenhouse and Greenhouse, 1988; direction was the same in both sets of data. Rosenthal, 1990a). The physician-subjects had been You are reviewing the literature. What is the randomly assigned to take aspirin or a . highest value of t in the second set of data that There were 104 heart attacks among the 11,037 you would describe as a failure to replicate? subjects in the aspirin group, and 189 heart attacks [1982, page 281. among the 11,034 subjects in the placebo group In reporting their results, Tversky and Kahne- (chi-square = 25.01, p < 0.00001). mann stated: After showing the results of that study, I pre- sented the audience with two hypothetical experi- The majority of our respondents regarded t = ments conducted to try to replicate the original 1.70 as a failure to replicate. If the data of two result, with outcomes in Table 2. such studies (t = 2.46 and t = 1.70) are pooled, I asked the audience to indicate which one they the value of t for the combined data is about thought was a more successful replication. The au- 3.00 (assuming equal variances). Thus, we are dience chose the second one, as would most journal faced with a paradoxical state of affairs, in editors, because of the "significant p-value." In which the same data that would increase our fact, the first replication has almost exactly the confidence in the finding when viewed as part same proportion of heart attacks in the two groups of the original study, shake our confidence as the original study and is thus a very close repli- when viewed as an independent study [1982, cation of that result. The second replication has page 281. At a recent presentation to the History and Phi- losophy of Science Seminar at the University of California at Davis, I asked the following question. TABLE1 Two scientists, Professors A and B, each have a Attempted replciations for professor B theory they would like to demonstrate. Each plans n Number of successes One-tailed p-value to run a fixed number of Bernoulli trials and then test H,: p = 0.25 versos Ha: p > 0.25. Professor A has access to large numbers of students each semester to use as subjects. In his first experiment, he runs 100 subjects, and there are 33 successes (p = 0.04, one-tailed). Knowing the importance of replication, Professor A runs an additional 100 sub- jects as a second experiment. He finds 36 successes (p = 0.009, one-tailed). Professor B only teaches small classes. Each quarter, she runs an experiment on her students to test her theory. She carries out ten studies this way, with the results in Table 1. TABLE2 I asked the audience by a show of hands to Hypothetical replications of the aspirin / heart indicate whether or not they felt the scientists had attack study successfully demonstrated their theories. Professor A's theory received overwhelming support, with Replication # 1 Replication #2 approximately 20 votes, while Professor B's theory Heart attack Heart attack received only one vote. Yes No Yes No If you aggregate the results of the experiments Aspirin 11 1156 20 2314 for each professor, you will notice that each con- Placebo 19 1090 48 2170 ducted 200 trials, and Professor B actually demon- Chi-square 2.596, p = 0.11 13.206, p = 0.0003 strated a higher level of success than Professor A, 368 J. UTTS very different proportions, and in fact the relative been consistent effects of the same magnitude. risk from the second study is not even contained in Rosenthal also advocates this view of replication: a 95% confidence interval for relative risk from the The traditional view of replication focuses on original study. The magnitude of the effect has significance level as the relevant summary been much more closely matched by the "nonsig- statistic of a study and evaluates the success of nificant" replication. a replication in a dichotomous fashion. The Fortunately, psychologists are beginning to no- newer, more useful view of replication focuses tice that replication is not as straightforward as on effect size as the more important summary they were originally led to believe. A special issue statistic of a study and evaluates the success of of the Journal of Social Behavior and Personality a replication not in a dichotomous but in a was entirely devoted to the question of replication continuous fashion [Rosenthal, 1990b, page 281. (Neuliep, 1990). In one of the articles, Rosenthal cautioned his colleagues: "Given the levels of sta- The dichotomous view of replication has been tistical power at which we normally operate, we used throughout the history of parapsychology, by have no right to expect the proportion of significant both parapsychologists and critics (Utts, 1988). For results that we typically do expect, even if in na- example, the National Academy of Sciences report ture there is a very real and very important effect" critically evaluated "significant" experiments, but (Rosenthal, 1990b, page 16). entirely ignored "nonsignificant" experiments. Jacob Cohen, in his insightful article titled In the next three sections, we will examine some "Things I Have Learned (So Far)," identified an- of the results in parapsychology using the broader, other misconception common among social scien- more appropriate definition of replication. In doing tists: "Despite widespread misconceptions to the so, we will show that the results are far more contrary, the rejection of a given null hypothesis interesting than the critics would have us believe. gives us no basis for estimating the probability that 4. THE GANZFELD DEBATE IN a replication of the research will again result in PARAPSYCHOLOGY rejecting that null hypothesis" (Cohen, 1990, page 1307). An extensive debate took place in the mid-1980s Cohen and Rosenthal both advocate the use of between a parapsychologist and critic, questioning effect sizes as opposed to significance levels when whether or not a particular body of parapsychologi- defining the strength of an experimental effect. In cal data had demonstrated psi abilities. The experi- general, effect sizes measure the amount by which ments in question were all conducted using the the data deviate from the null hypothesis in terms ganzfeld setting (described below). Several authors of standardized units. For instance, the effect size were invited to write commentaries on the debate. for a two-sample t-test is usually defined to be the As a result, this data base has been more thor- difference in the two means, divided by the stan- oughly analyzed by both critics and proponents dard deviation for the control group. This measure than any other and provides a good source for can be compared across studies without the depen- studying replication in parapsychology. dence on sample size inherent in significance lev- The debate concluded with a detailed series of els. (Of course there will still be variability in the recommendations for further experiments, and left sample effect sizes, decreasing as a function of sam- open the question of whether or not psi abilities ple size.) Comparison of effect sizes across studies is had been demonstrated. A new series of experi- one of the major components of meta-analysis. ments that followed the recommendations were Similar arguments have recently been made in conducted over the next few years. The results of ,the medical literature. For example, Gardner and the new experiments will be presented in Section 5. Altman (1986) stated that the use of p-values "to 4.1 Free-Response Experiments define two alternative outcomes-significant and not significant-is not helpful and encourages lazy Recent experiments in parapsychology tend to thinking" (page 746). They advocated the use of use more complex target material than the cards confidence intervals instead. and dice used in the early investigations, partially As discussed in the next section, the arguments to alleviate boredom on the part of the subjects and used to conclude that parapsychology has failed to partially because they are thought to "more nearly demonstrate a replicable effect hinge on these mis- resemble the conditions of spontaneous psi occur- conceptions of replication and failure to examine rences" (Burdick and Kelly, 1977, page 109). These power. A more appropriate analysis would compare experiments fall under the general heading of the effect sizes for similar experiments across ex- "free-response" experiments, because the subject is perimenters and across time to see if there have asked to give a verbal or written description of the REPLICATION IN PARAPSYCHOLOGY 369 target, rather than being forced to make a choice isolation technique originally developed by Gestalt from a small discrete set of possibilities. Various psychologists for other purposes. Evidence from types of target material have been used, including spontaneous case studies and experimental work pictures, short segments of movies on video tapes, had led parapsychologists to a model proposing that actual locations and small objects. psychic functioning may be masked by sensory in- Despite the more complex target material, the put and by inattention to internal states (Honorton, statistical methods used to analyze these experi- 1977). The ganzfeld procedure was specifically de- ments are similar to those for forced-choice experi- signed to test whether or not reduction of external ments. A typical experiment proceeds as follows. "noise" would enhance psi performance. Before conducting any trials, a large pool of poten- In these experiments, the subject is placed in a tial targets is assembled, usually in packets of four. comfortable reclining chair in an acoustically Similarity of targets within a packet is kept to a shielded room. To create a mild form of sensory minimum, for reasons made clear below. At the deprivation, the subject wears headphones through start of an experimental session, after the subject is which white noise is played, and stares into a sequestered in an isolated room, a target is selected constant field of red light. This is achieved by at random from the pool. A sender is placed in taping halved translucent ping-pong balls over the another room with the target. The subject is asked eyes and then illuminating the room with red light. to provide a verbal or written description of what In the psi ganzfeld experiments, the subject speaks he or she thinks is in the target, knowing only that into a microphone and attempts to describe the it is a photograph, an object, etc. target material being observed by the sender in a After the subject's description has been recorded distant room. and secured against the potential for later alter- At the 1982 Annual Meeting of the Parapsycho- ation, a judge (who may or may not be the subject) logical Association, a debate took place over the is given a copy of the subject's description and the degree to which the results of the psi ganzfeld four possible targets that were in the packet with experiments constituted evidence of psi abilities. the correct target. A properly conducted experi- Psychologist and critic and parapsy- ment either uses video tapes or has two identical chologist each analyzed the re- sets of target material and uses the duplicate set sults of all known psi ganzfeld experiments to date, for this part of the process, to ensure that clues and they reached strikingly different conclusions such as fingerprints don't give away the answer. (Honorton, 1985b; Hyman, 1985b). The debate con- Based on the subject's description, and of course on tinued with the publication of their arguments in a blind basis, the judge is asked to either rank the separate articles in the March 1985 issue of the four choices from most to least likely to have been Journal of Parapsychology. Finally, in the Decem- the target, or to select the one from the four that ber 1986 issue of the Journal'of Parapsychology, seems to best match the subject's description. If Hyman and Honorton (1986) wrote-a joint article ranks are used, the statistical analysis proceeds by in which they highlighted their agreements and summing the ranks over a series of trials and disagreements and outlined detailed criteria for comparing the sum to what would be expected by future experiments. That same issue contained chance. If the selection method is used, a "direct commentaries on the debate by 10 other authors. hit" occurs if the correct target is chosen, and the The data base analyzed by Hyman and Honorton number of direct hits over a series of trials is (1986) consisted of results taken from 34 reports compared to the number expected in a binomial written by a total of 47 authors. Honorton counted experiment with p = 0.25. 42 separate experiments described in the reports, of Note that the subjects' responses cannot be con- which 28 reported enough information to determine sidered to be "random" in any sense, so probability the number of direct hits achieved. Twenty three of assessments are based on the random selection of the studies (55%) were classified by Honorton as the target and decoys. In a correctly designed ex- having achieved statistical significance at 0.05. periment, the probability of a direct hit by chance is 0.25 on each trial, regardless of the response, and 4.3 The Vote-Counting Debate the trials are independent. These and other issues Vote-counting is the term commonly used for the related to analyzing free-response experiments are technique of drawing inferences about an experi- discussed by Utts (1991). mental effect by counting the number of significant versus nonsignificant studies of the effect. Hedges 4.2 The Psi Gandeld Experiments and Olkin (1985) give a detailed analysis of the The ganzfeld procedure is a particular kind of inadequacy of this method, showing that it is more free-response experiment utilizing a perceptual and more likely to make the wrong decision as the 370 J. UTTS number of studies increases. While Hyman ac- that the number of significant studies (using his knowledged that "vote-counting raises many prob- definition of a study) only dropped from 55% to lems" (Hyman, 1985b, page 8), he nonetheless spent 45%. Next, he proposed that a uniform index of half of his critique of the ganzfeld studies showing success be applied to all studies. He used the num- why Honorton's count of 55% was wrong. ber of direct hits, since it was by far the most Hyman's first complaint was that several of the commonly reported measure and was the measure studies contained multiple conditions, each of which used in the first published psi ganzfeld study. He should be considered as a separate study. Using then conducted a detailed analysis of the 28 studies this definition he counted 80 studies (thus further reporting direct hits and found that 43% were sig- reducing the sample sizes of the individual studies), nificant at 0.05 on that measure alone. Further, he of which 25 (31%) were "successful." Honorton's showed that significant effects were reported by six response to this was to invite readers to examine of the 10 independent investigators and thus were the studies and decide for themselves if the varying not due to just one or two investigators or laborato- conditions constituted separate experiments. ries. He also noted that success rates were very Hyman next postulated that there was selection similar for reports published in refereed journals bias, so that significant studies were more likely to and those published in unrefereed monographs and be reported. He raised some important issues about abstracts. how pilot studies may be terminated and not re- While Hyman's arguments identified issues such ported if they don't show significant results, or may as selective reporting and optional stopping that at least be subject to optional stopping, allowing should be considered in any meta-analysis, the de- the experimenter to determine the number of tri- pendence of significance levels on sample size makes als. He also presented a chi-square analysis that the vote-counting technique almost useless for as- "suggests a tendency to report studies with a small sessing the magnitude of the effect. Consider, for sample only if they have significant results" example, the 24 studies where the direct hit meas- (Hyman, 1985b, page 14), but I have questioned his ure was reported and the chance probability of a analysis elsewhere (Utts, 1986, page 397). direct hit was 0.25, the most common type of study Honorton refuted Hyman's argument with four in the data base. (There were four direct hit studies rejoinders (Honorton, 1985b, page 66). In addition with other chance probabilities and 14 that did not to reinterpreting Hyman's chi-square analysis, report direct hits.) Of the 24 studies, 13 (54%) were Honorton pointed out that the Parapsychological "nonsignificant" at a! = 0.05, one-tailed. But if the Association has an official policy encouraging the 367 trials in these "failed replications" are com- publication of nonsignificant results in its journals bined, there are 106 direct hits, z = 1.66, and p = and proceedings, that a large number of reported 0.0485, one tailed. This is reminiscent of the ganzfeld studies did not achieve statistical signifi- dilemma of Professor B in Section 3. cance and that therewould have to be 15 studies in Power is typically very low for these studies. The the "file-drawer" for every one reported to cancel median sample size for the studies reporting direct out the observed significant results. hits was 28. If there is a real effect and it increases The remainder of Hyman's vote-counting analy- the success probability from the chance 0.25 to sis consisted of showing that the effective error rate an actual 0.33 (a value whose rationale will be for each study was actually much higher than the made clear below), the power for a study with 28 nominal 5%. For example, each study could have trials is only 0.181 (Utts, 1986). It should be no been analyzed using the direct hit measure, the surprise that there is a "repeatability" problem in sum of ranks measure or one of two other measures parapsychology. used for free-response analyses. Hyman carried out 4.4 Flaw Analysis and Future Recommendations a simulation study that showed the true error rate would be 0.22 if "significance" was defined by re- The second half of Hyman's paper consisted of a quiring at least one of these four measures to "Meta-Analysis of Flaws and Successful Outcomes" achieve the 0.05 level. He suggested several other (1985b, page 30), designed to explore whether or ways in which multiple testing could occur and not various measures of success were related to concluded that the effective error rate in each ex- specific flaws in the experiments. While many crit- periment was not the nominal 0.05, but rather was ics have argued that the results in parapsychology probably close to the 31% he had determined to be can be explained by experimental flaws, Hyman's the actual success rate in his vote-count. analysis was the first to attempt to quantify the Honorton acknowledged that there was a multi- relationship between flaws and significant results. ple testing problem, but he had a two-fold response. Hyman identified 12 potential flaws in the First, he applied a Bonferroni correction and found ganzfeld experiments, such as inadequate random- REPLICATION IN PARAPSYCHOLOGY 371 ization, multiple tests used without adjusting the Hyman in his capacity as Chair of the National significance level (thus inflating the significance Academy of Sciences' Subcommittee on Parapsy- level from the nominal 5%) and failure to use a chology. Using Hyman's flaw classifications and a duplicate set of targets for the judging process (thus multivariate analysis, Harris and Rosenthal con- allowing possible clues such as fingerprints). Using cluded that "Our analysis of the effects of flaws on cluster and factor analyses, the 12 binary flaw study outcome lends no support to the hypothesis variables were combined into three new variables, that ganzfeld research results are a significant which Hyman named General Security, Statistics function of the set of flaw variables" (1988b, and Controls. page 3). Several analyses were then conducted. The one Hyman and Honorton were in the process of reported with the most detail is a factor analysis preparing papers for a second round of debate when utilizing 17 variables for each of 36 studies. Four they were invited to lunch together at the 1986 factors emerged from the analysis. From these, Meeting of the Parapsychological Association. They Hyman concluded that security had increased over discovered that they were in general agreement on the years, that the significance level tended to be several major issues, and they decided to coauthor inflated the most for the most complex studies and a "Joint Communiqu6" (Hyman and Honorton, that both effect size and level of significance were 1986). It is clear from their paper that they both correlated with the existence of flaws. thought it was more important to set the stage for Following his factor analysis, Hyman picked the future experimentation than to continue the techni- three flaws that seemed to be most highly corre- cal arguments over the current data base. In the lated with success, which were inadequate atten- abstract to their paper, they wrote: tion to both randomization and documentation and We agree that there is an overall significant the potential for ordinary communication between effect in this data base that cannot reasonably the sender and receiver. A regression equation was be explained by selective reporting or multiple then computed using each of the three flaws as analysis. We continue to differ over the degree dummy variables, and the effect size for the experi- to which the effect constitutes evidence for psi, ment as the dependent variable. From this equa- but we agree that the final verdict awaits the tion, Hyman concluded that a study without these outcome of future experiments conducted by a three flaws would be predicted to have a hit rate of broader range of investigators and according to 27%. He concluded that this is "well within the more stringent standards [page 3511. statistical neighborhood of the 25% chance rate" (1985b, page 37), and thus "the ganzfeld psi data The paper then outlined what these standards base, despite initial impressions, is inadequate ei- should be. They included controls against any kind ther to support the contention of a repeatable study of , thorough testing and documen- or to demonstrate thereality of psi" (page 38). tation of randomization methods used, better re- Honorton discounted both Hyman's flaw classifi- porting of judging and feedback protocols, control cation and his analysis. He did not deny that flaws for multiple analyses and advance specification of existed, but he objected that Hyman's analysis was number of trials and type of experiment. Indeed, faulty and impossible to interpret. Honorton asked any area of research could benefit from such a psychometrician David Saunders to write an Ap- careful list of procedural recommendations. pendix to his article, evaluating Hyman's analysis. , 4.5 Rosenthal's Meta-Analysis Saunders first criticized Hyman's use of a factor analysis with 17 variables (many of which were The same issue of the Journal of Parapsychology dichotomous) and only 36 cases and concluded that in which the Joint Communiqu6 appeared also car- "the entire analysis is meaningless" (Saunders, ried commentaries on the debate by 10 separate 1985, page 87). He then noted that Hyman's choice authors. In his commentary, psychologist Robert of the three flaws to include in his regression anal- Rosenthal, one of the pioneers of meta-analysis in ysis constituted a clear case of multiple analysis, psychology, summarized the aspects of Hyman's since there were 84 possible sets of three that could and Honorton's work that would typically be in- have been selected (out of nine potential flaws), and cluded in a meta-analysis (Rosenthal, 1986). It is Hyman chose the set most ,highly correlated with worth reviewing Rosenthal's results so that they effect size. Again, Saunders concluded that "any can be used as a basis of comparison for the more interpretation drawn from [the regression analysis] recent psi ganzfeld studies reported in Section 5. must be regarded as meaningless" (1985, page 88). Rosenthal, like Hyman and Honorton, focused Hyman's results were also contradicted by Harris only on the 28 studies for which direct hits were and Rosenthal (1988b) in an analysis requested by known. He chose to use an effect size measure 372 J. UTTS called Cohen's h, which is the difference between Honorton and his colleagues developed an auto- the arcsin transformed proportions of direct hits mated that was designed to that were observed aild expected: eliminate the methodological flaws identified by Hyman. The execution and reporting of the experi- h = 2(arcsin - arcsin 6). ments followed the detailed guidelines- agreed- upon by Hyman and Honorton. One advantage of this measure over the difference Using this "autoganzfeld" experiment, 11 experi- in raw proportions is that it can be used to compare mental series were conducted by eight experi- experiments with different chance hit rates. menters between February 1983 and September If the and numbers of hits 1989, when the equipment had to be dismantled were identical, the effect size would be zero. Of the due to lack of funding. In this section, the results 28 studies, 23 (82%) had effect sizes greater than zero, with a median effect size of 0.32 and a mean of these experiments are summarized and com- pared to the earlier ganzfeld studies. Much of the of 0.28. These correspond to direct hit rates of 0.40 information is derived from Honorton et al. (1990). and 0.38 respectively, when 0.25 is expected by chance. A 95% confidence interval for-the true effect size is from 0.11 to 0.45, corresponding to 5.1 The Automated Ganzfeld Procedure direct hit rates of from 0.30 to 0.46 when chance is Like earlier ganzfeld studies, the "autoganzfeld" 0.25. experiments require four participants. The first is A common technique in meta-analysis is to calcu- the Receiver (R), who attempts to identify the tar- late a "combined z," found by summing the indi- get material being observed by the Sender (S). The vidual z scores and dividing by the square root of Experimenter (E) prepares R for the task, elicits the number of studies. The result should have a the response from R and supervises R's judging of standard normal distribution if each z score has a the response against the four potential targets. standard normal distribution. For the ganzfeld (Judging is double blind; E does not know which is studies, Rosenthal reported a combined z of 6.60 the correct target.) The fourth participant is the lab with a p-value of 3.37 x He also reiterated lo-''. assistant (LA) whose only task is to instruct the Honorton's file-drawer assessment by calculating computer to randomly select the target. No one that there would have to be 423 studies unreported involved in the experiment knows the identity of to negate the significant effect in the 28 direct hit the target. studies. Both R and S are sequestered in sound-isolated, Finally, Rosenthal acknowledged that, because of electrically shielded rooms. R is prepared as in the flaws in the data base and the potential for at earlier ganzfeld studies, with white noise and a least a small file-drawer effect, the true average field of red light. In a nonadjacent room, S watches effect size was probably closer to 0.18 than 0.28. He the target material on a television and can hear R's concluded, "Thus, when the accuracy rate expected target description ("mentation") as it is being under the null is 114, we might estimate the ob- given. The mentation is also tape recorded. tained accuracy rate to be about 1/3" (1986, page The judging process takes place immediately af- 333). This is the value used for the earlier power ter the 30-minute sending period. On a TV monitor calculation. in the isolated room, R views the four choices from It is worth mentioning that Rosenthal was com- the target pack that contains the actual target. R is missioned by the National Academy of Sciences to asked to rate each one according to how closely it prepare a background paper to accompany its 1988 matches the ganzfeld mentation. The ratings are ' report on parapsychology. That 'paper (Harris and converted to ranks and, if the correct target is Rosenthal, 1988a) contained much of the same ranked first, a direct hit is scored. The entire proc- analysis as his commentary summarized above. ess is automatically recorded by the computer. The Ironically, the discussion of the ganzfeld work in computer then displays the correct choice to R as the National Academy Report focused on Hyman's feedback. 1985 analysis, but never mentioned the work it had There were 160 preselected targets, used with commissioned Rosenthal to perform, which contra- replacement, in 10 of the 11 series. They were dicted the final conclusion,in the report. arranged in packets of four, and the decoys for a given target were always the remaining three in 5. A META-ANALYSIS OF RECENT GANZFELD the same set. Thus, even if a particular target in a EXPERIMENTS set were consistently favored by Rs, the probability After the initial exchange with Hyman at of a direct hit under the null hypothesis would the 1982 Parapsychological Association Meeting, remain at 114. Popular targets should be no more REPLICATION IN PA likely to be selected by the computer's random rank, the first example is reproduced here. The number generator than any of the others in the set. target was a painting by Salvador Dali called The selection of the target by the computer is the "Christ Crucified." The correct target received a only source of randomness in these experiments. first place rank. The part of the mentation R used This is an important point, and one that is often to make this assessment read: misunderstood. (See Utts, 1991, for elucidation.) Eighty of the targets were "dynamic," consisting . . . I think of guides, like spirit guides, leading of scenes from movies, documentaries and cartoons; me and I come into a court with a king. It's 80 were "static," consisting of photographs, art quiet. . . .It's like heaven. The king is some- prints and advertisements. The four targets within thing like Jesus. Woman. Now I'm just sort of each set were all of the same type. Earlier studies summersaulting through heaven . . . . indicated that dynamic targets were more likely to Brooding. . . . Aztecs, the Sun God. . . . High produce successful results, and one of the goals of priest . . . .Fear . . . . Graves. Woman. the new experiments was to test that theory. Prayer . . . . Funeral . . . . Dark. The randomization procedure used to select the Death . . . . Souls . . . . Ten Commandments. target and the order of presentation for judging was Moses. . . . [Honorton et al., 19901. thoroughly tested before and during the experi- ments. A detailed description is given by Honorton Over all 11 series, there were 122 direct hits in et al. (1990, pages 118-120). the 355 trials, for a hit rate of 34.4% (exact bino- Three of the 11 series were pilot series, five were mial p-value = 0.00005) when 25% were expected formal series with novice receivers, and three were by chance. Cohen's h is 0.20, and a 95% confidence formal series with experienced receivers. The last interval for the overall hit rate is from 0.30 to 0.39. series with experienced receivers was the only one This calculation assumes, of course, that the proba- that did not use the 160 targets. Instead, it used bility of a direct hit is constant and independent only one set of four dynamic targets in which one across trials, an assumption that may be question- target had previously received several first place able except under the null hypothesis of no psi ranks and one had never received a first place abilities. rank. The receivers, none of whom had had prior Honorton et al. (1990) also calculated effect sizes exposure to that target pack, were not aware that for each of the 11 series and each of the eight only one target pack was being used. They each experimenters. All but one of the series (the first contributed one session only to the series. This will novice series) had positive effect sizes, as did all of be called the "special series" in what follows. the experimenters. Except for two of the pilot series, numbers of The special series with experienced Rs had an trials were planned in advance for each series. exceptionally high effect size with h = 0.81, corre- Unfortunately, three of the formal series were not sponding to 16 direct hits out of 25 trials (64%),but yet completed when the funding ran out, including the remaining series and the experimenters had the special series, and one pilot study with advance relatively homogeneous effect sizes given the planning was terminated early when the experi- amount of variability expected by chance. If the menter relocated. There were no unreported trials special series is removed, the overall hit rate is during the 6-year period under review, so there was 32.1%, h = 0.16. Thus, the positive effects are not no "file drawer." due to just one series or one experimenter. Overall, there were 183 Rs who contributed only Of the 218 trials contributed by novices, 71 were one trial and 58 who contributed more than one, for direct hits (32.5%, h = 0.17), compared with 51 a total of 241 participants and 355 trials. Only 23 hits in the 137 trials by those with prior ganzfeld Rs had previously participated in ganzfeld experi- experience (37%, h = 0.26). The hit rates and effect ments, and 194 Rs (81%) had never participated in sizes were 31% (h = 0.14) for the combined pilot any parapsychological research. series, 32.5% (h = 0.17) for the combined formal novice series, and 41.5% (h = 0.35) for the com- 5.2 Results bined experienced series. The last figure drops to While acknowledging that no probabilistic con- 31.6% if the outlier series is removed. Finally, clusions can be drawn from qualitative data, Hon- without the outlier series the hit rate for the com- orton et al. (1990) included several examples of bined series where all of the planned trials were session excerpts that Rs identified as providing the completed was 31.2% (h = 0.14), while it was 35% basis for their target rating. To give a flavor for the (h = 0.22) for the combined series that were termi- dream-like quality of the mentation and the amount nated early. Thus, optional stopping cannot of information that can be lost by only assigning a account for the positive effect. 374 J.UTTS

There were two interesting comparisons that had groups results in h = 0.068. Thus, the effect size been suggested by earlier work and were pre- observed in the ganzfeld data base is triple the planned in these experiments. The first was to much publicized effect of aspirin on heart attacks. compare results for trials with dynamic targets with those for static targets. In the 190 dynamic 6. OTHER META-ANALYSES IN target sessions there were 77 direct hits (40%, h = PARAPSYCHOLOGY 0.32) and for the static targets there were 45 hits in 165 trials (27%, h = 0.05), thus indicating Four additional meta-analyses have been con- that dynamic targets produced far more successful ducted in various areas of parapsychology since the results. original ganzfeld meta-analyses were reported. The second comparison of interest was whether Three of the four analyses focused on evidence of or not the sender was a friend of the receiver. This psi abilities, while the fourth examined the rela- was a choice the receiver could make. If he or she tionship between extroversion and psychic func- did not bring a friend, a lab member acted as tioning. In this section, each of the four analyses sender. There were 211 trials with friends as will be briefly summarized. senders (some of whom were also lab staff), result- There are only a handful of English-language ing in 76 direct hits (36%, h = 0.24). Four trials journals and proceedings in parapsychology, so used no sender. The remaining 140 trials used retrieval of the relevant studies in each of the nonfriend lab staff as senders and resulted in 46 four cases was simple to accomplish by searching direct hits (33%, h = 0.18). Thus, trials with friends those sources in detail and by searching other as senders were slightly more successful than those bibliographic data bases for keywords. without. Each analysis included an overall summary, an Consonant with the definition of replication based analysis of the quality of the studies versus the size on consistent effect sizes, it is informative to com- of the effect and a "file-drawer" analysis to deter- pare the autoganzfeld experiments with the direct mine the possible number of unreported studies. hit studies in the previous data base. The overall Three of the four also contained comparisons across success rates are extremely similar. The overall various conditions. direct hit rate was 34.4% for the autoganzfeld stud- ies and was 38% for the comparable direct hit 6.1 Forced-Choice Precognition Experiments studies in the earlier meta-analysis. Rosenthal's (1986) adjustment for flaws had placed a more con- Honorton and Ferrari (1989) analyzed forced- servative estimate at 33%, very close to the choice experiments conducted from 1935 to 1987, in observed 34.4% in the new studies. which the target material was randomly selected One limitation of this work is that the auto- after the subject had attempted to predict what it ganzfeld studies, while conducted by eight experi- would be. The time delay in selecting the target menters, all used the same equipment in the same ranged from under a second to one year. Target laboratory. Unfortunately, the level of fund- material included items as diverse as ESP cards ing available in parapsychology and the cost in and automated random number generators. Two time and equipment to conduct proper experiments investigators, S. G. Soal and Walter J. Levy, were make it difficult to amass large amounts of data not included because some of their work has been across laboratories. Another autoganzfeld labora- suspected to be fraudulent. tory is currently being constructed at the Univer- Overall Results. There were 309 studies re- sity of Edinburgh in Scotland, s6 interlaboratory ported by 62 senior authors, including more than comparisons may be possible in the near future. 50,000 subjects and nearly two million individual Based on the effect size observed to date, large trials. Honorton and Ferrari used z/fi as the samples are needed to achieve reasonable power. If measure of effect size (ES) for each study, where n there is a constant effect across all trials, resulting was the number of Bernoulli trials in the study. in 33% direct hits when 25% are expected by chance, They reported a mean ES of 0.020, and a mean to achieve a one-tailed significance level of 0.05 z-score of 0.65 over all studies. They also reported a with 95% probability would require 345 sessions. combined z of 11.41, p = 6.3 x Some 30% We end this section by &turning to the aspirin (92) of the studies were statistically significant at and heart attack example in Section 3 and expand- a! = 0.05. The mean ES per investigator was 0.033, ing a comparison noted by Atkinson, Atkinson, and the significant results were not due to just a Smith and Bem (1990, page 237). Computing the few investigators. equivalent of Cohen's h for comparing obser- Quality. Eight dichotomous quality measures ved heart attack rates in the aspirin and placebo were assigned to each study, resulting in possible REPLICATION IN PARAPSYCHOLOGY 375 scores from zero for the lowest quality, to eight for trend, decreasing in order as the time interval the highest. They included features such as ade- increased from minutes to hours to days to weeks to quate randomization, preplanned analysis and au- months. tomated recording of the results. The correlation between study quality and effect size was 0.081, 6.2 Attempts to Influence Random Physical indicating a slight tendency for higher quality Systems studies to be more successful, contrary to claims by Radin and Nelson (1989) examined studies de- critics that the opposite would be true. There was signed to test the hypothesis that "The statistical a clear relationship between quality and year of output of an electronic RNG [random number gen- publication, presumably because over the years erator] is correlated with observer intention in ac- experimenters in parapsychology have responded cordance with prespecified instructions" (page to suggestions from critics for improving their 1502). These experiments typically involve RNGs methodology. based on radioactive decay, electronic noise or pseu- File Drawer. Following Rosenthal (1984), the dorandom number sequences seeded with true ran- authors calculated the "fail-safe N" indicating the dom sources. Usually the subject is instructed to number of unreported studies that would have to be try to influence the results of a string of binary sitting in file drawers in order to negate the signifi- trials by mental intention alone. A typical protocol cant effect. They found N = 14,268, or a ratio of 46 would ask a subject to press a button (thus starting unreported studies for each one reported. They also the collection of a fixed-length sequence of bits), followed a suggestion by Dawes, Landman and and then try to influence the random source to Williams (1984) and computed the mean z for all produce more zeroes or more ones. A run might studies with z > 1.65. If such studies were a ran- consist of three successive button presses, one each dom sample from the upper 5% tail of a N(0,l) in which the desired result was more zeroes or distribution, the mean z would be 2.06. In this case more ones, and one as a control with no conscious it was 3.61. They concluded that selective reporting intention. A z score would then be computed for could not explain thew results. each button press. Comparisons. Four variables were identified The 832 studies in the analysis were conducted that appeared to have a systematic relationship to from 1959 to 1987 and included 235 "control" stud- study outcome. The first was that the 25 studies ies, in which the output of the RNGs were recorded using subjects selected on the basis of good past but there was no conscious intention involved. performance were more successful than the 223 These were usually conducted before and during using unselected subjects, with mean effect sizes of the experimental series, as tests of the RNGs. 0.051 and 0.008, respectively. Second, the 97 stud- Results. The effect size measure used was again ies testing subjects individually were more success- z/ A,where z was positive if more bits of the ful than the 105 studies that used group testing; specified type were achieved. The mean effect size mean effect sizes were 0.021 and 0.004, respec- for control studies was not significantly different tively. Timing of feedback was the third moderat- from zero (- 1.0 x lop5). The mean effect size ing variable, but information was only available for for the experimental studies was also very small, 104 studies. The 15 studies that never told the 3.2 x but it was significantly higher than the subjects what the targets were had a mean effect mean ES for the control studies (z = 4.1). size of -0.001. Feedback after each trial produced Quality. Sixteen quality measures were defined the best results, the mean ES for the 47 studies and assigned to each study, under the four general was 0.035. Feedback after each set of trials re- categories of procedures, statistics, data and the sulted in mean ES of 0.023 (21 studies), while RNG device. A score of 16 reflected the highest delayed feedback (also 21 studies) yielded a mean quality. The authors regressed mean effect size on ES of only 0.009. There is a clear ordering; as the mean quality for each investigator and found a gap between time of feedback and time of the slope of 2.5 x with standard error of 3.2 x actual guesses decreased, effect sizes increased. indicating little relationship between quality The fourth variable was the time interval be- and outcome. They also calculated a weighted mean tween the subject's guess and the actual target effect size, using quality scores as weights, and selection, available for 144 studies. The best results found that it was very similar to the unweighted were for the 31 studies that generated targets less mean ES. They concluded that "differences than a second after the guess (mean ES = 0.045), in methodological quality are not significant while the worst were for the seven studies that predictors of effect size" (page 1507). delayed target selection by at least a month (mean File Drawer. Radin and Nelson used several ES = 0.001). The mean effect sizes showed a clear methods for estimating the number of unreported UTTS studies (pages 1508-1510). Their estimates ranged were evenly balanced among the six faces. They from 200 to 1000 based on models assuming still found a significant effect, with mean and stan- that all significant studies were reported. They dard error for effect size of 8.6 x and 1.1 x calculated the fail-safe N to be 54,000. respectively. The combined z was 7.617 for these studies. 6.3 Attempts to Influence Dice They also compared effect sizes across types of Radin and Ferrari (1991) examined 148 studies, subjects used in the studies, categorizing them as published from 1935 to 1987, designed to test unselected, experimenter and other subjects, exper- whether or not consciousness can influence the imenter as sole subject, and specially selected sub- results of tossing dice. They also found 31 "con- jects. Like Honorton and Ferrari (1989), they found trol" studies in which no conscious intention was the highest mean ES for studies with selected involved. subjects; it was approximately 0.02, more than twice Results. The effect size measure used was that for unselected subjects. z/ 6,where z was based on the number of throws 6.4 Extroversion and ESP Performance in which the die landed with the desired face (or faces) up, in n throws. The weighted mean ES for Honorton, Ferrari and Bem (1991) conducted a the experimental studies was 0.0122 with a stan- meta-analysis to examine the relationship between dard error of 0.00062; for the control studies the scores on tests of extroversion and scores on mean and standard error were 0.00093 and 0.00255, psi-related tasks. They found 60 studies by 17 respectively. Weights for each study were de- investigators, conducted from 1945 to 1983. termined by quality, giving more weight to high- Results. The effect size measure used for this quality studies. Combined z scores for the exper- analysis was the correlation between each subject's imental and control studies were reported by Radin extroversion score and ESP score. A variety of and Ferrari to be 18.2 and 0.18, respectively. measures had been used for both scores across stud- Quality. Eleven dichotomous quality measures ies, so various correlation coefficients were used. were assigned, ranging from automated recording Nonetheless, a stem and leaf diagram of the corre- to whether or not control studies were interspersed lations showed an approximate bell shape with with the experimental studies. The final quality mean and standard deviation of 0.19 and 0.26, score for each study combined these with informa- respectively, and with an additional outlier at r = tion on method of tossing the dice, and with source 0.91. Honorton et al. reported that when weighted of subject (defined below). A regression of quality by degrees of freedom, the weighted mean r was score versus effect size resulted in a slope of -0.002, 0.14, with a 95% confidence interval covering 0.10 with a standard error of 0.0011. However, when to 0.19. effect sizes were weighted by sample size, there was Forced-Choice versus Free-Response Re- a significant relationship between quality and ef- sults. Because forced-choice and free-response tests fect size, leading Radin and Ferrari to conclude differ qualitatively, Honorton et al. chose to exam- that higher-quality studies produced lower weighted ine their relationship to extroversion separately. effect sizes. They found that for free-response studies there was File Drawer. Radin and Ferrari calculated a significant correlation between extroversion and Rosenthal's fail-safe, N for this analysis to be ESP scores, with mean r = 0.20 and z = 4.46. Fur- 17,974. Using the assumption that all significant ther, this effect was homogeneous across both studies were reported, they estimated the number investigators and extroversion scales. of unreported studies to be 1152..As a final assess- For forced-choice studies, there was a significant ment, they compared studies published before and correlation between ESP and extroversion, but only after 1975, when the Journal of Parapsychology for those studies that reported the ESP results adopted an official policy of publishing nonsigni- to the subjects before measuring extroversion. ficant results. They concluded, based on that an- Honorton et al. speculated that the relationship alysis, that more nonsignificant studies were was an artifact, in which extroversion scores published after 1975, and thus "We must consi- were temporarily inflated as a result of positive der the overall (1935-1987) data base as suspect feedback on ESP performance. with respect to the filedrawer problem." Confirmation with New Data Following the Comparisons. Radin and Ferrari noted that extroversion/ESP meta-analysis, Honorton et al. there was bias in both the experimental and control attempted to confirm the relationship using studies across die face. Six was the face most likely the autoganzfeld data base. Extroversion scores to come up, consistent with the observation that it based on the Myers-Briggs Type Indicator were has the least mass. Therefore, they examined re- available for 221 of the 241 subjects who had sults for the subset of 69 studies in which targets participated in autoganzfeld studies. REPLICATION IN PARAPSYCHOLOGY 377 The correlation between extroversion scores and much to be gained by discovering how to enhance ganzfeld rating scores was r = 0.18, with a 95% and apply these abilities to important world confidence interval from 0.05 to 0.30. This is con- problems. sistent with the mean correlation of r = 0.20 for ACKNOWLEDGMENTS free-response experiments, determined from the meta-analysis. These correlations indicate that ex- I would like to thank Deborah Delanoy, Charles troverted subjects can produce higher scores in Honorton, Wesley Johnson, Scott Plous and an free-response ESP tests. anonymous reviewer for their helpful comments on an earlier draft of this paper, and Robert Rosenthal and Charles Honorton for discussions that helped 7. CONCLUSIONS clarify details. Parapsychologists often make a distinction be- tween "proof-oriented research" and "process- REFERENCES oriented research." The former is typically con- ATKINSON,R. L., ATKINSON,R. C., SMITH,E. E. and BEM,D. J. ducted to test the hypothesis that psi abilities exist, (1990). Introduction to Psychology, 10th ed. Harcourt Brace while the latter is designed to answer questions Jovanovich, San Diego. about how psychic functioning works. Proof- BELOFF,J. (1985). Research strategies for dealing with unstable oriented research has dominated the literature phenomena. In The Repeatability Problem in Parapsychol- in parapsychology. Unfortunately, many of the ogy (B. Shapin and L. Coly, eds.) 1-21. Parapsychology Foundation, New York. studies used small samples and would thus be BLACKMORE,S. J. (1985). Unrepeatability: Parapsychology's only nonsignificant even if a moderate-sized effect finding. In The Repeatability Problem in Parapsychology exists. (B. Shapin and L. Coly, eds.) 183-206. Parapsychology The recent focus on meta-analysis in parapsy- Foundation, New York. chology has revealed that there are small but BURDICK,D. S. and KELLY,E. F. (1977). Statistical methods in parapsychological research. In Handbook of Parapsychology consistently nonzero effects across studies, experi- (B. B. Wolman, ed.) 81-130. Van Nostrand Reinhold, New menters and laboratories. The sizes of the effects in York. forced-choice studies appear to be comparable to CAMP,B. H. (1937). (Statement in Notes Section.) Journal of those reported in some medical studies that had Parapsychology 1 305. been heralded as breakthroughs. (See Section 5; COHEN,J. (1990). Things I have learned (so far). American Psychologist 45 1304-1312. also Honorton and Ferrari, 1989, page 301.) Free- COOVER,J. E. (1917). Experiments in Psychical Research at response studies show effect sizes of far greater Leland Stanford Junior University. Stanford Univ. magnitude. DAWES,R. M., LANDMAN,J. and WILLIAMS,J. (1984). Reply to A promising direction for future process-oriented Kurosawa. American Psychologist 39 74-75. research is to examine the causes of individual DIACONIS,P. (1978). Statistical problems in ESP research. Sci- ence 201 131-136. differences in psychic functioning. The ESP/ex- DOMMEYER,F. C. (1975). Psychical research at Stanford Univer- troversion meta-analysis is a step in that direction. sity. Journal of Parapsychology 39 173-205. In keeping with the idea of individual differ- DRUCKMAN,D. and SWETS,J. A,, eds. (1988) Enhancing Human ences, Bayes and empirical Bayes methods would Performance: Issues, Theories, and Techniques. National appear to make more sense than the classical infer- Academy Press, Washington, D.C. EDGEWORTH,F. Y.(1885). The calculus of probabilities applied ence methods commonly used, since they would to psychical research. In Proceedings of the Society for allow individual abilities and beliefs to be modeled. Psychical Research 3 190-199. Jeffreys (1990) reported a Bayesian analysis of some EDGEWORTH,F. Y. (1886). The calculus of probabilities applied of the RNG experiments and showed that conclu- to psychical research. 11. In Proceedings of the Society for sions were closely tied to prior beliefs even though Psychical Research 4 189-208. FELLER,W. K. (1940). Statistical aspects of ESP. Journal of hundreds of thousands of trials were available. Parapsychology 4 271-297. It may be that the nonzero effects observed in the FELLER,W. K. (1968). An Introduction to Probability Theory meta-analyses can be explained by something other and Its Applications 1, 3rd ed. , New York. than ESP, such as shortcomings in our understand- FISHER,R. A. (1924). A method of scoring coincidences in tests ing of randomness and independence. Nonetheless, with playing cards. In Proceedings of the Society for Psychi- cal Research 34 181-185. there is an anomaly that needs an explanation. As FISHER,R. A. (1929). The statistical method in psychical re- I have argued elsewhere (Utts, 1987), research in search. In Proceedings of the Society for Psychical Research parapsychology should receive more support from 39 189-192. the scientific community. If ESP does not exist, GALLUP,G. H., JR., and NEWPORT,F. (1991). Belief in paranor- there is little to be lost by erring in the direction of mal phenomena among adult Americans. 15 137-146. further research, which may in fact uncover other GARDNER,M. J. and ALTMAN,D. G. (1986). Confidence intervals anomalies. If ESP does exist, there is much to be rather than p-values: Estimation rather than hypothesis lost by not doing process-oriented research, and testing. British Medical Journal 292 746-750. 378 J. UTTS

GILMORE,J. B. (1989). Randomness and the search for psi. the behavioral and social sciences. Journal of Social Behav- Journal of Parapsychology 53 309-340. ior and Personality 5 (4) 1-510. GILMORE,J. B. (1990). Anomalous significance in pararandom OFFICEOF TECHNOLOGYASSESSMENT (1989). Report of a work- and psi-free domains. Journal of Parapsychology 54 53-58. shop on experimental parapsychology. Journal of the Amer- GREELEY,A. (1987). Mysticism goes mainstream. American ican Society for Psychical Research 83 317-339. Health 7 47-49. PALMER,J. (1989). A reply to Gilmore. Journal of Parapsychol- GREENHOUSE,J. B. and GREENHOUSE,S. W. (1988). An aspirin a ogy 53 341-344. day.. . ? Chance 1 24-31. PALMER,J. (1990). Reply to Gilmore: Round two. Journal of GREENWOOD,J. A. and STUART,C. E. (1940). A review of Dr. Parapsychology 54 59-61. Feller's critique. Journal of Parapsychology 4 299-319. PALMER,J. A., HONORTON,C. and Urrs, J. (1989). Reply to the HACKING,I. (1988). Telepathy: Origins of randomization in ex- National Research Council study on parapsychology. Jour- perimental design. Zsis 79 427-451. nal of the American Society for Psychical Research 83 31-49. HANSEL,C. E. M. (1980). ESP and Parapsychology: A Critical RADIN,D. I. and FERRARI,D. C. (1991). Effects of consciousness Re-evaluation. Prometheus Books, Buffalo, N.Y. on the fall of dice: A meta-analysis. Journal of Scientific HARRIS,M. J. and ROSENTHAL,R. (1988a). Interpersonal Ex- Exploration 5 61-83. pectancy Effects and Human Performance Research. Na- RADIN,D. I. and NELSON,R. D. (1989). Evidence for conscious- tional Academy Press, Washington, D.C. ness-related anomalies in random physical systems. Foun- HARRIS,M. J. and ROSENTHAL,R. (1988b). Postscript to Znterper- dations of Physics 19 1499-1514. sonal Expectancy Effects and Human Performance Research. RAO,K. R. (1985). Replication in conventional and controversial National Academy Press, Washington, D.C. sciences. In The Repeatability Problem in Parapsychology HEDGES,L. V. and OLKIN,I. (1985). Statistical Methods for (B. Shapin and L. Coly, eds.) 22-41. Parapsychology Foun- Meta-Analysis. Academic, Orlando, Fla. dation, New York. HONORTON,C. (1977). Psi and internal attention states. In RHINE,J. B. (1934). Extrasensory Perception. Boston Society for Handbook of Parapsychology (B. B. Wolman, ed.) 435-472. Psychical Research, Boston. (Reprinted by Branden Press, 1964.) Van Nostrand Reinhold, New York. RHINE,J. B. (1977). History of experimental studies. In Hand- HONORTON,C. (1985a). How to evaluate and improve the repli- book of Parapsychology (B. B. Wolman, ed.) 25-47. Van cability of parapsychological effects. In The Repeatability Nostrand Reinhold, New York. Problem in Parapsychology (B. Shapin and L. Coly, eds.) RICHET,C. (1884). La suggestion mentale et le calcul des proba- 238-255. Parapsychology Foundation, New York. bilit6s. Revue Philosophique 18 608-674. HONORTON,C. (1985b). Meta-analysis of psi ganzfeld research: A ROSENTHAL,R. (1984). Meta-Analytic Procedures for Social Re- response to Hyman. Journal of Parapsychology 49 51-91. search. Sage, Beverly Hills. HONORTON,C., BERGER,R. E., VARVOGLIS,M. P., QUANT,M., ROSENTHAL,R. (1986). Meta-analytic procedures and the nature DERR, P., SCHECHTER,E. I. and FERRARI,D. C. (1990). of replication: The ganzfeld debate. Journal of Parapsychol- Psi communication in the ganzfeld: Experiments with an ogy 50 315-336. automated testing system and a comparison with a meta- ROSENTHAL,R. (1990a). How are we doing in soft psychology? analysis of earlier studies. Journal of Parapsychology 54 American Psychologist 45 775-777. 99-139. ROSENTHAL,R. (1990b). Replication in behavioral research. HONORTON,C. and FERRARI,D. C. (1989). "Future telling": A Journal of Social Behavior and Personality 5 1-30. meta-analysis of forced-choice precognition experiments, SAUNDERS,D. R. (1985). On Hyman's factor analysis. Journal of 53 1935-1987. Journal of Parapsychology 281-308. Parapsychology 49 86-88. HONORTON,C., FERRARI,D:C. and BEM,D. J. (1991). Extraver- SHAPIN,B. and COLY,L., eds. (1985). The Repeatability Problem sion and ESP performance: A meta-analysis and a new in Parapsychology. Parapsychology Foundation, New York. confirmation. Research in Parapsychology 1990. The Scare- SPENCER-BROWN,G. (1957). Probability and Scientific Inference. crow Press, Metuchen, N.J. To appear. Longmans Green, London and New York. HYMAN,R. (1985a). A critical overview of parapsychology. In A STUART,C. E. and GREENWOOD,J. A. (1937). A review of criti- Skeptic's Handbook of Parapsychology (P. Kurtz, ed.) 1-96. cisms of the mathematical evaluation of ESP data. Journal Prometheus Books, Buffalo, N.Y. of Parapsychology 1 295-304. HYMAN,R. (1985b). The ganzfeld psi experiment: A critical TVERSKY,A. and KAHNEMAN,D. (1982). Belief in the law of appraisal. Journal of Parapsychology 49 3-49. small numbers. In Judgment Under Uncertainty: Heuristics HYMAN,R. and HONORTON,C. (1986). Joint communiqu6: The and Biases (D. Kahneman, P. Slovic and A. Tversky, eds.) psi ganzfeld controversy. Journal o.f Parapsychology 50 23-31. Cambridge Univ. Press. 351-364. Urrs, J. (1986). The ganzfeld debate: A statistician's perspec- IVERSEN,G. R., LONGCOR,W. H., MOSTELLER,F., GILBERT,J. P. tive. Journal of Parapsychology 50 395-402. and You~z,C. (1971). Bias and runs in dice throwing and Urrs, J. (1987). Psi, statistics, and society. Behavioral and recording: A few million throws. Psychometrika 36 1-19. Brain Sciences 10 615-616. JEFFREYS,W. H. (1990). Bayesian analysis of random event Urrs, J. (1988). Successful replication versus statistical signifi- generator data. Journal of Scientific Exploration 4 153-169. cance. Journal of Parapsychology 52 305-320. LINDLEY,D. V. (1957). A statistical paradox. Biometrika 44 Urrs, J. (1989). Randomness and randomization tests: A reply to 187-192. Gilmore. Journal of Parapsychology 53 345-351. MAUSKOPF,S. H. and MCVAUGH,M,. (1979). The Elusive Science: Urrs, J. (1991). Analyzing free-response data: A progress report. Origins of Experimental Psychical Research. Johns Hopkins In Psi Research Methodology: A Re-examination (L. Coly, Univ. Press. ed.). Parapsychology Foundation, New York. To appear. MCVAUGH,M. R. and MAUSKOPF,S. H. (1976). J. B. Rhine's WILKS, S. S. (1965a). Statistical aspects of expeirments in Extrasensory Perception and its background in psychical telepath. N.Y. Statistician 16 (6) 1-3. research. Zsis 67 161-189. WILKS, S. S. (1965b). Statistical aspects of experiments in NEULIEP,J. W., ed. (1990). Handbook of replication research in telepathy. N. Y.Statistician 16 (7) 4-6. Comment M. J. Bayarri and James Berger

1. INTRODUCTION EXAMPLE1. Consider the example from Tversky and Kahnemann (1982), in which an experiment There are many fascinating issues discussed in results in a standardized test statistic of 2, = 2.46. this paper. Several concern parapsychology itself (We will assume normality to keep computations and the interpretation of statistical methodology trivial.) The question is: What is the highest value therein. We are not experts in parapsychology, and of z2 in a second set of data that would be consid- so have only one comment concerning such mat- ered a failure to replicate? Two possible precise ters: In Section 3 we briefly discuss the need to versions of this question are: Question 1: What is switch from P-values to Bayes factors in discussing the probability of observing z2 for which the null evidence concerning parapsychology. hypothesis would be rejected in the replicated ex- A more general issue raised in the paper is that periment? Question 2: What value of z2 would of replication. It is quite illuminating to consider leave one's overall opinion about the null hypothe- the issue of replication from a Bayesian perspec- sis unchanged? tive, and this is done in Section 2 of our discussion. Consider the simple case where Z, - N(z, 10, 1) and (independently) Z2 N(z2 10, I), where 0 is 2. REPLICATION - the mean and 1 is the standard deviation of the Many insightful observations concerning replica- normal distribution. Note that we are considering tion are given in the article, and these spurred us the case in which no experimental bias is suspected to determine if they could be quantified within and so the means for each experiment are assumed Bayesian reasoning. Quantification requires clear to be the same. delineation of the possible purposes of replication, Suppose that it is desired to test H,: 0 I0 versus and at least two are obvious. The first is simple H,: 0 > 0, and suppose that initial prior opinion reduction of random error, achieved by obtaining about 0 can be described by the noninformative more observations from the replication. The second prior n(0) = 1. We consider the one-sided testing purpose is to search for possible bias in the original problem with a constant prior in this section, be- experiment. We use "bias" in a loose sense here, to cause it is known that then the posterior probabil- refer to any of the huge number of ways in which ity of H,, to be denoted by P(H, I data), equals the the effects being measured by the experiment can P-value, allowing us to avoid complications arising differ from the actual effects of interest. Thus a from differences between Bayesian and classical clinical trial without a placebo can suffer a placebo answers. "bias"; a survey can suffer a "bias" due to the After observing 2, = 2.46, the posterior distribu- sampling frame being unrepresentative of the tion of 0 is actual population; and possible sources of bias in parapsychological experiments have been extensively discussed. Question 1 then has the answer (using predictive Bayesian reasoning) Replication to Reduce Random Error P(rejecting at level a! 1 2,) If the sole goal of replication 'of an experiment is to reduce random error, matters are very straight- forward. Reviewing the Bayesian way of studying this issue is, however, useful and will be done through the following simple example.

where 9 is the standard normal cdf and c, is the

- - (one-sided) critical value corresponding to the level, M. J. Bayarri is Titular Professor, Department of a, of the test. For instance, if a! = 0.05, then this Statistics and Operations Research, University of probability equals 0.7178, demonstrating that there Valencia, Avenida Dr. Moliner 50,46100 Burjassot, is a quite substantial probability that the second Valencia, Spain. James Berger is the Richard M. experiment will fail to reject. If a is chosen to be Brumfield Distinguished Professor of Statistics, the observed significance level from the first exper- Purdue University, West Lafayette, Indiana 47907. iment, so that c, = z,, then the probability that the 380 J. UTTS second experiment will reject is just 112. This is one is very confident that the Xihave mean 6. nothing but a statement of the well-known martin- Using normal approximations for convenience, the gale property of Bayesianism, that what you "ex- data can be summarized as pect" to see in the future is just what you know today. In a sense, therefore, question 1 is exposed as being uninteresting. Question 2 more properly focuses on the fact that with actual observations x, = 7.704 and x, = the stated goal of replication here is simply to 13.07. reduce uncertainty in stated conclusions. The an- Consider now the bias issue. We assume that the swer to the question follows immediately from not- original experiment is somewhat suspect in this ing that the posterior from the combined data regard, and we will model bias by defining the mean of Y to be (21, 22) is

where 0 is the unknown bias. Then the data in the so that original experiment can be summarized by

Setting this equal to P(H, ) 2,) and solving for 2, with the actual observation being y = 7.707. yields 2, = (a- l)z, = 1.02. Any value of 2, Bayesian analysis requires specification of a prior greater than this will increase the total evidence distribution, "(P), for the suspected amount of bias. against H,, while any value smaller than 1.02 will Of particular interest then are the posterior distri- decrease the evidence. bution of 0, assuming replication i has been performed, given by Replication to Detect Bias xi) The aspirin example dramatically raises the is- "(PI Y, sue of bias detection as a motive for replication. Professor Utts observes that replication 1 gives results that are fully compatible with those of the original study, which could be interpreted as sug- where a? is the variance (4.82 or 3.63) from repli- gesting that there is no bias in the original study, cation i; and the posterior probability of H,, given while replication 2 would raise serious concerns of by bias. We became very interested in the implicit suggestion that replication 2 would thus lead to less overall evidence against the null hypothesis than would replication 1, even though in isolation replication 2 was much more "significant" than was replication 1. In attempting to see if this is so, we considered the Bayesian approach to study of bias within the framework of the aspirin example.

, EXAMPLE2. For simplicity in the aspiring exam- Recall that our goal here was to see if Bayesian ple, we reduce consideration to analysis can reproduce the intuition that the origi- 6 = true difference in heart attack rates between nal experiment could be trusted if replication 1had aspirin and placebo populations multiplied by been done, while it could not be trusted (in spite of 1000; its much larger sample size) had replication 2 been performed. Establishing this requires finding a Y = difference in observed heart attack rates be- prior distribution "(0) for which n(@ y, x,) has tween aspirin and placebo groups in original I little effect on P(H, y, x,), but "(0 y, x,) has a study multiplied by 10QO; 1 I large effect on P( H, ( y, x,). To achieve the first Xi= difference in observed heart attack rates be- objective, "(0) must be tightly concentrated near tween aspirin and placebo groups in Replica- zero. To achieve the second, "(0) must be such that tion multiplied by 1000. i large 1 y - x, I , which suggests presence of a large We assume that the replication studies are ex- bias, can result in a substantial shift of posterior tremely well designed and implemented, so that mass for 0 away from zero. REPLICATION IN PARAPSYCHOLOGY 381 A sensible candidate for the prior density n(@) 1.54/100)). The effect of this is to essentially ignore is the Cauchy (0, V) density the original experiment in overall assessments of evidence. For instance, P(Ho 1 y, x,) = 3.81 x lo-'', which is very close to P(Ho1 x,) = 3.29 x lo-". Note that, if @ were set equal to zero, the overall posterior probability of Ho (and P-value) Flat-tailed densities, such as this, are well known would be 2.62 x lo-',. to have the property that when discordant data is Thus Bayesian reasoning can reproduce the intu- observed (e.g., when ( I y - x, I is large), substan- ition that replication which indicates bias can cast tial mass shifts away from the prior center towards considerable doubt on the original experiment, the likelihood center. It is easy to see that a normal while replication which provides no evidence of prior for @ can not have the desired behavior. bias leaves evidence from the original experiment Our first surprise in consideration of these priors intact. Such behavior seems only obtainable, how- was how small V needed to be chosen in order for ever, with flat-tailed priors for bias (such as the P(Ho1 y, x,) to be unaffected by the bias. For Cauchy) that are very concentrated (in comparison instance, even with V = 1.54/100 (recall that 1.54 with the experimental standard deviation) near was the standard deviation of Y from the original zero. experiment), computation yields P(Ho 1 y, x,) = 4.3 x compared with the P-value (and poste- 3. P-VALUES OR BAYES FACTORS? rior probability from the original experiment as- Parapsychology experiments usually consider suming no bias) of 2.8 x There is a clear testing of Ho: No parapsychological effect exists. lesson here; even very small suspicions of bias can Such null hypotheses are often realistically repre- drastically alter a small P-value. Note that replica- sented as point nulls (see Berger and Delampady, tion 1 is very consistent with the presence of no 1987, for the reason that care must be taken in bias, and so the posterior distribution for the bias such representation), in which case it is known that remains tightly concentrated near zero; for in- there is a large difference between P-values and stance, the mean of the posterior for @ is then posterior probabilities (see Berger and Delampady, 7.2 x and the standard deviation is 0.25. 1987, for review). The article by Jefferys (1990) When we turned attention to replication 2, we dramatically illustrates this, showing that a very found that it did not seriously change the prior small P-value can actually correspond to evidence perceptions of bias. Examination quickly revealed for Ho when considered from a Bayesian perspec- the reason; even the maximum likelihood estimate tive. (This is very related to the famous "Jeffreys" of the bias is no more than 1.4 standard deviations paradox.) The argument in favor of the Bayesian from zero, which is not enough to change strong approach here is very strong, since it can be shown prior beliefs. We, therefore, considered a third that the conflict holds for virtually any sensible experiment, defined in Table 1. Transforming to prior distribution; a Bayesian answer can be wrong approximate normality, as before, yields if the prior information turns out to be inaccurate, but a Bayesian answer that holds for all sensible priors is unassailable. Since P-values simply cannot be viewed as mean- with x, = 22.72 being the actual observation. The ingful in these situations, we found it of interest to maximum likelihood estimate of bias is now 3.95 reconsider the example in Section 5 from a Bayes standard deviations from zero, so there is potential factor perspective. We considered only analysis of for a substantial change in opinion about the bias. the overall totals, that is, x = 122 successes out of Sure enough, computation when V = 1.54/100 n = 355 trials. Assuming a simple Bernoulli trial yields that E[@1 y, x,] = -4.9 with (posterior) model with success probability 0, the goal is to test standard deviation equal to 6.62, which is a dra- H0:8 = 114 versus H1:8 # 114. matic shift from prior opinion (that @ is Cauchy (0, To determine the Bayes factor here, one must specify g(0), the conditional prior density on HI. TABLE1 Consider choosing g to be uniform and symmetric, Frequency of heart attacks in replication 3 that is,

Yes No

Aspirin 5 2309 Placebo 54 2116 (0, otherwise 382 J. UTTS Crudely, r could be considered to be the maximum B lr) change in success probability that one would expect given that ESP exists. Also, these distributions are zoo-- the "extreme points" over the class of symmetric unimodal conditional densities, so answers that hold 150-- over this class are also representative of answers over a much larger class. Note that here r I0.25 loo (because 0 r 19 I1); for the given data the 0 > 0.5 are essentially irrelevant, but if it were deemed important to take them into account one could use the more sophisticated binomial analysis in Berger and Delampady (1987). For gr, the Bayes factor of Hl to No, which is to 0.05 0.1 0.15 0.2 0.25 be interpreted as the relative odds for the hypothe- FIG.1. The Bayes factor of HI to H, as a function of r, the maximum change in success probability that is expected given ses provided by the data, is given by that ESP exists, for the ganzfeld experiment. 25+r 122 ((r))J 0 (1- 8)355-122do B(r) = (1/4)122 (1- 114)~~~ obtains final posterior odds by multiplying prior odds by the Bayes factor). To properly assess strength of evidence, we feel that such Bayes factor computations should become standard in parapsy- chology. As mentioned by Professor Utts, Bayesian meth- ods have additional potential in situations such as This is graphed in Figure 1. this, by allowing unrealistic models of iid trials to The P-value for this problem was 0.00005, indi- be replaced by hierarchical models reflecting differ- cating overwhelming evidence against No from a ing abilities among subjects. classical perspective. In contrast to the situation studied by Jefferys (1990), the Bayes factor here ACKNOWLEDGMENTS does not completely reverse the conclusion, show- ing that there are very reasonable values of r for M. J. Bayarri's research was supported in part which the evidence against No is moderately by the Spanish Ministry of Education and Science strong, for example 100/1 or 20011. Of course, this under DGICYT Grant BE91-038, while visiting evidence is probably not of sufficient strength to Purdue University. James Berger's research was overcome strong prior opinions against Ho (one supported by NSF Grant DMS-89-23071.

Comment Ree Dawson

.This paper offers readers interested in statistical account of how both design and inferential aspects science multiple views of the controversial history of statistics have been pivotal issues in evaluating of parapsychology and how statistics has con- the outcomes of experiments that study psi abili- tributed to its development. It first provides an ties. It then emphasizes how the idea of science as replication has been key in this field in which results have not been conclusive or consistent and thus meta-analysis has been at the heart of the Ree Dawson is Senior statistician, New England literature in parapsychology. The author not only Biomedical Research Foundation, and Statistical reviews past debate on how to interpret repeated Consultant, RFEIRL Research Institute. Her mail- psi studies, but also provides very detailed informa- ing address is 177 Morrison Avenue, Somerville, tion on the Honorton-Hyman argument, a nice Massachusetts 02144. illustration of the challenges of resolving such de- REPLICATION IN PARAPSYCHOLOGY 383 bate. This debate is also a good example of how effects for this data (this result is reported in Sec- statistical criticism can be part of the scientific tion 5). For the remaining 10 series, the chi-square process and lead to better experiments and, in gen- value = 7.01 strongly favors homogeneity, al- eral, better science. though more than one-third of its value is due to The remainder of the paper addresses technical the novice series (number 4 in Table 1). This pat- issues of meta-analysis, drawing upon recent re- tern points to the potential usefulness of a richer search in parapsychology for an in-depth applica- model to accommodate series that may be distinct tion. Through a series of examples, the author from the others. For the earlier ganzfeld data ana- presents a convincing argument that power issues lyzed by Honorton (1985b), the appeal of a Bayes or cannot be overlooked in successive replications and other model that recognizes the heterogeneity that comparison of effect sizes provides a richer across studies is clear cut: Xi3 = 56.6, p = 0.0001, alternative to the dichotomous measure inherent in where only those studies with common chance hit the use of p-values. This is particularly relevant rate have been included (see Table 2). when the potential effect size is small and re- Historic reliance on voting-count approaches to sources are limited, as seems to be the case for psi determine the presence of psi effects makes it natu- studies. ral to consider Bayes models that focus on the The concluding section briefly mentions Bayesian ensemble of experimental effects from parapsycho- techniques. As noted by the author, Bayes (or em- logical studies, rather than individual estimates. pirical Bayes) methodology seems to make sense for Recent work in parapsychology that compares ef- research in parapsychology. This discussion exam- fect sizes across studies, rather than estimating ines possible Bayesian approaches to meta-analysis separate study effects, reinforces the need to exam- in this field. ine this type of model. Louis (1984) develops Bayes and empirical Bayes methods for problems that consider the ensemble of parameter values to be BAYES MODELS FOR PARAPSYCHOLOGY the primary goal, for example, multiple compar- The notion of repeatability maps well into the isons. For the simple compound normal model, Bayesian set-up in which experiments, viewed as a Yi - N(ei, I), ei - N(p, T'), the standard Bayes random sample from some superpopulation of ex- estimates (posterior means) periments, are assumed to be exchangeable. When T2 subjects can also be viewed as an approximately 8*=p+D(Yi-p) and D= - 1 T2 random sample from some population, it is appro- + priate to pool them across experiments. Otherwise, where the ei represent experimental effects of in- analyses that partially pool information according terest, are modified approximately to to experimental heterogeneity need to be consid- ered. Empirical and hierarchical Bayes methods offer a flexible modeling framework for such analy- when an ensemble loss function is assumed. The ses, relying on empirical or subjective sources to new estimates adjust the shrinkage factor D so determine the degree of pooling. These richer meth- that their sample mean and variance match the ods can be particularly useful to meta-analysis of posterior expectation and variance of the 0's. Simi- experiments in parapsychology conducted under lar results are obtained when the model is gener- potentially diverse conditions. For the recent ganzfeld series, assuming them TABLE1 to be independent binomially distributed as dis- Recent ganzfeld series cussed in Section 5, the data can be summed (pooled) across series to estimate a common hit Series type N Trials Hit rate Yi oi rate. Honorton et al. (1990) assessed the homogene- Pilot ity of effects across the 11 series using a chi-square Pilot test that compares individual effect sizes to Pilot the weighted mean effect. The chi-square statistic Novice Novice xf0 = 16.25, not statistically significant (p = Novice 0.093), largely reflects the contribution of the last Novice "special" series (contributes 9.2 units to the xf0 Novice value), and to a lesser extent the novice series with Experienced a negative effect (contributes 2.5 units). The outlier Experienced series can be dropped from the analysis to provide a Experienced more conservative estimate of the presence of psi Overall 384 J. UTTS TABLE2 vides more convincing support that all ei > - 1.10, Earlier ganzfeld studies although posterior estimates of uncertainty are N Trials Hit rate yi "i needed to fully calibrate this. For the earlier ganzfeld data in Table 2, ensemble loss can simi- larly be used to determine the number of studies with Bi < - 1.10 and specifically whether the nega- tive effects of studies 4 and 24 (Y, = -1.21 and Y,, = - 1.33) occurred as a result of chance fluctuation. Features of the ganzfeld data in Section 5, such as the outlier series, suggest that further elabora- tion of the basic Bayesian set-up may be necessary for some meta-analyses in parapsychology. Hierar- chical models provide a natural framework to spec- ify these elaborations and explore how results change with the prior specification. This type of sensitivity analysis can expose whether conclusions are closely tied to prior beliefs, as observed by Jeffreys for RNG data (see Section 7). Quantifying the influence of model components deemed to be more subjective or less certain is important to broad acceptance of results as evidence of psi performance (or lack thereof). Consider the initial model commonly used for Bayesian analysis of discrete data: Yi I Pi, ni - B(pi, ni), alized to the case of unequal variances, Yi - N(ei, ui"). ei - ~(p,T,), ei = logit(pi), For the above model, the fraction of 8; above (or with noninformative priors assumed for p and below) a cut point C is a consistent estimate of the (e.g., log T locally uniform). The distinctiveness of fraction of Bi > C (or Bi < C). Thus, the use of the last "special" series and, in general, the differ- ensemble, rather than component-wise, loss can ent types of series (pilot versus formal, novice ver- help detect when individual effects are above sus experienced) raises the question of whether the a specified threshold by chance. For the meta- experimental effects follow a normal distribution. analysis of ganzfeld experiments, the observed bi- Weighted normal plots (Ryan and Dempster, 1984) nomial proportions transformed on the logit (or can be used to graphically diagnose the adequacy of arcsin.\/) scale can be modeled in this framework. second-stage normality (see Dempster, Selwyn and Letting di and mi denote the number of direct hits Weeks, 1983, for examples with binary response and misses respectively for the ith experiment, and and normal superpopulation). pi as the corresponding population proportion of Alternatively, if nonnormality is suspected, the direct hits, the Yi are the observed logits model can be revised to include some sort of heavy- tailed prior to accommodate possibly outlying se- ries or studies. West (1985) incorporates additional and , ui2, estimated by maximum likelihood as scale parameters, one for each component of the l/di + l/mi, is the variance of Yi conditional on model (experiment), that flexibly adapt to a typi- Bi = logit(pi). The threshold logit (0.25) = 1.10 can cal ei and discount their influence on posterior be used to identify the number of experiments for estimates, thus avoiding under- or over-shrinkage which the proportion of direct hits exceeds that due to such Bi. For example, the second stage expected by chance. can specify the prior as a scale mixture of normals: Table 1 shows Yi and ui for the 11 ganzfeld series. All but one of the series are well above the threshold; Y, marginally falls below -1.10. Any shrinkage toward a common hit rate will lead to an estimate, 8: or 8:, above the threshold. The use of ensemble loss (with its consistency property) pro- This approach for the prior is similar to others for REPLICATION IN PARAPSYCHOLOGY 385 maximum likelihood estimation that modify the should not vary as the prior for r2 is varied. Other- sampling error distribution to yield estimates that wise, it is important to identify the degree to which are "robust" against outlying observations. subjective information about interexperimental Like its maximum likelihood counterparts, in ad- variability influences the conclusions. This sen- dition to the robust effect estimates o*, the Bayes sitivity analysis is a Bayesian enrichment of model provides (posterior) scale estimates y*. These the simpler test of homogeneity directed toward can be interpreted as the weight given to the data determining whether or not complete pooling is for each 8, in the analysis and are useful to diag- appropriate. nosing which model components (series or studies) To assess how well heterogeneity among his- are unusual and how they influence the shrinkage. torical control groups is determined by the data. When more complex groupings among the Bi are Dempster, Selwyn and Weeks (1983) propose three suspected, for example, bimodal distribution of priors for r2 in the logistic-normal model. The prior studies from different sites or experimenters, other distributions range from strongly favoring individ- mixture specifications can be used to further relax ual estimates, p(r2)dr0: r-', to the uniform refer- the shrinkage toward a common value. ence prior p(r2)dr0: T'~,flat on the log r scale, to For the 11 ganzfeld series, the last "outlier" strongly favoring complete pooling, p(r2)dr 0: r-3 series, quite distinct from the others (hit rate = (the latter forcing complete pooling for the com- 0.64), is moderately precise (N = 25). Omitting it pound normal model; see Morris, 1983). For their from the analysis causes the overall hit rate to drop two examples, the results (estimates of linear treat- from 0.344 to 0.321. The scale mixture model is a ment effects) are largely insensitive to variation in compromise between these two values (on the logit the prior distribution, but the number of studies in scale), discounting the influence of series 11 on the each example was large (70 and 19 studies avail- estimated posterior common hit rate used for able for pooling). For the 11 ganzfeld series, r2 may shrinkage. The scale factor yT,, an indication of be less well determined by the data. The posterior how separate 8,, is from the other parameters, also estimate of r2 and its sensitivity to p(r2)dr will causes 8T, to be shrunk less toward the common hit also depend on whether individual scale parame- rate than other, more homogeneous Oi, giving more ters are incorporated into the model. Discounting weight to individual information for that series (see the influence of the last series will both shift the West, 1985). The heterogeneity of the earlier marginal likelihood toward smaller values of r2 ganzfeld data is more pronounced, and studies are and concentrate it more in that region. taken from a variety of sources over time. For these The issue of objective assessment of experiment data, the y* can be used to explore atypical studies results is one that extends well beyond the field of (e.g., study 6, with hit rate = 0.90, contributes more parapsychology, and this paper provides insight into than 25% to the xis value for homogeneity) and issues surrounding the analysis and interpretation groupings among effects, as well as protect the of small effects from related studies. Bayes meth- analysis from misspecification of second-stage ods can contribute to such meta-analyses in two normality. ways. They permit experimental and subjective evi- Variation among ganzfeld series or studies and dence to be formally combined to determine the the degree to which pooling or shrinking is appro- presence or absence of effects that are not clear cut priate can be investigated further by considering a or controversial (e.g., psi abilities). They can also range of priors for r2. If the marginal likelihood of help uncover sources and degree of uncertainty in r2 dominates the prior specification, then results the scientific conclusions. Comment Persi Diaconis

In my experience, parapsychologists use statis- impossible to demand wide replicability in others. tics extremely carefully. The plethora of widely Finally, defining "qualified skeptic" is difficult. In significant p-values in the many thousands of pub- defense, most areas have many easily replicable lished parapsychological studies must give us pause experiments and many have their findings ex- for thought. Either something spooky is going on, plained and connected by unifying theories. It sim- or it is possible for a field to exist on error and ply seems clear that when making claims at such artifact for over 100 years. The present paper offers extraordinary variance with our daily experience, a useful review by an expert and a glimpse at some claims that have been made and washed away so tantalizing new studies. often in the past, such extraordinary measures are My reaction is that the studies are crucially mandatory before one has the right to ask outsiders flawed. Since my reasons are somewhat unusual, I to spend their time in review. The papers cited in will try to spell them out. Section 5 do not actively involve qualified skeptics, I have found it impossible to usefully judge what and I do not feel they have earned the right to our actually went on in a parapsychology trial from serious attention. their published record. Time after time, skeptics The points I have made above are not new. Many have gone to watch trials and found subtle and appear in the present article. This does not dimin- not-so-subtle errors. Since the field has so far failed ish their utility nor applicability to the most recent to produce a replicable phenomena, it seems to studies. me that any trial that asks us to take its find- Parapsychology is worth serious study. First, ings seriously should include full participation by there may be something there, and I marvel at the qualified skeptics. Without a magician and/or patience and drive of people like Jessica Utts and knowledgeable psychologist skilled at running ex- Ray Hyman. Second, if it is wrong, it offers a truly periments with human subjects, I don't think a alarming massive case study of how statistics can serious effort is being made. mislead and be misused. Third, it offers marvelous I recognize that this is an unorthodox set of combinatorial and inferential problems. Chung, requirements. In fact, one cannot judge what Diaconis, Graham and Mallows (1981), Diaconis "really goes on" in studies in most areas, and it is and Graham (1981) and Samaniego and Utts (1983) offer examples not cited in the text. Finally, our budding statistics students are fascinated by its Persi Diaconis is Professor of iVIathematics at Har- claims; the present paper gives a responsible vard University, Science Center, 1 Oxford Street, overview providing background for a spectacular Cambridge, Massachusetts 02138. classroom presentation. Comment: Parapsychology -On the Margins of Science?

Joel 6. Greenhouse

Professor Utts reviews and synthesizes a large lish the existence of phenomena. The body of experimental literature as well as the scien- organization and clarity of her presentation are tific controversy involved in the attempt to estab- noteworthy. Although I do not believe that this paper will necessarily change anyone's views re- garding the existence of paranormal phenomena, it does raise very interesting questions about the pro- Joel B. Greenhouse is Associate Professor of Statis- cess by which new ideas are either accepted or tics, Carnegie Mellon University, Pittsburgh, Penn- rejected by the scientific community. As students of sylvania 15213-3890. science, we believe that scientific discovery REPLICATION OF P ARAPSYCHOLOGY 387 advances methodically and objectively through the entific ideas, and the one referred to by Professor accumulation of knowledge (or the rejection of false Utts above, is the prevailing substantive beliefs knowledge) derived from the implementation of the and theories held by scientists at any given time. . But, as we will see, there is more Barber offers the opposition to Copernicus and his to the acceptance of new scientific discoveries than heliocentric theory and to Mendel's theory of ge- the systematic accumulation and evaluation of netic inheritance as examples of how, because of facts. The recognition that there is a social process preconceived ideas, theories and values, scientists involved with the acceptance or rejection of scien- are not as open-minded to new advances as one tific knowledge has been the subject of study of might think they should be. It was R. A. Fisher sociologists for some time. The scientific commu- who said that each generation seems to have found nity's rejection of the existence of paranormal phe- in Mendel's paper only what it expected to find and nomena is an excellent case study of this process ignored what did not conform to its own expecta- (Allison, 1979; Collins and Pinch, 1979). tions (Fisher, 1936). Implicit in Professor Utts' presentation and Pearson's response to the antimathematical prej- paramount to the acceptance of parapsychology as udice expressed by the Royal Society was to estab- a legitimate science are the description and docu- lish with Galton's support a new journal, mentation of the professionalization of the field of Biometrika, to encourage the use of mathematics in parapsychology. It is true that many researchers in biology. Galton (1901) wrote an article for the first the field have university appointments; there are issue of the journal, explaining the need for this organized professional societies for the advance- new voice of "mutual encouragement and support" ment of parapsychology; there are journals with for mathematics in biology and saying that "a new rigorous standards for published research; the field science cannot depend on a welcome from the fol- has received funding from federal agencies; and lowers of the older ones, and [therefore]. . .it is parapsychology has received recognition from other advisable to establish a special Journal for Biome- professional societies, such as the IMS and the try." Lavoisier understood the role of preconceived American Association for the Advancement of Sci- beliefs as a source of resistance when he wrote in ence (Collins and Pinch, 1979). Nevertheless, most 1785, readers of Statistical Science would agree that I do not expect my ideas to be adopted all at parapsychology is not accepted as part of orthodox once. The human mind gets creased into a way science and is considered by most of the scientific of seeing things. Those who have envisaged community to be on the margins of science, at best nature according to a certain point of view (Allison, 1979; Collins and Pinch, 1979). Why is during much of their career, rise only with this the case? Professor Utts believes that it is difficulty to new ideas. (Barber, 1961.) because people have not examined the data. She states that "Strong beliefs tend to be resistant to I suspect that this paper by Professor Utts syn- change even in the face of data, and many people, thesizing the accumulation of research results sup- scientists included, seem to have made up their porting the existence of paranormal phenomena minds on the question without examining any em- will continue to be received with by the pirical data at all." orthodox scientific community "even after examin- The history of science is replete with examples of ing the data." In part, this resistance is due to the resistance by the established scientific community popular perception of the association between para- to new discoveries. A challenging problem for sci- psychology and the occult (Allison, 1979) and due ence is to understand the process by which a new to the continued suspicion and documentation of th,eory or discovery becomes accepted by the com- fraud in parapsychology (Diaconis, 1978). An addi- munity of scientists and, likewise, to characterize tional and important source of resistance to the the nature of the resistance to new ideas. Barber evidence presented by Professor Utts, however, is (1961) suggests that there are many different the lack of a model to explain the phenomena. sources of resistance to scientific discovery. In 1900, Psychic phenomena are unexplainable by any cur- for example, Karl Pearson met resistance to his use rent scientific theory and, furthermore, directly of statistics in applications to biological problems, contradict the laws of physics. Acceptance of psi illustrating a source of resistance due to the use of implies the rejection of a large body of accumulated a particular methodology. The Royal Society in- evidence explaining the physical and biological formed Pearson that future papers submitted to the world as we know it. Thus, even though the effect Society for publication must keep the mathematics size for a relationship between aspirin and the separate from the biological applications. prevention of heart attacks is three times smaller Another obvious source of resistance to new sci- than the effect size observed in the ganzfeld data 388 J. UTTS base, it is the existence of a biological mechanism tions about the role of . Publication to explain the effectiveness of aspirin that ac- bias or "the file-drawer problem" arises when only counts, in part, for acceptance of this relationship. statistically significant findings get published, In evaluating the evidence in favor of the exis- while statistically nonsignificant studies sit unre- tence of paranormal phenomena, it is necessary to ported in investigators' file drawers. Typically, consider alternative explanations or hypotheses for Rosenthal's method (1979) is used to calculate the the results and, as noted by Cornfield (1959), "If "fail-safe N," that is, the number of unreported important alternative hypotheses are compatible studies that would have to be sitting in file-drawers with available evidence, then the question is unset- in order to negate the significant effect. Iyengar tled, even if the evidence is experimental" (see and Greenhouse (1988) describe a modification of also Platt, 1964). Many of the experimental results Rosenthal's method, however, that gives a fail-safe reported by Professor Utts need to be considered in N that is often an order of magnitude smaller than the context of explanations other than the exist- Rosenthal's method, suggesting that the sensitivity ence of paranormal phenomena. Consider the of the results of meta-analyses of psi experiments to following examples: unpublished negative studies is greater than is currently believed. (1) In the various psi experiments that Professor Utts discusses, the null hypothesis is a simple Even if parapsychology is thought to be on the chance model. However, as noted by Diaconis (1978) margins of science by the scientific community, in a critique of parapsychological research, "In parapsychologists should not be held to a different complex, badly controlled experiments simple standard of evidence to support their findings than chance models cannot be seriously considered as orthodox scientists, but like other scientists they tenable explanations: hence, rejection of such mod- must be concerned with spurious effects and the els is not of particular interest." Diaconis shows effects of extraneous variables. The experimental that the underlying probabilistic model in many of results summarized by Professor Utts appear to be these experiments (even those that are well con- sensitive to the effect of alternative hypotheses like trolled) is much more complicated than chance. the ones described above. Sensitivity analyses, (2) The role that experimenter expectancy plays which question, for example, how large of an effect in the reporting and interpreting of results cannot due to experimenter expectancy there would have be underestimated. Rosenthal (1966), based on a to be to account for the effect sizes being reported meta-analysis of the effects of experimenters' ex- in the psi experiments, are not addressed here. pectancies on the results of their research, found Again, the ability to account for and eliminate the that experimenters tended to get the results they role of alternative hypotheses in explaining the expected to get. Clearly this is an important po- observed relationship between aspirin and the pre- tential confounder in parapsychological research. vention of heart attacks is another reason for the Professor Utts comments on a debate between acceptance of these results. Honorton and Hyman, parapsychologist and critic, A major new technology discussed by Professor respectively, regarding evidence for psi abili- Utts in synthesizing the experimental parapsychol- ties, and, although not necessarily a result of ex- ogy literature is meta-analysis. Until recently, the perimenter expectancy, describes how ". . .each quantitative review and synthesis of a research analyzed the results of all known psi ganzfeld literature, that is, meta-analysis, was considered by experiments to date, and reached strikingly differ- many to be a questionable research tool (Wachter, ent conclusions." 1988). Resistance by statisticians to meta-analysis (3) What is an acceptable response in these ex- is interesting because, historically, many promi- periments? What constitutes a direct hit? What if nent statisticians found the combining of informa- the response is close, who decides whether or not tion from independent studies to be an important that constitutes a hit (see (2) above)? In an example and useful methodology (see, e.g., Fisher, 1932; of a response of a Receiver in an automated ganzfeld Cochran, 1954; Mosteller and Bush, 1954; Mantel procedure, Professor Utts describes the "dream-like and Haenszel, 1959). Perhaps the more recent skep- quality of the mentation." Someone must evaluate ticism about meta-analysis is because of its use as a these stream-of-consciousness responses to deter- tool to advance discoveries that themselves were mine what is a hit. An important methodological the objects of resistance, such as the of question is: How sensitive are the results to differ- psychotherapy (Smith and Glass, 1977) and now ent definitions of a hit? the existence of paranormal phenomena. It is an (4) In describing the results of different meta- interesting problem for the history of science to analyses, Professor Utts is careful to raise ques- explore why and when in the development of a REPLICATION OF PARAPSYCHOLOGY 389 of a discipline it turns to meta-analysis to answer research synthesis is in assessing the potential ef- research questions or to resolve controversy (e.g., fects of study characteristics and to quantify the Greenhouse et al., 1990). sources of heterogeneity in a research domain, that One argument for combining information from is, to study systematically the effects of extraneous different studies is that a more powerful result can variables. Tom Chalmers and his group at Harvard be obtained than from a single study. This objective have used meta-analysis in just this way not only is implicit in the use of meta-analysis in parapsy- to advance the understanding of the effectiveness of chology and is the force behind Professor Utts' medical but also to study the characteris- paper. The issue is that by combining many small tics of good research in , in particular, the studies consisting of small effects there is a gain in randomized controlled clinical trial. (See Mosteller power to find an overall statistically significant and Chalmers, 1991, for a review of this work.) effect. It is true that the meta-analyses reported by Professor Utts should be congratulated for her Professor Utts find extremely small p-values, but courage in contributing her time and statistical the estimate of the overall effect size is still small. expertise to a field struggling on the margins of As noted earlier, because of the small magnitude of science, and for her skill in synthesizing a large the overall effect size, the possibility that other body of experimental literature. I have found her extraneous variables might account for the rela- paper to be quite stimulating, raising many inter- tionship remains. esting issues about how science progresses or does Professor Utts, however, also illustrates the use not progress. of meta-analysis to investigate how studies differ and to characterize the influence of difficult covari- ACKNOWLEDGMENT ates or moderating variables on the combined esti- mate of effect size. For example, she compares the This work was supported in part by MHCRC mean effect size of studies where subjects were grant MH30915 and MH15758 from the National selected on the basis of good past performance to Institute of Mental Health, and CA54852 from the studies where the subjects were unselected, and she National Cancer Institute. I would like to acknowl- compares the mean effect size of studies with feed- edge stimulating discussions with Professors Larry back to studies without feedback. To me, this latter Hedges, Michael Meyer, Ingram Olkin, Teddy use of meta-analysis .highlights the more valuable Seidenfeld and Larry Wasserman, and thank them and important contribution of the methodology. for their patience and encouragement while prepar- Specifically, the value of quantitative methods for ing this discussion.

Comment Ray Hyman

Utts concludes that "there is an anomaly that Neither time nor space is available to respond in needs explanation." She bases this conclusion on detail to her argument. Instead, I will point to the ganzfeld experiments and four meta-analyses of some of my concerns. I will do so by focusing on parapsychological studies. She argues that both those parts of Utts' discussion that involve me. Honorton and Rosenthal have successfully refuted Understandably, I disagree with her assertions that my critique of the ganzfeld experiments. The meta- both Honorton and Rosenthal successfully refuted analyses apparently show effects that cannot be my criticisms of the ganzfeld experiments. explained away by unreported experiments nor Her treatment of both the ganzfeld debate and over-analysis of the data. Furthermore, effect size the National Research Council's report suggests does not correlate with the rated quality of the that Utts has relied on second-hand reports of the experiment. data. Some of her statements are simply inaccu- rate. Others suggest that she has not carefully read what my critics and I have written. This remote- ness from the actual experiments and details of the Ray Hyman is Professor of Psychology, University of arguments may partially account for her optimistic Oregon, Eugene,' Oregon 97403. assessment of the results. Her paper takes UTTS the reported data at face value and focuses on ity omitted many attributes that I included in the statistical interpretation of these data. my ratings. Even in those cases where we used Both the statistical- interpretation of the results the same indicators to make our assessments, we of an individual experiment and of the results of a differed because of our scaling. For example, on meta-analysis are based on a model of an ideal adequacy of randomization I used a simple dicho- world. In this ideal world, effect sizes have a tomy. Either the experimenter clearly indicated tractable and known distribution and the points in using an appropriate randomization procedure or the sample space are independent samples from a he did not. Honorton converted this to a trichoto- coherent population. The appropriateness of any mous scale. He distinguished between a clearly statistical application in a given context is an em- inadequate procedure such as hand-shuffling and pirical matter. That is why such issues as the failure to report how the randomization was done. adequacy of randomization, the non-independence He then assigned the lowest rating to failure to of experiments in a meta-analysis and the over- describe the randomization. In his scheme, clearly analysis of data are central to the debate. The inadequate randomization was of higher quality optimistic conclusions from the meta-analyses as- than failure to describe the procedure. Although we sume that the effect sizes are unbiased estimates agreed on which experiments had adequate ran- from independent experiments and have nicely domization, inadequate randomization or inade- behaved distributional properties. quate documentation, the different ways these were Before my detailed assessment of all the avail- ordered produced important differences between us able ganzfeld experiments through 1981, I accepted in how randomization related to effect size. These the assertions by parapsychologists that their are just some of the reasons why the finding of no experiments were of high quality in terms of stat- correlation between effect size and rated quality istical and experimental methodology. I was sur- does not justify concluding that the observed flaws prised to find that the ganzfeld experiments, had no effect. widely heralded as the best exemplar of a suc- I will now consider some of Utts' assertions and cessful research program in parapsychology, were hope that I can go into more detail in anoth- characterized by obvious possibilities for sensory er forum. Utts discusses the conclusions of the leakage, inadequate randomization, over-analysis National Research Council's Committee on and other departures from parapsychology's own Techniques for the Enhancement of Human Per- professed standards. One response was to argue formance. I was chairperson of that committee's that I had exaggerated the number of flaws. But subcommittee on paranormal phenomena. She even internal critics agreed that the rate of defects wrongly states that we restricted our evaluation in the ganzfeld data base was too high. only to significant studies. I do not know how she The other response, implicit in Utts' discussion of got such an impression since we based our analysis the ganzfeld experimmts and the meta-analyses, on meta-analyses whenever these were available. was to admit the existence of the flaws but to deny The two major inputs for the committee's evalua- their importance. The parapsychologists doing the tion were a lengthy evaluation of contemporary meta-analysis would rate each experiment for qual- parapsychology experiments by John Palmer and ity on one or more attributes. Then, if the null an independent assessment of these experiments by hypothesis of no correlation between effect size and . Our sponsors, the Army Research quality were upheld, the investigators concluded Institute had commissioned the report from the that the results could not be attributed to defects in parapsychologist John Palmer. They specifically methodology. asked our committee to provide a second opinion This retrospective sanctification' using statistical from a non-parapsychological perspective. They controls to compensate for inadequate experimental were most interested in the experiments on remote controls has many problems. The quality ratings viewing and random number generators. We de- are not blind. As the differences between myself cided to add the ganzfeld experiments. Alcock was and Honorton reveal, such ratings are highly sub- instructed, in making his evaluation, to restrict jective. Although I tried my best to restrict my himself to the same experiments in these categories ratings to what I thought were objective and ea- that Palmer had chosen. In this way, the experi- sily codeable indicators, my quality ratings pro- ments we evaluated, which included both signifi- vide a different picture than do those of Honorton. cant and nonsignificant ones, were, in effect, Honorton, I am sure, believes he was just as selected for us by a prominent parapsychologist. objective in assigning his ratings as I believe I was. Utts mistakenly asserts that my subcommittee Another problem is the number of different prop- on parapsychology commissioned Harris and Rosen- erties that are rated. Honorton's ratings of qual- thal to evaluate parapsychology experiments for REPLICATION OF PARAPSYCHOLOGY 391 us. Harris and Rosenthal were commissioned by lation between criterion variables and flaws of our evaluation subcommittee to write a paper on "only" 0.46. A true correlation of this magnitude evaluation issues, especially those related to exper- would be impressive given the nature and split of imenter effects. On their own initiative, Harris and the dichotomous variables. But, because it was not Rosenthal surveyed a number of data bases to illus- statistically significant, Harris and Rosenthal con- trate the application of methodological procedures cluded that there was no relationship between such as meta-analysis. As one illustration, they quality and effect size. A canonical correlation on included a meta-analysis of the subsample of this sample of 28 nonindependent cases, of course, ganzfeld experiments used by Honorton in his has virtually no chance of being significant, even if rebuttal to my critique. it were of much greater magnitude. Because Harris and Rosenthal did not them- What this amounts to is that the alleged contra- selves do a first-hand evaluation of the ganzfeld dictory conclusions of Harris and Rosenthal are experiments, and because they used Honorton's rat- based on a meta-analysis that supports Honorton's ings for their illustration, I did not refer to their position when Honorton's ratings are used and analysis when I wrote my draft for the chapter on supports my position when my ratings are used. the paranormal. Rosenthal told me, in a letter, that Nothing substantive comes from this, and it is he had arbitrarily used Honorton's ratings rather redundant with what Honorton and I have already than mine because they were the most recent avail- published. Harris and Rosenthal's footnote adds able. I assumed that Harris and Rosenthal were nothing because it supports the null hypothesis using Honorton's sample and ratings to illustrate with a statistical test that has no power against a meta-analytic procedures. I did not believe they reasonably sized alternative. It is ironic that Utts, were making a substantive contribution to the after emphasizing the importance of considering debate. statistical power, places so much reliance on the Only after the committee's complete report was outcome of a powerless test. in the hands of the editors did someone become (I should add that the recurrent charge that the concerned that Harris and Rosenthal had come to a NRC committee completely ignored Harris and conclusion on the ganzfeld experiments different Rosenthal's conclusions is not strictly correct. I from the committee. Apparently one or more com- wrote a response to the Harris and Rosenthal paper mittee members contacted Rosenthal and asked him that was included in the same supplementary to explain why he and Harris were dissenting. volume that contains their commissioned paper.) Because some committee members believed that Utts' discussion of the ganzfeld debate, as I have we should deal with this apparent discrepancy, I indicated, also shows unfamiliarity with details. contacted Rosenthal and pointed out if he had used She cites my factor analysis and Saunders' critique my ratings with _the very same analysis he had as if these somehow jeopardized the conclusions I applied to Honorton's ratings, he would have drew. Again, the matter is too complex to discuss reached a conclusion opposite to what Harris and adequately in this forum. The "factor analysis" she he had asserted. I did this, not to suggest my is talking about is discussed in a few pages of my ratings were necessarily more trustworthy than critique. I introduced it as a convenient way to Honorton's, but to point out how fragile any conclu- summarize my conclusions, none of which depended sions were based on this small and limited sample. on this analysis. I agree with what Saunders has to Indeed, the data were so lacking in robustness that say about the limitations of factor analysis in this the difference between my rating and Honorton's context. Unfortunately, Saunders bases his criti- rating of one investigator (Sargent) on one at- cism on wrong assumptions about what I did and tribute (randomization) sufficed to reverse the con- why I did it. His dismissal of the results as clusions Harris and Rosenthal made about the "meaningless" is based on mistaken algebra. I in- correlation between quality and effect size. cluded as dummy variables five experimenters in Harris and Rosenthal responded by adding a foot- the factor analysis. Because an experimenter can note to their paper. In this footnote, they repor- only appear on one variable, this necessarily forces ted an analysis using my ratings rather than the average intercorrelation among the experi- Honorton's. This analysis, they concluded, still sup- menter variables to be negative. Saunders falsely ported the null hypothesis of no correlation be- asserts that this negative correlation must be -1. tween quality and effect size. They used 6 of my 12 If he were correct, this would make the results dichotomous ratings of flaws as predictors and the z meaningless. But he could be correct only if there score and effect size as criterion variables in both were just two investigators and that each one ac- multiple regression and canonical correlation anal- counted for 50% of the experiments. In my case, as yses. They reported an "adjusted" canonical corre- I made sure to check ahead of time, the use of five 392 J.UTTS experimenters, each of whom contributed only a precise departure of a specific kind from the orbit few studies to the data base, produced a mildly expected on the basis of Newtonian mechanics. He negative intercorrelation of -0.147. To make sure knew exactly what he had to account for. even that small correlation did not distort the re- The "anomaly" or "anomalies" that Utts talks sults, I did the factor analysis with and without the about are different. We do not know what it is that dummy variables. The same factors were obtained we are asked to account for other than something in both cases. that sometimes produces nonchance departures However, I do no& wish to defend this factor from a statistical model, whose appropriateness is analysis. None of my conclusions depend on it. I itself open to question. would agree with any editor who insisted that I The case rests on a handful of meta-analyses that omit it from the paper on the grounds of redun- suggest effect sizes different from zero and uncorre- dancy. I am discussing it here as another example lated with some non-blindly determined indices of that suggests that Utts is not familiar with some quality. For a variety of reasons, these retrospec- relevant details in literature she discusses. tive attempts to find evidence for paranormal phe- nomena are problematical. At best, they should provide the basis for parapsychologists designing CONCLUSIONS prospective studies in which they can specify, in advance, the complete sample space and the critical Utts may be correct. There may indeed be an region. When they get to the point where they can anomaly in the parapsychological findings. Anoma- specify this along with some boundary conditions lies may also exist in non-parapsychological do- and make some reasonable predictions, then they mains. The question is when is an anomaly worth will have demonstrated something worthy of our taking seriously. The anomaly that Utts has in attention. mind, if it exists, can be described only as a depar- In this context, I agree with Utts that Honorton's ture from a generalized statistical model. From the recent report of his automated ganzfeld experi- evidence she presents, we might conclude that we ments is a step in the right direction. He used the are dealing with a variety of different anomalies ganzfeld meta-analyses and the criticisms of the instead of one coherent phenomenon. Clearly, the existing data base to design better experiments and reported effect sizes for the experiments with ran- make some predictions. Although he and Utts be- dom number generators are orders of magnitude lieve that the findings of meaningful effect sizes in lower than those for the ganzfeld experiments. Even the dynamic targets and a lack of a nonzero effect within the same experimental domain, the effect size in the static targets are somehow consistent sizes do not come from the same population. The with previous ganzfeld results, I disagree. I believe effects sizes obtained by Jahn are much smaller the static targets are closer in spirit to the original than those obtained'by Schmidt with similar ex- data base. But this is a minor criticism. periments on random number generators. In Honorton's experiments have produced intrigu- the ganzfeld experiments, experimenters differ ing results. If, as Utts suggests, independent labo- significantly in the effect sizes each obtains. ratories can produce similar results with the same This problem of what effect sizes are and what relationships and with the same attention to rigor- they are measuring points to a problem for para- ous methodology, then parapsychology may indeed psychologists. In other fields of science such as have finally captured its elusive quarry. Of course, astronomy, an "anomaly" is a very precisely speci- on several previous occasions in its century-plus fied departure from a well-established substantive history, parapsychology has felt it was on the theory. When Leverrier discovered Neptune by threshold of a breakthrough. The breakthrough studying the perturbations in the orbit of Uranus, never materialized. We will have to patiently wait he was able to characterize the anomaly as a very to see if the current situation is any different. REPLICATION OF PARAPSYCHOLOGY Comment Robert L. Morris Experimental sciences by their nature have found intervals rather than arbitrarily chosen signifi- it relatively easy to deal with simple closed sys- cance levels seems to indicate much greater consis- tems. When they come to study more complex, open tency in the findings than has previously been systems, however, they have more difficulty in gen- claimed. erating testable models, must rely more on multi- Second, when one codes the individual studies for variate approaches, have more diversity from flaws and relates flaw abundance with effect size, experiment to experiment (and thus more difficulty there appears to be little correlation for all but one in constructing replication attempts), have more data base. This contradicts the frequent assertion noise in the data, and more difficulty in construct- that parapsychological results disappear when ing a linkage between concept and measurement. methodology is tightened. Additional evidence on Data gatherers and other researchers are more this point is the series of studies by Honorton and likely to be part of the system themselves. Exam- associates using an automated ganzfeld procedure, ples include ecology, economics, apparently better conducted than any of the previ- and parapsychology. Parapsychology can be re- ous research, which nevertheless obtained an effect garded as the study of apparent new means of size very similar to that of the earlier more diverse communication, or transfer of influence, between data base. organism and environment. Any observer attempt- Third, meta-analysis allows researchers to look ing to decide whether or not such psychic communi- at moderator variables, to build a clearer picture of cation has taken place is one of several elements in the conditions that appear to produce the strongest a complex open system composed of an indefinite effects. Research in any real scientific discipline number of interactive features. The system can be must be cumulative, with later researchers build- modeled, as has been done elsewhere (e.g., Morris, ing on the work of those who preceded them. If our 1986) such as to organise our understanding of how earlier successes and failures have meaning, they observers can be misled by themselves, or by delib- should help us obtain increasingly consistent, erate frauds. Parapsychologists designing experi- clearer results. If psychic ability exists and is suffi- mental studies must take extreme care to ensure ciently stable that it can be manifest in controlled that the elements in the experimental system do experimental studies, then moderator variables not interact in unanticipated ways to produce arti- should be present in groups of studies that would fact or encourage fraudulent procedures. When re- indicate conditions most favourable and least searchers follow up ;the findings of others, they favourable to the production of large effect sizes. must ensure that the new experimental system From the analyses presented by Utts, for instance, sufficiently resembles the earlier one, regarding its it seems evident that group studies tend to produce important components and their potential interac- poor results and, however convenient it may be to tions. Specifying sufficient resemblance is more dif- conduct them, future researchers should apparently ficult in complex and open systems, and in areas of focus much more on individual testing. When doing research using novel methodologies. ganzfeld studies, it appears best to work with dy- As a result, parapsychology and other such areas namic rather than static target material and with may well profit from the application of modern experienced participants rather than novices. If meta-analysis, and meta-analytic methods may in such results are valid, then future researchers who turn profit from being given a good stiff workout by wish to get strong results now have a better idea of controversial data bases, as suggested by Jessica what procedures to select to increase the likelihood Utts in her article. Parapsychology would appear to of so doing, what elements in the experimental gain from meta-analytic techniques, in at least system seem most relevant. The proportion of stud- three important areas. ies obtaining positive results should therefore First, in assessing the question of replication increase. rate, the new focus on effect size and confidence However, the situation may be more complex than the somewhat ideal version painted above. As noted earlier, meta-analysis may learn from para- psychology as well as vice versa. Parapsychological Robert L. Morris occupies the Koestler Chair of data may well give meta-analytic techniques a good Parapsychology in the Department of Psychology at workout and will certainly pose some challenges. the University of Edinburgh, 7 George Square, Ed- None of the cited meta-analyses, as described above, inburgh EH8 9JZ, United Kingdom. apparently employed more than one judge or UTTS evaluator. Certainly none of them cited any corre- dance of existing faults and how to avoid them. If lation values between evaluators, and the correla- results are as strong under well-controlled con- tions between judges of research quality in other ditions as under sloppy ones, then additional social sciences tend to be "at best around .50," research such as that done by Honorton and associ- according to Hunter and Schmidt (1990, page 497). ates under tight conditions should continue to pro- Although Honorton and Hyman reported a rela- duce positive results. tively high correlation of 0.77 between themselves, In addition to the replication issue, there are they were each doing their own study and their some other problems that need to be addressed. So flaw analyses did reach somewhat different conclu- far, the assessment of moderator variables has been sions, as noted by Utts. Other than Hyman, the univariate, whereas a multivariate approach would evaluators cited by Utts tend to be positively ori- seem more likely to produce a clearer picture. Mod- ented toward parapsychology; roughly speaking, all erator variables may covary, with each other or evaluators doing flaw analyses found what they with flaws. For instance, in the dice data higher might hope to find, with the exception of the PK effect sizes were found for flawed studies and for dice data base. Were evaluators blind as to study studies with selected subjects. Did studies using outcome when coding flaws? No comment is made special subjects use weaker procedures? on this aspect. The above studies need to be repli- Given the importance attached to effect size and cated, with multiple (and blind) evaluators and incorporating estimates of effect size in designing reported indices of evaluator agreement. Ideally, studies for power, we must be careful not to assume evaluator should be assessed and taken that effect size is independent of number of trials or into account as well. A study with all hostile evalu- subjects unless we have empirical reason to do so. ators may report very high evaluator correlations, Effect size may decrease with larger N if experi- yet be a less valid study than one that employs a menters are stressed or bored towards the end of a range of evaluators and reports lower correlations long study or if there are too many trials to be among evaluators. conducted within a short period of time and sub- But what constitutes a replication of a meta- jects are given less time to absorb their instructions analysis? As with experimental replications, it may or to complete their tasks. On one occasion there is be important to distinguish between exact and con- presentation of an estimated "true average effect ceptual replications. In the former, a replicator size," (0.18 rather than 0.28) without also present- would attempt to match all salient features of the ing an estimate of effect size dispersal. Future initial analysis, from the selection of reports to the investigators should have some sense of how the coding of features to the statistical tests employed, likelihood that they will obtain a hit rate of 113 such as to verify that the stated original protocol (where 114 is expected) will vary in accordance had been followed faithfully and that a simi- with conditions. lar outcome results. For conceptual replication, There are a few additional quibbles with particu- replicators would take the stated outcome of the lar points. In Utts' example experiment with Pro- meta-analysis and attempt their own independent fessor A versus Professor B, sex of professor is a analysis, with their own initial report selection possible confounding variable. When Honorton criteria, coding criteria and strategy for statistical omitted studies that did not report direct hits as a testing, to see if similar conclusions resulted. Con- measure, he may have biased his sample. Were ceptual replication allows more room for bias and there studies omitted that could have reported di- resultant debate when findings differ, but when rect hits but declined to do so, conceivably because r,esults are similar they can be assumed to have they looked at that measure, saw no results and more legitimacy. Given the strong and surpris- dropped it? This objection is only with regard to the ing (for many) conclusions reached in the meta- initial meta-analysis and is not relevant for the analysis reported by Utts, it is quite likely later series of studies which all used direct hits. In that others with strong views on parapsychology Honorton's meta-analysis of forced-choice precogni- will attempt to replicate, hoping for clear confirma- tion experiments, the comparison variables of feed- tion or disconfirmation. The diversity of methods back delay and time interval to target selection they are likely to employ and the resultant debates appear to be confounded. Studies delaying target should provide a good oppoktunity for airing the selection cannot provide trial by trial feedback, for many conceptual problems still present in meta- instance. Also, I am unsure about using an approxi- analysis. If results differ on moderator variables, mation to Cohen's h for assessing the effect size for there can come to be empirical resolution of the the aspirin study. There would appear to be a very differences as further results unfold. With regard striking effect, with the aspirin condition heart to flaw analysis, such analyses have already fo- attack rate only 55% that of the rate for the placebo cused attention in ganzfeld research on the abun- condition. How was the expected proportion of REPLICATION OF PARAPSYCHOLOGY 395 misses estimated; perhaps Cohen's h greatly un- The above objections should not detract from the derestimates effect size when very low probability overall value of the Utts survey. The findings she events (less than 1in 50 for heart attack in the reports will need to be replicated; but even as is, placebo condition and less than 1 in a 100 for they provide a challenge to some of the cherished aspirin) are involved. I'm not a statistician and arguments of counteradvocates, yet also challenge thus don't know if there is a relevant literature on serious researchers to use these findings effectively this point. as guidelines for future studies.

Comment Frederick Mosteller

Dr. Utts's discussion stimulates me to offer some psychology, vol. 4 (1940), pp. 298-319, in particular comments that bear on her topic but do not, in the p. 306." The 25 symbols of 5 kinds, 5 of each, main, fall into an agree-disagree mode. My refer- correspond to the cards in a parapsychology deck. ences refer to her bibliography. The point of page 306 is that Greenwood and Let me recommend J. Edgar Coover's work to Stuart on that page claim to have generated two statisticians who would like to read about a pretty random orders of such a deck using Tippett's table sequence of experiments developed and executed of random numbers. Apparently Feller thought that well before Fisher's book on experimental design it would have taken them a long time to do it. If appeared. Most of the standard kinds of ESP exper- one assumes that Feller's way of generating a ran- iments (though not the ganzfeld) are carried out dom shuffle is required, then it would indeed be and reported in this 1917 book. Coover even began unreasonable to suppose that the experiments could looking into the amount of information contained be carried out quickly. I wondered then whether in cues such as whispers. He also worked at expos- Feller thought this was the only way to produce a ing mediums. I found the book most impressive. As random order to such a deck of cards. If you happen Utts says in her article, the question of significance to know how to shuffle a deck efficiently using level was a puzzling one, and one we still cannot random numbers, it is hard to believe that others solve even though some fields seem to have stan- do not know. I decided to test it out and so I dardized on 0.05. proposed to a class of 90 people in mathematical When Feller's comments on Stuart and Green- statistics that we find a way of using random num- 's sampling efperiments came out in the first bers to shuffle a deck of cards. Although they were edition of his book, I was surprised. Feller devotes familiar with random numbers, they could not come a problem to the results of generating 25 symbols up with a way of doing it, nor did anyone after class from the set a, b, c, d and e @age 45, first edition) come in with a workable idea though several stu- using random numbers with 0 and 1corresponding dents made proposals. I concluded that inventing to a, 2 and 3 to b, etc. He asks the student to find such a shuffling technique was a hard problem and out how often the 25 produce 5 of each symbol. He that maybe Feller just did not know how at the asks the student to check the results using random time of writing the footnote. My face-to-face at- number tables. The answer seems to be about 1 tempts to verify this failed because his response chance in 500. In a footnote Feller then says "They was evasive. I also recall Feller speaking at a [fandom numbers] are occasionally extraordinarily scientific meeting where someone had complained obliging: c.f. J. A. Greenwood and E. E. Stuart, about mistakes in published papers. He said essen- Review of Dr. Feller's Critique, Journal of Para- tially that we won't have any literature if mistakes are disallowed and further claimed that he always had mistakes in his own papers, hard as he tried to avoid them. It was fun to hear him speak. Frederick Mosteller is Roger I. Lee Professor of Although I find Utts's discussion of replication Mathematical Statistics, Emeritus, at Harvard Uni- engaging as a problem in human perception, I do versity and Director of the Technology Assessment always feel that people should not be expected to Group in the Harvard School of Public Health. His carry out difficult mathematical exercises in their mailing address is Department of Statistics, Har- head, off the cuff, without computers, textbooks or vard University, Science Center, 1 Oxford Street, advisors. The kind of problem treated requires Cambridge, Massachusetts 02138. careful formulation and then careful analysis. Even 396 J. UTTS after a careful analysis is completed, there can be backs of the cards of my parapsychology deck as vigorous reasonable arguments about the appropri- clearly as the faces. While preparing these remarks ateness of the formulation and its analysis. These in 1991, I found a note on page 305 of volume 1of investigations leave me reinforced with the belief The Journal of Parapsychology (1937) indicating that people cannot do hard mathematical problems that imperfections in the cards precluded their use in their heads, rather than with an attitude toward in unscreened situations, but that improvements or against ESP investigations. were on the way. Thus I sympathize with Utts's When I first became aware of the work of Rhine conclusion that much is to be gained by studying and others, the concept seemed to me to be very how to carry out such work well. If there is no ESP, important and I asked a psychologist friend why then we want to be able to carry out null experi- more psychologists didn't study this field. He re- ments and get no effect, otherwise we cannot put sponded that there were too many ways to do these much belief in work on small effects in non-ESP experiments in a poorly controlled manner. At the situations. If there is ESP, that is exciting. How- time, I had just discovered that when viewed with ever, thus far it does not look as if it will replace light coming from a certain angle, I could read the the telephone.

Rejoinder Jessica Utts

I would like to thank this distinguished group of Hyman and Morris, by either clarifying my origi- discussants for their thought-provoking contribu- nal statements or by adding more information from tions. They have raised many interesting and di- the original reports. verse issues. Certain points, such as Professor Mosteller's enlightening account of Feller's posi- Points Raised by Diaconis tion, require no further comment. Other points in- dicate the need for clarification and elaboration of Diaconis raised the point that qualified skeptics my original material. Issues raised by Professors and magicians should be active participants in Diaconis and Hyman and subsequent conversations parapsychology experiments. I will discuss this with Robert Rosenthal and Charles Honorton have general concept in the next section, but elaborate led me to consider the topic of "Satisfying the here on the steps that were taken in this regard for Skeptics." Since the conclusion in my paper was the autoganzfeld experiments described in Section not that psychic phenomena have been proved, but 5 of my paper. As reported by Honorton et al. rather that there is an anomalous effect that needs (1990): to be explained, comments by several of the discus- Two experts on the simulation of psi ability sants led me to address the question "Should Psi have examined the autoganzfeld system and Research be Ignored by the Scientific Community?" protocol. Ford Kross has been a professional Finally, each of the discussants addressed repli- mentalist [a magician who simulates psychic ,cation and modeling issues. The last part of my abilities] for over 20 years. . . Mr. Kross has rejoinder comments on some of these ideas and provided us with the following statement: "In discusses them in the context of parapsychology. my professional capacity as a mentalist, I have reviewed Psychophysical Research Laborato- CLARIFICATION AND ELABORATION ries' automated ganzfeld system and found it to Since my paper was a survey of hundreds of provide excellent security against deception by experiments and many published reports, I could subjects." We have received similar comments obviously not provide all of the details to accom- from , Professor of Psychology at pany this overview. However, there were details . Professor Bem is well lacking in my paper that have led to legitimate known for his research in social and personal- questions and misunderstandings from several of ity psychology. He is also a member of the the discussants. In this section, I address specific Psychic Entertainers Association and has per- points raised by Professors Diaconis, Greenhouse, formed for many years as a mentalist. He vis- REPLICATION IN PARAPSYCHOLOGY 397 ited PRL for several days and was a subject in cal question is the same. Under the null hypothe- Series 101" [pages 134- 1351. sis, since the target is randomly selected from the four possibilities presented, the probability of a Honorton has also informed me (personal communi- direct hit is 0.25 regardless of who does the judg- cation, July 25, 1991) that several self-proclaimed ing. Thus, the observed anomalous effects cannot skeptics have visited his laboratory and received be explained by assuming there was an over- demonstrations of the autoganzfeld procedure and optimistic judge. that no one expressed any concern with the secu- If Professor Greenhouse is suggesting that the rity arrangements. source of judging may be a moderating variable This may not completely satisfy Professor Diaco- that determines the magnitude of the demonstrated nis' objections, but it does indicate a serious effort anomalous effect, I agree. The parapsychologists on the part of the researchers to involve such peo- have considered this issue in the context of whether ple. Further, the original publication of the re- or not subjects should serve as judges for their own search in Section 5 followed the reporting criteria sessions, with differing opinions in different labora- established by Hyman and Honorton (1986), thus tories. This is an example of an area that has been providing much more detail for the reader than the suggested for further research. earlier published records to which Professor Finally, Greenhouse raised the question of the Diaconis alludes. accuracy of the file-drawer estimates used in the reported meta-analyses. I agree that it is instruc- Points Raised by Greenhouse tive to examine the file-drawer estimate using more Greenhouse enumerated four items that offer al- than one model. As an example, consider the 39 ternative explanations for the observed anomalous studies from the direct hit and autoganzfeld data effects. Three of these (items 2-4) will be addressed bases. Rosenthal's fail-safe N estimates that there in this section by elaborating on the details pro- would have to be 371 studies in the file-drawer to vided in my paper. His item 1will be addressed in account for the results. In contrast, the method a later section. proposed by Iyengar and Greenhouse gives a file- Item 2 on his list questioned the role of experi- drawer estimate of 258 studies. Even this estimate menter expectancy effects as a potential confounder is unrealistically large for a discipline with as few in parapsychological research. While the expecta- researchers as parapsychology. Given that the av- tions of the experimenter may influence the report- erage number of trials per experiment is 30, this ing of results, the ganzfeld experiments (as well as would represent almost 8000 unreported trials, and other psi experiments) are conducted in such a way at least that many hours of work. that experimenter expectancy cannot account for There are pros and cons to any method of esti- the results themselves. Rosenthal, who Greenhouse mating the number of unreported studies, and the cites as the expert in this area, addressed this in actual practices of the discipline in question should his background paper for the National Research be taken into account. Recognizing publication bias Council (Harris and Rosenthal, 1988a) and con- as an issue, the Parapsychological Association has cluded that the ganzfeld studies were adequately had an official policy since 1975 against the selec- controlled in this regard. He also visited the auto- tive reporting of positive results. Of the original ganzfeld laboratory and was given a demonstration ganzfeld studies reported in Section 4 of my paper, of that procedure. less than half were significant, and it is a matter of Greenhouse's item 3, the question of what consti- record that there are many nonsignificant studies tutes a direct hit, was addressed in my paper but and "failed replications" published in all areas of perhaps needs elaboration. Although free-response psi research. Further, the autoganzfeld database experiments do generate substantial amounts of reported in Section 5 has no file-drawer. Given the subjective data, the statistical analysis requires publication practices and the size of the field, the that the results for each trial be condensed into a proposed file-drawer cannot account for the ob- single measure of whether or not a direct hit was served effects. achieved. This is done by presenting four choices to Points Raised by Hyman a judge (who of course does not know the correct answer) and asking the judge to decide which of the One of my goals in writing this paper was to four best matches the subject's response. If the present a fair account of recent work and debate in judge picks the target, a direct hit has occurred. parapsychology. Thus, I was disturbed that Hy- It is true that different judges may differ on their man, who has devoted much of his career to the opinions of whether or not there has been a direct study of parapsychology, and who had first-hand hit on any given trial, but in all cases the statisti- knowledge of the original published reports, be- 398 J. UTTS lieved that some of my statements were inaccurate Hyman, chair of the subcommittee on parapsychol- and indicated that I had not carefully read the ogy, in which he raises questions about the pres- reports. I will address some of his specific objec- ence and consequence of methodological flaws in tions and show that, except where noted, the accu- the ganzfeld studies. . . ." racy of my original statements can be verified by In reference to this postscript, I stand corrected further elaboration and clarification, with due apol- on a technical point, because Hyman himself did ogy for whatever necessary details were lacking in not request the response to his own letter. As noted my original report. by Palmer, Honorton and Utts (1989), the postscript Most of our points of disagreement concern was added because: the National of sciences (National Re- cade ern^ At one stage of the process, John Swets, Chair search Council) report Enhancing Human Per- of the Committee, actually phoned Rosenthal formance (Druckman and Swets, 1988). This and asked him to withdraw the parapsychology report evaluated several controversial areas, in- section of his [commissioned] paper. When cluding parapsychology. Professor Hyman chaired Rosenthal declined, Swets and Druckman then the Parapsychology Subcommittee. Several back- requested that Rosenthal respond to criticisms ground papers were commissioned to accompany that Hyman had included in a July 30, 1987 this report, available from the "Publication on letter to Rosenthal [page 381. Demand Program" of the National Academy Press. One of the papers was written by Harris and A related issue on which I would like to elaborate Rosenthal, and entitled "Human Performance concerns the correlation between flaws and success Research: An Overview. " in the original ganzfeld data base. Hyman has Professor Hyman alleged that "Utts mistakenly misunderstood both my position and that of Harris asserts that my subcommittee on parapsychology and Rosenthal. He believes that I implicitly denied commissioned Harris and Rosenthal to evaluate the importance of the flaws, so I will make my parapsychology experiments for us. . . ." I cannot position explicit. I do not think there is any evi- find a statement in my paper that asserts that dence that the experimental results were due to the Harris and Rosenthal were commissioned by the identified flaws. The flaw analysis was clearly use- subcommittee, nor can I find a statement that ful for delineating acceptable criteria for future asserts that they were asked to evaluate parapsy- experiments. Several experiments were conducted chology experiments. Nonetheless, I believe our using those criteria. The results were similar to the substantive disagreement results from the fact original experiments. I believe that this indicates that the work by Harris and Rosenthal was writ- an anomaly in need of an explanation. ten in two parts, both of which I referenced in In discussing the paper and ppstscript by Harris my paper. They were written several months and Rosenthal, Hyman stated that "The alleged apart, but published together, and each had contradictory conclusions [to the National Research its own history. Council report] of Harris and Rosenthal are based The first part (Harris and Rosenthal, 1988a) is on a meta-analysis that supports Honorton's posi- the one to which I referred with the words tion when Honorton's [flaw] ratings are used and "Rosenthal was commissioned by the National supports my position when my ratings are used." Academy of Sciences to prepare a background He believes that Harris and Rosenthal (and I) failed paper to accompany its 1988 report on parapsychol- to see this point because the low power of the test ogy" (p. 372). According, to Rosenthal (personal associated with their analysis was not taken into communication, July 23, 1991) he was asked to pre- account. pare a background paper to address evaluation The analysis in question was based on a canoni- issues and experimenter effects to accompany the cal correlation between flaw ratings and measures report in five specific areas of research, including of successful outcome for the ganzfeld studies. The parapsychology. canonical correlation was 0.46, a value Hyman finds The second part was a "Postscript" to the com- to be impressive. What he has failed to take into missioned paper (Harris and Rosenthal, 1988b), and account however, is that a canonical correlation this is the one to which I referred on page 371 as gives only the magnitude of the relationship, and "requested by Hyman in his capacity as Chair of not the direction. A careful reading of Harris and the National Academy of Sciences' Subcommittee Rosenthal (198813) reveals that their analysis actu- on Parapsychology." (It is probably this wording ally contradicted the idea that the flaws could that led Professor Hyman to his erroneous allega- account for the successful ganzfeld results, since tion.) The postscript began with the words "We "Interestingly, three of the six flaw variables corre- have been asked to respond to a letter from Ray lated positively with the flaw canonical variable REPLICATION IN P ARAPSYCHOLOGY 399 and with the outcome canonical variable but three Paranormal (CSICOP) was established in 1976 by correlated negatively" (page 2, italics added). philosopher and sociologist Marcello Rosenthal (personal cemmunication, July 23, 1991) Truzzi when "Kurtz became convinced that the verified that this was indeed the point he was time was ripe for a more active crusade against trying to make. Readers who are interested in parapsychology and other pseudo-scientists" (Pinch drawing their own conclusions from first-hand and Collins, 1984, page 527). Truzzi resigned from analyses can find Hyman's original flaw codings in the organization the next year (as did Professor an Appendix to his paper (Hyman, 1985, pages Diaconis) "because of what he saw as the growing 44-49). danger of the committee's excessive negative zeal Finally, in my paper, I stated that the parapsy- at the expense of responsible scholarship" (Collins chology chapter of the National Research Council and Pinch, 1982, page 84). In an advertising report critically evaluated statistically significant brochure for their publication The Skeptical In- experiments, but not those that were nonsignifi- quirer, CSICOP made clear its belief that paranor- cant. Professor Hyman "does not know how [I] got mal phenomena are worthy of scientific attention such an impression," so I will clarify by outlining only to the extent that scientists can fight the some of the material reviewed in that report. There growing interest in them. Part of the text of the were surveys of three major areas of psi research: brochure read: "Why the sudden explosion of inter- (a particular type of free-response est, even among some otherwise sensible people, in experiment), experiments with random number all sorts of paranormal 'happenings'?. . . Ten years generators, and the ganzfeld experiments. As an ago, scientists started to fight back. They set up an example of where I got the impression that they organization-The Committee for the Scientific In- evaluated only significant studies, consider the sec- vestigation of Claims of the Paranormal." tion on remote viewing. It began by referencing a During the six years that I have been working published list of 28 studies. Fifteen of these were with parapsychologists, they have repeatedly ex- immediately discounted, since "only 13. . . were pressed their frustration with the unwillingness of published under refereed auspices" (Druckman and the skeptics to specify what would constitute ac- Swets, 1988, page 179). Four more were then dis- ceptable evidence, or even to delineate criteria for missed, since "Of the 13 scientifically reported an acceptable experiment. The Hyman and Honor- experiments, 9 are classified as successful" (page ton Joint Communiquk was seen as the first major 179). The report continued by discussing these nine step in that direction, especially since Hyman was experiments, never again mentioning any of the the Chair of the Parapsychology Subcommittee of remaining 19 studies. The other sections of the CSICOP. report placed similar emphasis on significant stud- Hyman and Honorton (1986) devoted eight pages ies. I did not think this was a valid statistical to "Recommendations for Future Psi Experiments," method for surveying a large body of research. carefully outlining details for how the experiments should be conducted and reported. Honorton and Minor Point Raised by Morris his colleagues then conducted several hundred The final clarification I would like to offer con- trials using these specific criteria and found essen- cerns the minor point raised by Professor Morris, tially the same effect sizes as in earlier work for that "When Honorton omitted studies that did not both the overall effect and effects with moderator report direct hits as a measure, he may have biased variables taken into account. I would expect Profes- his sample." This possibility was explicitly ad- sor Hyman to be very interested in the results of dressed by Honorton (1985, page 59). He examined these experiments he helped to create. While he did what would happen if z-scores of zero were inserted acknowledge that they "have produced intriguing for the 10 studies for which the number of direct results," it is both surprising and disappointing hits was not measured, but could have been. He that he spent only a scant two paragraphs at the found that even with this conservative scenario, end of his discussion on these results. the combined z-score only dropped from 6.60 to Instead, Hyman seems to be proposing yet an- 5.67. other set of requirements to be satisfied before parapsychology should be taken seriously. It is dif- ficult to sort out what those requirements should be SATISFYING THE SKEPTICS from his account: "[They should] specify, in ad- Parapsychology is probably the only scientific vance, the complete sample space and the critical discipline for which there is an organization of region. When they get to the point where they can skeptics trying to discredit its work. The Commit- specify this along with some boundary conditions tee for the Scientific Investigation of Claims of the and make some reasonable predictions, then they 400 J. UTTS will have demonstrated something worthy of our CSICOP, have served a useful role in helping to attention." improve experiments, their counter-advocacy stance Diaconis believes that psi experiments do not is counterproductive. If they are truly interested deserve serious attention unless they actively in- in resolving the question of whether or not psi volve skeptics. Presumably, he is concerned with abilities exist, I would expect them to encourage subject or experimenter fraud, or with improperly evaluation and experimentation by unbiased, controlled experiments. There are numerous docu- skilled experimenters. Instead, they seem to be mented cases of fraud and trickery in purported trying to discourage such interest by providing a psychic phenomena. Some of these were observed moving target of requirements that must be satis- by Diaconis and reported in his article in Science. fied first. Such cases have mainly been revealed when inves- tigators attempted to verify the claims of individ- SHOULD PSI RESEARCH BE IGNORED BY THE ual psychic practitioners in quasi-experimental or SCIENTIFIC COMMUNITY? uncontrolled conditions. These instances have re- ceived considerable attention, probably because the In the conclusion of my paper, I argued that the claims are so sensational, the fraud is so easy to scientific community should pay more attention to detect by a skilled observer and they are an easy the experimental results in parapsychology. I was target for skeptics looking for a way to discredit not suggesting that the accumulated evidence con- psychic phenomena. As noted by Hansen (1990), stitutes proof of psi abilities, but rather that it "Parapsychology has long been tainted by the indicates that there is indeed an anomalous effect fraudulent behavior of a few of those claiming psy- that needs an explanation. Greenhouse noted that chic abilities" (page 25). my paper will not necessarily change anyone's view Control against deception by subjects in the labo- about the existence of paranormal phenomena, an ratory has been discussed extensively in the para- observation with which I agree. However, I hope it psychological literature (see, e.g., Morris, 1986, and will change some views about the importance of Hansen, 1990). Properly designed experiments further investigation. should preclude the possibility of such fraud. Mosteller and Diaconis both acknowledged that Hyman and Honorton (1986, page 355) explicitly there are reasons for statisticians to be interested discussed precautions to be taken in the ganzfeld in studying the anomalous effects, regardless of experiments, all of which were followed in the auto- whether or not psi is real. As noted by Mosteller, ganzfeld experiments. Further the controlled labo- "If there is no ESP, then we want to be able to ratory experiments discussed in my paper usually carry out null experiments and get no effect, other- used a large number of subjects, a situation that wise we cannot put much belief in work on small minimizes the possibility that the results were due effects in non-ESP situations." Diaconis concluded to fraud on the part of a few subjects. As for the that "Parapsychology is worthy of serious study" possibility of experimenter fraud, it is of course an partly because "If it is wrong, it offers a truly issue in all areas of science. There have been a few alarming massive case study of how statistics can such instances in parapsychology, but since para- mislead and be misused." psychologists tend to be aware of this possibility, Greenhouse noted several sociological reasons for they were generally detected and exposed by insid- the resistance of the scientific community to accept- ers in the field. ing parapsychological phenomena. One of these is It is not clear whether or not Diaconis is suggest- that they directly contradict the laws of physics. ing that a magician or "qualified skeptic" needs to However, this assertion is not uniformly accepted be 'present at all times during a laboratory experi- by physicists (see, e.g., Oteri, 1975), and some of ment. 'I believe that it would be more productive for the leading parapsychological researchers hold such consultation to occur during the design phase, Ph.D.s in physics. and during the implementation of some pilot ses- Another reason cited by Greenhouse, and sup- sions. This is essentially what was done for the ported by Hyman, is that psychic phenomena are autoganzfeld experiments, in which Professor Hy- currently unexplainable by a unified scientific the- man, a skeptic as well as an accomplished magi- ory. But that is precisely the reason for more inten- cian, participated in the specification of design sive investigation. The history of science and criteria, and mentalists Bem and Kross observed medicine is replete with examples where empirical experimental sessions. Bem is also a well-respected departures from expectation led to important find- experimental psychologist. ings or theoretical models. For example, the causal While I believe that the skeptics, particularly connection between cigarette smoking and some of the more knowledgeable members of cancer was established only after years of statisti- REPLICATION IN PARAPSYCHOLOGY 401 cal studies, resulting from the observation by one specify what should happen if there is no such physician that his lung cancer patients who smoked thing as ESP by using simple binomial models, did not recover at the same rate as those who did either to find p-values or Bayes factors. As noted not. There are many medications in common use by Mosteller, if there is no ESP, or other nonstatis- for which there is still no medical explanation for tical explanation for an effect, we should be able to their observed therapeutic effectiveness, but that carry out null experiments and get no effect. Other- does not prohibit their use. wise, we should be worried about using these sim- There are also examples where a coherent theory ple models for other applications. of a phenomenon was impossible because the re- Greenhouse, in his first alternative explanation quisite background information was missing. For for the results, questioned the use of these simple instance, the current theory of as an models, but his criticisms do not seem relevant to explanation for the success of would the experiments discussed in Section 5 of my paper. have been impossible before the discovery of endor- The experiments to which he referred were either phins in the 1970s. poorly controlled, in which case no statistical anal- Mosteller's observation that ESP will not replace ysis could be valid, or were specifically designed to the telephone leads to the question of whether or incorporate trial by trial feedback in such a way not psi abilities are of any use even if they do exist, that the analysis needed to account for the added since the effects are relatively small. Again, a look information. Models and analyses for such experi- at history is instructive. For example, in 1938 For- ments can be found in the references given at the tune Magazine reported that "At present, few sci- end of Diaconis' discussion. entists foresee any serious or practical use for For the remainder of this discussion, I will con- atomic . " fine myself to models appropriate for experiments Greenhouse implied that I think parapsychology such as the autoganzfeld described in Section 5. It is not accepted by more of the scientific community is this scenario for which Bayarri and Berger com- only because they have not examined the data, but puted Bayes factors, and for which Dawson dis- this misses the main point I was trying to make. cussed possible Bayesian models. The point is that individual scientists are willing to If ESP does exist, it is undoubtedly a gross over- express an opinion without any reference to data. simplification to use a simple non-null binomial The interesting sociological question is why they model for these experiments. In addition to poten- are so resistant to examining the data. One of the tial differences in ability among subjects, there major reasons is undoubtedly the perception identi- were also observed differences due to dynamic ver- fied by Greenhouse that there is some connection sus static targets, whether or not the sender was a between parapsychology and the occult, or worse, friend, and how the receiver scored on measures of religious beliefs. Since religion is clearly not in the extraversion. All of these differences were antici- realm of science, the very thought that parapsy- pated in advance and could be incorporated into chology might be a science leads to what psychol- models as covariates. ogists call "." As noted by It is nonetheless instructive to examine the Bayes Griffin (1988), "People feel unpleasantly aroused factor computed by Bayarri and Berger for the when two cognitions are dissonant-when they con- simple non-null binomial model. First, the observed tradict one another" (page 33). Griffin continued by anomalous effects would be less interesting if the observing that there are also external reasons for Bayes factor was small for reasonable values of r, scientists to discount the evidence, since "It is gen- as it was for the random number generator experi- erally easier to be a skeptic in the face of novel ments analyzed by Jefferys (1990), most of which ' evidence; skeptics may be overly conservative, but purported to measure instead of ESP. they are rarely held up to ridicule" (page 34). Second, the Bayes factor provides a rough measure In summary, while it may be safer and more of the strength of the evidence against the null consonant with their beliefs for individual scien- hypothesis and is a much more sensible summary tists to ignore the observed anomalous effects, the than the p-value. The Bayes factors provided by scientific community should be concerned with Bayarri and Berger are probably more conserva- finding an explanation. The explanations proposed tive, in the sense of favoring the null hypothesis, by Greenhouse and others are simply not tenable. than those that would result from priors elicited from parapsychologists, but are probably reason- able for those who know nothing about past ob- REPLICATION AND MODELING served effects. I expect tht most parapsychologists Parapsychology is one of the few areas where a would not opt for a prior symmetric around chance, point null hypothesis makes some sense. We can but would still choose one with some mass below 402 J. UTTS chance. The final reason it is instructive to exam- my discussion. Their analyses suggest some alter- ine these Bayes factors is that they provide a quan- natives to power analysis that might be considered titative challenge to .,skeptics to be explicit about when designing a new study to try to replicate a their prior probabilities for the null and alternative questionable result. hypotheses. Morris addressed the question of what con- Dawson discussed the use of more complex stitutes a replication of a meta-analysis. He Bayesian models for the analysis of the auto- distinguished between exact and conceptual repli- ganzfeld data. She proposed a hierarchical model cations. Using his distinction, the autoganzfeld where the number of successes for each experiment meta-analysis could be viewed as a conceptual followed a binomial distribution with hit rate pi, replication of the earlier ganzfeld meta-analysis. and logit(p,) came from a normal distribution with He noted that when such a conceptual replication noninformative priors for the mean and variance. offers results similar to those of the original She then expanded this model to include heavier meta-analysis, it lends legitimacy to the original tails by allowing an additional scale parameter for results, as was the case with the autoganzfeld each experiment. Her rationale for this expanded meta-analysis. model was that there were clear outlier series in Greenhouse and Morris both noted the value of the data. meta-analysis as a method of comparing different The hierarchical model proposed by Dawson is a conditions, and I endorse that view. Conditions reasonable place to start given only that there were found to produce different effects in one meta- several experiments trying to measure the same analysis could be explicitly studied in a conceptual effect, conducted by different investigators. In the replication. One of the intriguing results of the autoganzfeld database, the model could be ex- autoganzfeld experiments was that they supported panded to incorporate the additional information the distinction .between effect sizes for dynamic available. Each experiment contained some ses- versus static targets found in the earlier ganzfeld sions with static targets and some with dynamic work, and they supported the relationship between targets, some sessions in which the sender and ESP and extraversion found in the meta-analysis receiver were friends and others in which they by Honorton, Ferrari and Bem (1990). were not and some information about the extraver- Most modern parapsychologists, as indicated by sion score of the receiver. All of this information Morris, recognize that demonstrating the validity could be included by defining the individual session of their preliminary findings will depend on identi- as the unit of analysis, and including a vector of fying and utilizing "moderator variables" in future covariates for each session. It would then make studies. The use of such variables will require more sense to construct a logistic regression model with complicated statistical models than the simple bi- a component for each experiment, following the nomial models used in the past. Further, models model proposed by Dawson, and a term X/3 to are needed for combining results from several dif- include the covariates. A prior distribution for /3 ferent experiments, that don't oversimplify at the could include information from earlier ganzfeld expense of lost information. studies. The advantage of using a Bayesian ap- In conclusion, the anomalous effect that persists proach over a simple logistic regression is that throughout the work reviewed in my paper will be information could be continually updated. Some of better understood only after further experimenta- the recent work in Bayesian design could then be tion that takes into account the complexity of the incorporated so that future trials make use of the system. More realistic, and thus more complex, best conditions. models will be needed to analyze the results of Several of the discussants addressed the concept those experiments. This presents a challenge that I of replication. I agree with Mosteller's implication hope will be welcomed by the statistics community. that it was unwise for the audience in my seminar to respond to my replication questions so quickly, ADDITIONAL REFERENCES and that was precisely my point. Most nonstatisti- cians do not seem to understand the complexity ALLISON,P. (1979). Experimental parapsychology as a rejected of the replication question. Parenthetically, when science. The Sociological Review Monograph 27 271-291. I posed the same scenario to an audience of statis- BARBER,B. (1961). Resistance by scientists to scientific discov- ticians, very few were willing to offer a quick ery. Science 134 596-602. opinion. BERGER,J. 0. and DELAMPADY,M. (1987). Testing precise hy- potheses (with discussion). Statist. Sci. 2 317- 352. Bayarri and Berger provided an insightful dis- CHUNG,I?. R. K., DIACONIS,P., GRAHAM,R. L. and MALLOWS, cussion of .the purpose of replication, offering quan- C. L. (1981). On the permanents of compliments of the titative answers to questions that were implicit in direct sum of identity matrices. Adu. Appl. Math. 2 121-137. REPLICATION IN PARAPSYCHOLOGY 403

COCHRAN,W. G. (1954). The combination of estimates from analysis of data from retrospective studies of disease. Jour- different experiments. Biometries 10 101-129. nal of the National Cancer Institute 22 719-748. COLLINS,H. and PINCH,T. (1979). The construction of the para- MORRIS,C. (1983). Parametric empirical Bayes inference: The- normal: Nothing unscientific is happening. The Sociological ory and applications (rejoinder) J. Amer. Statist. Assoc. 78 Review Monograph 27 237-270. 47-65. COLLINS,H. M. and PINCH,T. J. (1982). Frames of Meaning: The MORRIS,R. L. (1986). What psi is not: The necessity for experi- Social Construction of Extraordinary Science. Routledge & ments. In Foundations of Parapsychology (H. L. Edge, R. L. Kegan Paul, London. Morris, J. H. Rush and J. Palmer, eds.) 70-110. Routledge CORNFIELD,J. (1959). Principles of research. American Journal & Kegan Paul, London. of Mental Deficiency 64 240-252. MOSTELLER,F. and BUSH R. R. (1954). Selected quantitative DEMPSTER,A. P., SELWYN,M. R. and WEEKS,B. J. (1983). techniques. In Handbook of Social Psychology (G. Lindzey, Combining historical and randomized controls for assessing ed.) 1 289-334. Addison-Wesley, Cambridge, Mass. trends in proportions. J. Amer. Statist. Assoc. 78 221-227. MOSTELLEE,F. and CHALMERS,T.,(1991). Progress and problems DIACONIS,P. and GRAHAM,R. L. (1981). The analysis of sequen- in meta-analysis. Statist. Sci. To appear. tial experiments with feedback to subjects. Ann. Statist. 9 oTERI,L,, ed, (1975), Q~~~~~~ physics and P ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ , 236-244. Parapsychology Foundation, New York. FISHER,R. A. (1932). Statistical Methods for Research Workers, PINCH,T, J. and COLLINS,H. M. (1984). Private science and 4th ed. Oliver and Boyd, London. public knowledge: The Committee for the Scientific Investi- FISHER,R. A. (1935). Has Mendel's work been rediscovered? gation of Claims of the Paranormal and its use of the 1 Ann. of Sci. 116-137. literature. Social Studies of Science 14 521-546. GALTON,F. (1901-2). Biometry. Biometrika 1 7-10 PLATT,J. R. (1964). Strong inference. Science 146 347-353. GREENHOUSE,J.,FROMM,D., IYENGAR,S., DEW,M. A,, HOLLAND, Effects in Re- A, and KASS,R. (1990). Case study: ~h~ effects of rehabili- RosENTHAL,R. tation for aphasia. In The Future of Meta-Analysis search. Appleton-Century-Crofts, New York. (K. W. Wachter and M. L. Straf, eds,) 31-32. ~ ~sage~ ROSENTHAL,~ ~ R. (1979).l lThe "file drawer ~roblem"and tolerance Foundation, New York. for null results. Psychological Bulletin 86 638-641. GRIFFIN,D. (1988). Intuitive judgment and the evaluation of RyAN2 L. M. and DEMpsTER~ A. P. Weighted evidence. In Enhancing Human Performance: Issues, Theo- plots. Technical Report 3942, Dana-Farber Cancer Inst., ries and Techniques Background Papers-Part I. National Boston, Mass. Academy Press, Washington, D.C. SAMANIEGO,F. J. and UTTS, J. (1983). Evaluating performance HANSEN,G. (1990). Deception by subjects in psi research. Jour- in continuous experiments with feedback to subjects. Psy- nal of the American Society for Psychical Research 84 25-80. chometrika 48 195-209. HUNTER,J. and SCHMIDT,F. (1990). Methods of Meta-Analysis. SMITH,M. and GLASS,G. (1977). Meta-anal~sisof psychotherapy Sage, London. outcome studies. American Psychologist 32 752-760. IYENGAR,S. and GREENHOUSE,J. (1988). Selection models and WACHTER,K. (1988). Disturbed by meta-analysis? Science 241 the file drawer problem (with discussion). Statist. Sci. 3 1407-1408. 109-135. WEST,M. (1985). Generalized linear models: Scale parameters, LOUIS,T. A. (1984). Estimating an ensemble of parameters outlier accommodation and prior distributions. In Bayesian using Bayes and empirical Bayes methods. J. Amer. Statist. Statistics 2 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley, Assoc. 79 393-398. and A. F. M. Smith, eds.) 531-558. North-Holland Amster- MANTEL,N. and HAENSZEL,W. (1959). Statistical aspects of the dam.