The Null Hypothesis Statistical Testing Controversy: Why Do We Still Use NHST?

Ryan G. Barnhart

A thesis submitted to the Faculty of Graduate Studies in partial fulfillment of the requirements for the degree of Master of Arts

Graduate Programme in Psychology York University Toronto, ON

June 2010 Library and Archives Bibliothèque et ?F? Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l'édition

395 Wellington Street 395, rue Wellington Ottawa ON K1 A 0N4 Ottawa ON K1A 0N4 Canada Canada

Your file Votre référence ISBN: 978-0-494-68255-5 Our file Notre référence ISBN: 978-0-494-68255-5

NOTICE: AVIS:

The author has granted a non- L'auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l'Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distribute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriété du droit d'auteur ownership and moral rights in this et des droits moraux qui protège cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission.

In compliance with the Canadian Conformément à la loi canadienne sur la Privacy Act some supporting forms protection de la vie privée, quelques may have been removed from this formulaires secondaires ont été enlevés de thesis. cette thèse.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis.

1+1 Canada The Null Hypothesis Significance Testing Controversy: Why do we Still Use NHST? By Ryan Barnhart A thesis submitted to the Faculty of Graduate Studies of York University in partial fulfillment of the requirements for the degree of

MASTER OF ARTS

©2010

Permission has been granted to: a) YORK UNIVERSITY LIBRARIES to lend or sell copies of this thesis in paper, microform or electronic formats, and b) LIBRARY AND ARCHIVES CANADA to reproduce, lend, distribute, or sell copies ofthis thesis anywhere in the world in microform, paper or electronic formats and to authorize or procure the reproduction, loan, distribution or sale of copies of this thesis anywhere in the world in microform, paper or electronic formats.

The author reserves other publication rights, and neither the thesis nor extensive extracts from it may be printed or otherwise reproduced without the author's written permission. iv

Abstract

In spite of a controversy that has lingered over the past 60 years, the use of null hypothesis statistical testing (NHST) has remained a part of experimental practice in psychology. The use of NHST has been the target of such strong attack it is rare to find an experimental journal in psychology having not discussed its shortfalls. Despite cogent criticisms the technique remains in use. The present investigation examines the history of NHST in psychology and attempts to determine why it has remained a part of the repertoire of methodological approaches. It is argued that as psychology developed its quantified and experimental branches certain underlying theoretical issues went unresolved. As experimental practices evolved from early psychophysics, introspection, and correlation, the quality and meaning of results were plagued by subjectivity. NHST has survived attack for its perceived ability to provide objective determinations and a standard model of experimentation for psychology. V

Acknowledgements I would like to thank my advisor Dr. Alexandra Rutherford for her patience, guidance, and support throughout a difficult and often trying process. I would also like to thank Dr. Christopher D. Green whose insightful comments and unique perspective helped me approach this topic from new directions. Great appreciation is also extended to my colleagues in History and Theory whose faith and friendship helped me see this project through to an end. I am also indebted to my wife Ann for the inspiration she constantly provides, driving me towards being a better man. My final acknowledgements are extended to my mother Gail and mentors Peter Chow and Jimmy Kolios for their love, loyalty and individual gifts of spirit. vi

Table of Contents

Introduction 1

Chapter 1 : The NHST Debate: Past, Present and Persistent 9

The Substance of the NHST Debate: An Introduction 1 1

The Decision Making Problem 12

The Problem of Appropriateness 14 The Problem of Logic 1 5

The Problem of Sensitivity 1 7 Some Remarks on the Controversy 19

The NHST Debate: Origins 2 1 Sir Francis Galton: Defining Correlation 22

Karl Pearson, Skew Distributions and Chi-Square 24 Gösset, Fisher, Neyman and Pearson 29

Some Remarks Regarding the Origins ofNHST 37 The NHST Debate: The Persistence of a Problem in Psychology 40

The NHST Debate: Concluding Remarks 44

Chapter 2: Quantified Reasoning and a Standardized Image of Psychology 47

Fechner's Mathematical Psychology 48

Fechner's in the Service of Precision 56

Wundt and the Discipline of Experimental Psychology 60 Physiological Psychology as a Distinct Discipline 61 Vil

Inside Wundt's Psychological Laboratory 64 Wundt's Methods for Psychology as a Legitimizing and Standardizing Force 66 Quantified Reasoning and a Standardized Image of Psychology: Concluding Argument 69 Chapter 3 : The Fall of Introspection and Rise of Correlation 72 The Confusions of Introspection 74 The Problems of Introspection 79 A Changing Discipline 8 1 Correlation and Intelligence: The Rise of Galtonian Psychology 86 The Fall of Introspection and Rise of Correlation: Concluding Argument 94 Chapter 4: Controlled Experimentation, NHST and the Current Issue of Method 96

In Want of More than Correlation 98

NHST as the Solution to Controlled Experimentation and Consensus 104 The NHST Controversy Revisited 1 08

The Problem of Consensus: A Review 109

Why our use ofNHST not changed 117 Controlled Experimentation, NHST and the Current Issue of Method: Concluding Argument 1 2 1 Concluding Remarks 123

References 128 1

Introduction

The procedure known as Null Hypothesis Statistical Testing (NHST) has received a substantial amount of attention in the scientific literature over the past 50 years. The technique itself, in brief, requires that investigators formulate both a hypothesis regarding their investigation and its opposite or null hypothesis, conduct experimentation to obtain data, and then estimate the probability (the ? value) of having obtained this data (or more extreme data) by chance (the assumption of equipotentiality), under the mathematical constraints provided by the conditions stipulated in the null hypothesis. The goal of this process is to determine whether the obtained data are sufficiently unlikely to have occurred by chance alone (the decision point is typically set at p<.05) so as to reject the conditions of the null hypothesis as being an adequate description of the data.

There have been many detractors ofNHST and many reasons for concern regarding the use of the method (see Bakan, 1966; Cohen, 1962, 1994; Kline, 2004;

Meehl, 1967/1970, 1978; Nunnally, 1960; Rozeboom, 1960; Trafimow & Rice, 2009). What has remained consistent across the criticisms are the warnings expressed to psychologists who have adopted NHST as the primary method for designing and analyzing results from their experimental investigations. Contemporary authors like Trafimow and Rice (2009), point to the lack of an actual connection between obtained statistical probabilities, in the form of? values, and either of the formulated hypotheses. This line of argument can be traced back to Lancelot Hogben, who in 1957 noted the same deficiency in the reasoning behind the procedure. Contemporary authors like Gigerenzer, Krauss and Vitouch (2004), have noted that habit-forming rituals have sprung 2 up around the improper teaching, use and understanding of what it is ap value represents. These thoughts appear to echo those of John Tukey who in 1954 had already expressed these concerns. Other contemporary authors like Jacob Cohen (1994) have attempted to remove NHST altogether and replace it with confidence intervals because according to him "NHST has not only failed to support the advance ofpsychology as a science but also has seriously impeded it" (p. 997). Why would psychology continue to not only use, but emphasize the teaching of NHST as one of its primary tools for scientific discovery? This thesis is an attempt to understand how the controversy surrounding NHST has resulted in little to no change in the prevalence and nature of its use in psychology. While the primary approach to addressing the controversies ofNHST has been to analyze the technical aspects of the method itself, little attention has been paid to the nature of the controversy as it unfolded in psychology's own history. Historical debates surrounding the identification of meaningful units ofmeasurement and the correct application of the normal distribution have direct bearing upon the NHST controversy. The NHST procedure was developed outside of psychology and most acknowledge Ronald A. Fisher with its development (see Rodgers, 2010). Psychology incorporated the NHST procedure as its dominant approach over a 25 year span beginning approximately in 1925 (Rucci & Tweneny, 1980). How the method ofNHST found a home in psychology, and furthermore, became the dominant approach to experimental methodology, has yet to be connected to the history of the NHST controversy itself. Contemporary psychology has a unique relationship with the method ofNHST. In order to understand how it is that amidst

1 By way of the techniques of ANOVA, t-tests, and statistical methods of inference based upon the null testing principle more generally. 3

50 years of critique by prominent psychologists NHST has retained widespread use requires historical examination. Over the course of this project I propose an explanation for why psychology appears so committed to the method of NHST. In doing so, I evaluate the following issues:

1) What is the history of the debate regarding NHST in psychology; does it have a connection to debates that developed prior to the construction of the procedure? 2) What is it about psychology's early history that uniquely prepared it to adopt

NHST as its primary method of investigation? 3) Why has psychology, given the volume and persuasiveness of the literature that has appeared since the 1950s, refused to change its approach to NHST? By conducting this analysis, I hope to demonstrate that the persistence of NHST in psychology is linked to the specific demands that arose in psychology as it organized itself around certain experimental and theoretical approaches. Much has been made regarding the apparent flaws in the methodology supporting NHST, but little attention has been paid to the specific needs of psychology as a discipline that support its continued use. This project is in no way to be understood as a plea in support ofNHST; my aim is to identify certain key characteristics of experimental psychology that have created the apparent dependence upon it. As a result, this thesis is not necessarily prescriptive regarding the future of NHST, but descriptive with respect to the framework within which a certain dependence has been brought about. 4

Chapter 1 examines the history of the NHST debate, traces the intellectual history of its testing methods from Francis Galton to Ronald A. Fisher, and Egon S. Pearson, and examines the NHST controversy in light of psychology's past. The primary goals of this chapter are two-fold: 1) to introduce the origins of the NHST controversy and provide a brief intellectual history from Galton to Fisher, Pearson and Neymann, and 2) to demonstrate that experimental psychology has largely ignored the calls for reform that have existed within the NHST debate over the past 50 years. My concluding argument to this chapter is that while the NHST debate is specific to the methods of statistical testing such as ANOVA and t-tests, its resiliency in psychology is rooted in a more distant past. The continued commitment to NHST by psychologists can be understood as due to the perceived methodological needs and demands that developed during experimental psychology's early history and the rise of quantification. Chapter 2 traces the rise of a quantified and experimental psychology. Adopting the perspective that experimental psychology has two "founders," Gustav Theodor Fechner and Wilhelm Wundt (see Boring, 1950), this chapter is an attempt to understand why and how psychology adopted a quantified, experimental approach and was successful in its attempt. The primary goal of this chapter is to demonstrate that quantification came to psychology through the requirements of evidence as understood by Fechner. The chapter follows German traditions of thought from Kant, Herbart and Fechner through to Wundt. Wundt' s physiological approach to psychology is demonstrated to have begun the processes of legitimization and standardization for experimental psychology to eventually be accepted as a discipline. It is argued in this 5 chapter that despite certain controversies regarding the quantification of mental processes, the production of instruments, published experimental results, and textbooks allowed

Wundt to establish his physiological psychology. This chapter articulates a space where a statistical psychology might soon emerge and demonstrates that Wundt's efforts represent the first efforts towards establishing a consensus regarding what experimental psychology was and could do. Chapter 3 outlines the transitions that psychology underwent at the turn of the 20th century. This chapter details the collapse of the first dominant psychological methodology

- introspectionism - and how this collapse helped create a space for new approaches to experimental psychology to be developed. Introspection, in its various forms, came to be identified as a method fraught with insurmountable errors and subjectivities. The influence of persons like Charles Spearman led to the construction of a correlational psychology. Correlational psychology is mapped out beginning with Galton, but carried forward by Spearman and his ability to provide evidence for the idea of general intelligence. The belief in correlational measures of intelligence (amongst other traits) helped solidify trust in statistics as a useful approach to observing and describing mental phenomena. The concluding argument of this chapter is that while trust in statistical techniques grew, their inability to provide consensus regarding the interpretation of results reflected the same problems of quantity that experimental psychology had effectively ignored from Fechner through Wundt.

Chapter 4 describes the history ofNHST's rise to dominance in psychology and revisits the contemporary controversy. The investigation begins with an examination of 6 certain criticisms that experimental psychologists brought to bear against the use of correlation. Correlational techniques could not supply a definitive criterion that researchers could use to decide whether a given relationship was meaningful or not and remained a purely descriptive tool. With an existing set of tools based upon the critical ratio (CR), that aimed to determine the significance of results of correlational studies, the move to ANOVA could be understood as building on the foundation of existing methods in psychology. At the heart of the shift to ANOVA and NHST remained the issue of quantification in psychology. As Boring (1920a) had discussed, statistical information reverted to a discussion of probabilities based upon frequencies and often neglected a consideration of what constituted a psychological unit. Without having comprehension of the psychological units under investigation, statistical measures would remain arbitrary and their importance based upon subjective judgments. The resolution of the quantity objection by S. S. Stevens, by way of his new definition of measurement (and measurement scales), permitted psychology to resolve two issues: 1) the determination of psychological units was unimportant because measurement does not require the evidence of objective units of measure, and 2) that with no necessary units of measure, the use of NHST provided a means by which the significance of any experiment could be judged upon the ? value obtained. The utility ofNHST thus lay in its ability to provide clear rules to guide the interpretation of results. Consensus based upon the ? statistic could be found in editorial policies, textbooks and research that had the unfortunate problem ofbeing discussed in explicit terms entirely incorrectly. The ? statistic was often inaccurately associated with the probability of stated hypotheses being true, the ultimate measure of 7 the goodness of someone's research, and the procedure of NHST that generated the statistic was increasingly thought of as a decision-generating process. These ideas were widely taught and rarely corrected during the 1950s. I conclude by proposing that the reason psychology has resisted change is connected to the function that NHST came to play in light of the longstanding needs of the discipline, namely, providing a standard for consensus regarding the quality and interpretation of research. In the general conclusion to this thesis the synthesis of the previous chapters is performed. The final concluding argument to this thesis represents a summary of the individual histories that have contributed to psychology's dependence on NHST. Based upon the concluding arguments of chapter 4, this discussion attempts to show how beginning with Fechner psychology lacked a formal definition of what constituted a psychological unit. While still without a formal definition, experimental psychology first appeared and was quickly led away from philosophy by a community of psychologists who adopted the experimental approach in the years following Wundt's first laboratory. Correlational psychology provided the discipline a new way of approaching certain phenomena, but could not supply psychology with the ability to develop consensus regarding the meaning of results or what it was that was frequently being measured. As NHST rapidly gained acceptance and widespread use from the 1930s onwards, experimental psychology appeared to have obtained a method that eliminated the problems surrounding what constituted significance in its results. By the 1950s experimental psychology had taken up NHST as its primary method and indoctrinated practitioners in its use. Stevens's measurement scales resolved the theoretical issue of 8 quantity and NHST appeared capable of supplying machine-like mechanics of precise decision making. The continued use ofNHST represents a lasting commitment to these perceived benefits. For psychology NHST rose to dominance because it appeared to supply an approach that could unify the discipline methodologically and generate an objective criterion (p values) of the value and importance of experimental results. 9

Chapter 1

The NHST Debate: Past, Present and Persistent

In 1 998 Alan Kaufman declared that, "The controversy about the use or misuse of statistical significance testing that has been evident in the literature for the past 1 0 years has become the major methodological issue of our generation" (p. 1). While Kaufman's claim to the existence of a controversy was accurate, the time-frame he offered was not. Four years prior to Kaufman's observation, Cohen (1994) reported that the controversy regarding the testing of statistical significance had been in existence for at least 4 decades (p. 997). In August of 1999 the final report by Leland Wilkinson and the American

Psychological Association's Task Force on Statistical Inference (TFSI) appeared. This document described the impetus for the TFSFs creation as born of, ". . . [the] continuing debate over the applications of significance testing in psychology journals and [in response to] the publication of Cohen's (1994) article" (Wilkinson, 1999, p. 594). While this debate has remained active over the past 12 years, the problems surrounding the use and misuse of NHST have been evident in the scientific literature since as early as the 1930s. The purpose of this chapter is to introduce the major criticisms ofNHST, outline the origins of the NHST debate, briefly examine the history behind the methods, and demonstrate the NHST controversy's persistence in Psychology.

The NHST debate has been dominated by a number of specific arguments that have remained relatively unchanged and been repeatedly revisited (for a concise summary, see Rodgers, 2010). This chapter outlines these arguments and what they tell us about NHST. The problems regarding the use of NHST are not exclusive to Psychology 10 and the case against NHST has been supplied by researchers across disciplines. Special consideration will be given to certain aspects of the criticisms supplied by Joseph Berkson, Hanan Selvin and Lancelot Hogben from outside of psychology and Bill

Rozeboom, David Bakan, Jum Nunnally and Paul Meehl from within. While not a complete depiction of all the literature opposing NHST, these authors represent a selection of some of the most widely read from medicine, sociology and psychology. An introduction to the problems surrounding the procedure will provide a framework for understanding how these issues relate to the specific dilemma in Psychology. The intellectual history of the developments that led to NHST exposes the critical, often oppositional, approaches and thinking that have led to the contemporary debate.

Modern statistical methods developed around the demands of specific types of investigations and data. Beginning with Galton, this chapter will examine how correlation and descriptive statistics provided a necessary foundation upon which statistical testing could evolve. This chapter will also detail how , Galton' s very good friend and protégé, became dissatisfied with certain assumptions and qualities of Galton' s ideas and this led Pearson in a new direction. Pearson's new approaches would not be limited to Galton' s adherence to the idea of normality2 and would establish the first statistical tests of significance. Also given consideration are the statistical ideas developed by William Sealy Gösset, , Jerzy Neyman and Egon Pearson. Gösset found that while Galton-Pearson statistics were useful in certain domains, his practical problems could not

2 Normality is the state of affairs or assumption that collected data is distributed symmetrically about the arithmetic mean (µ) with dispersion of values measured as the standard deviation (a). 11 be solved using existing methods.3 Having assisted Gösset in his work, R. A. Fisher developed the NHST procedure and the methods of ¿-testing and ANOVA out of the research Gösset had begun. Fisher's method of NHST altered the way in which one approached experimentation and the treatment of data, but did not escape modification. Jerzy Neyman and Egon S. Pearson introduced the ideas of alternative hypothesis testing to ANOVA and while strongly opposed by Fisher, the accept/reject decision making approach to NHST evolved. As the NHST controversy took shape in the 1950s, Psychology embraced the procedure as its primary method of data analysis. Lee Cronbach, in his 1957 presidential address to the American Psychological Association, announced to the discipline that NHST had become the ultimate method for experimental psychologists. Rather than pausing to evaluate the nature of the controversy that had begun to foment, psychologists rallied around NHST, and its related test statistic the ? value became the benchmark of good experimental practice. The legacy of this relationship has been a 50-year period of stagnation, during which psychologists have employed NHST in much the same way as it was employed in the 1950s. This stagnation has persisted despite some researchers' claims that a "quiet methodological revolution" based on modeling has been brewing in psychology for several decades (e.g., Rodgers, 2010, p. 1).

The Substance of the NHST Debate: An Introduction

The debate over NHST appears to have begun shortly after the appearance of the method itself. For example, a series of articles that were published by scientists from

3 Gösset investigated and published two important papers concerning the use of small sample statistics. 12

medicine, sociology and psychology appeared as early as 1938 and established the primary issues that have subsequently defined the NHST debate. Very little in the way of new arguments have appeared since (Denis, 1999).

I have selected these articles to demonstrate what I believe to be the four basic

criticisms ofNHST. The first class of criticisms revolves around the use ??? values and the mechanized processes that have sprung up around them. I define this as the "decision- making problem". The second class of criticisms involves the problem of applying the NHST model to certain forms of inquiry. I have termed this the "problem of appropriateness." The third class of criticisms proposes that the logical structure ofNHST is impotent for scientific inquiry. I have labeled this the "problem of logic." The fourth class I have called the "problem of sensitivity," and it is a technical concern regarding the

effects of sample size upon test statistics.

The Decision Making Problem As early as 1938 Joseph Berkson criticized a trend within scientific investigation to make decisions concerning the meaning of experimental results using the null hypothesis and the significance of results. Berkson, a medical researcher, was concerned that scientists were not taking the time to consider alternate forms of evidence other than ? values when making decisions regarding the form and interpretation of results. P values, according to Berkson (1938), often failed to deliver any of the information a scientist should desire. He claimed that he had "encountered numerous situations in

which the test [Chi-Square] did not adequately perform the function for which [he] thought [he] could use it" (Berkson, 1938, p. 526). This conclusion of Berkson's stood in 13

Opposition to what he saw as the "routine" (Berkson, 1938, p. 528) scientific practice where decisions were generated hyp values. He argued that the use of statistical testing of significance was often an inappropriate or inadequate basis for decision-making. Berkson followed his initial criticism with a second 4 years later. "Tests of Significance Considered as Evidence" appeared in 1 942 in the Journal ofthe American Statistical Association. Berkson' s critical discussion characterized the NHST procedure and its reliance on ? values, as being counter-intuitive and not in keeping with the inferential aims of experimentalists. Scientists were currently not engaged in disproving things, they were typically "looking for appropriate evidence for affirmative conclusions" (Berkson, 1942/1970, pp.286). The danger inherent in the NHST approach according to Berkson (1942/1970) is that "the pursuit of a false principle for testing the null hypothesis will lead to false conclusions" (p. 287). Berkson' s article detailed what he saw as the ways that scientific experimentation occurred and how NHST did not provide evidence towards any hypotheses of interest involved in this activity. Bill Rozeboom in 1960 criticized NHST in a similar fashion. Rozeboom (1960) suggested that what NHST had done to scientific investigation was to recast it as a decision making process that revolved around the acceptance or rejection of hypotheses. This reflected the same disconnect with the process of scientific inquiry that Berkson had previously suggested. As Rozeboom (1960) explained: ...the primary aim ofa scientific experiment is not to precipitate decisions, but to make an appropriate adjustment in the degree to which one accepts, or believes, the hypothesis or hypotheses being tested. And even if the purpose of the 14

experiment were to reach a decision, it could not be a decision to accept or reject the hypothesis, (p. 221)

I suggest that it was for this same reason that Cohen (1994) suggested that NHST actually impedes the progress of science.

The Problem of Appropriateness In 1957 Hanan Selvin attacked the appropriateness of tests of significance in the domain of sociological research.4 Selvin, who obtained his Ph.D. from Columbia University in 1956, was familiar with the rising controversy over NHST. Columbia University's Bureau of Applied Social Research had investigated practices of experimentation and the testing of significance throughout the 1950s and generated two important volumes that discussed the topic. Union Democracy appeared in 1956 and The

Student Physician: Introductory Studies in the Sociology ofMedical Education in 1957. In his article "A Critique of Tests of Significance in Survey Research," that appeared in the American Sociological Review, Selvin suggested two crippling issues that rendered NHST inappropriate for sociological inquiry. The problems regarding the use ofNHST in the conduct of sociological research fell under two categories, "Problems of Design" and

"Problems of Interpretation" (Selvin, 1957/1970, p. 96-100). According to Selvin (1957/1970), the "problem of design" in sociological research "centers about the concept of experimental control' (p. 96). Sociologists were unable to isolate or control for specific effects and could not randomize the uncontrollable. R.A.

4 While the argument that follows surrounds a particular disciplinary concern, randomization and design have been hotly contested issues in the debates surrounding statistical testing (Gigerenzer et al., 1989). 5 The authors of Onion Democracy were Seymour M. Lipset, Martin A. Trow and James S. Coleman. 6 The authors of The Student Physician were Robert K. Merton, George G. Reader, M.D. and Patricia Kendall. 15

Fisher7 was explicit in his discussion about randomization and denied the use of systematic treatments, even where the reduction of error would be the result (Gigerenzer et al., 1989, p. 74). By being unable to employ these characteristics the experimental models required for NHST were beyond the reach of sociologists. Selvin's (1957/1970) "problem of interpretation" exposed the disciplinary problems of applying NHST, but also demonstrated that certain specific assumptions regarding the correct use of NHST were, at large, not being met. Some sociological research at the time collected data and examined it afterwards, testing various hypotheses and attempting to draw out statistical significance regardless of what the "true levels" of a certain phenomena were and what they meant (Selvin, 1957/1970, p. 106). According to Fisher (1966), NHST was to be considered inseparable from the experiment and not to be used this way. The rising concern with statistical significance also posed a threat to the structure of sociological research as it was rapidly becoming a criterion of good research and often begged criticism if absent (see Selvin, 1957/1970, p. 94).

The Problem of Logic 1957 also saw the publication of Lancelot Hogben's thorough and potent attack on NHST in his book Statistical Theory. Hogben's attack on NHST was calculated and methodical. His approach combined methodological concern with historical consideration and represents one of the fullest treatments on the issues ofNHST to first appear (Morrison & Henkel, 1970, p. 3). Hogben, a trained physiologist, was well known as a

7 Fisher can be considered the primary, but not sole, contributor to the development of the NHST approach. This will be discussed in more detail in the following section of this chapter. 16 medical statistician.8 Through a careful examination of R.A. Fisher's writing on the topics of statistical significance, the null hypothesis and the treatment of errors, Hogben

(1957/1970) identified a series of logical inconsistencies and potential errors that he concluded were fatal flaws in the NHST procedure. One aspect that he was primarily concerned with was the inadequacy of the method ofNHST to provide meaningful information. Hogben (1957/1970) wrote: Few who follow Fisher's test prescriptions appear to realize that there never can

be a unique denial of a hypothesis itself formulated, like the theory of the gene, in statistical terms, nor that a 20 to 1 convention can have any relevance to one's

assessment of its truth, (p.48)

Hogben' s criticism ofNHST demonstrates the weaknesses inherent in the logical structure of NHST. The denial of the null hypothesis according to Hogben does not provide a definitive criterion as to the truth or falsity of specific results. Another aspect of the problems of logic surrounding the use ofNHST is the fact that the point-null hypothesis is almost invariably false. David Bakan (1966) reported finding that under the conditions of "sharp" (p. 434) null hypotheses (point-null) the/» value was nearly always significant. The "sharp" or point-null hypothesis is the application of NHST designed to determine whether or not the sampled means "for two groups which differ in some identified properties (such as social class, intelligence, diagnosis, racial or religious background) [can be found] to differ not at all [emphasis added] in the Output variable of interest'" (Meehl, 1967/1970, p. 258). Paul Meehl

He was also known as a zoologist and was a Fellow of the Royal Society. 17

(1967/1970) considered this quality of the NHST procedure to be inconsistent with the normal aims of scientific practice. Where normal procedures of scientific investigation sought to expand our knowledge of studied phenomena, no useful information was generated by testing the null hypothesis because it should already be intuitively false. Bakan (1966) suggested that "sharp" or point-null hypotheses "rarely exist in nature" (p.

426) and Meehl (1967/1970) wrote: Now our general background knowledge in the social sciences, or, for that matter, even "common sense" considerations, make such an exact equality of all determining variables, or a precise "accidental" counterbalancing of them, so extremely unlikely that no psychologist or statistician would assign more than a negligibly small probability to such a state of affairs, (p. 258) Both Bakan's and Meehl's arguments are sound. However, it should be noted that Bakan (1966) did note an upsurge of scientists using what he called "loose" (p. 434) null hypotheses that aimed at controlling for the arbitrary solution of the point-null problem by specifying a range of expected difference rather than a point value or zero.

The Problem of Sensitivity

The problem of sensitivity was discussed repeatedly in the early years of the NHST controversy. It appears that Neyman and Pearson (1933) were the first to demonstrate that the determination of? values was troublingly linked to the size of a given sample. Neyman and Pearson confirmed that the probability of rejecting the null hypothesis based upon the ? value increased as a function of the sample size. This problem of sensitivity is therefore a technical issue surrounding the role that the data or 18

sample size plays in the determination of? values in statistical testing. It was also a driving force behind their principle of statistical power. This aspect ofNHST worried scientists like Berkson (1938) who had noted that, "when the numbers in the data [the amount of data, the character of the data or the sample size] are quite large, the P' s tend to come out small" (p. 526). With Berkson' s concern being a growing over-dependence of research upon/? values, this issue represented a way that scientists and research could be led to his proposed false conclusions. NHST' s sensitivity led Jum Nunnally (1960) to view the rejection of null hypotheses as being a fruitless enterprise. Nunnally (1960) suggested that "Experience shows that when large numbers of subjects are used in studies, nearly all comparisons of means are "significantly" different and all correlations are "significantly" different from zero" (p. 643). This arbitrary obtaining of statistical significance exposed the limits of NHST to analyze certain data and represented an avenue for the publication of misguided research. Nunnally (1960) wrote: The point of view taken here is that if the null hypothesis is not [emphasis added] rejected, it is usually because the N [sample size] is too small. If enough data is gathered, the hypothesis will generally be rejected. If the rejection of the null hypothesis were the real intention of psychological experiments, there usually would be no need to gather data. (p. 643).

9 Power is a measure relating the hypothetical strength of a treatment effect, to its ability to be detected by the research design employed, or alternatively the probability associated with the rejection of H0 when it is in fact false. 19

Often the sensitivity to sample size was improperly corrected by using an approach of sufficiency for significance. Only the number of persons needed to obtain significant results were used in a study in order to suppress size related significant findings. Nunnally (1960) refers to this approach as "the small N fallacy" (p. 644). The problem of sensitivity does not go away with smaller N, it simply provides a less accurate estimate of the effects under study (Nunnally, 1960).

Some Remarks on the Controversy

I suggest that the criticisms of NHST supplied by Berkson (1938; 1942/1970) and Rozeboom (1960) are attempts to remind scientists of the difference between statistical and substantive significance. The problem with the reliance upon NHST for the production of? values is that the statistic is driven by a variety of factors not related to the actual effects under investigation (sample size, disturbance and confounding variables, etc.). The information that it supplies should be viewed as only that, information. Fisher viewed the information obtained through a calculated/» value as evidence. The mechanized approach to statistical testing, that some have identified as establishing a "ritual" (Gigerenzer, Krauss, & Vitouch, 2004) or dogmatic conviction that ? values generate decisions, stands in opposition the original intentions of R.A. Fisher. Fisher10 (1966) wrote: No such selection [the choice of the ? statistic] can eliminate the whole of the possible effects of chance coincidence, and if we accept this convention, and agree

Lancelot Hogben (1957/1970) noted that Fisher's method of statistical testing evolved from its early stages where he was rather liberal with declaring ap < .05 criterion, to being very cautious and advising that scientists know and determine specific/? values tailored to their experiments in his later life. 20

that an event which would occur by chance only one in 70 trials is decidedly "significant," in the statistical sense, we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the "one chance in a million" will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. (pp. 13-14). Fisher (1966) prescribed that researchers should know the full range of outcomes expected in an experiment and that the ? value be a specific criteria and not a routine value. Fisher's ? value was connected to degrees ofbelief and not reject/accept decisions (Gigerenzer et al., 1989, pp. 206-212). The decision making approach to NHST was developed by Neyman and Pearson (1933) and the overall model was eventually hybridized.11 However, this does not resolve all the problems of NHST. As Hogben (1957/1970), Nunnally (1960) Rozeboom (1960/1970), Bakan (1966/1970) and Meehl (1967/1970) have all argued together, the logic and structure ofNHST is flawed. The information provided by the NHST procedure does not inform the individual scientist with any specific information regarding their hypotheses of interest. With an appreciation of the specific issues that plague the use ofNHST, it would seem natural to ask why it is that psychologists continue to use a statistical procedure that has received such extensive negative attention. I suggest that identifying what specific needs were being met with the creation of NHST will shed light on its widespread use. Furthermore, tracing the development of NHST from Galton through to Fisher, Neyman,

11 A more complete discussion of this point follows in chapter 4. 21 and Pearson helps historicize the aims surrounding the production and use of the procedure. Psychologists were not alone in adopting NHST as a popular technique of data analysis.

The NHST Debate: Origins NHST and indeed inferential statistical testing generally was effectively non- existent prior to the turn of the 20th century (see Smith, Best, Cylke, & Stubbs, 2000, for a concise overview of pre-NHST data analytic strategies). The New Psychology that emerged at the turn of the 20th century was in a state of flux. Early psychology as a discipline had been developed out of the psychophysical investigations of G. T. Fechner and developed into a systematized program of investigation by Wilhelm Wundt.12 The role that statistics often played in this program was in the service ofpsychophysical measurements. Reaction times and subjective judgments of sensations could be evaluated and corrected using the Law of Average Error or Gauss' Error Law13 as in the case of Fechner (1860/1966). Similar uses were those of Jastrow (1887) who used averages, tabled ratios and the method of least squares to determine the most representative values hidden within collected data (see p. 125). Explanations ofpsychological phenomena that relied upon numerical description were typically based upon aggregates of data collected from small numbers of individuals and generalized into theories of mental contents and activities. Methods were often beset by ambiguity and subjectivity. Psychology was in need of a new approach to mental quantification, one that could address this concern.

12 This portion of psychology's history will be explored in full in chapter 2. Also referred to as the Normal Law or error curve, where error is minimized by assuming the arithmetic mean is the most representative single value for a normal distribution of values, where chance events are assumed to occur, and justified by the Method of Least Squares. 22

Danziger (1987) suggests that this period in experimental psychology's history can be characterized as its turn from a "Wundtian" to a "Galtonian" (p. 37) tradition, namely, a move away from controlled experimental introspection and towards studies of distributions (p. 38). In psychology's move away from introspection, Francis Galton played an important role. As a well known man of science, Galton endorsed the use of correlation and descriptive statistics for psychology. Hence, I suggest Galton is a natural starting point for the investigation of NHST' s origins as he was particularly influential in establishing modern statistics and seeding it across disciplinary boundaries.

Sir Francis Galton: Defining Correlation

Sir Francis Galton in 1 888 was the first to provide us with a clear definition of the correlation of any two objects, or variables, one to the other (Stigler, 1989). Prior to that moment, Galton (1888) suggests that there was much in the way of talking of "co-relation or correlation of structure," especially within the field of biology, yet no one had developed a precise definition (p. 135). As an example of this within American contexts, Charles Sanders Peirce and Henry Pickering Bowditch in 1877 presented the results of a study regarding the growth of children for the Massachusetts Board of Health. This study employed a statistical measure of association for a 2 X 2 contingency table that seems to have anticipated Galton' s correlation (Rovine & Anderson, 2004). Related to correlation and regression was Gauss' Error Law, that in the 1800s had found application in many of the natural sciences. From astronomy, physics, and thermodynamics, the error curve was being incorporated and applied wherever it appeared possible. It was Galton' s far flung interests and awareness of a variety of 23 scientific studies and their methods (including Gauss' Error Law), from meteorology, anthropology, biology and of course eugenics, that set the stage for his work on correlation.

For many years Galton had been using methods related to what would become the correlation coefficient and linear regression. Stigler (1989) has suggested that Galton actually spent about 20 years working on the topic of correlation before he actually published his formal statement. In Galton' s (1872) paper, outlining how best to chart the navigation of sailing vessels, he drew from an understanding of vector decomposition and addition the fundamental premises upon which regression and correlation rely. His Law of Regression appeared in 1886,14 two years prior to the appearance of the formal statement of the correlation. Galton (1889) later wrote that the development of the correlation had been brought about by his observation of a mysterious relationship between various anthropological measures: Reflection soon made it clear to me that not only were the two new problems identical in principle with the old one of kinship which I had already solved, but that all three of them were no more than special cases of a much more general problem - namely that of correlation, (p. 82) Adolphe Quetelet and his pursuit of l'homme moyen16 also had piqued Galton' s interest in the Normal Law or error curve and variation (see Gigerenzer et al., 1989, p. 53).

14 see Galton (1886) pp. 246-263. 15 in this case the related lengths of a persons bones, with their stature and body size (see Galton 1989, pp. 81-82) 16 the middle, average or ideal man of society. This statistically constructed person could be studied and understood through the accumulation of large amounts of social data and applying the Normal or Law of Errors as well as the arithmetic mean (see Quetelet, 1835) 24

Galton' s seminal investigations into human statistics and scientific investigations often carried mention, awareness, and acknowledgement of Quetelet and his theoretical bent.17

Correlation provided the essential framework upon which modern statistical techniques could emerge. Galton, known less for his mathematical talent than for his novel approach to problem solving, helped bring about the modern era of statistical testing. He also played an important role by supporting and working along side his protégé Karl Pearson. Karl Pearson, Skew Distributions and Chi-Square

In 2004, historian of science Theodore Porter published a biography of Karl Pearson and exposed the wide range of talents, interests and passions that Pearson possessed. Important to this examination is Pearson's relentless commitment to the advancement of statistics and science. In describing Pearson's emergence as a pioneer of modern statistics, Porter (2004) suggested that: Pearson saw that statistics could be made mathematical and might infuse the

practices of the 'descriptive scientist' with some of its logical accuracy. [Pearson

saw that] This could be especially valuable in regard to the great social and economic questions [of the time], where interested opinion so often held sway. . .

[and] in 1893, he would acquire a passionate commitment to the statistical study of biological evolution, Galton's favourite field, (pp. 215-216).

However, it should be noted that Pearson's first foray into the field of statistics came during his time teaching mathematics to engineers (Porter, 2004).

17 in example see Galton (1874; 1876; 1896/1962) 25

One of Karl Pearson's pioneering developments in statistics was the family of skew distributions. The skew distributions grew out ofpractical problems of science that he was closely associated with. They were developed in response to specific cases where there was an apparent failure of the Normal Law to describe acquired scientific data. Pearson was routinely called upon for his statistical expertise. In one such case Professor Walter Weldon (1893), reported finding large and abnormal deviations in his data. The data, that were the measurements of certain organs of crabs, found in the Bay ofNaples and Plymouth Sound, stood out as incompatible with his previous findings in shrimp. While measuring the organs of shrimp, Weldon had concluded that the variations in the size of the measured organs distributed themselves symmetrically about the mean in such a way as to be consistent with the Normal Law (see pp. 318-31 9). His new results regarding the organs of crabs from Naples and Plymouth startled him. Weldon (1893) reported: I was led to hope that the result obtained might arise from the presence, in the sample measured, of two races of individuals, clustered symmetrically about separate mean magnitudes. Professor Karl Pearson has been kind enough to test this supposition for me: he finds that the observed distribution corresponds fairly well with that resulting from the grouping of two series of individuals, (p.324) Pearson, a man known for his exactness and commitment to scientific rigour, was not one to accept "corresponds fairly well" as an adequate scientific explanation. In 1 893 Pearson 26 addressed certain failures of the Normal Law, or normal distribution as it was coming to be known.18 Pearson (1893) wrote: If a series of measurements physical, biological, anthropological, or economical, not of the same object, but of a group of objects of the same type or family, be

made, and a curve be constructed by plotting up the number of times the measurements fall within a small unit of range to the range, this curve may be termed a.frequency curve. As a rule this frequency curve takes the well known

form of the curve of errors, and such a curve may be termed a normalfrequency curve. The latter curve is symmetrical about its maximum ordinate. Occasionally, however, frequency curves do not take the normal form, and are then generally, but not necessarily, asymmetrical, (pp. 675-676) Pearson (1 895) followed up his 1 893 examination by publishing the results he had obtained by studying the data of Weldon and others, that certain families of these abnormal frequency curves were in fact compound normal curves whose overlap could be distinguished as ? number of normal curves. Pearson then extended his work on this topic by introducing the idea of skew curves in a paper presented in 1895 entitled, "Contributions to the Mathematical Theory of Evolution. II. Skew Variation in

Homogeneous Material." The importance of Pearson's family of skew distributions cannot be overstated. The Normal Law had developed an iconic space in the collective mind in the scientific community and notably with Galton. In Natural Inheritance Galton (1 889) wrote:

18 Pearson did not rely solely upon the data supplied by Weldon, but also that from Thompson who studied prawns, and Bateson who studied earwigs (see Pearson, 1893, p. 676). 27

I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error." . . .Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along. The tops of the marshalled row form a flowing curve of invariable proportions: and each element, as it is sorted into place, finds, as it were, a pre-ordained niche, accurately adapted to fit it. (p. 66) Many persons were swayed by Galton's insistent belief in the Normal Law. Pearson was not one of them. According to Porter (2004), Pearson offered "a critique of the easy assumption of the normal law," and "sought to dislodge [the normal law] even from its original and most secure ground as the distribution of errors of observation" (p. 255). Pearson (1900a) wrote: Now it appears to me, that if the earlier writers on probability had not proceeded so entirely from the mathematical standpoint, but had endeavored first to classify experience in deviations from the average, and then to obtain some measure of the goodness of fit provided by the normal curve, that curve would never have obtained its present position in the theory of errors. Even today there are those who regard it as some sort of fetish; and while admitting it to be at fault as a means of generally describing the distribution of variation of a quantity ? from its mean, assert that there must be some unknown quantity ? of which ? is an unknown function, and that ? really obeys the normal law! This might be reasonable if there were but few exceptions to this universal law of error; but the 28

difficulty is to find even the few variables which obey it, and these few are not those usually cited as illustrations by the writers on the subject, (p. 173) The skew curves represented the general solution to any series of distributions. However, "He did not abandon the normal curve. For purposes of analyzing data, the normal was more tractable than his complicated skew curves" (Porter, 2004, p. 254). Because Pearson did not abandon the normal curve, many refused to acknowledge the skew curves. They instead retained a steadfast commitment to the normal curve regardless of the current evidence. In response to this, Pearson felt he needed to defend his family of skew curves (Porter, 2004, p. 255). To achieve this he developed the goodness-of-fit test or ?2 statistic. The ?2 statistic is related to expected frequencies by a direct proportionality to the sample variance19. Describing the utility ofhis new function Pearson (1900a) suggested: In other words, if a curve is a good fit to a sample, to the same fineness of grouping it may be used to describe other samples from the same population. If it is a bad fit, then this curve cannot serve to the same fineness of grouping to describe other samples from the same population. We thus seem in a position [using the ?2 statistic] to determine whether a given form of frequency curve will effectively describe the samples drawn from a given population to a certain degree of fineness of grouping, (p. 163) Barnard (1992) proposed that it was at this moment that statistics emerged from its early form into its modern 20th century manifestation. Using this obtained statistic and

19 X2 = s2(N-l)/ff2 , Chi-square is defined by the sample variance s2, the total number of observations N, and the known population variance a . 29

Pearson's ?2 tables, a decision could be made whether or not a curve or sample deviated from another curve or sample in a quantifiable manner, not expected by chance (Pearson,

1900a). The ?2 statistic represents the first widely used statistical test that anticipated the development of NHST. It provided a firm foundation upon which modern statistical testing could emerge. Scientists were encouraged to evaluate the importance or chance- like qualities of their data based upon departures from expected outcomes. The interpretation of chance is closely related to the assumptions of the null hypothesis. The ?2 statistic provided the necessary inspiration for R. A. Fisher to develop his approach to statistical testing in the form of Mests and ANOVAs. Fisher (1928) noticed that the ?2 statistic could provide a useful means of deciding whether or not acquired data deviated from expected data in such a way that a given null hypothesis might be discredited.

Gösset, Fisher, Neyman and Pearson The rise of statistical testing of significance can be attributed to Karl Pearson and the biometrie school,20 but also to William Sealy Gösset.21 His work, presented in his 1908 paper "The Probable Error of a Mean," addressed the issues regarding the evaluation of small sample statistics in comparison to overall population statistics. R.A

Fisher (1928) wrote of Gosset's work: The study of the exact distributions of statistics commences in 1908 with "Student's" paper The Probable Error ofa Mean. Once the true nature of the

20 see Gigerenzer et al. 1989, p. 79. 21 the Guinness brewery did not allow their scientists to publish research outside of their involvement at the brewery. Hence, Gösset is also known as Student, his pen-name for his scientific publications. AU references and citations made to Gösset can be found in the works cited under his pen-name Student. 30

problem was indicated, a large number of sampling problems were within reach of mathematical solution, (p.23) Gosset's (1908a) paper introduced the concept of the z-test, that was later transformed by Fisher into the f-test (Fisher, 1928; Eisenhart, 1979).

Gosset's ability to perform mathematical and statistical analysis was well-known to his colleagues. McMullen suggests that it was well known "that he could calculate a probable error in 1903" (see McMullen, 1939, p. 204). It was his emerging talent in statistics that lead to his 1904 report to Guinness22 regarding "the importance of probability theory in setting an exact value on the results of brewery experiments, many of which were probable but not certain" (Boland, 1984, p. 179). The report, entitled "The Application of the Law of Error to Work of the Brewery", created the opportunity for Gösset to meet Karl Pearson, and in 1906, to take a leave of absence from Guinness to study statistics in close contact with Pearson's biometrie laboratory. Out of his leave of absence sprang two important papers in the development of modern statistical methods. Both papers appeared in , in 1908. Gosset's first paper appeared in March 1908 as "The Probable Error of a Mean." This was Gosset's first

22 Just prior to the turn of the 20th century, the Guinness brewing company decided that they needed to evaluate their brewing process in order to ensure the quality of their product. To this end, Arthur Guinness, Son & Co., began hiring scientists to perform this task. William Sealy Gösset was one of the first scientists to be hired as a brewer for this purpose in 1899 (Boland, 1984; McMullen, 1939; Pearson, 1939). Gösset had been educated, and was a scholar, in mathematics and chemistry at Winchester and New College, Oxford (McMullen, 1939). Gösset has been recognized by many as a brilliant scientist and statistician. Gosset's vision and ability to identify and state problems were the driving force behind some of the most important changes in statistical theory as it emerged around the turn of the 20th century (see Boland, 1984; Fisher-Box, 1978, 1981, 1987; McMullen, 1939; Pearson, 1939; Ziliak & McCloskey, 2008). Gösset conceived and developed his z-test in response to the issues and incompatibilities of Karl Pearson's large sample statistics and the work of the Guinness brewing company. In attempting to analyze its brewing strategies, Guinness was necessarily concerned with problems surrounding the study of small sample statistics (McMullen, 1939). 31 attempt to bring to light the issues regarding the application of statistical methods to small sample data. He outlined a strategy for the testing of statistics obtained from small samples against what was to be expected from the assumed population statistics that the sample data was hypothetically drawn from. Statistical treatments of small sample data had often been handled by allowing for multiple repetitions of the same experiment. The problem as outlined by Gösset (1908a) was that: There are other experiments, however, which cannot be easily repeated very often; in such cases it is sometimes necessary to judge of the certainty of the results from a very small sample, which itself affords the only indication of the variability, (p. 2) The general solution to this problem was what Gösset was after. While he did not actually achieve the general solution to this problem, he did begin the investigation that would eventual achieve this. Gösset (1908a), so good at properly defining the problem, stated it this way: Again, although it is well known that the method of using the normal curve is only trustworthy when the sample is "large," no one has yet told us clearly where the limit between "large" and "small" samples is to be drawn, (p. 2)

The quantity ? and its distribution were derived as the value that is "obtained by dividing the mean of the sample and the mean of the population by the standard deviation of the sample" (Gösset, 1908a, p. 2). Gösset (1908a) concluded, following some practical

23 multiple repetitions of single experiments allow for the suppression of the error of random sampling and permit for a close approximation of the law of distribution regarding the data (whether it be normal or otherwise) 32 examples of his new method, that the use of his ? distribution was "A very close fit" (p. 18) to the expected curve, even under small sample conditions, provided that the distribution was approximately normal. The second paper appeared a few months later titled "The Probable Error of a Correlation Coefficient." Gösset was quite aware that most statisticians of the time were concerned with large samples. Small sample problems seemed to them reckless and silly. Hence, though there were occasions where persons of authority spoke on the significance of correlations with samples as small as 21, no one had as of yet clearly defined what significance could be assigned correlations derived from small samples (Gösset, 1908b, p. 302). To this end, Gösset' s second paper was the first of its kind to define this particular problem and explore the issue empirically. The empirical investigation was done in order to possibly understand the stability of statistical measures under small sample conditions. Gösset (1908b) did not develop a general solution to this problem, but he brought new angles to this issue in ". . .hope[s] they may serve as illustrations for the successful solver of this problem" (p. 310). The introduction of the problems regarding the use of small samples to estimate the nature and properties of a given distribution proved fruitful. In 1912 Ronald Aylmer Fisher had read Gösset' s 1908 papers on the probable errors of the correlation coefficient and mean. Fisher's first published article,24 "On an Absolute Criterion for Fitting Frequency Curves," appeared that year. This paper was to introduce Fisher's concept of

24 Gosset's 1908 papers were the inspiration for Fisher's second published article as well. In 1915 Fisher published, "Frequency Distribution of the Values of the Correlation Coefficient in Samples from an Indefinitely Large Population." This paper was highly dependent upon geometric arguments and established the general solution to Gosset's inquiry regarding the distribution of the correlation coefficient regardless of the sample size. 33 maximum likelihood (Fisher, 1912, 1928; Fisher-Box, 1978, 1981). In this paper, Fisher proposed a different denominator in Gösset' s variance equation, that of n-1 rather than n. Encouraged by F.J.M. Stratton, his college tutor,25 Fisher decided to write to Gösset regarding the discrepancy. Theirs would be a long friendship with continuous correspondence26 (Boland, 1984; McMullen, 1939; Fisher-Box, 1978, 1981, 1987). The proposed correction of the n-1 denominator in Student's variance formula was an example of Fisher's exceptional mathematical ability.27 Fisher's n-1 denominator was the first modern occurrence of the concept of degrees of freedom (Fisher-Box, 1978, 1981, 1987). The concept of degrees of freedom cannot be completely understood without the geometric argument for it.29 Many persons had, and have, difficulty

25 Fisher was then still an undergraduate at Gonville and Caius College, Cambridge (Fisher-Box, 1978, 1987) 26 Gösset and Fisher began writing to each other fairly regularly. Gösset was quite surprised to find out that Fisher was neither a Major or Professor, was a school teacher during the war, and was currently not working. Gösset, aware of some changes being made at the Rothamsted Experimental Station, wrote to Fisher encouraging him to seek employment there as a statistician (Fisher-Box, 1978, 1981). The Rothamsted Experimental Station was an agricultural station where experiments were performed to investigate the growth and response of crops under changing conditions 27 Ronald Fisher had attended college to study mathematics. This aspect of Fisher's education may seem curious given his well known interests in biology and evolution. According to Fisher-Box (1978, 1987), it was his interest in these things which lead to his study in mathematics. Very early in his life Fisher had demonstrated a talent for mathematics. His talent in mathematics led him to the awareness that if one is to study the nature of genetic transmission and evolution, one must understand and apply the methods pertaining to probability and statistics (Fisher-Box, 1978, 1987). Fisher's talent for mathematics was not, however, of the ordinary variety (Neyman, 1951). Fisher, since childhood, had very poor eyesight. He suffered from extreme myopia (Yates & Mather, 1963) and this often led to headaches and difficulties reading and studying. In response to this, Fisher was led to develop strategies involving visualization of mathematical methods, most importantly geometrical arguments and proofs (Fisher-Box, 1978). This ability to both visualize and conceptually see ?-dimensional space, was to inform Fisher of many of the special qualities of statistical distributional methods and lead to the his use of the concept of degrees of freedom. 28 The first formal statement of the concept of degrees of freedom did not appear until 1922, with Fisher's paper "On the Interpretation of ? 2 from Contingency Tables, and the Calculation of P." While Hotelling (1951) acknowledges Gauss as the developer of the concept of degrees of freedom, Hotelling (1942/1943) also places Fisher, rightly, as the person who brought the method back into use in a new light, that of the theories of estimation which were theoretically and technically different than those of Gauss. 29 Degrees of Freedom represent the range ofpossible integer values which might be occupied within a given dimension of space, when constraints or conditions are placed upon the data, ie. M = N, a degree of 34 understanding it. However, as Fisher (1922) described, the inconsistencies and confusions that appeared in the current literature could be straightened provided the true number of degrees of freedom in which observation and expectation might in reality differ were accounted for.

Fisher began work at the Rothamsted Experimental Station in 1919. This location was to prove an exceptional source of inspiration for Fisher, many ofhis ideas regarding the design of experiments and advanced statistical techniques were borne out by the challenges he faced at Rothamsted (Fisher-Box, 1978, 1981; Pearson, 1974). The series of papers30 produced there are considered by some (see Neyman, 1951; Pearson, 1974; Salsburg, 2001) as examples of the most important developments in modern statistical theory, especially where it involves practical experimentation. The correspondence between Fisher and Gösset also led to the eventual development of the t statistic. The t statistic, or distribution, is the general solution to Gosset's developed ? distribution. The correction of the ? statistic, when stated as t, accounts for the degrees of freedom.31 The first appearance of Gösset and Fisher's Mest was, according to Eisnenhart (1979), in Fisher's 1925 Statistical Methodsfor Research Workers, however, it is suggested that Fisher began to use the t notation sometime in

1922. Due to a lack of complete mathematical proofs and his unusual use of n- dimensional geometry, Fisher's Statistical Methodsfor Research Workers, was not warmly received (Fisher-Box, 1978). Later in 1925, Fisher published the complete proof freedom is lost. Hence, when evaluating the variance we use the mean (an unbiased estimator) and lose a degree of freedom. 30 Of special note the series entitled Studies in Crop Variation. 31 The t distribution is related to the ? distribution as t = zyJfv/V), where ? = the degrees of freedom and Fis a chi-square distribution. 35 of Gösset' s t distribution, and demonstrated its wide range of applications (Fisher, 1925). According to Fisher (1922, 1925), this distribution could be used to test the significance of means, correlations, and regressions amongst others. Samples could be evaluated based upon their characteristic distributions, compared relative to other distributions and evaluated for significance at the ? < .05 convention in a manner not unlike Pearson's ? statistic (Fisher, 1928, p. 100). The t distribution would also become the foundation for the creation and appearance of the method of analysis of variance (ANOVA). Fisher introduced his method of NHST around this concept and it was supported by ¿-tests, ANOVA and the ?2 statistic. ANOVA seems to have first appeared in Fisher and MacKenzie's 1923 paper "Studies in Crop Variation. II. The Manurial Response of Different Potato Varieties." ANOVA stands as a technique of estimation that followed from work that Fisher had begun for himself in 1912 with the development of the Method of Maximum Likelihood. Maximum Likelihood is a technique for estimating the fitting of frequency curves and was developed to demonstrate how Karl Pearson's Method of Moments was often inefficient and ineffective to this end (Hotelling, 1951). ANOVA received a more extensive treatment as outlined and explained in Fisher's 1925 Statistical Methodsfor Research Workers (see Fisher, 1928). The intimate connection between Gosset's and Fisher's work spawned what Fisher saw as the general class of distributions, r-to-z, that were related to Gosset's ? distribution by the following relationship, ? = ?/2\??[(1-t)/(1+t)].

In 1934 Snedecor tabled Fisher's r-to-z ratio as mean square ratios and called the table the F-table in honour of Fisher (Rucci & Tweney, 1980). The modern method of 36 analysis of variance entails evaluating variables by their sum of squared deviations about a sample mean, divided by their respective degrees of freedom, thus creating the mean square value. The obtained mean square values are then divided again by the mean square error and a value for the F ratio is obtained.32 This value can then be evaluated against the tabled critical values for the F distribution, for given degrees of freedom. The use of ANOVAs and i-tests rely upon the principle of testing the null hypothesis. Fisher saw the null hypothesis as being possibly discounted by evidence, but never proposed that isolated experiments could provide definitive information. Fisher's proposed interpretation of statistical testing, as contributing to degrees ofbelief, has often been forgotten and confused with the decision-making approach prescribed by Neyman and Pearson (Gigerenzer et al., 1989). 1933 saw Jerzy Neyman and Egon S. Pearson propose a modification of Fisher's approach to null hypothesis statistical testing. The problem that Neyman and Pearson saw in the Fisherian approach to statistical testing was that it did not account for alternative hypotheses (Neyman & Pearson, 1933). Neyman and Pearson proposed that not only should a null hypothesis be formulated, but an alternative hypothesis should be formulated as well. This second or opposite hypothesis would represent the likely alternative to the null when the null is rejected as the possible explanation of the observed data (Neyman & Pearson, 1933).

32 The .F-statistic is a ratio statistic giving an estimated measure of the proportion of obtained variance in the dependent variable, relative to the estimated amount most likely obtained by error. The larger the ratio value, the more variance in the data is possibly attributable to the independent variables used to study the dependent variable. 37

This accept and reject approach to statistical testing was strongly opposed by Fisher. He believed that while it appeared appropriate to infer that if NHST could discount one hypothesis, it should be able to prove an alternative one, this alternative hypothesis "however reasonable or true it may be, is ineligible as a null hypothesis to be tested by experiment, because it is inexact"(Fisher, 1966, p. 16). Fisher saw the Neyman-

Pearson approach as operating outside of the framework that he had devised for statistical testing. Fisher (1966) wrote:

A good deal of confusion has certainly been caused be the attempt to formalise the

exposition of tests of significance in a logical framework different from that which they were in fact first developed, (p. 26)

According to Gigerenzer et al. (1989), the Neyman-Pearson approach to inference that involved the use of two hypotheses and introduced power as a fundamental concept was blurred or ignored by textbook writers and introduced as if there did not exist any controversy. The modern version of NHST employed by most scientists is better understood as a hybridized model of Fisherian and Neyman-Pearson statistics, and thus not all criticisms of the NHST procedure are rightly directed at Fisher alone.

Some Remarks Regarding the Origins of NHST

The advent of the method of correlation, inspired by biological and eugenic studies, created the space for distributional statistics to enter psychology. Galton had been aware that the method of correlation extended well beyond the realm ofbiology, and was actually useful as a method of comparison for any possible set of given variables (Gigerenzer et al., 1989). Fancher (1996) suggested that "while often wrong in their 38 specifics, his [Francis Galton' s] theories provided innumerable basic foundations on which others could build. He pioneered the very idea that tests could be employed to measure psychological differences among people" (p. 219). Correlation made it possible to study more than one variable at any time in a manner that was of technical ease (see

MacKenzie, 1981, p. 59). Just two years following the defining of the method of correlation, James McKeen Cattell, then professor of psychology at the University of Pennsylvania, together with the endorsement and interest of Galton, proposed a research program designed to study and measure mental traits and abilities that employed Galton' s technique (see Cattell & Galton, 1890, p. 308). Using a variety of mental measures and their relationships, Cattell and Galton (1890) envisioned experimental psychology moving towards the "certainty and exactness of the physical sciences" (p. 373). These thoughts were echoed by F. Y. Edgeworth in 1893. Edgeworth (1893) argued for the study of social statistical entities, much like Quetelet's (p. 670), that could uncover the regularities of social variation using Galton's method of correlation.33 While it is unclear why, in 1889 John Dewey wrote a review of Galton's Natural Inheritance. Dewey (1889) was greatly impressed by Galton's work on heredity and claimed it was "doubtless the ablest work on the subject extant" (p. 331). Dewey (1889) also acknowledged the "double interest" (p. 331) ofNatural Inheritance as Galton had "collected a large mass of statistical information" and "more important, [had] developed some new and interesting statistical methods" (p. 331). Dewey (1 889) was particularly impressed by the fact that Galton "found a remarkable

and with a certain measure of common sense regarding its interpretation (p. 672). 39 parallelism between results obtained by observation and those theoretically deduced from the mathematical calculations" (p. 333). According to Dewey (1889), "Galton might express his scheme in the well-known curve of error, but his curve of distribution contains all that the curve of error contains, and much besides" (p. 333). Hence, he suggested that other researchers in their respective fields "acquaint themselves with

Galton's development of new methods, and see how far they can be applied in their own fields" (Dewey, 1889, pp. 333-334). Dewey joined the calls to action and incorporation, of Cattell and Edgeworth. He promoted the utility of studying human and social phenomena through the application of the Normal Law and correlation. Once correlation was adopted, other statistical methods soon appeared. The advent of small sample statistics, the ?2 statistic and the tests that followed (t, ANOVAs, etc.) were developed in response to the needs of practical experimentation in the brewery.

Psychologists wanting to study certain aspects of human behaviour or investigate certain mental characteristics were at times confined to the study of small samples in much the same way Gösset and certain other experimentalists were. As statistical methods and measures of correlation began to appear in psychological literature with increasing frequency at the turn of the 20l century, psychologists needed a technique of assessing the meaning of these measures. By the 1950s, NHST, exemplified by ANOVA techniques, dominated the experimental psychological literature (Rucci & Tweney, 1980).

NHST appears to have provided the opportunity to obtain seemingly systematic and definitive results regarding the interpretation of data. These critical aspects and needs of experimental psychologists will be explored in more detail in the chapters that follow. 40

The NHST Debate: The Persistence of a Problem in Psychology In 1957 Lee Cronbach gave his presidential address to the American Psychological Association. In this address, Cronbach outlined two disciplines within psychology. Each had a unique history, and each was characterized by their use of statistics. He suggested that there existed a "correlational" psychology and an "experimental" psychology (Cronbach, 1957, p. 671). He suggested that "Fisher made the experimenter an expert puppeteer" while "the correlational psychologist is a mere observer of a play where Nature pulls a thousand strings" (Cronbach, 1957, p. 675). This observation is consistent with what Danziger (1987) has offered as the distinction between the Galtonian (correlational) and neo-Galtonian (experimental) branches of psychology as the discipline developed following the turn of the 20l century. For Experimental psychology at the turn of the 20th century and into the early 1930s, ANOVA did not yet exist or had not appeared in psychological journals. However, there existed many alternative methods of statistically testing the significance of results. These methods were acknowledged as early as 1904 by Thorndike in his Introduction to the Theory ofMental and Social Measurements. They had the common practice of measuring the amount of variability belonging to a distribution of data, as a measure of the average deviation (A.D.), standard deviation (S.D.), percentiles (Q) or the probable error (P.E.) (see Thorndike, 1913, pp. 91-108). One such method, the Critical Ratio (CR) functioned in a manner similar to that of? or t testing. The differences between the average values for a given group under study could be analyzed as a measure of relative dispersion. Leith (1936) suggested: 41

The sole purpose in computing the critical ratio is to measure the probability that a difference as large as the observed difference would be obtained ifthe samples were drawn from the same statistical population, in the face of an established presumption (possibly very mild) that they are not so drawn, (p. 556) It takes one only a moment to refer to a contemporary statistics textbook to find that Leith's interpretation of the CR, and its purpose, is consonant with the aims of modern psychology's use of the null hypothesis in NHST. These precursory methods and the use of the ?2 statistic possibly link certain aspects of the NHST debate to criticisms raised by Edwin Boring in the years immediately before the appearance of Fisher's ANOVA and the formal discussion of NHST. Boring in 1919 expressed a general distrust in statistical methods. Statistical methods were still relatively new to experimental psychology and Boring at the time was at Cornell University under the tutelage of E.B. Titchener. Titchener maintained the belief that introspection was the method of experimental psychology, perhaps explaining why Boring was skeptical of statistical techniques. Boring's paper was concerned with the usefulness of statistical measures and their related tests of significance for psychology. The paper, "Mathematical vs. Scientific Significance", outlined the tenuous nature of applying the rules of statistics to observations and measurements of human subjects. Boring pointed to the possible limitations of statistical significance in light of the intuition of an experienced psychologist. He further suggested that statistical treatments of data were perhaps inhibiting the production of scientific knowledge: 42

So it happens that the competent scientist does the best he can in obtaining unselected samples, makes his observations, computes a difference and its "significance," and then - absurd as it may seem - very often discards his mathematical result, because in his judgment the mathematically "significant"

difference nevertheless is not large compared with what he believes is the

discrepancy between his samples and the larger groups which they represent (Boring, 1919, p. 337) Boring (1919) contended that the use of statistics had led psychologists away from thinking freely and coming to conclusions drawn in discerning and meaningful ways. Psychologists were being removed from their objects of study. Boring (1919) wrote "the case is one of many where statistical ability, divorced from a scientific intimacy with the fundamental observations, leads nowhere" (p. 338). Regardless, by the 1950s the method ofNHST (as ANOVA) had become the dominant approach to experimental design and analysis in experimental psychology (Rucci & Tweney, 1980). The large scale production of textbooks dedicated to Fisherian statistics, their enthusiastic uptake by psychologists, and the decision to make statistical methods of inference based upon Fisher's model a graduate requirement between 1951 and 1954, all helped solidify the role NHST was going to play for future psychologists (Gigerenzer et al., 1989). The NHST debate seems to have been passed over by experimental psychology, not necessarily at an individual level, but certainly at the level of disciplinary concern. 43

Contemporary authors have demonstrated that little change has been brought about by the 50 years of NHST debate. Roger Kirk in 2004 discussed the many pitfalls and difficulties of using NHST, but does not stray from a general message: A significance test does not tell us how large the effect is or whether the effect is important or useful. Unfortunately, all too often the primary focus of research is on rejecting a null hypothesis and obtaining a small/? value. The focus should be

on what the data tell us about the phenomenon under investigation. (Kirk, 2004, p.

213) The argument has been put forward elsewhere by Kirk (1996), Marewski and Olsson (2009), Kline (2004), Gigerenzer et al. (1989), and as recently as last year by Trafimow and Rice (2009). The Trafimow and Rice (2009) article presents an empirical study of the logical connections between the probability of obtained data, given the null-hypothesis, by way of studying the correlation proposed to exist between the opposite relationship. The results clearly demonstrate that there is little to no probabilistic information that the NHST procedure can actually provide regarding any of the hypotheses of interest to the scientist. The symmetry of Boring's early criticisms and the mid-century critiques of Berkson, Hogben, Bakan and Meehl, to that offered by Kirk, Trafimow and Rice and other contemporary authors is clear. The state of affairs regarding NHST appears to have not changed.

The probability of the null hypothesis given the data. 44

The NHST Debate: Concluding Remarks If Boring's 1919 criticisms are taken seriously from a methodological point of view, then the NHST debate could be considered to be linked to psychology's history before NHST was formally proposed and before it became a controversy. Boring's criticisms reflect three of the four proposed aspects of the current NHST debate. Boring's criticisms reflect the decision making problem, the problem of appropriateness and the problem of logic.

The decision making problem surrounding the use of statistical significance is a critical aspect of Boring's 1919 paper. He claimed much the same as Berkson did in

1938, that the reliance upon statistical significance was often an inappropriate means of analyzing data. His argument stressed the distinction that Rozeboom (1960/1970) would later make, that the statistical and substantive significance of data were often two distinctly different interpretations. The problem of appropriateness is evident when Boring (1919) proposes that "It is a common experience of scientific persons working with human [emphasis added] data that these formulae [chi-square in particular] frequently give values for the probability of differences that are 'too high.'" (p. 336).

Boring's claim regarding unguided statistical information and scientific obscurity possesses a distinct similarity to the problem of logic. I suggest that these scientific impediments are similar to those that Hogben (1957/1970), Bakan (1966), Berkson

(1942/1970), Meehl (1967/1970, 1978) and more recently Cohen (1994) and many others have suggested result from following the NHST procedure. 45

It further appears that contemporary authors continue to repackage the same arguments that were made quite strongly perhaps starting with Boring in 1919 and the authors of the controversy from the 1950s and 1960s. I argue here that the reasons for psychology's stagnancy when it comes to re-examining and perhaps changing its methods has a great deal to do with the apparent needs of the discipline in the years leading up to its adoption ofNHST. As Danziger (1987) has suggested, the move to Fisherian statistics represented a solution to psychology's problems of consensus surrounding the interpretation and meaning of experimental research. The apparent laissez-faire approach to addressing the issues regarding NHST and statistical methods becomes more understandable within a historical context. As we examine the historical roots of experimental psychology in light of certain unresolved problems surrounding its emergence as a quantified discipline, we find that experimental approaches rose and fell frequently prior to NHST. Examining this rising and declining of experimental methods helps expose the intradisciplinary needs and pressures that NHST must have appeared to have met in order for it to have retained a prominent position in the tool-box of experimental psychologists. Gustav Theodor Fechner's psychophysics and Wilhelm Wundt's program for an experimental psychology established what would become the experimental discipline of psychology. With there being a great interest in physiology and materialist science in late 19th century Germany, a certain amount of interest in the functions of mind and body existed. Building upon his philosophical antecedents, Fechner proposed to solve the mind-body problem by experimentally demonstrating his version of dual aspect theory. 46

Fechner's experimental demonstrations and mathematical determinations would help fuel increased interest in experiments that sought to quantify the mental. However, some like Johannes von Kries argued that Fechner had failed in his attempt to quantify the mental and that the task was impossible. This apparent controversy was sidestepped by early psychologists whose interests at the time were in experimental practices and not theoretical concerns. At the forefront of this movement was Wilhelm Wundt. His laboratory and the practices that were begun there provided a standard image that would establish the first norms of experimental practice and begin the process of organizing what was to become experimental psychology. The unanswered theoretical concerns that developed during these years were carried forward and would persist. These concerns helped establish the role that statistics and eventually NHST would play for the young discipline. 47

Chapter 2

Quantified Reasoning and a Standardized Image of Psychology The rise of experimental and quantitative psychology began in 1 9* century

Germany. The emergence of a quantified psychology was the result of a new approach to the mind.35 Correcting and extending the works of J. F. Herbart and E. H. Weber, experimental psychology, in its modern form, took shape in 1860 with Gustav Theodor Fechner36 and the advent of psychophysics (see Micheli, 1999, p. 3; Murray, 1987, pp. 73-80; Heidelberger, 2004). The first psychological laboratory would follow soon after in

1 879, established by Wilhelm Wundt. The experimental approach to psychology promoted in Wundt' s laboratory had been based in part upon Fechner' s theories but were enhanced and refined by his knowledge of physiology. In this way, experimental psychology's history as a discipline is bound up within a history of its acquisition of forms of quantified reasoning and experimentation in 19th century Germany. The purpose of this chapter is to examine the rise of quantified reasoning in psychology within a primarily German tradition. This examination exposes the deep roots of the quantity objection that sits at the heart of the cause behind NHST's persistence in psychology. The quantity objection and the problems ofpsychological investigation were overridden by early psychologists in the desire for discipline-building and methods of standardized practice. Fechner was not concerned with establishing psychology as a

35 See Leary's (1978) analysis of the philosophical transformations which took place from Kant to Beneke in The Philosophical development ofthe conception ofpsychology in Germany, 1 780-1850, in the Journal ofthe History ofthe Behavioral Sciences, 14, 1 13-121. Fechner was first trained as a medical doctor and then took on the task of becoming a trained physicist whose knowledge of mathematics and experimental methods would be of great asset to his ultimate goals (see Boring, 1950, p. 277). 48 science, but rather with the exploration of possible evidence for his worldview that attempted to navigate its way past established traditions begun by Immanuel Kant. But, by introducing quantitative evidence as the demonstration of proof of his beliefs, Fechner inadvertently provided a foundation for an experimental program of psychological investigation. His efforts to quantify sensation and provide a precise description of the nature of sensory responses, represented the entry point for statistics to make their way into psychology (Sheynin, 2004). Though criticisms were aimed at Fechner and his theories, his ideas inspired certain German intellectuals practicing experimental approaches to physiology. Wundt took a portion of Fechner' s ideas and established a quantified science of psychology that he named physiological psychology. Through Wundt, a standardized formula for experimental practice and the image of laboratories devoted to psychological inquiry was developed. Wundt' s example sparked the interests of a collection of scientists who adopted, at least in part, the traditions begun by Wundt. Experimental approaches to psychology, over time, established a difference from philosophy and physiology that provided it with scientific credibility. By developing common methods and approaches, as well as textbooks, psychology provided evidence of being an established science regardless of existing theoretical difficulties. Fechner's Mathematical Psychology Described as the father of experimental psychology by Boring (1950, p. 246), Fechner wrote Elemente der Psychophysik, which first appeared in 1860, to remedy what he saw as a problem; a lack of knowledge concerning "the relation of mind and matter" (p. 1). According to Fechner (1860/1966), "Psychophysics, already related to psychology 49 and physics by name, must on the one hand be based on psychology, and on the other hand promises to give psychology a mathematical foundation." (p. 10). The question addressed here is why a mathematical foundation was seen as necessary for psychophysics, and by extension psychology when Fechner himself suggested that

"knowledge of the mind has, at least up to a certain point, established for itself a solid basis in psychology and logic." (p.l)

In the decades leading up to Fechner' s mathematico-experimental treatment of sensation, there existed a rather vigorous debate concerning the nature of the mind and experience. Among German intellectual, political and social circles of this era, Immanuel Kant was amongst the most influential.37 Kant's Critique ofPure Reason, that appeared in 1781 and was later revised in 1787, criticized the British empiricist ideas, that began with John Locke and were extended by David Hume. The empiricist psychology of the time had reduced all knowledge and knowing to human experience. To Kant this was unacceptable and his Critique ofPure Reason was motivated by a need to demonstrate how one might come to "understand by a priori knowledge, not knowledge independent of this or that experience, but knowledge absolutely independent of all experience" (Kant,

1787/2003, p.43) This concept of knowledge as independent of the knower created a new space that strengthened the immaterialist philosophy of an independent world whose existence did not rely upon human sensory experience. In so doing, Kant (1787/2003) put into play a new form of discourse, one involving "objective reality" (p. 89), that defined an external object as an "object in itself (p. 88).

37 Kant's influence should not be understated, his work has had more impact on psychology "than [the work of] any other philosopher of the eighteenth century" (see Büchner, 1897). 50

Kant's views on psychology appear complex and wide ranging,38 however, he did suggest that psychology and psychological investigation did not, and would not, fit within his conceptions of what constituted a science. In Kant's Metaphysische Annfangsgründe der Naturwissenschaft {Metaphysical Foundations ofNatural Science) of 1786, he proposed that "a rational doctrine ofNature deserves the label 'natural science' only when the laws of Nature are known a priori and aren't mere laws of experience" (Kant, 1786/2009, p. 2). According to Kant (1786/2009), the pure part39 of a science leads it to being a natural science by way of its structure. The structure of a priori principles binds the knowledge of natural science together and provides meaning to it all. Geometry and Mathematics for Kant (1786/2009) are good examples of this natural structure. Hence, Kant (1786/2009) claimed that no science can be a "natural science unless mathematics is brought into it" (p. 4). A mathematical psychology, following Kant's reasoning, was not possible given that the mind was not knowable as substance (an object in itself) and hence, could not be known in an objective sense or be represented mathematically. Kant further proposed that empirical psychology was not fit to be considered a subject of meta- physics either (Kant, 1787/2003, p. 664) and in so doing placed psychology in the realm of natural history. However, it appears that in placing psychology outside of philosophy, Kant actually provided the push it needed towards becoming a science (see Bell, 2005, p. 143;Leary, 1978, p. 116).

38 see the discussion regarding Kant's various views on psychology as put forth by Bell, 2005, ppl45-151. 39 that part of it which is only known a priori. 40 By science I am using the contemporary understanding of the term here. Kant would not have suggested that psychophysics or experimental psychology ever developed a priori structures. Psychology was and has remained an empirical science. 51

Fechner's prescription for a quantified, mathematical psychology was in response to both the philosophic problem surrounding the nature and relationship of the mind to the physical world, as well as the scientific prescriptions of Kant's philosophy. Fechner (1860/1966) was concerned with generating evidence for the "dualistic" (p. 4) relationship that he believed existed in the mind-body relationship. Fechner's (1860/1966) adherence to dual aspect theory proposed that much like a circle, whose convex and concave sides are only visible based upon situated perspective (inside the circle or outside), the reason why the connection between mind and body had not been teased apart, up until that point, was that the problem had not been approached with perspective in mind (p. 2). He wrote: What will appear to you as your mind from the internal standpoint, where you

yourself are this mind, will, on the other hand, appear from the outside point of

view as the material basis of this mind. There is a difference whether one thinks

with the brain or examines the brain of a thinking person, (p.3)

Body and mind parallel each other; changes in one correspond to changes in the other. Why? Leibniz says: one can hold different opinions. Two clocks mounted

on the same board adjust their movement to each other by means of their common

attachment (if they do not vary too much from each other); this is the usual dualistic notion of the mind-body relation, (p.4)

Fechner (1860/1966) proposed, in the analogy of the two clocks, that the reason why the clocks remain in sync lies not in their common connection, but because they are two faces 52 of the same entity. Thus by extension, dual aspect theory resolves the mind-body relation through perspective, provided one accepts that the two are part of the same whole. Fechner (1860/1966) stated, "These are my fundamental opinions. They will not clear up the ultimate nature ofbody and mind, but I do seek by means of them to unify the most general factual relationships between them under a single point of view"(p. 5). Fechner' s "factual relationships" were those relationships that could be described by a governing mathematical relationship. While considered to be the first successful experimenter of mind, Fechner' s theory was in no way an oddity of the time. Following Kant and his Critiques, other scientists and philosophers took up Kant's criteria for science and attempted to apply them to studies of the mind. Fechner, in this way, was influenced by Herbart41 (see Boring, 1950, p. 284; Fechner, 1860/1966, p. 46). He sought to exploit what he saw as Herbart' s failure, the absence of a full mathematical treatment combined with experimental demonstration. Johann Friedrich Herbart has been identified as the "father" of scientific pedagogy (Boring, 1950, p.250). His mathematic approach to psychology, while a departure from Kant's view that psychology could not become a science, represents an acceptance of Kant's claim that a science must include mathematics. Herbart's thought had been noticeably influenced by Sir Isaac Newton's Philosophia Naturalis Principia Mathematica that had appeared in 1687.42 Newton had proposed the existence of a measurable and law-like relation between brain, nerve fibers, and electric and elastic

41 While Fechner did not directly acknowledge Herbart's system of psychology, he makes repeated mention of Herbart's psychology as an effective means of contrasting the differences between what Herbart failed to do and what he saw himself as able to achieve, a the quantification of sensation and the development of a mathematical and experimental system ofpsychology (see Heidelberger, 2004, p. 31). 42 For full discussion of this point please see Boudewijnse, Murray & Bandomir, 1999, pp. 165-166. 53 responses. Herbart felt that a scientific psychology could be based upon Newton's claim to law-like regularities between the mind and body, that were explained through Kant's necessary condition of mathematics. In his 1824 text Psychologie als Wissenschaft, he proposed a model ofpsychology based upon mathematics in a manner that mirrored Newton's mathematical model of the cosmos.43

From this model Herbart proposed a new concept of measurement and its application to a mathematics of the mind. He emphasized that that the method of psychology was observation, that psychology was mechanical and that its objects were amenable to mathematics (see Boring, 1950, pp. 252-253). In this way Herbart diverged from Kant and his a priori principles, but also proposed a means for psychology to obtain the type of "objectivity" that Kant had established in his Critique ofPure Reason (see Daston & Gallison, 2007, p.258; Kant, 1787/2003, pp.72-73). This was to remedy the problem proposed by Kant (1787/2003) that the soul (or mind) "can never be met with in any experience, and such, therefore, that there is no way of attaining to it, as an objectively valid concept" (p. 341). Unfortunately, Herbart felt that there was no way to properly introduce experimentation44 into his new mathematical psychology because quantity was unattainable (Zupan, 1976, p. 151). It appears that the absence of experimental applications cast Herbart as having "failed to accomplish the task which he set himself'(Wundt, 1902/1969, p. 6) and contributed to his disappearance from many

See the discussion and distinction of "physical" bodies or Körper found in Herbart's early writing and the later replacement with mental "bodies" or Vorstellung as outlined by Boudewijnse, Murray & Bandomir, 1999, pp. 163-164. 44 This was a reflection of his having adopted Kant's claim that psychology could never become experimental as there was no obvious way to conduct experimentation on mind (see Boring, 1950, p.254) 54 non-German histories of psychology (see Boring, 1950, pp. 250-261; Boudewijnse, Murray & Bandomir, 1999; Boudewijnse, Murray & Bandomir, 2001). Fechner succeeded where Herbart had failed by way of his faith in dual aspect theory. Fechner (1860/1966) felt that Herbart' s contention that psychology could not be experimental was simply an expression of his inability address the "question of whether and to what extent it is possible to measure sensation itself or mentality in general" (p. 46). Fechner's (1860/1966) use of dual aspect theory permitted an intimate and direct functional relationship between the inner and outer worlds of experience: Generally the measurement of a quantity consists of ascertaining how often a unit quantity of the same kind is contained in it. In this definition, sensitivity is an abstract capacity and as little a measure as is abstract energy. But instead of measuring it by itself, one can measure something related to it, something of which it is a function, which in accord with this concept increases and decreases with sensitivity - or conversely with which sensitivity increases and decreases - and thus we obtain an indirect measure in the same way as we do energy, (p. 38) By proposing a direct connection between body and mind (the physical and the mental), that they are part of the same whole, Fechner suggested that changes in one quantified aspect of the physical, should create changes in the quantified experience of the mental. This aspect of Fechner's theory was created in imitation of the law of conservation of energy. The nature of these quantifiable changes is not assumed to be one-to-one; determining the nature of the law-like regularities was to be the goal ofpsychophysics (see Fechner, 1860/1966, pp. 19-37). 55

Fechner achieved the quantification of sensation in his mathematical and experimental expression of psychophysics. Fechner' s psychophysics built upon the discoveries of E. H. Weber, who had already deduced the general form of the stimulus- sensation relationship. In Fechner's (1860/1966) Elemente he wrote: The more exact formulation stating that the magnitude of the stimulus increment must increase in proportion to the stimulus already present, in order to bring about a equal increase in sensation, was first made with some generality by E. H. Weber and supported by his experiments. I have therefore called it Weber's Law. (p. 54) Weber's Law governing the quantified relationship between stimulus intensity and sensation magnitude had been published in 1846 (see Boring, 1950, p. 280). Fechner deduced his more generally applicable formulation of Weber's Law in 1951, in his Zend- Avesta, where there is no mention of Weber's Law (see Fechner, 1851/1987).

When Fechner set about to produce his experimental method ofmental measurement there existed a strong current of activity in Germany reflecting an interest in the measurement of mind. Beginning with a body of interested German psychologists and physiologists that emerged in the mid 19th century, the examination of the mind and its various physical relations began to take a rigorous experimental and mathematical form (see Daston & Galison, 2007, p.258). Fechner's psychophysics, in this way, made its appearance at a time when there existed a community of scientists interested in related matters. While, Fechner himself had developed a bit of a tarnished reputation for his interest in issues that other philosophers viewed as dubious,45 his psychophysics

Fechner wrote many works on the issues of life, death, and God (see Boring, 1950) 56 demonstrated a sufficient grounding in experimentalism and mathematics that it drew some attention (see Boring, 1950, p. 281).

Fechner' s Statistics in the Service of Precision The 19th century saw a rapid rise of mechanical procedures and other methods of experimentation that were designed to deliver results that could deemed unbiased or objective. Nowhere else was this found to be more true than in Germany (see Olesko, 1995, p. 103). There was a drive within the scientific community towards an "increasing separation of science from philosophy and the humanities" (Heidelberger, 1987, p. 119). Individual judgments were no longer considered to be entirely reliable. The experimenter in 19th century Germany had to be regulated by outside, objective forces. With this in mind Fechner (1860/1966) wrote, "judgment needs to be mediated by a process of calculation" (p. 63). In arguing for the attainment of objectively valid procedures that generated precise quantities by way ofmeasurement and calculation, Fechner was of the mind that he was generating a form of strong argument for his claim to have successfully quantified an aspect of mentality. The precision of discrete quantities, for the German investigators of the 19l century, had come to occupy a social function. There was a certain ethical or value-laden dimension that had emerged regarding the attainment of precise quantities. While this was not an aspect exclusive to German scientific investigation, it certainly presented itself as an integral component, here more than elsewhere. Emerging as a prominent aspect of this ideal, Gauss' method of least squares and the normal distribution, that first appeared in published form in 1809, became a much discussed method in the 1830s (see Olesko, 57

1995, p. 107). The normal distribution and probability calculus became a very influential technology for scientific claims. The method proved itselfuseful in estimating what approximate true values existed amongst a large pool of aggregate data. The math itself was quite clever, but it was the interpretation of the results of Gauss' work that was more influential. The method of least squares suggested in its use that the effects of errors present within measurements were reduced to a minimum. Furthermore, it proposed to help identify those errors that were systematic, thereby providing a means of eliminating them, and leaving the data bare with only the "accidental errors" of measurement (Olesko, 1995, p. 108). The errors could be born either of the experimenter, the tools of measurement, or even inherent in the properties of the object of study. No matter the source, Gauss' method proposed to strike to the heart of the truth behind measurements and uncover real or true values. The very use of rigid, numerical methods also led to an objectification of science, by eliminating personal "whim," by making the observers only observe what can serve as input for these [in this case Gauss'] methods. (Swijtink, 1987, p. 263) Thus Gauss' new mathematical technology appeared to operate independently of the desires of the investigator and produced a refined, precise image of what lay beneath the data; it provided a technology of objective observation (Swijtink, 1987). From a methodological point of view, Fechner was very concerned with the concepts of precision, objectivity and validity. He developed his methods of measurement with particular interest in minimizing the influence of individual personalities or subjectivities; quantities could argue for themselves. Fechner (1860/1966) notes that with 58 respect to the method ofjust noticeable differences (j.n.d.) "one of its major drawbacks is the fact that whatever is termed just noticeable leaves more room for error as a subjective judgment" (p. 63). In order to remedy the apparent subjectivities involved in the measurement of sensitivity, Fechner believed that the method of average error47 could be applied. Fechner (1860/1966) wrote: To my knowledge, however, this method [the method of average error] has been used only with respect to objective measures of precision of physical and astronomical observations or to the determination of the magnitude of the source

of errors in this judgment. It has never been considered or used as a psychophysical method for determining the acuteness of the senses. Yet it seems to me to be one of the most useful for this purpose,. . . I have used it in determining

the precision ofjudgments of visual extent and touch, (pp. 62-63) Fechner's concern over precision produced very large data sets. In order to correctly determine the thresholds for weight differences, Fechner used two experimental series. These series each consisted of "24, 576 simple liftings or cases," (Fechner, 1860/1966, p. 153) and these simple "liftings" were extended for each experimental condition. Fechner applied the method of average error to all the data that he acquired, for and under each of the methods he devised (see Fechner, 1860/1966, p. 63). He attempted

46 the method ofjust noticeable differences was at the time most notably used by E.H. Weber in his investigations of sensitivity for "weight, touch, and visual space perception"(Fechner, 1860, trans. 1966, p. 62), it involves the discrimination of sensitivity thresholds at which a sensation can just be detected and acknowledged to have developed through the application of a given stimulus and was inclusively discussed in the brief discussion of Weber's Law. 47 Fechner acquired massive amounts of data in order to determine regularities. The aggregate data would yield representative values by determining the average error. Using the average error a measured true value could be determined as Gauss had prescribed in the method of least squares. 48 For example, Fechner's method of weight detection involved a series of experimental conditions where detected differences were of interest. For a single 59 to improve upon the theory of errors by attempting to derive a new formula for error estimation that relied upon finite numbers of observations rather than a hypothetical infinite set of observations (see Fechner, 1860, pp. 100-107). At the time this was a novel correction to the application of the normal law to the theory of errors. It also is an example of how Fechner the statistician, though not always discussed as such, introduced statistical measures into the domain of the psychological as a tool of the experimenter (Sheynin, 2004). Collected data, after being transformed by the application of the error law, exhibited characteristic quantities that obeyed specific and regular tendencies, indeed, psychophysical regularities.

Fechner did not develop an all encompassing research program based upon his psychophysics. In fact, the bulk of Fechner' s psychophysical research, that appeared in the 27 years following the publication of his Elemente, simply addressed his critics and sought to clarify and reinforce his views (Murray, 1990). The objections to his claims of measurement and quantification49 centered upon his ability to have measured sensation, and to have successfully quantified an aspect of mental life (Heidelberger, 2004, pp. 207-

208). Fechner's disinterest in developing a program of investigation should come as no surprise, given our discussion of Fechner's reasoning behind the advent of psychophysics. Evidence, as proof of dual aspect theory, was at the heart of Fechner's research. Yet, that research had "shown us [experimental psychologists] how a 'mathematical psychology' may, within certain limits, be realised in practice" (Wundt, 1902/1969, p. 6). The formulation and execution of developing a research program, and a discipline devoted to

These will be addressed later in this paper. 60 an experimental psychology, was the work of Wilhelm Wundt (see Danziger, 1980, p. 1 10). Using Wundt's model, experimental psychology would eventually develop a standardized form that would propel it beyond the conceptual problems inherent in psychological quantification.

Wundt and the Discipline of Experimental Psychology For psychologists, the year 1979 marked the 100th birthday of psychology as a discipline. The year chosen to represent the birth year was marked by the establishment of the first psychological laboratory at the University of Leipzig by Wilhelm Wundt in 1879 (Hatfield, 2002). Wundt's founding of the first psychological laboratory and the methods he devised for the new science appear to be the greatest gifts he provided the new field of experimental psychology50 (Danziger, 1980). Wundt, who was first a medical doctor and physiologist, has been described by Boring (1950) as the "first man who without reservation is properly called a psychologist" (p. 316). It would seem that the reasoning behind this declaration lies not in some successful navigation of the theoretical and practical considerations of psychological matter, but in Wundt's ability to lead and grow a movement. The "movement" as it appeared to William James in 1875, was a shift that had begun in the "physical sciences," whereby new interest had developed in "metaphysical problems" (p. 114). James (1875/1980) went on to say that:

Nowhere is the new movement more conspicuous than in psychology which is of

course the antechamber to metaphysics. The physiologists of Germany, devoid for

see Danziger's 1980 article, Wundt's psychological experiment in the light of his philosophy of science, Psychological Research, 42, 109-122, for a more complete discussion 61

the most part of any systematic bias, have, by their studies on the senses and the brain, really inaugurated a new era in this science. . .Such as they are, Professor Wundt, the title of whose latest work heads our article, is perhaps their paragon;

and his whole career is at the same time a superb illustration ofthat thoroughness

in education for which Germany is so renowned, (pp. 114-115) Wundt' s reputation in 1875 as the leader of a new and burgeoning field had already become apparent. He was creating a new discipline recognizable as Physiological Psychology, a distinct form of Philosophy. How this was achieved this strikes at the heart of this investigation. The emerging discipline was to be based upon controlled experimentation, observation, and measurement. Physiological Psychology as a Distinct Discipline Wilhelm Wundt' s interest and career in psychology was intimately related to his experience studying medicine and physiology. Wundt received his medical degree from Heidelberg University in 1855. With medical degree and license in hand, Wundt spent nearly the next 25 years of work and study developing alongside many important figures in medicine and physiology. His first medical work was to develop his interest in psychology and psychophysics. As reported by Bringmann, Bringmann and Balance (1980), Wundt acknowledged his career in psychology as beginning at a time when he was working under Ewald Hasse at Heidelberg's University Hospital. At this time "Observations about touch sensitivity in hysterical patients led him to challenge the formulations of the veteran psychophysicist, Ernest Heinrich Weber" (p. 23). Though Fechner's formulation of 62 psychophysics had as of yet not occurred, the methods of E. H. Weber were the stage upon which Fechner would act out his discoveries. Wundt would soon turn from medicine towards physiology. It marks a move towards what he would later describe as "physiological psychology" (Wundt, 1902/1969), a term that he felt was interchangeable with that of "experimental psychology" (see Wundt, 1907, p. 27, note 3a.). The shift in interest, according to Bringmann, Bringmann and Balance (1980) took place in the middle years of the 1850s, and was drawn out by experiences Wundt obtained while in Berlin working closely with Emil Du-Bois Reymond and Johannes Müller (see pp. 25-26). It also prepared Wundt for his employment as Hermann von Helmholtz's assistant. Fancher (1996) has suggested that Wundt' s knowledge of his Helmholtz's research helped him fashion his own experiments regarding reaction-time and shift the site of experimentation from the "peripheral" parts of the mind-body relation, to the "central", or integral neurological parts of the mind itself (p. 146). This provided him with the crucial insights into the boundaries between psychology and physiology in order to be the first to "successfully demarcate the problem areas of scientific psychology as distinct from either physiology or philosophy, thus insuring psychology's survival as an autonomous discipline" (Bringmann, Bringmann, & Balance, 1980, p. 5). By the time that Wundt was called to take up a professorship at Leipzig in 1874, he had already begun to lecture extensively on the nature and grounds of what he saw as physiological psychology and was a "self-taught expert in laboratory work" (Bringmann, Bringmann, & Balance, 1980, p. 29). Wundt's experimental psychology found a natural 63 fit in Leipzig. According to G. Stanley Hall (1924), "no place in Germany was so favorable for the reception of the new mode of applying the exact methods of physical science to psychology" (p. 312). For Wundt, Fechner's psychophysics made possible the realization that mathematical relationships could be established between the body and the mind. According to Wundt (1902/1969): Fechner's determinations are... a specific science of the 'interactions ofmind and

body.' But, in saying this we do not lessen the magnitude of his achievement. He

was the first to show how Herbart's idea of an 'exact psychology' might be turned to practical account, (p. 7)

For his colleagues at Leipzig, this provided Wundt with a solid foundation upon that he could establish the new science because "the succession to Wundtism from the doctrines of Herbart and Fechner was both legitimate and direct" (Hall, 1924, p. 312). Wundt's experimental psychology came together in the 1880s at Leipzig. The establishment of Wundt's laboratory in 1879 provided him with the space necessary to conduct his experiments and to organize his discipline. It also became the site that would define the image of psychology as it began to move forward as a scientific discipline.

Where psychological knowledge was generated moved from the philosophic and speculative arenas to the laboratory. This was a move motivated by the imperatives of precision and objectivity, and was also a result of psychology's increasing turn to quantification. 64

Inside Wundt's Psychological Laboratory

In 1928 James McKeen Cattell wrote of Wundt and his laboratory saying "he was interested in the laboratory as a system" (p. 545). In this sense, Cattell (1928) was indicating the relationship that Wundt's psychological laboratory and laboratories in general, had with the rise of industrialism in the 19th century. The laboratory became a site for the efficient production ofpsychologists and psychological knowledge, while the procedures and tools of the laboratory led to a mechanization of the methods of psychological experimentation in both the figurative and literal senses. In the literal sense, Wundt's laboratory was filled with expensive mechanical equipment designed to be used in experimentation, in order to deliver the most accurate ofmeasurements and images. Harper (1950) suggested that "By 1894 Wundt's 'Institut' was the largest and best equipped laboratory of experimental psychology in the world" (p. 161). Indeed, Krohn (1892) wrote "Professor Wundt's laboratory is so well known, and his apparatus has found way into so many places, that perhaps it needs no detailed description" (p. 590). There existed enough of a belief in the proven production of laboratories and their equipment used to perform psychological investigation, that the entire purpose for Krohn' s (1 892) paper was to supply the "Many teachers who contemplate starting laboratories in this country [the United States] . . . information [regarding German laboratories] as to the apparatus used, its cost and value"(p. 585). Wundt, in 1893, published a "Note on Psychological Instruments," for the purpose of informing his readers where to obtain the necessary equipment to conduct experimental 65 psychological investigations and perform demonstrations, as he had been receiving numerous letters requesting the information (Gundlach, 2007, p. 197). The use of highly specialized equipment in the laboratory also facilitated the figurative sense in which mechanization took place. As suggested by Gundlach (2007), instruments used in psychological research helped catalyze the emergence of an independent discipline of experimental psychology. This aspect of Wundt's laboratory reflects what Daston (1992), Daston and Galison (1992, 2007), and Sturm and Ash (2005) have defined as scientific respectability, in this case for psychology, as it is obtained through objectivity derived through its instruments of investigation. The nature of the instruments used in early experimental psychology created certain useful appearances. Importantly, they created the appearance of psychological communities of experts that were distinct from others. The instruments were often "expensive, beautiful and even mysterious, and they turn[ed] those who control[led] them into experts" (Sturm & Ash,

2005, pp. 3-4). The importance of Wundt's laboratory to the science of experimental psychology cannot be overstated. Harper (1950) identified Wundt's laboratory as having produced more psychological research than any other laboratory in the 19* century (see p. 161). However, it appears that the true impact of Wundt's laboratory lay in its ability to supply the standard model of what a psychological laboratory should be. This standardized form would help carry experimental psychology forward into the 20l century. It would help establish the organizing principles that would provide evidence both within and without the discipline that psychology was an experimental science. 66

Wundt's Methods for Psychology as a Legitimizing and Standardizing Force Wundt's laboratory and experimental methods played a significant role for experimental psychology as a form of rhetoric. Experimental psychology, or a psychology based upon quantified measures, was acceptable because it appealed to the typical sentiments to be found in German investigators of the 19th century. It also appealed in increasing amounts to a Euro-American science community that looked to the ideals of precision and objectivity as the mark of science. Benschop and Draaisma (2000) have characterized Wundt's psychological laboratory as having created a "veritable cult" of precision (p. T). This precision measurement and exact experimentation had at its heart an important aspect related to the proliferation of psychology as an experimental science, that of the production of standards for psychology (Benschop & Draaisma, 2000). The emphasis that Wundt placed upon the creation of standard measures and methods of experimentation are a reflection of strong social currents within German and other European states at the time. As described by Olesko (1995), the German states during the 1 9l century were consumed by a need to reform and reinstitute standards relating to weights and measures. This imperative led to the science of Metrology, a branch of physics devoted to the establishing of precise determination of these standards of quantity. These standards were seen as essential to the means of production whether it be goods, justice or scientific endeavor.

Standards enhanced the qualities of the experimental investigator by underscoring

the essentiallypublic nature of the methods used in precise determinations: they were public in the sense that they had to be reported so as to be judged, and if 67

necessary imitated; and they were public in the sense that they were subject to the policing and surveillance of outside agencies responsible for certifying their authenticity. (Olesko, 1995, p. 125) Benschop and Draaisma (2000) have argued, that while Wundt's experimental psychology directly inherited the methods of physiology, it also adopted the physicists' obsessive concern with metrology (see p. 15). As a means of combating the discrepancies of imprecision for psychology, Wundt and many other psychologists began the publishing of standards obtained in research. These standards took the forms of published research and more importantly textbooks. When the large amounts of printed material began to appear in Wundt's laboratory and on the pages ofPhilosophische Studien, the imperatives ofprecise measurement and the availability of standards seemed apparent (Benschop & Draaisma, 2000). The development and publication of new textbooks of psychology also played a significant role in the establishment of the early experimental discipline. Psychology in Germany had long been a branch of philosophy. While the primary purpose of the German university system in the 19th century was "the training of Germany's educated elite" (Ash, 1980, p. 77), a certain amount of freedom over what a course taught was supplied to the professors. This freedom, known as Lehrfreiheit, provided the space that an experimental psychology needed to have, in order to appear in the German system (Ash, 1980). Psychology, being under the branch of philosophy, was understood as being the scientific basis for pedagogy and hence many textbooks on the topic of psychology already existed in the early years of the 19th century (see Teo, 2007). With interest in 68 psychology having reached its peak by the 1870s in the German university system (Ash, 1980), a new approach seemed necessary. Psychology's territories needed to change. The experimental approach supplied by Wundt, and others, slowly began to draw experimental psychology out and away from philosophy. With new textbooks came the clear demarcation of psychology's methods and territories. Early experimental psychology was often accused of being a branch ofphysiology, Wundt (1902/1969) in his textbook could now publicly defend his view that "Physiological psychology is. . .first of all psychology . . .It is not a province of physiology" (p. 2). The new textbooks could then serve not only to educate, but serve functions related to experimental psychology's new desired identity. Textbooks have been shown to operate as a force for the legitimization of a science (Ash, 2005) as well as an effective tool for boundary making (Morawski,

1992). The new experimental psychology that owed itself to Wundt possessed not only a growing sense of scientific respectability, but also a demonstrable form. Wundt' s methods, instruments and structures within the laboratory would serve to become a model of experimental psychology that could be standardized so that it might be emulated and replicated. Boring (1950) suggested that Wundt' s laboratory did more than set the fashion for the new psychology; it defined experimental psychology for the time being, because the work of this first laboratory was really the practical demonstration that there could be an experimental psychology and was thus an example of what an experimental psychology would be like in fact. (pp. 339-340) 69

Wundt's laboratory created the primary standard by which other laboratories would establish and evaluate themselves. Many of Wundt's students would use their time in the Leipzig laboratory to move elsewhere and establish laboratories of their own. Harper (1950), in an assessment of the laboratories in existence at or before 1900, indicates that of the 47 laboratories existing world wide, 14 were established by students of Wundt (see p. 161). Many more students studied in Wundt's laboratory only for a short time (see Tinker, 1932, pp. 630-631). Nicolas and Ferrand (1991) suggested that the reasoning behind this short-term form of study was to attend "for one or two semesters to study the experimental techniques used and then" bring these methods back "to their former university" (p. 194). Thus a systematic incorporation took place at other universities all over the world of a model of psychology that in one way or another adopted or expanded the experimental practice and model of psychology provided by Wundt.

Quantified Reasoning and a Standardized Image of Psychology:

Concluding Argument The methods, perspectives and laboratory techniques of Wilhelm Wundt helped organize early psychology around a series of central practices. It also helped psychology side-step questions of whether it was being accurate in its methods of quantification and subsequent mathematization. Fechner had provided the essential insight as to how one might go about actually quantifying aspects of mentality and Wundt provided the methods and organizational skills to transform psychology into a science. Wundt's efforts provided the initial move towards organizing certain practices of psychological 70 experimentation in such a way that it was recognizable as a discipline. The contents of his laboratory, the ways in which psychological investigations were to be structured (with an emphasis on observation, precision, quantification, and measurement), and the production ofjournals, posted results, and textbooks, were important markers of professionalization. The social-scientific climate of late 19th Century Germany was also uniquely prepared for this form ofpsychology to appear and stake its claim. However and importantly for the eventual emergence of the NHST controversy, there still remained many objections and considerations with regards to what it was Fechner and Wundt were quantifying. As Micheli (1999) has pointed out, Fechner

"believed that the exact sciences must use mathematics and that the use of mathematics entails measurement" (p. 87). This relationship exposes the unanswered question of what it was psychologists were measuring, an objection brought to bear against Fechner by many including Johannes Von Kries in 1882.52 However, with experimental psychology moving towards legitimization, this issue was pushed to the side. Hornstein (1988) suggested: The fact that psychophysicists went on with their research despite the ongoing debate about what they were really measuring merely strengthened their position. In other words, the theoretical ambiguities surrounding their work became buried in the profusion of empirical findings that emerged from their laboratories. . . The strategy of splitting theoretical issues from methodological practice was an

51 This point in psychology's history may reflect what Christopher D. Green (2004) has identified as Foucault' s stage of epistomologization. This is where an emerging science is capable of transcending the social practices that first established its course and conduct (Green, 2004). 52 for a more complete discussion see Micheli (1999) pp. 88-89. 71

enormously profitable one, because it made it seem possible to proceed with the work despite the ongoing debates about what it meant, (p. 7) This division of theoretical and empirical issues has been a lasting legacy and represents one of the first keys to understanding NHST's resiliency in psychology. This is a point that we will turn our attention to later in chapter 4. As psychology moved into the 20th century the theoretical concerns that surrounded psychophysics and experimental approaches to psychology would reemerge. The method of introspection, considered by many early psychologists as central to the emergent discipline, came under scientific scrutiny. Problems of subjectivity and certain ambiguities regarding how to interpret the results of introspection based experiments would lead to its eventual abandonment. One of the methods endorsed to replace introspection as the primary tool ofpsychologists would be correlation. The appearance and rise of a correlational psychology is according to Danziger (1987) the moment when a shift in psychology took place. The shift marks a turn away from the Wundtian model and towards Danziger' s (1987) Galtonian model of psychology. The success of early experimental psychology and its methods would give way to the study of distributions and statistics. Correlational psychology would also attempt unsuccessfully to articulate the lasting problem of the quantity objection. 72

Chapter 3

The Fall of Introspection and Rise of Correlation Experimental psychology emerged in the late 19th century, recognizable as a discipline distinct from philosophy. While I have discussed the birth of the discipline in Wilhelm Wundt's first laboratory and the Leipzig school in the previous chapter, there were a number of experimental traditions that grew up in various other locations. These traditions, that developed around the same time,53 the Paris or French tradition, Würzburg, and American or Clark schools, all kept apace of the Leipzig school and can be differentiated from each other by examining their methodology (Danziger, 1980, 1985). With regards to what is at stake in this historical investigation, the confusions and disagreements that began to swirl around introspection created the necessary space for a methodological shift. As psychology moved into the 20th century, certain disagreements as to the nature and use of introspection led to controversy. In particular, the British and German interpretations of the role and place of introspection in psychology led to confusion and an approximate abandonment of the method. E. W. Scripture in 1 894 itemized a number of issues in American psychology in an effort to encourage awareness of them. He sought to promote within American psychology the ideal of accuracy. According to Scripture, American psychology was woefully inadequate in its attempts to be an accurate science. Scripture (1894) wrote, There are periods in the life of a science when it becomes necessary to take a decided stand against the tendencies prevailing at the place and time, when the

53 1 am referring to the historical period of 1879 through 1914 as demonstrated by Danziger (1980; 1985). 73

battle of the moment is not against avowed enemies, but against the very ones that apparently support the cause. The danger that threatens us comes directly from the

psychological laboratories. . . The struggle for ever increasing accuracy is the vital principle of all the sciences. No astronomer, physicist or biologist would for an instant hesitate to declare that his work aims at the employment of ever more careful methods. How different is the case in psychology! We frequently hear it stated that psychological experiments and measurements can never be exact or trustworthy; at best they can only give an inkling of the facts of the case. (p. 427) The issues that Scripture identified as most salient in the experimental methods of psychology were all connected to a culture of inaccuracy. He proposed four specific sources of error. There were "Errors of the apparatus," "Errors of surroundings," "Errors due to poor powers of introspection" and "Errors of statement" (Scripture, 1894, pp. 428- 430). While these four sources of error all represented a problem for psychology to address according to Scripture, our focus will only be upon the introspective method. The purpose of this chapter is to examine the confusions and impact of the perceived problems of the introspective method at the turn of the 20l century. The chapter first examines introspection from these two views. First, the confusions and the way that introspection came to be regarded as a single approach is discussed. The inaccuracies regarding what was meant by introspection, combined with the occurrence of different traditions, created confusion and disagreement regarding its use. These confusions led to the eventual grouping of various introspective methods into a single 74 false entity called "introspectionism," based upon the classical view. Second, the problems that psychologists identified with introspectionism are examined. The decline of introspective investigation will demonstrate that as psychology emerged at the turn of the century, the broadening of the theoretical boundaries that psychology occupied made room for a New Psychology to appear. The New Psychology would permit the method of correlation that was being used and generated outside of psychology to acquire a significant role in the emerging science. This chapter will further evaluate how psychologists came to be interested in the statistical conception of general intelligence.

This interest would lead psychology away from its focus on statistics as a method for the treatment of errors within data, to statistics as a means for discovering underlying phenomena. This led to a change in the way theories were constructed. The influence of Ernst Mach and the idea of functional relations shifted psychology's approach. It moved statistics into a primary role within the domains of method and analysis. The chapter will close with a brief discussion of one of the practical problems that correlational techniques created for psychologists, namely the recurrent issue of what it was and is psychologists are actually measuring when correlating certain performance measures. The Confusions of Introspection The tradition of experimental psychology established by Wundt in the late 19l century had a definite method. The method of introspection that Wundt employed, innere Wahrnehmung, that meant "internal perception" (Danziger, 1980, p. 244), had at its heart the idea of strictly controlled self-observation in relation to experiment (see Wundt, 1902/1969, p. 4). This stood in contrast to, Selbstbeobachtung, the German term for 75 introspection that simply meant "self observation" and was not bound to experimental control (Danziger, 1980, p. 244). Wundt held that the experimental method combined with innere Wahrnehmung was of "cardinal importance" (Wundt, 1902/1969, p. 4) to anyone endeavoring upon the conduct of psychophysical experimentation or measurement.

The American psychologist William James also believed in the necessity of introspection. However, James' use of introspection was more in keeping with what Wundt had rejected, Selbstbeobachtung. James wrote, "Introspective Observation is what we have to rely on first and foremost and always...I regard this belief as the most fundamental of all the postulates of Psychology" (James, 1890, p. 185). These two rising schools of psychology defined themselves through introspection, but in very different ways. The major distinctions between their methods of introspection, as identified by Danziger (1980), follow from the different perspectives to be found in the British54 introspective tradition and the German.55 While both methods undoubtedly turned

54 The British tradition had been reflected in James' (1890) use and description of introspection. This tradition was rooted in the philosophies of Locke, Hume and both me Mills amongst others (Boring, 1953). The British tradition of introspection reflected the classical view. Classical British introspection proposed that one could examine the contents of the mind in simple or complex forms and maintain validity through its application at the correct moments and through intensive training. While Boring (1953) has contended that Wundt's view of introspection is "classical" introspection, Danziger (1980) has demonstrated mat the "classical" should be understood as existing before Wundt, and being involved with the nature, and arguments, surrounding philosophy's use of the term and not experimental psychology's. The examination of the mental contents of one's mind could occur in retrospective examination, when the contents were fresh in memory and available to the introspecting psychologist (James, 1890, p. 189). Once fixed upon early in memory, introspection could make the contents of the mind its "prey" (James, 1890, p. 189). This view of introspection provided a wide range of applications and aspects of mental life to be considered as possible for study. 55 The British trust in introspection stood in sharp contrast to the German tradition which had taken a more skeptical approach. This more skeptical approach had grown up around the philosophies of Auguste Comte, 76 attention to the contents of the conscious mind, the ways that this was considered possible and reliable were distinctly different. The duplicity of the meanings of introspection, within the different traditions, helped foster general inaccuracies regarding the method, its applications and its uses. Wundt's awareness of the specific criticisms and problems associated with Selbstbeobachtung, led to his adoption of Fechner's methods, and his prescription to avoid the alternatives. Wundt identified the primary problem inherent in the nature of

Selbstbeobachtung. Wundt (1902/1969) wrote, For while in natural science it is possible, under favourable conditions, to make an accurate observation without recourse to experiment, there is no such possibility in psychology. It is only with grave reservations that what is called 'pure self- observation' [Selbstbeobachtung] can properly be termed observation at all, and under no circumstances can it lay claim to accuracy, (p. 4) Wundt accepted what appeared to be the subjective nature of unguided self-observation. He drew upon Kant's description of the interacting elements of directed attention upon the conscious psychical processes and suggested that, "the very intention to observe, which is a necessary condition of all exact investigation, modifies essentially the rise and

Friedrich Albert Lange and more importantly Kant's perspective on introspection. Kant had claimed that it was only the experiential world that was available to the mind. Kant (1787/2003) wrote: We admit that we know objects only in so far as we are externally affected. . ..as regards the inner sense. . .by means of it we intuit ourselves only as we are inwardly affected by ourselves; in other words, that so far as inner intuition is concerned, we know our own subject only as appearance, not as it is itself, (p. 168). According to Kant, introspection or self observation could not provide any meaningful information apart from subjective interpretations as it had no objective reality. Hence, the foundation which Fechner gave introspection was based upon the notion that the "feeling measure" (Fechner, 1851/1987, p. 203), and not mental contents and structures, was readily available to strict experimentation and mathematical treatment. This was Fechner's method of providing reliable results by mediating the process of introspection through mathematics and perceived quantities (Fechner, 1860/1966, p. 63). 77 progress of psychical processes" (Wundt, 1902/1969, p. 24). It was for this reason he wrote that "the experimental method is of cardinal importance; it and it alone makes a scientific introspection possible" (Wundt, 1902/1969, p. 4). Introspection as self- observation {Selbstbeobachtung) was not to be trusted, only inner perception {innere Wahrnehmung) embedded within the strictly controlled contexts of experimentation, where experimentation could provide replicable results was reliable (Danziger, 1980). Thus, through experimentation internal perceptions could be manipulated in such a way as to approximate external perceptions (Danziger, 1980). Classical introspection was of no use to psychology, according to Wundt, should it be used outside of the controlled conditions of experimentation.

Introspection, in one form or another, was critically linked to experimental psychology. While the use introspection was being associated as a defining method of psychology, as with James' declaration of introspection as "first and foremost and always," a great amount of confusion surrounding the term remained. Ladd, for example,

(1894) claimed that, "no method can be developed in psychology which will enable us to dispense with introspection. . . [is] too obvious to require discussion" (p. 6). Ladd (1894) argued that, "scientific psychology is the science of the phenomena of consciousness, as such. And no interpretation of consciousness is possible in any terms whatever without self-consciousness [introspection]" (p. 6). Ladd (1894), like Wundt, saw introspection as indistinguishable from experimentation. However, his use of introspection was better reflected by James' uses and the British tradition. 78

Titchener adopted the experimental approach to introspection associated with Wundt, but he investigated certain things that Wundt felt were inappropriate for his introspective approach (Danziger, 1980). Titchener (1910), promoted "systematic introspection"56 and felt that the introspective method was aptly suited to examine certain functions like memory that Wundt' s strict experimental approach expressly denied.

However, his associations with Wundt through the time that he had spent at Leipzig, combined with an intentional representation of himself as a faithful follower of Wundt

(see Fancher, 1996, p. 172), often led American psychologists to confuse his system of psychology with Wundt's, as in the case of his student E. G. Boring (see Boring, 1950, pp. 410-414). Systematic introspection was also employed in its most "advanced" form by the

Würzburg School. However, one of Wundt's former students Oswald Kiilpe would take introspection beyond even the bounded conditions of systematic introspection and bring the Würzburg School, Wundt and Titchener into opposition (Boring, 1953, p. 171). The more liberal approach to introspection, combined with the conflation of its meanings into a single general term, created a suitable enemy for the opponents of the method.

It appears that at the time of the publication of the fifth edition of his Grundzüge der physiologischen Psychologie in 1902, Wundt had been aware of certain currents of thought that had confounded experimental introspection with its non-experimental brethren. Wundt wrote in both an attack upon self-observation and defense of inner

56 Titchener's use of introspection was labeled systematic introspection and it sought to retrospectively examine the contents of consciousness and reduce them to the basic elements of the sensory processes (Danziger, 1980). Wundt's psychology was not atomistic or elemental. He opposed Herbart's psychology on this matter, and is better understood as being more holistic in his approach (Blumenthal, 1975, p. 1084). 79 perception. He regarded his experimental method as defensible. He attempted to emphasize the confusion regarding the distinctions between his experimentally dependent form that he advocated and the unbounded classical conception. He wrote, . . .the objection urged against experimental psychology, that it seeks to do away with introspection. . .is based upon a misunderstanding. The only form of introspection which experimental psychology seeks to banish from the science is that professing self-observation which thinks it can arrive directly, without further assistance, at an exact characterization of mental facts, and which is therefore inevitably exposed to the grossest of self-deception. (Wundt, 1902/1969, p. 7) According to Wundt, the ability to obtain accurate measurements and observations was difficult, but not impossible. Unfortunately, many misunderstood the distinction that Wundt had drawn. Introspection was not to be Wundt' s alone, and many like Sully (1881), Bain (1893) and Titchener (1899) reported a full scale analysis of the method. Each report strongly supported their respective interpretation of the method.

The Problems of Introspection In 1912, Knight Dunlap, a colleague of John B. Watson's at Johns Hopkins University, published a withering attack of English approaches to introspection. Knight acknowledged the differences between the German and English traditions, but for the purpose of his attack focused upon English approaches of William James and G. F. Stout, "because these two have made the attempt to work out a system in which 'introspection' is not only admitted, but is really provided for" (Dunlap, 1912, p. 405-406). Dunlap, who had many similar interests as Watson (Dorcus, 1950), felt that introspection should no 80 longer be a part of the vocabulary of psychological science. According to Dunlap, the logic of introspection in both James's and Stout's systems was critically flawed. Dunlap

(1912) suggested that for this reason: most psychologists who use the term 'introspection' and define it as the observation of consciousness not only do not seek to apply it in strict accordance with the definition, but they even apply it to the whole range of psychological observation. . .This practice constitutes effectively the reduction ad absurdum of the 'introspection' theory. Starting as a distinctive kind of observation, the observation of an observation of something, it finishes as the only kind of observation. In other words, there would seem to be really nothing to observe except the observation of something else! (p. 412) According to Dunlap, The inconsistencies and absurdities of introspection were unsalvageable. He suggested that psychologists are "probably better to banish it for the present from psychological usage" (Dunlap, 1912, p. 412). The attacks on introspection continued and psychologists like Raymond Dodge (1912) found it surprising that experimental psychology was attached to introspection in a "jealous" manner and with an "insistent confidence" that "approaches dogmatism" (p. 214). B. H. Bode (1913) wrote, "That almost all the important results of psychology should be based upon a method which is unclear in its nature and aim is an intolerable state of affairs" (p. 85). A certain sense of dissatisfaction within the psychological community had developed and reformation or change was being called for in experimental methods. 81

By the early years of the 20th century, confusions regarding the different traditions of introspection helped create a straw man to knock down. The different traditions of introspection, including the German, were uncritically blurred together and "introspectionism" was created because the "new schools [of psychology] needed a clear and stable contrasting background against which to exhibit their novel features" (see

Boring, 1953, p. 172). As Danziger (1980) has reported, by 1903 and through 1913, the more liberal uses of introspection "flourished," and the more strict experimental approach began to recede (see p. 255). As the more liberal forms of introspection took dominance from Wundt's strict experimentalism, the criticisms regarding classical introspection would return and play a role in the effective abandonment of introspection in psychology.

A Changing Discipline The debate that swirled around the use of introspection as psychology moved into the 20th century would eventually see introspection abandoned. Psychologists were discovering new areas of human mental life to examine and not all of these were accessible to introspection. New and existing approaches that did not rely upon or require introspection for investigative purposes quickly became more prominent. This fostered a certain amount of conflict within the young discipline over proposed methods. The dissension amongst the experimental psychologists of the late 1890s and early 1900s, over introspective techniques, helped strengthen other approaches to psychological investigation.

In 1898 W. McDougall urged psychologists to reconsider their definitions of what constituted mental measurement and analysis. What seemed apparent to McDougall 82

(1898) was that there existed certain mental processes that occurred outside of conscious perception. The problems of neurosis and psychosis embodied these types of evidence, yet were incapable of access via the introspective methods existing. McDougall (1898) declared,

the fact that it is not only the clear, vivid affections of consciousness with which

the science [psychology] has to deal, but that it must take account of other processes also, processes less easily recognisable by direct introspection, and in

fact usually only discoverable indirectly by inference, (p. 15) McDougall ' s argument reflected growing concerns amongst many new and interested investigators who suggested that psychology needed to reconsider its definitions regarding mental processes. Mental processes as being only those that were capable of study under the method of introspection was unacceptably limiting.

S. E. Sharp in 1899 published an article in the American Journal ofPsychology that had at its heart two particular issues. First of all she sought to introduce other psychologists to what she believed constituted "Individual Psychology" at the time and secondly, to introduce the methods that it employed (see Sharp, 1899, pp. 329-330). Sharp identified three different and distinct forms of Individual Psychology as she found them at the time. A French tradition begun by Alfred Binet,57 a German tradition owing itself to Emil Kraepelin, and a lesser organized group of psychologists in America conducting research in similar areas who more closely linked to the German tradition (see

5 Binet is not here being discussed in full due to his methods being more influential in terms of its suggestive character rather than possessing a statistical contribution as it made its way into psychology (see Bell, 1916) 83

Sharp, 1899, pp.329-330). To this area of psychology, that Sharp (1899) acknowledged as "of but recent date" (p. 329), new methods were being applied to capture and compare higher mental processes that were outside of the scope of Wundt's experimental introspection, with its emphasis on basic mental processes. Individual psychology, according to Sharp's (1899) reading of Binet,58 attempted to shift the focus of experimental psychology from a focus upon general properties of psychical processes, to that of studying how individual processes themselves vary from one individual to another (see p. 330). Binet, much like McDougall, had come to the conclusion that a large variety of mental processes occurred in the unconscious mind. Binet' s studies of personality led him to discard with theories of associationism and eventually pursue experimental methods ofmeasuring intelligence (Varon, 1936). In 1901 G. Stanley Hall published an article in Harper 's Monthly magazine entitled "The New Psychology." The new psychology according to Hall had moved from touch, the "mother-sense" (p. 727) of all other senses, and into new areas by way of scientific advance. The areas that Hall (1901) advocated as the legitimate domains of the "New Psychology" were not limited, but "includefd] almost every kind of psychic activity" (p. 730). Hall suggested that imagination, sentiment, volition, animal and human intelligence, and importantly education had become ready to be enriched by advancements in the field of experimental psychology. This new psychology was also endorsed by James McKeen Cattell at the St. Louis World's Fair in 1904. Cattell suggested that the most valuable experimental work produced to date, was independent of

58 Sharp cites Binet and Henri from their 1895 paper, La psychologie individuelle, to be found in Binet & Henri (1895) La psychologie individuelle. L'Année Psychologique, 2, p. 411. 84 introspection (Angell, 1905). He furthermore suggested that the "quantitative" and "genetic" methods of the natural sciences were actually now the methods of psychology (Angeli, 1905, p. 537). In the United States the breakdown of introspection as one of the defining methods upon which experimental psychology stood, encouraged the rise of behaviourism. Behaviourism spelled the end of mainstream psychology's use of introspection there. In 1913, John Watson published his behaviourist manifesto.59 In this article he clearly delineated the goals of a new science of psychology, behaviourism, that would be identifiable with its difference from the older psychology and more akin to natural science. Watson (1913) claimed, Psychology as the behaviorist views it is a purely objective experimental branch of natural science. . .Introspection forms no essential part of its methods, nor is the scientific value of its data dependent upon the readiness with which they lend themselves to interpretation in terms of consciousness, (p. 248).

While Watson attempted to redirect the future of American psychology, the older traditions of psychology continued to exist in greatly diminished capacity (see Boring, 1950, p. 645). Furthermore, aspects not dissimilar with that of introspection found their way into behaviourism, most notably the use of self-report. At the end of the 19th and early 20th century American psychology began to move into the realm of applied forms, it ventured into military, economic and importantly educational interests. As psychology moved into the new century, interested

59 Watson's arguments of 1913, while often captured as hitorically momentous, have been described by Danziger (1980) as contributing little to the novel arguments against introspection. 85 psychologists like Cattell were not alone. Joseph Jastrow had begun a research program interested in looking at the intersection of mental testing and education,60 and many other psychologists like G. Stanley Hall,61 John Dewey,62 and E. L. Thorndike63 would elevate the position of psychology with respect to education. Hall and the group of psychologists with whom he worked at Clark University (the Clark School), had found new ways of shifting the focus of psychological investigation away from processes and towards performance criteria. The Clark School represented a distinct approach to psychological investigation (Danziger, 1985) and in many respects formed a basis for the Galtonian approach to psychological investigation. The studies performed by the Clark School often employed very large samples of participants, interested in the distribution of scores or performance criteria and demonstrated a firm and effective grasp of statistical techniques (Danziger, 1985). Much of the reasoning behind their arguments for psychology's growing role in education and applied forms depended upon the radical reformulation of their objects of study. Psychologists reformulated the role of individuals within experimentation. Persons were examined as understood by performance measures,64 and revealed through averages, medians and variation. The objects of psychological inquiry could be found by looking within data sets. Psychology slowly began to become a statistically centered science.

60 As an example of this, in 1901, William Chandler Bagley, under the supervision of Jastrow, published an article detailing the discovered correlations believed to exist between quickness of motor response and mental ability in school children (see Bagley, 1901). 61 Hall foresaw the usefulness psychology would have to education (see Hall, 1885) 62 Dewey was an educational reformer who helped guide the future of curriculum construction in America (see Caswell, 1949) 63 Thorndike felt psychology would guide, inform and create a better understanding of what was both attainable and expectable in education (see Thorndike, 1910) 64 like test scores, or physical measures. 86

Correlation and Intelligence: The Rise of Galtonian Psychology As discussed in chapter 1 , Francis Galton had been greatly impressed by the work of Adolphe Quetelet. Quetelet's ideas surrounding the interpretation of statistical regularities in various vital and social statistics engendered in Galton a particular viewpoint of what was implied by statistical measures. With regards to Galton's interpretation of Quetelet's reports on human differences, Galton (1892/1962) wrote, It's application [the frequency of error, or Gauss' method of average error] had been extended by Quételet to the proportions of the human body, on the grounds that the differences, say in stature, between men of the same race might theoretically be treated as if they were Errors made by Nature in her attempt to mould individual men of the same race according to the same ideal pattern. Fantastic as such a notion may appear to be when it is expressed in these bare terms, without the accompaniment of a full explanation, it can be shown to rest on

a perfectly just basis, (p. 28) Galton had been so greatly impressed by Quetelet's application of statistics and the appeal of his l'homme moyen, that he himself appears to have adopted a similar view of the informative function of statistical distributions. In 1879 Galton proposed a method of superimposing images or photographs from a family of people, or even of a lineage of dogs, in order to see the variations of the persons or animals as a window into their hereditary characteristics. These images would not display "a living person," the image would be "the portrait of a type and not an individual" (see Galton, 1879, p. 133). Galton's use of the word "type," here reflects a common use that referred to the "ideal" or 87 something akin to Quetelet's use of "ideal pattern." Galton came to believe, through his interaction with Quetelet's statistics, in the concept that there might be a "type" of individual, and variation represented a measure of departure from the "typical."65 Using this notion of the nature and application of heredity and statistics respectively, Galton was one of the early quasi-psychologists to have been influenced by the use and consumption of statistical information. Galton was inspired to promote the statistical approach to the study of human characteristics in other areas of scientific inquiry. This was one of the roles he would play in experimental psychology. Galton's Hereditary Genius had first appeared in 1869. Galton's discussion of the hereditary qualities of the concept of genius was novel and critical to psychology's late century departure from introspection. While Galton at the time did not set about to asses the correlation of intelligence measures, it set the stage upon which the idea of mental abilities and their possible measurement would find a home in psychology. Galton (1 892/1 962) used the idea that he could demonstrate how "reputation may be accepted as a test of ability" (p. 49). Reputation as a criterion represented an objective and external measure of intelligence. Investigators could use performance criteria that did not entirely exist within the inner sense to evaluate individual's intelligence. The insight that Galton had obtained was that there existed relationships between performance evaluations and what we would call intelligence. Later Galton's ideas of the co-occurrence of certain

65 Galton refers to "type" in Hereditary Genius usually in reference to the stable model upon which a particular race, or quality might rest upon, "the type [emphasis added] being supposed constant" (see Galton, 1896/1962, p. 423). It seems this was reinforced during Galton's investigation of the nature and transmission of Genius 88 physical traits would be the inspiration for what would soon be his determination of the correlation.

While these ideas of Galton did not immediately bring him into contact with psychology, near the turn of the 20th century it did. One of the earliest published attempts to statistically measure mental abilities and promote the idea of statistically amenable mental tests occurs in 1890 with James McKeen Cattell. Cattell (1890) suggested that: Psychology cannot attain the certainty and exactness of the physical sciences, unless it rests on a foundation of experiment and measurement. A step in this direction could be made by applying a series of mental tests and measurements to a large number of individuals. The results would be of considerable scientific value in discovering the constancy of mental processes, their interdependence, and their variation under different circumstances, (p. 373)

Cattell's vision of the usefulness of Galton's statistics mirrored the methods and interests of many psychologists engaged in what was becoming Individual Psychology. In a co- authored paper, Cattell and Farrand (1896) acknowledged the encouragements that Cattell had received from the likes of Galton, Kraepelin, and Binet in promoting the methods and interests of Individual psychology (see p. 618). Cattell (1893) began to advocate the use of statistics in psychology, he suggested that "There seems to be scientific value in the collection of statistics concerning the inheritance of mental traits, and in other directions" (p. 322). He felt that there might be great use in statistics should it prove capable of identifying mental traits within certain classes of individuals and to determine the possible correlation of thoughts with the time required for them to gain an association. 89

These aspects represented the possible basis for a science of mental mechanics (see Cattell, 1893, p. 322).

In Cattell's 1890 declaration of a systematic treatment of mental measurement, a letter containing some remarks by Galton was included at the end. Galton was quite aware of the community of interested psychologists who were to read the article printed in Mind. He encouraged psychologists to take up a program of study, not unlike his own at the Anthropometric Laboratory, designed to study the mental characteristics of persons so as "to obtain a general knowledge of the capacities of a man" (see Cattell & Galton, 1890, p. 381).

While Galton' s influence had always been strong, it is somewhat surprising to find that fourteen years later Charles Spearman (1904a) would be led to declare that "Psychologists, with scarcely an exception, never seem to have become acquainted with the brilliant work being carried on since 1886 by the Galton-Pearson school" (p. 96) and that in the area of psychology "correlation has in general met with great neglect" (see Spearman, 1904b, p. 206). While there were some like Clark Wissler, Hall, Thorndike and Cattell, whose psychological interests had led them towards the use and application of statistics and correlation in their researches upon the mind, they were doing so in a crude and ineffective manner (see Spearman, 1904b). Spearman, who was known to posses a high degree of mathematical talent (see Holzinger, 1945), decided to make every effort towards putting psychology on a path to becoming a correlational science. 90

In 1904 Spearman published two very important papers regarding the nature correlation66 and the ability to conduct the measurement of mental abilities. In the first of the two papers, The Proofand Measurement ofAssociation between Two Things, Spearman (1904a) developed a method whereby the necessary criterion that measures on any kind to be used in the calculation of correlation must be either interval or ratio, could be avoided.67 While important in its utility to a science like psychology, whose measures have often been of a less than material kind, Spearman's second paper appearing only four months later would come to occupy the attention ofmany psychologists. The second paper Spearman (1904b) titled as "General Intelligence, " Objectively Determined and Measured. This paper was to be Spearman's manifesto for a new statistical psychology.

Spearman (1904b) wrote, The present article, therefore, advocates a "Correlational Psychology," for the purpose of positively determining all psychical tendencies, and in particular those which connect together the so-called "mental tests" with psychical activities of

greater generality and interest, (p. 205) Spearman (1904b) analyzed the nature and history of the problems surrounding "mental correlation" (p. 206) and concluded that what was lacking in the area was a useful method of determining what intelligence was and how it should appear. Some researchers had attempted to assess the correlation between certain methods of testing and a criterion of intelligence, but Spearman found fatal flaws in each and every attempt (see pp. 206-219).

66 While some have suggested that Spearman developed an understanding and method of correlation independent of Galton (see Cattell, 1945, p. 86), what remains undoubted is that in one manner or another Galton and Karl Pearson played a significant role in influencing Spearman's notion that mental abilities could be tested (see Spearman, 1904a; Holzinger, 1945; Burt, 1946; Cattell, 1945). 67 See Spearman's discussion of the method of ranks (Spearman, 1904a). 91

The goal of Spearman's (1904b) paper was to demonstrate the basis of his new statistical technique commonly recognized today as Factor Analysis. It was also to demonstrate the results of this technique. Correlations of various measures of ability interrelated giving rise to a general factor common to all the tests and a specific factor unique to the type of test in question. Spearman wrote (1904b), The above and other analogous observed facts indicate that all branches of intellectual activity have in common one fundamental function (or group of functions), whereas the remaining or specific elements of the activity seem in every case to be wholly different from that in all the others, (p. 284). Spearman was proposing that not only was it possible to objectively determine intelligence, intelligence was made visible in the latent variables (multiple correlations) that were discovered to operate between performance estimates on certain tasks and tests. This change in focus for which Spearman both pleaded and provided evidence was an interesting one. The nature of Spearman's requested transition of psychology towards becoming a correlational discipline altered the theoretical approach one takes in approaching relationships between events. The nature of the correlational measure of intelligence as outlined in Spearman's paper on "General Intelligence, " is such that scientific investigation moved from observations up to the level of theory. In order for theories to be developed using the correlational approach, testing must precede formal statements. Observations would reveal the functional relationships between things. A large part of the success of this tradition owes to Ernst Mach and his philosophy of science where he had developed the idea of functional relations rather than causes to 92 explain phenomena (Winston, 2001). Mach's concept of function was consonant with the theoretical prescriptions of correlation and the Galton-Pearson school. Pearson, himself a follower of Mach, had conceived of causes as being better understood as contingencies

(Pearson, 1900b). Contingencies were revealed by correlation and their strength of association was a measure of the magnitude of the correlation. Certain psychologists like

E. B. Titchener, Hermann Ebbinghaus, Oswald Külpe and Titchener's student E. G. Boring found Mach very influential (see, Winston, 2001 , p. 117-11 9). Mach's philosophies of function, independent variables, and experimentation would contribute to the steady rise of correlation in psychology. For psychology the use of the correlational method provided an avenue to the determination of measures of association between tests of certain abilities. These associations then gave rise to the possible explanation of a functional dependence upon intelligence. However, factor analysis and functional relationships also led to the problem of consensus regarding the definition of what intelligence was. What was it the tests were measuring? Spearman's factor analytic method had provided some amount of evidence to back the idea of a general factor of intelligence, but it had not and could not provide an explanation of what exactly intelligence was.

Understanding what intelligence was remained a metaphysical problem that seemed to have no solution. Furthermore, at the time when these discussions of intelligence were taking place, little knowledge existed as to what the actual distribution of intelligence would look like in the general population. While still a relatively new concept, discussing what the psychological unit might be that could support a quantified 93 definition of intelligence seemed pointless to some. This point was expressed by Boring

(1920a) when he wrote, The great difficulty is, as we have just pointed out, to find anything that we may properly call a psychological unit. . ..For example, with intelligence, which is the mental capacity most often measured, we have seen that, not only is there no attempt to make equal increments on an intelligence scale correspond to equal increments of intelligence, but that the concept of intelligence is so vague that any such accurate quantitative relationship is in practice almost meaningless, (p. 31) It appears that the resolution of this problem was to be articulated by Boring himself, not long after he expressed his initial criticisms. With respect to intelligence testing Boring

(1923/1961) wrote, They mean in the first place that intelligence as a measurable capacity must at the start be defined as the capacity to do well in an intelligence test. Intelligence is what the tests test. . .It would be better if the psychologists could have used some

other and more technical term. . .however. . .no harm need result if we but

remember that measurable intelligence is simply what the tests of intelligence test,

(p. 210) The quality of this argument is not nonsense. Boring provides an operational definition rather than a formal definition. Boring proposed that given a dependent relationship between two qualities, here test performance and intelligence, changes in one should reflect changes in the other in an ordered fashion. Consensus regarding intelligence could 94 possibly be achieved by adopting this operational definition, rather than having to deal with the metaphysical problems surrounding quantified intelligence. American psychologists believed they had already made an excellent demonstration of the utility of this approach. The usefulness of the applied forms of mental testing, as in the case of the Army Alpha and Beta testing (see Carson, 1993), had demonstrated the practical existence of the concepts they attempted to measure. Hall (1919) had put it this way: The efficiency of war in this country has been incalculably increased by what might be called the vocational guidance work that has been done by scores of expert psychologists. . .America's war work here will long be not only an object lesson but an inspiration for the application of the same principles of fitting every man to his job in all the great lines ofbusiness, (p. 80). According to Hall (1919), the ability to measure intelligence, attitudes, and aptitudes was putting psychology on the path to being a reliable applied science whose expert knowledge was going to reshape the lives of all Americans. The Fall of Introspection and Rise of Correlation: Concluding Argument With the work of Galton and Spearman, statistics as a reliable and valid method for the investigation ofpsychological subject matter had established itself. The new methods of correlation and factor analysis managed to not only permit the study of new phenomena, they provided evidence of their own validity. However, the underlying issues relating to the questions of what indeed was being quantified were passed over as being too metaphysical and unimportant. In part, this critical aspect created the necessary issue 95 that helped prepare the way for ANOVA and NHST. While psychologists had developed trust in their correlations and various statistical measures, nothing in these numbers could provide consensus as to what the significance of these numbers were. As psychology in the 1930s moved towards inferential models and methods of theory building based on the application of Fisher's ANOVA, this question became increasingly important. The following chapter will seek to examine the conditions within psychology's experimental methods that led to its adoption of Fisher's methods. Fisher's ANOVA technique and NHST in general appear to be become entrenched as a result of certain functions it has come to serve. Psychologist's experimental practices were still without a criteria that could aid in the process of developing a unified approach to experimentation. This left the problems of how to interpret and achieve consensus amongst psychologists regarding experimental results and their meanings, to the individual. NHST appears to have become the dominant approach in psychology because it was thought to be able to resolve this issue. 96

Chapter 4 Controlled Experimentation, NHST and the Current Issue of Method

The progression of experimental methods in psychology appears to have been rather disjointed. From Fechner's early experiments to a standard image based on

Wundt's laboratory, the science of psychology rapidly changed its approaches.

Dissatisfied psychologists raised concerns over the apparent subjectivities and inaccuracies of introspective investigation. This friction helped solidify an already broad scope of what was to be considered valid areas of psychological study. Behaviourism, individual, comparative and correlational psychology, amongst others, all made an appearance in the early years of the 20th century. Introspective investigation gave way to other methods of investigation including the correlational approach. Correlational psychology carried with it the experimental approaches of the Galton-Pearson school and was based in part upon the philosophy of Ernst Mach. Mach' s concept of function, rather than causation, helped resolve certain complications regarding experimentation that had plagued early experimental psychology. However, correlation would soon be overcome as the most important tool in the psychologist's repertoire. Correlation, while useful, did not offer criteria to make determinations about the importance of results from experimentation or study. The method ofNHST represented a more useful technique to employ in experimental settings and uniquely satisfied issues regarding the meaning of experimental results.

Cronbach's 1957 assertion that there existed two distinct streams of psychology, the correlational and the experimental, is a generalized statement regarding the state of 97 experimental psychology in the 1950s. Supporting Cronbach's description of a two- stream psychology stands the investigation of published materials during the years of 1925 through 1950 by Rucci and Tweney (1980). This analysis, devoted to documenting the appearance and frequency of use of the method of ANOVA within psychological literature, tracks the changes during this period that show a slow but steady increase of ANOVA from the early 1930s to a position of dominance in the early 1950s. The purpose of this chapter is to examine the conditions that led to the adoption and proliferation of inferential statistical testing, most importantly NHST, in this period. According to Rucci and Tweney (1980), the use of the correlational approach remained relatively stable over the time that ANOVA rose to prominence. Correlational psychology was simply different in its purpose than that of ANOVA. By examining the differences in approach and the perceived benefits of NHST, the possible reason why psychology has struggled to change its approach to the procedure will become evident. This chapter will first examine criticisms that appeared in response to perceived deficiencies in the correlational approach. An historical line will also be drawn to the pre-NHST type approach of the Critical Ratio (CR) that determined the significance of results when comparing means. The proponent's arguments for the adoption ofNHST will also be briefly examined. Following this introduction, the argument as to why psychology has not changed its approach will be stated. Psychology has a longstanding history of difficulties surrounding the meaning and interpretations of its experimental results. This has both been a product of a multitude of approaches and a reason for this differentiation. NHST appeared to resolve the issue of consensus in psychology; namely it appeared to provide a 98 research strategy that could at once unify the discipline and provide clear and concise criteria that could mechanize the process of inference for psychologists. This claim will be examined from psychophysical experimentation through to the correlational, as this history was constructed in the previous chapters. It will also examine the more current controversy in light of this claim. The aim of this chapter is to synthesize the historical developments previously examined in order to show how NHST was used to organize psychological research and remove the vagaries the plagued the interpretation of experimental results. This aspect of NHST, here defined as its power to provide consensus, explains its continued use.

In Want of More than Correlation

Between 1912 and 1920, the psychologist James Burt Miner produced a yearly review of the advances in correlational statistics. Miner surveyed a large number of studies undertaken in psychology using the correlational method and reported them. The article, consistently titled "Correlation" appeared in the Psychological Bulletin. Miner's first article of 1912 contained a critical discussion of correlation supplied by the German psychologist W. Betz. Miner (1912), summarizing Betz, wrote: Betz emphasizes that correlation alone does not demonstrate a functional connection. Moreover, if changing one variable necessarily changes the second it is not shown that the converse is true; not all functional connections are reversible.

An inventory of correlations cannot disclose psychological secrets unless supplemented by an understanding of mental facts. Correlations serve two purposes in psychology: (1) mass-studies, in which traits are described in popular 99

terms, to aid in the educational or social description of groups; (2) the discovery of functional connections by using the greatest care in analysis and experiment with small groups, (p. 223) In response, Miner (1912) turned to Pearson's (1900b) thoughts on the nature and use of correlation, as found in The Grammar ofScience. Here Pearson (1900b) detailed a change in scientific aims, from that of scientific determination or causality, to that of the nature of contingencies and correlations. This was the Machian philosophy of functional relations. Miner seems to be confused as to what Betz was criticizing. Betz's proposed limitations of the correlational approach stated that correlation did not imply necessary functional relationships.68 With correlation and regression, relationships between variables can be arranged in nonsensical fashion.69 However, "Correlation," according to its author, had been developed "to encourage the more frequent use of this important statistical tool" (see Miner, 1920, p. 388). In the subsequent issues that appeared after the 1912 article, little to no mention was made regarding any controversies regarding the application of statistical correlation to matters considered psychological.70

An example of a necessary functional relationship would be the ideal gas law, PV = nRT. This relationship demonstrates the functional relationship that exists between the product of the (P)ressure and (V)olume of a gas with the product of the (n)umber of moles of a particular gas, the gas constant (R) and the (T)emperature of the gas. This law can be derived from first principles using the kinetic theory of gases. No one particular aspect of this relationship can be thought to cause any other aspect, but in this approximation each aspect, once known, will fully determine the remainder. The relationship also holds under any reordering or direction of analysis. This does not occur in the use of correlation and regression as functional relationships. Reversibility and reordering of relationships do not maintain determinations and often lose their sense or meaning. If an observed score on an intelligence test is predicted by age, high school mathematics grade, and some form of health index, then the same functional relationship can be reversed to suggest that one's age is a function of the other variables, which is blatantly not true. The only controversy which was consistently addressed was the ongoing defense and examination of Spearman's General Intelligence (see Miner, 1912; 1913; 1914; 1915; 1916; 1917; 1918; 1919; 1920). 100

While Miner downplayed the limitations and encouraged the use of the correlational approach to psychological investigation, others were less enthusiastic. E. G. Boring's attacks on the effectiveness and use of correlational methods exemplified a resistance that existed amongst some psychologists who remained steady in their belief in more theoretically complete models than correlation could supply. In 1919 Boring wrote, It is useless to try to limit the scientist to the mere description of his samples. Science begins with description but it ends in generalization. . .[with correlation] conclusions must ultimately be left to the scientific intuition of the experimenter and his public. . .It is equivalent merely to saying that, given only approximate control of experimental conditions, only approximate results can be achieved, (p.

337) Boring continued his assault on the topic in two papers that both appeared in 1920. In Boring's (1920a) first paper that appeared in January of 1920 he revisited the problems of psychological measurement. Boring attempted to articulate the apparent difficulties that psychologists had regarding the defining of their experimental units of measure. This inadequacy, combined with uninvestigated assumptions regarding the structure of the underlying distribution of the data, created certain conflicts with Von Kries' principle of insufficient reason.71 Boring (1920a) concluded:

71 Von Kries had defined the law of insufficient reason as the occasion whereby, in the application of probability theory, one assumes that there is no reason not to assume the equipotential principle. The equipotential principle states that in a given system of accessible states that if all states are equally probable ofbeing occupied, then the probability of any given state being occupied is equal to the inverse number of states available to the system. A consequence of this theoretical conception would be that, provided that equipotentiality is a correct assumption, then after a long run sequence of repeated events and observations all the probabilities of a given system will become stable and observable. However, the frequentist interpretation of probability is nonsense in a system where frequency cannot be or has not been observed. 101

The initial error in the application of the theory of probabilities was the assumption of the law of insufficient reason. It was wrongly supposed that knowledge could somehow be wrought out of ignorance. This very error, however, has never been routed. It has gone on, multiplying mischief. The substitute for insufficient reason is cogent reason. The more we know of the intimate nature of the entity with which we are dealing the more accurate and complete can our descriptions become, (p. 33) Psychologists, according to Boring (1920a), needed to revisit their objects of investigation and study them more closely. Assuming a normal distribution before experimentation was to assume that all outcomes of experiment and observation were equally likely. This assumption seemed unnecessary and often wrong to Boring. He felt that with a more complete understanding of the certain aspects of psychological investigation, it would be clear that the law of insufficient reason was being inappropriately applied in lieu of critical thinking. Boring's second paper on the topic appeared in Science in August. This article was a criticism of some research produced by J. Johnstone on bacteria counts in polluted shellfish. Boring (1920b) proposed that "biological and mental phenomena, of whose conditions of variability we are thus ignorant, do not necessarily give symmetrical distributions when observed" (p. 129). Boring proposed that in certain cases assuming the normal law, and the probabilities based upon it, was inappropriate prior to experimentation or study. These cases were

The principle of insufficient reason attempts to determine expected frequency from the assumed probabilities that are estimated. What Boring is saying is that we have no reason to expect all outcomes to be equally probable. instances where the researcher was aware that it was unlikely they would observe a normal distribution, or was capable of making careful observations rather than simply assuming the possible fit of a normal distribution. Boring thought that scientific investigation should attempt to observe the nature of their objects of study before assuming equipotentiality. With both these papers Boring expressed his disdain for the thoughtless application of the normal distribution. For him, an absence of knowledge regarding the underlying distributional character did not create sufficient reason for the law of insufficient reason. It appeared to Boring that critical thinking and good experimental practice was being discarded and replaced by theories of probability that were being applied uncritically at the outset. A growing number of psychologists highlighted the technical dangers of overstepping the limitations of correlation. Burks (1926a 1926b) analyzed a number of psychologists' uses of correlation and its derivative methods. Burks noticed a trend in certain psychological literature. This literature appeared to suggest that there might exist a "widespread impression that the methods of partial and multiple correlation are adequate substitutes for controlled experimentation in the solution of problems of causation" (Burks, 1926a, p. 532). Correlation is not capable of generating necessary functional relationships between variables and represents nothing more than a measure of the strength of association or co-occurrence between them.73 The technique of partial

Necessary functional relationships, see footnote 68. The correlation between two variables is a measure of the proportion of the covariance and the inverse square root of the product of the individual variances. The part and partial correlations are adjusted measures of the correlation of two or more variables with the hypothetical proportion of a third or more variables subtracted from the general correlation between two variables and is ratio adjusted for the 103 correlation is not a perfect treatment of how to isolate influence. Hence, specific sources of effect or response in a variable of interest can never be thought of as being perfectly attributed to any degree by the influencing independent variable. Multiple correlation can be thought of as measuring the relationship between one variable and all the predictor variables from any given multiple regression. As before, the functional relations between variables can be reformulated in nonsense fashion. Hence, correlation was not flawed in any particular sense according to Burks (1926a, 1926b), but certain lingering beliefs and desires regarding causation remained in the minds of psychologists and this led to its misuse.

Clark Hull (1927) demonstrated certain ambiguities of measurement, borne of scaling gradations. The fineness of a measurement scale would generate fluctuating accuracies of prediction but not as a product of an underlying relationship. The strength of a correlation was then subject to ambiguous conditions. Hull (1927) also explored the overall weaknesses inherent in the predictive use of regression and the correlation and how these issues were nothing more than limitations of the mathematics and the effects of number.74 Correlation was a method not without its own difficulties when applied to psychological investigation. The Galtonian model that psychology adopted had apparently resolved the issues that introspection had introduced. However, it had divorced itself from the hallmark of the early psychology of Wundt; control and experimentation. Correlation increasingly proportion of relationship uniquely belonging to the variable(s) we wish to remove from a position of influence. 74 Pearson (1896-1897) had noted the appearance of spurious correlations, where correlations could be obtained but might not logically contain a functional relationship. In this case the matter was attributed to the natural relationships between indices (number). 104 was recognized as a method of scientific investigation that amounted to "a mere piling up of facts [that] can only lead to a chaotic and unproductive situation" (Lewin, 1936, p. 4). The Galtonian model could only describe relations between variables of interest and not actually disclose the underlying mechanisms behind the relationships. Effects and relationships could not be stated in the form of testable theories, only discussed and reflected upon after investigations were complete. This left a number of psychologists wanting a way to recapture experimentation. NHST as the Solution to Controlled Experimentation and Consensus During the 1930s, when psychologists first began to employ NHST in the form of ANOVA and t-tests, there existed a similar method that was already being used in some areas of psychological and educational research. The critical ratio (CR) was a mathematical treatment of the measure of difference between two means divided by the combined variance of the two groups.75 This value was evaluated as a relative measure of the differences between means in terms of dispersive measures. The goal in the use of the

CR was, as mentioned in chapter 1, the evaluation of the significance of data. The CR was one of the first methods widely used in psychology that attempted to provide a mathematical criterion of significance. This technique of significance testing was an objective method of determining whether important results had been achieved. Another important aspect of the method of CR was that it reflected the emergence of a new concept in psychological experimentation, that of the treatment group. Two groups under investigation on a specific variable could be compared, and the results of

75 It also appeared early on as a measure of the probable error (P.E.) rather than aspects of variance or standard deviation (as an example see Nicholls, 1923). 105 this comparison could be evaluated as to the likelihood of their chance occurrence (Zubin,

1936). This technique, very similar in quality to that of the modern t-test (see Shen, 1940), provided a foundation upon which the methods of NHST and ANOVA grew (see

Rucci & Tweney, 1980). The emergence of the treatment group also marked the shift of psychological investigation away from the Galtonian model to that of the neo-Galtonian (Danziger, 1987). If significant differences could be detected on a desired measure between a control group that had been kept under normal conditions and a treatment group that had received some single identifying and clearly controlled condition, then the

CR could stand as objective evidence of the functional influence of the treatment variable. The advent of the treatment group, as discussed in chapter 1, marks the third major stage in psychological methods. While the treatment group gained popularity, the use of ANOVA was not immediately adopted by experimental psychology. The first evidence of its applicability to educational and psychological interests appears in 1934 with two papers published in related journals. In July, J. O. Irwin (1934) discussed various applications of the methods of correlation to psychological data. His article,

"Correlation Methods in Psychology," appeared in The British Journal ofPsychology, and it was intended to evaluate various means of establishing decision criterion that would help bring clarity to various psychological analyses. Irwin found that most psychological analyses simply "sum[med] them [their analyses] up by use of a single coefficient" (Irwin, p. 86). Correlations and regressions were at times ambiguous in demonstrating what variables were best as predictors and what correlations should be

76As outlined in chapter 1, the rise ofNHST took effectively 15 years (see Rucci & Tweeny, 1980). 106 considered large. Part of the benefit of testing for significance was that it made the result more easily "understood by an intelligent layman" (Irwin, 1934, p. 87). Of all the methods discussed, under certain circumstances, certain methods were determined and demonstrated as superior than others. In the case where one variable is qualitative and another is quantitative, Irwin (1934) suggests: we get a clearer picture by comparing the variation from one array to another with the variation within arrays, which can be more conveniently be done by the analysis of variance technique, than by a mere consideration of the correlation

ratio, (p. 86) Irwin (1934), furthermore, discussed the ability of ANOVA to handle "small or even moderate samples" (p. 87), referred to R. A. Fisher's tables for z, and acknowledged the .05 criterion (see p. 88). The method was promoted as an effective means of obtaining clear results for psychological investigations. The second paper to appear was W. Reitz' Statistical Techniquesfor the Study of Institutional Differences in the Journal ofExperimental Education. Reitz' s paper demonstrated the practicality of the technique of ANOVA in light of existing methods for testing the significance of a correlation coefficient as previously developed by Karl Pearson. What Reitz (1934) achieved was a clear demonstration that Fisher's method could provide, at the least, an equally reliable method of determining the significance of a correlation coefficient. The problem to which Fisher's technique was amenable was that of determining if legitimate differences existed between educational institutions using the intelligence evaluations of students. However, Reitz (1934) suggests that: 107

The present statistical technique. . .need not be confined to so-called initiatory measures, but may well be applied to measures of the finished product, such as marks, college grade points, number of quarters or semesters in residence, (p. 22) Reitz (1934) further advocated the use of ANOVA as an effective means of establishing admission criteria to institutions of higher learning. Interestingly, the two 1934 papers complemented each other in an important aspect. Irwin's paper promoted application of ANOVA based on the theoretical concerns of experimental and scientific pursuits, while Reitz' s was able to demonstrate practical application.

Other papers concerning the theoretic use and application of ANOVA would begin to appear over the course of the later stages of the 1930s and early 1940s. In 1938 R. S. Crutchfield appealed to psychologists to be aware of and utilize the method of ANOVA. According to Crutchfield, ANOVA's ability to handle multiple variables and still produce meaningful results was a break in tradition with the older concept of single variable manipulation that diminished the generalizable nature of results (see Crutchfield, 1938, p. 340). ANOVA was promoted as a method that could increase objectivity in experimental investigations while concurrently handling the increased complexity. Crutchfield attempted to demonstrate the validity of his claims in an experiment involving the performance of rats on a string pulling task. In 1939, G. Rubin-Rabson performed a study of the memorization of piano music under different conditions that used ANOVA as its core method of analysis. Baxter in 1940 conducted a reaction time experiment using ANOVA as a means of determining any significant differences that might exist under different stimulus modalities. Crutchfield and Tolman in 1940 urged the community of expenmental psychologists to recognize that "Multiple-variable design is appropriate, in general, for any comprehensive psychological study which is concerned with the generalized effects of a multiplicity of variables and their complex interactions" (p. 41). In two related papers appearing in 1937 and 1941, Gaskill and Cox demonstrated the clear application of the methods of analysis of variance and covariance to measures of emotional response as revealed by respiration, heat-rate and blood pressure. In the concluding remarks to Gaskill and Cox's (1941) paper on emotional responses as measured by heart-rate and blood pressure, the authors concluded that while analysis of variance and covariance were effective uses for the analysis of their data, analysis of the correlations were "not only insufficient but distinctly misleading" (p. 420).

By the 1950s ANOVA had become the dominant method of experimental analysis (Rucci & Tweney, 1980). While correlation remained a robust technique, the use of the

CR had been on a rapid decline in step with the rise of ANOVA. Where the CR had been able to give a measure of the effectiveness of single variables, ANOVA was a technique that could manipulate and account for many variables at once. ANOVA provided for psychologists, once again, the opportunity to develop theories and test them. However, as early as 1938 Joseph Berkson was raising concerns over the use of statistical tests as automatic processes that generated decisions for the experimenter, and the controversy regarding the use NHST began to foment.

The NHST Controversy Revisited The history of quantification and the rise of statistical methods in psychology, to this point, reveal what I am suggesting here as the major issue that NHST appeared to 109 solve: consensus. One of the major criticisms that developed out of psychophysical measurement was the quantity objection. The use of introspection was fraught with subjectivities that gave unreliable results. Correlations and multiple correlations provided no objective criteria to determine which observed values were important and which were not and they were difficult to interpret. NHST not only helped psychology regain an experimental approach after the demise of experimental introspection, it appeared to provide what seemed to be precise and unambiguous results. How can the resiliency and longevity of NHST over the past 50 or more years be explained? I have suggested that it has persisted because it seems to provide a means of evaluating a wide variety of complex data, generating what appear to be unambiguous results upon which everyone could agree, based upon ? values.

The Problem of Consensus: A Review

At each successive developmental stage, from Fechner and Wundt, through to the neo-Galtonian approach, there has been a persistent and nagging issue that has shaped psychology's experimental methods. I suggest that this issue has been the desire to achieve clear, objective, and mutually agreeable methods of experimentation and interpretations of results. This need to develop a unified approach to experimental investigation and its results has been defined as consensus. Each successive method adopted as primary for psychologists has wrestled with this need. Of the various objections leveled against psychophysics, one particular problem that has remained is the quantity objection (see Murray, 1990; Micheli, 1999). Johannes von Kries was extremely critical of Fechner' s claims of a foundation for psychophysical 110 measurement. According to von Kries, the primary issue regarding Fechner's

Psychophysical Law was that stimulus sensation could not be quantified and therefore could not be considered to have been measured (Niall, 1995). von Kries held that quantification could not be accomplished in psychophysics and psychology because in these realms "a reduction of terms to terms of spatial and temporal magnitudes is excluded" (Niall, 1995, p. 290). von Kries was argued that because psychophysics could not state theoretically what a unit quantity77 of sensation was then it was not safe to assume that magnitude was a meaningful concept to apply indiscriminately, von Kries (as cited in Niall, 1995) stated: If a location on the skin is subjected first to a two-pound and then a three-pound

load, and subsequently to a ten-pound and then a fifteen-pound load, the latter two sensations of pressure occupy quite a different place in the whole series of

sensations than the first two. Thus the one increment is something quite different from the other; at first they admit of no comparison. The claim that they may be

equal makes absolutely no sense. . .This is clearest in cases where we are not bent

on objectification, such as with pain. What it means to say that one pain may be exactly ten times as strong as another, is simply unfathomable, (pp. 291-292)

Fechner's psychophysics is defined not by its definition and demonstration ofunit

78 quantity, but by its application of a convention based upon a theoretical unit, von Kries (as cited in Niall, 1995) wrote:

von Kries required Fechner to state unit quantity in terms of physical or time based objective measures, length, mass, and time from physics. 78 see the discussion of Fechner's method of measurement in chapter 2. Ill

This is the sense in which Fechner thought he had revealed the just-noticeable difference as the 'true' standard of measure for the intensity of sensation. In

contrast, our discussion has shown that we are really dealing with an arbitrary

convention about relations of magnitudes-which can be expedient or inexpedient, but which cannot be right or wrong, (pp. 292-293)

Fechner' s reply to von Kries proposed that the experimental method had revealed the nature of the relationship and hence the truth of its existence (Niall, 1995; Murray, 1990). Where von Kries attacked the theoretical foundation from below, Fechner defended it from an empirical perspective from above. IfNiall' s and Murray's arguments are correct, then von Kries' critique and Fechner' s reply do not have a point of contact, von Kries accused Fechner that his theory assumed a congruence of unit measures, while Fechner claimed unit measures were the demonstrated products of his experimental approach. The question of quantity remained a point of active inquiry in psychology's early years as a discipline. E. B. Titchener and William James sought to side-step the issue raised by the quantity objection by reformulating psychophysical measurement not as measuring sensation, but as the magnitude of sensation distances (see Micheli, 1999, pp.

90-91). Hugo Münsterberg saw no way around the problems of measurement. He went so far as to say that he had "never measured a psychical fact," nor ever "heard that anybody has measured a psychical fact" and he strongly disbelieved that "in the centuries to come a psychical fact will ever be measured" (Münsterberg, 1898, p. 159). Cattell defended his

79 Murray (1990) further cites Fechner as acknowledging that should von Kries and the quantity objection be true there was nothing to support psychophysical measurement. 112 belief in psychophysical and mental measurement by suggesting that mental processes were so inextricably bound up with the physical that they had to submit themselves to measurement and hence be quantifiable, while concurrently stating that psychology is not concerned "with the metaphysics of these magnitudes, nor even with such critical discussion as may fall within the limits of a scientific theory ofknowledge" (see Cattell,

1893, p. 321). What appears to have been the most satisfying rebuttal of the quantity objection was provided by S. S. Stevens in 1951. Stevens, then professor of psychology at Harvard, proposed a radical reinterpretation of what constituted measurement. Stevens's artifice was to suggest that measurement (and as a result quantity) was nothing more than "the assignment of numbers to objects or events in accordance with a rule of some sort"

(Stevens, 1958, p. 383). In Stevens' 1959 account of his new theory he wonders at the community of "practitioners of the physical sciences [who] prefer to cling to the narrower view that only certain of the tidier rules are admissible [as measurement]" and are unwilling to accept his "liberal and open-handed definition of measurement" (see p. 19). Stevens' approach to measurement altered the importance of the quantity objection to psychology. By suggesting that quantity and measurement were not actually aspects bound to the same domain (see, Stevens, 1959, pp. 19-20), Stevens in practical terms resolved the quantity objection for psychology. If, as Stevens (1959) suggested, mathematics is concerned only with systems of symbols, numerosity and not quantity, and if measurement is merely a set of rules governing the application of numbers to 113 objects and events obtained empirically (Stevens, 1959, p. 20), then the quantity objection has no place in any criticism regarding any form ofmeasurement. The problems associated with the use of introspection also reflected the need for consensus. As scientists, psychologists wanted to obtain objective methods and be able to agree on the reliability of results. Wundt had prescribed rigorous methods to ensure the reliability and scrutiny of results from introspecting psychologists, calibration of man and machine was the key (Benschop & Draaisma, 2000). The rhetorical function of the lab declared to the world that psychology had become and was developing into a modern science (Sturm & Ash, 2005) regardless of the controversies that had swirled around the quantity objection and the method of introspection. As Gail Hornstein suggested, the low cost, ease and appearance of clear experimental results kept psychophysics and early psychology moving forward rather than dealing with theoretical issues (Hornstein, 1988, pp. 7-8). However, as detailed in chapter 3, the problems of liberal self-observation and the conflation of the various approaches to introspection80 were its eventual downfall. The introspective methods used in the early 20th century did not survive the strong criticisms of those like Dodge (1912), Bode (1913) and Dunlap (1912) who questioned the place of introspection in psychology and proposed that the method needed to be banished from psychology altogether. The pressures being supplied by alternative forms of psychological inquiry (Individual psychology, Comparative psychology, Correlational psychology and others) also demonstrated that at the time the discipline could not be defined by a single principal

Introspectionism was easily defeated because as a single entity it did not exist, see chapter 3. 114 means of investigation. The needs and approaches of experimental psychology at the turn of the 20th century were larger than the scope of a single method. There could be no measure of consensus using introspection. Individual reports of self-examining persons were far too idiosyncratic, not necessarily replicable and far too many areas of interest were inaccessible to study using the introspective method.

As correlational psychology became a popular branch of experimental psychology, two issues related to the need to develop consensus emerged. Both issues were regarding the interpretation of the correlation coefficient. First no one could define what constituted proper interpretations of the magnitude of the calculated correlation.

What needed answering was how large the measured correlation was and to what extent was the correlation an effective measure of the bivariate relationship. The second issue was more a product of the history of quantification in psychology. What meaning did the reported statistics or unit measures, especially in the measurement of intelligence, possess. This point we discussed earlier in chapter 3. In the years leading up to the adoption ofNHST by psychologists, the statement of experimental results were primarily descriptive. Without a specific criterion for determining degrees of differences or the importance of measured magnitudes of correlation, there was often discussion of potentially important differences or values, but inference was generally absent. Boring (1919) wrote that ". . .since in the nature of the case it is impossible for him to state in numerical terms the degree of representativeness that his samples possess. . .the probability of difference is not wholly satisfactory but it is inescapable, (p. 337). Without a standard method, psychologists appeared to approach the 115 discussion of their results in a variety of ways. One of the leading journals in experimental psychology during these years stands as an example of this. As reported by Gigerenzer et al. (1989): . . .the volumes of the Journal ofExperimental Psychology in the 1920s contain articles in which results are summarized and described in detail, but no inference is drawn by eyeballing curves, by personal judgment, by bargaining with the reader, or by comparing mean differences with the variances without specifying a rule of comparison. More formal criteria such as the probable error, the critical ratio, or Karl Pearson's chi-square method were also used, but in other articles the issues of data description and inference were not even distinguished, (p. 206) However, with a move towards experimentation and the discussion of group differences due to some specific factor (assumed causal); mere descriptions of results were ineffectual. The uses of NHST permitted an analysis of the differences of statistical measures of correlation and mean differences that appeared to imply causation and operate by a rule (p values). Statistical significance appeared to offer the ability to demonstrate that one had obtained unarguably important and scientifically sound results. As intelligence testing became more popular, so did the construction of an increasingly large battery of tests. Boring (1923/1961) had operationally defined intelligence as what the tests tested. Paul Horst81 remained unsatisfied and suggested that in measuring intelligence, "in order for the counting process [measurement] to be useful

Horst was a professor of psychology at the University of Washington, Seattle. He specialized in mathematical statistics. He also took a one year leave as professor in 1949-1950 to be the Director of Research at the Educational Testing Service in Princeton, New Jersey. 116 in prediction it is necessary to know what to count and what to do with the numbers resulting from the counting process" (Horst, 1932, p. 637). Hornstein (1988) noted that an increasingly large number of tests came to be confused with or comingled with the definition Boring had supplied. If it was the case that intelligence was defined by the tests themselves then there possibly existed a wide spectrum of different forms of intelligence. This was not what Charles Spearman and his correlational psychology had proposed in his theory of intelligence. However, by the 1930s the debates regarding the structure and meaning of intelligence seemed to fall away. The activities of experimentation and the wide spread use of intelligence testing in applied areas like education, industry and for the military, had trumped the questions. According to Hornstein (1988): Like the psychophysicists who simply pointed to their empirical findings as evidence that it must be possible to quantify perception, the testers could use the existence and widespread use of tests to support their claim that intelligence

could likewise be quantified, (p. 12) The quantification of intelligence that emerged in the early years of the 20th century carried with it the same need for consensus that had plagued psychophysics. No one could agree conclusively as to how and what was being measured.

82 As discussed in chapter 3, Spearman had begun working on the problem of intelligence as early as 1904 and continued for many years (see Hart & Spearman, 1912; Spearman, 1904a, 1904b, 1914). He formally proposed his two factor theory of intelligence in 1914, the general and specific factors would account for intelligence. Spearman (1914) proposed this regardless of the fact that his factor analytic method could be used to support a variety of outcomes regarding the number of actual underlying factors. 117

Why our use of NHST not changed

NHST rose to prominence in the 1950s, effectively 60 years later, we continue to use NHST in virtually the same manner as in its earliest appearances. Psychology has insulated itself, intentionally or not, against criticisms that could deflect its use ofNHST. During the 1940s, psychology did come to agreement regarding the training of all future psychologists. The methods of statistical analysis were agreed upon as being fundamental to all experimental work (Annastasi, 1947; Buxton, 1946; Stoke, 1946). About this time, the methods ofNHST appeared in statistical texts for psychologists along with the inconsistencies and misinterpretations that led many to make incorrect conclusions regarding the method's use and application. NHST was incorrectly perceived to provide a mechanized process that generated rule governed results. This standardized approach was translated into a fundamental rubric to determine the quality and validity of experimental results that could be found operating in the practices of some journal editors. NHST came to occupy a place in psychological research that appeared to create consensus regarding experimental results. It would appear that statistical significance replaced substantive significance. Psychology has been more concerned with the apparent utility of NHST than with the debate regarding its use.

In the United States, during the 1940s, a discussion of the essential areas of study for students of psychology took place. A large number of reports, produced by some of the leading psychologists of the time, suggested that the training of psychologists should include a thorough introduction to statistics. In fact, as described by Stuart Stoke in 1946, in order for any student of psychology to be capable of doing good work, they must be 118 proficient "Certainly [with] statistics; and most of our groups say experimental methodology" (Stoke, 1946, p. 114). Furthermore, "The status of a student who cannot argue in terms of statistics is definitely low" (Stoke, 1932, p. 115). Claude Buxton in the same year presented a model upon which a standard system of course materials for introductory psychology was presented, also containing statistics as a core element (see Buxton, 1946, p. 310). The issue regarding the certification of practicing psychologists

(clinical in particular) was also important following the war, as there was increased demand for psychologists (see Ullman 1947; American Psychological Association, 1947).

Anne Annastasi (1947) also argued for the incorporation of statistical studies in undergraduate courses for psychology. As psychology moved towards the use ofNHST as a primary method, it placed statistics as a central theme of study.

For early experimental psychology, textbooks were educational but also served to reinforce the notion ofpsychology as a science. The textbooks were used to introduce the methods of the new science and mark the boundaries of the territories belonging to psychological investigation. These efforts helped to organize and develop consensus regarding what it meant to be a psychologist and how to go about conducting psychological research in the 1940s and 1950s. As discussed in chapter 1, the textbooks provided to young psychologists were often flawed in one form or another. Rucci and

Tweney's (1980) investigation of the principal appearances and uses of ANOVA, contains a systematic survey of the early rise of statistical textbooks for psychologists.

refer to chapter 2 of this paper. 119

The authors note the 1950 edition of Guilford's textbook on statistics, which was identified as one amongst a few which were widely used and influential (see p. 176). Gigerenzer, Krauss and Vitouch (2004) point to Guilford's 194284 textbook as an example of inconsistent statistical treatments and often erroneous information taught to psychology students in the 1950s. The Guilford text "marked the beginning of a genre of statistical texts that vacillate between the researchers' desire for probabilities of hypotheses and what significance testing can actually provide" (p. 4). Furthermore Guilford was guilty of ignoring the work of E. S. Pearson and Jerzy Neyman on power analysis. Guilford claimed in his text that the concept of power was far too complicated to be discussed (see Gigerenzer et al., 1989, p. 208). Psychology texts that did incorporate the theories did not reflect the existing controversies between the theories of Fisher and Neyman and Pearson, but instead "hybridized" the theories and this obscured the controversies (Gigerenzer et al., 1989, p. 208). While there has been some improvement in the content of statistical textbooks for social and behavioural scientists, the NHST controversy remains very nearly absent in modern texts (Gliner, Leech, & Morgan, 2002). Psychologists being trained in the 1950s and afterward were poorly educated on the topic ofNHST.

Regardless of neither how problematic NHST has been demonstrated to be, nor how deficient the source materials for the study of statistical testing, psychologists have been forced to conform to disciplinary pressures surrounding its use. In a 1 962 editorial appearing in The Journal ofExperimental Psychology, A. W. Melton expounded upon

as well as it subsequent revised editions. some of the practices that he believed, as editor of the journal for the past twelve years, brought a certain distinction to the quality of material accepted for publication. Of particular note was that effective use ofNHST was ofprimary importance. For Melton (1962), what constituted effective use was achieving statistical significance levels not just at a minimum/? value of .05, but only those demonstrating significance at the .01 level were actually worthy admissions for publication. This viewpoint of Melton's only helped encourage and cement the idea that NHST was the standard that research should be judged by. Gigerenzer et al. report that by 1955 more than eighty percent of all the articles appearing in four of the leading journals in psychology used NHST to demonstrate or validate conclusions from the data (Gigerenzer et al., 1989, p. 206). The age of statistical consensus was born. Interestingly, Melton's criteria would sharply contrast Jacob Cohen's report of the same year. In an analysis of statistical power, Cohen (1962) detailed the chance like character of a number ofpsychological hypotheses and theories that had been published based upon the .05 criterion. Regardless of the NHST controversy, the idea of statistical consensus based on the use of the .05 criterion, has remained a part of editorial policies even in more contemporary settings (Kupfersmid, 1988; Kupfersmid & Fiala, 1991) The results of the organized investigation into the application and uses of NHST in psychology performed by the Task Force on Statistical Inference (1999) determined that although there may be flaws, and although psychologists seem irrevocably doomed to misuse NHST, it possesses a certain utility that cannot be discarded. The reason why NHST became the dominant method and remained resilient in light of attack lies in its unique utility: that of the production of consensus. As Danziger suggested, NHST and statistical inference ". . .proved to be very attractive to psychologists because it finally enabled them to develop a unified research practice" (Danziger, 1987, p. 46) and "one should not underestimate the potential merits of this style of research [neo-Galtonian models ofperformance analysis] from a certain point of view. It was and remains capable of yielding results of some utility in situations where its limitations are acceptable"

(Danziger, 1987, p. 43). This point exemplifies the statements of the TFSI (1999) and nearly all the others who acknowledge the limitations ofNHST yet return to its utility because of some sort of unstated limited circumstances where it is effective at gleaning the truth.

Controlled Experimentation, NHST and the Current Issue of Method: Concluding Argument The NHST controversy is not a simple problem of technical adequacy. Within psychology this controversy is related to the history of the discipline as it has developed from its earliest experimental forms. By divorcing the issue ofunit quantity from experimental concern, S.S. Stevens gave modern psychology permission to use NHST as a method to achieve consensus not just across results from an experiment, but across the different intradisciplinary concerns. From editorial policies that demand the .05 criteria for publication (Kupfersmid, 1988; Kupfersmid & Fiala, 1991; Sedlmeier & Gigerenzer, 1989; Vacha-Haase, 2001) to courses and textbooks in statistics and psychological measurement that make no mention of existing controversies (Gigerenzer et al. 1989;

Micheli, 1 999), psychology seems to have a system in place to maintain its current state. It would appear that this is the reason why Rodgers (2010) has argued for a change in undergraduate and graduate curriculum for psychologists. 5 It would appear that statistics and measurement share the common property assigned to NHST by John (1992), that these techniques act as forms ofrhetoric within psychology that maintain a blind adherence to a belief that the discipline has achieved the scientific method for itself.

Rodgers (2010) has suggested that psychologists should now begin to focus on teaching and learning the techniques of statistical modeling and move away from NHST. 123

Concluding Remarks

The answer as to why psychology has not changed its approach to its use of NHST, I have contended, rests not on technical reasons, but on the emergent needs of a young discipline grappling with the nature of its subject matter and its status as a science. Experimental psychology appeared in the years following G. T. Fechner's demonstration of the principle of psychophysical measurement. Yet, Fechner (1860/1966) initially proposed that his system did not actually define the quantity of sensations, it conceived of quantified sensation". . .merely [as] a matter of definition, and implies, as generally understood, no specific measure of sensation" (p. 14). Yet, his work in part laid the foundation upon which a quantified discipline of experimental psychology would emerge.

The arguments provided by von Kries in 1 882 suggest that the issue of quantification, borne of Fechner's theories, was far from being resolved. However, by the time that von Kries had leveled his quantity objection against Fechner's psychophysics, Wilhelm Wundt had already established the first laboratory dedicated to experimental psychology. Wundt's own experiences86 had led him to adopt something of Fechner's psychophysical methods and his model of experimental psychology reflected this.

Wundt's model of experimental psychology was successful in establishing a standardized model upon which a great many of the early laboratories of psychology would soon bloom. The activities of the laboratory allowed the practical concerns of conducting research to bypass the concerns of the theorist (Hornstein, 1988). A discipline was created

seeFancher, 1996, pp. 145-146. 124 before the existing controversies regarding the idea of quantifying mentality had been properly addressed.

However, Wundt's dependence upon the introspective method would conflate his approach with many other experimental programs that used introspection in different ways. Introspection soon became introspectionism and the collective arguments of subjectivity and errors of interpretation represented the first reformulation of experimental psychology. What emerged from the first reformation was a discipline that was varied in its approaches and focus.

The emergent sub-discipline of correlational psychology was made most appealing by its ability to address and investigate the concept of intelligence.

Correlational psychology attempted to achieve consensus as to what it was experimental psychologists were quantifying through its ability to develop relationships (though correlational) to underlying substructures that were borne of the statistical technique. Unfortunately, correlational psychology was limited to descriptive analyses.

Experimental psychology, having earlier attempted to model itself after the natural sciences, was in want of a method upon that might more fully expose the functional relationships between psychological processes. To this end, psychologists of the early 20th century approached statistical information (aggregate data) using a variety of techniques like the CR that were designed to determine if differences existed in meaningful or significant fashion. This set the stage for NHST and while many were slow to take up this approach, the practical usefulness lay in its ability to "alleviate problems of reaching a consensus about the significance of aggregate data" (Danziger, 1987, p. 46). The quantity objection was finally displaced from psychology with S.S. Stevens' redefinition of what constituted measurement. Measurement, as proposed by Stevens, did not have to do with quantity, it was a form of numerology that obeyed a system of rules that governed the assignment of numbers to objects. The effect of Stevens' redefinition was the liberation of psychology from the difficulties it encountered regarding the interpretation of various quantified methods (including the testing of intelligence and related measures). However, I suggest Boring's (1920a) criticisms remain relevant to experimental psychologists. Boring contended that experimental psychologists need to understand something of what it is they are studying before applying the principles of the normal law and assuming statistical significance was a proper criterion for interpreting data.

NHST was adopted and has remained in use in psychology because it could fulfill two needs. First, NHST represented a return to experimental control without having to discard the Galtonian approach of accumulating large amounts of aggregate data. Second, NHST at the time (the 1950s) represented the only possible way to discover the underlying significance of the data collected in a manner that could both handle the large number of variables in operation within a psychological experiment and prescribe a definite criterion for consensus (the ? < .05 criterion). Using NHST, psychologists could examine any aspect of quantified psychological phenomena provided they could provide a reasonable justification for the application of numbers. Observations and the results of research could then be mechanically evaluated upon the .05 criterion. The utility of NHST to psychology has been its ability to produce consensus regarding the interpretation and evaluation of conducted research and this has provided a delimited resource for publishable material. While it is unclear when the NHST controversy will leave us at last, a few suggestions may speed the process. Students of psychology are often met with introductory statistics ill equipped to analyze the underlying philosophies and mathematical structures that support it. With undergraduate and graduate courses in psychology requiring little to no mathematics training, students learn the practice of statistics and not the theories. I suggest that this encourages a "how" approach rather than a "why" approach to an applied form of mathematics that they do not fully grasp. These same students then find themselves incapable of interrogating what it is they are taught. The result is mechanized use. Students of psychology should have more training in basic mathematics concurrent with their introduction to statistics. This approach is currently in use in physics, engineering and nearly all other natural sciences. The pace of an undergraduate physics course is synchronized with the pace of the supporting calculus course. Alternative practices like the Bayesian approach to statistics, and statistical modeling that are less interested in providing all-or-none type information should also become components of undergraduate training. Currently students are typically unaware that there are alternative interpretations of probability. The frequentist interpretation of probability is taught in support of NHST and little else apart from non-parametric approaches are endorsed. The promotion of modeling also encourages a more intimate and careful appreciation of functional relationships. NHST has often been used in crude and thoughtless ways. Often variables like gender or race are inserted into the ANOVA 127 model simply to arbitrarily reduce the error term. The result can sometimes be the maligning of persons of different genders, races or any other demographic consideration that can be numerically coded.87 Helping to create a more capable, patient and considerate user of statistical methods should help psychology break away from the unique attractions that NHST has provided for so many years.

This idea is similar and is supported by the notion of epistemologica! violence (see Teo, 2008). References American Psychological Association. (1947). Recommended graduate training program in clinical psychology. American Psychologist, 2(12), 539-558. Anastasi, A. (1947). The place of experimental psychology in the undergraduate curriculum. American Psychologist, 2(2), 57-62. Angeli, J. R. (1905). Psychology at the St. Louis congress. The Journal ofPhilosophy, Psychology and Scientific Methods, 2(20), 533-546. Ash, M. G. (1980). Experimental psychology in Germany before 1914: Aspects of an academic identity problem. Psychological Research, 42, 75-86. Ash, M. G. (1995). The self-presentation of a discipline: History of psychology in the United States between pedagogy and scholarship. In L. Graham, W. Lepenies, & P. Weingart (Eds.), Functions and uses ofdisciplinary histories (pp. 143-189). Dordrecht: D. Reidel Publishing Company. Bain, A. (1893). The respective spheres and mutual helps of introspection and psychophysical experiment in psychology. Mind, 2(5), 42-53. Bagley, W. C. (1901). On the correlation of mental and motor ablity in school children. The American Journal ofPsychology, 12(2), 193-205. Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66(6), 423-437. Barnard, G. A. (1992). Introduction by G. A. Barnard. Pearson, K. (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to 129

have arisen from random sampling. In S. Kotz & N. L. Johnson (Eds.), Breakthroughs in statistics: Vol. 2. NY: Springer-Verlag. Baxter, B. (1940). The application of a factorial design to a psychological problem.

Psychological Review, 47(6), 494-500. Bell, M. (2005). The German tradition ofpsychology in literature and thought, 1 700-

1840. NY: Cambridge University Press. Benschop, R. & Draaisma, D. (2000). In pursuit of precision: The calibration of minds and machines in late nineteenth-century psychology. Annals ofScience, 57, 1-25. Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal ofthe American Statistical Association, 33(203), 526-

536.

Berkson, J. (1942/1970). Tests of significance considered as evidence. In D. E. Morrison & R. E. Henkel (Eds.), The significance test controversy. Chicago, IL: Aldine

Publishing. Blumenthal, A. L. (1975). A reappraisal of Wilhelm Wundt. American Psychologist,

30(11), 1081-1088. Bode, B. H. (1913). The method of introspection. The Journal ofPhilosophy, Psychology and Scientific Methods, 10(4), 85-91. Boland, P. J. (1984). A biographical glimpse of William Sealy Gösset. The American Statistician, 38(3), 179-183. Boring E. G. (1919). Mathematical vs. scientific significance. The Psychological Bulletin, 16(10), 335-338. 130

Boring, E. G. (1920a). The logic of the normal law of error in mental measurement. The

American Journal ofPsychology, 31{\), 1-33. Boring, E. G. (1920b). A priori use of the Gaussian law. Science, 52(1336), 129-130.

Boring (1923/1961). Intelligence as the tests test it. In J. J. Jenkins & D. G. Paterson (Eds.), Studies in individual differences. East Norwalk, CT: Appleton-Century-

Crofts.

Boring, E. G. (1950). A history ofexperimentalpsychology . Englewood Cliffs, NJ:

Prentice-hall.

Boring (1953). A history of introspection. Psychological Bulletin, 50(3), 169-189. Boudewijnse, G. A., Murray, D. J., & Bandomir, C. A. (1999). Herbart's mathematical

psychology. History ofPsychology, 2(3), 163-193. Boudewijnse, G. A., Murray, D. J., & Bandomir, C. A. (2001). The fate of Herbart's

mathematical psychology. History ofPsychology, 4(2), 107-132. Bringmann, W. G., Bringmann, N. J., & Balance, W. D. G. (1980). Wilhelm Maximilian Wundt 1832-1874: The formative years. In Bringmann, W. G., Tweney, R. D., &

Hilgard, E. R. (Eds.), Wundt studies: A centennial collection. Toronto, ON: C. J. Hogrefe, Inc.

Burks, B. S. (1929a). On the inadequacy of the partial and multiple correlation technique. Journal ofEducational Psychology, 1 7(8), 532-540. Burks, B. S. (1929b). On the inadequacy of the partial and multiple correlation technique. Journal ofEducational Psychology, 17(9), 625-630. 131

Buxton, C. E. (1946). Planning the introductory psychology course. American Psychologist, 7(8), 303-311. Carson, J. (1993). Army alpha, army brass, and the search for army intelligence. Isis, 84(2), 278-309. Caswell, H. L. (1949). The influence of John Dewey on the curriculum of American schools. Teacher's College Record, 51, 144-146. Cattell, J. McK. (1893). Mental measurement. The Philosophical Review, 2(3), 316-332. Cattell, J. McK. (1928). Early psychological laboratories. Science, (57(1744), 543-548. Cattell, J. McK. & Farrand, L. (1896). Physical and mental measurements of the students of Colmubia University. Psychological Review, 3(6), 618-648. Cattell, J. McK. & Galton, F. (1890). Mental tests and measurements. Mind, 15(59), 373-

381.

Cattell, R. B. (1945). The life and work of Charles Spearman. Journal ofPersonality, 14(2), 85-92. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal ofAbnormal and Social Psychology, 65(3), 145-153. Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003. Cronbach, L. J. (1957). The two disciplines of scientific psychology. American Psychologist, 12, 671-684. Curtchfield, R. (1938). Efficient factorial design and analysis of variance illustrated in psychological experimentation. Journal ofPsychology, 5, 339- 346. 132

Crutchfield, R. & Tolman, E. C. (1940). Multiple-variable design for experiments

involving interaction ofbehavior. Psychological Review, 47(1), 38-42. Danziger, K. (1980). The history of introspection reconsidered. Journal ofthe History of

the Behavioral Sciences, 16, 241-262. Danziger, K. (1985). The origins of the psychological experiment as a social institution.

American Psychologist, 40(2), 133-140. Danziger, K (1987). Statistical method and the historical development of research practice in American psychology. In Krüger, L., Gigerenzer, G. & Morgan, M. S. (Eds.), The probabilistic revolution: Vol. 2. ideas in the sciences. Cambridge,

MA: The MIT Press.

Daston, L. (1992). Objectivity and the escape from perspective. Social Studies ofScience,

22(4), 597-618. Daston, L. & Galison, P. (1992). The image of objectivity [Special issue: Seeing science].

Representations, 40, 81-128. Daston, L., & Gallison, P. (2007). Objectivity. Brooklyn, NY: Zone Books Denis, D. (1999). Null hypothesis significance testing: History, criticisms and alternatives. Unpublished master's thesis. York University, Toronto, Ontario,

Canada.

Dewey, J. (1889). Galton's statistical methods [Review of the book Natural inheritance]. Publications ofthe American Statistical Association, 1(7), 331-334. Dodge, R. (1912). The theory and limitations of introspection. The American Journal of

Psychology, 23(2), 214-229. 133

Dunlap, K. (1912). The case against introspection. Psychological Review, 19(5), 404-413.

Edgeworth, F. Y. (1893). Statistical correlation between social phenomena. Journal ofthe Royal Statistical Society, 56(4), 670-675.

Eisenhart, C. (1979). On the trasitions from "Student's" ? to "Student's" t. The American Statistician, 35(1), 6-10. Fancher, R.E. (1996) Pioneers ofpsychology (3rd ed.) New York: W.W. Norton & Company Inc.

Fechner, G. T. (1987). Outline of a new principle of mathematical psychology. (E.

Scheerer, Trans.) Psychological Research, 49, 203-207. (Original work published 1851).

Fechner, G. T. (1966). Elements ofpsychophysics: Vol. 1. (H. E. Adler, Trans., H. H. Davis & E. G. Boring, Eds.) NY: Holt, Reinhart and Winston (Original work

published 1860). Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves. Messenger of

Mathematics, 41, 155-160.

Fisher, R. A. (1915). Frequency Distribution of the Values of the Correlation Coefficient

in Samples from an Indefinitely Large Population. Biometrika, 10, 507-521. Fisher, R. A. (1922). On the interpretation of ?2 from contingency tables, and the calculation ofp. Journal ofthe Royal Statistical Society, 85, 87-94.

Fisher, R. A. (1925). Applications of "Student's" Distribution. Metron, 5, 90-104. Fisher, R.A. (1928). Statistical methodsfor research workers (2nd ed.). Edinburgh, UK: Oliver and Boyd. 134

Fisher, R. ?., & MacKenzie, W. A. (1923). Studies in Crop Variation. II. The manurial response of different potato varieties. Journal ofAgricultural Science, 13, 31 1-

320.

Fisher-Box, J. (1978). R.A. Fisher: The life ofa scientist. New York: John Wiley &

Sons.

Fisher-Box, J. (1981). Gösset, Fisher, and the t distribution. The American Statistician,

55(2), 61-66. Fisher-Box, J. (1981). Guinness Gösset, Fisher and small samples. Statistical Science,

2(1), 45-52. Galton, F. (1872). On the employment of meteorological statistics in determining the best course for a ship whose sailing qualities are known. Proceedings ofthe Royal

Society ofLondon, 21, 263-274. Galton, F. (1879). Composite portraits, made by combining those of many different persons into a single resultant figure. The Journal ofthe Anthropological Institute ofGreat Britain and Ireland, 8, 132-144. Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal ofthe Anthropological Institute ofGreat Britain and Ireland, 15, 246-263. Galton, F. (1888). Co-Relations and their measurement, chiefly from anthropometric data. Proceedingsfrom the Royal Society ofLondon, 45, 135-145. Galton, F. (1889). Kinship and correlation. Statistical Science, 4(2), 81-86. Galton, F. (1889). Natural inheritance [Electronic Version]. NY: MacMillan & co. Galton, F. (1962). Hereditary genius (2n ed.). Cleveland, OH: Meridian. (Original work published 1892).

Gaskill, H. V. & Cox, G. M. (1937). Patterns in emotional reactions: I. Respiration; The

use of analysis of variance and covariance in psychological data. Journal of General Psychology, 75,21-38.

Gaskill, H. V. & Cox, G. M. (1941). Patterns in emotional reactions: II. Heart rate and blood pressure. Journal ofGeneral Psychology, 24, 409-421.

Gigerenzer, G. (1987). Probabilistic thinking and the fight against subjectivity. In Krüger, L., Gigerenzer, G. & Morgan, M. S. (Eds.), The probabilistic revolution: Vol. 2.

ideas in the sciences. Cambridge, MA: The MIT Press.

Gigerenzer, G., Krauss, S. & Vitouch, O. (2004). The null ritual what you always wanted to know about significance testing but were too afraid to ask. In D. Kaplan (Ed.),

The Sage handbook ofquantitative methodologyfor the social sciences. Thousand Oaks, CA: Sage.

Gigerenzer, G., Swijtink. Z., Porter, T., Daston, L., Beatty, J. Kruger, L. (1989). The

empire ofchance: Howprobability changed science and everyday life.

NewYork, New York: Cambridge University Press. Gliner, J. A., Leech, N. L., & Morgan, G. A. (2002). Problems with null hypothesis

statistical testing (NHST): What to the textbooks say? The Journal of

Experimental Education, 71(1), 83-92. Green, C. D. (2004). Digging archaeology: Sources of Foucault's historiography

[Electronic Version]. Journal ofthe Interdisciplinary Crossroad, 1, 121-141. 136

Gundlach, H. (2007). What is a psychological instrument? In M. G. Ash, & T. Sturm (Eds.), Psychology 's territories: Historical and contemporary perspectivesfrom different disciplines. Mahwah, NJ: Lawrence Erlbaum Associates. Hall, G. S. (1901). The new psychology. Harper's Monthly Magazine, 103, 727-732. Hall, G. S. (1919). Practical applications of psychology as developed by the war. Pedagogical Seminary, 26, 76-89. Hall, G. S. (1924). Founders ofmodern psychology. NY: D. Appleton & Co. Harper, R. S. (1950). The first psychological laboratory. Isis, 41(2), 158-161. Hart, B & Spearman, C. (1912). General ability, its existence and nature, British Journal ofPsychology, 5(1), 51-84. Hatfield, G. (2002). Psychology, philosophy and cognitive science: Reflections on the history and philosophy of experimental psychology. Mind & Language, 77(3),

207-232.

Heidelberger, M. (2004). Naturefrom within: Gustav Theodor Fechner and his psychophysical worldview. Pittsburgh, PA: University of Pittsburgh Press. Heidelberger, M. (1987). Fechner's indeterminism: From freedom to laws of chance. In Krüger, L., Daston, L. & Heidelberger, M. (Eds.), The probabilistic revolution: Vol. 1. ideas in history. Cambridge, MA: The MIT Press. Hogben, L. (1957/1970). Statistical prudence and statistical inference. In D. E. Morrison & R. E. Henkel (Eds.), The significance test controversy. Chicago, IL: Aldine Publishing. Holzinger, K. J. (1945). Spearman as I knew him. Psychometrika, 10(4), 231-235. 137

Hornstein, G. A. (1988). Quantifying psychological phenomena: Debates, dilemmas, and implications. In J. G. Morawski (Ed.), The rise ofexperimentation in American psychology (pp. 1-34). Binghamton, N.Y.: Yale University Press. Horst, P. (1932). Measurement relationship and correlation. The Journal ofPhilosophy, 29(23), 631-637. Hotelling, H. (1942/1943). Dr. Peters' criticisms of Fisher's statistics. Journal of

Educational Research, 36, 707-711. Hotelling, H. (1951). The impact of R. A. Fisher on statistics. Journal ofthe American Statistical Association, 46(253), 35-46. Howell, D. C. (2002). Statistical methodsforpsychology (5th ed.). CA: Wadsworth Group. Hull, C. (1927). The correlation coefficient and its prognostic significance. Journal of Educational Research, 15(5), 327-338. Irwin, J. O. (1934). Correlation methods in psychology. British Journal ofPsychology,

25(1), 86-91. James, W. (1890). The principles ofpsychology Vol.1 [Electronic Version]. NY: Henry

Holt & Co.

James, W. (1980). Review of Wundt's Principles of Physiological Psychology. In Bringmann, W. G., Tweney, R. D., & Hilgard, E. R. (Eds.), Wundt studies: A centennial collection. Toronto, ON: C. J. Hogrefe, Inc. (Original work published

1875). 138

Jastrow, J. (1887). The psycho-physic law and star magnitudes. The American Journal of Psychology, 1(1), 112-127. John, I.D. (1992). Statistics as rhetoric in psychology. Australian Psychologist, 27(3),

144-149.

Kant, I (2003). Critique ofpure reason. (N. K. Smith, Trans.). NY: Palgrave MacMillan

(Original work published in 1787). Kaufman, A.S. (1998). Introduction to the special issue on statistical significance testing.

Research in the Schools, 5, 1 .

Kelley, T. L. (1923). The principles and technique of mental measurement. The American Journal ofPsychology, 34(3), 408-432.

Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.

Kirk, R. E. (2004). Promoting good statistical practices: Some suggestions. Educational and Psychological Measurement, 61, 213-218. Kline, R.B. (2004). Beyond significance testing: Reforming data analysis methods in

behavioural research. Washington, DC: American Psychological Association. Krohn, W. O. (1892). Facilities in experimental psychology at the various German

universities. The American Journal ofPsychology, 4(A), 585-594. Kupfersmid, J. (1988). Improving what is published. American Psychologist, 43Q, 635-

642. Kupfersmid, J. & Fiala, M. (1991). A survey of attitudes and behaviors of authors who publish in psychology and education journals. American Psychologist, 46(3), 249-

250.

Ladd, G. T. (1894). President's address before the New York meeting of the American Psychological Association. The Psychological Review, 7(1), 1-21. Leary, D. E. (1978). The philosophic development of the conception of psychology in Germany, 1780-1850. Journal ofthe History of the Behavioral Sciences, 14, 113-

121.

Leith, J. D. (1936). Error in "Error in the use of the standard error" by W. R. Van Voorhis. Journal ofEducational Psychology, 27(1), 556-557. Lewin, K. (1936). Principles oftopologicalpsychology (F. Heider & G. Heider, Trans.) [Electronic Version]. NY: McGraw-Hill MacKenzie, D. (1981) Statistics in Britain 1865-1930.. Edinburgh: Edinburgh University

Press.

Masin, S. C, Zudini, V., & Antonelli, M. (2009). Early alternative derivations of Fechner's law. Journal ofthe History ofthe Behavioral Sciences, 45(X), 56-65. McMullen, L. (1939). "Student" as aman. Biometrika, 30(3/4), 205-210 McDougall, W. (1898). A contribution towards an improvement in psychological

method. Mind, 7, 15-33. Meehl, P. E. (1976/1970). Theory testing in psychology and physics: A methodological paradox. In D. E. Morrison & R. E. Henkel (Eds.), The significance test controversy. Chicago, IL: Aldine Publishing. Meehl, P.E. (1978) Theoretical risks and tabular astensks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal ofConsulting and Clinical Psychology,

46, 806-834.

Melton, A. W. (1962). Editorial. Journal ofExperimental Psychology, 64(6), 553-557.

Micheli, J. (1999). Measurement in psychology: A critical history ofa methodological

concept. New York: Cambridge University Press. Micheli, J. (2008) Is psychometrics pathological science? Measurement: Interdisciplinary

Research & Perspective, 6(1), 7-24. Miner, J. B. (1912). Correlation. Psychological Bulletin, 9(6), 222-231.

Miner, J. B. (1913). Correlation. Psychological Bulletin, 10(11), 425-433. Miner, J. B. (1914). Correlation. Psychological Bulletin, 11(5), 177-185.

Miner, J. B. (1915). Correlation. Psychological Bulletin, 12(5), 179-186. Miner, J. B. (1916). Correlation. Psychological Bulletin, 13(5), 208-215.

Miner, J. B. (1917). Correlation. Psychological Bulletin, 14(5), 176-185. Miner, J. B. (1918). Correlation. Psychological Bulletin, 15(4), 114-122.

Miner, J. B. (1919). Correlation. Psychological Bulletin, 16(11), 382-389.

Miner, J. B. (1920). Correlation. Psychological Bulletin, 17(11), 388-396. Morawski, J. G. (1992). There is more to our history of giving: The place of introductory

textbooks in american psychology. American Psychologist, 47, 161-69. Morrison, D. E. & Henkel, R. E. (1970). The significance test controversy. Chicago, IL:

Aldine Publishing. 141

Münsterberg, H (1898). The danger from experimental psychology. Atlantic Monthly, 81,

159-167.

Münsterberg, H (1899). Psychology. Science, 9(212), 91-93. Murray, D. J. (1987). A perspective for viewing the integration of probability theory into psychology. In Krüger, L., Gigerenzer, G. & Morgan, M. S. (Eds.), The probabilistic revolution: Vol. 2. ideas in the sciences. Cambridge, MA: The MIT

Press.

Murray, D. J. (1990). Fechner's later psychophysics. Canadian Psychology/Psychologie

Canadienne, 37(1), 54-60 Neyman, J. (1951). Fisher's collected papers [Review of the book Contributions to mathematical statistics by R. A. Fisher]. The Scientific Monthly, 72(6), 406-408. Neyman, J. & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions ofthe Royal Society ofLondon. Series A, Containing Papers ofa Mathematical or Physical Character, 231, 289-

337.

Niall, K. K. (1995). Conventions of measurement in psychophysics: von Kries on the so- called psychophysical law. Spatial Vision, 9(3), 275-305. Nicholls, E. E. (1923). Performances in certain mental tests of children classified as underweight and normal. Journal ofComparative Psychology, 3(3), 147-179. Nicolas, S. & Ferrand, L. (1999). Wundt's laboratory at Leipzig in 1891. History of

Psychology, 2(3), 194-203. 142

Nunnally, J. (1960). The place of statistics in psychology. Educational and Psychological Measurement, 20(4), 641-650. Olesko, K. M. (1995). The meaning of precision: The exact sensibility in early nineteenth-century Germany. In M. N. Wise (Ed.), The values ofprecision (103- 134). Princeton, NJ: Princeton University Press. Pearson, E. S. (1939). "Student" as statistician. Biometrika, 50(3/4), 210-250. Pearson, E.S. (1974). Memoirs of the impact of Fisher's work in the 1920s. International

Statistical Review, 42(\), 5-8+4. Pearson, K (1893). Contributions to the mathematical theory of evolution. Journal ofthe Royal Statistical Society, 56(A), 675-679. Pearson, K (1895). Contributions to the mathematical theory of evolution. II. skew variation in homogeneous material. Philosophical Transactions ofthe Royal

Society ofLondon, 186, 343-414. Pearson, K. (1896/1897). Mathematical contributions to the theory of evolution - On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings ofthe Royal Society ofLondon, 60, 489-498. Pearson, K. (1900a). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50,

157-175. Pearson, K. (1900b). The grammar ofscience (2nd ed.). London: Adam and Charles

Black. Pillsbury, W. B. (1904). A suggestion toward a reinterpretation of introspection. The Journal ofPhilosophy, Psychology and Scientific Methods, 1(9), 225-228. Porter, T. M. (2004). Karl Pearson: The scientific life in a statistical age. Princeton, NJ: Princeton University Press. Reitz, W. (1934). Statistical techniques for the study of institutional differences. Journal ofExperimental Education, 5(1), 11-24. Rodgers, J. L. (2010). The epistemology of mathematical and statistical modeling: A quiet methodological revolution. American Psychologist, 65(1), 1-12. Rovine, MJ. & Anderson, D.R. (2004). Peirce and Bowditch: An American contribution to correlation and regression. The American Statistician, 58(3), 232-236. Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416-428. Rucci, A. J. & Tweney, R. D. (1980). Analysis of variance and the "second discipline" of scientific psychology: A Historical account. Psychological Bulletin, 87(1), 166-

184.

Salsburg, D. (2001). : How statistics revolutionized science in the twentieth century. New York, New York: W. H. Freeman and Company. Scripture, E. W. (1894). Accurate work in psychology. The American Journal of Psychology, 6(3), 427-430. Sedlmeier, P. & Gigerenzer, G. (1997). Intuitions about sample size: The Empirical law oflarge numbers. Journal ofBehavioral Decision Making, 10,33-51. 144

Selvin, H. C. (1957/1970). A critique of test of significance in survey research. In D. E. Morrison & R. E. Henkel (Eds.), The significance test controversy. Chicago,

IL: Aldine Publishing. Sharp, E. (1899). Individual psychology: A study in psychological method. The American Journal ofPsychology, 10(3), 329-391. Shen, E. (1940). Experimental design and statistical treatment in educational research.

Journal ofExperimental Education, 8(3), 346-353. Sheynin, O. (2004). Fechner as a statistician. British Journal ofMathematical and

Statistical Psychology, 57, 53-72. Smith, L. D., Best, L. A., Cylke, V. A., & Stubbs, D. A. (2000). Psychology without ? values: Data analysis at the turn of the 19th century. American Psychologist, 55(2),

260-263.

Spearman, C. (1904a). The proof and measurement of the association between two things. The American Journal ofPsychology, 75(1), 72-101. Spearman, C. (1904b). "General Intelligence," objectively determined and measured. The

American Journal ofPsychology; 1 5(2), 201-292. Spearman, C. (1914). The theory of two factors. Psychological Review, 21(2), 101-115.

Stevens, S. S. (1958).Measurement and man. Science, 127(3295), 383-389. Stevens, S. S. (1959). Measurement, psychophysics, and utility. In C. W. Churchman &

P. Ratoosh (Eds.), Measurement: definitions and theories. NY: John Wiley &

Sons. Stigler, S. (1989). Francis Galton's account of the invention of correlation. Statistical

Science, 4(2), 73-86. Stoke, S. M. (1946). Undergraduate training for graduate students ofpsychology.

American Psychologist, 1(4), 113-115. Student (1908 a). The probable error of a mean. Biometrika, 6(1), 1-25.

Student (1908b). Probable error of a correlation coefficient. Biometrika, 6(2/3), 302-310.

Sturm, T. & Ash, M. G. (2005). Roles of instruments in psychological research. History ofPsychology, 8(1), 3-34.

Sully, J. (1881). Illusions of introspection. Mind, 6(21), 1-18. Swijtink, Z. G. (1987). The objectif!cation of observation: Measurement and statistical

methods in the nineteenth-century. In Krüger, L., Daston, L. & Heidelberger, M. (Eds.), The probabilistic revolution: Vol. 1. ideas in history. Cambridge, MA:

The MIT Press.

Teo, T. (2007). Local institutionalization, discontinuity, and German textbooks of

psychlology, 1816-1854. Journal ofthe History ofthe Behavioral Sciences, 43(2),

135-157.

Teo, T. (2008). From speculation to epistemological violence: A critical-hermeneutic

reconstruction. Theory & Psychology, 18(1), 47-67.

Thorndike, E. L. (1913). Introduction to the theory ofmental and social measurements (2nd ed.) [Electronic Version]. NY: Columbia University. 146

Tinker, M. A. (1980). Wundt's doctoral students and their theses 1875-1920. In Bringmann, W. G., Tweney, R. D., & Hilgard, E. R. (Eds.), Wundt studies: A centennial collection. Toronto, ON: C. J. Hogrefe, Inc. Titchener, E. B. (1912). Prolegomena to a study of introspection. The American Journal

ofPsychology, 23(3), 427-448: Titchener, E. B. (1910). A text-book ofpsychology: Partii. [Electronic Version] NY:

MacMillan. Trafimow, D. & Rice, S. (2009). A test of the null hypothesis significance testing procedure correlation argument. The Journal ofGeneral Psychology, 136(3), 261-

269.

Trendler, G. (2009). Measurement theory, psychology and the revolution that cannot happen. Theory & Psychology, 19(5), 579-599. Tukey, J. W. (1954). Unsolved problems of experimental statistics. Journal ofthe American Statistical Association, 49(26$), 706-731. Ullmann, C. A. (1947). The training of clincical psychologists. American Psychologist,

2(5), 173-175. Vacha-Haase. T. (2001). Statistical significance should not be considered one of life's guarantees: Effect sizes are needed. Educational and Psychological Measurement, 6Iß), 219-224. Varón, E. J. (1936). Alfred Binet's concept of intelligence. Psychological Review, 43(1),

32-58. Watson, J. B. (1913). Psychology as the behaviorist views it. Psychological Review, 101(2), 248-253. Weldon, W. F. R. (1893). Report of the committee, consisting of Mr. Galton ( Chairman), Mr. F. Darwin, Professor Macalister, Professor Meldola, Professor Poulton, and Professor Weldon, "for conducting statistical inquiries into the measurable characteristics of plants and animals." Part I. "An attempt to

measure the death-rate due to the selective destruction of carcinus moenas with

respect to a particular dimension." Proceedings ofthe Royal Society ofLondon,

57, 360-379. Wilkinson, L., & the Task Force on Statistical Inference (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54,

594-604.

Winston, A. S. (2001). Cause into function: Ernst Mach and the construction of explanation in psychology. In Green, C. D., Shore, M. and Teo, T. (Eds.), The transformation ofpsychology: Influences of19th-century philosophy, technology, and natural science (pp. 107-131). Washington, DC: American Psychological Association. Wundt, W. (1907). Outlines ofpsychology. (C. H. Judd, Trans. ) NY: Wilhelm Engelmann. Wundt, W. (1969). Principles ofphysiologicalpsychology (E. B. Titchener, Trans.) NY: The MacMillan Co. (Original work published 1902). Yates, F. & Mawthewr, K. (1963). Ronald Aylmer Fisher. Biographical Memoirs of Fellows ofthe Royal Society ofLondon, 9, 91-120. Ziliak, S. T. & McCloskey, D. N. (2008). The cult ofstatistical significance: How the standard error costs usjobs, justice and lives. Ann Arbor, MI: The University

of Michigan Press. Zubin, J. (1936). Note on a graphic method for determining the significance of the difference between group frequencies. Journal ofEducational Psychology, 27(6),

431-444.

Zupan, M. L. (1976). The conceptual development of quantification in experimental psychology. Journal ofthe History ofthe Behavioral Sciences, 12, 145-158.