<<

UC San Diego UC San Diego Previously Published Works

Title How persuasive is a good fit? A comment on theory testing

Permalink https://escholarship.org/uc/item/5vt0z72k

Journal Psychological Review, 107(2)

Authors Roberts, Seth Pashler, Harold

Publication Date 2000

Peer reviewed

eScholarship.org Powered by the California Digital Library University of California Psychological Review (in press) Running head: Testing Theories With Free Parameters

How Persuasive is a Good Fit?

Seth Roberts Harold Pashler University of California, Berkeley University of California, San Diego

Quantitative theories with free parameters often gain credence when they "fit" data closely. This is a mistake, we argue. A good fit reveals nothing about (a) the flexibility of the theory (how much it cannot fit), (b) the variability of the data (how firmly the data rule out what the theory cannot fit), and (c) the likelihood of other outcomes (perhaps the theory could have fit any plausible result)–and a reader needs to know all three to decide how much the fit should increase belief in the theory. As far as we can tell, the use of good fits as evidence receives no support from philosophers of science nor from the history of ; we have been unable to find examples of a theory supported mainly by good fits that has led to demonstrable progress. We consider and rebut arguments used to defend the use of good fits as evidence–for example, that a good fit is meaningful when the number of free parameters is small compared to the number of data points, or when one model fits better than others. A better way to test a theory with free parameters is to (a) determine how the theory constrains possible outcomes (i.e., what it predicts); (b) assess how firmly actual outcomes agree with those constraints; and (c) determine if plausible alternative outcomes would have been inconsistent with the theory, allowing for the variability of the data.

How Persuasive is a Good Fit? 1. Cohen, Dunbar, and McClelland (1990) proposed a parallel-distributed- Many quantitative psychological processing model to explain the Stroop theories with free parameters are supported effect and related data. The model was mainly or entirely by demonstrations that meant to embody a "continuous" view of they can "fit" data–that the parameters can automaticity, in contrast to an "all-or-none" be adjusted so that the output of the theory (p. 332) view. The model contained many resembles actual results. The similarity is adjustable parameters, including number of often shown via a graph with two functions: units per module, ratio of training one labeled observed (or data), the other frequencies, learning rate, maximum labeled predicted (or theory or simulated). response time, initial input weights, indirect That the theory fits data is supposed to show pathway strengths, cascade rate, noise, that the theory should be taken seriously– magnitude of attentional influence (two should be published, for example. parameters), and response-mechanism This type of argument is common; parameters (three). The model was fit to six judging from a search of Psychological data sets. Some parameters (e.g., number of Abstracts, the research literature probably units per module) were separately adjusted contains thousands of examples. Early for each data set; other parameters were instances involved sensory processes adjusted based on one data set and held (Hecht, 1934) and animal learning (Hull, constant for the rest. The function relating 1943), but it is now used in many areas. cycle time (model) to average reaction time Here are three recent examples: (observed) was always linear but its slope

Psychological Review (in press) Running head: Testing Theories With Free Parameters

and intercept varied from one data set to the as evidence seem to reason as follows: if our next. That the model could fit several data theory is correct, it will be able to fit the sets led the authors to conclude that data; our theory fits the data; therefore it is compared to the all-or-none view, "a more more likely that our theory is correct. useful approach is to consider automaticity in terms of a continuum" (Cohen et al., However, if a theory did not constrain 1990, p. 357)–although they did not try to fit possible outcomes, the fit is meaningless. a model based on the all-or-none view. A prediction is a statement of what a 2. Zhuikov, Couvillon, & Bitterman theory does and does not allow. When a (1994) presented a theory to explain goldfish theory has adjustable parameters, a avoidance conditioning. It is a quantitative particular fit is just one example of what it version of Mowrer’s two-process theory, in allows. To know what a theory predicts for a which some responses are generated by fear, some by reinforcement. When some particular measurement you need to know simplifying assumptions are made, the all of what it allows (what else it can fit) and theory has three equations and six adjustable all of what it does not allow (what it cannot parameters. The authors fit the theory to data fit). For example, suppose two measures are from four experiments, and concluded that positively correlated, and it is shown that a "the good fit suggests that the theory is certain theory can produce such a relation– worth developing further" (Zhuikov, that is, can fit the data. This does not show Couvillon, & Bitterman, 1994, p. 32). 3. Rodgers and Rowe (1993) that the theory predicts the correlation. A proposed a theory that explains how theory predicts such a relation only if it teenagers come to engage in various sexual could not fit other possible relations between behaviors for the first time. It emphasizes the two measures–zero correlation, negative contact with other teenagers--a "contagion" correlation–and this is not shown by fitting a (p. 479) explanation. The theory has eight positive correlation. equations with twelve free parameters. Rodgers and Rowe fitted the theory to When a theory does constrain survey data about the prevalence of kissing, possible outcomes, it is necessary to know petting, and intercourse in boys and girls of how much. The more constraint–the different ages and races and concluded that narrower the prediction—the more the theory "appears to have successfully impressive a confirmation of the constraint captured many of the patterns in two (e.g., Meehl, 1997). Without knowing how empirical data sets" (p. 505). This success much a theory constrains possible outcomes, was the main support for the theory. you cannot know how impressed to be when Why the Use of Good Fits as Evidence is observation and theory are consistent. Wrong Second, the variability of the data (e.g., between-subject variation) is unclear. This type of argument has three How firmly do the data agree with the serious problems. First, what the theory predictions of the theory? Are they predicts–how much it constrains the fitted compatible with the outcomes that the data–is unclear. Theorists who use good fits theory rules out? The more conclusively the data rule out what the theory rules out, the Psychological Review (in press) Running head: Testing Theories With Free Parameters

more impressive the confirmation. For Psychological Review (in press) Running head: Testing Theories With Free Parameters

example, suppose a theory predicts that a To further illustrate these points, certain measure should be greater than zero. Figure 1 shows four ways a "two- If the measure is greater than zero, the dimensional" prediction–a constraint shorter the confidence interval, the more involving two measures at once–can be impressive the confirmation. That a theory compatible with data. Measures A and B in fits data does not show how firmly the data Figure 1 are both derived from rule out outcomes inconsistent with the measurements of behavior. Either might be theory; without this information, you cannot quite simple (e.g., trials to criterion) or know how impressed to be that theory and relatively complex (the quadratic component observation are consistent. of a fitted function); it does not matter. The Adding error bars may not solve this axis of each measure covers the entire range problem; it is variability on the constrained of plausible values of the measure before the dimension(s) that matters. For example, experiment is done (e.g., from 0 to 1, if the suppose a theory predicts that several points measure is a probability). The dotted area will lie on a straight line. To judge the shows the predictions of the theory, the accuracy of this prediction, the reader needs range of outcomes that are consistent with to know the variability of a measure of the theory. In the two upper panels of Figure curvature (or some other measure of non- 1, the theory tightly constrains possible linearity). Adding vertical error bars to each outcomes; in the two lower panels, it does point is a poor substitute (unless the answer, not. In each case there is one data point. In linear or non-linear, is very clear); the the two left-hand panels, the observations vertical position of the points is not what the tightly constrain the population value; in the theory predicts. two right-hand panels, they do not. In every case, the data are consistent with the theory (the data point is within the dotted area), which means in every case the theory can closely fit the data. But only the situation in the upper left panel is substantial evidence for the theory. Third, the a-priori likelihood that the theory will fit--the likelihood it will fit whether or not it is true--is ignored. Perhaps the theory could fit any plausible result. It is well-known that a theory gains more support Figure 1: Four possible relationships between theory and data. (Measures A and B are both measures of behavior. For both from the correct prediction of an unlikely measures, the axes cover the whole range of possible values. The event than from the correct prediction of dotted areas indicate the range of outcomes that would be consistent with the theory. The error bars indicate standard errors. something that was expected anyway. In every case, the theory can closely fit the data, but only when Lakatos (1978) made this point vividly: "It both theory and data provide substantial constraints does this provide significant evidence for the theory.) is no success for Newtonian theory that stones, when dropped, fall towards the earth, Psychological Review (in press) Running head: Testing Theories With Free Parameters

no matter how often this is repeated. . . . psychological data is often not surprising. What really count are [the confirmation of] Therefore prediction of such data cannot dramatic, unexpected, stunning predictions" provide much support for any theory. Quantitative theories are usually fit to (p. 6), such as the return of Halley’s comet. functions: A measure of behavior (y) "All the research programmes [i.e., theories] recorded at several values of a procedural I admire have one characteristic in common. variable (x)–for example, probability of They all predict novel facts, facts which had correct recall as a function of retention been either undreamt of, or have indeed interval. It is never plausible that the points been contradicted [i.e., predicted to not on the function are independent of each occur] by previous or rival programmes" (p. other, in the sense that knowing the y values of some of the points does not help you 5). predict the y values of rest of the points. Bayes’s Theorem, interpreted as a And the lack of independence is not trivial; statement about degrees of belief, is a inevitably the plausible outcomes are a tiny quantitative version of this idea. Bayes’s fraction of the possible outcomes. Theorem is The need to make predictions that are at least a little implausible seems to have P(E | H) been overlooked by quantitative theorists. P (H | E) = ------P (H), When a theory with three free parameters is P(E) used to fit a function with 20 data points, 20 (x, y) pairs, it is obvious that the theory must where H (hypothesis) is a theory and E somehow constrain the function; it could not (event) is a particular outcome (Howson & fit all possible functions with 20 points Urbach, 1993, p. 28). P(H) is the plausibility (keeping the x values fixed but allowing the of H before data collection, P(E) is the y values to vary). Plainly, some results perceived likelihood of P before data would contradict the theory. This seems to collection, P(E | H) is the likelihood of E have been the sort of reasoning, either given that H is true, and P(H | E) is the implicit or explicit, that has convinced plausibility of H after data collection–after E theorists and reviewers that the data provide has been observed. When E is a prediction a test of the theory. But whether any of H, P(E | H) = 1. Thus, according to this plausible results would contradict the theory theorem, when P(E) is large–close to 1– is not so clear. observation of E will have little effect on An especially simple example of the belief in H. "Strong inference" experiments problem involves asymptotic behavior. (Platt, 1964)–where different theories make Suppose a learning experiment measured contradictory predictions–are a practical percent correct as a function of trial number. application of this idea. They embody the Performance improved for several trials but notion that the best evidence for a theory is eventually–say, after 15 trials--leveled off at evidence that would be otherwise unlikely. a value less than 100% correct–say, 93%. To For more discussion of the importance of the fit this data, a theory will presumably need a a-priori likelihood of a prediction, see parameter that somehow corresponds to the Howson and Urbach (1993, especially pp. asymptotic level of performance (93% 123-126). correct) and a parameter that corresponds to This principle–predictions should be when this level is reached (after 15 trials). It surprising–is relevant to psychology because needs these two adjustable parameters Psychological Review (in press) Running head: Testing Theories With Free Parameters

because both aspects of the data, 15 trials the goodness of fit was inevitable, and and 93%, surely depend on procedural therefore provides no support for the theory. details. Yet once these two parameters are The smoothness of almost any psychological properly set the theory will accurately function seems inevitable (i.e., is very predict performance at an unlimited number plausible before measurement) because of of trial numbers: It will predict 93% correct both (a) previous data (the number of on Trial 16, on Trial 17, etc. If the smooth functions in that area of research is experiment measured asymptotic large, the number of jagged functions is zero performance for 50 trials (Trials 16-65), a or near zero) and (b) extant theories (which theory–any theory–could quite accurately predict smooth functions). If jagged predict 50 data points with just two free functions began to be observed, or if parameters. Yet this success would add plausible theories predicted jagged nothing to the theory’s credibility. functions, then–and only then–would the A defender of the use of good fits as prediction that a function will be smooth be evidence might reply that fits are often interesting. judged by the percentage of variance But the heart of the problem is not explained, which fitting the same data value using constant functions or smooth functions (e.g., 93%) many times does not increase to test theories; it is using functions that very much. However, the problem does not have simple shapes. Most functions go away when the fitted data vary. The measured by psychologists, and most of the functions used to assess psychological functions to which quantitative theories are theories are almost always "smooth," in the fit, are concave up, concave down, or sense that if you know, for example, the indeterminate between the two (i.e., close to extreme y values and alternate intermediate linear). For example, learning curves values (e.g., if x = 1, 2, . . . , 9, the values of (performance as a function of number of y for x = 1, 3, 5, 7, and 9), you can estimate training trials) and retention functions the remaining values of y quite closely by (memory as a function of time since linear interpolation. This means that any learning) usually fit this description. With theory that does a good job of fitting about typical amounts of data, we suspect, several half of the data will do a good job fitting the equations with three orthogonal parameters, other half–regardless of the theory’s such as a quadratic equation, will fit correctness. Suppose, for example, the reasonably well. The residuals may appear function consists of 9 points at x = 1, 2, . . ., systematic, but the remaining structure (the 9. A theory with five orthogonal parameters structure in the residuals after the three- is fit to the data for x = 1, 3, 5, 7, and 9, parameter fit is removed) will probably be which it will fit perfectly. (The n parameters impossible to detect reliably. The number of of a formula or theory are orthogonal if the psychological research reports that have function can fit exactly n data points. For found a reliable cubic component, or reliable example, a + bx has two orthogonal structure in the residuals after a three- parameters, but ax + bx does not.) Then the parameter fit is removed, is very low; we do fitted parameters are used to predict the not know of even one example. For values of y for x = 2, 4, 6, and 8. The indications of the typical precision of theory–any plausible theory--will do a very retention functions, see Rubin and Wenzel good job. Although it is standard practice to (1996) and Rubin, Hinton, and Wenzel (in say there were four degrees of freedom with press). which to "test" the theory, this is misleading; Psychological Review (in press) Running head: Testing Theories With Free Parameters

The practical effect of these considerations is that such functions can usually provide only a little guidance in choosing a theory–regardless of how many points they contain. The "first-degree" structure (overall level) is uninteresting; the sign of the "second-degree" structure (slope) is usually obvious (e.g., memory decays with time, performance improves with practice) and its size is uninteresting (because it will presumably depend on procedural details not covered by theory); Figure 2: A prediction of a version of Ashby’s (1982) cascade and the "fourth-degree" and higher model, and some data. (Each of the many small points is derived from the results of a simulated 2 x 2 experiment. The large points, structures cannot be made out. That leaves with standard-error bars, are from actual experiments. From the "third-degree" structure (curvature) as a Roberts & Sternberg [1993]). source of guidance. If the data were It matters that the plausible outcomes are a small fraction of the possible outcomes remarkably close to linear on some scale because the plausible theories are crowded (the original y or x scales, or some into the same small space, in the sense that transformation, such as logarithmic, of they can predict the plausible outcomes and either or both), that would be quite useful no others (e.g., they can only predict smooth because most two-parameter theories would functions). In the early days of chemistry, it fail to predict it (they would produce only was repeatedly determined that when hydrogen gas and oxygen gas combined to curved functions on that scale); but that is form water, the volume of oxygen used up rare. If the data were convincingly concave was very close to half the volume of the up (say), and this is not due to floor or hydrogen used up (Ihde, 1964). After several ceiling effects, the best one can do is repetitions of this result, it became the only determine what sort of theories do not plausible, in the sense of unsurprising, predict this, i.e., what this finding rules out; outcome of those measurements. However, perhaps it will cut the number of plausible the predictions of plausible theories of the composition of water (HO? HO ? H O?) candidate theories in half. That is progress, 2 2 remained scattered--that is, predicted a wide of course, but it cannot strongly favor any range of combining ratios. This is why the one theory. (The difficulty of extracting actual ratio could be used to choose between much information from the usual functions them. In contrast, the psychological results suggests that theorists should also look for we have been discussing–behavior at predictions that relate two measures of asymptote, smooth functions, functions with behavior–as in Figure 2, described below.) simple shapes–are both (a) likely on the basis of experience and (b) easily explained. When performance reaches asymptote and stays there–no sudden drops–we are not only

not surprised, we are not puzzled. It is easy

Psychological Review (in press) Running head: Testing Theories With Free Parameters

to think of a theory of learning that predicts that after performance reaches asymptote it will stay there; indeed, it is hard to think of a psychological processes" (Massaro, 1988, p. theory that predicts anything else. When a 219). (b) Wexler (1978), reviewing function turns out to be smooth, this is not Anderson’s ACT theory, noted that "ACT only unsurprising but un-mysterious; it is can model not only the Sternberg result, but hard to think of a theory that would not also its opposite, or anything else of the produce a smooth function. Likewise for sort" (Wexler, 1978, p. 338). This flexibility functions with simple shapes: At the level of makes the theory "so weak that there is no precision to which they are measured in way to find evidence either for or against it" most experiments, these results are not only (Wexler, 1978, p. 346). unsurprising but could be produced by many 2. The same data can be closely fit different plausible theories. by a similarly-flexible theory making quite Clearly, then, showing that a theory different assumptions. This means, of fits data is not enough. By itself, it is nearly course, that the fits do not meaningfully meaningless. Because of the flexibility of support the assumptions of the theory. One many theories, the variability of example: Salasoo, Shiffrin, and Feustel measurements, and the simplicity of most (1985) found that a model with 14 free psychological data functions, it is often quite parameters could fit a variety of word- possible that the theory could fit any recognition data. Johnston, van Santen, and plausible outcome to within the precision of Hale (1985), using the Salasoo et al. data, the data. The reader has no way of knowing showed that "a large family of rather which panel of Figure 1 the evidence different models" (Johnston et al., 1985, p. resembles. 507) with roughly the same number of free parameters could also fit the data. Johnston Similar Criticisms et al (1985, p. 507) conclude, "because our models fit the data [equally well] assuming Criticisms of the use of good fits as only one higher level memory evidence have been made by others, usually representation, there is no support for the in the context of specific models (Coltheart assumption [of Salasoo et al.’s model] that & Coltheart, 1972; Hintzman, 1991; two kinds of memories–episodic and Johnston, van Santen, & Hale, 1985; permanent–underlie the effects of repetition Massaro, 1988; Roberts & Sternberg, 1993; on identification." Roediger & Neely, 1982; Wexler, 1978). 3. Although a theory closely fits When discussing specific models, these data, at least one of its assumptions is critics have often shown, or pointed out, not wrong. Three examples: (a) As pointed out just that this sort of evidence may be by Coltheart and Coltheart (1972), the misleading–as we argue–but that it has been concept-learning model of Bower and misleading. These demonstrations fall into Trabasso (1964) "achieved extraordinary three categories: correspondences between predicted and 1. A theory "fits too much"–it can obtained results" (Coltheart & Coltheart, generate such a wide range of outcomes that 1972, p. 294) yet one of the assumptions of the fact that it can generate the actual results the model (independence of path) turned out means little. Two examples: (a) Massaro to be wrong (Trabasso & Bower, 1966). (b) (1988) showed that " a single connectionist Coltheart and Coltheart (1972) pointed out model can simulate results that imply [i.e., that four assumptions of Rumelhart’s (1970) were generated by] mutually exclusive model of tachistoscopic recognition were Psychological Review (in press) Running head: Testing Theories With Free Parameters

incompatible with experimental evidence, yet the model fit data quite closely. Like us, Coltheart and Coltheart (1972) conclude "it several ways the use of good fits as evidence is poor strategy to evaluate a mathematical might conceivably be justified–by theory only by assessing how well" (p. 294) philosophy of science, the history of it can fit data. (c) According to Hinztman psychology, and arguments that the practice (1991), Bower’s (1961) model of paired- is acceptable when certain conditions are associate learning fit numerous data sets met. with "incredible precision" (p. 50) although its central assumption was evidently quite Does Philosophy of Science Support the Use wrong. of Good Fits as Evidence? Although each critique (with the exception of Coltheart and Coltheart’s) Can the use of good fits to support focused on a particular theory, the diversity theories be justified by some well-accepted of the theories raises the possibility that the doctrine in the philosophy of science? fundamental problem is not with any one Philosophers of science do not appear to theory or class of theories (e.g., have considered this particular practice, but connectionist theories are too flexible) but of course much has been written about the something broader. We suggest that the general question of how to test theories, with fundamental problem, as Coltheart and considerable consensus (Kitcher, 1993; Coltheart (1972) argued, is a method of Howson & Urbach, 1989). Suppose we have theory evaluation (fitting theories to data) so a theory that we want to test. According to inadequate that serious flaws go undetected. this consensus, there are essentially two In the fields of statistics and computer ways to do this: science, a problem related to what we First, we can test a prediction of the criticize here, called overfitting, has been theory, i.e., make an observation that might familiar for many years (e.g., Anscombe, yield results that would contradict the 1967; Leahy, 1994; Schaffer, 1993). The theory. Karl Popper, probably the most possibility of overfitting arises when a influential philosopher of science (Bondi, model that does a good job of fitting data 1992), advocated "falsifiability" as the performs poorly in other ways. For instance, essential feature of scientific inquiry. a neural-network program trained to classify According to Popper (1959), a theory must fruit using one sample eventually achieved specify some possible observations that 90% accuracy, but did much worse with a could falsify it, and a theory is supported by second sample from the same population observations only if the observations might (Donlin & Child, 1992). Overfitting occurs have had outcomes inconsistent with the when the model is too flexible. Such theory. experiences have taught statisticians and Second, if there are competing computer scientists that models should not (incompatible) explanations of the facts our be judged only by how well they fit a data theory explains, we can test a prediction of a set; there must be assessment of, and penalty competing theory. In many cases, alternative for, flexibility (e.g., Hurvich, 1997). theories are incompatible, i.e, if one theory Although the arguments against the (T0) is correct other explanations (T1, T2, use of good fits as evidence strike us as etc.) of the same facts must be wrong. In overwhelming, we nevertheless try to these cases elimination of alternatives present the other side, arguments in favor of supports T0. This approach was first the practice. In what follows, we consider Psychological Review (in press) Running head: Testing Theories With Free Parameters

sketched by Bacon (1620/1960; Urbach, 1987) If alternative theories exist and make An early example of the use of close differing predictions (e.g., one theory says a fits by themselves to support a theory is certain measurement should be zero, another Hecht’s (1931) theory of color vision–a theory says it should be positive), we can theory that is nowadays almost completely combine the two approaches and test a forgotten. In contrast, Hering’s theory of prediction of the theory and a prediction of a color vision, based on quite different data, is competing theory at the same time. When still important (Hurvich, 1981). Another the two predictions are incompatible (non- early example of the practice was Principles overlapping), this is what Platt (1964) called of Behavior (Hull, 1943), which may have strong inference. (Efficient inference might been cited more often in the experimental- have been a better name. The results will not psychology literature of the 1940s and 1950s be decisive–"strong"–unless several other than any other work. In spite of numerous conditions are met.) excellent fits, it seems fair to say that none When it is claimed that a good fit of Hull's theoretical ideas supported by supports a theory, what sort of test is this? fitted curves is still influential. Mackintosh Nothing is said about competing theories, (1983), for instance, refers to the "legacy" eliminating the second method. Perhaps (p. 2) of Thorndike, Pavlov, Konorski, and theorists who support theories with good fits Tolman--but not Hull. to data believe that they have tested a Later quantitative learning theories prediction of the theory (the prediction that were much simpler than Hull's but still the theory will fit the data), a Popperian test. relied on good fits for support. In what But they have not shown that, given the Jenkins (1979) called a "ground-breaking precision of the data, there were any paper" (p. 206), Estes (1950) used the plausible outcomes that the theory could not following equation, derived from a theory of have fit. learning, for the mean latency to fit some Thus we do not find any support runway learning data, with L the latency to among philosophy of science for the use of leave the start box and T the trial number: good fits to support theories.

Does the History of Psychology Support the Use of Good Fits as Evidence? The parameters 2.5, .9648, and -.12 The use of close fits as evidence were of course estimated from the data. might be justified by showing that it has According to Estes, the fit was "satisfactory" "worked"–led to demonstrable progress. We (p. 101). Satisfactory or not, a reader could have searched the history of psychology for not know what to make of this. The theories originally supported mainly or variability of the data was not shown, so it entirely by good fits to data that eventually was unclear if the deviations were reliable. found support from other sources (e.g., tests Nor was it clear that any plausible results of specific assumptions, confirmation of could have contradicted the theory. new predictions). We have been unable to Although many theorists seemed to have find even one example. Although several been impressed at the time--as Jenkins reviewers of this article have disagreed with (1979) says, Estes’s (1950) work led to our conclusions, they did not provide many similar theories--later theorists were examples of such a theory. Psychological Review (in press) Running head: Testing Theories With Free Parameters

less impressed. A look at any recent animal- learning text suggests that the mathematical learning theorists of the 1950's and 60's, in attention. In spite of the popularity of spite of many successful fits, discovered connectionist models, and numerous good nothing that formed the basis for current fits, we have yet to encounter even one such theories of learning. The use of good fits as model whose predictions have been evidence probably received a boost from the determined, much less verified or shown to advent of cheap and powerful computers, rule out plausible results. Massaro (1988) which made it much easier to search a large made similar points. Without accurate parameter space for the best fit. predictions in cases where the prediction Connectionist theorizing, in particular, took could have plausibly been wrong, the claim advantage of the new flexibility in model- that connectionist theories have helped us building that seemed to be available. An understand the brain seems to rest entirely influential early paper in this area on belief in the assumptions of these (Anderson, 1973) proposed an explanation theories. of some reaction-time results with short So we do not find any support in the memorized lists. Empirical support for the history of psychology for the use of good theory consisted almost entirely of fits to support theories. demonstrations that it could fit a variety of data. The fits involved 5-8 free parameters, Defenses of the Use of Good Fits as which changed from one data set to the next. Evidence It was unclear what the theory predicted, that is, what it could not fit; because the Many psychologists, we suspect, constraints were unclear, variability on the realize that not all good fits provide constrained dimensions was of course substantial support for a theory. Yet they unclear. Because the number of data points believe that their example is sound because was much larger than the number of free it satisfies certain conditions. Although the parameters, the theory surely ruled out many use of good fits as evidence may in general possible outcomes; but whether it ruled out be flawed, they believe that in certain any plausible outcomes was not clear. restricted situations it is helpful. Here we An example of later work along consider the arguments along these lines we these lines is Seidenberg and McClelland’s have encountered most frequently. (1989) theory of visual word recognition and Defense 1. A good fit is impressive pronunciation. Their goal was a when there are more observations in the data connectionist model "that exhibited many of set than free parameters in the model. "A the basic phenomena of word recognition standard rule of thumb states that a model and naming" (p. 529). The evidence for the has too many [free] parameters to be testable model consisted of numerous graphs that if and only if it has at least as many showed a close fit between two measures: parameters as empirically observable reaction time (observed in experiments) and quantities" (Bamber & Van Santen, 1985, p. squared error (produced by the model). 443). For example, if a model has 5 free What the model could not fit was unclear. parameters and there are 20 data points, this In Hinton and Anderson (1981) and supposedly gives 15 degrees of freedom for Rumelhart, McClelland, and the PDP assessing the fit. Research group (1986), the first influential It is a generous rule of thumb. In books on connectionism, the issue of how to fact, the number of free parameters in a test such flexible theories received almost no theory provides an upper bound on its Psychological Review (in press) Running head: Testing Theories With Free Parameters

flexibility. If a theory has five orthogonal free parameters, then it will be able to fit exactly any five data points; if the to determine what the theory predicts. parameters are not orthogonal, however, the Second, it takes no account of the variability number of data points the theory can fit of the data. Suppose, for example, that exactly is less (as in the example given Theory X predicts that a certain earlier, ax +bx, which has two parameters, a measurement should be 5 while Theory Y and b, but cannot fit any two data points). predicts it should be 7. If the actual result is The more serious distortion, however, is the 5.5 ± 10, Theory X will fit better, yet there idea that the number of data points indicates is no good reason to prefer it. the range of possible outcomes–that if there Fitting several plausible models to are ten data points, the possible outcomes learn if any can be ruled out makes sense, could have plausibly been anywhere in a especially when combined with an effort to ten-dimensional space. As argued above, find features of the data that are hard to fit. this is usually a great overstatement. A more But this is not what is usually done. For realistic view is that most functions provide example, Zhuikov, Couvillon, and Bitterman only one useful piece of information for (1994) compared the fit of two models--the testing theories: whether the function is favored model and a simpler model--to some concave up, nearly linear, or concave down of their data. Because the simpler model was (when the data are scaled so that all three a subset of the favored model, with fewer possibilities are plausible). free parameters, it was certain that the Defense 2. My model fits better than favored model would fit better, yet Zhuikov another model. Theorists often compare the et al. apparently took this result to be fits produced by different models and meaningful. They did not show that the data assume that the best-fitting one deserves ruled out the simpler model. Rodgers and belief because it has won a kind of Rowe (1993), in their study of teenage competition (e. g., Ashby & Lee, 1991; sexual behavior, fit two different models Atkinson & Crothers, 1964; Bush & making somewhat different assumptions. Mosteller, 1959; Nosofsky, Kruschke, & Although "both models were consistent with McKinley, 1992). There are several the data according to chi-square tests" (p. problems with this approach. First, the best- 495), Rodgers and Rowe favored one of fitting model may merely be the most them. flexible model rather than the best model Comparing the fit of several theories (Collyer, 1985)–a lesson that statisticians should not be confused with comparing their and computer scientists learned long ago, as predictions, which is always worthwhile. In discussed above. To equate the flexibility of these fit-comparison situations, the the theories being compared, psychologists predictions–that is, the constraints–of the sometimes adjust goodness-of-fit statistics various theories are not even determined, according to a general formula (Akaike, much less compared, at least in the examples 1974; Takane & Shibayama, 1992). we have seen. Unfortunately, this is inadequate because the Defense 3. The research and editorial flexibility added by a free parameter processes protect readers from too-flexible depends on the details of the theory models. During the theory-building process, (compare ax + bx with ax + b–both have two the argument goes, many models are parameters but the latter is more flexible). rejected because they cannot fit the data. The only accurate way to "allow" for the When the theorist finally finds a model that flexibility of a theory, as far as we know, is can fit the data, he or she hurries to publish Psychological Review (in press) Running head: Testing Theories With Free Parameters

it, and does not describe in the publication all the failures. A problem with this argument is that a reader has no way of Using intuition, experience, and trial and knowing if it is true; nor can the reader be error, the theorist must search among the sure that the published theory is no more many predictions of a theory to find those flexible than the rejects. A similar argument narrow enough to plausibly be falsified. is that reviewers can supposedly tell when a Problem 2: The variability of the model is too flexible. Again, a reader has no data is unclear. Solution: Show the way of knowing if this is true. The variability of the data. As discussed above, it plausibility of contradictory outcomes, is variability on the constrained dimensions outcomes that the theory cannot fit, is that is important. This means that Problem 1 crucial information that should be made (predictions unclear) must be solved first. explicit. Solutions to Problems 1 and 2 are illustrated by Roberts and Sternberg (1993), Better Ways to Judge Theories with Free who tested Ashby’s (1982) version of Parameters McClelland’s (1979) cascade model. The tested version of Ashby’s (1982) model had The problems described earlier have two free parameters–the time constants of straightforward solutions. two processes. Roberts and Sternberg varied Problem 1: What the theory predicts those parameters over all plausible values is unclear. Solution: determine the they could have in a 2 x 2 experiment. predictions. To determine the predictions of Examination of simulated results covering a theory with free parameters requires the entire parameter space showed that a varying each free parameter over its entire certain measure derived from reaction times range, in all possible combinations (i.e., (a main-effect difference statistic) was surveying the entire parameter space). For constrained by the model, and that this each combination of parameters (each point constraint varied with a second measure (a in the parameter space) the theory generates variance-change statistic). Both statistics simulated behavior. The prediction of the vaguely resemble measures of interaction. theory, for any measure (any function of the Figure 2, from Roberts and Sternberg observations, real or simulated), is the range (1993), shows this prediction and some data. of outcomes that the theory can produce Each small point represents the results of a (Sternberg, 1963, pp. 89-90). For example, simulated 2 x 2 experiment; the area filled suppose a theory has two free parameters, a, by these points is the prediction of the which can range from 0 to 10, and b, which theory. The large points, with standard-error can range from 0 to 1. To determine what bars based on between-subjects variation, the theory predicts for, say, trials to represent data. Some of the points fall within criterion, one would vary both a and b over the predicted area, but none firmly; and their entire ranges, in all possible several points fall firmly outside the combinations (i.e., over the whole two- predicted area, inconsistent with the model. dimensional parameter space) and determine Thus the model fails the test. the predicted trials to criterion for each Problem 3: Perhaps the theory could combination of parameter values (i.e., for fit any plausible result. Solution: Show that each point in the parameter space). The there are plausible results the theory cannot prediction of the theory for this dimension fit. It is not enough to show that there are of data would be the entire range of trials to some results the theory cannot fit; to criterion that the theory could produce. meaningfully constrain the data, there must Psychological Review (in press) Running head: Testing Theories With Free Parameters

be some plausible results the theory cannot fit. Suppose you test a theory, and theory predicts. Of course, the more of this discover that it accurately predicts the range that the tested theory cannot explain, results–theory and data are consistent. the more impressive the observed Which quadrant of Figure 1 does the consistency. Because pointing out plausible evidence resemble? To find out, you need to alternatives is rare, many theorists may not determine the range of plausible alternative have realized that doing so would strengthen results–predictions different from the the case for the theory they favor. prediction of the theory being tested. How To compare plausible alternative we decide what is plausible is a big subject outcomes with what the tested theory could (e. g., Hogarth, 1980), but everyone agrees explain, it is necessary to combine (a) the that both theory (beliefs about how the flexibility of the tested theory and (b) the world works) and data (actual observations) variability of the actual results. As Figure 1 are important, that we use both to judge the illustrates, the evidence will not be likelihood of future events. For example, convincing if either is large compared to the Lakatos (1978), in the statement quoted range of plausible outcomes. earlier, mentions both. It is important, he In practice, this comparison requires says, that predictions be surprising--differ four steps. First, determine what the theory from "stones falling to earth when dropped" of interest predicts. For example, suppose it (expectations based on experience) or from predicts that the measurement will be expectations based on "rival programmes" between 40 and 50. Second, determine the (predictions of other theories). 95% confidence interval based on the data. Determining what other theories Suppose the confidence interval is the predict needs no explanation. However, the average ± 10. Third, widen the prediction idea of determining what experience interval appropriately. In the example, the (unexplained by any theory) predicts may be widened interval is 30 (40 -10) to 60 (50 + unfamiliar. Earlier measurements similar to 10). The new interval (30 to 60) is the range the current measurement may have of results (i.e., averages) consistent with the generated a range of outcomes, which would theory, given the variability of the data. suggest that the current measurement could Unlike familiar intervals, the actual result have a similar range. Or earlier will probably not be in the middle of the measurements may have suggested interval. Fourth, compare actual and empirical generalizations that predict a plausible results to the widened interval. The specific value or range of values in the results should increase belief in the theory current case. only if the actual result is within the The range of plausible outcomes is widened interval and at least one plausible the union of the predictions based on other alternative result is outside the widened plausible theories and expectations based on interval. other data. For example, if other theories Figure 3 shows two examples. The suggest the measurement might be 10 to 30, solid line shows what the theory predicts; and other data suggest it might be 20 to 50, the dotted lines extend the prediction to the plausible range is 10 to 50. For the allow for the variability of the data. In both observed consistency of theory and data to cases, the tested theory could closely fit the be meaningful, it is necessary only that some result. But only the lefthand pattern of of this range fall outside what the tested results should increase belief in the theory.

Running head: Testing Theories With Free Parameters

that the increase should be linear with list length. All possible outcomes were not equally plausible, of course. Based on previous results, it was quite likely, before the experiment was done, that reaction time would change monotonically with list

length–for example, that the reaction time with list length 2 would be between the reaction time for list length 1 and the reaction time for list length 3 (within experimental error). This restriction should be taken into account when deciding how Figure 3: How the plausibility of other results affects the impressed to be with observed linearity–or, interpretation of the observed results. (The solid lines indicate the more precisely, a failure to reject the prediction of the tested theory. Dotted lines, based on the hypothesis of linearity–because a large variability of the data, indicate 95% confidence intervals.) fraction of the results that would have Sternberg’s (1966) memory-scanning rejected that hypothesis were implausible. data allow a simple real-life illustration. In To not take this into account would give the the varied-set procedure, the subject saw a hypothesis of linearity an undeserved boost. list of one to six digits. After a brief delay, A realistic assessment of the the subject saw a test digit, and indicated as evidence for linearity thus requires a quickly as possible whether it was on the plausible alternative prediction (or range of list. The measure of interest was the reaction predictions). One alternative is provided by time to respond "yes" or "no." Mean the empirical generalization that reaction reaction time increased with list length. An time is linear with logarithm of the number interesting theoretical question is whether of stimulus-response combinations (Hick, the results support a theory of serial memory 1952; Hyman, 1953). Considering each scanning, a simple version of which implies stimulus-response combination as one item becomes possible to assess how much (or two items) to be remembered suggests results consistent with linearity support a the empirical generalization that reaction theory that predicts linearity. One way to time is linear with the logarithm of the test the prediction of linearity is to use the number of items to be remembered. This reaction times with lists of length one and generalization might be wrong, of course, six to predict by interpolation the average but before Sternberg collected his data it reaction time with lists of length three and was plausible, and therefore could be used to four. The logarithmic prediction can be generate plausible outcomes. In Sternberg’s tested in a similar way. Figure 4 shows the (1966) experiment–assuming that each digit results of this analysis. The results agree to be remembered is an item--it implies that with the linear prediction, but reliably differ reaction time would be linear with the from the logarithmic prediction. Because the logarithm of list length (the number of digits results rule out a plausible alternative, the to be remembered). Certain theories also fact that they are consistent with a prediction suggest this relation (Sternberg, 1966). of the serial-scanning theory provides real When at least one plausible support for that theory. alternative to linearity has been identified, it Running head: Testing Theories With Free Parameters

Why Has the Use of Good Fits as Evidence example in physics involves the charge on Persisted? the electron. In 1909, when Millikan measured this quantity for the first time, he Why has the practice of using good fits to used a wrong value for the viscosity of air. support theories been so popular? Its flaws-- Subsequent measurements of the charge on it hides the flexibility of the theory and the the electron shifted only gradually from variability of the data, and ignores the Millikan’s value to the correct value plausible range of the data–are large and (Feynman, 1985). Biology provides another easy to understand. There are several example. From the 1930s until 1955, possible reasons: mammalian cytologists were "virtually 1. A desire to imitate physics. This certain" (Kottler, 1974, p. 465) that human may have been important initially. In 1929, cells contain 48 chromosomes, although the Clark Hull "purchased and became deeply correct number is 46. This conclusion was familiar with Newton’s Principia, a work based on "chromosome counts made during which strongly influenced his thinking from the 1920s and 1930s by a number of that time on" (Beach, 1959, pp. 128-9). esteemed cytologists all over the world" Presenting a graph with several data points (Kottler, 1974, p. 465). By 1954, the and a line through the points makes it appear existence of 48 human chromosomes was that the theory being fit makes narrow "an established fact" (Kottler, 1974, p. 466), quantitative predictions, like many physical according to one cytologist. The correct theories. number was discovered only when improved 2. Confirmation bias (J. Palmer, techniques made counting chromosomes personal communication, November 1, much less error-prone (Kottler, 1974). 1996). Confirmation bias is a tendency to Similarly, the use of good fits as evidence in test beliefs in ways likely to confirm them. experimental psychology may have To regard a good fit as substantial evidence remained popular at least partly due to is of course to adopt a testing strategy that repetition and inertia. tends to confirm flexible theories. Nickerson 4. Theory complexity. As theories (1998) concluded that "a great deal of have grown in complexity, it has become no empirical evidence supports the idea that the easy task to determine how they constrain confirmation bias is extensive and strong possible outcomes. It is computationally and that it appears in many guises" (p. 177); much easier to fit them to data. he described several examples involving 5. Neglect of basic principles. The scientific practice. In many theoretical most basic principles of theory testing--the publications, the authors test only one ideas that (a) to test a theory, you must theory–a theory that they created and that, collect data that could plausibly disprove it; naturally, they wish to confirm. and (b) the more plausible the possibility of 3. Repetition. Once a new result or disproof, the stronger the test–receive little method has appeared in print a few times, it attention in psychology. They are far from gathers a certain respect, and a certain obvious; as Lakatos (1978) pointed out, momentum, unrelated to merit. Sheer Popper himself failed to appreciate the repetition–if it is repetition of a mistake--can crucial role of plausibility. be strong enough to push whole scientific A larger lesson of this article may be fields off track for many years, which is that these principles–and the related what we claim happened here. A famous questions "what would disprove my Psychological Review (in press) Running head: Testing Theories With Free Parameters

theory?" and "what theories do these data rule out?"–deserve more emphasis. Psychological Review (in press) Running head: Testing Theories With Free Parameters

Author Note

Seth Roberts, Department of Psychology; Harold Pashler, Department of Psychology.

We thank Jonathan Baron, William Batchelder, Nicholas Christenfeld, Brett Clementz, Max Coltheart, Douglas Hintzman, James Johnston, David Krantz, James McClelland, Craig MacKenzie, Dominic Massaro, Douglas Rohrer, David Rubin, Eric Ruthruff, Saul Sternberg, James Townsend, Ben Williams, and John Wixted for helpful comments. We thank Saul Sternberg for providing unpublished data.

Correspondence concerning this article should be addressed to Seth Roberts, Department of Psychology, University of California, Berkeley, California 94720-1650 or [email protected].

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Atkinson, R. C., & Crothers, E. J. (1964). A Transactions on Automatic Control, comparison of paired-associate 19, 716-723. learning models that have different acquisition and retention axioms. Anderson, J. A. (1973). A theory for the Journal of , recognition of items from short 2, 285-315. memorized lists. Psychological Review, 80, 417-438. Bamber, D., & Van Santen, J. P. H. (1985). How many parameters can a model Anscombe, F. J. (1967). Topics in the have and still be testable? Journal of investigation of linear relations fitted Mathematical Psychology, 29, 443- by the method of least squares (with 473. discussion). Journal of the Royal Statistical Society, Series B, 90, 1- Beach, F. A. (1959). Clark Leonard Hull. In 52. Biographical Memoirs (vol. 33) of National Academy of Sciences of the Ashby, F. G. (1982). Deriving exact United States of America. New predictions from the cascade model. York: Columbia University Press. Psychological Review, 89, 599-607.

Ashby, F. G., & Lee, W. W. (1991). Bondi, H. (1992). The philosopher for Predicting similarity and science. Nature, 358, 363. categorization from identification. Journal of Experimental Psychology: Bower, G. H. (1961). Application of a General, 120, 150-172. Psychological Review (in press) Running head: Testing Theories With Free Parameters

model to paired-associate learning. L. (1990). On the control of Psychometrika, 26, 255-280. automatic processes: A parallel distributed processing account of the Bower, G., & Trabasso, T. (1964). Concept Stroop effect. Psychological Review, identification. In R. C. Atkinson 97, 332-361. (Ed.), Studies in mathematical Coltheart, M., & Coltheart, V. (1972). On psychology (pp. 32-94). Stanford, Rumelhart's model of visual CA: Stanford University Press. information-processing. Canadian Journal of Psychology, 26, 292-295. Bush, R. R., & Mosteller, F. (1959). A comparison of eight models. In R. R. Collyer, C. E. (1985). Comparing strong and Bush & W K. Estes (Eds.), Studies in weak models by fitting them to mathematical psychology (pp. 293- computer-generated data. Perception 307). Stanford, CA: Stanford & Psychophysics, 38, 476-481. University Press. Donlin, M., & Child, J. (1992, October). Is Cohen, J. D., Dunbar, K., & McClelland, J. neural computing the key to artificial (Eds.), Relating theory and data: intelligence? Computer Design, Essays on human memory in honor 31(10), 87-100. of Bennet B. Murdock (pp. 39-56). Hillsdale, NJ: Lawrence Erlbaum Estes, W. K. (1950). Toward a statistical Associates. theory of learning. Psychological Review, 57, 94-107. Hogarth, R. (1980). Judgment and choice. New York: John Wiley & Sons. Feynman, R. P. (1985). Surely you’re joking Mr. Feynman. New York: W. W. Howson, C., & Urbach, P. (1989). Scientific Norton. reasoning: The Bayesian approach. Lasalle, IL: Open Court Press. Hecht, S. (1931). The interrelations of various aspects of color vision. Hull, C. L. (1943). Principles of behavior. Journal of the Optical Society of New York: D. Appleton-Century Co. America, 21, 615-639. Hurvich, C. M. (1997). Mean square over Hick, W. E. (1952). On the rate of gain of degrees of freedom: New information. Quarterly Journal of perspectives on a model selection Experimental Psychology, 4, 11-26. treasure. In D. R. Brillinger, L. T. Fernholz, & S. Morgenthaler (Eds.), Hinton, G. E., & Anderson, J. A. (1981). The practice of data analysis: Essays Parallel models of associative in honor of John W. Tukey (pp. 203- memory. Hillsdale, NJ: Lawrence 215). Princeton, NJ: Princeton Erlbaum Associates. University Press.

Hintzman, D. L. (1991). Why are formal Hurvich, L. M. (1981). Color vision. models useful in psychology? In W. Sunderland, MA: Sinauer E. Hockley & S. Lewandowsky Associates. Psychological Review (in press) Running head: Testing Theories With Free Parameters

B. L. (1985). Repetition effects in Hyman, R. (1953). Stimulus information as word and pseudoword identification: a determinant of reaction time. Comment on Salasoo, Shiffrin, and Journal of Experimental Psychology, Feusteel. Journal of Experimental 45, 188-196. Psychology: General, 114, 498-508.

Ihde, A. J. (1964). The development of Kottler, M. J. (1974). From 48 to 46: modern chemistry. New York: Cytological technique, Harper & Row. preconception, and the counting of human chromosomes. Bulletin of the Jenkins, H. M. (1979). Animal learning and History of Medicine, 48, 465-502. behavior theory. In E. Hearst (Ed.), The first century of experimental Kitcher, P. (1993). The advancement of psychology (pp. 177-230). Hillsdale, science: Science without legend, NJ: Lawrence Erlbaum Associates. objectivity without illusions. New York: Oxford University Press, Johnston, J. C., van Santen, J. P. H., & Hale, 1993. Lakatos, I. (1978). The methodology of Steiger (Eds.), What if there were no scientific research programmes. significance tests? (pp. 393-425). Cambridge, England: Cambridge Mahwah, NJ: Lawrence Erlbaum University Press. Associates.

Leahy, K. (1994, October). The overfitting Nickerson, R. S. (1998). Confirmation bias: problem in perspective. AI Expert, 9 A ubiquitous phenomenon in many (10), 35-36. guises. Review of General Psychology, 2, 175-220. McClelland, J. L. (1979). On the time relations of mental processes: An Nosofsky, R. M., Kruschke, J. K., & examination of processes in cascade. McKinley, S. C. (1992). Combining Psychological Review, 86, 287-330. exemplar-based category representations and connectionist Mackintosh, N. J. (1983). Conditioning and learning rules. Journal of associative learning. Oxford: Oxford Experimental Psychology: Learning, University Press. Memory, & Cognition, 18, 211-233. Platt, J. R. (1964). Strong inference. Massaro, D. W. (1988). Some criticisms of Science, 146, 347-353. connectionist models of human performance. Journal of Memory & Popper, K. (1959). The logic of scientific Language, 27, 213-234. discovery. London: Hutchinson.

Meehl, P. E. (1997). The problem is Roberts, S., & Sternberg, S. (1993). The epistemology, not statistics: Replace meaning of additive reaction-time significance tests by confidence effects: Tests of three alternatives. In intervals and quantify accuracy of D. E. Meyer & S. Kornblum (Eds.), risky numerical predictions. In L. L. Attention and Performance XIV: Harlow, S. A. Mulaik, & J. H. Synergies in experimental Psychological Review (in press) Running head: Testing Theories With Free Parameters

psychology, , hundred years of forgetting: A and cognitive neuroscience--A Silver quantitative description of retention. Jubilee (pp. 611-653). Cambridge, Psychological Review, 103, 734-760. MA: MIT Press. Rubin, D. C., Hinton, S., & Wenzel, A. (in Rodgers, J. L., & Rowe, D. C. (1993). Social press). The precise time course of contagion and adolescent sexual retention. Journal of Experimental behavior: A developmental EMOSA Psychology: Learning, Memory, and model. Psychological Review, 100, Cognition. 479-510. Rumelhart, D. E. (1970). A multicomponent Roediger, H. L., & Neely, J. H. (1982). theory of the perception of brief Retrieval blocks in episodic and visual displays. Journal of semantic memory. Canadian Journal Mathematical Psychology, 7, 191- of Psychology, 36, 213-242. 218.

Rubin, D. C., & Wenzel, A. E. (1996). One Rumelhart, D. E., McClelland, J. L., & the PDP Research Group (1986). Sternberg, S. (1966). High-speed scanning Parallel distributed processing: in human memory. Science, 153, Explorations in the microstructure of 652-654. cognition. Cambridge, MA: MIT Press. Takane, Y., & Shibayama, T. (1992). Structures in stimulus identification Salasoo, A., Shiffrin, R. M., & Feustel, T. C. data. In F. G. Ashby (Ed.), (1985). Building permanent memory Multidimensional models of codes: Codification and repetition perception and cognition (pp. 335- effects in word identification. 362). Hillsdale, NJ: Lawrence Journal of Experimental Psychology: Erlbaum Associates. General, 114, 50-77. Trabasso, T., & Bower, G. (1966). Schaffer, C. (1993). Overfitting avoidance Presolution dimensional shifts in as bias. Machine Learning, 10, 153- concept identification: A test of the 178. sampling with replacement axiom in all-or-none models. Journal of Seidenberg, M. S., & McClelland, J. L. Mathematical Psychology, 3, 163- (1989). A distributed, developmental 173. model of word recognition and naming. Psychological Review, 96, Trabasso, T., & Bower, G. (1968). Attention 523-568. in learning. New York: Wiley. Urbach, P. (1987). Francis Bacon's Sternberg, S. (1963). Stochastic learning philosophy of science. La Salle, theory. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of IL: Open Court. mathematical psychology. Vol. 2 (pp. 1-120). New York: Wiley. Wexler, K. (1978). A review of John R.

Psychological Review (in press) Running head: Testing Theories With Free Parameters

Anderson's Language, Memory, and avoidance conditioning in goldfish. Thought. Cognition, 6, 327-351. Journal of Experimental Psychology: Animal Behavior Processes, 20, 32- Zhuikov, A. Y., Couvillon, P. A., & 43. Bitterman, M. E. (1994). Quantitative two-process analysis of