The Robust Beauty of Improper Linear Models in Decision Making

ROBYN M. DAWES University of Oregon

ABSTRACT: Proper linear models are those in which A proper linear model is one in which the predictor variables are given weights in such a way weights given to the predictor variables are chosen that the resulting linear composite optimally predicts in such a way as to optimize the relationship be- some criterion of interest; examples of proper linear tween the prediction and the criterion. Simple models are standard regression analysis, discriminant regression analysis is the most common example function analysis, and ridge regression analysis. Re- of a proper linear model; the predictor variables search summarized in Paul Meehl's book on clinical are weighted in such a way as to maximize the versus statistical prediction—and a plethora of re- search stimulated in part by that book—all indicates correlation between the subsequent weighted com- that when a numerical criterion variable (e.g., graduate posite and the actual criterion. Discriminant grade point average) is to be predicted from numerical function analysis is another example of a proper predictor variables, proper linear models outperform linear model; weights are given to the predictor clinical intuition. Improper linear models are those in variables in such a way that the resulting linear which the weights of the predictor variables are ob- composites maximize the discrepancy between two tained by some nonoptimal method; for example, they or more groups. Ridge regression analysis, an- may be obtained on the basis of intuition, derived other example (Darlington, 1978; Marquardt & from simulating a clinical judge's predictions, or set to Snee, 1975), attempts to assign weights in such be equal. This article presents evidence that even a way that the linear composites correlate maxi- such improper linear models are superior to clinical in- mally with the criterion of interest in a new set tuition when predicting a numerical criterion from numerical predictors. In fact, unit (i.e., equal) weight- of data. ing is quite robust for making such predictions. The Thus, there are many types of proper linear article discusses, in some detail, the application of unit models and they have been used in a variety of weights to decide what bullet the Denver Police De- contexts. One example (Dawes, 1971) was pre- partment should use. Finally, the article considers sented in this Journal; it involved the prediction commonly raised technical, psychological, and ethical of faculty ratings of graduate students. All gradu- resistances to using linear models to make important social decisions and presents arguments that could weaken these resistances. Work on this article was started at the University of Paul MeehPs (1954) book Clinical Versus Statis- Oregon and Decision Research, Inc., Eugene, Oregon; it was completed while I was a James McKeen Cattell Sab- tical Prediction: A Theoretical Analysis and a batical Fellow at the Psychology Department at the Uni- Review of the Evidence appeared 25 years ago. versity of Michigan and at the Research Center for It reviewed studies indicating that the prediction Group Dynamics at the Institute for Social Research there, I thank all these institutions for their assistance, and I of numerical criterion variables of psychological especially thank my friends at them who helped. interest (e.g., faculty ratings of graduate students This article is based in part on invited talks given at who had just obtained a PhD) from numerical the American Psychological Association (August 1977), the University of Washington (February 1978), the predictor variables (e.g., scores on the Graduate Aachen Technological Institute (June 1978), the Univer- Record Examination, grade point averages, ratings sity of Groeningen (June 1978), the University of Am- of letters of recommendation) is better done by a sterdam (June 1978), the Institute for Social Research at the University of Michigan (September 1978), Miami Uni- proper linear model than by the clinical intuition versity, Oxford, Ohio (November 1978), and the Uni- of people presumably skilled in such prediction. versity of Chicago School of Business (January 1979). The point of this article is to review evidence that I received valuable feedback from most of the audiences. Requests for reprints should be sent to Robyn M. even improper linear models may be superior to Dawes, Department of Psychology, University of Oregon, clinical predictions. Eugene, Oregon 97403.

Vol. 34, No. 7,571-582 AMERICAN PSYCHOLOGIST • JULY 1979 • 571 Copyright 1979 by the American Psychological Association, Inc. 0003-066X/79/3407-0571$00.75 ate students at the University of Oregon's Psy- random. Nevertheless, improper models may have chology Department who had been admitted be- great utility. When, for example, the standard- tween the fall of 1964 and the fall of 1967—and ized GREs, GPAs, and selectivity indices in the who had not dropped out of the program for non- previous example were weighted equally, the re- academic reasons (e.g., psychosis or marriage)— sulting linear composite correlated .48 with later were rated by the faculty in the spring of 1969; faculty rating. Not only is the correlation of this faculty members rated only students whom they linear composite higher than that with the clinical felt comfortable rating. The following rating judgment of the admissions committee (.19), it is scale was used: S, outstanding; 4, above average; also higher than that obtained upon cross-validat- 3, average; 2, below average; 1, dropped out of ing the weights obtained from half the sample. the program in academic difficulty. Such overall An example of an improper model that might be ratings constitute a psychologically interesting cri- of somewhat more interest—at least to the general terion because the subjective impressions of fac- public—was motivated by a physician who was on ulty members are the main determinants of the a panel with me concerning predictive systems. job (if any) a student obtains after leaving gradu- Afterward, at the bar with his' wife and me, he ate school. A total of 111 students were in the said that my paper might be of some interest to sample; the number of faculty members rating my colleagues, but success in graduate school in each of these students ranged from 1 to 20, with psychology was not of much general interest: the mean number being 5.67 and the median be- "Could you, for example, use one of your improper ing 5. The ratings were reliable. (To determine linear models to predict how well my wife and I the reliability, the ratings were subjected to a one- get along together?" he asked. I realized that I way analysis of variance in which each student be- could—or might. At that time, the Psychology ing rated was regarded as a treatment. The Department at the University of Oregon was en- resulting between-treatments variance ratio (»j2) gaged in sex research, most of which was be- was .67, and it was significant beyond the .001 havioristically oriented. So the subjects of this level.) These faculty ratings were predicted from research monitored when they made love, when a proper linear model based on the student's Grad- they had fights, when they had social engagements uate Record Examination (GRE) score, the stu- (e.g., with in-laws), and so on. These subjects dent's undergraduate grade point average (GPA), also made subjective ratings about how happy they and a measure of the selectivity of the student's were in their marital or coupled situation. I undergraduate institution.1 The cross-validated immediately thought of an improper linear model multiple correlation between the faculty ratings to predict self-ratings of marital happiness: rate and predictor variables was .38. Congruent with of lovemaking minus rate of fighting. My col- Meehl's results, the correlation of these latter fac- league John Howard had collected just such data ulty ratings with the average rating of the people on couples when he was an undergraduate at the on the admissions committee who selected the University of Missouri—Kansas City, where he 2 students was .19; that is, it accounted for one worked with Alexander (1971). After establish- fourth as much variance. This example is typical ing the intercouple reliability of judgments of of those found in psychological research in this lovemaking and fighting, Alexander had one part- area in that (a) the correlation with the model's ner from each of 42 couples monitor these events. predictions is higher than the correlation with clin- ical prediction, but (b) both correlations are low. She allowed us to analyze her data, with the fol- These characteristics often lead psychologists to lowing results: "In the thirty happily married interpret the findings as meaning that while the low correlation of the model indicates that linear modeling is deficient as a method, the even lower ^This index was based on Cass and Birnbaum's (1968) rating of selectivity given at the end of their book Com- correlation of the judges indicates only that the parative Guide to American Colleges. The verbal cate- wrong judges were used. gories of selectivity were given numerical values according An improper linear model is one in which the to the following rale: most selective, 6; highly selective, 5; very selective (+), 4; very selective, 3; selective, 2; weights are chosen by some nonoptimal method. not mentioned, 1. They may be chosen to be equal, they may be 2Unfortunately, only 23 of the 111 students could be used in this comparison because the rating scale the chosen on the basis of the intuition of the person admissions committee used changed slightly from year making the prediction, or they may be chosen at to year.

572 • JULY 1979 • AMERICAN PSYCHOLOGIST couples (as reported by the monitoring partner) few would agree that it doesn't entail some ability only two argued more often than they had inter- to predict. course. All twelve of the unhappily married Why? Because people—especially the experts couples argued more often" (Howard & Dawes, in a field—are much better at selecting and coding 1976, p. 478). We then replicated this finding at information than they are at integrating it. the University of Oregon, where 27 monitors rated But people are important. The statistical model happiness on a 7-point scale, from "very un- may integrate the information in an optimal man- happy" to "very happy," with a neutral midpoint. ner, but it is always the individual (judge, clini- The correlation of rate of lovemaking minus rate cian, subjects) who chooses variables. Moreover, of arguments with these ratings of marital happi- it is the human judge who knows the directional ness was .40 (/><.05); neither variable alone relationship between the predictor variables and was significant. The findings were replicated in the criterion of interest, or who can code the Missouri by D. D. Edwards and Edwards (1977) variables in such a way that they have clear direc- and in Texas by Thornton (1977), who found a tional relationships. And it is in precisely the correlation of .81 (p < .01) between the sex-argu- situation where the predictor variables are good ment difference and self-rating of marital happi- and where they have a conditionally monotone ness among 28 new couples. (The reason for this relationship with the criterion that proper linear much higher correlation might be that Thornton models work well.8 obtained the ratings of marital happiness after, The linear model cannot replace the expert in rather than before, the subjects monitored their deciding such things as "what to look for," but it lovemaking and fighting; in fact, one subject de- is precisely this knowledge of what to look for in cided to get a divorce after realizing that she was reaching the decision that is the special expertise fighting more than loving; Thornton, Note 1.) people have. Even in as complicated a judgment The conclusion is that if we love more than we as making a chess move, it is the ability to code hate, we are happy; if we hate more than we love, the board in an appropriate way to "see" the we are miserable. This conclusion is not very proper moves that distinguishes the grand master profound, psychologically or statistically. The from the expert from the novice (deGroot, 1965; point is that this very crude improper linear model Simon & Chase, 1973). It is not in the ability predicts a very important variable: judgments to integrate information that people excel (Slovic, about marital happiness. Note 2). Again, the chess grand master considers The bulk (in fact, all) of the literature since the no more moves than does the expert; he just publication of Meehl's (1954) book supports his knows which ones to look at. The distinction be- generalization about proper models versus intui- tween knowing what to look for and the ability tive clinical judgment. Sawyer (1966) reviewed a to integrate information is perhaps best illustrated plethora of these studies, and some of these studies in a study by Einhorn (1972). Expert doctors were quite extensive (cf. Goldberg, 196S). Some coded biopsies of patients with Hodgkin's disease 10 years after his book was published, Meehl and then made an overall rating of the severity of (1965) was able to conclude, however, that there the process. The overall rating did not predict was only a single example showing clinical judg- survival time of the 193 patients, all of whom ment to be superior, and this conclusion was im- mediately disputed by Goldberg (1968) on the 8 Relationships are conditionally monotone when vari- grounds that even the one example did not show ables can be scaled in such a way that higher values on such superiority. Holt (1970) criticized details each predict higher values on the criterion. This condi- tion is the combination of two more fundamental mea- of several studies, and he even suggested that surement conditions: (a) independence (the relationship prediction as opposed to understanding may not be between each variable and the criterion is independent a very important part of clinical judgment. But of the values on the remaining variables) and (b) mono- tonicity (the ordinal relationship is one that is monotone). a search of the literature fails to reveal any studies (See Krantz, 1972; Krantz, Luce, Suppes, & Tversky, in which clinical judgment has been shown to be 1971). The true relationships need not be linear for linear models to work; they must merely be approximated superior to statistical prediction when both are by linear models. It is not true that "in order to com- based on the same codable input variables. And pute a correlation coefficient between two variables the relationship between them must be linear" (advice found though most nonpositivists would agree that un- in one introductory statistics text). In the first place, it derstanding is. not synonymous with prediction, is always possible to compute something.

AMERICAN PSYCHOLOGIST • JULY 1979 • 573 died. (The correlations of rating with survival Another situation in which proper linear models time were all virtually 0, some in the wrong direc- cannot be used is that in which there are no tion.) The variables that the doctors coded did, measurable criterion variables. We might, never- however, predict survival time when they were theless, have some idea about what the important used in a multiple regression model. predictor variables would be and the direction In summary, proper linear models work for a they would bear to the criterion if we were able to very simple reason. People are good at picking measure the criterion. For example, when decid- out the right predictor variables and at coding ing which students to admit to graduate school, them in such a way that they have a conditionally we would like to predict some future long-term monotone relationship with the criterion. People variable that might be termed "professional self- are bad at integrating information from diverse actualization." We have some idea what we mean and incomparable sources. Proper linear models by this concept, but no good, precise definition as are good at such integration when the predictions yet. (Even if we had one, it would be impossible have a conditionally monotone relationship to the to conduct the study using records from current criterion. students, because that variable could not be as- Consider, for example, the problem of compar- sessed until at least 20 years after the students ing one graduate applicant with GRE scores of had completed their doctoral work.) We do, how- 750 and an undergraduate GPA of 3.3 with an- ever, know that in all probability this criterion is other with GRE scores of 680 and an undergradu- positively related to intelligence, to past accom- ate GPA of 3.7. Most judges would agree that plishments, and to ability to snow one's colleagues. these indicators of aptitude and previous accom- In our applicant's files, GRE scores assess the plishment should be combined in-some compensa- first variable; undergraduate GPA, the second; tory fashion, but the question is how to compen- and letters of recommendation, the third. Might sate. Many judges attempting this feat have little we not, then, wish to form some sort of linear knowledge of the distributional characteristics of combination of these variables in order to assess GREs and GPAs, and most have no knowledge of our applicants' potentials? Given that we cannot studies indicating their validity as predictors of perform a standard regression analysis, is there graduate success. Moreover, these numbers are nothing to do other than fall back on unaided in- inherently incomparable without such knowledge, tuitive integration of these variables when we GREs running from SOO to 800 for viable appli- assess our applicants? cants, and GPAs from 3.0 to 4.0. Is it any One possible way of building an improper linear wonder that a statistical weighting scheme does model is through the use of bootstrapping (Dawes better than a human judge in these circumstances? & Corrigan, 1974; Goldberg, 1970). The process Suppose now that it is not possible to construct is to build a proper linear model of an expert's a proper linear model in some situation. One judgments about an outcome criterion and then to reason we may not be able to do so is that our use that linear model in place of the judge. That sample size is inadequate. In multiple regression, such linear models can be accurate in predicting for example, b weights are notoriously unstable; experts' judgments has been pointed out in the the ratio of observations to predictors should be as psychological literature by Hammond (1955) and high as IS or 20 to 1 before b weights, which are Hoffman (1960). (This work was anticipated by the optimal weights, do better on cross-validation 32 years by the late Henry Wallace, Vice-Presi- than do simple unit weights. Schmidt (1971), dent under Roosevelt, in a 1923 agricultural ar- Goldberg (1972), and Claudy (1972) have demon- ticle suggesting the use of linear models to analyze strated this need empirically through computer "what is on the corn judge's mind.") In his in- simulation, and Einhorn and Hogarth (1975) and fluential article, Hoffman termed the use of linear Srinivisan (Note 3) have attacked the problem models a paramorphic representation of judges, by analytically. The general solution depends on a which he meant that the judges' psychological pro- number of parameters such as the multiple corre- cesses did not involve computing an implicit or lation in the population and the covariance pattern explicit weighted average of input variables, but between predictor variables. But the applied im- that it could be simulated by such a weighting. plication is clear. Standard regression analysis Paramorphic representations have been extremely cannot be used in situations where there is not successful (for reviews see Dawes & Corrigan, a "decent" ratio of observations to predictors. 1974; Slovic & Lichtenstein, 1971) in contexts in

574 • JULY 1979 • AMERICAN PSYCHOLOGIST which predictor variables have conditionally mono- judges were highly skilled experts attempting to tone relationships to criterion variables. predict a highly reliable criterion. Goldberg The bootstrapping models make use of the (1976), however, noted that many of the ratios weights derived from the judges; because these had highly skewed distributions, and he reana- weights are not derived from the relationship be- lyzed Libby's data, normalizing the ratios before tween the predictor and criterion variables them- building models of the loan officers. Libby found selves, the resulting linear models are improper. 77% of his officers to be superior to their para- Yet these paramorphic representations consistently morphic representations, but Goldberg, using his do better than the judges from which they were rescaled predictor variables, found the opposite; derived (at least when the evaluation of goodness 72% of the models were superior to the judges is in terms of the correlation between predicted from whom they were derived.4 and actual values). Why does bootstrapping work? Bowman Bootstrapping has turned out to be pervasive. (1963), Goldberg (1970), and Dawes (1971) all For example, in a study conducted by Wiggins and maintained that its success arises from the fact Kohen (1971), psychology graduate students at that a linear model distills underlying policy (in the University of Illinois were presented with 10 the implicit weights) from otherwise variable be- background, aptitude, and personality measures havior (e.g., judgments affected by context effects describing other (real) Illinois graduate students or extraneous variables). in psychology and were asked to predict these stu- Belief in the efficacy of bootstrapping was based dents' first-year graduate GPAs. Linear models of on the comparison of the validity of the linear every one of the University of Illinois judges did model of the judge with the validity of his or her a better job than did the judges themselves in judgments themselves. This is only one of two predicting actual grade point averages. This re- logically possible comparisons. The other is the sult was replicated in a study conducted in con- validity of the linear model of the judge versus junction with Wiggins, Gregory, and Diller (cited the validity of linear models in general; that is, to in Dawes & Corrigan, 1974). Goldberg (1970) demonstrate that bootstrapping works because the demonstrated it for 26 of 29 clinical psychology linear model catches the essence of the judge's judges predicting psychiatric diagnosis of neu- valid expertise while eliminating unreliability, it is rosis or psychosis from Minnesota Multiphasic necessary to demonstrate that the weights ob- Personality Inventory (MMPI) profiles, and tained from an analysis of the judge's behavior Dawes (1971) found it in the evaluation of are superior to those that might be obtained in graduate applicants at the University of Oregon. other ways, for example, randomly. Because both The one published exception to the success of the model of the judge and the model obtained bootstrapping of which I am aware was a study randomly are perfectly reliable, a comparison of conducted by Libby (1976). He asked 16 loan the random model with the judge's model permits officers from relatively small banks (located in an evaluation of the judge's underlying linear Champaign-Urbana, Illinois, with assets between representation, or policy. If the random model $3 million and $56 million) and 27 loan officers does equally well, the judge would not be "follow- from large banks (located in Philadelphia, with ing valid principles but following them poorly" assets between $.6 billion and $4.4 billion) to (Dawes, 1971, p. 182), at least not principles any judge which 30 of 60 firms would go bankrupt more valid than any others that weight variables within three years after their financial statements. in the appropriate direction: The loan officers requested five financial ratios on Table 1 presents five studies summarized by which to base their judgments (e.g., the ratio of Dawes and Corrigan (1974) in which validities present assets to total assets). On the average, the loan officers correctly categorized 44.4 busi- nesses (74%) as either solvent or future bank- *It should be pointed out that a proper linear model ruptcies, but on the average, the paramorphic rep- does better than either loan officers or their paramorphic representations. Using the same task, Beaver (1966) and resentations of the loan officers could correctly Deacon (1972) found that linear models predicted with classify only 43.3 (72%). This difference turned about 78% accuracy on cross-validation. But I can't out to be statistically significant, and Libby con- resist pointing out that the simplest possible improper model of them all does best. The ratio of assets to lia- cluded that he had an example of a situation where bilities (!) correctly categorizes 48 (80%) of the cases bootstrapping did not work—perhaps because his studied by Libby.

AMERICAN PSYCHOLOGIST • JULY 1979 • 575 TABLE 1 Correlations Between Predictions and Criterion Values

Average Average Validity Cross- Validity Average validity validity of equal validity of of optimal validity of judge of random weighting regression linear Example of judge model model model analysis model

Prediction of neurosis vs. psychosis .28 .31 .30 .34 .46 .46 Illinois students' predictions of GPA .33 .50 .51 .60 .57 .69 Oregon students' predictions of GPA .37 .43 .51 .60 .57 .69 Prediction of later faculty ratings at Oregon .19 .25 .39 .48 .38 .54 Yntema & Torgerson's (1961) experiment .84 .89 .84 .97 .97

Note. GPA = grade point average.

(i.e., correlations) obtained by various methods in this experiment were asked to estimate the were compared. In the first study, a pool of 861 value of each ellipse and were presented with out- psychiatric patients took the MMPI in various come feedback at the end of each trial. The prob- hospitals; they were later categorized as neurotic lem was to predict the true (i.e., experimenter- or psychotic on the basis of more extensive in- assigned) value of each ellipse on the basis of its formation. The MMPI profiles consist of 11 size, eccentricity, and grayness. . scores, each of which represents the degree to The first column of Table 1 presents the average which the respondent answers questions in a man- validity of the judges in these studies, and the ner similar to patients suffering from a well-de- second presents the average validity of the para- fined form of psychopathology. A set of 11 scores morphic model of these judges. In all cases, boot- is thus associated with each patient, and the prob- strapping worked. But then what Corrigan and I lem is to predict whether a later diagnosis will be constructed were random linear models, that is, psychosis (coded 1) or neurosis (coded 0). models in which weights were randomly chosen Twenty-nine clinical psychologists "of varying ex- except for sign and were then applied to standard- perience and training" (Goldberg, 1970, p. 425) ized variables." were asked to make this prediction on an 11-step forced-normal distribution. The second two stud- The sign of each variable was determined on an a priori basis so 'that it would have a positive relationship to the ies concerned 90 first-year graduate students in the criterion. Then a normal deviate was selected at random Psychology Department of the University of from a normal distribution with unit variance, and the Illinois who were evaluated on 10 variables that absolute value of this deviate was used as a weight for are predictive of academic success. These vari- the variable. Ten thousand such models were constructed for each example. (Dawes & Corrigan, 1974, p. 102) ables included aptitude test scores, college GPA, various peer ratings (e.g., extraversion), and vari- On the average, these random linear models per- ous self-ratings (e.g., conscientiousness). A first- form about as well as the paramorphic models of year GPA was computed for all these students. the judges; these averages are presented in the The problem was to predict the GPA from the 10 third column of the table. Equal-weighting mod- variables. In the second study this prediction was els, presented in the fourth column, do even better. made by 80 (other) graduate students at the Uni- (There is a mathematical reason why equal-weight- versity of Illinois (Wiggins & Kohen, 1971), and ing models must outperform the average random in the third study this prediction was made by 41 model.6) Finally, the last two columns present graduate students at the University of Oregon. The details of the fourth study have already been covered; it is the one concerned with the predic- 5 Unfortunately, Dawes and Corrigan did not spell out tion of later faculty ratings at Oregon. The final in detail that these variables must first be standardized study (Yntema & Torgerson, 1961) was one in and that the result is a standardized dependent variable. Equal or random weighting of incomparable variables— which experimenters assigned values to ellipses for example, GRE score and GPA—without prior stan- presented to the subjects, on the basis of figures' dardization would be nonsensical. 6 size, eccentricity, and grayness. The formula used Consider a set of standardized variables Si, X*, .Xm, each of which is positively correlated with a standardized was ij + kj + ik, where i, j, and k refer to values variable Y. The correlation of the average of the Xs on the three dimensions just mentioned. Subjects with the Y is equal to the correlation of the sum of the

576 • JULY 1979 • AMERICAN PSYCHOLOGIST the cross-validated validity of the standard re- Wainer (1976), Wainer and Thissen (1976), W. gression model and the validity of the optimal M. Edwards (1978), and Gardiner and Edwards linear model. (197S). Essentially the same results were obtained when Dawes and Corrigan (1974, p. 105) concluded the weights were -selected from a .rectangular dis- that "the whole trick is to know what variables to tribution. Why? Because linear models are ro- look at and then know how to add." That prin- bust over deviations from optimal weighting. In ciple is well illustrated in the following study, other words, the bootstrapping finding, at least conducted since the Dawes and Corrigan article in these studies, has simply been a reaffirmation of was published. In it, Hammond and Adelman the earlier finding that proper linear models are (1976) both investigated and influenced the de- superior to human judgments—the weights derived cision about what type of bullet should be used by from the judges' behavior being sufficiently close the Denver City Police, a decision having much to the optimal weights that the outputs of the more obvious social impact than most of those models are highly similar. The solution to the discussed above. To quote Hammond and Adel- problem of obtaining optimal weights is one that man (1976): —in terms of von Winterfeldt and Edwards (Note 4)—has a "flat maximum." Weights that are In 1974, the Denver Police Department (DPD), as well as other police departments throughout the country, de- near to optimal level produce almost the same cided to change its handgun ammunition. The principle output as do optimal beta weights. Because the reason offered by the police was that the conventional expert judge knows at least something about the round-nosed bullet provided insufficient "stopping effec- tiveness" (that is, the ability to incapacitate and thus to direction of the variables, his or her judgments prevent the person shot from firing back at a police yield weights that are nearly optimal (but note officer or others). The DPD chief recommended (as did that in all cases equal weighting is superior to other police chiefs) the conventional bullet be replaced by a hollow-point bullet. Such bullets, it was contended, models based on judges' behavior). flattened on impact, thus decreasing penetration, increasing The fact that different linear composites corre- stopping effectiveness, and decreasing ricochet potential. late highly with each other was first pointed out The suggested change was challenged by the American Civil Liberties Union, minority groups, and others. Op- 40 years ago by Wilks (1938). He considered ponents of the change claimed that the new bullets were only situations in which there was positive corre- nothing more than outlawed "dum-dum" bullets, that they lation between predictors. This result seems to created far more injury than the round-nosed bullet, and should, therefore, be barred from use. As is customary, hold generally as long as these intercorrelations judgments on this matter were formed privately and are not negative; for example, the correlation be- then defended publicly with enthusiasm and tenacity, and tween X + 27 and 2X + Y is .80 when X and Y the usual public hearings were held. Both sides turned to ballistics experts for scientific information and support, are uncorrelated. The ways in which outputs are (p.392) relatively insensitive to changes in coefficients (provided changes in sign are not involved) have The disputants focused on evaluating the merits been investigated most recently by Green (1977), of specific bullets—confounding the physical effect of the bullets with the implications for social policy; that is, rather than separating questions of Xs with Y. The covariance of this sum with 7 is equal what it is the bullet should accomplish (the social to policy question) from questions concerning ballis- tic characteristics of specific bullets, advocates (n)- 1 Li yt(xn + *,-»... + *,-m merely argued for one bullet or another. Thus, as Hammond and Adelman pointed out, social = ( ~ ) £ ViXn + ( - ) policymakers inadvertently adopted the role of \n) \n/ (poor) ballistics experts, and vice versa. What = fi + »•«... + Cm (the sum of the correlations). Hammond and Adelman did was to discover the The variance of y is 1, and the variance of the sum of important policy dimensions from the policy- the Xs is M + M(M — l)r, where f is the average inter- makers, and then they had the ballistics experts correlation between the Xs. Hence, the correlation of the average of the ATs with Y is (2><)/(M + if (M - l)f)»; rate the bullets with respect to these dimensions. this is greater than (Sn)/(Af + M1 - M)* = average n. These dimensions turned out to be stopping ef- Because each of the random models is positively corre- fectiveness (the probability that someone hit in lated with the criterion, the correlation of their average, which is the unit-weighted model, is higher than the the torso could not return fire), probability of average of the correlations. serious injury, and probability of harm to by-

AMERICAN PSYCHOLOGIST • JULY 1979 • 577 standers. When the ballistics experts rated the departmental merit, and need; I was told "you bullets with respect to these dimensions, it turned can't systemize human judgment." It was only out that the last two were almost perfectly con- six months later, after our committee realized the founded, but they were not perfectly confounded political and ethical impossibility of cutting back with the first. Bullets do not vary along a single fellowships on the basis of intuitive judgment, that dimension that confounds effectiveness with lethal- such a system was adopted. And so on. ness. The probability of serious injury or harm In the past three years, I have written and to bystanders is highly related to the penetration talked about the utility (and in my view, ethical of the bullet, whereas the probability of the bul- superiority) of using linear models in socially let's effectively stopping someone from returning important decisions. Many of the same objec- fire is highly related to the width of the entry tions have been raised repeatedly by different wound. Since policymakers could not agree about readers and audiences. I would like to conclude the weights given to the three dimensions, Ham- this article by cataloging these objections and an- mond and Adelman suggested that they be swering them. weighted equally. Combining the equal weights with the (independent) judgments of the ballistics Objections to Using Linear Models experts, Hammond and Adelman discovered a bullet that "has greater stopping effectiveness and These objections may be placed in three broad is less apt to cause injury (and is less apt to categories: technical, psychological, and ethical. threaten bystanders) than the standard bullet then Each category is discussed in turn. in use by the DPD" (Hammond & Adelman, 1976, p. 395). The bullet was also less apt to cause in- TECHNICAL jury than was the bullet previously recommended The most common technical objection is to the use by the DPD. That bullet was "accepted by the City Council and all other parties concerned, and of the correlation coefficient; for example, Remus is now being used by the DPD" (Hammond & and Jenicke (1978) wrote: Adelman, 1976, p. 395).7 Once again, "the whole It is clear that Dawes and Corrigan's choice of the corre- trick is to decide what variables to look at and lation coefficient to establish the utility of random and then know how to add" (Dawes & Corrigan, 1974, unit rules is inappropriate [sic, inappropriate for what?]. p. 105). A criterion function is also needed in the experiments cited by Dawes and Corrigan. Surely there is a cost So why don't people do it more often? I know function for misclassifying neurotics and psychotics or of four universities (University of Illinois; New refusing qualified students admissions to graduate school York University; University of Oregon; Univer- while admitting marginal students, (p. 221) sity of California, Santa Barbara—there may be Consider the graduate admission problem first. more) that use a linear model for applicant selec- Most schools have k slots and N applicants. The tion, but even these use it as an initial screening problem is to get the best k (who are in turn device and substitute clinical judgment for the willing to accept the school) out of N. What final selection of those above a cut score. Gold- better way is there than to have an appropriate berg's (1965) actuarial formula for diagnosing neurosis or psychosis from MM PI profiles has rank? None. Remus and Jenicke write as if the proven superior to clinical judges attempting the problem were not one of comparative choice but of absolute choice. Most social choices, however, same task (no one to my or Goldberg's knowledge involve selecting the better or best from a set of has ever produced a judge who does better), alternatives: the students that will be better, the yet my one experience with its use (at the Ann bullet that will be best, a possible airport site Arbor Veterans Administration Hospital) was that that will be superior, and so on. The correlation it was discontinued on the grounds that it made obvious errors (an interesting reason, discussed at length below). In 1970, I suggested that our fel- 7 It should be pointed out that there were only eight lowship committee at the University of Oregon bullets on the Pareto frontier; that is, there were only apportion cutbacks of National Science Founda- eight that were not inferior to some particular other tion and National Defense Education Act fellow- bullet in both stopping effectiveness and probability of harm (or inferior on one of the variables and equal on ships to departments on the basis of a quasi-linear the other). Consequently, any weighting rule whatsoever point system based on explicitly defined indices, would have chosen one of these eight.

578 • JULY 1979 • AMERICAN PSYCHOLOGIST coefficient, because it reflects ranks so well, is important criterion were to be predicted. The clearly appropriate for evaluating such choices. answer is that of course the findings could be dif- The neurosis-psychosis problem is more subtle ferent, but we have no reason to suppose that they and even less supportive of their argument. would be different. First, the distant future is in "Surely," they state, "there is a cost function," general less predictable than the immediate future, but they don't specify any candidates. The im- for the simple reason that more unforeseen, ex- plication is clear: If they could find it, clinical traneous, or self-augmenting factors influence in- judgment would be found to be superior to linear dividual outcomes. (Note that we are not dis- models. Why? In the absence of such a discov- cussing aggregate outcomes, such as an unusually ery on their part, the argument amounts to nothing cold winter in the Midwest in general spread out at all. But this argument from a vacuum can be over three months.) Since, then, clinical predic- very compelling to people (for example, to losing tion is poorer than linear to begin with, the hy- generals and losing football coaches, who know pothesis would hold only if linear prediction got that "surely" their plans would work "if"—when much worse over time than did clinical prediction. the plans are in fact doomed to failure no matter There is no a priori reason to believe that this what). differential deterioration in prediction would occur, A second related technical objection is to the and none has ever been suggested to me. There is comparison of average correlation coefficients of certainly no evidence. Once again, the objection judges with those of linear models. Perhaps by . consists of an argument from a vacuum. averaging, the performance of some really out- Particularly compelling is the fact that people standing judges is obscured. The data indicate who argue that different criteria or judges or vari- otherwise. In the Goldberg (1970) study, for ables or time frames would produce different re- example, only 5 of 29 trained clinicians were better sults have had 25 years in which to produce ex- than the unit-weighted model, and none did better amples, and they have failed to do so. than the proper one. In the Wiggins and Kohen

(1971) study, no judges were better than the unit- PSYCHOLOGICAL weighted model, and we replicated that effect at Oregon. In the Libby (1976) study, only 9 of 43 One psychological resistance to using linear models judges did better than the ratio of assets to lia- lies in our selective memory about clinical predic- bilities at predicting bankruptcies (3 did equally tion. Our belief in such prediction is reinforced well). While it is then conceded that clinicians by the availability (Tversky & Kahneman, 1974) should be able to predict diagnosis of neurosis or of instances of successful clinical prediction—espe- psychosis, that graduate students should be able to cially those that are exceptions to some formula: predict graduate success, and that bank loan offi- "I knew someone once with . , . who . . . ." (e.g., cers should be able to predict bankruptcies, the "I knew of someone with a tested IQ of only 130 possibility is raised that perhaps the experts used who got an advanced degree in psychology.") As in the studies weren't the right ones. This again Nisbett, Borgida, Crandall, and Reed (1976) is arguing from a vacuum: If other experts were showed, such single instances often have greater used, then the results would be different. And impact on judgment than do much more valid sta- once again no such experts are produced, and once tistical compilations based on many instances. (A again the appropriate response is to ask for a good prophylactic for clinical psychologists basing reason why these hypothetical other people should resistance to actuarial prediction on such instances be any different. (As one university vice-president would be to keep careful records of their own told me, "Your research only proves that you used predictions about their own patients—prospective poor judges; we could surely do better by getting records not subject to hindsight. Such records better judges"—apparently not from the psychol- could make all instances of successful and unsuc- ogy department.) cessful prediction equally available for impact; in A final technical objection concerns the nature addition, they could serve for another clinical ver- of the criterion variables. They are admittedly sus statistical study using the best possible judge short-term and unprofound (e.g., GPAs, diag- —the clinician himself or herself.) noses) ; otherwise, most studies would be infea- Moreover, an illusion of good judgment may be sible. The question is then raised of whether the reinforced due to selection (Einhorn & Hogarth, findings would be different if a truly long-range 1978) in those situations in which the prediction

AMERICAN PSYCHOLOGIST • JULY 1979 • S79 of a positive or negative outcome has a self-ful- Finally, there are all sorts of nonintellectual filling effect. For example, admissions officers factors in professional success that could not pos- who judge that a candidate is particularly quali- sibly be evaluated before admission to graduate fied for a graduate program may feel that their school, for example, success at forming a satisfy- judgment is exonerated when that candidate does ing or inspiring libidinal relationship, not yet evi- well, even though the candidate's success is in dent genetic tendencies to drug or alcohol addic- large part due to the positive effects of the pro- tion, the misfortune to join a research group that gram. (In contrast, a linear model of selection "blows up," and so on, and so forth. is evaluated by seeing how well it predicts per- Intellectually, I find it somewhat remarkable formance within the set of applicants selected.) that we are able to predict even 16% of the Or a waiter who believes that particular people variance. But I believe that my own emotional at the table are poor tippers may be less attentive response is indicative of those of my colleagues than usual and receive a smaller tip, thereby hav- who simply assume that the future is more pre- ing his clinical judgment exonerated.8 dictable. / want it to be predictable, especially A second psychological resistance to the use of when the aspect of it that I want to predict is linear models stems from their "proven" low valid- important to me. This desire, I suggest, trans- ity. Here, there is an implicit (as opposed to lates itself into an implicit assumption that the explicit) argument from a vacuum because neither future is in fact highly predictable, and it would changes in evaluation procedures, nor in judges, nor then logically follow that if something is not a in criteria, are proposed. Rather, the unstated very good predictor, something else might do bet- assumption is that these criteria of psychological ter (although it is never correct to argue that it interest are in fact highly predictable, so it fol- necessarily will). lows that if one method of prediction (a linear Statistical prediction, because it includes the model) doesn't work too well, another might do specification (usually a low correlation coefficient) better (reasonable), which is then translated into of exactly how poorly we can predict, bluntly the belief that another will do better (which is strikes us with the fact that life is not not a reasonable inference)—once it is found. predictable. Unsystematic clinical prediction (or This resistance is best expressed by a dean con- "postdiction"), in contrast, allows us the com- sidering the graduate admissions who wrote, "The forting illusion that life is in fact predictable and correlation of the linear composite with future that we can predict it. faculty ratings is only .4, whereas that of the admissions committee's judgment correlates .2. ETHICAL Twice nothing is nothing." In 1976, I answered as follows (Dawes, 1976, pp. 6-7): When I was at the Los Angeles Renaissance Fair last summer, I overhead a young woman complain In response, I can only point out that 16% of the variance that it was "horribly unfair" that she had been is better than 4% of the variance. To me, however, the fascinating part of this argument is the implicit assump- rejected by the Psychology Department at the tion that that other 84% of the variance is predictable University of California, Santa Barbara, on the and that we can somehow predict it. basis of mere numbers, without even an interview. Now what are we dealing with? We are dealing with personality and intellectual characteristics of [uniformly "How can they possibly tell what I'm like?" bright] people who are about 20 years old. . . . Why are The answer is that they can't. Nor could they we so convinced that this prediction can be made at all? with an interview (Kelly, 1954). Nevertheless, Surely, it is not necessary to read Ecclesiastes every night to understand the role of chance. . . . Moreover, there many people maintain that making a crucial social are clearly positive feedback effects in professional de- choice without an interview is dehumanizing. I velopment that exaggerate threshold phenomena. For think that the question of whether people are. example, once people are considered sufficiently "out- standing" that they are invited to outstanding institutions, treated in a fair manner has more to do with the they have outstanding colleagues with Whom to interact question of whether or not they have been de- —and excellence is exacerbated. This same problem humanized than does the question of whether occurs for those who do not quite reach such a threshold level. Not only do all these factors mitigate against the treatment is face to face. (Some of the worst successful long-range prediction, but studies of the success doctors spend a great deal of time conversing with of such prediction are necessarily limited to those accepted, with the incumbent problems of restriction of range and a negative covariance structure between predictors (Dawes, 1975). 8 This example was provided by Einhorn (Note 5).

580 • JUL^T 1979 • AMERICAN PSYCHOLOGIST their patients, read no medical journals, order few REFERENCES or no tests, and grieve at the funerals.) A GPA Alexander, S. A. H. Sex, arguments, and social engage- represents 3£ years of behavior on the part of the ments in marital and premarital relations. Unpublished applicant. (Surely, not all the professors are master's thesis, University of Missouri—Kansas City, biased against his or her particular form of cre- 1971. Beaver, W. H. Financial ratios as predictors of failure. ativity.) The GRE is a more carefully devised In Empirical research in accounting: Selected studies. test. Do we really believe that we can, do a Chicago: University of Chicago, Graduate School of better or a fairer job by a 10-minute folder evalua- Business, Institute of Professional Accounting, 1966. Bowman, E. H. Consistency and optimality in manager- tion or a half-hour interview than is done by ial decision making. Management Science, 1963, 9, 310- these two mere numbers? Such cognitive conceit 321. (Dawes, 1976, p. 7) is unethical, especially given Cass, J., & Birnbaum, M. Comparative guide to American colkges. New York: Harper & Row, 1968. the fact of no evidence whatsoever indicating that Claudy, J. G. A comparison of five variable weighting we do a better job than does the linear equation. procedures. Educational and Psychological Measure- (And even making exceptions must be done with ment, 1972,52,311-322. Darlington, R. B. Reduced-variance regression. Psycho- extreme care if it is to be ethical, for if we admit logical Bulletin, 1978, 8S, 1238-12SS. someone with a low linear score on the basis that Dawes, R. M. A case study of graduate admissions: he or she has some special talent, we are automati- Application of three principles of human decision mak- cally rejecting someone with a higher score, who ing. American Psychologist, 1971^ 26, 180-188. Dawes, R. M. Graduate admissions criteria and future might well have had an equally impressive talent success. Science, 1975,187, 721-723, had we taken the trouble to evaluate it.) Dawes, R. M. Shallow psychology. In J. Carroll & J. Payne (Eds.), Cognition and social behavior. Hills- No matter how much we would like to see this dale, N.J.: Erlbaum, 1976. or that aspect of one or another of the studies Dawes, R. M., & Corrigan, B. Linear models in decision making. Psychological Bulletin, 1974, SI, 95-106. reviewed in this article changed, no matter how Deacon, E. B. A discriminant analysis of predictors of psychologically uncompelling or distasteful we may business failure. Journal of Accounting Research, 1972, find their results to be, no matter how ethically un- 10, 167-179. deGroot, A. D. Het denken van den schaker [Thought comfortable we may feel at "reducing people to and choice in chess]. The Hague, The Netherlands: mere numbers," the fact remains that our clients Mouton, 1965. Edwards, D. D., & Edwards, J. S. Marriage: Direct and are people who deserve to be treated in the best continuous measurement. Bulletin of the Psychonomic manner possible. If that means—as it appears Society, 1977, 10, 187-188. at present—that selection, diagnosis, and progno- Edwards, W. M. Technology for director dubious: Evalu- ation and decision in public contexts. In K. R. Ham- sis should be based on nothing .more than the mond (Ed,), Judgement and decision in public policy addition of a few numbers representing values on formation. Boulder, Colo.: Westview Press, 1978. Einhorn, H. J. Expert measurement and mechanical important attributes, so be it. To do otherwise is combination. Organizational Behavior and Human Per- cheating the people we serve. , formance, 1972, 7, 86-106. Einhorn, H. J., & Hogarth, R. M. Unit weighting schemas for decision making. Organizational Behavior and Hu- REFERENCE NOTES man Performance, 1975, 13, 171-192. Einhorn, H. J., & Hogarth, R. M. Confidence in judg- 1. Thornton, B. Personal communication, 1977. ment: Persistence of the illusion of validity. Psycho- 2. Slovic, P. Limitations of the mind of man: Implica- logical Review, 1978, 85, 395-416. tions for decision making in the nuclear age. In H. J. Gardiner, P. C., & Edwards, W. Public values: Multi- Otway (Ed.), Risk vs. benefit: Solution or dream? attribute-utility measurement for social decision mak- ing. In M. F. Kaplan & S. Schwartz (Eds.), Human (Report LA 4860-MS), Los Alamos, N.M.: Los Ala- judgment and decision processes. New York: Academic mos Scientific Laboratory, 1972. [Also available as Press, 1975. Oregon Research Institute Bulletin, 1971, 77(17).] Goldberg, L. R. Diagnosticians vs. diagnostic signs: The 3. Srinivisan, V. A theoretical comparison of the predic- diagnosis of psychosis vs. neurosis from the MMPI. tive power of the multiple regression and equal weight- Psychological Monographs, 1965, 79(9, Whole No. 602). ing procedures (Research Paper No. 347). Stanford, Goldberg, L. R. Seer over sign: The first "good" ex- Calif.: Stanford University, Graduate School of Busi- ample? Journal of Experimental Research in Person- ness, February 1977. ality, 1968,3, 168-171. Goldberg, L. R. Man versus model of man: A rationale, 4. von Winterfeldt, D., & Edwards, W. Costs and payoffs plus some evidence for a method of improving on in perceptual research. Unpublished manuscript, Uni- clinical inferences. Psychological Bulletin, 1970, 73, versity of Michigan, Engineering Psychology Labora- 422-432; tory, 1973. Goldberg, L. R. Parameters of personality inventory con- 5. Einhorn, H. J. Personal communication, January 1979. struction and utilization: A comparison of prediction

AMERICAN PSYCHOLOGIST • JULY 1979 • 581 strategies and tactics. Multivariate Behavioral Research mative. In J. Carrol & J. Payne (Eds.), Cognition Monographs, 1972, No. 72-2. and social behavior. Hillsdale, N.J.: Erlbaum, 1976. Goldberg, L. R. Man versus model of man: Just how Remus, W. E., & Jenicke, L. O. Unit and random linear conflicting is that evidence? Organizational Behavior models in decision making. Multivariate Behavioral and Human Performance, 1976, 16, 13-22. Research, 1978, 13, 215-221. Green, B. F., Jr. Parameter sensitivity in multivariate Sawyer, J. Measurement and prediction, clinical and sta- methods. Multivariate Behavioral Research, 1977, 3, tistical. Psychological Bulletin, 1966, 66, 178-200. 263. Schmidt, F. L. The relative efficiency of regression and Hammond, K. R. Probabilistic functioning and the simple unit predictor weights in applied differential clinical method. Psychological Review, 1955, 62, 255- psychology. Educational and Psychological Measure- 262. ment, 1971, 31, 699-714. Hammond, K. R., & Adelman, L. Science, values, and Simon, H. A., & Chase, W. G. Skill in chess. American human judgment. Science, 1976, 194, 389-396. Scientist, 1973, 61, 394-403. Hoffman, P. J. The paramorphic representation of clini- Slovic, P., & Lichtenstein, S. Comparison of Bayesian cal judgment. Psychological Bulletin, 1960, 57, 116-131. and regression approaches to the study of information Holt, R. R. Yet another look at clinical and statistical processing in judgment. Organizational Behavior and prediction. American Psychologist, 1970, 25, 337-339. Human Performance, 1971, 6, 649-744. Howard, J. W., & Dawes, R. M. Linear prediction of Thornton, B. Linear prediction of marital happiness: A marital happiness. Personality and Social Psychology replication. Personality and Social Psychology Bulletin, Bulletin, 1976, 2, 478-480. 1977, 3, 674-676. Kelly, L. Evaluation of the interview as a selection Tversky, A., & Kahneman, D. Judgment under uncer- technique. In Proceedings of the 1953 Invitational Con- tainty: Heuristics and biases. Science, 1974, 184, 1124- ference on Testing Problems. Princeton, N.J.: Educa- 1131. tional Testing Service, 1954. Wainer, H. Estimating coefficients in linear models: It Krantz, D. H. Measurement structures and psychologi- don't make no nevermind. Psychological Bulletin, 1976, cal laws. Science, 1972, 175, 1427-1435. 83, 312-317. Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. Wainer, H., & Thissen, D. Three steps toward robust Foundations of measurement (Vol. 1). New York: Aca- regression. Psychometrika, 1976, 41, 9-34. demic Press, 1971. Wallace, H. A. What is in the corn judge's mind? Libby, R. Man versus model of man: Some conflicting Journal of the American Society of Agronomy, 1923, evidence. Organizational Behavior and Human Per- 15, 300-304. formance, 1976,16, 1-12. Wiggins, N., & Kohen, E. S. Man vs. model of man Marquardt, D. W., & Snee, R. D. Ridge regression in revisited: The forecasting of graduate school success. practice. American Statistician, 1975, 29, 3-19. Journal of Personality and Social Psychology, 1971, 19, Meehl, P. E. Clinical versus statistical prediction: A 100-106. theoretical analysis and a review of the evidence. Min- Wilks, S. S. Weighting systems for linear functions of neapolis: University of Minnesota Press, 1954. correlated variables when there is no dependent vari- Meehl, P. E. Seer over sign: The first good example. able. Psychometrika, 1938, 8, 23-40. Journal of Experimental Research in Personality, 1965, Yntema, D. B., & Torgerson, W. S. Man-computer co- 1, 27-32. operation in decisions requiring common sense. IRE Nisbett, R. E., Borgida, E., Crandall, R., & Reed, H. Transactions of the Professional Group on Human Fac- Popular induction: Information is not necessarily nor- tors in Electronics, 1961, 2(1), 20-26.

582 • JULY 1979 • AMERICAN PSYCHOLOGIST