<<

Observational Studies 1 (2015) 124-125 Submitted 8/15; Published 8/15

Introduction to Observational Studies and the Reprint of Cochran’s paper “Observational Studies” and Comments

Dylan S. Small [email protected] Department of , University of Pennsylvania Philadelphia, PA 19104, USA

In this first issue of Observational Studies, we reprint a review of observational studies by William Cochran, a pioneer of statistical research on observational studies, followed by comments by leading current researchers in observational studies. Cochran (1965, Journal of the Royal Statistical Society, Series A) defined an as

an empiric investigation [in which]...the objective is to elucidate cause-and-effect relationships...[in which] it is not feasible to use controlled experimentation, in the sense of being able to impose the procedures or treatments whose effects it is desired to discover, or to assign subjects at random to different procedures.

Observational Studies is a new peer-reviewed journal that seeks to publish papers on all aspects of observational studies. Researchers from all fields that make use of observational studies are encouraged to submit papers. Topics covered by the journal include, but are not limited, to the following:

• Study protocols for observational studies. The journal seeks to promote the planning and transparency of observational studies. In addition to publishing study protocols, the journal will publish comments on the study protocols and allow the authors of the study to respond to the comments.

• Methodologies for observational studies. This includes statistical methods for all as- pects of observational studies and methods for the conduct of observational studies such as methods for collecting . In addition to novel methodological articles, the journal welcomes review articles on methodology relevant to observational studies as well as illustrations/explanations of methodologies that may have been developed in a more technical article in another journal.

• Software for observational studies. The journal welcomes articles describing software relevant to observational studies.

• Descriptions of observational study data sets. The journal welcomes descriptions of observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form collaborations, to learn from each other and to maximize use of existing resources. The journal also encourages submission of examples of how a publicly available observational study database can be used.

⃝c 2015 Dylan S. Small. • Analyses of observational studies. The journal welcomes analyses of observational studies. The journal encourages submissions of analyses that illustrate use of sound methodology and conduct of observational studies.

The paper we reprint of Cochran’s and the comments by leading current researchers in observational studies provide illuminating perspectives on important issues in observational studies that the journal seeks to address. The contents of the rest of this section are as follows: Author Title Pages William Cochran Observational Studies 126-136 Norman Breslow William G. Cochran and the 1964 Surgeon’s 137-140 General Report Thomas Cook The Inheritance bequeathed to William G. 140-163 Cochran that he willed forward and left for others to will forward again: The Limits of Observational Studies that seek to Mimic Randomized & Nanny Wermuth Design and interpretation of studies: relevant 165–170 concepts from the past and some extensions Stephen Fienberg Comment on “Observational Studies” 171–172 by William G. Cochran Joseph Gastwirth & Barry Graubard Comment on Cochran’s “Observational Studies 173–181 Andrew Gelman The State of the Art in : 182–183 Some Changes Since 1972 Ben Hansen & Adam Sales Comment on Cochran’s “Observational Studies” 184–193 Miguel Hern´an A good deal of humility: 194–195 Cochran on observational studies Jennifer Hill Lessons we are still learning 196–199 Judea Pearl Causal Thinking in the Twilight Zone 200–204 Paul Rosenbaum Cochran’s Causal Crossword 205–211 Donald Rubin Comment on Cochran’s “Observational Studies” 212–216 Herbert Smith Comment on Cochran’s “Observational Studies” 217–219 Mark van der Laan Comment on “Observational Studies” 220–222 by Dr. W.G. Cochran (1972) Tyler VanderWeele Observational Studies and Study Designs: 223–230 An Epidemiologic Perspective Stephen West Reflections on “Observational Studies”: 231–240 Looking Backward and Looking Forward

125 Observational Studies 1 (2015) 126-136 Submitted 1972; Published reprinted, 8/15

Observational Studies

William G. Cochran

Editor’s Note: William G. Cochran (1909-1980) was Professor of Statistics, Harvard University, Cambridge, Massachusetts. This article was originally published in Statistical Papers in Honor of George W. Snedecor, ed. T.A. Bancroft, 1972, Iowa State University Press, pp. 77-90. The paper is reprinted with permission of the copyright holder, Iowa State University Press. Comments by leading current researchers in observational studies follow.

1. Introduction

OBSERVATIONAL STUDIES are a class of statistical studies that have increased in fre- quency and importance during the past 20 years. In an observational study the investigator is restricted to taking selected or measurements on the process under study. For one reason or another he cannot interfere in the process in the way that one does in a controlled laboratory type of . Observational studies fall roughly into two broad types. The first is often given the name of “analytical surveys.” The investigator takes a survey of a population of interest and proceeds to conduct statistical analyses of the relations between variables of interest to him. An early example was Kinsey’s study (1948) of the relation between the frequencies of certain types of sexual behavior and variables like the age, sex, social level, religious affiliation, rural-urban background, and direction of social mobility of the person involved. Dr. Kinsey gave much thought to the methodological problems that he would face in planning his study. More recently, in what is called the “midtown Manhattan study” (Srole et al., 1962), a team of psychiatrists studied the relation in Manhattan, New York, between age, sex, parental and own social level, ethnic origin, generation in the United States, and religion and nonhospitalized mental illness. The second type of observational study is narrower in scope. The investigator has in mind some agents, procedures, or experiences that may produce certain causal effects (good or bad) on people. These agents are like those the would call treatments in a controlled experiment, except that a controlled experiment is not feasible. Examples of this type abound. A simple one structurally is a Cornell study of the effect of wearing a lap seat belt on the amount and type of injury sustained in an automobile collision. This study was done from police and medical records of injuries in automobile accidents. The prospective smoking and health studies (1964) are also a well-known example. These are comparisons of the death rates and causes of death of men and women with different smoking patterns in regard to type and amount. An example known as the “national halothane study” (Bunker et al., 1969) attempted to make a fair comparison of the death rates due to the five leading anesthetics used in hospital operations.

⃝c 2015 Iowa State University Press. Observational Studies

Several factors are probably responsible for the growth in the number of studies of this kind. One is a general increase in funds for research in the social and medicine. A related reason is the growing awareness of social problems. A study known as the “Coleman report” (1966) has attracted much discussion. This was begun because Congress gave the U.S. Office of Education a substantial sum and asked it to conduct a nation wide survey of elementary schools and high schools to discover to what extent minority-group children in the United States (Blacks, Indians, Puerto Ricans, Mexican-Americans, and Orientals) receive a poorer education than the majority whites. A third reason is the growing area of program evaluation. All over the world, administrative bodies – central, regional, and local – spend the taxpayers’ money on new programs intended to benefit some or all of the population or to combat social evils. Similarly, a business organization may institute changes in its operations in the hope of improving the running of the business. The idea is spreading that it might be wise to devote some resources to trying to measure both the intended and the unintended effects of these programs. Such evaluations are difficult to do well, and they make much use of observational studies. Finally, some studies are undertaken to investigate stray reports of unexpected effects that appear from time to time. The halothane study is an example; others are studies of side effects of the contraceptive pill and studies of health effects of . This paper is confined mainly to the second, narrower class of observational studies, although some of the problems to be considered are also met in the broader analytical ones. For this paper I naturally sought a topic that would reflect the outlook and research interests of George Snedecor. In his career activity of helping investigators, he developed a strong interest in the , a subject on which numerous texts are now available. The planning of observational studies, in which we would like to do an experiment but cannot, is a closely related topic which cries aloud for George’s mature wisdom and the methodological truths that he expounded so clearly. Succeeding sections will consider some of the common issues that arise in planning.

2. The Statement of Objectives

Early in the planning it is helpful to construct and discuss as clear and specific a written statement of the objectives as can be made at that stage. Otherwise it is easy in a study of any complexity to take later decisions that are contrary to the objectives or to find that different team members have conflicting ideas about the purpose of the study. Some investigators prefer a statement in the form of hypotheses to be tested, others in the form of quantities to be estimated or comparisons to be made. An example of the hypothesis type comes from a study (Buck et al., 1968), by a Johns Hopkins team, of the effects of coca-chewing by Peruvian Indians. Their hypotheses were stated as follows.

1. Coca, by diminishing the sensation of hunger, has an unfavorable effect on the nutri- tional state of the habitual chewer. Malnutrition and conditions in which nutritional deficiencies are important determinants occur more frequently among chewers than among control subjects.

127 Cochran

2. Coca chewing leads to a state of relative indifference which can result in inferior personal hygiene.

3. The work performance of coca chewers is lower than that of comparable nonchewers. One objection sometimes made to this form of statement is its suggestion that the answers are already known, and thus it hints at personal . However, these statements could easily have been put in a neutral form, and the three specific hypotheses about coca were suggested by a previous League of Nations commission. The statements perform the valuable purpose of directing attention to the comparisons and measurements that will be needed.

3. The Comparative Structure of the Plan The statement of objectives should have suggested the type of comparisons on which logical judgments about the effects of treatment would be based. Some of the most common structures are outlined below. First, the study may be restricted to a single group of people, all subject to the same treatment. The timing of the measurements may take several forms. 1. After Only (i.e., after a period during which the treatment should have had time to produce its effects).

2. Before and After (planned comparable measurements both before and after the period of exposure to the agent or treatment).

3. Repeated Before and Repeated After. In both (1) and (2) there may be a series of After measurements if there is interest in the long-term effects of treatment. Single-group studies are so weak logically that they should be avoided whenever possible, but in the case of a compulsory law or change in business practice, a comparable group not subject to the treatment may not be available. In an After Only study we can perhaps judge whether or not the situation after the period of treatment was satisfactory but have no basis for judging to what extent, if any, the treatment was a cause, except perhaps by an opinion derived from a subjective impression as to the situation before exposure. Supplementary observations might of course teach something useful about the operation of a law–e.g., that it was widely disobeyed through ignorance or unpopularity with the public or that it was unworkable as too complex for the administrative staff. In the single-group Before and After study we at least have estimates of the changes that took place during the period of treatment. The problem is to judge the role of the treatment in producing these changes. For this step it is helpful to list and judge any other contributors to the change that can be envisaged. Campbell and Stanley (1966) have provided a useful list with particular reference to the field of education. Consider a Before-After rise. This might be due to what I vaguely call “external” causes. In an economic study a Before-After rise might accompany a wide variety of “treatments,” good or bad, during a period of increasing national employment and prosperity. In ed- ucational examinations contributors might be the increasing maturity of the students or familiarity with the tests. In a study of an apparently low group on some variable (e.g.,

128 Observational Studies

poor at some task) a rise might be due to what is called the regression effect. If a person’s score fluctuates from time to time through human variability or measurement error, the “low” group selected is likely to contain persons who were having an unusually bad day or had a negative error of measurement on that day. In the subsequent After measurement, such persons are likely to show a rise in score even under no treatment – either they are having one of their “up” days or the error of measurement is positive on that day. After World War I the French government instituted a wage bonus for civil servants with large families to stimulate an increase in the birthrate and the population of France. I have been told the primary effect was an influx of men with large families into French civil service jobs, creating a Before-After rise that might be interpreted as a success of the “treatment.” An English Before-After evaluation of a publicity campaign to encourage people to come into clinics for needed protective shots obtained a Before-After drop in number of shots given. The clinics, who were asked to keep the records, had persuaded patrons to come in at once if they were known to be intending to have shots (Before), so that these people would be out of the way when the presumed big rush from the campaign started. A time-series study with repeated measurements Before and After presents interest- ing problems – that of appraising whether the Before-After change during the period of treatment is real in relation to changes that occur from external causes in the Before and After periods and that of deciding what is suggested about the time-response curve to the treatment. Campbell and Ross (1968) give an excellent account of the types of analysis and judgment needed in connection with a study of the Connecticut state law imposing a crackdown on speeding, and Campbell (1969) has discussed the role of this and other techniques in a highly interesting paper on program evaluation. Single-group studies emphasize a characteristic that is prominent in the analysis of nearly all observational studies – the role of judgment. No matter how well-constructed a mathematical model we have, we cannot expect to plan a statistical analysis that will provide an almost automatic verdict. The statistician who intends to operate in this field must cultivate an ability to judge and weigh the relative importance of different factors whose effects cannot be measured at all accurately. Reverting to types of structure, we come now to those with more than one group. The simplest is a two-group study of treated and untreated groups (seat-belt wearers and nonwearers). We may also have various treatments or forms of treatment, as in the smoking and health studies (pipes, cigars, cigarettes, different amounts smoked, and ex-smokers who had stopped for different lengths of time and had previously smoked different amounts). Both After Only and Before and After measurements are common. Sometimes both an After Only and a Before-After measurement are recommended for each comparison group if there is interest in studying whether the taking of the Before measurement influenced the After measurement. Comparison groups bring a great increase in analytical insight. The influence of external causes on both groups will be similar in many types of study and will cancel or be minimized when we compare treatment with no treatment. But such studies raise a new problem – How do we ensure that the groups are comparable? Some relevant statistical techniques are outlined in section 6. In regard to incomparability of the groups the Before and After study is less vulnerable than the After Only since we should be able to judge comparability of the treated and untreated groups on the response variable at a time when they have not been

129 Cochran

subjected to the difference in treatment. Occasionally, we might even be able to select the two groups by , having a instead of an observational study; but this is not feasible when the groups are self-selected (as in smokers) or selected by some administrative fiat or outside agent (e.g., illness).

4. Measurements

The statement of objectives will also have suggested the types of measurements needed; their relevance is obviously important. For instance, early British studies by aerial photographs in World War II were reported to show great damage to German industry. Knowing that early British policy was to bomb the town center and that German factories were often concentrated mainly on the outskirts, Yates (1968) confined his study to the factory areas, with quite a different conclusion which was confirmed when postwar studies could be made. The question of what is considered relevant is particularly important in program evaluation. A program may succeed in its main objectives but have undesirable side effects. The verdict on the program may differ depending on whether or not these side effects are counted in the evaluation. It is also worth reviewing what is known about the accuracy and precision of proposed measurements. This is especially true in social studies, which often deal with people’s atti- tudes, motivations, opinions, and behavior – factors that are difficult to measure accurately. Since we may have to manage with very imperfect measurements, need more technical research on the effects of errors of measurement. Three aspects are: (1) more study of the actual distribution of errors of measurement, particularly in multivariate prob- lems, so that we work with realistic models; (2) investigation, from these models, of the effects on the standard types of analysis; (3) study of methods of remedying the situation by different analyses with or without supplementary study of the error distributions. To judge by work to date on the problem of estimating a structural regression, this last problem is formidable. It is also important to check comparability of measurement in the comparison groups. In a medical study a trained nurse who has worked with one group for years but is a stranger to the other group might elicit different amounts of trustworthy information on sensitive questions. Cancer patients might be better informed about cases of cancer among blood relatives than controls free from cancer. The scale of the operation may also influence the measuring process. The midtown Manhattan study, for instance, at first planned to use trained psychiatrists for obtaining the key measurements, but they found that only enough psychiatrists could be provided to measure a sample of 100. The analytical aims of the study needed a sample of at least 1,000. In numerous instances the choice seems to lie between doing a study much smaller and narrower in scope than desired but with high quality of measurement, or an extensive study with measurements of dubious quality. I am seldom sure what to advise. In large studies one occasionally sees a mistake in plans for measurement that is perhaps due to inattention. If two laboratories or judges are needed to measure the responses, an administrator sends all the treatment group to laboratory 1 and the untreated to laboratory 2 – it is at least a tidy decision. But any systematic difference between laboratories or judges

130 Observational Studies

becomes part of the estimated treatment effect. In such studies there is usually no difficulty in sending half of each group, selected at random, to each judge.

5. Observations and Experiments In the search for techniques that help to ensure comparability in observational studies, it is worth recalling the techniques used in controlled experiments, where the investigator faces similar problems but has more resources to command. In simple terms these techniques might be described as follows. Identify the major sources of variation (other than the treatments) that affect the re- sponse variable. Conduct the experiment and analysis so that the effects of such sources are removed or balanced out. The two principal devices for this purpose are and the analysis of . Blocking is employed at the planning stage of the experiment. With two treatments, for example, the subjects are first grouped into pairs (blocks of size 2) such that the members of a pair are similar with respect to the major anticipated sources of variation. Covariance is used primarily when the response variable y is quantitative and some of the major extraneous sources of variation can also be represented by quantitative variables x1, x2,... From a mathematical model expressing y in terms of the treatment effects and the values of the xi, estimates of the treatment effects are obtained that have been adjusted to remove the effects of the xi. Covariance and blocking may be combined. For minor and unknown sources of variation, use randomization. Roughly speaking, randomization makes such sources of error equally likely to favor either treatment and ensures that their contribution is included in the of the estimated treatment effect if properly calculated for the plan used. In general, extraneous sources of variation may influence the estimated treatment effect τ in two ways. They may create a bias B. Instead of estimating the true treatment effect τ, the expected value ofτ ˆ is (τ + B), where B is usually unknown. They also increase the ofτ ˆ. In experiments a result of randomization and other precautions (e.g., blindness in measurement) is that the investigator usually has little worry about bias. Discussions of the effectiveness of blocking and covariance (e.g., Cox, 1957) are confined to their effect on V (ˆτ) and on the power of tests of significance. In observational studies we cannot use of subjects, but we can try to use techniques like blocking and covariance. However, in the absence of randomization these techniques have a double task – to remove or reduce bias and to increase precision by decreasing V (ˆτ). The reduction of bias should, I think, be regarded as the primary objective – a highly precise estimate of the wrong quantity is not much help.

6. and Adjustments In observational studies as in experiments we start with a list of the most important extra- neous sources of variation that affect the response variable. The Cornell study, based on automobile accidents involving seat-belt wearers and nonwearers, listed 12 major variables. The most important was the intensity and direction of the physical force at impact. A head-on collision at 60 mph is a very different matter from a sideswipe at 25 mph. In the smoking–death-rate studies age gradually becomes a predominating variable for men over

131 Cochran

55. In the raw data supplied to the Surgeon General’s committee by the British and Cana- dian studies and in a U.S. study cigarette smokers and nonsmokers had about the same death rates. The high death rates occurred among the cigar and pipe smokers. If these data had been believed, television warnings might now be advising cigar and pipe smokers to switch to cigarettes. However, cigar and pipe smokers in these studies were found to be markedly older than nonsmokers, while cigarette smokers were, on the whole, younger. All studies regarded age as a major extraneous variable in the analysis. After adjustment for age differences, death rates for cigar and pipe smokers were close to those for nonsmokers; those for cigarette smokers were consistently higher. In observational studies three methods are in common use in an attempt to remove bias due to extraneous variables. Blocking, usually known as matching in observational studies. Each member of the treated group has a match or partner in the untreated group. If the x variables are classified, we form the cells created by the multiple classification (e.g., x1 with 3 classes and x2 with 4 classes create 12 cells). A match a member of the same cell. If x is quantitative (discrete or continuous), a common method is to turn it into a classified variate (e.g., age in 10-year classes). Another method, caliper matching, is to call x11i (in group 1) and x12i (in group 2) matches with respect to x1 if |x11i − x22j| ≤ a. Standardization (adjustment by subclassification). This is the analogue of covariance when the x’s are classified and we do not match. Arrange the data from the treated and untreated samples in cells, the ith cell containing say n1i, n2i observations with response meansy ¯1i, y¯2i. If the effect τ of the treatment is the∑ same in every cell, this method∑ depends on the result that for any set of weights wi with wi = 1, the quantityτ ˆ = wi(¯y1i − y¯2i) is an unbiased estimate of τ (apart from any within-cell ). The weights can therefore be chosen to minimize V (ˆτ). If it is clear that τ varies from cell to cell as often∑ happens, the choice of weights becomes more critical, since it determines the quantity wiτi that is being estimated. In vital statistics a common practice is to take the weights from some standard population to which we wish the comparison to apply. Covariance (with x’s quantitative), used just as in experiments. The idea of matching is easy to grasp, and the statistical analysis is simple. On the operational side, matching requires a large reservoir in at least one group (treated or untreated) in which to look for matches. The hunt for matches (particularly with caliper matching) may be slow and frustrating, although computers should be able to help if data about the x’s can be fed into them. Matching is avoided when the planned sample size is large, there are numerous treatments, subjects become available only slowly through time, and it is not feasible to measure the x’s until the samples have already been chosen and y is also being measured. There has been relatively little study of the effects of these devices on bias and precision, although particular aspects have been discussed by Billewicz (1965), Cochran (1968) and Rubin (1970). If x is classified and two members of the same class are identical in regard to the effect of x on y, matching and standardization remove all the bias, while matching should be somewhat superior in regard to precision. I am not sure, however, how often such ideal classifications actually exist. Many classified variables, especially ordered classifications, have an underlying quantitative x – e.g., for sex with certain types of response there is a whole gradation from very manly men to very womanly women. This is obviously true for quantitative x’s that are deliberately made classified in order to use within-cell matching.

132 Observational Studies

In such cases, matching and standardization remove between-cell bias but not within-cell bias. Of an initial bias in means µ1x − µ2x, they remove about 64%, 80%, 87%, 91%, and 93% with 2, 3, 4, 5, and 6 classes, the actual amount varying a little with the choice of class boundaries and the nature of the x distribution (Cochran, 1968). Caliper matching removes about 76%, 84%, 90%, 95%, and 99% with a/σx = 1, 0.8, 0.6, 0.4, and 0.2. These percentages also apply to y under a linear or nearly of y on x. With a quantitative x, covariance adjustments remove all the initial bias if the correct model is fitted, and they are superior to within-class matching of x when this assumption holds. In practice, covariance nearly always means linear covariance to most users, and some bias remains after covariance adjustment if the y, x relation is nonlinear and a linear covariance is fitted. If nonlinearity is of the type that can be approximated by a quadratic 2 2 curve, results by Rubin (1970) suggest that the residual bias should be small if σ1x = σ2x 2 2 and x is symmetrical or nearly so in distribution. When σ1x/σ2x is 1/2 or 2,, the adjustment can either overcorrect or undercorrect to a material extent. Caliper matching, on the other hand, and even within-class matching do not lean on an 2 2 assumed linear relation between y and x. If σ1x/σ2x is near 1 (perhaps between 0.8 and 1.2), the evidence to date suggests, however, that linear covariance is superior to within- class matching in removing bias under a moderately curved y, x relation, although more study of this point is needed. Linear covariance applied to even loosely caliper-matched samples should remove nearly all the initial bias in this situation. Billewicz (1965) compared linear covariance and within-class matching (3 or 4 classes) in regard to precision in a model in which x was distributed as N(0, 1) in both populations. For the curved relations y = 0.4x−0.1x2, y = 0.8x−0.14x2 and y = tanh x, he found covariance superior in precision on samples of size 40. Larger studies in which matching becomes impractical present difficult problems in analysis. Protection against bias from numerous x variables is not easy. Further, if there are say four x variables, the treatment effect may change with the levels of x2 and x3. For applications of the conclusions it may be important to find this out. The obvious recourse is to model construction and analysis based on the model, which has been greatly developed, particularly in regression. Nevertheless the Coleman report on education (1966) and the national halothane study (Bunker et al., 1969) illustrate difficulties that remain.

7. Further Points on Planning

7.1 Sample Size

Statisticians have developed formulas that provide guidance on the sample size needed in a study. The formulas tend to be harder to use in observational studies than in experiments because less may be known about the likely values of population parameters that appear in the formulas and the formulas assume that bias is negligible. Nevertheless there is frequently something useful to be learned – for instance, that the proposed size looks adequate for estimating a single overall effect of the treatment, but does not if the variation in effect with an x is of major interest.

133 Cochran

7.2 Nonresponse

Certain administrative bodies may refuse to cooperate in a study; certain people may be unwilling or unable to answer the questions asked or may not be found at home. In modern studies, standards with regard to the nonresponse problem seem to me to be lax. In both the smoking and Coleman studies nonresponse rates of over 30% were common. The main difficulty with nonresponse is not the reduction in sample size but that nonrespondents may be to some extent different types of people from respondents and give different types of answers, so that results from respondents are biased in this sense. Fortunately, nonresponse can often be reduced materially by hard work during the study, but definite plans for this need to be made in advance.

7.3 Pilot Study

The case for starting with a small pilot study should be considered – for instance, to work out the field procedures and check the understanding and acceptability of the questions and the interviewing methods and time taken. When information is wanted on a new problem, the cheapest and quickest method is to base a study on routine records that already exist. However, such records are often incomplete and have numerous gross errors. A law or administrative rule specifying that records shall be kept does not ensure that the records are usable for research purposes. A good pilot study of the records should reveal the state of affairs. It is worth looking at ; a suspiciously low variance has sometimes led to detection of the practice of copying previous values instead of making an independent determination.

7.4 Critique

When the draft of plans for a study is prepared, it helps to find a colleague willing to play the role of devil’s advocate – to read the plan and to point out any methodological weaknesses that he sees. Since observational studies are vulnerable to such defects, the investigator should of course also be doing this, but it is easy to get in a rut and overlook some aspect. It helps even more if the colleague can suggest ways of removing or reducing these faults. In the end, however, the best plan that investigator and colleague can devise may still be subject to known weaknesses. In the report of the results these should be discussed in a clearly labeled section, with the investigator’s judgment about their impact.

7.5 Sampled and Target Populations

Ideally, the statistician would recommend that a study start with a probability sample of the target population about which the investigator wishes to obtain information. But both in experiments and in observational surveys many factors – feasibility, costs, geography, supply of subjects, opportunity – influence the choice of samples. The population actually sampled may therefore differ in several respects from the target population. In his report the investigator should try to describe the sampled population and relevant target populations and give his opinion as to how any differences might affect the results, although this is admittedly difficult.

134 Observational Studies

One reason why this step is useful is that an administrator in California, say, may want to see the results of a good study on some social issue for policy guidance and may find that the only relevant study was done in Philadelphia or Sweden. He will appreciate help in judging whether to expect the same results in California.

7.6 Judgment about Causality Techniques of statistical analysis of observational studies have in general employed standard methods and will not be discussed here. When the analysis is completed, there remains the problem of reaching a judgment about causality. On this point I have little to add to a previous discussion (Cochran, 1965). It is well known that evidence of a relationship between x and y is no proof that x causes y . The scientific philosophers to whom we might turn for expert guidance on this tricky issue are a disappointment. Almost unanimously and with evident delight they throw the idea of cause and effect overboard. As the statistical study of relationships has become more sophisticated, the statistician might admit, however, that his point of view is not very different, even if he wishes to retain the terms cause and effect. The probabilistic approach enables us to discard oversimplified deterministic notions that make the idea look ridiculous. We can conceive of a response y having numerous contributory causes, not just one. To say that x is a cause of y does not imply that x is the only cause. With 0,1 variables we may merely that if x is present, the probability that y happens is increased – but not necessarily by much. If x and y are continuous, a causal relation may imply that as x increases, the average value of y increases, or some other feature of its distribution changes. The relation may be affected by the levels of other variables; it may be strengthened or weakened or entirely disappear, depending on these levels. One can see why the idea becomes tortuous. For successful prediction, however, a knowledge of the nature and stability of these relationships is an essential step and this is something that we can try to learn in observational studies. A claim of proof of cause and effect must carry with it an explanation of the mechanism by which the effect is produced. Except in cases where the mechanism is obvious and undisputed, this may require a completely different type of research from the observational study that is being summarized. Thus in most cases the study ends with an opinion or judgment about causality, not a claim of proof. Given a specific causal hypothesis that is under investigation, the investigator should think of as many consequences of the hypothesis as he can and in the study try to include response measurements that will verify whether these consequences follow. The cigarette- smoking and death-rate studies are a good example. For causes of death to which smoking is thought to be a leading contributor, we can compare death rates for nonsmokers and for smokers of different amounts, for ex-smokers who have stopped for different lengths of time but used to smoke the same amount, for ex-smokers who have stopped for the same length of time but used to smoke different amounts, and (in later studies) for smokers of filter and nonfilter cigarettes. We can do this separately for men and women and also for causes of death to which, for physiological reasons, smoking should not be a contributor. In each comparison the direction of the difference in death rates and a very rough guess at the relative size can be made from a causal hypothesis and can be put to the test.

135 Cochran

The same can be done for any alternative hypotheses that occur to the investigator. It might be possible to include in the study response measurements or supplementary obser- vations for which alternative hypotheses give different predictions. In this way, ingenuity and hard work can produce further relevant data to assist the final judgment. The final report should contain a discussion of the status of the evidence about these alternatives as well as about the main hypothesis under study. In conclusion, observational studies are an interesting and challenging field which de- mands a good deal of humility, since we can claim only to be groping toward the truth.

References Billewicz, W. Z. (1965). The efficiency of matched samples. 21 : 623-44. Buck, A. A. et al. (1968). Coca chewing and health. Am. J. Epidemiol. 88: 159-77. Bunker, J. P. et al., eds. (1969). The national halothane study. Washington, D.C.: USGPO. Campbell, D. T. (1969). Reforms as experiments. Am. Psychologist 24: 409-29. Campbell, D. T., and H. L. Ross. (1968). The Connecticut crackdown on speeding: data in quasi-experimental analysis. Law and Society Rev. 3: 33-53. Campbell, D. T., and J. C. Stanley. (1966). Experimental and quasi-experimental designs in research. Chicago : Rand McNally. Cochran, W. G. (1965). The planning of observational studies. J. Roy. Statist. Soc. Ser. A, 128 : 234-66. Cochran, W.G. (1968). The effectiveness of adjustment by classification in removing bias in observational studies. Biometrics 24 : 295-314. Coleman, J. S. (1966). Equality of educational opportunity. Washington, D.C.: USGPO. Cox, D. R. (1957). The use of a concomitant variable in selecting an experimental design. 44 : 150-58. Kinsey, A. C., W. B. Pomeroy, and C. E. Martin. (1948). Sexual behavior in the human male. Philadelphia : Saunders. Rubin, D. B. (1970). The use of matched and regression adjustment in observa- tional studies. Ph.D. thesis, Harvard Univ., Cambridge. Srole, L., T. S. Langner, S. T. Michael, M. K. Opler, and T. A. C. Rennie. (1962). Mental health in the metropolis (The midtown Manhattan study). New York : McGraw-Hill. U.S. Surgeon-General’s committee (1964). Smoking and health. Washington, D.C. : US- GPO. Yates, F. (1968). and practice in statistics. J. Roy. Statist. Soc. Ser. A, 131 : 463-77.

136 Observational Studies 1 (2015) 137-140 Submitted 3/15; Published 8/15

William G. Cochran and the 1964 Surgeon General’s Report

Norman Breslow [email protected] Department of University of Washington Seattle, WA 98195, USA

By the late 1950’s the causal connection between cigarette smoking and lung cancer was well established. Several excellent retrospective (case-control) and prospective (cohort) studies had been published that led the US Surgeon General to declare “excessive smoking is one of the causative factors of lung cancer” (Burney 1959). The next few years brought new evidence of this and other major health effects of smoking. Although “medical opin- ion had shifted significantly against smoking” (United States Surgeon General’s Advisory Committee Report, 1964), no concerted action had yet been taken to alert the public to its dangers. The Federal Trade Commission (FTC) was clamoring for guidance on how to regulate the labeling and advertising of tobacco products. Accordingly, in 1962, Surgeon General Luther Terry selected an advisory committee of ten members to revisit the scientific evidence and produce a technical report on the health hazards of smoking. Representatives of government, medicine and industry, including some from the Tobacco Institute Inc., submitted a list of over 150 candidates for possible appoint- ment to the committee. Each organization reserved the right to veto, without explanation, any name on the list. People who had taken a position on the issue, which included all those who performed the studies under review, were excluded from consideration. The committee on smoking and health ultimately comprised eight physicians, one chemist and one statistician, William Cochran. Reputed to be a “statistician you could talk to,” Cochran was by then well known for prior service on several national advisory commit- tees dealing with prominent policy issues: the effectiveness of the battery additive ADX2; an evaluation of the Kinsey report on sexual behavior; and the planning of the Salk polio (Meier, 1984). His acceptability to all the organizations responsible for proposing candidates may have been helped by the fact that he was a heavy smoker (Colton, 1981). Indeed, smokers made up half the committee. Cochran’s influence on the report and its conclusions was enormous. Although none of its chapters were attributed to individual committee members, he was known in particular to have written Chapter 8, Mortality, and its appendices. This chapter reviewed seven large cohort studies of smoking and mortality in men. In his recent bestseller, Siddartha Mukherjee (2010) stated:

The precise and meticulous Cochran devised a new mathematical insight to judge the trials [studies]. Rather than privilege any particular study, he rea- soned, perhaps one could use a method to estimate the as a com- posite number through all the trials in the aggregate. (This method, termed meta-analysis, would deeply influence academic in the future.)

⃝c 2015 Norman Breslow. Breslow

Table 26 of Chapter 8 contained the key results. Its importance to the overall evaluation of the evidence was apparent from the fact that an abridged version appeared as Table 2 in Chapter 4, Summaries and Conclusions. For each of 25 specific causes of death, and for all causes, the table listed for each of the seven studies the observed numbers of deaths in smokers, the expected numbers and their ratio. Following principles of indirect standard- ization, the expected numbers were the sum over age categories of the age-specific death rates among non-smokers times the age-specific person-years of for smokers. Age adjustment was essential. Since smokers were younger than non-smokers, their crude death rates were less than those for non-smokers. Cochran’s innovation was to present two summaries of the seven mortality ratios for each cause of death. The first was a summary mortality ratio, where the expected number was obtained by pooling the age-specific data over studies. The second was simply the of the mortality ratios for the seven stud- ies. These were remarkably consistent: 10.8 vs. 11.7, respectively, for lung cancer; 1.7 vs. 1.7 for coronary artery disease, the most common cause of death; and 1.68 vs. 1.65 for all causes. In the parlance of modern meta-analysis, the first method, the summary mortality ratio, approximates the summary measure from a fixed effects model whereas the second, the median, corresponds more to a random effects model. In a 2014 letter to the editor of the New England Journal of Medicine, Schumacher et al. (2014) used modern software to produce a graphical “” of the 1964 results that shows study-specific and summary confidence intervals under both models. In two appendices to Chapter 8, Cochran described the statistical methods he used to estimate the bias and uncertainty of results presented in the main report. The first appendix reports a sensitivity analysis of possible bias in the mortality ratios caused by non-response. The second appendix presents two approximate methods for obtaining a confidence interval for the mortality ratio. The first, derived under the admittedly false assumption that the age-specific ratios of person years of observation for smokers vs. non- smokers were constant over age, used the fact that the ratio of a Poisson variable to its sum with another, independent Poisson variable is binomial. The second method avoided the person-years assumption, but involved other assumptions including a normal approximation that “are shaky with small numbers of deaths” ((United States Surgeon General’s Advisory Committee Report, 1964). Fortunately, the two methods produced comparable results, especially for the lower confidence limit. Cochran’s mortality ratio would be most compelling as a summary measure if the age- specific death rates for smokers vs. non-smokers were in constant ratio, in which case it consistently estimates the constant. A modern approach would be to fit the model assuming constant age-specific rate ratios using (Greenland and Robins, 1985). The “robust” standard error for the regression estimate of the (constant) log rate ratio, allowing for model misspecification, provides an alternative to the ad-hoc methods proposed by Cochran. The 1964 report makes clear, however, that the rate ratios declined with age, dropping by nearly half from ages 40-49 to 80-89. Accounting for this systematic decline in the model could clarify the interpretation of the summary measure as pertaining to a specific age, e.g., 65 years, with predictable changes for younger or older men. On the other hand, the simplicity of the observed/expected formulation was likely more persuasive to most readers of the report than a modeling approach would have been.

138 Cochran and the Surgeon’s General Report

Many other aspects of Chapter 8, and of the 1964 report in general, reflect Cochran’s philosophy regarding observational studies. Section 7.6 of his paper “Observational Studies” (reprinted in this volume), titled Judgment about Causality, expresses a common theme in his writing: “Given a specific causal hypothesis that is under investigation, the investigator should think of as many consequences of the hypothesis as he can and in the study try to include response measurements that will verify whether these consequences follow.” He goes on to illustrate this point with examples drawn from Chapter 8 of the report. This contained sections that dealt with mortality ratios by amount smoked, by age at which smoking started, by duration of smoking, by inhalation of smoke, by current vs. ex smokers, by causes of death that one might expect to be related to smoking and by other causes one might not. There was a section on non-response and another on (“disturbing”) variables, which were measured and considered in some studies. Needless to say, the 1964 report had enormous impact (Mukherjee, 2010). The morning after its release, on January 11, 1964, it was front-page news and the subject of widespread media coverage throughout the world. The tobacco industry initially took some refuge in the fact that public reaction was not as strong as feared, and for a time it appeared that they might escape significant regulation. The FTC proposed a strongly worded warning for cigarette packages, but this was watered down by congress. What eventually led to the voluntary withdrawal of tobacco advertising from radio and television, in 1971, was a 1968 court decision mandating that stations broadcasting tobacco ads had to give equal time to anti-tobacco advertising under the “fairness doctrine” that applied to controversial issues. While Cochran himself remained a heavy smoker until the end of his days, his work on the committee contributed to a dramatic decline in smoking in the US, to the easing of the burden of chronic disease and to demonstrably increased longevity.

References

Burney, L.E. (1959). Smoking and lung cancer – a statement of the Public Health Service. Journal of the American Medical Association, 171, 1829-1837. Colton T. (1981). Cochran, Bill - his contributions to medicine and public-health and some personal recollections. American Statistician, 35, 167-170. Greenland S. and Robins, J.M. (1985). Estimation of a common effect parameter from sparse follow-up data. Biometrics, 41, 55-68. Meier P. (1984). William G. Cochran and public health. In: Rao, PSRS and Sedransk J., eds. WG Cochran’s Impact on Statistics. Wiley, New York: 73-81. Mukherjee S. The Emperor of all Maladies: A Biography of Cancer. Scribner, New York. Schumacher, M., Rucker, G. and Schwarzer G. (2014). Meta-analysis and the Surgeon General’s report on smoking and health. New England Journal of Medicine, 370, 186- 188. United States Surgeon General’s Advisory Committee Report (1964). Smoking and Health. U.S. Department of Health, Education and Welfare, Washington D.C.

139 Breslow

A Personal Recollection To my knowledge I met Bill Cochran only once. The occasion, during Spring of 1962, was a trip East to visit universities where I had applied to graduate school in . My father, a public health physician, was the principal investigator on two of the seven mortality studies summarized in the 1964 report and later testified before the committee. He knew and admired Cochran and arranged for me to have an interview at Harvard, undoubtedly hoping that I might become interested in statistics. I remember Cochran as a tall and, to me, somewhat formal figure who displayed little interest in my career choice. He may have known that Harvard’s math faculty would reject my application. My experience at The Johns Hopkins University was different. After a disastrous interview with one of the senior math faculty, a meeting with Alan Kimball that had been similarly arranged by my father led me to withdraw my application on the spot and to re-apply to the statistics department that Kimball was attempting to resurrect. When I returned to my undergraduate college and told my professors what I had done, they said that if I was truly interested in statistics I should apply to UC Berkeley and to Stanford, from which I ultimately graduated. I always wondered how my career might have evolved had the interview with Cochran gone differently and I had applied (and been accepted) by Harvard statistics.

140 Observational Studies 1 (2015) 141-164 Submitted 7/15; Published 8/15

The Inheritance bequeathed to William G. Cochran that he willed forward and left for others to will forward again: The Limits of Observational Studies that seek to Mimic Randomized Experiments

Thomas D. Cook [email protected] Northwestern University & Mathematical Policy Research, Inc.

Introduction Seamus Heaney had the courage to want to use the past to construct a better future from among the many potential futures always available. In “The Settle Bed” of 1991 he wrote: And now this is “an inheritance” – Upright, rudimentary, unshiftably planked In the long ago, yet willable forward Again and again and again. To etymologists, “upright” connotes full of rectitude, being correct, while “rudimentary” connotes beginnings and first principles. Together, they signify that Heaney’s concern is with inheritances that address fundamental issues and want to be right about them. That all things are linked to the past is a truism, but that many things are “unshiftably planked” there is not. Firmly rooted inheritances require generation-transcending transmission mech- anisms, whether objects like books, or cognitive habits like , or social institutions like universities, or subsequent generations like one’s own students and students’ students. Such mechanisms transcend lives, including those of individual scholars. Heaney insists that unshiftably planked inheritances have to be willed forward, not just once or twice, but “again and again and again”. So he rejects ossifying traditions that fail to accommodate new realities and instead recommends using human will to make continuous changes in an inheritance. Presumably this is by implementing possible changes, learning about their successes and failures, and incorporating their results into the original inheritance that is thereby modified. The consequence is an inheritance that is planked in the past, modified in the present, and repeatedly modified in the future by dint of human will. Cochran was himself the beneficiary of an important inheritance that he improved and passed on. It does not diminish him to note that his writings on causation and the de- sign of experiments and observational studies are inconceivable without Fisher. The two did not work together at Rothamsted, but Fisher often returned there after his move to Cambridge and they talked there as well as at Royal Society meetings (Watson, 1982). We even have Cochran’s own reports of conversations with Fisher, including the insight that observational studies should make the implications of a single causal hypothesis more elabo-

c 2015 Thomas D. Cook. Cook

rated in the data (as reported, for example, in Rosenbaum, 2005). Yates mentored Cochran at Rothamsted and had earlier been a colleague of Fisher there (Yates, 1964). It seems inconceivable that Cochran and Yates did not speak in detail about Fisher’s contributions, given its general intellectual resonance and its clear relevance to their own research agendas and the mission of their research station. Cochran brought detailed knowledge of Fisher to his first stay at Iowa State; and after he emigrated he continued this dissemination process at Iowa State again and then in North Carolina, at Johns Hopkins and at Harvard. His dissemination of Fisher took place face-to-face, in teaching and in writing, perhaps most saliently in his seminal texts with Cox and Snedecor. Fisher himself did not become an intellectual giant out of nothing, but he nonetheless created most of the planks on which Cochran first stood. Here are what I think are the five major ones. (1) It is legitimate for statisticians to choose their intellectual problems from among the practical problems faced by those who seek to improve the physical world, be they farmers, health workers, engineers or social welfare professionals. The main alternative to such practice-based problem choice is when research agendas emanate from puzzles in existing or from past and emerging issues in mathematics. (2) Many of the practical problems practitioners face require identifying whether a ma- nipulable action is “causally” related to possible consequences. The inference entailed here moves Statistics beyond its historical concerns with reducing uncertainty and improving prediction (Stigler, 1986). It legitimates research on causal bias and its control, the central point of this paper. But it also legitimates working with what is perhaps a less fundamental theory of causation. Variously labeled by philosophers of science as the activity, manip- ulability or recipe theory of causation (see Cook and Campbell, 1979), it seeks the valid description of concrete If/Then connections rather than explanations of why things happen in the world, including causal connections. (3) Causal bias is best controlled through experimental design. Such design requires clear null hypotheses that are then tested through a combination of how units are allocated to the treatments, how and when the study outcome is assessed, and how comparison groups are selected. The hope is that such structural elements will provide a perfect no- treatment counterfactual against which observed performance in the treatment group can be compared. Then, no other interpretation of the relationship between the independent and dependent variable is possible other than that it is causal. Many non-experimental causal methods exist, primarily “causal modeling” methods where substantive theory about a set of interdependent temporal influences is tested, essentially by estimating how well the obtained and predicted data match. (4) Within experimental design, random assignment is the best tool for warranting un- biased counterfactual estimates. Random assignment balances the study groups on all ob- served and unobserved variables, entailing a perfect counterfactual in expectation and a probabilistically equivalent counterfactual in individual studies. Random assignment is also demonstrably implementable in real-world settings where its assumptions are often clearly met. There are other cause-probing “experimental” traditions that do not require random assignment. For instance, the experimental laboratory sciences routinely test causal hy- potheses, but they rule out alternative causal interpretations by virtue (1) of closed-system test settings and apparatus that physically exclude most contending hypotheses, (2) of substantive theories whose predictions are so numerically or functionally specific that no

142 The Limits of Observational Studies that seek to Mimic Randomized Experiments

contending theory can explain them, and (3) of implementing procedures that have evolved, and are still evolving, to reduce the confounds from the laboratory researchers’ own hopes, expectancies and interests. “Quasi-experiments” are also experimental, but by definition they are without random assignment. Instead, they seek to mimic the logic and structure of random assignment experiments, usually without the benefit of closed-system test settings. Causal ambiguities often remain with quasi-experiments, therefore, given how rare it is for the non-random process of selection into treatment to be perfectly known and measured. (5) When random assignment is not possible, the preference is for observational study designs that mimic random assignment as much as possible. From Fisher’s Latin Square designs on, these mimetic designs test a null hypothesis about a deliberately manipulated treatment versus a comparison group; the comparison group is deliberately selected to min- imize pre-intervention group differences; the occasions of measurement are those found in many outdoor and long-lasting random assignment studies that include pretest assessments as well as posttest ones; and these pretest covariates are then “somehow” used to control for any bias remaining after the study groups have been chosen. In more modern language, the goal of the mimetic tradition is to provide the best approximations for those potential out- comes that are missing in the quasi-experiment but are available in the random assignment experiment. Legitimate debate is possible about these five propositions, and we could add others. But they help describe and demarcate the inheritance Cochran received. Other causal inheri- tances were available to him at the time. One was the Galton/Pearson tradition with its em- phasis on prediction, substantive modeling, multivariate data analysis and epistemological verification – an evident counterpoint to Fisher’s emphasis on causation, experimentation, random assignment and what later came to be called falsification (Meehl, 1978). Cochran could also have worked on identifying the processes through which laboratory research pro- motes clear causal conclusions, given the strong interest at the time in how experimental physics, chemistry and biology advanced theory in their respective disciplines. But he did neither of these things, instead representing the third position just described. Each of the three has evolved as an adaptation to different parts of the world of research. For most of the 20th century, the experimental model has been preferred for open-system applications in agriculture, medicine, and some engineering. More recently, it has also been imported into micro-, political science, education and sociology. However, it is rarely relevant in chemistry, physics or micro-biology, where closed-system experiments are used to describe and explain cause-effect relationships. Nor is the experimental model of much relevance in meteorology, macro-economics, macro-sociology, historical geology and much of population genetics. Cause is still a goal in these fields, but control through de- liberate manipulation is difficult, causal modeling is the norm, and local understandings of “cause” stress explanation more than the description of If/Then relationships. Knowing about the alternatives Cochran did not choose helps us distinguish the unique boundaries of the experimental inheritance he was bequeathed, willed forward and passed on. How he made progress is evident in several ways. One is his role in disseminating the core inheritance through his physical presence, expository writings and transmission of insights not yet committed to print (Cochran, 1965). A second is through his personal research. I am not a statistician or historian of science, but note that he sharpened thinking about how observational studies should seek to mimic the logic of randomized experiments, especially

143 Cook

through his work on matching (Cochran, 1983). He had begun exploring the bias-reduction role of matching much earlier (Cochran, 1950) and was particularly creative in dealing with sub-classification (Cochran, 1968) and even determining how different numbers of strata affect the amount of bias reduced (Cochran, 1965). In addition, he discovered that the of analyzing observational study data makes little practical difference, and that bias control is made more difficult by the size of the initial overlap between groups (Cochran, 1968). He also forwarded his inheritance through his influence on others. He trained 39 doctoral students and they (and their students) have created an even more systematic theory of causal identification and estimation that assumes the five main planks described earlier and is best embodied by the Rubin . Expositions of it explicitly acknowledge Cochran’s influences (e.g., Rosenbaum and Rubin, 1983; Imbens and Rubin, 2015). He also influenced many other scholars of causation, including fellow statisticians like Mosteller, Moses and Tukey as well as admirers from more distant fields, like Donald T. Campbell and his students whose work often cites Cochran (see the bibliography in Shadish et al., 2002). A full historical analysis of how Cochran directly and indirectly willed forward his in- heritance is beyond the scope of this essay. But the point is clear. Cochran was embedded in an upright and rudimentary inheritance; he willed it forward first in one paper and then again in another and then again in another; and he bequeathed this marginally improved inheritance to others, including his own students, so that they might will it even further forward, again and again and again.

The Present Purpose

As a practitioner of mostly quasi-experimental research in complex field settings, I am de- nied the protections of random assignment and closed-system laboratories. So I am used to feeling vulnerable and suspect and even envious of others’ certainties about method choice. I am not formally connected to the intellectual history of causal design in Statistics. Nonetheless, I feel very comfortable with the first four propositions characterizing Cochran’s inheritance, but am ambivalent about my connections to the mimetic conception of obser- vational study design. This is because I am aware of contrary examples I will present and of three issues central to quasi-experimental practice that the Cochran inheritance rarely discusses. I want to present the issues here and ask: (1) whether they help demarcate the inheritance’s boundaries and (2) whether one or more of them might be worth “willing forward” to incorporate into the inheritance, even if only at one of its margins. Of course, no appointed court of statisticians exists to deliberate about exclusion from the causal in- heritance or inclusion into it. Scientific agenda-setting and -modification is a much more ad hoc process that almost certainly has chance components. My hope is merely to start a conversation about what should and should not be part of Cochran’s inheritance, not as he found it, but as he and his students, friends and followers have elaborated it. The first issue from my own work as a quasi-experimental practitioner speaks to the reality that I sometimes cannot construct a single focused null hypothesis test of a causal hypothesis, even though all random assignment studies and all mimetic quasi-experiments aspire to such a test. Archetypically, this test evaluates the difference between two posttest means from two initially identical groups. Fisher himself advised Cochran that observa- tional studies should not take this approach and should instead elaborate the same causal

144 The Limits of Observational Studies that seek to Mimic Randomized Experiments

hypothesis until it has multiple implications within the data that are subsequently tested. “Somehow” multiple sub-hypothesis tests have to be constructed – even from multiple data sources – and the case has to be made that they are collectively sufficient to test the hy- pothesis. Fisher’s advice probably surprised Cochran, for it sees to be at odds with Fisher’s own writings on null hypothesis testing. Nonetheless, Cochran(1965) provided some brief examples of causal questions that cannot be answered with a single focused test. Other scholars have noted the same, and have worked on concepts that overlap with Fisher’s off-hand remark. Campbell’s (1966) “pattern-matching” requires postulating and testing a complex pattern of differences that might rule out all other alternative interpreta- tions. The “critical multiplism” of Cook(1985) depends on multiple tests that are critically chosen because they collectively rule all the currently identifiable causal alternatives to the hypothesis under test. Rosenbaum’s (2005; 2009; 2011) “coherence” notion depends on the consistency of results from multiple tests across different datasets that provide a coherent link to a single causal hypothesis. Finally, the idea of Generalized Differences in Differ- ences (GDD) involves testing statistical hypotheses that are of higher order (and thus more “pattern-laden”) than the two-way interactions in standard differences in differences approaches (e.g., Imbens and Wooldridge, 2007; Chetty et al., 2009). In what follows, the national educational reform program No Child Left Behind (NCLB) illustrates a case where no single focused hypothesis test is possible but where elaborated and testable sub-hypotheses are. All the results are consistent with the hypothesis that NCLB raised academic achievement, but they flirt with verifying a predicted pattern of data and fail to falsify all possible alternative interpretations even if they do falsify all “plausible” ones. So I want to ask: Does the inheritance under discussion want to wash its hands of such inelegant and marginally less successful omnibus tests? If it does not, how might the Fisher strategy be incorporated into an inheritance where its current role is minimal. The second issue I want to address is that causal statements require more than identify- ing whether two variables are “causally” related and then estimating the size and statistical significance of the obtained relationship. Also needed is a name for both the cause and effect in general language, given the impossibility of providing a comprehensive description of the cause (or effect) each time it is mentioned. Cochran(1965) briefly mentions this issue, but in the examples he cites he leaves construct validity to psychologists and sociologists, not statisticians. I do not dispute that his inheritance has made some breakthroughs in labeling causal manipulations – e.g., in discussions of comparison group types (e.g., no-treatment versus placebo control groups) and of factorial designs that decompose a complex interven- tion into parts that are then separately examined. Even so, randomized experiments were designed to optimize construct validity much less than internal validity, and I suspect many people would question the utility of valid cause-effect relationships in which either the cause or effect were wrongly named. So I present the example of an otherwise successful bail bond reform program that was discontinued because of how the manipulation was (mis-)labeled. Its benefits might have continued, however, if the treatment had been correctly labeled. What intellectual responsibility, if any, does the Cochran inheritance want to take for the construct validity of independent and dependent variables? Is this issue demarcated out, or worth eventually incorporating? The final issue I bring up concerns the generalization of causal relationships, using the regression discontinuity design (RDD) as an example. RDD uses a deterministic treatment

145 Cook

assignment procedure to identify causal effects. It is deterministic because assignment to treatment or control status depends only on whether a unit scores above or below a single observed score on a continuous measure that is often of need, merit or age. Since this assignment procedure creates non-overlapping groups on each side of the cutoff, causal identification requires extrapolating the functional from the untreated regression segment into the treated segment where it estimates the missing but crucial potential outcome – what would have happened to treated units had they not been treated. Unfortunately, no independent support exists for this crucial extrapolation, and so causal inference is usually limited to the cutoff point where only a small fraction of those receiving treatment are to be found. The ensuing loss in causal generalization contrasts badly with the randomized experiment whose treatment and comparison groups totally overlap and so warrant the estimation of an average rather than a local treatment effect. The combination of RDD’s unsupported extrapolation, limited causal generalization, and dependence on modeling may explain why few statisticians have paid much attention to the design (Cook, 2008). However, simple design elements can be added to the basic RDD structure and will provide some support for the extrapolation RDD always needs. These supplementary data elements might be untreated regressions from pretest observations (Wing and Cook, 2013), from other covariates (Angrist and Rokkanen, 2015), or from non-equivalent comparison groups (Tang et al., 2015). When certain assumptions are met, we later show that RDDs with an untreated comparison function (CRD) lead to causal results that are demonstrably valid in the whole treated area beyond the cutoff. Do some versions of RDD, like CRD, deserve willing forward? More generally, should the inheritance pay more attention to causal generalization and raise its profile relative to causal identification and estimation?

Issue 1: When no single focused test of a causal hypothesis is possible

Random assignment experiments create a single focused test of a hypothesis about a single cause and a single effect, usually a test of mean differences. Something very similar hap- pens in most quasi-experiments. The Rubin Causal Model seeks to create treatment and comparison groups that are equivalent conditional on covariates, thus allowing the group mean differences to be examined at posttest. I prefer such focused tests and can usually create them in observational study work. But I cannot always do so, and it is this sometime failure that motivates the issue addressed here. Cochran was apparently fond of saying something like “Unless you can give me an example that illustrates your statistical problem, I won’t find it important enough to bother with”. In this spirit let me offer the example of NCLB. It sought to improve academic achievement in all public schools nationwide. The program began in 2002, and the law specified that by 2014 children in all schools in all states had to attain passing scores on state achievement tests. States were free to set their own time schedule for reaching the national goal, but they had to implement a system in which sanctions escalated as the number of consecutive years increased over which a school had failed to reach pre-specified annual performance levels. Random assignment was not possible because NCLB was the product of a national law that was rolled out immediately. Nonetheless, Dee and Jacob(2011) reasoned that the key component of the national program was “consequential accountability”, the system of

146 The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 1: Trends in grade 4 math in Main NAEP, by timing of accountability policy escalating reforms a school had to undertake as a function of the number of consecutively failed years. Each state had to have such a system by 2002. But some states already had one, allowing states to be partitioned into those with and without accountability in 2002. Using the national Main NAEP math test for 4th graders over eight time points, some before and some after NCLB, Dee and Jacob constructed Figure1. It shows that for states that already had consequential accountability before 2002, slopes for achievement are steeper; but after 2002 the slope difference changes to become parallel, suggesting a post-2002 improvement in those states getting accountability in 2002. This is consistent with a causal impact, but only if a number of problems are addressed. First, the object being evaluated is not NCLB; it is only one mechanism within it, thus compromising the construct validity of the cause. Second, it is not clear whether other state-level, math- correlated forces might have differentially affected achievement before and after 2002 – an issue of internal validity. And finally, there are few data points during the baseline period, leading to questions about how well baseline functional form differences have been modeled – another internal validity issue. An alternative design (Wong et al., 2015) pits national public schools that are subject to NCLB against the nation’s private schools to which NCLB hardly applied. This is not a sharply focused test, though. Prior to NCLB, public and private schools were quite different in achievement levels and maybe even slopes, and they were also subject to different historical forces that might have changed around 2002. Moreover, though the baseline time points are now more, they are still few. Nonetheless, Figures2 and3 plot 4 th grade differences in math on Main NAEP when public schools are contrasted with Catholic schools and then with non-Catholic private schools. For both grades, a large mean selection difference is evident at baseline, with perfor- mance higher in each type of private school. The baseline time trends are less clear, however. Simple visual tests comparing differences in differences suggest that the public schools came

147 Cook

Figure 2: 4th grade math for Main NAEP: Figure 3: 4th grade math for Main NAEP: Public and Catholic schools Public and non-Catholic private schools

Figure 4: 8th grade math scores for Main Figure 5: 8th grade math for Main NAEP: NAEP: Public and Catholic schools Public and non-Catholic private schools to do better after 2002, narrowing the achievement gaps visible before NCLB. Statistical tests used baseline means and linear trends to examine immediate posttest mean differences, posttest linear slope differences, and final mean differences. While every estimate has a sign indicating positive NCLB effects, most are not statistically significant – perhaps due to the small number of degrees of freedom in a national level analysis. For 8th grade math and 4th grade reading scores, the corresponding data are in Figure4 through7. All causal signs are again positive, but few are statistically significant. And effects seem even smaller for reading than math. The national Main NAEP results just presented are based on items that vary over time in order to reflect national changes in teaching content. In contrast, Trend NAEP holds test items constant. Figures8 through 10 plot the corresponding Trend NAEP differences for 4th grade math, 8th grade math and 4th grade reading. (Trend NAEP data for non-Catholic private schools are not available, and there is only one interpretable posttest time point exists due to a change in sampling design after 2004). All the results point to greater mean

148 The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 6: 4th grade reading scores for Main Figure 7: 4th grade reading Main NAEP: NAEP: Public and Catholic schools Public and non-Catholic private schools change after 2002 in public schools and to a reduced achievement gap. Now, the two math differences in difference are statistically significant. State level tests have advantages over national ones, particularly as regards sample size. NCLB left states free to set their own standards about how quickly to reach the ultimate 2014 national goal, and some states chose to make faster initial changes. So states varied in whether they initially had higher, moderate or lower standards for initial improvement,. (This contrast was not correlated with which states adopted consequential accountability before or after 2002). Figures 11 through 13 give the results for 4th and 8th grade math and 4th grade reading. Immediately after 2002, all estimates of the difference of differences are positive in sign, with states taking NCLB more seriously doing better, and those in the moderate category doing next best. Adding these state-level results to the national ones in Wong et al. indicates coefficients favoring NCLB in every one of the 39 difference of difference tests across 4th or 8th grade, math or English, Main NAEP or Trend NAEP, Catholic or non-Catholic private schools, and state standard differences assessed in categorical or continuous form. Of these 39, 10 are independent tests of immediate mean and slope differences for math (p < .001); and eight are independent tests of differences of differences in slope and final endpoint (p < .01). The statistical significance levels are generally weaker for Main than Trend NAEP, but they are strong for Main NAEP when the strongest tests are conducted that combine both the initial mean and subsequent slope effects in order to examine mean differences at the last point. Of the 9 such estimates, all have the same sign, and five are statistically significant at the .10 level or lower. The case is strong then, that the obtained data match the pattern predicted from the single hypothesis that NCLB raised achievement. Yet none of the single tests warrants a strong causal conclusion; only together do they do what Fisher recommended by rendering the single causal hypothesis more complex and testing how well the predicted and obtained data patterns correspond. However, such verification does not necessarily fulfill what a well executed random assignment experiment achieves – ruling out all alternative interpretations. Fisher’s advice will almost always reduce the of alternative interpretations, but it will not necessarily eliminate all of

149 Cook

Figure 8: 4th grade math scores for Trend Figure 9: 8th grade math Trend NAEP: NAEP: Public and Catholic schools Public and Catholic schools

Figure 10: 4th grade reading for Trend NAEP: Public and Catholic schools them. Indeed, Wong et al.(2015) discuss three possible alternatives. Given the coherent pattern of results, each is of low plausibility – but nonetheless each is still possible if a highly unlikely concatenation of alternative forces operates. One alternative is that students suddenly began to leave Catholic schools for public schools after 2002 when scandals about the sexual behavior of priests began to emerge. This might suddenly affect mean performance in these schools and also in the other schools to which the departing students moved. Moves from Catholic schools were indeed greater just after 2002 – by about one-fifth of one percent relative to prior years. But these rates are too small to account for the public school gains, given that public schools cater to about 90% of all students nationally and Catholic schools to about 5%. Moreover, Wong et al. show that Catholic school movers had no detectable impact on the post-2002 composition of schools on three demographic measures that are usually highly correlated with achievement. In addition, for inter-school student composition changes to account for the obtained data pattern further requires that (a) the students who left Catholic schools in states with higher

150 The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 11: 4th grade math for Main NAEP: Figure 12: 8th grade math for Main NAEP: High vs. med. vs. low proficiency standard High vs. medium vs. low proficiency stan- states dard states

Figure 13: 4th grade reading scores on Main NAEP: High vs. medium vs. low pro- ficiency standard states

2002 standards achieved more than students who exited Catholic schools in states with lower standards, and (b) students in states with high consequential accountability were more likely to leave Catholic schools than their counterparts in states with low consequential accountability. Another interpretation is that school officials manipulated NAEP achievement test scores after 2002, and that they were more motivated to do so in public than private schools and in states with higher standards and also in states with consequential accountability be- ginning in 2002. Two popular ways of manipulating test scores are to reduce how many students with disabilities (SWD) or English language learners (ELL) are tested. Table 9 of Wong et al.(2015) shows the percentages excused from NAEP testing did shift more in public than private schools after 2002, but that no such differential exclusion pattern is evident in the contrast of states varying in standards. Also, school officials have less moti- vation to manipulate the testing process for NAEP than for state achievement tests since no

151 Cook

consequential decisions depend on a school’s NAEP results. The only manipulation-related explanation we can envisage is that high stakes testing at the state level changed the cul- ture of all achievement test taking in and after 2002 – including for NAEP – and that this changing culture was stronger in public than Catholic schools even though both kinds of schools take part in Main NAEP testing. No data exist on this quite circumscribed version of an alternative interpretation based on manipulating test scores. A third alternative invokes the National Council of Teachers of Mathematics (NCTM). It updated its math standards in 2000 and later claimed that this was responsible for the subsequent national improvements in NAEP math scores (National Council of Teachers Mathematics, 2008). While this alternative might explain why the math effect was larger than the reading one in 4th grade, it also requires that NCTM standards were adopted more often, or were implemented better, in high standard versus low standard states, in states adopting consequential accountability only after 2002, and in public schools more than the two private school types. Also needed are the assumptions that: (a) math standards operate with a three-year causal delay, for the standards changed in 2000 but the first evidence of a possible effect appears in 2003; and (b) issuing new standards is by itself sufficient to change performance at the national level! The need for discussions about ruling out alternative interpretations leaves a bad taste in the mouth when compared to the simple use of a single focused hypothesis test. But consider the alternatives to this bad taste. One would be to do no evaluation. Would that leave an even worse taste? Another would be to do a randomized experiment under atypical circumstances, perhaps by issuing waivers from NCLB to exempt some school districts. However, this set-aside waiver experiment would be compromised by the kinds of districts or schools eligible for waivers and applying for them. More serious would be the nature of the schools available to parents wanting to leave a school. In the waiver context we envisage, there would be some public schools without NCLB in each school district. Yet this would not be possible in a national program and would fundamentally alter the school choices available. When study designs are evaluated by both internal and criteria, a randomized experiment of a systematic and likely small sample of schools operating in a social context to which one does not want to generalize might not be superior to an unaesthetic bundle of multiple interrupted time-series tests conducted at both the national and state levels. Construing study quality within a multi-attribute decision framework, rather than today’s single-attribute framework that prioritizes only internal validity, suggests that a sharply focused experimental test from a waiver experiment might not be superior to the messy multiple tests described above. The epistemological bottom line with elaborating the causal hypothesis is that empirical coherence is only part of the bias control story; the rest depends on how well alternative interpretations are ruled out by the uniqueness of the predictions. In his own work, Fisher wrote about single sharp hypothesis tests, and he did not exemplify his own advice about elaborating a causal hypothesis, though I know little of his later genetics research. Nor, to my knowledge, did Cochran do anything more with Fisher’s advice than pass it on, all the time advocating for observational studies based on matching procedures designed to mimic single sharp tests. Most of Cochran’s students (and their students) have not heeded Fisher’s advice either, with the salient exception of Rosenbaum. I grant the superiority of single focused tests. My question is whether the critical multiplist perspective described

152 The Limits of Observational Studies that seek to Mimic Randomized Experiments

here should be demarcated out of the inheritance, though it is consonant with Fisher’s dictum? Or is it worth willing forward when single sharp hypothesis tests are not possible?

Issue 2: Simultaneously Identifying Causal Relationships and Causal Constructs

The central task of the inheritance under discussion is to identify causal relationships through a falsification process that distributes conceptually irrelevant causal forces equally across conditions. In random assignment studies, this is done unconditionally, though in quasi-experiments it is done conditionally on the covariates used. Other meanings of cause are broader. Let us leave aside the distinction between causal explanation and causal de- scription, recognizing that experimentation always deals with causal description but only deals with causal explanation if (1) the independent and dependent variables are delib- erately chosen to index explanatory concepts from a clear substantive theory, or (2) all plausible temporally mediating variables in a study are measured and the analysis identifies which of them offer a more plausible causal explanation than others. Let us instead focus on the fact that the cause and effect in each study need to be labeled in abstract language if scientific communication is to ensue. The alternative, providing a complete description of the manipulandum, is impossible because the volume of words needed is impractical and, anyway, full knowledge of causal agency is often elusive. The next example explores the conundrum that a causal relationship can be valid but useless, or even dangerous, when the causal agent is invalidly labeled. Cook County, Illinois changed its system of bail hearings in 1999, hoping to reduce costs by substituting video hearings for face-to-face hearings with a judge. The reform applied to all charged offenses except the most serious ones – murder and serious sexual offenses, for which face-to-face hearings persisted. All the labels attached to the reform featured its video component, reflecting the universal consensus that this was the core intervention component. Some legal advocates came to dislike the reform, and one reason for this was the suspicion that it may have unintentionally increased bail amounts. By so doing, more individuals could not now afford bail and would have to remain in jail even though presumed innocent; their families and jobs would be endangered; and the county would have to pay increased jail and welfare costs. A study was conducted using 17 years of data on daily bail amounts, but aggregated by quarter (Diamond et al., 2010). About half of the days came from before the video intervention and half after it. The charged offenses were partitioned by seriousness, with the less serious ones receiving the intervention and the most serious ones serving as no- treatment controls. The resulting comparative interrupted time series showed that, after the intervention, bail amounts changed by more for the less serious than the more serious charges. These results were first privately released to the county senior judges with infor- mation that they would soon be released to the press. Just before they were released, the video innovation of 8 years was rescinded. Face to face hearings returned for all charges. There is little doubt that bail amounts increased after the video reform. Figure 14 gives that lowess-smoothed daily over 17 years from Cook et al.(2014). The date of the reform is set at 0 and the range of days goes from plus to minus 3,000. The greater mean shift after the intervention is clear.

153 Cook

Figure 14: Serious and less serious offenses over time: LOWESS-smoothed daily aggregated bail amounts from before and after the bail bond reform intervention

Figure 15: LOWESS-smoothed number of judges hearing serious and less serious cases by day before and after the bail bond reform intervention

Figure 15 shows the number of judges who adjudicated bail bond hearings each day. For the less serious cases the number of officiating judges dropped precipitously from about 8 per day to three, indicating that the reform was a multi-attribute treatment package. One part was the video, and another was the reduction in the number of judges. Other evidence not produced here shows that the volume of bail bond hearings did not change over the intervention period, suggesting that the post-video judges became more specialized.

154 The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 16: Percentage of all felony bail hearings held before and after the bail bond reform by the Groups of 3, 13 and 83 judges who held bail hearings both before and after the reform.

Of the 530 judges adjudicating in bail bond hearings over 17 years, 99 heard cases both before and after the video-based reform. We call these the before-after judges. All the other judges heard cases only before or only after the change. Of the 99 before-after judges, three were responsible for almost 100% of all the cases for the first year after the change and for 50% of them for the next two years. In contrast they heard 40% in the year immediately before the reform and very little before that. So they were heavily overrepresented immediately after the reform when compared to before it. This is revealed in Figure 16 which plots the number of pre- and post-intervention hearings adjudicated by the 3 most specialized, the 13 next most specialized, and the 83 who heard very few cases even after the reform. Key issues are whether these before-after judges were disposed to set higher bail amounts than their pre-intervention colleagues even before the intervention occurred; and whether their bail amounts increased at all. Figure 17 gives the relevant data for the bail amount set by three most prolific judges, and we see no change in bail amounts from before to after the reform! Data are more sparse in the immediate post-video period for the other before-after judges, but Figure 18 for the 83 before-after judges shows there is again no evidence that they set higher bail amounts after the intervention than before. The apparent mean increase for less serious offenses is matched by an apparent increase for the more serious offenses that were always adjudicated face-to-face. If within-judge analyses show that bail amounts did not change from before to after the reform, and if analyses aggregated across all judges (including include the before-only and after-only judges) show that bail amounts changed from before to after the video intervention, what explains the seeming discrepancy. The most likely explanation is that the before-only judges set lower bail amounts than their before-after colleagues, and Figure 4

155 Cook

Figure 17: Serious and less serious offenses over time: LOWESS-smoothed individual case record data for the three most active before-after judges.

Figure 18: Serious and less serious offenses over time for the 83 before-after judges: Overlap between the LOWESS-smoothed individual case record data and quadratic order regression. in Cook et al.(2014) confirms this. So the higher post-reform bail amounts found with all of the judges is probably due to the before-after change in judge composition rather than to the introduction of the new video system. Even before the video reform, the judges hearing cases both before and after the reform were prone to set higher bail amounts than their colleagues who only adjudicated them before. What would have happened if the true causal agent had been correctly identified? Could the Cook County Court system have changed the judges who held bail hearings without sacrificing the video component and the cost savings it generated through mechanisms outlined in Cook et al.(2014)? Could the post-intervention judges have been randomly chosen?; Could a different set of judges with different proclivities have been selected?; Could the 3 or 99 judges have been instructed to re-consider how they set bail amounts? These are all pertinent questions, but they were never asked at the time because the universal social consensus was that the reform entailed changing only the mode in which hearings were held – from face-to-face to video. The resulting mis-identified causal agent points to a dilemma: What is the value of learning that a causal relationship is true if the cause involved in it

156 The Limits of Observational Studies that seek to Mimic Randomized Experiments

is wrongly named and a name has to be provided? Does the Cochran inheritance take no responsibility for mis-identified causal agents, only for mis-identified causal relationships? Many non-statisticians may find it odd to limit the idea of causation to identifying and estimating causal relationships without considering how the cause and effect are named. Is it worthwhile willing forward research on this cause-naming topic?

Issue 3: Generalizing Causal Relationships, including in the Regression Discontinuity Context

Despite research on pooling experimental results and on identifying statistical interactions between treatments and study details, the Achilles Heel of experimentation remains what is variously called external validity, heterogeneity, or causal generalization – the capacity to identify the boundary conditions on which a specific causal relationship depends, including total generalization across all boundary conditions within our space-time continuum. Few experiments are based on samples randomly drawn from a clearly described popu- lation; most sampling particulars are products of opportunism. In research syntheses it is rare for the sample of available studies to be formally linked to a meaningful population and, anyway, formal sampling models are not relevant to all the objects of causal generalization – e.g., to different types of settings, times, and ways of operationalizing the cause and effect. Also, studies rarely have samples so large that they permit examining how the treatment interacts with all, or even many, of the possible sources of heterogeneity. Scientific decisions about causal robustness are usually more weakly supported and depend, say, on a predom- inantly positive causal sign or on whether effect sizes differ across the possible sources of heterogeneity that happen to be examined. When causal results are not robust, it is com- mon to use the pattern of results to begin the process of specifying the boundary conditions that determine variation in effect sizes. The limitation here is the extensiveness of the po- tential moderator variables available. Longitudinal surveys typically have better sampling bases and more measures of potential moderators, but internal validity is a much bigger problem with them than with experiments. And, if researchers cannot be “reasonably” sure of a causal connection, what is the point of worrying about its generalization? When it comes to causal generalization, RDD is in even worse shape than the usual ex- periment. Although the process of selection into treatment and control is perfectly known and can be modeled, the unbiased causal inference that results depends on (1) having spec- ified the correct functional form or bandwidth size (Imbens and Lemieux, 2008); (2) the utility of a local average treatment effect at the cutoff (LATE); and (3) the willingness to tolerate less statistical power than in the experiment. To try to counteract these lim- itations, the comparative RDD design (CRD) adds an untreated regression to the basic RDD structure, typically to date from a pretest measure (Wing and Cook, 2013), other co- variates (Angrist and Rokkanen, 2015) or a non-equivalent comparison group (Tang et al., 2015). The purposes of this untreated regression are to provide additional support for the extrapolation that RDD always needs, to increase power by adding data, and to test causal generalization not just at the cutoff but also in the entire area of the assignment variable where the treatment is available. In Wing and Cook(2013), the key assumption of CRD is that the three untreated regression lines are parallel. These are the pretest scores on the untreated side of the

157 Cook

cutoff and on the treated side as well, plus the posttest observations in the untreated part of the assignment variable. When these slopes are plausibly parallel and the pretest is included in the outcome model together with the assignment variable and binary cutoff score, then the relationship of the assignment variable to the outcome will be zero. This is the ignorability condition that Angrist and Rokkanen(2015) believe is necessary for valid causal generalization beyond the cutoff score in CRD. Tang et al. have just tested these ideas using the national Head Start impact study, a nationally representative study of 2499 children who were randomly assigned to Head Start or control status. As causal benchmarks, experimental estimates were estimated both at the RD cutoff and also in the total treated area above it. The CRD was constructed out of the by creating an assignment variable (IQ test scores), defining a cutoff value on it, and then omitting the control cases from the treated side of the assignment variable and the treated cases from the control side. One comparison regression function came from pretest scores on each of these outcomes, and the second from a cohort of non- Head Start children one year older and from the same catchment areas. There were three study outcomes measured before and after entering Head Start – performance in math, literacy and social behavior. In addition, a second assignment variable was also designated to test if the results achieved with the IQ assignment variable would be replicated. They were, and in most of what follows we present results for the single IQ assignment variable. We use CRD-Pre to designate use of a pretest no-treatment regression function and CRD- CG to designate use of a non-equivalent independent comparison group. We refer to the randomized experiment as an RCT. Analyses not reported here showed that the three untreated regression segments were plausibly parallel for each outcome and each assignment variable. One question is how similar the experimental, basic RD and CRD estimates are at the cutoff. This is where theory indicates each should be unbiased and so the two causal estimates should be similar, not just to each other within the limits of sampling error, but also to the RCT when it is estimated as a LATE. More important is the second question: How similar are the RCT, CRD-Pre and CRD-CG estimates away from the cutoff, given that basic RDD is not estimated there and the internal validity of CRD designs is unknown there? A final question is how similar standard error estimates will be across the three design variants, after adjusting for the inevitably smaller RDD group due to how it was constructed from the RCT groups? Figure 19 shows results comparing only the RCT and basic RDD for each outcome and assignment variable. They show the expected results: No bias at the cutoff, and larger confidence intervals with basic RDD. Figure 20 shows the results for CRD-Pre at the cutoff, and Figure 21 for CRD-CG at the cutoff. In each case the estimates seem close to the experimental ones (and hence also the basic RDD). The confidence intervals are narrower than with the basic RDD and are thus closer to those from the RCT. But they are not equal to the RCT. Figure 22 shows the causal estimates away from the cutoff for CRD-Pre and thus cal- culated for more cases than just at the cutoff. Figure 23 shows the results for CRD-CG. In each case, both the treatment estimates and confidence intervals are similar to those from the RCT. Under the conditions tested, the RCT, CRD-Pre and CRD-CG results are

158 The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 19: Basic RD estimates and confidence intervals compared to the corresponding RCT benchmarks at the cutoff over three outcomes and two assignment variables.

Figure 20: CRD-Pre estimates and confidence intervals at the cutoff compared to the corresponding RCT benchmarks for one assignment variable. virtually interchangeable with respect to both bias away from the cutoff and statistical power. Are such findings of any relevance to the inheritance in which Cochran was embedded, that he marginally improved, and that he then passed on for others to improve again and again and again? In Tang et al., as in Wing and Cook, the RCT causal estimates and standard errors are reproduced in the CRDs merely by adding supplementary data to what is otherwise a weak but theoretically unbiased design – the basic RDD. Does RDD fall

159 Cook

Figure 21: CRD-CG estimates and confidence intervals at the cutoff compared to the corresponding RCT benchmarks for one assignment variable

Figure 22: CRD-Pre estimates and confidence intervals above the cutoff compared to the corresponding RCT benchmarks for one assignment variable. outside of the Cochran inheritance’s bailiwick; or is it worth including in it even if only in some forms (like CRD) but not others (like basic RDD). CRD certainly needs willing forward, but will statisticians of causation take on this task?

Conclusion The three examples presented here support causal conclusions; they may even strongly warrant them. One conclusion is about the achievement consequences of a national law; another about the consequences of a county-level reform on bail bond amounts and number of officiating judges; and the third is about the consequences of CRD for the quality of

160 The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 23: CRD-CG estimates and confidence intervals above the cutoff compared to the corresponding RCT benchmarks for one assignment variable. causal inferences relative to an RCT. None of these studies started from the premise of creating an observational study that mimics what would be done in an experiment. The first example involves multiple hypothesis tests, each individually imperfect, rather than a single sharply focused test. The second compares legal charges whose seriousness widely differs and whose bail amounts, unlike in an experiment, hardly overlap at all. The third is about a design (RDD) whose treatment and control groups also do not overlap and so are not matched. Yet each of these designs supported strong causal conclusions without any matching for a single strong test. Indubitably, there are great conceptual advantages when observational studies are con- sidered as though they were broken down experiments; and in actual research practice, there are many opportunities to match individual cases where the selection process is mostly known and where relevant and otherwise rich covariates are available. I am not against the utility of the mimetic conception of observational study design that Cochran willed forward. My concern is limited. It is whether the current dominance of this conception threatens to rule out other ways of advancing observational study design that are predicated on dif- ferent principles than the pursuit of ineffable exact matching. I would like to see more principles than matching implicated in observational study design, to see more arrows in the observational study quiver. I have tried to model some of them here. No body of statisticians is charged with making decisions about which issues to include in Cochran’s inheritance or to exclude from it. So there is no mechanism for formally answering the questions this paper posed about demarcation – what should and should not be part of the inheritance?; and what might be willed forward because it seems worthy of inclusion? The most to hope is that, over time, a consensus will emerge about (1) supplementing the traditional research agenda on internal validity by increasing the profiles of both construct and external validity, and (2) adding RDD and CRD to the Pantheon of acceptable designs, even though they are predicated on minimizing the very group overlap that experiments and

161 Cook

mimetic quasi-experiments seek to maximize. The issues I have described are important in the daily struggles of practitioners of observational studies who, like me, look to statisticians for help. I wish I could talk with Cochran to learn how he would make demarcation decisions. Of course, he might not be interested and simply reject the assumptions of this paper, perhaps believing I have reified the intrinsically fuzzy concept of an inheritance. But that would still be interesting to learn. Though conversation with him is impossible, it is with some of his students and their students. Most of them are acquainted, I would guess, with the distinguished figure and record of Professor William G. Cochran of Harvard University. But it is important not to forget that in the working class Glasgow of his childhood he would have been Oor Wullie, that he was Bill to his friends on both sides of the Atlantic, and that the few students of his I know remember him with affection and with respect for his work on observational study procedures that seek to approximate the logic and practice of randomized experiments. That his students and his students’ students have willed his work forward so tellingly is a blessing for his memory; but it is not something any scholar should take for granted. Consider Galileo. He used his lens as a telescope and revealed a heliocentric universe that transformed how Man, God and Nature are related. A decade later, he realized his lens was also a microscope, but he was not interested enough to venture far into the micro worlds that eventually transformed biology and medicine. Fortunately, others were more interested in microscopes than he was, and about 50 years later “the pioneers of microscopy Antonie van Leeuwenhoek and Robert Hooke revealed that a Lilliputian universe existed all around and even inside of us. But neither of them had students (my italics), and their researches ended in another false dawn for microscopy. It was not until the middle of the nineteenth century. . . that the discovery of the very small began to alter science in fundamental ways” (Flannery, 2015, p. 30). How lucky Cochran’s students and his students’ students are in being so centrally networked within the extraordinarily upright and rudimentary inheritance that Cochran received, improved and passed on to them. Does this very privileged position in history entail any obligation to take seriously the demarcation tasks so amateurishly described in this paper? I guess not, but. . .

Acknowledgments

This work was supported by NSF grant #DRL-1228866

References Angrist, J. D. and Rokkanen, M. (2015). Wanna get away? Regression discontinuity esti- mation of exam school effects away from the cutoff. Journal of the American Statistical Association. (Just accepted).

Campbell, D. T. (1966). Pattern matching as an essential in distal knowing. In Hammond, K. R., editor, The psychology of Egon Brunswik, pages 81–106. Holt, Rinehart & Winston, Oxford, UK.

162 The Limits of Observational Studies that seek to Mimic Randomized Experiments

Chetty, R., Looney, A., and Kroft, K. (2009). Salience and taxation: Theory and evidence. American Economic Review, 99:1145–1177.

Cochran, W. G. (1950). The comparison of percentages in matched samples. Biometrika, 37(3/4):256–266.

Cochran, W. G. (1965). The planning of observational studies of human populations. Jour- nal of the Royal Statistical Society. Series A (General), 128(2):234–266.

Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 24(2):295–313.

Cochran, W. G. (1983). Planning and analysis of observational studies. Wiley, New York.

Cook, T. D. (1985). Post-positivist critical multiplism. In Shotland, R. L. and Mark, M. M., editors, and Social Policy, pages 21–62. Sage Publications, Beverly Hills, CA.

Cook, T. D. (2008). “Waiting for life to arrive”: A history of the regression-discontinuity design in psychology, statistics and economics. Journal of , 142(2):636–654.

Cook, T. D. and Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings. Houghton Mifflin, Boston.

Cook, T. D., Tang, Y., and Diamond, S. S. (2014). Causally valid relationships that invoke the wrong causal agent: Construct validity of the cause in policy research. Journal of the Society for Social Work & Research, 5(4):379–414.

Dee, T. S. and Jacob, B. (2011). The impact of No Child Left Behind on student achieve- ment. Journal of Policy Analysis and Management, 30:418–446.

Diamond, S. S., Bowman, L. E., Wong, M., and Patton, M. M. (2010). Efficiency and cost: The impact of videoconferenced hearings on bail decisions. Journal of Criminal Law and Criminology, 100:869–902.

Flannery, T. (2015). How you consist of thousands of tiny machines. New York Review of Books, LXII(12):30–32.

Imbens, G. W. and Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. Journal of Eeconometrics, 142(2):615–635.

Imbens, G. W. and Rubin, D. B. (2015). An Introduction to Causal Inference in Statistics, Biomedical and Social Sciences. CAMBRIDGE UNIVERSITY PRESS, New York, NY.

Imbens, G. W. and Wooldridge, J. (2007). What’s new in econometrics? Lecture notes, NBER Summer Institute.

Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46:806–834.

163 Cook

Rosenbaum, P. R. (2005). Observational studies. In Everitt, B. S. and Howell, D. C., editors, Encyclopedia of statistics in behavioral science, Vol. 3, pages 1451–1462. Wiley & Sons, Chichester, UK.

Rosenbaum, P. R. (2009). Observational Studies. Springer-Verlag, New York, NY, 2nd edition.

Rosenbaum, P. R. (2011). Some approximate evidence factors in observational studies. Journal of the American Statistical Association, 106:285–295.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.

Shadish, W. R., Cook, T. D., and Campbell, D. T. (2002). Experimental and Quasi- Experimental Designs for Generalized Causal Inference. Houghton Mifflin, Boston.

Stigler, S. M. (1986). The : The measurement of uncertainty before 1900. Harvard University Press, Cambridge, MA, and London, England.

Tang, Y., Cook, T. D., and Kisbu-Sakarya, Y. (2015). Reducing bias and increasing precision by adding either a pretest measure of the study outcome or a nonequivalent comparison group to the basic regression discontinuity design. Working paper, Institute for Policy Research at Northwestern University.

Watson, G. S. (1982). William Gemmell Cochran 1909–1980. The , 10(1):1–10.

Wing, C. and Cook, T. D. (2013). Strengthening the regression discontinuity design using additional design elements: A within-study comparison. Journal of Policy Analysis and Management, 32(4):853–877.

Wong, M., Cook, T. D., and Steiner, P. M. (2015). Adding design elements to improve time series designs: No Child Left Behind as an example of causal pattern-matching. Journal of Research on Educational Effectiveness, 8(2):245–279.

Yates, F. (1964). Sir and the design of experiments. Biometrics, 20(2):307– 321.

164 Observational Studies 1 (2015) 165-170 Submitted 12/14; Published 8/15

Design and interpretation of studies: relevant concepts from the past and some extensions

David R. Cox david.cox@nuffield.ox.ac.uk Nuffield College, Oxford University Oxford OX1 1NF, United Kingdom Nanny Wermuth [email protected] , Chalmers University of Technology Gothenburg, Sweden and Medical Psychology and Medical Sociology, Gutenberg-University Mainz, Germany

We are happy to have the chance of discussing the paper by W.G. (Bill) Cochran, titled “Observational Studies” and reprinted here. It appeared first in 1972 and, we call it the “present paper” below. We start however by describing our personal encounters with Bill.

1. Personal Encounters with Bill Cochran

DRC: I first heard Bill Cochran lecture in 1956 and, about that time, greatly benefited from his pre-publication comments on a draft of a book on experimental design. I recall also a memorable meeting of the Royal Statistical Society at which the precursor (Cochran, 1965) of the present paper was given for discussion. NW: As a Ph.D. student, I was fortunate to get to know Bill Cochran as an excellent teacher and researcher. His way of teaching was typically most illuminating for me. He was involved in many different types of empirical studies and he shared his experiences openly with the students. He would talk with joy about successes but would also report on disappointing developments that had led to difficult, unsolved problems. I regarded him as the heart of our department. He stressed the positive features of his colleagues and he remembered the names of all the students as well as what he had discussed with them before. This could concern statistical questions or personal experiences. He was kind and modest, typically full of energy, and always ready to listen and talk. I learned a lot from him not only about statistics.

2. Discussion of Cochran’s “Observational Studies”

The present paper is striking for its relevance even after so many years. Cochran’s concepts and ideas are presented with clarity and simplicity. Many of them appear to be ignored in the current inrush of “big data.” This makes many of Cochran’s points ever more topical. The discussion of principles of design makes it clear that there are essential differences between experiments and observational studies. In experiments, crucial aspects are under the investigators control while in observational studies the features measured will largely

⃝c 2015 David Cox and Nanny Wermuth. Cox and Wermuth

have to be accepted as they happen to arise. Cochran stresses however that, nevertheless, experiments and observational studies have much in common. In particular, for the types of observational study he is discussing, the motivation is a search for causes. Several variables may be viewed as treatments in a broad sense. For instance, stronger positive effects may be expected for a set of new teaching methods, or stronger negative effects after exposure to higher levels of several risk factors for a given disease. When experiments are not feasible, the main aim is still to establish, as firmly as possible, the link with an underlying data generating process. Cochran states this as: “A claim of proof of cause and effect must carry with it an explanation of the mechanism by which this effect is produced.” Thus, an underlying data generating process is to be scientifically explainable in the given subject matter context. Some of the terminology has changed since the paper was written, but several key aspects remain essential for any planned study today:

• stating the main objectives of a study before the data are collected,

• planning for well-defined comparisons and for one or several control groups,

• thinking about the types of measurements needed and how to assure their compara- bility,

• specifying target populations and being aware of nonresponse as one reason for missing a target.

The relative importance of these aspects may differ in different fields of application. For example, in many areas of physics there is likely to be a secure base of background knowledge and theory, whereas in some types of social science research, this may not yet be the case. The broad approach to design must depend also on the time-scale and costs of a single investigation. Whenever new studies can be designed speedily and the data can be collected quickly and analyses are easily computed and interpreted, then a flexible approach with a sequence of simple studies may be feasible. But when the effort and time involved in any single study is considerable, all the above four points become essential for the study to become successful. A noteworthy example is the prospective study by Doll and Hill (1956) establishing cigarette smoking as a cause of lung cancer. For experiments, R.A. Fisher (1926, 1935) had suggested, as principles of design, the need to avoid systematic distortions in treatment effects and the enhancement of the pre- cision in estimates of effects. He stressed also the value of considering several treatments simultaneously rather than one factor at a time. This gives the chance to see whether effects are substantially modified for particular levels of another factor or for level combinations of several factors, that is to understand major interactions. More importantly, it may help to establish the stability of an effect under a range of conditions by showing the absence of major interactions. This idea carries directly over to observational studies. However, to avoid systematic distortions, called often also “bias,” is considerably harder in observational studies. In experiments, in addition to creating laboratory-like condi- tions for obtaining measurements for quantitative variables and observations for categorical variables, the main tools are randomization, that is random allocation of participants to treatment levels, stratification (called also subclassification or standardisation), the use of

166 Design and Interpretation of Studies

important covariates (in some contexts called concomitant variables) and blocking (which turns in observational studies into matching). Clinical trials with randomized allocation of patients to treatments ideally may be re- garded as experiments rather than observational studies. But in reality, distorted estimates of treatment effects can occur even in such clinical trials, for instance, when relevant inter- mediate variables are overlooked, such as non-compliance of patients to assigned treatments, or when there is a substantial undetected interactive effect of a treatment and a background variable on the response, even though, by successful randomization, this background vari- able has become independent of the treatment. Thus, Cochran’s statement (on page 85) that “in regard to the effect of x on y, matching and standardization remove all bias” cannot hold when one of the above mentioned sources of distortion for treatment effects is present. When randomization is not an option, the next best approach is to design a prospective . But it may take a long time to see any results and these types of study are often expensive. They offer however the possibility of deriving and studying data generating processes. This option was not yet available in the 1970’s except in the special situation of only linear relations and with responses that are affected one after the other, that is when path analysis, called recursive systems by Strotz and Wold (1960), is applicable. The importance of such an approach was rarely appreciated at that time; the textbook by Snedecor and Cochran (1966) was a notable exception. The direct generalisation of path analysis, to include other than linear relations and arbitrary types of variables, is to the directed acyclic graph (DAG) models. A more ap- propriate class of models for data generating processes are the recursive systems in single as well as joint responses, called traceable regressions; see Wermuth (2012), Wermuth and Cox (2013, 2015). In these models, several responses may be affected at the same time, such as for instance, systolic and diastolic blood pressure which are two aspects of a single phenomenon, namely the blood pressure wave. Both will for instance be influenced at the same time when patients receive a medication to reduce high blood pressure. These sequences of regressions form one subclass of the so-called graphical chain models and they include DAG-models as a subclass. They often permit the use of a corresponding graph to trace pathways of development and they may be compatible with causal interpre- tations. They also take care of a main criticism of DAG-models regarding causal interpre- tations by Lindley (2002): that DAGs do not include joint responses and therefore cannot capture many types of causal processes. In the last section of the present paper, there is a beautiful illustration of the suggestion “make your theories elaborate,” given by R.A. Fisher when asked how to clarify the step from association to causation; see Cochran (1965). We fully agree that this step needs careful planning of studies and good judgement in interpreting statistical evidence. In the meantime, some of our colleagues have derived a “causal calculus” for the chal- lenging process of inferring causality; see Pearl (2015). In our view, it is unlikely that a virtual intervention on a , as specified in this calculus, is an accurate representation of a proper intervention in a given real world situation. Their virtual inter- vention on a given distribution just introduces some conditional independence constraints and leaves all other aspects unchanged. This may sometimes happen, but experience from controlled clinical trials suggests that this is a relatively rare situation.

167 Cox and Wermuth

Even before the step to a causal interpretation, it is, as discussed below, less clear that matching or some adjustment will always be beneficial in observational studies. For in- stance, with pair-matched samples, no clear target population is defined, hence it remains often unclear to which situations the results could be generalized. Blocking in experiments and matching in observational studies clearly make the measurements in different treat- ment groups more comparable. And, it has been demonstrated explicitly, how with more homogeneous groups to compare, both sampling variability and sensitivity to other sources of distortions are reduced; see Rosenbaum (2005). But for data looked at only after pair-matching, it becomes impossible to study de- pendences among the matching variables, in particular, to recognize an extremely strong dependence among them in a target population that could even lead to a reversal of the dependence of the response on this treatment. In addition, if results for the dependence of a response are computed exclusively for explanatory variables other than the matching variables, then an important interactive effect, of a treatment and a matching variable on the response, may get overlooked; for some examples see McKinlay (1977). The same holds for caliper matching, as defined in the present paper, and for a formal extension of it, called propensity-score matching by Rosenbaum and Rubin (1983). For a careful study and discussion of the large differences in estimated bias that can result with different choices of variables included in the propensity score and with different types of matching methods, see for instance Smith and Todd (2005). Similarly, any adjustment of estimates depends typically on how well the associated model is specified; see for instance Bushway et al (2007). For poor estimates or with some model misspecifications, adjustments may do harm instead of being beneficial. For approaches to move away from mere adjustments, see for instance Genb¨ack et al. (2014). In all of these discussions of matching and adjustments in the literature, generating processes are rarely mentioned. But their importance was already stressed in the present paper even though at that time, more than 40 years ago, the corresponding sequences of regressions, necessary for full discussions, had been studied intensively only for the very special situation of exclusively quantitative responses and linear dependences. Generating processes lead from background variables, such as intrinsic features of the individuals, via treatments and intermediate variables to the outcomes of main interest. In corresponding sequences of regressions, the dependence structure among directly and indi- rectly important explanatory variables is estimated and different pathways for dependences of the responses are displayed in corresponding regression graphs. Such graphs may be derived from underlying statistical analyses for a given set of data and they represent hypothetical processes that can be tested in future studies. In addition, consequences of any given regression graph can be derived. Consequences that result after marginalizing over some of the variables or after conditioning on other variables in such a way that the conditional independences present in the generating process are preserved for the remaining variables, can be collected into a “summary graph” by using, for instance, subroutines in the program environment R; see Sadeghi and Marchetti (2012). In this way, it will become evident which variables need to be conditioned on and such knowledge may possibly lead to a single measure for conditioning. Generating processes will point directly to situations in which seemingly replicated results in several groups, such as strong positive dependences, change substantially after marginalising over some of the

168 Design and Interpretation of Studies

groups, in some cases even turning positive into negative dependences. This can happen only when some of the grouping variables are strongly dependent. This well known phe- nomenon has been named differently in different contexts, for instance as the presence of multicollinearity, as highly unbalanced groupings, or as the Yule-Simpson paradox. Con- ditions for the absence of such situations have been named and studied as conditions for “transitivity of association signs” by Jiang et al. (2015). With the dissemination of fully directed acyclic graph models, some more recent termi- nology has become common. For instance, when an outcome has one important explanatory variable and there exists, in addition, an important common explanatory variable for both, the latter is a confounding variable and when unobserved, it is now named an “unmeasured confounder” that may distort the true dependence substantially. Similarly, when an out- come has one important explanatory variable and another outcome depends strongly on both, then by conditioning on this common response, a distortion of the first dependence is introduced and is named “.” In the current literature on “causal models,” known to us, both these types of distortions are discussed separately. A related phenomenon, for which a first example had been given by Robins and Wasserman (1997), is typically overlooked: by a combination of marginalizing over and conditioning on variables in a given generating process, a much stronger distortion, named now “indirect confounding,” may be introduced than by an unmeasured confounder alone or by a selection bias alone. Parametric examples for exclusively linear dependences and graphical criteria for detecting indirect confounding, in general, are available. The latter use summary graphs that are derived by marginalizing only; see Wermuth and Cox (2008, 2015). The broad issues so clearly emphasized in the present paper remain central, challenging and relevant. That is to say, not only are firm statistical relations of particular kinds to be established, such as the estimation of treatment effects and of possibly underlying data- generating processes, but the statistical results need to be interpretable in terms of the underlying science.

References

Bushway, S., Johnson, B.D. and Slocum, L.A. (2007). Is the magic still there? The use of the Heckman two-step correction for selection bias in criminology. J. Quant. Criminol., 23, 151–178. Cochran, W. G. (1965). The planning of observational studies. J. Roy. Statist. Soc. Ser. A, 128, 234-66. Cochran, W. G. (1972). Observational studies. In T. A. Bancroft (ed.), Statistical Papers in Honor of George W. Snedecor. Iowa State University Press, Ames, Iowa. Doll, R. and Hill, A. (1956). Lung cancer und other causes of death in relation to smoking. A second report on the mortality of British doctors. British Medical Journal, 2 1071– 1081. Fisher, R.A. (1926). The arrangement of field experiments. J. Ministry of Agric., 33, 503–513. Fisher, R.A. (1935). Design of experiments. Edinburgh: Oliver and Boyd.

169 Cox and Wermuth

Genb¨ack, M., Stanghellini, E. and de Luna, X. (2014). Uncertainty intervals for regression parameters with non-ignorable missingness in the outcome. Statistical Papers; articles. Jiang, Z., Ding, P. and Geng, Z. (2015). Qualitative evaluation of associations by the transitivity of the association signs. Statistica Sinica, 25, 1065–1079. Lindley, D.V. (2002). Seeing and doing: the concept of causation. Int. Statist. Rev. 70, 191–214. McKinlay, S.M. (1977). Pair-matching - a reappraisal of a popular technique. Biometrics, 33, 725–735. Pearl, J. (2015). Trygve Haavelmo and the emergence of causal calculus. , 31, 152–179. Robins, J. and Wasserman, L. (1997). Estimation of effects of sequential treatments by reparametrizing directed acyclic graphs. In: Proc. 13th Ann. Conf., UAI, D. Geiger and O. Shenoy (eds.) Morgan and Kaufmann, San Mateo, 409–420. Rosenbaum P.R. and Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. Rosenbaum P.R. (2005). Heterogeneity and causality: Unit heterogeneity and design sen- sitivity in observational studies. American Statistician, 2005, 59, 147–152. Sadeghi K. and Marchetti, G.M. (2012). Graphical Markov models with mixed graphs in R. The R Journal, 4, 65–73. Snedecor, G.W. and Cochran, W.G. (1966). Statistical Methods (5th ed.) Ames: The Iowa State University Press. Smith, J.A. and Todd, P.E. (2005). Does matching overcome LaLondes critique of nonex- perimental estimators? J. Econometrics, 125, 305353. Strotz, R.H. and Wold, H.O.A. (1960). Recursive vs. nonrecursive systems: an attempt at synthesis. , 28, 417–427. Wermuth, N. (2012). Traceable regressions. Intern. Statist. Review. 80, 415–438. Wermuth, N. and Cox, D.R. (2008). Distortions of effects caused by indirect confounding. Biometrika, 95, 17–33. Wermuth, N. and Cox, D.R. (2013). Concepts and a for a flexible class of graphical Markov models. In: Robustness and Complex Data Structures. Festschrift in honour of Ursula Gather. Becker, C., Fried, R. and Kuhnt, S. (eds.), Springer, Heidelberg, 331–350; also on ArXiv: 1303.1436. Wermuth, N. and Cox D.R. (2015). Graphical Markov models: overview. In International Encyclopedia of the Social and Behavioral Sciences, 2nd ed., J. Wright ed., 341–350; also on ArXiv: 1407.7783

170 Observational Studies 1 (2015) 171-172 Submitted 5/15; Published 8/15

Comment on “Observational Studies”, by William G. Cochran

Stephen E. Fienberg fi[email protected] Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213-3890, USA

It has been a pleasure for me to reread this paper by Wm. Cochran after a number of years and to see it reprinted in the opening issue of the journal, Observational Studies. Although I was a graduate student in the Department of Statistics at Harvard in the 1960s, I never actually took a course from him but I did sit in on a seminar where he described the work that appeared initially as department technical reports and then ultimately in a series of papers on the topic, e.g., see Cochran (1965, 1968) and Cochran and Rubin (1973), and his lectures included early versions of the ideas that ultimately found their way into this paper and his posthumously published book on the topic (Cochran, 1983). I actually still have copies of those technical reports in my files! As I interacted with Cochran over the ensuing decade or so I came to appreciate, more and more, the wisdom of his insights into drawing causal inferences from observational studies. Although he is often thought of primarily as a sampling statistician or as a major force in the design of experiments, e.g., see Cochran and Cox (1957), Cochran was clearly heavily influenced in his work on observational studies by his experiences with several very large studies, such as the Kinsey Report (Cochran et al., 1953, 1954), and especially by his work as the only statistician on the 10-member scientific advisory committee for the 1964 U.S. Surgeon General’s report (United States Public Health Service, 1964) concluding that cigarette smoking caused lung cancer. In the Kinsey report, Cochran et al. argued that there was no basis for many of the inferences drawn by the authors, although he says this in quite a gentle way in the present paper, whereas in the case of the effects of smoking on cancer the observational evidence was strong and he describes some of his thinking on the topic in Section 6. Perhaps the most important thing we can take from this paper is Cochran’s clear mes- sage: If we really want to draw causal inferences from observational studies then we need to think about their design and how such designs approximate those that we might have developed, had we only been able to design a randomized controlled trial. As methods for dealing with observational studies have developed over the ensuing decades this same philosophy can be found in the books by Rosenbaum (2002, 2010) and the article by Rubin (2008). This leads me to speculate how Cochran would have viewed the current “Big Data” movement. As others have remarked in recent years, drawing causal inferences from large amounts of data gleaned from the WWW is fraught with difficulty, and this involves both internal validity (to which randomization provides the key) and external validity (our ability

⃝c 2015 Stephen Fienberg. Fienberg

to generalize to a relevant population). If only the proponents of big data for causal purposes would take the time to read Cochran’s 1972 paper with care!

Acknowledgments

This preparation of this comment was supported by the Singapore National Research Foun- dation under its International Research Centre @ Singapore Funding Initiative and admin- istered by the IDM Programme Office.

References Cochran, W. G. (1965). The planning of observational studies of human populations. Jour- nal of the Royal Statistical Society. Series A (General), pages 234–266.

Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, pages 295–313.

Cochran, W. G. (1983). Planning and Analysis of Observational Studies. John Wiley & Sons, New York. Lincoln E. Moses and Frederick Mosteller, eds.

Cochran, W. G. and Cox, G. M. (1957). Experimental Designs. John Wiley & Sons, New York, 2nd edition.

Cochran, W. G., Mosteller, F., and Tukey, J. W. (1953). Statistical problems of the kinsey report. Journal of the American Statistical Association, 48(264):673–716.

Cochran, W. G., Mosteller, F., and Tukey, J. W. (1954). Statistical Problems of the Kin- sey Report on Sexual Behavior in the Human Male. American Statistical Association, Washington, DC.

Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhy¯a:The Indian Journal of Statistics, Series A, pages 417–446.

Rosenbaum, P. R. (2002). Observational Studies. Springer, New York.

Rosenbaum, P. R. (2010). Design of Observational Studies. Springer, New York.

Rubin, D. B. (2008). For objective causal inference, design trumps analysis. Annals of Applied Statistics, 2(3):808–840.

United States Public Health Service (1964). Smoking and Health: Report of the Advisory Committee to the Surgeon General of the Public Health Service. US Department of Health, Education, and Welfare, Washington, DC.

172 Observational Studies 1 (2015) 173-181 Submitted 4/15; Published 8/15

Comment on Cochran’s “Observational Studies”

Joseph L. Gastwirth [email protected] Department of Statistics, George Washington University Washington, DC 20052, USA Barry I. Graubard [email protected] Division of Cancer, Epidemiology & Genetics, National Cancer Institute Bethesda, Maryland 20892, USA

1. Introduction The timelessness of Prof. Cochran’s contributions to the planning, design and analysis of both large scale surveys and observational studies is exemplified by his 1972 paper summa- rizing many years of research. Prof. Small should be thanked by the statistics community for devoting a special section of the new journal Observational Studies to bring this im- portant work to the attention of the next generation of statisticians and data scientists. Researchers in almost every field will benefit from reading the advice given in the paper at the very start of thinking about a study, whether randomized or observational. Our com- mentary will focus on differences between studies designed to make inferences applicable for the general population or studies carried out to understand what occurred in the population studied, i.e. the study population is the population for which inferences will be made. In the first and most common setting, the importance of Cochran’s observation that one needs to consider whether the study population differs in some important ways from the general population will be illustrated by reviewing studies concerning the relationship between obe- sity and more generally body weight and mortality and morbidity. While that issue does not arise in the second setting, frequently occurring in legal cases dealing with discrimina- tion or violation of occupational safety and health rules, the wisdom of Prof. Cochran’s recommendations on analytic techniques and methods for controlling for the potential effect of confounders are very relevant to the proper analysis of statistical evidence.

2. Inference from samples from a population Cochran points out that the target population should be identified and that a probability sample of the target population be collected. However, he points out that a probability sample may not be possible, e.g., because of cost and operational considerations, so that many studies obtain a sample from a population that differs somewhat from the target population. For example to study the association of body weight (actually body weight adjusted for height or body mass index (BMI), i.e., weight in kg divided by the square of height in meters) with all-cause mortality, researchers have used existing cohorts of adults such as the National Institutes of Health (NIH)-AARP Diet and Health Study (Adams et al., 2006), which we will call the NIH-AARP Study. The NIH-AARP Study sent a

⃝c 2015 Joseph Gastwirth and Barry Graubard. Gastwirth and Graubard

in 1995-96 to all members of the AARP 50-71 years old who resided in six U.S. states (California, Florida, Louisiana, New Jersey, North Carolina, and Pennsylvania) and two metropolitan areas (Atlanta and Detroit). The questionnaire collected self-reported body weight, height along with other relevant covariates such as smoking, alcohol consumption and physical activity that were incorporated in the analysis to remove the potential for confounding in the estimated association of BMI and mortality. The sample studied had information on the 18% of male and female members of AARP (n= 567,169) who returned the . The low response rate also raises important statistical concerns about the generalizability of the conclusions to the population of AARP members, much less the entire adult U.S. population 50-71 years old. The Adams et al. article states “Even against the background of advances in the man- agement of obesity-related chronic in the past few decades, our findings suggest that adiposity, including overweight, is associated with an increased risk of death” and compares their results to those reported by Flegal et al (2005). The earlier study used national probability samples from the National Health and Nutrition Examination Survey (NHANES) to construct US representative cohorts and did not find a an increased risk of death in overweight (25 ≤ BMI < 30) individuals. The NIH-AARP Study does not state restrictions on the population to which their conclusions are applicable and implies that they are valid for the entire US population. Investigators (Durazo-Arvizo et al 1997; Calle et al 1999) have reported that the BMI mortality relationship may vary by race, so those groups should be appropriately represented in the study sample. Although membership in the AARP is open to the entire population, it is not a representative sub-population of the over 50 population of the nation as members must pay annual dues. Further, the low response rate potentially exacerbates the non-representativeness of the NIH-AARP sample. Cochran’s recommendation that discussing how differences between a study sample and the target population, such as the US population, may affect the interpretation of the in- ferences drawn from it is very important. The BMI and mortality relationship found in the NIH-AARP Study may not be generalizable to the entire US population as its race/ethnic mix differs from that of the entire US. The 1995 US projections for race/ethnicity distribution of 50-69 year olds (the closest age range to the NIH-AARP available) were 80.7%, 9.5%, 6.5%, 3.3% (see http://www.census.gov/prod/1/pop/p25-1130/p251130.pdf ) compared to the NIH-AARP race/ethnicity distribution of 91.6%, 3.7%, 1.8%, 1.6% for white, African-American, Hispanic and Asian, Pacific Islander or Native American, re- spectively. Not surprisingly, whites are over-represented in the NIH-ARRP study; notice that all three minority groups are under-represented by a factor of two. The use of vol- unteers, i.e., nonrandom samples, from selected populations is typical of many epidemi- ologic studies; examples are plentiful, e.g., the Harvard University Nurses Health Study http://www.channing.harvard.edu/nhs/, the National Cancer Institute U.S. Radiologic Technologists Cohort http://dceg.cancer.gov/research/who-we-study/cohorts/us-radiologic-technologists, and the City of Hope National Medical Center California Teacher Study https://www.calteachersstudy.org/study_data.html. Even though the sample size of these cohorts is quite large, the potential bias resulting from having specialized samples of a target population (e.g., the US population) is not ameliorated by the small standard errors of estimates of association obtained from large samples. This important point is

174 Comment on Cochran’s “Observational Studies”

often ignored by epidemiologists even though statisticians are well aware of the potential magnitude of this type of bias, e.g., from the Literary Digest pre-election poll in 1936. For conclusions concerning a relative risk obtained from a sampled population differing from the target one, the risk model fit to the data needs to be correctly specified with the correct functional form of the relationship of the response to the covariates along with appropriate interactions and information on all major covariates should be obtained. Only then will there be a good chance of estimating the risk accurately for a target population. Further- more, the sample should span the same distribution of covariates as the target population, e.g., if the relative risks differ by age group the age groups of the target population should be represented in the sample population. From a public health policy point of view, the results from cohort studies or other types of epidemiologic studies are most useful if they are generalizable to the target population of interest. In the Adams et al. paper and many other epidemiologic analyses, population attributable risks (PAR) for an “exposure” are used to estimate how many cases or deaths would be prevented if the exposure was eliminated or reduced by a suitable intervention. If the estimated relative risks from a particular study are not applicable to the target population then the estimated PARs could be misleading resulting in a misallocation of resources that may be directed to more important public health exposures, e.g., preventing smoking, or warning the public of a risk of a serious disease, e.g. Reye syndrome, from a frequently used product.

3. Inference for the study population

Although the objective of most statistical surveys and studies is to draw inferences from a subset, ideally a random sample, of a population that will apply to the much larger population, in some important applications one is concerned with drawing conclusions that will apply only to the study population. In many legal cases the question addressed by the statistical analysis can shed light on concerns of the appropriateness of the practices of a specific employer or firm. For example, in a fair trade case the question may focus on whether a particular exporter “dumped” or sold goods below cost, which is unfair to the importing nation’s producers; in an equal pay case the issue is whether female employees are paid the same as similarly qualified males. In both situations, the conclusions will only apply to the particular firm or employer, i.e. if it turns out that the firm did not dump goods or that the employer underpaid female employees by $2.00 an hour, those conclusions will not be considered in a similar case involving a different firm nor would they imply that females employed in similar jobs throughout the nation are under-paid by an average of $2.00 an hour. This section will show that several of Prof. Cochrans wise suggestions and guidelines are very useful in analyzing this type of observational study but also have been misinterpreted by “experts” and courts as they were developed for situations where the ultimate inference will apply to a much larger population than the one studied. In the legal setting where one has data for the entire finite population for the period under review, calculated from the data are in fact population quantities. Statisticians often impose a probability model to aid in interpreting and understanding the evidentiary strength of a difference in averages or percentages. For example, in an equal employment case concerning the fairness of an employers promotions, suppose that 2 of

175 Gastwirth and Graubard

15 (13.3%) eligible female employees and 12 of 23 (52.2%) eligible males were promoted during the relevant time period. There is no sampling error involved; however, as an aide to interpreting the data statisticians often assume that the promotions are randomly chosen from the pool of eligible employees. Assuming that all other relevant factors, e.g. seniority are balanced in both groups, the number of females among the 14 promotions follows a hy- pergeometric distribution and Fisher’s yields a p-value of .02 (two-sided). Notice that the proportion of women promoted is about one-fourth the corresponding proportion of male employees, which is clearly meaningful. The statistical test informs the court that the data is unlikely to occur if promotions were randomly selected from the eligible pool. Thus, we infer that the gender of an eligible employee affected their chance of promotion. Courts then require the employer to justify their promotion process. Notice that the total sample size in the above situation is much smaller than in the applications discussed by Prof. Cochran. Unfortunately, courts often ignore data sets referring to the complete set of eligible employees and simply say the sample is too small. For example, the decision in the age discrimination case, Fallis v. Kerr-McGee Corp. 944 F.2d 743 (10th Cir. 1991), stated that the “sample” of 51 employees was too small.1 A related problem is that expert witnesses have convinced courts that samples of 200-400 may be needed to subject data comparing the pass rates of minority and majority pass rates on a pre-employment exam to assess whether it has a disparate impact on the minority applicants.2 Because courts have not encouraged the analysis of data pertaining to a seemingly small population, they may not fully appreciate the meaning of a simple statistical summary. The case Chappel v. Ayala3 currently being considered by the U.S. Supreme court provides an illuminating example. The case concerns whether the lower courts properly considered a defendants Batson allegation that the prosecutor discriminated against minorities by removing all seven minority members from the venire of potential jurors through peremptory challenges. Although the main legal issues concern the propriety of the trial judge excluding the defendant’s lawyer from part of the proceedings where the prosecutor explained why the minorities were challenged and the apparent loss of many questionnaires potential jurors filled out, the courts might have benefitted from a formal statistical analysis of the data. Even Judge Callahan who dissented from the 9th circuits opinion granting the defendant a new trial noted “The only indicia of possible racial bias was the fact seven of the eighteen peremptory challenges exercised by the prosecutor excused African-American and Hispanic jurors.” To properly interpret this information one needs to know the number of non- minorities who were on the venire. The majority opinion noted that in the case, each side could remove 20 members of the venire by peremptory challenge when the jury of 12 was chosen and then had six more peremptory challenges to use when the six alternates were chosen. Thus, there must have been at least seventy individuals on the venire in order for the court to end with a jury of twelve and six alternates. To maximize the proportion of

1In the case, 3 of 9 employees over 40 were fired in contrast to 4 of 42 employees under 40. Analyzing the data with Fisher’s test yields a non-significant result (one-sided p-value = .095), which would support the courts decision and avoid making an “ad hoc” judgement that a sample of 51 is too small. 2See Lopez v. Lawrence (D. Mass.) 2014 U.S. Dist. LEXIS 124139. 3The Supreme Court granted certiorari in Chappell v. Ayala, 2014 U.S. LEXIS 7094 (U.S., Oct. 20, 2014) and will review the decision Ayala v. Wong 756 F.3d 656 (9th Cir. 2014); 2014 U.S. App. LEXIS 3699.

176 Comment on Cochran’s “Observational Studies”

minorities in the pool from which the jury of twelve were chosen, let us assume that the trial court proceeded in two stages: first, selecting the jury and then the alternates. Allowing for each side to have twenty peremptory challenges, the minimum size of the panel from which the twelve jurors were chosen is 52, of whom 7 were minority. The prosecution actually removed 18 members of the panel, all seven minorities and eleven whites. Applying Fishers exact test shows that the probability that a random sample of 18 taken from a pool of 7 minorities and 45 whites would include all 7 minorities is .00024 or about 1 in 4000.4 This is quite a significant result, which suggests that the court should carefully examine the reasons the prosecution offers to justify its challenges when the judge compares the characteristics of the minority members excluded with the majority members who were not excluded to see whether the offered reasons were applied to all members of the venire.5 Prof. Cochran emphasized the usefulness of matching and stratification methods and they are especially appropriate when the results of one’s analysis needs to be explained to a non-statistician as they can understand that the factors used in the matching/stratification process are controlled for. In other words proper matching and stratification can simplify analysis to examining possibly simple means or proportions when otherwise, for instance, a less intuitive regression method may be used to conduct analyses that adjust for the stratification or matching variables. If the concomitant factor used in the matching process is ordinal, however, one may lose some relevant information. As an example consider the pay data in Table 1 on the following page from EEOC v. Shelby County Government, a case concerning whether women clerical employees in the county’s criminal court were discriminated against in pay and promotion. In the opinion, the judge noted that judges are very familiar with the duties of these clerical workers and found unequal pay after considering the pay data in Table 1, stratified into four seniority levels. Although the data is so clear a formal statistical test was not required, Gastwirth (1992) applied the Mann-Whitney form of the Wilcoxon test to the data in each strata and combined the results using the van Elteren procedure. The result was highly significant (p¡.001). One feature of the data, however, is ignored in this analysis. Notice that some men are paid more than women with noticeably more seniority. For example, D.V., a male is paid more than the four females who have higher seniority (i.e., F.R., T.D., P.B., and P.E.) and B.W., another male who has even less seniority than D.V., is paid more than three of those females and the same as the fourth. This phenomenon held true even in 1988, five years after the charge was filed (Gastwirth, 1992).

4If the trial court started with a panel large enough for it to select twelve jurors and six alternates, then the minimum size would be 70 and the probability that all seven minorities would be in a random sample of 18 from this larger pool would be 2.55 × 10−5 or just over one in 40,000. Unfortunately, none of the opinions reports the full data set or provides a detailed description of the original jury selection procedure. 5In United States v. Omoruyi, 7 F.3d 880 (9th Cir. 1993), the prosecutor peremptorily challenged the two single minority females and the defendant raised a Batson claim after the second one was removed. The trial judge accepted the prosecutor’s claim that he removed them because they were single. The appellate court noted that the prosecutor had not peremptorily challenged single, unmarried men in the jury panel and granted the defendant a new trial. In contrast, in Alviero v. Sam’s Warehouse Club Inc. 253 F.3d 933, 940-41 (7th Cir. 2001) the court accepted the prosecutor’s explanation of the removal of all three female members of the jury panel on the basis of their limited work experience and level of education even though some males with similar educational backgrounds but more work experience but more education were not challenged.

177 Table 1. Pay Data for Male and Female Employees Clerical Employees of Shelby County Criminal Court, and Estimated Damages for Female Employees using the Peters-Belson Approach

Initials of Dender Hire date Salary in 1983 Estimated Employee (dollars per damages for month) female employees for 1983 (dollars per month) C.R. C 5/73 1474 203.61 J.t.V. a 9/73 1666 T.D. C 1/74 1403 227.50 C.H. a 1/74 1666 t... C 5/74 1403 203.94 L.A. a 5/74 1548 C.C. a 5/74 1548 t.E. C 8/74 1403 186.27 D.V. a 9/74 1548 T.t. C 5/75 1112 424.27 D.L. C 1/76 1306 183.15 S... C 2/76 1336 147.26 D.V. a 3/76 1548 J... C 9/76 1336 106.04 ..W. a 1/78 1474 ..D. C 9/78 1000 300.69 ..t. C 10/79 1000 224.12 J.A. a 10/79 1157 C.D. a 8/82 1000 t.S. C 9/82 929 88.99 a.D. C 12/82 929 71.32 V.H. C 1/83 929 65.43 S.C. C 7/83 800 159.10

Comment on Cochran’s “Observational Studies”

The Peters-Belson (Peters, 1941; Belson, 1956) approach to regression, discussed in Cochran and Rubin (1973), was used by Gastwirth and Greenhouse (1994) to analyze this data. First, one fits a model relating the salaries of male employees to their seniority level. Then one predicts the salary a female employee would receive had they been paid according the male equation. The differences√ Di for each female estimate the shortfall (if negative) in their salary and Z = D/¯ V (D¯) where D¯ is the average of the Di and V (D¯) is the variance of D¯, and Z is approximately normally distributed in large samples and a t-distribution in small ones. For the Shelby data the model was a linear regression predicting a worker’s salary in 1983 from the number of months they had worked. Table 1 displays the salary and date hired and gender data that we use here; see Table 7 in Gastwirth (1992) for this data and salary data for other years. The observed average shortfall D¯ = $185.12 in the monthly salary of females has a standard error of 35.62 resulting in two-sided p-value < .001. Another analytic approach that does not require the assumption that the errors in the regression model follow a normal distribution and logically follows from the idea that one is imposing a probability model on the data is to apply a permutation test. A complete permutation test would consist of swapping the gender labels and repeatedly applying the Peters-Belson( ) approach to each relabeled data set. As there are 9 males and 14 females, 24 there are 9 = 1, 307, 504 ways to relabel gender in the Shelby data. For computational purposes we randomly selected (without replacement) 1,000 relabeled data sets and found only 3 of the D¯ across the 1,000 to be as large or larger in absolute value as the observed shortfall, yielding a two-sided p-value of 0.003. Table 1 shows the Peters-Belson estimate of Di for each female employee is negative, which illustrates the unfairness of the pay system examined and provides an estimate of the amount of money each woman deserves. Other uses of permutation methods in are described in Sprent (1998). The Peters-Belson (PB) method is related to the use of counterfactuals in the Neyman- Rubin (Holland, 1986; Morgan and Winship, 2007) approach to causal inference if one considers the PB estimated salary for each female obtained from the male equation as the estimated salary of her “male counterfactual.” This predicted salary, however, may not be the salary of any of the male employees; rather it is a “statistical match” in the terminology of Peters (1944). The accuracy of the shortfall obtained from PB regression depends on the appropriateness of the model and the completeness of the information on the covariates. In the context of an “Equal Pay” case, the employer knows the relevant factors used and in determining salaries and the relative weight given to each of them and should ensure that accurate information on them is obtained and retained.

4. Summary and future thoughts

Very few publications remain highly relevant to their subject after forty years have passed. Professor Cochran’s 1972 paper and his earlier work, which is summarized in it, are in that special category. Every investigator should review the recommendations on the need to have a clear statement of the objectives of a study when planning and designing a study and follow his suggestions, e.g. have a pilot study, at those stages. His discussion of the various methods for removing the effect of confounders remains the basis of much current research (Rosenbaum, 2002).

179 Gastwirth and Graubard

Our comments focused on the value of Cochran’s emphasis on the need to consider the effect of differences between the study population and the population to which the inferences drawn from the study will be applied and the situation when the study group is the entire population. In the context of examining the complete population, especially in a legal case, Prof. Cochran’s concern with stability of the relationship, presumably over time, is less important than in the usual setting where one desires to draw inferences valid for a much larger population from a sample and learn about the underlying mechanisms producing the response. In an equal employment case, one’s focus is on what happened during the few years in which the employer used the practice (job requirement, pay decision process) under review. Indeed, quite often an employer will change policies in response to a claim so that the earlier relationship between salary and gender or race and other covariates may well change. In practice, almost no large scale study will be “perfect,” the will generally be a good “approximation” of the relationship of the response to the predictors and there will be errors of measurement and a potentially relevant covariate may be omitted. Readers should be aware of the usefulness of sensitivity analysis (Rosenbaum, 2002) and in particular, the importance of the Cornfield inequality (Cornfield et al., 1959, Gastwirth and Greenhouse, 1995) in assessing whether a possible omitted factor can explain a statistically significant difference between two groups. Briefly, Cornfield gave conditions on the strength and imbalance or differences in the the omitted variable must satisfy in order to explain a relative risk. The result has been used by Gastwirth (1992) to show that judges who required the party suggesting that an observed difference or relative risk was due to an omitted variable to submit an analysis including that factor were correct. In view of recent interest in the issue of of scientific studies, basing inferences that will be applied to the target population on random samples from it has the advantage that investigators using different random samples should arrive at similar results, within sampling variation.

References

Adams K.F., Schatzkin A., Harris T.B., Kipnis V., Mouw T., Ballard-Barbash R., Hol- lenbeck A. and Leitzmann M.F. (2006). Overweight, obesity, and mortality in a large prospective cohort of persons 50 to 71 years old. New England Journal of Medicine, 355, 763–78. Belson W.A. (1956). A technique for studying the effects of a television broadcast. Journal of the Royal Statistical Society, Series C (Applied Statistics), 5, 195-202. Calle, E.E., Thun, M.J., Petrelli, J.M., Rodriguez, C. and Heath, C.W. Jr. (1999). Body- mass index and mortality in a prospective cohort of U.S. adults. New England Journal of Medicine, 341, 1097-1105. Cochran, W.G. and Rubin, D.B. (1973). Controlling bias in observational studies, a review. Sankyha (A), 35, 417–446. Cornfield, J.C., Haenzel, W., Hammond, E.C. , Liliefield, A.M., Shminkin, M.B. and Wyn- der, E.L. (1959). Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute, 22, 173–203.

180 Comment on Cochran’s “Observational Studies”

Durazo-Arvizu R., Cooper R.S., Luke A., Prewitt T.E., Liao Y. and McGee D.L. (1997). Relative weight and mortality in U.S. blacks and whites: findings from representative national population samples. Annals of Epidemiology, 7, 383–395. Flegal K.M., Graubard B.I., Williamson D.F. and Gail M.H. (2005). Excess deaths as- sociated with underweight, overweight, and obesity. Journal of the American Medical Association, 293, 1861–1867. Gastwirth, J.L. (1992). Methods for assessing the sensitivity of statistical comparisons used in title VII cases to omitted variables. , 33, 19–34. Gastwirth, J.L. (1992). Statistical reasoning in the legal wetting. American Statistician, 46, 55–69. Gastwirth, J.L. and Greenhouse, S.W. (1995). Biostatistical concepts and methods in the legal setting. Statistics in Medicine, 14: 1641–1653. Holland P.W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81, 945–970. Morgan, S.L. and Winship, C. (2007). Counterfactuals and Causal Inference: Methods and Principles for Social Research. Cambridge University Press, Cambridge. Peters C.C. (1941), A method of matching groups for experiment with no loss of population. The Journal of Educational Research, 34, 606-612. Sprent, P. (1998). Data Driven Methods in Statistics. Chapman & Hall, London.

181 Observational Studies 1 (2015) 182-183 Submitted 1/15; Published 8/15

The State of the Art in Causal Inference: Some Changes Since 1972

Andrew Gelman [email protected] Department of Statistics and Department of Political Science Columbia University New York, NY 10027, USA

William Cochran’s 1972 article on observational studies is refreshing and includes rec- ommendations and warnings that remain relevant today. Also interesting are the ways that Cochrans advice differs from recent textbook treatments of causal inference from observa- tional data. Most notable, perhaps, is that Cochran talks about design, and estimation, and general goals of a study but almost nothing about causality, devoting only one page out of ten to the topic. In statistical terms, Cochran spends a lot of time on the estimator (and, more generally, the procedure to decide what estimator to use) but never defines the estimand in an observational study. He refers to bias but gives no clear sense of what exactly is being estimated (he does not, for example, define any of average causal effect). Modern treat- ments of causal inference are much more direct on this point, with the benefit of the various formal models of causal inference that were developed by Rubin and others starting in the 1970s. Scholars have pointed out the ways in which the potential-outcome formulation de- rives from earlier work by statisticians and economists, but Cochran’s chapter reveals what was missing in this earlier era: there was only a very weak connection between substantive concerns of designs and measurement, and decisions regarding match- ing, weighting, and regression. In more recent years, the filling in of this gap has been an important research area of Rosenbaum and others; again, seeing Cochran’s essay gives us a sense of how much needed to be done. One area that Cochran discusses in detail, and which I think could use more attention in modern textbooks (including those of my collaborators and myself) is measurement. Statistics has been described as living in the intersection of variation, comparison, and measurement, and most textbooks in statistics and econometrics tend to focus on the first two of these, taking measurement for granted. Only in do we really see measurement getting its due. So I was happy to see Cochran discuss measurement, even if he did not get to all the relevant issuesin particular, external validity, which has been the subject of much recent discussion in the context of laboratory experiments vs. field experiments vs. observational studies for social science and policy. In reading Cochran’s chapter, I was struck by his apparent lack of interest in causal identification. Modern textbooks (for example, the econometrics book of Angrist and Pis- chke) discuss the search for natural experiments, along with the assumptions under which an observational study can yield valid causal inference, and various specific methods such as instrumental variables and regression discontinuity that can identify causal effects if defined

⃝c 2015 Andrew Gelman. The State of the Art in Causal Inference: Some Changes Since 1972

carefully enough under specified conditions. In contrast, Cochran discusses generic before- and-after designs and restricts himself to analysis strategies that do basic controlling for pre-treatment covariates by matching and regression. He is not so clear on what variables should be controlled for (which perhaps can be expected given that he was writing before Rubin codified the concept of ignorability), and this has the practical consequence that he devotes little space to any discussion of the data-generating process. Sure, an experiment is, all else equal, better than an observational study, but we don’t get much guidance on how an observational study can be closer or further from the experimental ideal. Cochran did write, “a claim of proof of cause and effect must carry with it an explanation of the mechanism by which the effect is produced,” which could be taken as an allusion to the sub- stantive assumptions required for causal inference from observational data but he supplied no specifics, nothing like, for example, the exclusion restriction in instrumental variables analysis. Another topic that has appeared from time to time in the causal inference literature, notably by Leamer in the 1970s and in recent years by researchers such as Ioannidis, Button, and Simonsohn in medicine and psychology, are the biases resulting from the search for low p-values and the selective publication of large and surprising results. We are increasingly aware of how the “statistical significance filter” and other sorts of selection bias can distort our causal estimates in a variety of applied settings. Cochran, though, followed the standard statistical tradition of approaching studies one at a time; the terms “selection” and “meta- analysis” do not appear at all in his essay. Just to be clear: In noting this perspective, I am not suggesting that his own analyses were rife with selection bias. It is my impression that, in his work, Cochran was much more interested in improving the highest-quality research around him and was not particularly interested in criticizing the worst stuff. I get the sense, though, that, whatever things may have been like in the 1960s, in recent years selection bias has become a serious problem even in much of the most serious work in social science and medicine, and that careful analysis of individual studies is only part of the picture. Let me conclude by emphasizing that the above discussion is not intended to be exhaus- tive. The design and analysis of observational studies is a huge topic, and I have merely tried to point to some areas that today are considered central to causal inference, were barely noted at all by a leader in the field in 1972. Much of the research we are doing today can be viewed as a response to the challenges laid down by Cochran in his thought-provoking essay that mixes practical concerns with specific statistical techniques.

183 Observational Studies 1 (2015) 184-193 Submitted 7/15; Published 8/15

Comment on Cochran’s “Observational Studies”

Ben B. Hansen [email protected] Department of Statistics, University of Michigan Ann Arbor, MI 48109, USA Adam Sales [email protected] Department of Educational Psychology, University of Texas at Austin Austin, TX 78712, USA

1. Cochran’s advice

It is a tribute to Cochran’s wisdom and foresight that the issues he identified as central in 1972 remain so for the methodologists of 2015, with newer subfield journals as well as older, mainstream statistical journals steadily presenting techniques with which to ease “Judgment about Causality.” Among the newer journals Observational Studies stands out for facilitating the study planning Cochran urged, presenting a forum for the sharing of study protocols, even those that do not require the prior endorsement of a funding panel or regulatory body. Cochran saw protocols as important for the purpose of soliciting and addressing criticism before rather than after completion of the study, and for keeping studies on track once they are underway. This journal’s Aims and Scope statement correctly adds that a study conducted according to a public protocol will generally be more transparent and persuasive. Both Cochran and the Observational Studies founding document emphasized utility of planning with regard to choice and measurement of outcomes, as opposed to the choice and implementation of statistical methods. However, in-advance study protocols turn out to be particularly useful for structuring and organizing the many small choices underlying statistical analysis of most any comparative study. It is self-evident that a detailed and publicly posted plan may inoculate a study against suspicions of its having been engineered to confirm preconceptions of the investigators. What is more subtle is that if the study’s end verdict depends directly or indirectly on choices among statistical models – as nearly all studies do – then in a purely statistical sense the chosen model’s integrity is protected by codifying the sequence of modeling choices, with each selection guided by a test with prespecified characteristics. This is particularly so if the stepwise construction of the model proceeds in a “forward” rather than a “backward” direction, starting simple before narrowing or complicating as needed, rather than starting with something elaborate and progressively seeking to simplify it. Then the frequency properties of the overall procedure follow in a particularly simple way from those of the constituent tests, by dint of an insight of R. Berger, the “stepwise intersection-union principle” (SIUP). The term appears in an unpublished manuscript, but Berger et al. (1988) gave an application and Rosenbaum (2008, Proposition 1) noted its

c 2015 Ben B. Hansen and Adam C. Sales.

Comment

relevance to observational studies. To our knowledge its special relevance to setting up an observational study has not previously been noted. The concept itself is very simple; the remainder of this comment restates it before discussing its application in observational study designs of increasing complexity.

2. Stepwise intersection-union testing

Let the set A be totally ordered by ≺: for all a 6= a0 in A, either a ≺ a0 or a0 ≺ a. Suppose that each a ∈ A corresponds to a null hypothesis, Ha, to be tested with size α ∈ (0, 1). For the set of hypotheses {Ha}a assume: ∈A

1. There exists a level-α test of Ha for all a ∈ A

2. Either Ha is false for all a ∈ A or there exists ana ˜ ∈ A such that Ha˜ is true and for all b ≺ a˜, Hb is false—Ha˜ is the first true hypothesis.

Let F be a family of tests testing the hypotheses {Ha}a . Then let a∗ = sup{a : ∈A Ha is not rejected byF}. Then let F ∗ be a modified family of tests, also indexed by A, which rejects all hypotheses Ha with a∗ ≺ a and does not reject Ha with a ≺ a∗. Then the family F ∗ strongly controls the family-wise error rate. That is, the SIUP states:

Proposition 1 If for each a ∈ A the corresponding test from F incorrectly rejects true hypotheses with probability at most α, then with probability at least 1 − α, F ∗ rejects only hypotheses that are false.

Proof If all hypotheses {Ha}a are false, it is trivially impossible to reject a true hy- ∈A pothesis. Then, following the second condition above, let Ha˜ be the first true hypothesis. Let R be the event that a researcher rejects at least one true hypothesis. Under F ∗, R is equivalent to rejecting Ha˜: since Ha˜ is true, rejecting Ha˜ implies R, and since F ∗ can only reject additional true hypotheses after first rejecting Ha˜, then R entails rejecting Ha˜. Therefore, P r(R) = P r(reject Ha˜) = α.

The SIUP states that if a researcher pre-specifies a sequence of hypotheses and cor- responding level-α tests, tests those hypotheses in order, and stops testing after the first non-rejected hypothesis, then the probability of incorrectly rejecting at least one correct hypothesis is at most α. As Rosenbaum (2008) pointed out, inverting a test to form a confidence interval can be though of as an application of the SIUP. Say a researcher seeks to estimate a one-sided 95% confidence interval for the average height of a population, using data from a random sample. Assume, for the sake of this example, that the distribution of heights in the population is normal, so a t-test yields exact p-values. Then to approximate a 95% interval, she could specify a grid of possible average heights measured in inches, say a = 50, 51, 52,··· . For each of these, she would test the hypothesis Ha that the population average height is less than or equal to a, and reject those hypotheses with p-values lower than α = 0.05. If H50 is rejected, she then tests H51; if H51 is rejected, she then tests H52 and so on. Eventually, she will test a hypothesis that cannot be rejected—say H60. She may then state, with 1 − α =95% confidence, that the average height of the target population is

185 Hansen and Sales

greater than 59 inches. Even though this procedure may result in many hypothesis tests, with no correction for multiplicity, the SIUP states that because the hypotheses are tested in order the probability of rejecting a true hypothesis is only α = 0.05.

3. Regression discontinuity bandwidth selection

In a Regression Discontinuity Design (RDD), treatment Z is a deterministic function of a numeric “running variable” R: Z = 1 when R > c for some known constant c, and Z = 0 otherwise. For example, Bartlett III and McCrary (2015) (B&M) discuss the effect of a securities trading rule, that only applies when the price of the security (R) meets or exceeds one dollar (c). Under this design, they estimate the effect of the rule on a number of market outcomes, such as quotes-per-second. RDDs are uniquely persuasive because the treatment assignment mechanism is known, and hence there is only one confounder—R—and that confounder is measured. On the other hand, matching on R is impossible, since all subjects with R on either side of c are either treated or untreated. For that reason, statistical modeling is necessary. With statistical models come modeling assumptions, and corresponding specification tests. For instance, B&M model expected outcomes as a function of price, E[Y |R = r], with a local-linear smoother. One test of this specification is a covariate placebo test: researchers typically have access to a set of covariates X, such as prior weekly returns in B&M, that could not have been affected by the treatment. A covariate placebo test estimates treatment effects for X—if the method discovers a treatment, something must be wrong. Covariate placebo tests in RDDs involve two forms of multiple testing, both discussed in Sales and Hansen (2015, Section 2). First, say several covariates are available, so each p subject i has a vector of length p X~ i = {Xik} . As p increases, so does the probability of · k=1 falsely rejecting a null hypothesis at a fixed level α for at least one of the covariates. This is not unique to RDDs; in a general context, Hansen and Bowers (2008) suggested combining covariate placebo tests into an omnibus test. In an RDD context, Lee and Lemieux (2010) suggested combining separate RDD analyses, one for each column of X, with the omnibus p-value from seemingly unrelated regressions. B&M used a version of Hotelling’s T 2 to combine p = 4 covariates. A second form of multiple testing emerges from a search for specification. For instance, to mitigate the bias caused by model-misspecification in RDDs, it is common practice to estimate effects in RDDs using only data in a window around the cutoff Wb = {i : Ri ∈ (c − b, c + b)} for some bandwidth b > 0. One method for choosing b, recommended in Sales and Hansen (2015), Cattaneo et al. (2015) and, in a slightly different context, Angrist and Rokkanen (2015) is based on sequential specification tests: for each candidate b, conduct a covariate specification test using the data in Wb. This is illustrated in Bartlett and McCrary’s (2015) Figure 1, reproduced with permission as Figure 1 here, which displays p-values issuing from covariate placebo tests in windows Wb, b = 0.05, 0.06, ..., 1.5. In principle, there are two ways to conduct this procedure: one is to start with the smallest possible bandwidth b, and expand Wb until a specification test rejects. In the B&M example, for α = 0.1, say, this would result in a very small bandwidth b = 0.11. Alternatively, selecting forward, an analyst can begin with the largest possible b, and restrict the window

186 Comment

Appendix Figure 1. Randomization Inference p-value as a Function of Bandwidth .8 .6 .4 Permutation Test p − value .2 0 0 .5 1 1.5 Bandwidth

Note: For bandwidths over a grid of 0.05,0.06,...,1.5 ,figureshowsrandomizationinferencep-valueassociated { } with Hotelling’s T 2 and the four covariates from Figures 1, 2, 3A, and 3B.

Figure 1: The stepwise intersection-union principle applied to regression discontinuity band- width selection. For a sequence of candidate bandwidths b, Bartlett III and Mc- Appendix Figure 2. Placebo Distribution of t-Ratios for Delisting Risk Crary (2015) conducted covariate placebo tests using data in Wb. Figure copied with permission from Bartlett III and McCrary (2015). In practice, the details of B&M’s bandwidth selection procedure differed from what we present here.

Wb until a specification test fails to reject the specification. This would allow much larger bandwidths in the B&M dataset. The SIUP recommends the latter procedure, which strongly controls familywise error rates. Let B ⊂ (0, ∞) be the set of candidate bandwidths, and let F be a family of tests, one for each b ∈ B, testing the null hypothesis Hb that the specification holds in Wb. Then define the modified family of tests F ∗ that rejects each hypothesis Hb at level α if and only if F rejects each Hb0 , b0 ≥ b. The testing in order principle states that, with probability 1 − α, F∗ rejects at most one true Hb.

4. Pair matching, nearest neighbor matching or something in between?

SupposeNote: Figure that showsnt themembers of of t-ratios a treatment obtained by estimating group placebo are to discontinuities be paired in delisting to members risk of a reservoir of from $0.50 to $4.00. The superimposed grey curve is the standard normal density function. nc controls on the basis of some summary of their baseline differences, such as their absolute differences along an estimated propensity score. The simplest and most intuitive match- ing structures emerge from pair matching57 without replacement, which generates precisely min(nc, nt) disjoint matched sets, each containing a single representative of either group.

187 Hansen and Sales

At the outset of an investigation there can be no guarantee that pair matching will achieve a satisfactory reduction in baseline differences — even with the use of a optimal matching algorithm, and even with a large control reservoir. Difficult to match members of the treatment group may compete for the same controls; in this case, with-replacement pair matching may better remove bias due to differences between the groups at baseline, as might full matching to nc∗ controls, nc∗ < nc (Rosenbaum, 1991; Hansen and Klopfer, 2006). In the worst case common support (Heckman et al., 1998) may fail, with a subset of the treatment group possessing combinations of characteristics that could never appear in the control group, even in indefinitely large samples. Then one might instead settle for matching only a subset of the treatment group. A suitable optimal matching algorithm can select a size-nt∗ subset of the treatment group, nt∗ < nt, along with a without-replacement match to nt∗ controls, to achieve the minimum sum of paired matching discrepancies among all possible collections of nt∗ nonoverlapping 1:1 matched pairs (Rosenbaum, 2012). While there can be no knowing in advance which option will be most appropriate, or whether either one will be necessary, a protocol has every reason to anticipate the eventuality that such a choice might be needed. One approach to such decisions is to structure them with statistical hypothesis tests, the first pertinent null hypothesis being that treatments and their matched controls were drawn from the same population, so far as baseline characteristics are concerned. (Or, more precisely, that matched counterparts’ propensity scores are always equal [Rosenbaum, 2010].) The part of a decision that’s guided by a hypothesis test is relatively easy to codify in a protocol, through pre-specification of a hypothesis, a test statistic and an α level. The decision may also have additional components, particularly if the eventuality of rejection would force a choice among contrasting methods: after rejection of an ordinary pair match one might decide between full matching to a fraction of the control reservoir and optimal pair matching of an optimally chosen subset of the treatment group. This decision may be based on some diagnostic of whether the problem appears to be competition for controls or a failure of overlap. Whatever fallback procedure may be selected, it’s natural to subject the matches it generates to a test of the same baseline equivalence hypothesis, proceeding to later stages of analysis only if that hypothesis can be sustained. This recycling of hypothesis tests leads naturally to misgivings; the repetition would seem to threaten frequentist error rates associating with the test. But the misgivings are misplaced – this turns out to be step- wise intersection-union testing, because the later matches are constructed only after earlier matches were rejected. If at each step, the decision to try another match or to stick with the current one is guided by a test with local level α, then the entire procedure has level α, in the family-wise sense. This is fortunate, because our pair matching alternatives each requires the choice of a tuning parameter: for with-replacement matching, a positive integer nc∗ < min(nc, nt); for matching of an optimally chosen subset, an integer nt∗ < min(nc, nt). Were a multiple- testing penalty to accrue from increasing the number of specification tests, then determining how find a grid to search would itself constitute a wrenching decision. With stepwise intersection-union testing, however, there’s no reason not to start with nc∗ = min(nc, nt)−1, or with nt∗ = nt −1, and to reduce them in increments of 1 until precisely the point at which baseline equivalence is no longer rejected.

188 Comment

Indeed, this perspective suggests beginning the investigation of matching options at a still more optimistic position than pair matching . If nc  nt and good matches are plenti- ful for all subjects then 1:2 matched triples, or 1:3 matched quadruples, might be arranged with little increase in biases but significant benefit for variances of eventual effect estimates. Full matching (Rosenbaum, 1991; Gu and Rosenbaum, 1993) modifies pair matching in both directions at once, permitting difficult-to-match treatments to share controls while pairing easier to match treatment group subjects to multiple controls, in varying configurations. Bias-variance tradeoffs arise also in full matching, however, when it is combined with struc- tural restrictions to enhance the effective sample size (Hansen, 2004, 2011); hence there continues to be an important role to play for sequential intersection-union testing.

5. Falsification with planned fallbacks

The No Child Left Behind (NCLB) Act of 2002 aimed to have all K-12 public school pupils meet grade-level proficiency standards, in reading and math, and to do so by 2014. It coerced the 50 states to track their schools’ performance and to sanction those found to be making inadequate progress. The year 2014 has since come and gone, with the law’s lofty goal remaining to be achieved. Yet Dee and Jacob’s (2011) authoritative study of it maintains that NCLB generated meaningful gains, particularly in lower grade mathematics. Many states had self-imposed NCLB-type accountability measures over the 10 years preceding the law, and the basis of Dee and Jacob’s (D & J’s) analysis is a comparison of trends in these states and in those states, the treatment group, for which NCLB appreciably strengthened school accountability. Between 1992 and 2002, fourth graders in both groups of states improved on the math portion of the National Assessment of Educational Progress (NAEP), with states independently adopting accountability measures during this period starting lower in 1992 but improving more rapidly, surpassing their counterparts by 2000. These states’ improvements continued at much the same rate between 2003 and 2007, while states affected by NCLB jumped sharply after 2002, then continued to increase at a rate paralleling that of the unaffected group. Much of the paper investigates and rules out attributions of the post-2002 jump to factors other than the NCLB law. Some of the investigations closely resembled the covariate placebo tests discussed above in § 3. D & J fit regression models parallel to their eventual analysis of NCLB’s associations with NAEP scores, but with dependent variables other than, and arguably unrelated to, measures of student achievement. Specifically, they considered whether, net of state fixed effects and time trends, the passage of NCLB covaried with: the states’ NAEP participation rates; fractions of grade-level cohorts eligible to participate in the NAEP that had previously attended full-day kindergarten, or preschool; proportions of school-age children attending public school; public school cohort demographics; or aggregate economic indicators. The results were largely consistent with hypotheses of no difference (on these measures) between states NCLB did and did not immediately affect. However, a small NCLB “effect” on median household income was noted, along with differences in the student cohorts’ race compositions that would be difficult to attribute to chance, even accounting for multiplicity. Procedures of this type are sometimes labeled “falsification tests,” implying that rejec- tion of the null would invalidate some meaningful premise of the research. But few working researchers would be so wasteful as to discard a study of a policy because its introduction

189 Hansen and Sales

was found to coincide with changes in other potentially relevant variables; the obvious rem- edy is to fold adjustments for those variables into the outcome analysis. D & J do just this, if somewhat apologetically: since their validation strategy relied on models of the same form being used in the eventual outcome analysis, their primary results are based on regressions of state mean NAEP scores on precisely the independent variables as used for the validation. That is, they didn’t adjust their primary analyses in light of the mixed result of the validity check; instead the specification adding demographic composition adjustments is presented as an appendicial supplement. It was one “sensitivity analysis” alongside 13 others altering their analytic choices in various ways. D & J label this validation exercise a “robustness check,” which sits more comfortably with their approach than would “falsification test.” To avoid rigid falsificationism is to show good judgment. In this instance and others, however, a modicum of Popperian orientation (Popper, 1963, ch. 1; Wong et al., 2015) can be practical as well as idealistic. Dee and Jacob’s robustness checks lead them to present 15 pairs of estimates and standard errors, for each of their 4 outcomes. The results are broadly in agreement, fortunately; but what if they had not been? Would it not have been better to somewhat narrow this wide field prior to outcome analysis? A pre-arranged sequence of robustness checks, with designated adjustments to the ana- lytic framework corresponding to the eventuality of any particular test’s failure, would have streamlined the process. For example, at Stage I D & J might have used their preferred, more parsimonious model specification to estimate policy effects on each of the putatively unrelated variables, combining the estimates into a single test reflecting the multiplicity of the estimations. For instance, a Bonferroni adjustment and overall size set to αf = .25 entails rejecting this stage 1 specification if any of the 10 estimates differs from zero at local level αl = .025. The plan should prespecify how the basic model specification is to be changed if this stage I test ends in rejection. It might for example say that each of the vari- ables for which the significance criterion was met is to be adjusted for in the specifications tested at later stages, and in the specification eventually used for outcome analysis. Perhaps better yet, rather than attempting to address each shortcoming of the stage 1 specification at once, the fallback plan for the eventuality of stage 1’s ending in rejection might specify that adjustment be made for just one of the unruly variables, or even for a variable with which no problem was detected. In light of the SIUP, one can always plan additional fallback measures if minor adjustments to the model prove insufficient. In the D & J study, the fallback to a rejection at stage I might simply have been to add adjustments for the proportions of Black, Hispanic, and free lunch eligible students, whether or not any of these variables were responsible for the stage I null being rejected. Stage II would then be another level α ensemble test adding these adjustments and omitting the regressions with student demographics as dependent variables. If falsification at stage I led to demographic adjustments, then a second falsification at stage II might for example recommend economic indicator adjustments. Or perhaps it would occasion more comprehensive adjustments for student demographics, with additional adjustment variables being added and only after stage III (if the procedure gets to stage III). Importantly, at each stage the family wise error rate — accounting for previous stages as well as multiplicities within the current stage — is fully controlled because of the sequential intersection union test, and the principle that later stages are reached only if each earlier stage ended in rejection.

190 Comment

While there is nothing to prevent the protocol from detailing eventualities in which ev- ery validation variable might be factored into the outcome analysis model as a covariate, it may be advantageous to hold some of them back, even in the worst-case scenario that the last putatively unrelated variable left standing still seems to be affected by the policy. D & J included as validation variables the cohort fractions that had attended kindergarten, and preschool, in order to address the possibility that states that did not begin accountability measures before being forced to by NCLB had allocated resources to early childhood educa- tion rather than school accountability, investments which may in turn have increased their students’ later achievement scores relative to other states. Had partial associations of the NCLB law with these variables remained even after adjustment for the remaining valida- tion variables, the research effort might be better served to step back and consider a more major revision to its identification strategy — that is, to declare the original identification strategy “falsified.” The overall probability of such falsification occurring falsely is capped at αf , again due to the SIUP.

6. Summary

The sequential intersection union principle is well suited to aid the selection of quasiexperi- mental study designs. Observational studies require models and hence modeling choices and assumptions, some of which can be tested. When performed without structure, such tests result in a hard-to interpret proliferation of p-values, and in spuriously rejected model spec- ifications. However, if the researcher in advance maps out a sequence of modeling choices and corresponding specification tests, the SIUP allows her to strongly control the type-I error rate of her process. We have argued by way of example that this pro- cess works best when researchers begin with a specification that is likely to be overly broad, or a model that is too simple, and sequentially add restrictions or complexity as broad or simple specifications are rejected. Of course, researchers are free to use even narrower or more complex specifications than the first that is not rejected. The first hypothesis whose specification test fails to reject at α forms, in a way, the boundary of a region of plausible specifications. It is important to note that the appropriate size for a specification test often exceeds the typical α = 0.05 threshold for outcome analysis. This is because in the search for a plausible specification, the costs of mistakenly accepting an improper specification often greatly outweigh the costs of rejecting a valid specification. Rejecting a valid specification may lead to the researcher having to settle for a more restrictive, but also valid, specification, if in virtue of a type I error the sequence of tests terminates at a later position than it could have. If it occurs that the sequence of modifications anticipated in the protocol is exhausted, the investigator may still opt to proceed, presenting the findings with an acknowledgment of the possibility of design bias. Alternatively, she can go back to the drawing board, mapping out a new sequence of decisions culminating in an analysis with similar but more feasible goals.

Acknowledgments

191 Hansen and Sales

B.H. and A.S. thank Jake Bowers, Justin McCrary and Dylan Small for helpful comments and encouragement (while retaining full responsibility for any errors or oversights.) A.S. is partially supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305B1000012. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors.

References Joshua D. Angrist and Miikka Rokkanen. Wanna get away? Regression discontinuity estimation of exam school effects away from the cutoff. Journal of the American Statistical Association, 2015. Advance online publication.

Robert P Bartlett III and Justin McCrary. Dark trading at the midpoint: Pricing rules, order flow and price discovery. Technical report, UC Berkeley School of Law, 2015.

Roger L Berger, Dennis D Boos, and Frank M Guess. Tests and confidence sets for com- paring two mean residual life functions. Biometrics, pages 103–115, 1988.

Matias D Cattaneo, Brigham R Frandsen, and Rocio Titiunik. Randomization inference in the regression discontinuity design: An application to party advantages in the US senate. Journal of Causal Inference, 3(1):1–24, 2015.

Thomas S Dee and Brian Jacob. The impact of No Child Left Behind on student achieve- ment. Journal of Policy Analysis and management, 30(3):418–446, 2011.

X.S. Gu and Paul R. Rosenbaum. Comparison of multivariate matching methods: Struc- tures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2 (4):405–420, 1993.

Ben B. Hansen. Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association, 99(467):609–618, September 2004.

Ben B. Hansen. Propensity score matching to extract latent experiments from nonexperi- mental data: A case study. In Neil Dorans and Sandip Sinharay, editors, Looking Back: Proceedings of a Conference in Honor of Paul W. Holland, chapter 9, pages 149–181. Springer, 2011.

Ben B. Hansen and Jake Bowers. Covariate balance in simple, stratified and clustered comparative studies. , 23(2):219–236, 2008.

Ben B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via network flows. Journal of Computational and Graphical Statistics, 15(3):609–627, 2006. URL http://www.stat.lsa.umich.edu/%7Ebbh/hansenKlopfer2006.pdf.

James J. Heckman, Hidehiko Ichimura, and Petra E. Todd. Matching as an Econometric Evaluation Estimator. Review of Economic Studies, 65(2):261–294, 1998.

D.S. Lee and T. Lemieux. Regression discontinuity designs in economics. Journal of Eco- nomic Literature, 48:281–355, 2010.

192 Comment

Karl Popper. Conjectures and refutations. London: Routledge and Kegan Paul, 1963.

Paul R. Rosenbaum. A characterization of optimal designs for observational studies. Journal of the Royal Statistical Society, 53:597– 610, 1991.

Paul R Rosenbaum. Testing hypotheses in order. Biometrika, 2008.

Paul R. Rosenbaum. Design of Observational Studies. Springer Verlag, 2010.

Paul R. Rosenbaum. Optimal matching of an optimally chosen subset in observational studies. Journal of Computational and Graphical Statistics, 21(1):57–71, 2012.

Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55, 1983.

Adam Sales and Ben B Hansen. Limitless regression discontinuity. arXiv preprint arXiv:1403.5478, 2015.

Manyee Wong, Thomas D. Cook, and Peter M. Steiner. Adding design elements to improve time series designs: No Child Left Behind as an example of causal pattern-matching. Journal of research on educational effectiveness, 8(2):245–279, 2015.

193 Observational Studies 1 (2015) 194-195 Submitted 6/15; Published 8/15

A good deal of humility: Cochran on observational studies

Miguel A. Hern´an miguel [email protected] Department of Epidemiology and Department of Biostatistics Harvard T.H. Chan School of Public Health Boston, MA 02115, USA

Cochran’s commentary shows that much of our conceptual framework for observational studies was already in place in the early 1970s. It also illuminates the technical progress that has been achieved since then, and identifies crucial methodologic challenges that still remain and that may remain forever. In his commentary Cochran classifies observational studies into two groups. The first group–“analytical surveys” in Cochran’s terminology–investigates the “relation between variables of interest” in a sample of a target population. In these studies the goal is pre- diction, not causality. The second group studies the causal effects of “agents, procedures, or experiences.” Cochran’s commentary is almost exclusively concerned with this second group of observational studies whose goal is causal inference about comparative effects. Cochran views these observational studies as attempts to answer causal questions in settings “in which we would like to do an experiment but cannot” because it is impractical, unethical, or untimely. The agents being compared in observational studies “are like those the statistician would call treatments in a controlled experiment.” Cochran effectively argues that observational studies for comparative effects can be viewed as attempts to emulate randomized experiments. Cochran, the Harvard statistician, was not alone. Other prominent researchers like Feinstein, the Yale epidemiologist, espoused similar views. The concept of observational studies as an attempt to emulate randomized experiments was central for the next generation of statisticians and epidemiologists. Cochran argues that, like in randomized experiments, a prerequisite for causal inference from observational data is the statement of objectives or the “description of the quantities to be estimated.” Rubin, Cochran’s former student and future Chair of Statistics at Harvard, championed the use of counterfactual notation to unambiguously express these quantities as contrasts involving potential outcomes. A decade later Robins, also at Harvard, generalized coun- terfactual theory to time-varying treatments, a generalization that extends the concept of trial emulation to settings in which treatment strategies are sustained over time. These formalizations had profound effects on the field of causal inference from observational data. A practical consequence of Cochran’s viewpoint is that observational studies can benefit from the basic principles that guide the design and analysis of randomized experiments. His commentary reminds us that causal analyses of observational data, like those of ran- domized trials, need to specify the “sample and target populations.” When discussing the “comparative structure”, he reminds us that studies with a control group, whether they are observational or randomized, are generally preferred to those without a control group: “Single group studies are so weak logically that they should be avoided whenever possible”. And he identifies the defining problem of causal inference from observational data when

⃝c 2015 Miguel Hernan. A good deal of humility: Cochran on observational studies

two or more groups are compared: “How do we ensure that the groups are comparable?” This is the fundamental problem of confounding or, in Cochran’s terminology, “bias due to extraneous variables”. Cochran classifies methods for confounding adjustment into three classes: analysis of co- variance, matching, and standardization. In the decades after Cochran’s commentary, each of these methods was deeply transformed from simple techniques that could only be used under serious constraints (few covariates, linear models...) to powerful analytical tools with few restrictions. The morphed into sophisticated outcome regression models that can easily handle complexities such as repeated measures, random effects, flex- ible dose-response functions, and failure time data. Matching was further developed for application to high-dimensional applications, which can incorporate propensity scores (co- developed by Rubin). Standardization in its two modern forms–the parametric g-formula and inverse probability weighting of marginal structural models (by Robins)–can now be applied to complex longitudinal data with multiple time points and covariates. In addition to these three classes of methods for confounding adjustment, a fourth class emerged in the early 1990s: g-estimation (Robins again). In many settings, the above methods can be made doubly-robust, another technical development that arose at the turn of the cen- tury. Finally, a whole suite of econometric methods like instrumental variable estimation are being progressively embraced by statisticians and epidemiologists interested in causal inference. All these technical and conceptual developments, however, do not alter Cochran’s take home message: causal inference from observational data “demands a good deal of humility.” Fancy techniques for confounding adjustment will not protect us from bias if the confounders are unknown or if key variables of the analysis are mismeasured. Cochran reminds us that those who aspire to make causal inferences from observational data “must cultivate an ability to judge and weigh the relative importance of different factors whose effects cannot be measured at all accurately.” Because human judgment and subject-matter knowledge are fallible, causal inference from observational data is also fallible in ways that causal inference from ideal randomized experiments is not. A fascinating question is how much machine learning algorithms will be able to replace subject-matter knowledge in the years to come. For the time being, however, expert knowledge continues to be as paramount for the design and analysis of studies based on observational data as it was in Cochran’s time.

195 Observational Studies 1 (2015) 196-199 Submitted 6/15; Published 8/15

Lessons we are still learning

Jennifer L. Hill [email protected] Department of Humanities and Social Sciences New York University New York, NY 10003, USA

I thoroughly enjoyed re-reading Cochran’s commentary on observational studies. In particular, Cochran captured my feelings towards the topic of observational studies quite aptly in his final sentence, “observational studies are an interesting and challenging field which demands a good deal of humility, since we can claim only to be groping toward the truth.” Scientific inquiry would benefit from greater humility among researchers pursuing causal answers today. In this comment I will briefly highlight some of the knowledge that has been gained about causal inference since that time (apologies in advance for not referencing all the scholars who have contributed – there are too many people to do it equitably). I will then focus on what I feel has been lost in the past decades and point out what I see to be important directions for the future.

1. Causal inference without randomized experiments

Causal inference typically requires satisfying both structural and parametric assumptions. Randomized experiments have the advantage of addressing both of these types of assump- tions. The most problematic structural assumption, ignorabilty, (also referred to as all confounders measured, selection on observables, conditional independence assumption, ex- changeability, etc) is trivially satisfied in a pristine randomized experiment. Randomized experiments also ensure common support across treatment and control groups. Randomized experiments have the added advantage that they do not require condition- ing on confounders for unbiased estimation, thus eliminating dependence on parametric assumptions. Moreover, even if we use a model to estimate treatment effects in this setting (for instance with goal of increasing efficiency) it is likely that our estimates will be robust to violations of the parametric assumptions of the model. Of course in practice, noncompliance, , measurement issues and other com- plications can still wreak havoc with treatment effect estimation even in the context of a randomized experiment. Even more problematic, randomized experiments are often not possible due to ethical, financial, or logistical reasons. In the absence of a randomized experiment (or ) the structural as- sumptions required to identify a causal effect become more heroic, requiring appropriate conditioning on confounding covariates. Unfortunately our dependence on the parametric assumptions grows as well since we now must appropriately estimate expectations condi- tional on the set of proposed confounding covariates.

⃝c 2015 Jennifer Hill. Lessons we are still learning

Much of the work in causal inference methodology in the decades since Cochran’s paper was first published has focused on relaxation of parametric assumptions. Cochran discusses matching, subclassifcation, and covariance adjustment. Since that time, however, use of propensity scores for matching, inverse probability weighting using propensity scores have yielded improvements in our ability to estimate causal effects with less bias due to a reduced reliance on parametric assumptions. More recently, more sophisticated matching methods that capitalize on advances in computer science have increased our ability to find good bal- ance targeted to particular balance criteria without undue investment of researcher time. Along another vein, it has been proposed that use of flexible modeling of the response sur- face using Bayesian nonparametrics along with appropriate checks for overlap may largely obviate the need for such preprocessing methods. All in all, our ability to condition on po- tential confounders without making extreme parametric assumptions has increased greatly in the past few decades. Other groundbreaking work has been done since Cochran’s paper around causal inference in longitudinal settings, greater understanding of the role of double robustness in estimation, mediation and approaches to SUTVA violations. Moreover our awareness about and ability to exploit rigourous quasi-experimental designs has grown far stronger. Most of these advances have been facilitated by the creation of a shared formal language for describing both causal estimands and assumptions. Actually, two “languages” currently hold sway: the potential outcome framework, and directed acyclic graphs, and their exten- sions. I will not advocate for one or another of these frameworks but rather suggest that to the extent that researchers interested in pursuing causal estimation understand these lan- guages we can learn from each other more easily and advance science with greater efficiency and clarity.

2. A broader perspective One of the most refreshing aspects of the Cochran article is his concern over many dif- ferent parts of the research process, including: the framing of the research question (how often do we formally talk about that?), study design (too often statisticians are not in- volved!), measurement (we largely ignore... that’s for the psychometricians!), non-response (we sometimes address...), power (boring!), and generalizability (there is renewed interest in this topic). While all of these topics are still actively pursued as research, it is rare to see a study that does due diligence to all of these concerns. Perhaps this is due to the pressure in academia to be specialists, and the lack of reward to working in large research teams (in many, though not all, fields). Few academics rigorously address more than two or three of these concerns in their work. Moreover, these issues are typically addressed in separate papers rather than understanding the complexity of the relationships between them. And in applied papers we typically pick our battles. How often do we apply a new method in an applied problem that might eliminate a small percentage of the bias only to ignore measurement error or missing data issues that might be causing bias that would easily swamp this gain? If we address only some of the statistical issues what can we expect to achieve overall? I advocate that we could increase our impact by focusing on being more broad than deep. Rather than trying to act as surgeons working in a specialized field, we would reach

197 Hill

more people by thinking of ourselves as army surgeons at the front trying to extract some meaning from the messy and imperfect world of empirical science.

3. Beyond “correlation is not causation” Understanding what situations allow for causal inference is non-trivial. This challenge is complicated by the fact that most people intuitively assign causal attribution without thinking too hard about it; it feels natural. Psychological experiments have demonstrated time and again, however, that humans are easily misled into drawing causal conclusions even when none are warranted. Yet understanding the answers to causal questions is critical for making progress in science, assessing policy implications, and even trying to better understand the implications of our actions in our individual lives. While we have made advances in creating a broader understanding of causal issues it is mostly captured by the phrase “correlation is not causation”. This is useful and has led to more broadly visible manifestations of this advice. For instance, science writer Gary Taubes wrote a helpful (if slightly imperfect) New York Times magazine article about the limitations of observational studies (Taubes, 2007). As a sillier example Tyler Vigen has a website http://www.tylervigen.com/spurious-correlations and a book (Vigen, 2015) that highlight amusing real world examples of correlations that quite clearly do not reflect causal relationships (for instance the 94% correlation between per capita cheese consumption and the number of people who died by becoming tangled in their bedsheets). This is helpful but we need to do much more. Too many scientific journals promote the practice of using the word association (rather than effect) as a panacea. Consequently, authors dutifully describe relationships between variables as associations (typically not even as conditional associations with a very particular conditioning set!) and then proceed in the discussion section of the paper to make recommendations for policy or practice (clearly interpreting their own results causally). That is hardly less damaging than just using the word cause throughout. Instead, we need to encourage a culture of transparency about causal claims. After all, almost all interesting scientific questions are causal in nature. We’re all trying to do it. So let’s be honest about it and be clear about our assumptions and explore how far we might stray from those assumptions and what the implications of those violations might be. In terms of education, we need to move beyond the catch-all “correlation is not cau- sation” admonition and help create a deeper understanding about the broader populace about what it means to make a causal claim and what kind of research strongly supports such claims. That would mean starting to teach about counterfactuals and a wider range of designs even in introductory statistics courses (I would like some of these concepts to be taught in grade school). It would also mean totally rethinking the way we typically teach regression.

4. Embrace your inner scold Returning to Cochran’s final sentence. Humility may be one of the least exhibited traits among academics today. We see a fair amount of hubris with regard to causality, both within the academy and in industries that rely heavily on “data science”. For instance, in a

198 Lessons we are still learning

now infamous article in Wired magazine, Chris Anderson wrote “There is now a better way. Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show.” (Anderson, 2007). These types of overstatements may seem amusing and nonthreatening, but this type of hype about “big data” or new technologies could set us back decades. Unfortunately, statisticians are often seen as scolds or Chicken Little. They are less apt to say “Yes you can!” and more apt to say “Oh my goodness you can’t do that!!!” This does not always make us popular either as cocktail party guests or collaborators. I don’t suggest that we entirely give up this role – it’s too important and nobody else seems prepared to take it on. However, if we embrace the role of scold we also need to point out what can be done to fix the problems. Focus on better designs (bring statisticians in to a study from day one!) and sensitivity analyses that explore how far from the mark our estimates may be if our assumptions are violated are a start to this process.

References Anderson, C. (2007). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine Taubes, G. (September 17, 2007). Do We Really Know What Makes Us Healthy? New York Times Magazine. Vigen, T. (2015). Spurious Correlations. Hatchett Books, New York.

199 Observational Studies 1 (2015) 200-204 Submitted 7/15; Published 8/15

Causal Thinking in the Twilight Zone

Judea Pearl [email protected] Computer Science Department University of California, Los Angeles Los Angeles, CA 90095, USA

To students of causality, the writings of William Cochran provide an excellent and in- triguing vantage point for studying how statistics, lacking the necessary mathematical tools, managed nevertheless to cope with increasing demands for policy evaluation from obser- vational studies. Cochran met this challenge in the years 1955-1980, when statistics was preparing for a profound, albeit tortuous transition from a science of data, to a science of data generating processes. The former, governed by Fisher’s dictum (Fisher, 1922) “the object of statistical methods is the reduction of data” was served well by the traditional lan- guage of probability theory. The latter, on the other hand, seeking causal effects and policy recommendations, required an extension of probability theory to facilitate mathematical representations of generating processes. No such representation was allowed into respectable statistical circles in the 1950-60s, when Cochran started looking into the social effects of public housing in Baltimore. While data showed improvement in health and well-being of families that moved from slums to public housing, it soon became obvious that the estimated improvement was strongly biased; Cochran reasoned that in order to become eligible for public housing the parent of a family may have to possess both initiative and some determination in dealing with the bureaucracy, thus making their families more likely to obtain better healthcare than non-eligible families.1 This led him to suggest “adjustment for covariates” for the explicit purpose of reducing this causal effect bias. While there were others before Cochran who applied adjustment for various purposes, Cochran is credited for introducing this technique to statistics (Salsburg, 2002) primarily because he popularized the method and taxonomized it by purpose of usage. Unlike most of his contemporaries, who considered cause-effect relationships “ill-defined” outside the confines of Fisherian experiments, Cochran had no qualm admitting that he sought such relationships in observational studies. He in fact went as far as defining the objective of an observational study: “to elucidate causal-and-effect relationships” in situ- ations where controlled experiments are infeasible (Cochran, 1965). Indeed, in the paper before us, the word “cause” is used fairly freely, and other causal terms such as “effect,” “influence,” and “explanation” are almost as frequent as “regression” or “variance.” Still, Cochran was well aware that he was dealing with unchartered extra-statistical territory and cautioned us:

“Claim of proof of cause and effect must carry with it an explanation of the mechanism by which this effect is produced.”

1. Narrated in Cochran (1983, p. 24)

⃝c 2015 Judea Pearl. Causal Thinking in the Twilight Zone

Today, when an analyst declares that a claim depends on “the mechanism by which an effect is produced” we expect the analyst to specify what features of the mechanism would make the claim valid. For example, when Rosenbaum and Rubin (1983) claimed that propensity score methods may lead to unbiased estimates of causal effects, they conditioned the claim on a counterfactual assumption named “strong ignorability.” Such identifying assumptions, though cognitively formidable, provided a formal instrument for proving that some adjustments can yield unbiased estimates. Similarly, when a structural analyst makes the claim that an “indirect effect” is estimable from observational studies, the claim must follow assumptions about the structure of the underlying graph which, again, assures us of zero-bias estimates (see Pearl, 2014b). Things were quite different in Cochran’s era; an appeal to “a mechanism,” like an ap- peal to “subject matter information” stood literally for a confession of helplessness, since “mechanisms” and causal relationships had no representation in statistics. Structural equa- tion models (SEM), the language used by economists to represent mechanisms, were deeply mistrusted by statisticians, who could not bring themselves to distinguish structural from regression models (Guttman, 1977; Freedman, 1987; Cliff, 1983; Wermuth, 1992; Holland, 1995).2 Counterfactuals, on the other hand, were still in the embryonic state that Neyman left them in – symbols with no model, no formal connection to realizable variables, and no inferential machinery with which to support or refute claims.3 Fisher’s celebrated advice: “make your theories elaborate” was no help in this transitional era of pre-formal causation; there is no way to elaborate on a theory that cannot be represented in some language. It is not surprising, therefore, that Cochran’s conclusions are quite gloomy:

“It is well known that evidence of a relationship between x and y is no proof that x causes y. The scientific philosophers to whom we might turn for expert guidance on this tricky issue are a disappointment. Almost unanimously and with evident delight they throw the idea of cause and effect overboard. As the statistical study of relationships has become more sophisticated, the statistician might admit, however, that his point of view is not very different, even if he wishes to retain the terms cause and effect.”

It is likewise not surprising that in the present article, Cochran does not offer readers any advice on which covariates are likely to reduce bias and which would amplify bias. Any such advice, as we know today, requires a picture of reality, which Cochran understood to be both needed and lacking at his time.4 On the positive side, though, he did have the vision to anticipate the emergence of a new type of research paradigm within statistics, a paradigm centered on mechanisms:

“A claim of proof of cause and effect must carry with it an explanation of the mechanism by which the effect is produced. Except in cases where the

2. This mistrust persists to some degree even in our century, see Berk (2004) or Sobel (2008). 3. These had to wait for Rubin (1974), Robins (1986), and the structural semantics of Balke and Pearl (1994). 4. To the best of my knowledge, the only adjustment-related advice in the entire statistics literature prior to 1980 was Cox’s warning that “the concomitant observations be quite unaffected by the treatments” (Cox, 1958, p. 48) ; it was the first defiance of an unwritten taboo against the use of data-generating models.

201 Pearl

mechanism is obvious and undisputed, this may require a completely different type of research from the observational study that is being summarized.” I believe the type of research we see flourishing today, based on a symbiosis between the graphical and counterfactual languages (Morgan and Winship, 2014; Vanderweele, 2015; Bareinboim and Pearl, 2015) would perfectly meet Cochran’s vision of a “completely dif- ferent type of research.” This research differs fundamentally from the type of research con- ducted in Cochran’s generation. First, it commences with a commitment to understanding what reality must be like for a statistical routine to succeed and, second, it represents reality in terms of data-generating models (read: “mechanisms”), rather than probability distributions. Encoded as nonparametric structural equations, these models have led to a fruitful sym- biosis between graphs and counterfactuals and have unified the potential outcome frame- work of Neyman, Rubin, and Robins with the econometric tradition of Haavelmo, Marschak, and Heckman. In this symbiosis, counterfactuals (potential outcomes) emerge as natural byproducts of structural equations and serve to formally articulate research questions of interest. Graphical models, on the other hand, are used to encode scientific assumptions in a qualitative (i.e., nonparametric) and transparent language and to identify the logical ramifications of these assumptions, in particular their testable implications.5 A summary of results emerging from this symbiotic methodology is given in Pearl (2014a) and includes complete solutions6 to several long-standing problem areas, ranging from policy evaluation (Tian and Shpitser, 2010) and selection bias (Bareinboim, Tian and Pearl, 2014) to external validity (Bareinboim and Pearl, 2015; Pearl and Bareinboim, 2014) and missing data (Mohan, Pearl and Tian, 2014). This development has not met with universal acceptance. Cox and Wermuth (2015), for example, are still reluctant to endorse the tools that this symbiosis has spawned, questioning in essence whether interventions can ever be mathematized.7 Others regard the symbiosis as unscientific (Rubin, 2008) or less than helpful (Imbens and Rubin, 2015, p. 22) insisting for example that investigators should handle ignorability judgments by unaided intuition. I strongly believe, however, and I say it with a deep sense of responsibility, that future explorations of observational studies will rise above these inertial barriers and take full advantage of the tools that the graphical-counterfactual symbiosis now offers.

References Bareinboim, E. and Pearl, J. (2015). Causal inference from big data: Theoretical founda- tions and the data-fusion problem. Tech. Rep. R-450, http://ftp.cs.ucla.edu/pub/statser/r450.pdf, Department of Computer Science,

5. Note that the potential outcome framework alone does not meet these qualifications. Scientific assump- tions must be converted to conditional ignorability statements (Rosenbaum and Rubin 1983; Imbens and Rubin, 2015) which, being cognitively formidable, escape the scrutiny of plausibility judgment and impede the search for their testable implications. 6. By “complete solution” I mean a method of producing consistent estimates of (causal) parameters of interests, applicable to any hypothesized model, and accompanied by a proof that no other method can do better except by strengthening the model assumptions. 7. Unwittingly, the very calculus that they reject happens to resolve the problem that they pose (“indirect confounding”) in just four steps (Pearl, 2015a; Pearl, 2015b)

202 Causal Thinking in the Twilight Zone

University of California, Los Angeles, CA. Forthcoming, Proceedings of the National Academy of Sciences. Bareinboim, E., Tian, J. and Pearl, J. (2014). Recovering from selection bias in causal and statistical inference. In Proceedings of the Twenty-eighth AAAI Conference on Artificial Intelligence (C. E. Brodley and P. Stone, eds.). AAAI Press, Palo Alto, CA. Best Paper Award, http://ftp.cs.ucla.edu/pub/statser/r425.pdf. Berk, R. (2004). : A Constructive Critique. Sage, Thousand Oaks, CA. Cliff, N. (1983). Some cautions concerning the application of causal modeling methods. Multivariate Behavioral Research, 18, 115-126. Cochran, W. (1965). The planning of observational studies of human population. Journal of the Royal Statistical Society (Series A), 128 234-255. Cochran, W. G. (1983). Planning and Analysis of Observational Studies. Wiley, New York. Cox, D. (1958). The Planning of Experiments. John Wiley and Sons, NY. Cox, D. and Wermuth, N. (2015). Design and interpretation of studies: Relevant concepts from the past and some extensions. Observational Studies, 1, 165–170. Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics. Philosoph- ical Transactions of the Royal Society of London, Series A, 222, 309–368. Freedman, D. (1987). As others see us: A case study in path analysis (with discussion). Journal of Educational Statistics, 12, 101-223. Guttman, L. (1977). What is not what in statistics. The Statistician 26 81-107. Holland, P. (1995). Some reflections on Freedmans critiques. Foundations of Science, 1, 50-57. URL http://arxiv.org/pdf/1505.02452v1.pdf Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomed- ical Sciences: An Introduction. Cambridge University Press, New York. Mohan, K., Pearl, J. and Tian, J. (2013). Graphical models for inference with missing data. In Advances in Neural Information Processing Systems 26 (C. Burges, L. Bottou, M.Welling, Z. Ghahramani and K.Weinberger, eds.). Curran Associates, Inc., 1277-1285. http://papers.nips.cc/paper/4899-graphical-models-for-inference-with-missing-data.pdf Morgan, S. L. and Winship, C. (2014). Counterfactuals and Causal Inference: Methods and Principles for Social Research (Analytical Methods for Social Research). 2nd ed. Cambridge University Press, New York. Pearl, J. (2014a). The deductive approach to causal inference. Journal of Causal Inference, 2, 115-129. Pearl, J. (2014b). Interpretation and identification of causal mediation. Psychological Methods, 19 459-481. Pearl, J. (2015a). Indirect Confounding and Causal Calculus (On three papers by Cox and Wermuth). Blog entry: http://www.mii.ucla.edu/causality/. Pearl, J. (2015b). Indirect Confounding and Causal Calculus (On three papers by Cox and Wermuth). Tech. Rep. R-457, http://ftp.cs.ucla.edu/pub/statser/r457.pdf, Department of Computer Science, University of California, Los Angeles, CA. Pearl, J. and Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29, 579-595. Robins, J. (1986). A new approach to causal inference in mortality studies with a sus- tained exposure period applications to control of the healthy workers survivor effect. Mathematical Modeling, 7, 1393-1512.

203 Pearl

Rosenbaum, P. and Rubin, D. (1983). The central role of propensity score in observational studies for causal effects. Biometrika, 70, 41-55. Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688-701. Rubin, D. (2008). Authors reply (to Ian Shriers Letter to the Editor). Statistics in Medicine, 27, 2741-2742. Salsburg, D. (2002). The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. Henry Holt and Company, LLC, New York. Sobel, M. (2008). Identification of causal parameters in randomized studies with mediating variables. Journal of Educational and Behavioral Statistics, 33, 230-231. Tian, J. and Shpitser, I. (2010). On identifying causal effects. In Heuristics, Probability and Causality: A Tribute to Judea Pearl (R. Dechter, H. Geffner and J. Halpern, eds.). College Publications, UK, 415444. VanderWeele, T. (2015). Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford University Press, New York. Wermuth, N. (1992). On block-recursive regression equations. Brazilian Journal of Prob- ability and Statistics (with discussion), 6, 156.

204 Observational Studies 1 (2015) 205-211 Submitted 5/15; Published 8/15

Cochran’s Causal Crossword

Paul R. Rosenbaum [email protected] Department of Statistics, Wharton School University of Pennsylvania Philadelphia, PA 19104-6340 US

Abstract In discussing the “step from association to causation,” Cochran described a certain “multi- phasic attack” as “one of the most potent weapons in observational studies.” This method emphasized assembling several weak strands of evidence that become stronger through mutual support by virtue of intersecting in appropriate ways. Keywords: Differential effects; elaborate theories; evidence factors; intersecting strands of evidence; quasi-experiments.

1. Introduction Cochran’s “Observational Studies” is the less familiar, less readily accessible member of a pair of papers in which Cochran (1965, 1972) outlined the general structure of observational studies as a type of statistical investigation. Studies of this type were not new in 1965, nor was the attempt to think systematically about them (e.g., Campbell and Stanley 1963, Hill 1965), but Cochran was the first person to define the subject abstractly, that is, as a subject applicable to and informed by many academic disciplines. It is fitting that Dylan Small’s new interdisciplinary journal Observational Studies makes Cochran’s (1972) paper easily accessible once again. Cochran’s papers have many interesting aspects, but I will focus on just one aspect that appears in different forms near the end of both papers. The final sections of Cochran (1965, 1972) are entitled “The step from association to causation,” and “Judgment about causality.” These sections make several distinct and useful points, but I would like to focus on one of these, first in §2 by quoting what Cochran says, then in §3 by adding some interpretation.

2. What Cochran says In discussing judgments about causality, in going beyond modelling empirical associations to reach causal conclusions, Cochran (1965, 1972) speaks again and again about “many different consequences,” “variety of consequences,” the “mechanism by which the effect is produced,” and “completely different type[s] of research.” How can many individually weak strands of evidence combine to become strong evidence by considering many different and varied consequences of a causal mechanism? In one of the more often quoted remarks about causal inference, Cochran (1965, p. 252) wrote:

⃝c 2015 Paul R. Rosenbaum. Rosenbaum

First, as regards planning. About 20 years ago, when asked in a meeting what can be done in observational studies to clarify the step from association to causation, Sir Ronald Fisher replied: “Make your theories elaborate.” The reply puzzled me at first, since by Occam’s razor the advice usually given is to make theories as simple as is consistent with the known data. What Sir Ronald meant, as the subsequent discussion showed, was that when constructing a causal hypothesis one should envisage as many different consequences of its truth as possible, and plan observational studies to discover whether each of these consequences is found to hold.

After presenting a few illustrations of “many different consequences,” Cochran (1965, p. 252) continues:

Of course, the number and variety of consequences depends on the nature of the causal hypothesis, but imaginative thinking will sometimes reveal consequences that were not at first realized, and this multi-phasic attack is one of the most potent weapons in observational studies. In particular, the task of deciding between alternative hypotheses is made easier, since they may agree in predicting some consequences but will differ in others.

The second paper repeats similar points in different words and adds (Cochran 1972, p. 89):

A claim of proof of cause and effect must carry with it an explanation of the mechanism by which the effect is produced. Except in cases where the mecha- nism is obvious and undisputed, this may require a completely different type of research from the observational study that is being summarized.

3. Analogies and methods 3.1 A limited analogy: The cable of many slender fibers That individually weak strands of evidence may combine to form strong evidence was most famously suggested by Charles Sanders Peirce (1868):

[We should] trust rather to the multitude and variety of . . . arguments than to the conclusiveness of any one. [Our] reasoning should not form a chain which is no stronger than its weakest link, but a cable whose fibers may be ever so slender, provided they are sufficiently numerous and intimately connected.

Although memorable, the cable analogy distorts in a key respect: the many fibers of a cable play identical roles in forming a strong cable, but strands of evidence must exhibit variety, must speak to different consequences of a theory, because, as Cochran says, “deciding between alternative hypotheses is made easier, since they may agree in predicting some consequences but will differ in others.” There is a better analogy.

206 Cochran’s Causal Crossword

3.2 A better analogy: a crossword puzzle “Generalization naturally starts from the simplest, the most transparent particular case,” Georg Polya (1968, p. 60) wrote in discussing heuristic reasoning in mathematics. This sim- plest, most transparent case is not the important or general case that led us to be concerned with the topic in question; rather, it is the least cluttered, most immediately accessible and surveyable case, the example that perfectly exemplifies one issue in isolation from unneeded complications. Susan Haack (1995) suggests that the simplest, most transparent case of weak strands of evidence becoming stronger by virtue of mutual support is the case of a crossword puzzle. She writes (1995, pp. 81-82):

The model is not . . . how one determines the soundness or otherwise of a math- ematical proof; it is, rather, how one determines the reasonableness or otherwise of entries in a crossword puzzle. . . . [T]he crossword model permits pervasive mutual support, rather than, like the model of a , encour- aging an essentially one-directional conception. . . . How reasonable one’s con- fidence is that a certain entry in a crossword is correct depends on: how much support is given to this entry by the clue and any intersecting entries that have already been filled in; how reasonable, independently of the entry in question, one’s confidence is that those other already filled-in entries are correct; and how many of the intersecting entries have been filled in.

Haack is making two points here, the obvious one being that much of the conviction we develop that a crossword puzzle is filled-in correctly comes not from the individual clues, but from entries intersecting in appropriate ways. When we first pencil in an entry based on a clue, we may doubt that it is correct, but later, when other entries meet it in an appropriate way, we may be nearly certain it is correct, even though the direct evidence from the clue remains unconvincing on its own. It is important to recognize that, beyond this obvious point, there is a second point. Haack’s second, subtle, point relates to her phrase above: “independently of the entry in question.” She is concerned to exhibit mutual support without vicious circularity. If I can deduce B from assuming A, and if I can deduce A from assuming B, then the assertion of A-and-B based on these two deductions would be a logical error — vicious circularity — because both deductions are perfectly compatible with both A and B being false. In the crossword, two entries may meet appropriately yet both be incorrect entries. Haack is saying that B provides support for A only to the extent that we are confident about B not employing the support provided by its intersection with A, and A provides support to B only to the extent that we are confident about A not employing the support provided by its intersection with B; but, with this caveat, A and B may each support the other. The appropriate intersection of A and B provides support for both A and B, but we may reflect upon the evidence for B that does not derive from its appropriate intersection with A, and Haack refers to this as the “independent security” of B. Haack (1995, p. 84–86) continues:

The idea of independent security is easiest to grasp in the context of the cross- word analogy . . . How reasonable one’s confidence is that 4 across is correct depends, inter alia, on how reasonable one’s confidence is that 2 down is cor- rect. True, how reasonable one’s confidence is that 2 down is correct in turn

207 Rosenbaum

depends, inter alia, on how reasonable one’s confidence is that 4 across is cor- rect. But in judging how reasonable one’s confidence is that 4 across is correct one need not, for fear of getting into a vicious circle, ignore the support given it by 2 down; it is enough that one judge how reasonable one’s confidence is that 2 down is correct leaving aside the support given it by 4 across.

In a crossword puzzle, entries need not intersect to provide mutual support. If 2 down meets both 4 across and 6 across, then an entry in 6 across may support the entry in 2 down, and the entry in 2 down may support the entry in 4 across, so the entry in 6 across supports the entry in 4 across even though 6 across and 4 across do not intersect. Consider the same ideas in a biological context. A high level of exposure to a toxin, such as cigarette smoke, is associated with a particular disease, say a particular cancer, in a human population, where experimentation with toxins is not ethical. A controlled randomized experiment shows that deliberate exposure to the toxin causes this same cancer in laboratory animals. A DNA-adduct is a chemical derived from the toxin that is covalently bound to DNA, perhaps disrupting or distorting DNA transcription. Exposure to the toxin is observed to be associated with DNA-adducts in lymphocytes in humans; e.g., Phillips (2002). A further controlled experiment shows that the toxin causes these DNA- adducts in cell cultures. A case-control study finds cases of this cancer have high levels of these DNA-adducts, whereas noncases (so-called “controls”) have low levels. A pathology study finds these DNA-adducts in tumors of the particular cancer under study, but not in most tumors from other types of cancer. Certain genes are involved in repairing DNA, for instance, in removing DNA adducts; see Goode et al. (2002). In human populations, a rare genetic variant has a reduced ability to repair DNA, in particular a reduced ability to remove adducts, and people with this variant exhibit a higher frequency of this cancer, even without high levels of exposure to the toxin. Each of these entries in the larger puzzle is quite tentative as an indicator that the toxin causes cancer in humans, and some of the entries do not directly intersect; e.g., the rare genetic variant is not directly linked to the toxin. Yet, the filled in puzzle with its many intersections may be quite convincing. Consider the same ideas in an economic context. Economic understanding depends, in part, on mathematical theories that derive predictions of economic actions from behavioral assumptions, and, in part, on empirical studies of how people or institutions do act in particular economic contexts. Taken in isolation, the assumptions in one mathematical theory may be quite speculative. Taken in isolation, the findings in one empirical study may be quite insecure, ambiguous and tentative. However, one mathematical theory may intersect with many empirical studies, and may also intersect with many other mathematical theories. Important economic facts – say, a high level of unemployment among recent high school graduates in a particular region at a particular time – may be compatible with several mutually incompatible economic theories – say, a theory that emphasizes rigidities in the labor market, or another that emphasizes the absence of a mechanism to provide adequate investments in human capital. But each theory intersects many particular facts, many empirical studies, and many other theories. Clarification comes, if it does, when an initially speculative theory has correctly met so many ambiguous facts or tentative empirical findings that the theory is no longer speculative, the facts no longer ambiguous, the findings no longer tentative.

208 Cochran’s Causal Crossword

3.3 What would it mean to take Cochran’s advice seriously? If you took Cochran’s advice seriously, then you would ask of each new study what it con- tributes to the currently incomplete, partly penciled-in puzzle. You would welcome the completion of a new entry, even a small entry, compatible with the current tentative com- pletion. You would also welcome a compelling new entry that challenged some current entries. You would welcome the suggestion that a particular entry is mistaken and consti- tutes a barrier to correct completion of the puzzle. A pencil and an eraser would be two tools of equal importance. You would agree with Sunstein (2005) in finding positive value in dissonance and dissent, and you would agree with Rescher (1995) in finding positive value in consensus only to the extent that this consensus has its origins in a rational appraisal of the evidence, whereupon the mere existence of consensus would have little importance beyond its important origins. You would tolerate inconsistency and uncertainty as neces- sary stepping stones on a path to greater consistency and greater certainty. You would welcome systematic attempts to take stock, to view the tentative completion as a whole, the appraisal of the gaps, the parts that appear secure, the other parts that are uncertain, needing work, in conflict, perhaps mistaken. You would welcome careful, patient, methodi- cal scientific work. You would agree with Kafka (1917): “All human errors are impatience, the premature breaking off of what is methodical.” To take Cochran’s advice seriously is to be skeptical of investigations that derive stout conclusions from slender evidence. It is to be skeptical of grand studies and grand con- clusions, the suggestion that a single proposed entry settles a major issue, that consistent completion of the puzzle is inevitable given this one entry, and hence consistent completion is not needed and not worth the effort.

3.4 Methods Several statistical methods cultivate varied strands of evidence within a single study, each strand being weak on its own, each strand vulnerable in a different way, but with the several strands gaining in strength if they agree in appropriate ways. Traditional methods are quasi-experimental designs; see Campbell and Stanley (1963), Shadish et al. (2002), West et al. (2008) and Wong et al. (2015). More recent methods include evidence factors (Rosenbaum 2010, 2015; Zhang et al. 2011), differential effects (Rosenbaum 2006, 2013, 2015; Zubizarreta et al. 2014) and attempts to integrate qualitative and quantitative causal inference (Rosenbaum and Silber 2001; Weller and Barnes 2014). Vanderweele (2015) expands on one of Cochran’s themes, the role of mechanisms as evidence. Yang et al. (2014) encourage the tolerance of statistical inferences that terminate in dissonance, that is, inferences that demonstrate unresolved inconsistencies among intersecting strands of evidence.

Acknowledgments

Supported by a grant from the Measurement, Methodology and Statistics Program of the US National Science Foundation.

209 Rosenbaum

References Campbell, D. T., Stanley, J. C. (1963). Experimental and Quasi-experimental Designs for Research. Chicago: Rand McNally. Cochran, W. G. (1965). The planning of observational studies of human populations (with Discussion). Journal of the Royal Statistical Society A, 234-266. Cochran, W. G. (1972). Observational Studies. Statistical Papers in Honor of George W. Snedecor. Ames: Iowa State University Press, 70-90. Goode, E. L., Ulrich, C. M. and Potter, J. D. (2002). Polymorphisms in DNA repair genes and associations with cancer risk. Cancer Epidemiology Biomarkers and Prevention, 11:1513-1530. Haack, S. (1995). Evidence and Inquiry. Oxford: Blackwell. Hill, A. B. (1965). The environment and disease: association or causation?. Proceedings of the Royal Society of Medicine, 58:295-300. Kafka, F. (1917). The Blue Octavo Notebooks. Cambridge: Exact Change. Peirce, C. S. (1868). Some consequences of four incapacities. Journal of Speculative Philosophy, 2, 140-157. Reprinted in: Talisse, R. B. and Aikin, S. F., eds. (2011) The Pragmatism Reader: From Peirce through the Present, Cambridge MA: Harvard University Press. Phillips, D. H. (2002). Smoking-related DNA and protein adducts in human tissues. Car- cinogenesis, 23:1979-2004. Polya, G. (1968). Mathematics and Plausible Reasoning, Volume II, 2nd edition. Princeton, NJ: Princeton University Press. Rescher, N. (1995). Pluralism: Against the Demand for Consensus. New York: Oxford University Press. Rosenbaum, P. R. and Silber, J. H. (2001). Matching and thick description in an observa- tional study of mortality after surgery. Biostatistics, 2, 217-232. Rosenbaum, P. R. (2006). Differential effects and generic biases in observational studies. Biometrika, 93:573-586. Rosenbaum, P. R. (2010). Evidence factors in observational studies. Biometrika, 97:333- 345. Rosenbaum, P. R. (2013). Using differential comparisons in observational studies. Chance, 26:18-25. Rosenbaum, P. R. (2015). How to see more in observational studies: Some new quasi- experimental devices. Annual Review of Statistics and Applications, 2:21-48. Shadish, W. R., Cook, T. D., Campbell, D. T. (2002). Experimental and Quasi-experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin. Sunstein, C. R. (2005). Why Societies Need Dissent. Harvard University Press. Vanderweele, T. J. (2015). Explanation in Causal Inference. New York: Oxford. Weller, N. and Barnes, J. (2014). Finding Pathways: Mixed-method Research for Studying Causal Mechanisms. Cambridge University Press. West, S. G., Duan, N., Pequegnat, W., Gaist, P., Des Jarlais, D. C., Holtgrave, D., Szapoc- znik, J., Fishbein, M., Rapkin, B., Clatts, M., Mullen, P. D. (2008). Alternatives to the randomized controlled trial. American Journal of Public Health, 98:1359-66.

210 Cochran’s Causal Crossword

Wong, M., Cook, T. D., and Steiner, P. M. (2015). Adding design elements to improve time series designs: No Child Left Behind as an example of causal pattern matching. Journal of Research on Educational Effectiveness, 8:245-279. Yang, F., Zubizarreta, J. R., Small, D. S., Lorch, S., and Rosenbaum, P. R. (2014). Disso- nant conclusions when testing the validity of an instrumental variable. American Statis- tician, 68:253-263. Zhang, K., Small, D. S., Lorch, S., Srinivas, S., and Rosenbaum, P. R. (2011). Using split samples and evidence factors in an observational study of neonatal outcomes. Journal of the American Statistical Association, 106:511-524. Zubizarreta, J. R., Small, D. S., Rosenbaum, P. R. (2014). Isolation in the construction of natural experiments. Annals of Applied Statistics, 8:2096-2121.

211 Observational Studies 1 (2015) 212-216 Submitted 3/15; Published 8/15

Comment on Cochran’s “Observational Studies”

Donald B. Rubin [email protected] Department of Statistics Harvard University Cambridge, MA 02138, USA

First, I have to thank Dylan Small for inviting me to contribute comments on the reprinted Cochran (1972) target article. This is, in fact, the third time that I have read this nice, almost colloquial, chapter Bill Cochran wrote in honor of his coauthor, George Snedecor. The first time was about 1970 when I was finishing my PhD under Bill’s direction, and he circulated it as a pre-print. The second time was while I was writing my own chapter (Rubin 1984) summarizing Bill’s contributions to observational studies, which appeared in the volume edited by Rao and Sedransk; because my discussion of it is less than three pages and the original is in a relatively massive book, I include it here in the appendix. But what would I say that’s different today than I said back in 1984? Obviously, I still am awed by Bill’s straightforward and no-nonsense style of communication – that hasn’t changed. But I think that today I would include some points not emphasized in 1984. One aspect that I think I should have emphasized is the usefulness of the formal idea for an “assignment mechanism” to distinguish between randomized experiments versus obser- vational studies and the formal concept of “potential outcomes” to define precisely causal effects. That is, under the “stable unit treatment value assumption” (SUTVA; Rubin 1980, 1986), Yi(1) are the ith units values of outcomes under the active treatment and Yi(0) are the i-th unit’s values of outcomes under the control treatment, where the causal effect of the active versus the control treatment for unit i is the comparison of these two potential outcomes. Also, the assignment mechanism is the probability distribution of the vector of treatment indicators, W, given the arrays of potential outcomes and covariates; this perspective was termed “Rubin’s Causal Model” by Holland (1986), but the potential out- comes notation had its roots in the work of Neyman (1923) in the context of randomized experiments; the term “assignment mechanism” and the use of potential outcomes to define causal effects in general originates with work in the 1970s (Rubin 1974, 1976, 1977, 1978). In my 1984 discussion of Cochran (1972), I did not emphasize the clarity that this conceptualization brings to causal inference in observational studies. In hindsight, I think that omission was because that conceptualization seemed so obvious to me. It is only in recent years that I have been befuddled by all the confusion created by some writers who eschew this formulation with its attendant clarity. The recent text by Imbens and Rubin (2015) hopefully contributes to rectifying this situation, at least from my perspective. Another noteworthy omission in my 1984 discussion is my recent focus on the importance of outcome-free design for observational studies (Rubin 2006, 2008). I am not alone in having this current emphasis; see, for example Yue (2006) and D’Agostino and D’Agostino (2007). In hindsight I wished that I had emphasized this aspect, although with generous

⃝c 2015 Donald B. Rubin. Comment on Cochran’s “Observational Studies”

interpretation one could read that theme into parts of Cochran (1972), although I do not see much distinction made between things like propensity score design, which is blind to outcome data, and model-based adjustment methods, which require outcome data and so are subject to inapposite manipulation. I think that this desire to correct this omission arises from being repeatedly exposed to more problematic examples in recent years. A final comment on Cochran (1972) concerns a statement from his concluding section “Judgement About Causality” where he fairly blatantly reveals his disappointment in the answers provided by scientific philosophers. Often decisions about interventions must be made, even if based on limited empirical evidence, and we should help decision makers make sensible decisions under clearly stated assumptions so that “consumers” of the conclusion about the effects of some intervention can honestly weigh the support for that conclusion. In conclusion: A fine choice for a classic article to reprint.

Appendix Cochran (1972) is an easy to read article that begins with a reviwe of examples of observa- tional studies and notes their increased numbers in recent years. He mentions, as examples of observational studies for treatment effects, the Cornell study of seat belts (Kihlberg and Narragon, 1964), studies of smoking and health (U.S. Surgeon General Advisory’s Com- mittee Report, 1964), the halothane study (Bunker et al., 1969), and the Coleman report (Coleman et al., 1966). In many ways this article is an updated conversational summary of Cochran (1965). Here, even more than before, mature advice to the investigator is the focus of the paper. First, the investigator should clearly state the objective and hypothesis of the study because such “...statements perform the valuable purpose of directing attention to the com- parisons and measurements that will be needed.” Second, the investigator should carefully consider the type of study. Cochran seems to be more obviously negative about studies without control groups than he was earlier:

Single group studies are so weak logically that they should be avoided when- ever possible... Single group studies emphasize a characteristic that is prominent in the anal- ysis of nearly all observational studies – the role of judgment. No matter how well-constructed a mathematical model we have, we cannot expect to plan a sta- tistical analysis that will provide an almost automatic verdict. The statistician who intends to operate in this field must cultivate an ability to judge and weigh the relative importance of different factors whose effects cannot be measured at all accurately. Comparison groups bring a great increase in analytical insight. The influence of external causes on both groups will be similar in many types of study and will cancel or be minimized when we compare treatment with no treatment. But such studies raise a new problem – How do we ensure that the groups are comparable?

Cochran still emphasizes the importance of effective measurement:

213 Rubin

The question of what is considered relevant is particularly important in pro- gram evaluation. A program may succeed in its main objectives but have unde- sirable side effects. The verdict on the program may differ depending on whether or not these side effects are counted in the evaluations...Since we may have to manage with very imperfect measurements, statisticians need more technical research on the effects of errors of measurement.

But the primary statistical advice Cochran has to offer is on controlling bias: “The reduction of bias should, I think, be regarded as the primary objective – a highly precise es- timate of the wrong quantity is not much help...In observational studies, three methods are in common use in an attempt to remove bias due to extraneous variables...Blocking, usually known as matching in observational studies...Standardization (adjustment by subclassifica- tion)...Covariance (with x’s quantitative) used just as in experiments.” The de-emphasis of efficiency relative to bias removal was evident when I began my thesis work under Cochran in 1968. The results of this thesis (Rubin, 1970) in large part published in Rubin (1973a, 1973b), and summarized in Cochran and Rubin (1973), lead to some new advice on the tradeoff between matching and covariance: In order to guard against nonlinearities in the regression of y on x, the combination of regression and matching appears superior to either method alone. Recent work (Rubin, 1979) extends this conclusion to more than one x. Specifically, based on Monte Carlo results with 24 moderately nonlinear but parallel response surfaces and 12 bivariate normal distributions of x, and using percentage reduction in expected squared bias of treatment effect as the criterion, it appears quite clear that the combination of matched sampling and regression adjustment is superior to either matching or regression adjustment alone. Furthermore, Mahalanobis metric matching, which defines the distance between a treatment and control unit using the inverse of the sample of the matching variables and then sequentially finds the closest unmatched control unit for each experimental unit, was found superior to discriminant matching, which forms the best linear discriminant between the groups and sequentially finds the closest unmatched control unit for each experimental unit with respect to this discriminant. Moreover, regression adjustment that estimates the regression coefficient from the regression of the matched pair y differences on the matched pair x differences is superior to the standard covariance- adjusted estimator, which estimates the coefficients from the pooled within-group covariance matrix. Cochran goes on to offer advice on sample sizes, handling nonresponse, the use of a pilot study, the desire for a critical colleague in the planning stages, and the relation between sample and target populations. It is not surprising that this article concludes with a short section called “Judgment About Causality.” Cochran’s views are somewhat more bluntly presented here than in previous writing:

It is well known that evidence of a relationship between x and y is no proof that x causes y. The scientific philosophers to whom we might turn for expert guidance on this tricky issue are a disappointment. Almost unanimously and with evident delight they throw the idea of cause and effect overboard...A claim of proof of cause and effect must carry with it an explanation of the mechanism

214 Comment on Cochran’s “Observational Studies”

by which the effect is produced. Except in cases where the mechanism is obvious and undisputed, this may require a completely different type of research from the observational study htat is being summarized. Thus in most cases the study ends with an opinion or judgment about causality, not a claim of proof.

Cochran closes with the standard advice to make causal hypotheses complex:

Given a specific causal hypothesis that is under investigation, the investiga- tor should think of as many consequences of the hypothesis as he can and in the study try to include response measurements that will verify whether these consequences follow.

References Bunker, J. P. et al., eds. (1969). The national halothane study. Washington, D.C.: USGPO. Cochran, W. G. (1965). The planning of observational studies. J. Roy. Statist. Soc. Ser. A, 128 : 234-66. Cochran, W.G. and Rubin, D.B. (1973). Controlling bias in observational studies: a review. A, 35, 417–446. Coleman, J. S. (1966). Equality of educational opportunity. Washington, D.C.: USGPO. D’Agostino, R.B. Jr. and D’Agostino, R.B. Sr. (2007). Estimating treatment effects using observational data. Journal of the American Medical Association, 297, 314–316. Holland, P. W. (1986). Statistics and causal inference (with discussion). Journal of the American Statistical Association, 81, 945–970. Imbens, G.W. and Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomed- ical Sciences: An Introduction. Cambridge University Press, New York. Neyman, J. (1923). On the Application of Probability Theory to Agricultural Experiments: Essay on Principles, Section 9. Translated in Statistical Science, 5, 465-480, 1990. Rubin, D.B. (1970). The Use of Matched Sampling and Regression Adjustment in Obser- vational Studies, Ph.D. thesis, Department of Statistics, Harvard University. Rubin, D.B. (1973a). Matching to remove bias in observational studies. Biometrics, 29, 159–183. Rubin, D.B. (1973b). The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, 29, 184–203. Biometrics Rubin, D.B. (1974). Estimating causal effects of treatments in randomized and nonran- domized studies. Journal of Educational Psychology, 66, 688–701. Rubin, D.B. (1976). Inference and missing data (with discussion). Biometrika, 63, 581–592. Rubin, D.B. (1977). Assignment to treatment group on the basis of a covariate. Journal of Educational Statistics, 2, 1–26. Rubin, D.B. (1978). for causal effects: The role of randomization. Annals of Statistics, 6, 34-68. Rubin, D.B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74, 318–328.

215 Rubin

Rubin, D.B. (1980). Discussion of “Randomization analysis of experimental data in the Fisher randomization test by Basu.” Journal of the American Statistical Association, 75, 591–593. Rubin, D.B. (1984). William G. Cochran’s contribution to the design, analysis and evalua- tion of observational studies. In W.G. Cochran’s Impact on Statistics, eds., P.S.R.S. Rao and J. Sedransk. Wiley, New York. Rubin, D.B. (1986). Which ifs have causal answers? (comment on “Statistics and causal inference”by P.W. Holland). Journal of the American Statistical Association, 81, 961–962 Rubin, D.B. (2008). For objective causal inference, design trumps analysis. Annals of Applied Statistics, 2, 808–840. Rubin, D.B. and Waterman, R.L. (2006). Estimating causal effects of marketing interven- tions using propensity score methodology. Statistical Science, 21, 206–222. United States Surgeon General’s Advisory Committee Report (1964). Smoking and Health. U.S. Department of Health, Education and Welfare, Washington D.C.

216 Observational Studies 1 (2015) 217-219 Submitted 3/15; Published 8/15

Comment on Cochran’s “Observational Studies”

Herbert L. Smith [email protected] Department of Sociology and Population Studies Center University of Pennsylvania Philadelphia, PA 19104, USA

Cochran’s perspective on observational studies (inter alia, Cochran 1965, Cochran 1972) remains a touchstone for population studies and other fields in which observations are plenty and experiments are implausible. It repays careful study and reflection. Cochran (1972, pp. 77-78) starts by distinguishing “analytical studies” from the “type of observational study [that] is narrower in scope[, where t]he investigator has in mind agents, procedures, or experiences that may produce certain causal effects...on people” [my emphasis]. The essay discusses difficulties that arise when treatments cannot be assigned at random, as in an experiment. It also discusses many other features of that have nothing to do with random assignment but much to do with our warrant for causal inference, including the need to be explicit regarding what the research is about (the specification of what is to be estimated and where effects are supposed to obtain), trade-offs in measurement between accuracy and bias, sample size, and non-response. Somewhere along the line “doing causation” became associated with establishing the internal validity of a study and/or form of estimation, and everything else got swept into external validity and measurement– important, to be sure; but somehow not as important as “.” To re-read Cochran (1972) is to realize this should not have happened (Smith 2009b, 2013). In the discussion of sample size, Cochran (1972, p. 87) distinguishes between “estimating a single overall effect of the treatment” and the alternative situation in which “the variation in effect with an x is of major interest.” In the latter circumstance, randomization does not buy one out of the need to specify accurately what is being estimated and for whom (Smith 1990). The definition of treatment effects at the unit level (Rubin 2005) has made clear how contingent all causal inference is on varying qualities of the population. Many of the xs are unobservable and are confounded with selection into treatment such that familiar statistical estimators must be re-conceptualized as to pertain to causal effects only in certain slices of the population (Angrist, Imbens, and Rubin 1996; Morgan and Winship 2007, ch. 7). The renewed emphasis on the population heterogeneity of treatment effects (Xie 2013) has paid great dividends in the social sciences. It was always suspected that “returns to education” may be exaggerated given that those who obtain college educations may be pre- cisely the type of people who would do well financially in the absence of a degree; notorious counterfactuals such as Bill Gates and Steve Jobs have only burnished the argument for adjusting downward maintained causal effects to adjust for positive selectivity into higher education. Although this may indeed be the case at the upper-end of the social background and ability distribution, there is a concomitant negative selection such that those most likely

⃝c 2015 Herbert L. Smith. Smith

to benefit from a college education are those least likely to be able to or to be encouraged to seek one (Hout 2012). Cochran (1972, p. 87) wrote that “[i]n modern studies, standards with regard to the nonresponse problem seem to me to be lax.” The same is doubtless true in our post-modern era, but the problem has become so much worse that is hard to disentangle lax standards in survey research from social change (technologies, household and family structures, trust in institutions, community adhesion, non-stop marketing, and just plain fatigue with what was once an innovation but is now a hovering presence) . Moreover, a blind faith in what are essentially arbitrary standards for response rates can distract from serious studies of bias due to non-response (Asch, Jedrziewski, and Christakis 1997). We are learning interesting things. Early polling failures with respect to predicting presidential election outcomes have long been the basis for introducing students to problems of coverage and non-response bias alike (Freedman, Pisani, and Purves 1978, pp. 301-307). There are now so many polls, executed at varying degrees of departure–typically large–from the ideal of sampling frames isomorphic with target populations, respondent selection with known probabilities, and successful effort to deter non-response; and so many elections; that we know that non- response, though large in the extreme, is not a problem with respect to bias (Gelman and King 1993, pp. 423-427). Happily, all sorts of biases seem to cancel one another out, to the point where a mass of such polls, undifferentiated with respect to the classic criteria for such observational studies, provide a firm basis for some impressive forecasting of election results (Linzer 2013). Similar results have also been obtained via post-stratification adjustment of polls with no pretense to sampling (Wang et al. 2015). But there are no guarantees that it will be ever thus–the 2015 British parliamentary election appears to be a contemporary example of results defying polls (if not modeling of poll data). Moreover, there are few if any similar domains of survey research in which the extent of observation is so great and the criterion for validity so decisive. Our hunches and received wisdom may not suffice. In a mail-out, mail-back survey of over 100,000 registered nurses, only one-third could be induced to respond. The possible extent of potential bias was large, and a careful follow-up survey of over 1,000 non-respondents (where only nine percent proved to be “hard core”) confirmed differential non-response with respect to a number of demographic characteristics, including gender, race, national origin, and edu- cation. These are among the characteristics often available in sampling frames, hence the basis of post-stratification weighting schemes. Yet the follow-up survey of non-respondents also revealed that in spite of the bias with respect to these demographic factors there was no bias whatsoever regarding the core content items that had motivated the study (Smith 2009a). I would be hard-pressed to dissent from Cochrans (1972, p. 89) conclusion, that “observational studies are an interesting and challenging field which demands a great deal of humility, since we can claim only to be groping toward the truth.”

References

Angrist, J.D., Imbens, G.W., and Rubin, D.B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91, 444-455. Asch, D.A., Jedrziewski, M.K. and Christakis, N.A. (1997). Response rates to mail surveys published in medical journals. Journal of Clinical Epidemiology, 50, 1129–1136.

218 Comment on Cochran’s “Observational Studies”

Cochran, W. G. (1965). The planning of observational studies. J. Roy. Statist. Soc. Ser. A, 128 : 234-66. Cochran, W,G.. (1972). Observational studies. In Statistical Papers in Honor of George W. Snedecor, ed. T.A. Bancroft, 70-90. Iowa University Press, Ames. Reprinted in Observational Studies, 1, 126–136. Gelman, A. and King, G. (1993). Why are American presidential election campaign polls so variable when votes are so predictable? British Journal of Political Science, 23, 409-451. Hout, M. (2012). Social and economic returns to college education in the United States. Annual Review of Sociology, 38, 379-400. Linzer, D.A. (2013). Dynamic Bayesian forecasting of presidential elections in the states. Journal of the American Statistical Association, 108, 124-134. Rubin, D.B. (2005). Causal inference using potential outcomes. Journal of the American Statistical Association, 100, 322-331. Smith, H.L. (1990). Specification problems in experimental and nonexperimental social research. Sociological Methodology, 20, 59-91. Smith, H.L. (2009a). Double sample to minimize bias due to non-response in a mail survey. PSC Working Paper Series, December. http://repository.upenn.edu/psc_working_papers/20. Smith, H.L. (2009b). Causation and Its Discontents. In Causal Analysis in Population Studies, ed. H. Engelhardt, H.-P. Kohler, and A. F´’urnkranz-Prskawetz, 23342. The Springer Series on Demographic Methods and Population Analysis 23. Springer Nether- lands. http://link.springer.com/chapter/10.1007/978-1-4020-9967-0_10. Smith, H.L. (2013). Research design: Toward a realistic role for causal analysis. In Hand- book of Causal Analysis for Social Research, ed. S.L. Morgan. 45-73. Springer Nether- lands. http://link.springer.com/chapter/10.1007/978-94-007-6094-3_4. Wang, W., Rothschild, D., Goel, S. and Gelman, A. (2015). Forecasting elections with non-representative polls. International Journal of Forecasting, 31, 980–991. Xie, Y. (2013). Population heterogeneity and causal inference. Proceedings of the National Academy of Sciences, 110, 6262–6268.

219 Observational Studies 1 (2015) 220-222 Submitted 3/15; Published 8/15

Comment on “Observational Studies” by Dr. W.G. Cochran (1972)

Mark J. van der Laan [email protected] Division of Biostatistics, School of Public Health University of California, Berkeley Berkeley, CA 94720, USA

This article “Observational Studies” by Dr. W.G. Cochran (1972) is not only a must- read but also an excellent article motivating the need for the Observational Studies journal that is now having its first issue. Reading this paper by Dr. Cochran, a few points came to my mind. To start with his description of the growing importance of observational studies aiming to shed light on causal effects could have been written today with a few minor changes. For example, the importance of governments in driving the need for well designed observational studies, as he describes, is as timely as ever with the FDA running large scale safety analysis programs to determine harmful side effects of FDA-approved drugs, and with the precision medicine ini- tiatives at the national and state level to improve patient care while making it cost-effective, to just name a few. These initiatives demand setting up large observational studies, such as the Sentinel Project, and create collaborations among government, insurance companies, industry and academics. By having a clear goal in mind, these initiatives provide unique opportunities for inspirational multidisciplinary bundles of scientific activities, all having the eye on the ball. Dr. Cochran’s articles represent the writing of a wise scientist that greatly cares about the real world, the truth, and the impact of these observational studies on society. The ending of his article is also telling about his great care: “In conclusion, observational studies are an interesting and challenging field which demands a good deal of humility, since we can claim only to be groping toward the truth.” Dr. Cochran provides a wonderful roadmap for planning observational studies, while providing crystal clear examples to demonstrate the dangers that lurk in the background and can completely destroy the success of an observational study. His demonstrations are as timely as ever, and the mistakes he warns against are not only still common place in the current era of Big Data, but are, in fact, even more prevalent than ever. Even though much of his wisdom appears to be common place by now in a typical epidemiology or biostatistics education, he puts his finger on crucial spots that are easily overlooked by most practitioners and data analysts involved in designing observational studies, including its statistical analysis plan. In particular, Dr. Cochran clarifies the importance of defining the causal quantity of interest and addressing what and how data needs to be measured in order to be able to learn the value of this causal quantity from the observed data. In addition, he stresses that once one has the question of interest in mind the decisions regarding the design can

⃝c 2015 Mark J. van der Laan. Comment on “Observational Studies” by Dr. W.G. Cochran (1972)

now be targeted towards this question, resulting in what one might call targeted designs . An important part of our research has centered on precisely developing targeted group sequential designs that not only optimally estimate the desired causal quantity, but also adapt the design (e.g., allocation of treatment) towards its goal based on what one learned from past data (e.g., Chambaz et al, 2015), without sacrificing the scientific rigor of a controlled design. Most importantly, time after time Dr. Cochran expresses concern about the underlying assumptions under which the design combined with the proposed analysis actually answer the question of interest. For example, he cares about the validity of a regression model, and that only with relatively few variables. I can only imagine how concerned he would be when we run a high dimensional linear regression analysis with hundreds or thousands of variables, as if these models approximate the truth. He emphasizes the need for supplementary studies, including pilot studies, to obtain better understanding of measurement errors “so that we can work with realistic models”. Clearly, Dr. Cochran does care about using statistical models that are realistic, and, even when the study is an observational study, he wants to control as much about the experiment as possible and incorporate this knowledge about the experiment in the statistical analysis plan for assessing the desired causal effect. Compare this with the common approach that throws one of our standard unrealistic parametric regression models at the data based on the format of the data, and manually tuning its choice to get the desired answer! In Dr. Cochrans discussion on the effect of misspecified models on bias and variance, he concludes with “Reduction of bias should be the primary objective”. From this perspective, a recent opinion piece “Why we need a statistical revolution” (van der Laan, 2015) is very much in the spirit of Dr. Cochran. Our research in the field Targeted Learning (e.g., van der Laan and Rose, 2011) is aimed to respond to the enormous challenges our field is confronted with, by providing methods that optimally learn a specific target quantity, only incorporating real knowledge about the data generating experiment, fully utilizing the state of the art in machine learning through Super Learning, while still providing formal statistical inference. I would have loved the opportunity to talk with Dr. Cochran to hear his view on these robust methods based on realistic models, aiming to minimize bias, and maximize precision. After having pointed out the lack of statistical methods that appropriately deal with nonresponse, Dr. Cochran concludes: “Fortunately, nonresponse can often be reduced materially by hard work during the study, but definite plans for this need to be made in advance.” He realizes that observational studies require as much careful planning as a controlled experiment, and that hard work can prevent missingness or provide a fundamental understanding of the missingness mechanism so that statistical methods can correct for bias induced by informative accordingly. Beyond the great concern Dr. Cochran expresses regarding statistical bias due to viola- tions of assumptions such as linearity of a regression model or measurement error assump- tions, he pays particular attention to the non-testable assumptions required to draw causal conclusions. He states that these non-testable assumptions should not only be clearly pre- sented and discussed in a separate section of a manuscript, but substantial effort should be invested in additional analyses that can shed some light on these assumptions, and possible explanations of the statistical findings should be provided.

221 van der Laan

Cochran strongly recommends the inclusion of a colleague who plays the role of devils’s advocate to hit the weak spots of the statistical analysis plan. He would have liked the use of negative controls (i.e., in the actual data set of interest one assesses the effect of a variable on an outcome for which it is known that the causal effect equals zero) to showcase possible causal bias in the statistical method. In light of the political maneuvering taking place in the current Big Data arena, the following remark by Dr. Cochran is very relevant and timely: “In numerous instances the choice seems to lie between doing a study much smaller and narrower in scope than desired but with high quality of measurement, or an extensive study with measurements of dubious quality. I am seldom sure what to advise.” To conclude this commentary, Dr. Cochran is one of the very important contributors to our discipline, and his spirit is at least as important now as it was at the time. His spirit stands for the advancement of science going after truth; careful and targeted planning of observational studies; targeting of the statistical approach towards the scientific question of interest and integrating the knowledge about the experiment; and hard work combined with humility when it comes to drawing conclusions. I am convinced that this new journal Observational Studies will stand for all this, greatly advance our scientific discipline, and thereby honor Dr. Cochran and the likes of him accordingly.

References Chambaz, A. and van der Laan, M. J. (2014). Inference in targeted group-sequential covariate-adjusted randomized clinical trials. Scandinavian Journal of Statistics, 41, 104– 140. van der Laan, M.J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observa- tional and Experimental Data. Springer, New York. van der Laan, M.J. (2015). Why we need a statistical revolution. http://www.stats.org/super-learning-and-the-revolution-in-knowledge/ Zheng, W., Chambaz, A. and van der Laan, M.J. (2015, forthcoming). Group sequen- tial clinical trials with response-adaptive randomization. In Modern Adaptive Random- ized Clinical Trials: Statistical, Operational, and Regulatory Aspects, ed. A. Sverdlov. Springer, New York. See also http://biostats.bepress.com/ucbbiostat/paper323

222 Observational Studies 1 (2015) 223-230 Submitted 3/15; Published 8/15

Observational Studies and Study Designs: An Epidemiologic Perspective

Tyler J. VanderWeele [email protected] Department of Epidemiology and Department of Biostatistics Harvard University Boston, MA 02115, USA

Cochran’s Contribution to Observational Studies

Cochran’s remarkable paper, published over forty years ago, clearly articulates, or touches upon and anticipates, numerous principles of research for observational studies that are still of central relevance today. Amongst the principles and ideas covered in his paper are: the central importance of a control group (Section 3); issues of confounding and confound- ing adjustment (Sections 3 and 6); the relevance of baseline measurements of the outcome (Section 3); the issue of direction of causation even in what today would be called an in- terrupted time-series designs (Section 3); issues of measurement error including differential measurement error (Section 4); the importance of examining effects on multiple outcomes in assessing the relevance of a treatment or exposure (Section 4) - an issue that is still very much neglected today; randomized experiments as a template for thinking about observa- tional studies (Section 5); matching, blocking, standardization, and covariance adjustment (Section 6); issues of generalizability and external validity (Section 7.5); and of course the challenges, and relevance, of causal inference (Section 7.6). The discussion in Cochran’s paper moreover anticipates numerous important ideas that were to be developed fully only much later. His discussion of matching and standardization and the difficulties encountered with multiple covariates (Section 6) anticipates the need for the use of propensity scores (Rosenbaum and Rubin, 1983, 1984) in the handling of these issues. His discussion of the role of blocking and covariance adjustment in observational studies to both increase precision and control bias, along with the position that in observa- tional studies reduction of bias ought to be given priority (Section 5), partially anticipates Rosenbaum’s notion of design sensitivity being more important than efficiency in observa- tional studies (Rosenbaum, 2010). His discussion of causation as a change in the treatment variable leading to a change in the outcome (rather than merely being associated) antici- pates the resurrection of Neyman’s potential outcomes notation (Neyman, 1923) by Rubin (1974, 1978) for use in observational studies just a couple of years after Cochran’s paper was written. His mention of “repeated before and repeated after” studies (Section 3), though not developed, anticipates in some sense Robins’ development of concepts and methodology for the causal effects of time-varying exposures (Robins, 1986, 1997; Robins et al., 2000). Cochran’s point that assessing the explanation of, or mechanisms for, an effect may require a completely different type of research from that assessing the overall effect (Section 7.6)

⃝c 2015 Tyler J. VanderWeele. VanderWeele

anticipates in some sense the literature on mediation and the challenges therein (Robins and Greenland, 1992; Pearl, 2001; VanderWeele, 2015). His encouraging the use of outcomes for which an exposure has no effect on the outcome (Section 7.6) anticipated more formal development of the use of “negative controls” for reasoning about causality (Lipsitch et al., 2010; Tchetgen Tchetgen, 2014; Imbens and Rubin, 2015). In many ways then Cochran’s paper provided a roadmap for a great deal of research on observational studies and causal inference that was to take place in the decades to follow.

Observational Studies and Study Design

Cochran’s discussion of observational studies is perhaps best understood as a statistician’s perspective. Many of the issues he had discussed are still at the forefront of statistical research on observational studies today. Within observational studies in epidemiology, however, while all of Cochran’s principles and discussion are still relevant, the scope of such studies is somewhat broader. Cochran distinguishes two broad types of observational studies: (i) analytic surveys and (ii) the studies of the effects of treatments. His focus, he notes, is on the latter. Implicit throughout nearly all of Cochran’s discussion of this second type of study is that the design of the study in question is what we might today call a . However, many other types of observational study designs are arguably also relevant in assessing the effects of treatments, and many of these designs have arisen within epidemiology. Indeed, as discussed below, one of epidemiology’s central methodolog- ical contributions has been to study design and to the development of new types of study designs. Several of these new designs are described below. However, before doing so, it will be good to consider more precisely what it is we mean by terms such as “study” and “study design.” Cochran describes an observational study as a study in which “the investigator is re- stricted to taking selected observations or measurements on the process under study [and does not] interfere in the process in the way that one does in a controlled laboratory type of experiment.” Both (i) analytic surveys and (ii) studies of the effects of treatments fall within this description. His distinction above between these two might, however, be under- stood either as one of the data available (e.g. all measurements simultaneous versus some before and some after an exposure or event) or as one of purpose (descriptive versus causal). These two dimensions do not entirely coincide. In a survey in which all variables are mea- sured on a single occasion, it may be possible to retrospectively assess certain exposures or treatments which can then be used in causal/etiologic research. Likewise, even with data available in a longitudinal cohort study, a simple description of the sample characteristics at baseline may, in some instances, be of considerable interest. We should thus distinguish between the data itself and how it was collected (study as a data resource) versus how the data is to be used (study as a particular empirical inquiry). The term “study” might thus refer to either (i) “the process of systematically collecting data and the resulting data resource” or (ii) “the use of data to address a specific inquiry or set of related inquiries” (often resulting in a paper or report). Cochran’s use of “study” seems usually, though perhaps not always, to be in the latter sense. Even when he refers to specific studies by naming the data resource (e.g. the “National Halothane Study”), this is often in the context of a specific empirical inquiry. And indeed many studies (processes

224 Observational Studies and Study Designs: An Epidemiologic Perspective

of data collection and resulting data resource) are often focused upon addressing a single empirical question or set of closely related questions. This is especially the case with randomized trials in which the effectiveness of a specific intervention on a specific population is assessed as rigorously as possible. But, as will also be discussed further below, a single study (i.e. a single data resource) can often be used for multiple studies in the second sense i.e. for multiple empirical inquiries on different topics. When referring to a “study” it is thus good to distinguish between a “study” as the collecting of data resulting in a data resource and “study” as the steps taken to address a specific empirical inquiry or set of related inquiries. Likewise “study design” might be understood either as (i) the design of the data col- lection process for a specific data resource, or (ii) the design of the analytic process and the use of that data resource to address a specific empirical question. Once again, most of Cochran’s discussion of the design of observational studies concerns study design in this latter sense. When the study (data resource) is put together for a focuses specific inquiry these two aspects of study design are then integrally intertwined. But, in other cases, the study (data resource) is put together for multiple purposes and can be used to address multiple very different inquiries; in these cases often these two aspects of study design may be quite distinct. Even after a study (data collection resulting in a data resource) has been conducted and the data is available, considerable work may be required in determining how that data resource can be used to address a specific empirical question. The term “study design” does also have at least one yet further, and perhaps even more common use, that of a template. A specific “study design” might be understood as a template in which the actions undertaken to generate and collect the data fit a certain pattern or mold. A randomized trial might be viewed as one type of template, a closed cohort study as another type of template. Each of these is a type of study design. Thus, with the term “study design”, we have at least three uses: (i) the details of the design of the data collection process for a specific data resource (e.g. the design of the “National Halothane Study” or the “Framingham Heart Study”), (ii) the design of the analytic process used, including the data selected and the analyses conducted, to answer a specific empirical inquiry, and (iii) a template for obtaining data with a particular set of shared characteristics with respect to the data-generating mechanism. This third use of “study design” merits some further discussion, and in the following section several new types of study designs that have arisen in epidemiology are described.

Study Designs in Epidemiology

As indicated above, many different types of study designs (templates for the collection of data) can be used to assess causal/etiologic questions. Randomized trials and cohort type designs are perhaps the most common, but many other types of designs are available. In Section 4 of his paper, Cochran makes brief mention that cancer patients may be better informed of instances of cancer among blood relatives than are perhaps the controls who are free of cancer. Cochran is perhaps here touching upon the case-control design (though this is not entirely clear and a cohort design with case status ascertained at the end of follow- up, and perhaps certain covariate data ascertained retrospectively at that time, might also be in view). In any case, such case-control designs collect exposure/treatment data,

225 VanderWeele

along with covariate data, on a set of individuals with the outcome, and also on a set of controls (selected either from the non-cases or from the underlying population). Intuitively, if the exposure has no effect on the outcome, the prevalence of the exposure should be no higher among the cases than among the controls. More specifically, in such case-control designs, if adequate control has been made for confounding, then it is possible to assess the effect of the exposure on the outcome at least on the or rate ratio scale; one can obtain the same odds ratio or rate ratio from case-control data as one would have obtained in the underlying cohort (Miettinen, 1976; Prentice and Pyke, 1979). If further information is available on the prevalence of the outcome or the exposure, further progress can be made in estimating effects on the difference scale as well (Rothman et al., 2008). Such case-control studies are subject to additional biases, many of which have to do with selection of controls, but case-control designs can be very efficient ways to study cause-effect relationships especially when outcomes are rare. Such designs developed out of epidemiology, but there is no reason in principle why these designs could not be used in the social sciences as well, especially for rare outcomes for which larger sample sizes are needed when using cohort designs. Arguably their more extensive use in epidemiology is due to historic, rather than substantive, considerations. See Keogh and Cox (2014) for a recent comprehensive overview of case-control designs. Epidemiologic study designs have also been extended in a number of other additional ways including varieties of case-control designs such as nested and matched case-control designs (cf. Rothman et al., 2008), two-stage sampling case-control designs (White, 1982; Breslow and Cain, 1988), and variants of the case-control design such as the case-cohort design (Prentice, 1986; Barlow, 1999), or two-stage designs involving both ecologic and case-control data (Haneuse and Wakefield, 2008). These other study designs could likewise have applicability outside of epidemiology. Even more creative, specialized, and in some way surprising designs have arisen within epidemiology that address questions of cause- effect relations. So called case-crossover designs collect data only on subjects who have the outcome in question and use exposure information in window of time substantially prior to the outcome (e.g. a week before the outcome) comparing this to exposure immediately preceding the outcome, to examine whether particular exposures triggered the outcome in question (Maclure, 1991; Maclure and Mittleman, 2000). Such designs have been used for example to study triggers for MI (Mittleman et al., 1993) and the relationship between cell- phone use and car accidents (Redelmeier and Tibshirani, 1997; McEvoy et al., 2005). They are useful for etiologic research but address a different type of causal question (e.g. does the exposure trigger the outcome in a short window of time following the exposure) than is addressed in cohort or case-control studies which focus on longer term effects (cf. Maclure, 2007). Such distinctions make clear the importance of specifying the object of investigation in etiologic research (Miettinen and Karp, 2012). It is not sufficient to simply specify the exposure and outcome being studied, but also the measure (prevalence, ; difference, ratio, rate, etc.) and the time-frame. The case-crossover design described above is subject to biases that can arise from tem- poral trends in the exposure and various other types of designs such as the self-controlled case-series design (Farrington, 1995; Whitaker et al., 2006), the case-time-control design (Suissa, 1995), and the case-case-time-control design (Wang et al., 2011) have been devel- oped to help address some of these biases. This case-only type approach has also been

226 Observational Studies and Study Designs: An Epidemiologic Perspective

modified to study the effects of spatially relevant features or exposures, e.g. proximity to power lines that are only on one side of a street, by using data only on cases, and examining how distances to the feature would be modified if mirror images were taken with respect to some other spatial characteristic e.g. the center line of a street. Such designs are sometimes called case-specular study designs (Zaffanella et al., 1999). Yet another case-only study design variant arises from the observation, essentially simply a consequence of Bayes’ The- orem, that if a genetic and an environmental exposure are independent in distribution in a population then the odds ratio relating the genetic and environmental exposures among the cases gives the measure of multiplicative interaction in the effects of the two exposures on the outcome in the population (Piergorsch et al., 1996; Schmidt and Schaid, 1999). As with case-control designs, each of these other designs makes additional assumptions above and beyond those in a cohort study, and are thus subject to a wider range of biases (cf. Greenland, 1996; Rothman et al., 2008). But each of these designs also has the potential to give substantial insight in etiologic research, even though they are not traditional follow- up cohort studies and even though they sometimes do not even have a control group as traditionally conceived. In some instances, a study (data resource) produced for the purpose of one specific empirical inquiry can be useful in addressing an entirely different empirical study question by making use of one of the aforementioned, rather clever, study designs (templates). It is arguably the decoupling of the notion of study as data resource and that of study as template and as the use that is made of the data, that allows for such clever secondary uses of data, and that perhaps led to the development of these new study designs within epidemiology to begin with. Such decoupling has arguably also been important in trying to provide some unification of the different types of epidemiologic study designs (Miettinen and Karp, 2012). Once again, even though these designs were developed within epidemiology, there is generally nothing now restricting their use in other fields. Other disciplines would likely benefit from the use of these alternative observational study designs as well.

Large Multi-Purpose Cohort Studies

Even with cohort designs, the practice of epidemiology has arguably allowed for new insights. Cochran discusses the need for clear specific focused hypotheses and argues that a study should be designed in light of these (Section 2). It is hard to argue against such advice. However, if a study is conceived of as the creation of a data resource (e.g. the Nurses Health Study, the Framingham Heart Study etc.) then such a study can in fact be used to address multiple hypotheses. Hundreds of individual studies in the form of published empirical inquiries have come out of the Nurses Health Study on topics as diverse as nutrition, cancer, depression, social support, etc. In such a context, the principles of study design (understood as the design of the process by which the data resource is created and the resulting content of that resource) are arguably somewhat different than that for a study with a single narrow hypothesis. In the creation of a study (data collection resulting in a data resource) that is to be used for multiple research questions, careful thought must be given to the possibility of many hypotheses, what confounding variables are relevant not for just one exposure, or one outcome, but a whole host of exposure-outcome relationships; often extensive questionnaires and detailed follow-up to avoid non-response are needed to ensure adequate data.

227 VanderWeele

The considerations when a single data resource is to be used for numerous empirical inquiries are challenging; but the undertaking of such studies (i.e. creation of such data re- sources) hold tremendous promise for research. Cochran, in his paper, discusses the difficult trade-off between conducting “a study much smaller...than desired but with high quality measurements, or an extensive study with measurements of dubious quality” (Section 4). In one of the rare moments of uncertainty in his paper, Cochran concludes: “I am seldom sure what to advise.” However, with large cohort studies that are to be used for multiple purposes, and which, often as a consequence, have extensive human and financial resources in support of them, there is sometimes no need to choose between the two. Large studies with excellent and extensive measurements are at least sometimes achievable. It is, once again, arguably the conceptual decoupling of “study” as the creation of a data resource and “study” as a specific empirical inquiry that underlies the pursuit of such large multi-purpose cohort studies. Of course, even with a multi-purpose study (data resource), once the data has been collected, and a specific topic of study (empirical inquiry) has been selected, and an appro- priate study design (template) chosen, a great deal of work may still remain in determining how best to make use of the data to answer the specific inquiry. And here, irrespective of the design (template) selected, all of Cochran’s principles come, once again, into play. Is- sues of selecting controls, baseline measurements, confounders and confounder adjustment, measurement error, method of adjustment, generalizability and external validity, and the assessing of the evidence for judging causation must all be carefully evaluated. Cochran’s discussion of observational studies is indeed as relevant today as it was forty years ago.

References Barlow, W.E., Ichikawa, L., Rosner, D., and Izumi, S. (1999). Analysis of case-cohort designs. Journal of Clinical Epidemiology, 52:1165-1172. Breslow N and Cain K (1988). for two-stage case-control data. Biometrika, 75:11–20. Farrington CP. (1995). Relative incidence estimation from for vaccine safety evaluation. Biometrics, 51:228–235. Greenland S. (1996). Confounding and exposure trends in case-crossover and case-time- control designs. Epidemiology, 7(3):231-239. Haneuse, S.J.-P.A. and Wakefield, J.C. (2008). The combination of ecological and casecon- trol data. Journal of the Royal Statistical Society: Series B, 70:73–93. Imbens, G.W., and Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomed- ical Sciences: An Introduction. Cambridge University Press. Keogh, R.H., and Cox. D.R. (2014). Case-control Studies. Cambridge University Press. Lipsitch, M., Tchetgen Tchetgen, E. and Cohen, T. (2010). Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology, 21:383-388. McEvoy SP, Stevenson MR, McCartt AT, Woodward M, Haworth C, Palamara P and Cercarelli R. (2005). Role of mobile phones in motor vehicle crashes resulting in hospital attendance: a case-. BMJ 2005; 331. Maclure, M. (1991). The case-crossover design: a method for studying transient effects on the risk of acute events. American Journal of Epidemiology, 133:144-153.

228 Observational Studies and Study Designs: An Epidemiologic Perspective

Maclure, M. (2007). “Why me?” versus “why now?” – differences between operational hypotheses in case-control versus case-crossover studies. Pharmacoepidemiology and Drug Safety, 16:850–853. Maclure M and Mittleman MA. (2000). Should we use a case-crossover design? Annual Review of Public Health, 21:193–221. Miettinen, O.S. (1976). Estimability and estimation in case-referent studies. American Journal of Epidemiology, 103:226–235. Miettinen, O.S. and Karp, I. (2012). Epidemiological Research: An Introduction. Springer. Mittleman MA, Maclure M, Tofler GH, Sherwood JB, Goldberg RJ and Muller JE. (1993). Triggering of acute by heavy physical exertion. New England Jour- nal of Medicine, 329:1677-1683. Neyman, J. (1923). Sur les applications de la thar des probabilities aux experiences Agar- icales: Essay des principle. Excerpts reprinted (1990) in English (D. Dabrowska and T. Speed, Trans.) in Statistical Science, 5:463–472. Pearl, J. (2001). Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty and Artificial Intelligence. Morgan Kaufmann, San Francisco, 411–420. Piegorsch, W.W., Weinberg, C.R., Taylor, J.A. (1994). Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statistics in Medicine, 13:153–162. Prentice, R.L. (1986). A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika, 73:1–11. Prentice, R.L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika, 66:403–411. Redelmeier DA and Tibshirani RJ. (1997). Association between cellular telephone calls and motor vehicle collisions. New England Journal of Medicine, 336: 453–458. Robins, J.M. (1986). A new approach to causal inference in mortality studies with sus- tained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling, 7:1393–1512. Robins, J.M. (1997). Causal inference from complex longitudinal data. In: Latent Variable Modeling and Applications to Causality. Lecture Notes in Statistics (120), M. Berkane, Editor. Springer Verlag, New York, 69–117. Robins, J.M. and Greenland, S. (1992). Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3:143–155. Robins J.M., Hern´anM.A. and Brumback B. (2000). Marginal structural models and causal inference in epidemiology. Epidemiology, 11:550–560. Rosenbaum, P.R. (2010). Design sensitivity and efficiency in observational studies. Journal of the American Statistical Association, 105:692–702. Rosenbaum P.R. and Rubin D.B. (1983). The central role of the propensity score in obser- vational studies for causal effects. Biometrika, 70:41–55. Rosenbaum, P.R. and Rubin D.B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79:516–524. Rothman, K.J., Greenland, S. and Lash, T.L. (2008). Modern Epidemiology, 3rd edition. Philadelphia: Lippincott Williams and Wilkins.

229 VanderWeele

Rubin, D. (1974). Estimating causal effects of treatments in randomized and non-randomized studies. Journal of Educational Psychology, 66:688–701. Rubin, D.B. (1978). Bayesian inference for causal effects: The role of randomization. Annals of Statistics, 6:34–58. Schmidt, S., and Schaid, D.J. (1999). Potential misinterpretation of the case-only study to assess gene-environment interaction. American Journal of Epidemiology, 150:878–885. Suissa, S. (1995). The case-time-control design. Epidemiology, 6(3):248–253. Tchetgen Tchetgen, E.J. (2014). The control outcome calibration approach for causal in- ference with unobserved confounding. Am. J. Epidemiol. 179:633-640. VanderWeele, T.J. (2015). Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford University Press, New York. Wang, S., Linkletter, C., Maclure, M., Dore, D., Mor, V., Buka, S. and Wellenius, G.A. (2011). Future-cases as present controls to adjust for exposure-trend bias in case-only studies. Epidemiology, 22(4): 568-574. Whitaker HJ, Farrington CP and Musonda P. (2006). Tutorial in Biostatistics: The self- controlled case series method. Statistics in Medicine, 25(10): 1768–1797. White, J. (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology, 115:119–128. Zaffanella, L.E., Savitz, D.A., Greenland, S., and Ebi, K.L. (1998). The residential case- specular method to study wire codes, magnetic fields, and disease. Epidemiology, 9:16–20.

230 Observational Studies 1 (2015) 231-240 Submitted 4/15; Published 8/15

Reflections on “Observational Studies”: Looking Backward and Looking Forward

Stephen G. West [email protected] Arizona State University and Freie Universit¨atBerlin Tempe, AZ 85287, USA

Abstract The classic works of William Cochran and Donald Campbell provided an important foun- dation for the design and analysis of non-randomized studies. From the remarkably similar perspectives of these two early figures, distinct perspectives have developed in statistics and in psychology. The potential outcomes perspective in statistics has focused on the concep- tualization and the estimation of causal effects. This perspective has led to important new statistical models that provide appropriate adjustments for problems like missing data on the outcome variable, treatment non-compliance, and pre-treatment differences on baseline covariates in non-randomized studies. The Campbell perspective in psychology has focused on practical design procedures that prevent or minimize the occurrence of problems that potentially confound the interpretation of causal effects. It has also emphasized empirical comparisons of the estimates of causal effects obtained from different designs. Greater interplay between the potential outcomes and Campbell perspectives, together with con- sideration of applications of a third perspective developed in computer science by Judea Pearl, portend continued improvements in the design, conceptualization, and analysis of non-randomized studies.

1. Reflections on “Observational Studies”: Looking Backward and Looking Forward William G. Cochran in statistics and Donald T. Campbell in psychology provided much of the foundation for the major approaches currently taken to the design and analysis of non- randomized studies in my field of psychology. The initial similarity of the positions taken by Cochran (1965, 1972, 1983) and Campbell (1957, 1963/1966) in their early writings on this topic is remarkable. Their work helped define the area and raised a number of key issues that have been the focus of methodological work since that time. Truly significant progress has been made in providing solutions to several of these key issues. Going forward to the present, work in statistics by Cochrans students (particularly Donald Rubin and his students) and in psychology by Campbells colleagues (particularly Thomas Cook and William Shadish) have diverged in their emphases. Reconsideration of the newer work from the foundation of Cochran and Campbell helps identify some persisting issues.

2. Looking Backward Cochran (1972) defined the domain of observational studies as excluding randomization, but including some agents, procedures, or experiences...[that] are like those the statistician

⃝c 2015 Stephen G. West. West

would call treatments in a controlled experiment... (p. 1). This definition reflects a middle ground between experiments and surveys without intervention, two areas in which Cochran had made major contributions (Cochran, 1950; 1963). The goal and the challenge of the observational study is causal inference: Did the treatment cause a change in the outcome? The domain established by Cochran’s definition can still be considered relevant today. There is continued debate over exactly what quantities should be called “treatments in a controlled experiment” (e.g., Holland, 1986; Rubin, 2010). And some authors (e.g., Cook, Shadish & Wong, 2008; Rosenbaum, 2010; Rubin, 2006) appear to have narrowed Cochran’s more inclusive definition of observational study to focus only on those designs that include baseline measures, non-randomized treatment and control groups, and at least one . I will restrict the use of observational study to this narrower definition below, using the term non-randomized design to indicate the more inclusive definition. Cochran (1972, section 3) discusses several designs that might be used to investigate the effects of a treatment in the absence of randomization. He also provides discussion of some potential confounders that potentially undermine the causal interpretation of any prima fa- cie observed effects of the treatment. Campbell (Campbell & Stanley, 1963/1966) attempted to describe the full list of the non-randomized and randomized designs then available, in- cluding some he helped invent (e.g., the regression discontinuity design, Thistlethwaite & Campbell, 1957). Associated with each non-randomized design are specific types of po- tential confounders that undermine causal inference. Campbell attempted to enumerate a comprehensive list of potential confounders which he termed threats to internal validity. These threats represented “an accumulation of our fields criticisms of each other’s research” (Campbell, 1988, p. 322). Among these are such threats as history, maturation, instru- mentation, testing, statistical regression, and attrition in pretest-posttest designs, selection in designs comparing non-randomized treatment and control groups only using a posttest measure, and interactions of selection with each of the earlier list of threats in observational studies (narrow definition). Both Cochran and Campbell clearly recognized the differential ability of each of the non-randomized designs to account for potential confounders. Echoing his famous earlier quoting of Fisher to “Make your theories elaborate” (Cochran, 1965, p. 252), Cochran (1972, p. 10) stated that “the investigator should think of as many consequences of the hypothesis as he can and in the study try to include response mea- surements that will verify whether these consequences follow.” Campbell (1968; Campbell & Stanley, 1963/1966; Cook & Campbell, 1979) emphasized the similar concept of pat- tern matching in which the ability of the treatment and each of the plausible confounders to account for the obtained pattern of results is compared. Campbell emphasized that both response variables and additional design features (e.g., multiple control groups hav- ing distinct strengths and weaknesses; repeated pre-treatment measurements over time) be included in the design of the study to distinguish between the competing explanations. Cochran (1972) also considered several of the ways in which measurement could af- fect the results of observational studies, notably the effects of accuracy and precision of measurement on the results of analyses and the possibility that measurements were non- equivalent in the treatment and comparison groups. Campbell and Stanley (1963/1966) in- cluded measurement-related issues prominently among their threats to internal validity and Campbell and Fiske (1957) offered methods of detecting potential biases (termed “method effects”) associated with different approaches to measurement (e.g., different types of raters;

232 Reflections on “Observational Studies”

different measurement operations). Based on his experiences attempting to evaluate com- pensatory education programs, Campbell particularly emphasized the role of measurement issues, notably unreliability and lack of stability over time of baseline measurements, in producing artifactual results in the analysis of observational studies (Campbell & Boruch, 1975; Campbell & Erlebacher, 1970; Campbell & Kenny, 1999).

3. Looking Forward to the Present From the initial similarity of the perspectives of Cochran and Campbell on non-randomized studies, their followers have diverged in their emphases. In statistics, Donald Rubin, one of Cochran’s students, has developed the potential outcomes approach to causal inference (Ru- bin, 1978; 2005; Imbens & Rubin, 2015), which provides a formal mathematical statistical approach for the conceptualization and the analysis of the effects of treatments. In psy- chology, Campbell’s colleagues have continued to develop aspects of his original approach focusing on systematizing our understanding of design approaches to ruling out threats to internal validity. I highlight a few of these differences below (see West & Thoemmes, 2010 for a fuller discussion). Table 1 summarizes the typical design and statistical analysis approaches to strengthening causal inference associated with some randomized and non- randomized designs that come out of the Rubin and Campbell perspectives, respectively.

3.1 Rubin’s Potential Outcomes Model The potential outcomes model has provided a useful mathematical statistical framework for conceptualizing many issues in randomized and nonrandomized designs. This framework starts with the (unattainable) ideal of comparing the response of a single participant un- der the treatment condition with the response of the same participant under the control condition at the same time and in the same setting. Designs that approximate this ideal to varying degrees can be proposed including the randomized experiment, the regression discontinuity design, and the observational study. The potential outcomes framework forces out the exact assumptions needed to meet the ideal and defines mathematically the precise causal effects that can be achieved if these assumptions can be met. The framework draws heavily on Rubin’s (1976; Little & Rubin, 2002) seminal work on developing unbiased es- timates of parameters when data are missing. Randomized experiments can be conceived of as a design in which no observations are available for treatment group participants in the control group and no observations are available for control group participants in the treatment group, but data are missing completely at random. The potential outcomes framework permits the unbiased estimation of the magnitude of the average causal effect in experiments given that four assumptions are met: (a) suc- cessful randomization, (b) full compliance to the assigned treatment condition, (c) full measurement of the outcome variables, and (d) the stable unit treatment value assumption (SUTVA: participants outcomes are unaffected by the treatment assignments of others; no hidden variations of treatments). In cases when these assumptions break down (“broken randomized experiments”), new approaches requiring additional assumptions have been developed from the foundation of the potential outcomes model. Angrist, Imbens, and Rubin (1996) developed an approach that provides unbiased estimates of the causal effect for those participants who would take the treatment in a randomized experiment if assigned

233 West

Table 1: Key Assumptions/Threats to Internal Validity and Example Remedies for Ran- domized Experiments and Non-randomized Alternative Designs

Assumption or Threat Typical Approaches to Mitigating the Threat to Internal Validitya Design Approach Statistical Approach Randomized Experiment Independent units temporal or geographical multilevel analysis; isolation of units other statistical adjustment for clustering Stable Unit Treamtent Value Assumption temporal or geographical statistical adjustment for (SUTVA): Other treatment conditions isolation of treatment groups measured exposure to do not affect participant’s outcome; other treatments No hidden variation in treatments Full treatment adherence incentive for adherence instrumental variable analysis (assume exclusion restriction) No attrition sample retention procedures missing data (assume missing at random) Regression Discontinuity Design Functional form of relationship with different cutpoint; ; between assignment variable and Nonequivalent dependent variables Sensitivity analysis outcome is properly modeled Interrupted Time Series Analysis Functional form of the relationship nonequivalent control series in which Diagnostic plots for the time series is properly modeled; intervention is not introduced; switching (autocorrelogram; spectral another historical event, a change in replication in which intervention is density). population (selection), or a change in introduced at another time point; Sensitivity analysis measures coincides with the introduction nonequivalent dependent measure. of the intervention. Observational Study Measured baseline variables equated Multiple control groups; Propensity score analysis; Unmeasured baseline variables equated Nonequivalent dependent measures Sensitivity analysis; Differential maturation Additional pre-and-post intervention Subgroup analysis measurements a Note. The list of assumptions/threats to internal validity identifies issues that commonly occur in each of the designs. The alternative designs may be subject to each of the issues listed for the RE in addition to the issues listed for the specific design. The examples of statistical and design approaches for mitigating the threat to internal validity illustrate some commonly used approaches and are not exhaustive. For the observational study design, Rubin’s and Campbell’s perspectives differ so that the statistical and design approaches do not map 1:1 onto the assumptions/threats to internal validity that are listed. Reprinted from West, S. G. (2009). Alternatives to randomized experiments. Current Directions in Psychology, 18, 299–304.

to the treatment condition but take the control if assigned to the control condition (a.k.a., complier average causal effect; see Sagarin, West, Ratnikov, Homan, Ritchie, & Hansen, 2014 for a recent review of approaches to treatment non-compliance). Little and Rubin

234 Reflections on “Observational Studies”

(2002) and Yang and Maxwell (2014) offer methods for estimating unbiased causal effects when there is attrition from measurement of the outcome variable. In the context of observational studies, Rosenbaum and Rubin (1983) developed propen- sity score analysis as a vehicle to adjust for the effect of a large number of covariates (po- tential confounders) measured at baseline (see Austin, 2011 and West, Thoemmes, Cham, Renneberg, Schultze, & Weiler, 2014 for recent reviews). Hong and Raudenbush (2005; 2013) extended propensity score analysis to provide proper estimates of the average causal effect when the treatment was delivered to a pre-existing group (e.g., treatments delivered to an existing classrooms of students) in group-based observational studies. Thoemmes and West (2011) developed approaches to the analysis of group-based observational studies in which the treatment is delivered to originally independent individuals who are constituted into groups for the purpose of the study. In each case, the potential outcomes perspec- tive provides the foundation for conceptualizing the analysis, helps identify the necessary assumptions, and specifies the exact causal effect that may be estimated. The potential outcomes approach has been particularly fruitful for the analysis of de- signs in which there are baseline covariates, outcome measures collected at a single time point, treatment and control conditions, and in which treatment assignment is assumed to be either independent of potential confounders or to be independent of all potential confounders after conditioning on covariates (ignorable). It has provided a valuable tool for analyzing randomized experiments, broken randomized experiments, the regression dis- continuity design, and the observational study (narrow definition). However, the potential outcomes approach becomes more challenging to apply in designs in which measurement of the outcome variable is extended over time (e.g., interrupted time series designs, see Im- bens & Wooldridge, 2009) or there are time-varying treatments (see Hong & Raudenbush), 2006. In addition, assumptions underlying the application of the potential outcomes ap- proach (e.g., ignorability–no other unmeasured confounders exist) may not be not testable, an important limitation.

3.2 Campbell’s Practical Working Scientist Approach

In psychology, design-based approaches have been given priority in the Campbell tradition over statistical adjustment of treatment effects: “When it comes to causal inference from quasi-experiments, design rules, not statistics” (Shadish & Cook, 1999, p. 300). The prefer- ence is to use the strongest design that can be implemented in the research context (Shadish, Cook, & Campbell, 2002). Advice is given to researchers about justifications that may be given to individual participants, communities, and organizations so that they will permit the strongest possible design to be implemented. Each of the potential threats to internal validity associated with the specific design is then carefully evaluated for plausibility in the research context. Attempts are then made to prevent or minimize the threat. For example, Ribisl et al. (1996) developed a valuable compendium of the then-available strategies of minimizing attrition in longitudinal designs (which needs updating to incorporate new tech- nologies for tracking and keeping in contact with individual participants). Efforts to prevent the threat are supplemented by the identification of design elements that specifically address each identified threat (e.g., multiple pretests; multiple control groups; see Shadish et al., 2002, p. 157). Then, the pattern of obtained results is compared to the results predicted by

235 West

the hypothesis and by each of the threats to internal validity (confounders). Shadish et al. present illustrations of the use of this strategy and numerous research applications in which it has been successful. Campbell’s approach does not eschew the elegant statistical analysis approaches offered by the potential outcomes framework (indeed, it welcomes them), but it gives higher priority to design solutions.

There are two key difficulties in applying Campbell’s approach to non-randomized de- signs. First, the researcher’s decision to rule out or ignore certain threats to internal validity a priori may be incorrect. In Campbells approach, there are suggestions to use common sense, prior research, and theory to eliminate some potential threats. While often a good guide, common sense, prior research, and theory might be misleading. Second, although the pattern matching strategy can be compelling when the evidence supporting one hy- pothesis is definitive, more ambiguous partial support for several competing hypotheses can also occur. Rosenbaum (2010) offers some methods of integrating results across multiple design elements, but methods of formally assessing the match of the obtained results to the hypothesis and each of the plausible threats to validity need further development.

Campbell emphasized a practical approach that mimics the approach of the working sci- entist. One intriguing development within the Campbell tradition is the empirical compar- ison of the results of overlapping study designs in which a randomized and non-randomized design are both used. A series of papers have compared non-randomized designs to the of the randomized experiment under conditions in which the two designs share a common treatment group and ideally sample from the same population of participants. No difference was found between the estimates of the magnitude of the treatment effect from the randomized experiment and the regression discontinuity design (Cook & Wong, 2008; see also Shadish et al., 2011 for a randomized experiment comparing the estimates of treatment effects from the two designs). Similarly, no differences were found between the estimates of the treatment effect from a randomized experiment and interrupted time series design (St. Clair, Cook, & Halberg, 2014). In contrast, syntheses of existing comparison studies (Cook, Shadish, & Wong, 2008) as well as randomized experiments comparing the effect sizes for participants randomly assigned to treatment and control conditions or randomly assigned to self-select between the same treatment and control conditions (Shadish, Clark, & Steiner, 2008) identified cases in which the two designs produced comparable and non-comparable results. Several factors facilitated obtaining comparable results (Cook, Shadish, & Wong, 2008; Cook, Steiner, & Pohl, 2009; Steiner, Cook, Shadish, & Clark, 2010): (a) a known selection rule that determines assignment to treatment and control conditions, (b) inclusion of measures of all relevant confounders in the statistical adjustment model, (c) inclusion of a pretest measure of the outcome variable in the statistical adjustment model, (d) reliable measurement of the covariates, and (e) control participants being selected from the same population as the treatment participants (originally descriptively termed a “focal, local” control group). Generalization of the results of these studies was originally limited by the small number of research contexts in the existing studies upon which the conclusions were based. However, in an ongoing project more than 65 studies of this type have been identified and compiled thus far and this database continues to expand. Synthesis of these studies is increasingly providing a practical basis for the design of non-experimental studies that can help minimize bias in the estimates of causal effects of treatments.

236 Reflections on “Observational Studies”

4. Conclusion

Cochran and Campbell helped define the domain of non-randomized studies and the threats to internal validity (potential confounders) that may compromise the interpretation of ob- served treatment effects. Subsequent work in statistics and psychology has taken comple- mentary paths. Work in the potential outcomes model tradition in statistics has substan- tially improved the conceptualization of causal effects and has provided improved estimates of their magnitude. Work in psychology following in the tradition of Campbell has em- phasized the development of practical methods for researchers to improve the design of non-randomized studies. Some of this work has built on the insights of Cochran and Camp- bell to make theories complex and to design studies so that pattern matching may be used distinguish between potential explanations. Some of this work has re-emphasized the impor- tance of some of Cochran and Campbell’s views about the key status of baseline measures of the outcome variable and the importance of highly reliable measurement of baseline mea- sures in non-randomized designs, features that have received less attention in the potential outcome approach. Within the potential outcomes framework, less attention has been given to the development of new methods in which the observations are measured repeatedly over time or time-varying treatments are implemented (but see e.g., Robins & Hern´an,2009). Within the Campbell framework, practical methods for strengthening causal inferences in the interrupted time series have been developed (Shadish et al., 2002) and new work has focused on improving the design and analysis of the single subject design (Shadish, 2014; Shadish, Hedges, Pustejovsky, Rindskopf, Boyajian, & Sullivan, 2014). And within both approaches the development of formal methods for assessing Cochran’s elaborate theories or Campbell’s pattern matching has received relatively little attention. This situation may be beginning to change. Judea Pearl’s (2009) approach to causal inference developed in computer science is being applied to systems of variables. Although Pearl’s approach can be seen as having its own limitations (Shadish & Sullivan, 2012; West & Koch, 2014), it has helped sharpen our conceptualization of some causal inference problems (e.g., mediation in which treatment causes changes in intermediate variables [mediators], which, in turn, pro- duce changes in the outcome variable; which confounders should and not be controlled in observational studies). It has also provided challenges to the potential outcomes approach given its alternative (but overlapping) approach to the conceptualization and estimation of causal effects. Although important advances have occurred since the foundational work of Cochran and Campbell, greater interplay among the potential outcomes, Campbell, and Pearl perspectives portends continued improvements in our design, conceptualization, and analysis of non-randomized studies.

Acknowledgments

I thank Thomas D. Cook, William R. Shadish, and Dylan Small for their comments on an earlier version of the manuscript.

237 West

References

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables (with discussion). Journal of the American Statistical Association, 91, 444–472. Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3), 399–424. Campbell, D. T. (1957). Factors affecting the validity of experiments in social settings. Psychological Bulletin, 54, 297–332. Campbell, D. T. (1966). Pattern matching as an essential in distal knowing. In K. R. Hammond (Ed.), The psychology of Egon Brunswick (pp. 81-106). Holt, Rinehart, & Winston. New York. Campbell, D. T. (1988). Can we be scientific in applied social science? In E. S. Overman (Ed.), Methodology and epistemology for social science: Selected papers of Donald T. Campbell. University of Chicago Press, Chicago. Campbell, D. T., and Boruch, R. F. (1975). Making the case for randomized assignment to treatments by considering the alternatives: Six ways in which quasi-experimental evaluations in compensatory education tend to underestimate effects. In C. A. Bennett & A. A. Lumsdaine (Eds.), Evaluation and experiments: Some critical issues in assessing social programs (pp. 195–296). Academic Press, New York. Campbell, D. T., and Fiske, D. W. (1957). Convergent and discriminant validation by the multrait-multimethod matrix. Psychological Bulletin, 56, 81–105. Campbell, D. T., and Erlebacher, A. E. (1970). How regression artifacts can mistakenly make compensatory education programs look harmful. In J. Hellmuth (Ed.), The dis- advantaged child: Vol. 3, Compensatory education: A national debate (pp. 185–210). Brunner/Mazel, New York. Campbell, D. T., and Kenny, D. A. (1999). A primer on regression artifacts. Guilford, New York. Campbell, D. T., and Stanley, J. C. (1963/1966). Experimental and quasi-experimental designs for research. Rand McNally, Chicago. Originally published in N. L. Gage (Ed.), Handbook of research on teaching (pp. 171–246). Rand McNally, Chicago. Cochran, W. G. (1965). The planning of observational studies (with discussion). Journal of the Royal Statistical Society, Series A, 128, 234–266. Cochran, W. G. (1972). Observational studies. In T. A. Bancroft (Ed.) (1972), Statistical papers in honor of George W. Snedecor (pp. 77-90). Ames, IA: Iowa State University Press. Reprinted in Observational Studies, 1, 126–136. Cochran, W. G. (1983). Planning and analysis of observational studies. Wiley, New York. Cochran, W. G., and Cox, G. M. (1950). Experimental designs. Wiley, New York. Cochran, W. G. (1953). Sampling techniques. New York, NY: Wiley. Cook, T. D., and Wong, V. C. (2008). Empirical tests of the validity of the regression discontinuity design. Annales d’Economie et de Statistique, 91-92, 127–150. Cook, T. D., Shadish, W. R., and Wong, V. C. (2008). Three conditions under which observational studies produce the same results as experiments. Journal of Policy Analysis and Management, 27(4), 724–750.

238 Reflections on “Observational Studies”

Cook, T. D., Steiner, P. M., and Pohl, S. (2009). Assessing how bias reduction is influenced by covariate choice, unreliability and data analytic mode: An analysis of different kinds of within-study comparisons in different substantive domains. Multivariate Behavioral Research, 44, 828-847. Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396), 945–960. Hong, G., and Raudenbush, S. W. (2006). Evaluating kindergarten retention policy. Journal of the American Statistical Association, 101(475), 901-910. Hong, G., and Raudenbush, S. W. (2013). Heterogeneous agents, social interactions, and causal inference. In S. L. Morgan (Ed.), Handbook of Causal Analysis for Social Research (pp. 331-352). Springer Netherlands. Imbens, G. W., and Rubin, D. B. (2015). Causal inference for statistics, social, and biomed- ical sciences: An introduction. Cambridge University Press, New York. Imbens, G. W., and Wooldridge, J. W. (2009). Recent developments in the economics of program evaluation. Journal of Economic Literature, 47, 5–86. Little, R. J. A., and Rubin, D. B. (2002). Statistical inference with missing data (2nd Ed.). Wiley, Hoboken, NJ. Ribisl, K. M., Walton, M. A., Mowbray, C. T., Luke, D. A., Davidson, W. S., and Bootsmiller, B. J. (1996). Minimizing participant attrition in panel studies through the use of effec- tive retention and tracking strategies: Review and recommendations. Evaluation and Program Planning, 19(1), 1–25. Robins J. M., and Hern´anM. A. (2009). Estimation of the causal effects of time-varying ex- posures. In G. Fitzmaurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds), Advances in Longitudinal Data Analysis. Chapman and Hall, New York. Rosenbaum, P. R. (2010). Design of observational studies. New York: Springer. Rosenbaum, P. R., and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Annals of Statistics, 6, 34–58. Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, deci- sions. Journal of the American Statistical Association, 100, 322–331. Rubin, D. B. (2006). Matched sampling for causal effects. Cambridge University Press, New York. Rubin, D. B. (2010). Reflections stimulated by the comments of Shadish (2010) and West and Thoemmes (2010). Psychological Methods, 15(1), 38–46. Sagarin, B. J., West, S. G., Ratnikov, A., Homan, W. K., Ritchie, T. D., and Hansen, E. J. (2014). Treatment noncompliance in randomized experiments: Statistical approaches and design issues. Psychological Methods, 19(3), 317–333. Shadish, W. R. (2014). Statistical analyses of single-case designs: The shape of things to come. Current Directions in Psychological Science, 23, 139–146. Shadish, W. R., M. H. Clark and Peter M. Steiner (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random to nonrandom assignment (with commentary). Journal of the American Statistical Association, 103, 1334–1356.

239 West

Shadish, W. R., and Cook, T. D. (1999). Design rules: More steps towards a complete theory of quasi-experimentation. Statistical Science, 14, 294–300. Shadish, W. R., Cook, T. D., and Campbell, T. D. (2002). Experimental and quasi- experimental designs for general causal inference. Wadsworth, Boston. Shadish, W. R., Galindo, R., Wong, V. C., Steiner, P. M., and Cook, T. D. (2011). A randomized experiment comparing random to cutoff-based assignment. Psychological Methods, 16(2), 179–191. Shadish, W.R., Hedges, L.V., Pustejovsky, J., Rindskopf, D.M., Boyajian, J.G. and Sullivan, K.J. (2014). Analyzing single-case designs: d, G, multilevel models, Bayesian estimators, generalized additive models, and the hopes and fears of researchers about analyses. In T. R. Kratochwill and J. R. Levin (Eds.), Single-Case Intervention Research: Method- ological and Statistical Advances (pp. 247–281). American Psychological Association, Washington, D.C. Shadish, W.R. and Sullivan K. (2012). Theories of causation in psychological science. In H. Cooper (Ed.), APA Handbook of Research Methods in Psychology (Vol. 1, pp. 23–52). Washington, D.C.: American Psychological Association. St. Clair, T., Cook, T. D. and Hallberg, K. (2014). Examining the internal validity and statistical precision of the comparative interrupted times series design by comparison with a randomized experiment. American Journal of Evaluation, 35(3), 311–327. Thistlethwaite, D. L., and Campbell, D. T. (1960). Regression-discontinuity analysis: An alternative to the ex post facto experiment. Journal of Educational Psychology, 51, 309– 317. Thoemmes, F. J., and West, S. G. (2011). The use of propensity scores for nonrandomized designs with clustered data. Multivariate Behavioral Research, 46(3), 514–543. West, S. G., and Koch, T. (2014). Restoring Causal Analysis to Structural Equation Mod- eling. Review of Judea Pearl, Causality: Models, Reasoning, and Inference (2nd. Ed). Structural Equation Modeling, 21, 161–166. West, S. G., and Thoemmes, F. (2010). Campbell’s and Rubin’s perspectives on causal inference. Psychological Methods, 15, 18-37. West, S. G., Cham, H., Thoemmes, F., Renneberg, B., Schulze, J., and Weiler, M. (2014). Propensity scores as a basis for equating groups: Basic principles and application in clinical treatment outcome research. Journal of consulting and clinical psychology, 82(5), 906–919. Yang, M., and Maxwell, S. E. (2014). Treatment effects in randomized longitudinal trials with different types of non-ignorable dropout. Psychological Methods, 19, 188-210.

240