<<

Observational Studies 6 (2020) 1-9 Submitted 1965; Published reprinted, 1/20

The environment and : association or causation?

Sir

Editor’s Note: Sir Austin Bradford Hill was Professor of Medical , , United Kingdom. This article was originally published in the Proceedings of the Royal Society of Medicine, May 1965, 58, 295-300. This paper is reprinted with permission of the copyright holder, Sage Publications. New comments by the following researchers follow: ; Mike Baiocchi; Samantha Kleinberg; James O’Malley; Chris Phillips and Joel Greenhouse; Kenneth Rothman; Herb Smith; Tyler VanderWeele; Noel Weiss; and William Yeaton.

Among the objects of this newly founded Section of Occupational Medicine are: first, ‘to provide a means, not readily afforded elsewhere, whereby physicians and surgeons with a special knowledge of the relationship between sickness and injury and conditions of work may discuss their problems, not only with each other, but also with colleagues in other fields, by holding joint meetings with other Sections of the Society’; and second, ‘to make available information about the physical, chemical and psychological hazards of occupation, and in particular about those that are rare or not easily recognized’. At this first meeting of the Section and before, with however laudable intentions, we set about instructing our colleagues in other fields, it will be proper to consider a problem fundamental to our own. How in the first place do we detect these relationships between sickness, injury and conditions of work? How do we determine what are physical, chemical and psychological hazards of occupation, and in particular those that are rare and not easily recognised? There are, of course, instances in which we can reasonably answer these questions from the general body of medical knowledge. A particular, and perhaps extreme, physical en- vironment cannot fail to be harmful; a particular chemical is known to be toxic to man and therefore suspect on the factory floor. Sometimes, alternatively, we may be able to consider what might a particular environment do to man, and then see whether such con- sequences are indeed to be found. But more often than not we have no such guidance, no such means of proceeding; more often than not we are dependent upon our observation and enumeration of defined events for which we then seek antecedents. In other words, we see that the event B is associated with the environmental feature A, that, to take a specific example, some form of respiratory illness is associated with a dust in the environment. In what circumstances can we pass from this observed association to a verdict of causation? Upon what basis should we proceed to do so? I have no wish, nor the skill, to embark upon a philosophical discussion of the meaning of ‘causation’. The ‘’ of illness may be immediate and direct, it may be remote and indirect underlying the observed association. But with the aims of occupational, and almost

c 2020 Sage Publications. Hill

synonymously preventive, medicine in mind, the decisive question is whether the frequency of the undesirable event B will be influenced by a change in the environmental feature A. How such a change exerts that influence may call for a great deal of research. However, before deducing ‘causation’ and taking action, we shall not invariably have to sit around awaiting the results of that research. The whole chain may have to be unravelled or a few links may suffice. It will depend upon circumstances. Disregarding then any such problem in semantics we have this situation. Our observa- tions reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?

1. Strength. First upon my list, I would put the strength of the association. To take a very old example, by comparing the occupations of patients with scrotal with the occupations of patients presenting with other , Percival Pott could reach a correct conclusion because of the enormous increase of scrotal cancer in the chimney sweeps. ‘Even as late as the second decade of the twentieth century’, writes , ‘the mortality of chimney sweeps from scrotal cancer was some 200 times that of workers who were not specially exposed to tar or mineral oils and in the eighteenth century the relative difference is likely to have been much greater’ (Doll, 1964). To take a more modern and more general example upon which I have now reflected for over 15 years, prospective inquiries into smoking have shown that the death rate from cancer of the lung in cigarette smokers is nine to 10 times the rate in non-smokers and the rate in heavy cigarette smokers is 20 to 30 times as great. On the other hand, the death rate from coronary thrombosis in smokers is no more than twice, possibly less, the death rate in non-smokers. Though there is good evidence to support causation it is surely much easier in this case to think of some features of life that may go hand- in-hand with smoking – features that might conceivably be the real underlying cause or, at the least, an important con- tributor, whether it be lack of exercise, nature of diet or other factors. But to explain the pronounced excess in cancer of the lung in any other environmental terms requires some feature of life so intimately linked with cigarette smoking and with the amount of smoking that such a feature should be easily detectable. If we cannot detect it or reasonably infer a specific one, then in such circumstances, I think we are reasonably entitled to reject the vague contention of the armchair critic ‘you can’t prove it, there may be such a feature’. Certainly in this situation, I would reject the argument sometimes advanced that what matters is the absolute difference between the death rates of our various groups and not the ratio of one to other. That depends upon what we want to know. If we want to know how many extra deaths from cancer of the lung will take place through smoking (i.e. presuming causation), then obviously we must use the absolute differences between the death rates – 0.07 per 1000 per year in non-smoking doctors, 0.57 in those smoking 1–14 cigarettes daily, 1.39 for 15–24 cigarettes daily and 2.27 for 25 or more daily. But it does not follow here, or in more specifically occupational problems, that this best measure of the effect upon mortality is also the best measure in relation to aetiology. In this respect, the ratios of 8, 20 and 32 to 1 are far more informative. It does not, of course, follow that the difference revealed by ratios are of

2 Association or Causation?

any practical importance. Maybe they are, maybe they are not; but that is another point altogether. We may recall ’s classic analysis of the opening weeks of the epi- demic of 1854 (Snow, 1855). The death rate that he recorded in the customers supplied with the grossly polluted water of the Southwark and Vauxhall Company was in truth quite low – 71 deaths in each 10,000 houses. What stands out vividly is the fact that the small rate is 14 times the figure of five deaths per 10,000 houses supplied with the sewage-free water of the rival Lambeth Company. In thus putting emphasis upon the strength of an association, we must, nevertheless, look at the obverse of the coin. We must not be too ready to dismiss a cause-and-effect hypothesis merely on the grounds that the observed association appears to be slight. There are many occasions in medicine when this is in truth so. Relatively few persons harbouring the meningococcus fall sick of meningococcal meningitis. Relatively few persons occupationally exposed to rat’s urine contract Weil’s disease.

2. Consistency: Next on my list of features to be specially considered, I would place the consistency of the observed association. Has it been repeatedly observed by different persons, in different places, circumstances and times? This requirement may be of special importance for those rare hazards singled out in the Section’s terms of reference. With many alert minds at work in industry today many an environmental association may be thrown up. Some of them on the customary tests of statistical significance will appear to be unlikely to be due to chance. Nevertheless, whether chance is the explanation or whether a true hazard has been revealed may sometimes be answered only by a repetition of the circumstances and the observations. Returning to my more general example, the Advisory Committee to the Surgeon- General of the United States Service found the association of smoking with cancer of the lung in 29 retrospective and seven prospective inquiries (US Depart- ment of Education, Health and Welfare, 1964). The lesson here is that broadly the same answer has been reached in quite a wide variety of situations and techniques. In other words, we can justifiably infer that the association is not due to some constant error or fallacy that permeates every inquiry. And we have indeed to be on our guard against that. Take, for instance, an example given by Heady (Heady, 1958). Patients admitted to hospital for operation for peptic ulcer are questioned about recent domestic anxieties or crises that may have precipitated the acute illness. As controls, patients admitted for operation for a simple hernia are similarly quizzed. But, as Heady points out, the two groups may not be in pari materia. If your wife ran off with the lodger last week, you still have to take your perforated ulcer to hospital without delay. But with a hernia you might prefer to stay at home for a while – to mourn (or celebrate) the event. No number of exact repetitions would remove or necessarily reveal that fallacy. We have, therefore, the somewhat paradoxical position that the different results of a different inquiry certainly cannot be held to refute the original evidence; yet the same results from precisely the same form of inquiry will not invariably greatly strengthen

3 Hill

the original evidence. I would myself put a good deal of weight upon similar results reached in quite different ways, e.g. prospectively and retrospectively. Once again looking at the obverse of the coin, there will be occasions when repetition is absent or impossible and yet we should not hesitate to draw conclusions. The experience of the nickel refiners of South Wales is an outstanding example. I quote from the Alfred Watson Memorial Lecture that I gave in 1962 to the Institute of Actuaries:

The population at risk, workers and pensioners, numbered about one thou- sand. During the ten years 1929 to 1938, sixteen of them had died from cancer of the lung, eleven of them had died from cancer of the nasal sinuses. At the age specific death rates of England and Wales at that time, one might have anticipated one death from cancer of the lung (to compare with the 16), and a fraction of a death from cancer of the nose (to compare with the 11). In all other bodily sites cancer had appeared on the death certificate 11 times and one would have expected it to do so 10-11 times. There had been 67 deaths from all other causes of mortality and over the ten years’ period 72 would have been expected at the national death rates. Finally division of the population at risk in relation to their jobs showed that the excess of cancer of the lung and nose had fallen wholly upon the workers employed in the chemical processes. More recently my colleague, Dr Richard Doll, has brought this story a stage further. In the nine years 1948 to 1956 there had been, he found, 48 deaths from cancer of the lung and 13 deaths from cancer of the nose. He assessed the numbers expected at normal rates of mortality as, respectively 10 and 01. In 1923, long before any special hazard had been recognized, certain changes in the refinery took place. No case of cancer of the nose has been observed in any man who first entered the works after that year, and in these men there has been no excess of cancer of the lung. In other words, the excess in both sites is uniquely a feature in men who entered the refinery in, roughly, the first 23 years of the present century. No causal agent of these has been identified. Until recently no animal experimentation had given any clue or any support to this wholly statistical evidence. Yet I wonder if any of us would hesitate to accept it as proof of a grave industrial hazard? (Hill, 1930)

In relation to my present discussion, I know of no parallel investigation. We have (or certainly had) to make up our minds on a unique event; and there is no difficulty in doing so.

3. Specificity: One reason, needless to say, is the specificity of the association, the third characteristic which invariably we must consider. If, as here, the association is lim- ited to specific workers and to particular sites and types of disease and there is no association between the work and other modes of dying, then clearly that is a strong argument in favour of causation.

4 Association or Causation?

We must not, however, overemphasise the importance of the characteristic. Even in my present example there is a cause-and-effect relationship with two different sites of cancer – the lung and the nose. Milk as a carrier of and, in that sense, the cause of disease can produce such a disparate galaxy as scarlet fever, diphtheria, , undulant fever, sore throat, dysentery and typhoid fever. Before the discovery of the underlying factor, the bacterial origin of disease, harm would have been done by pushing too firmly the need for specificity as a necessary feature before convicting the dairy. Coming to modern times, the prospective investigations of smoking and cancer of the lung have been criticised for not showing specificity – in other words, the death rate of smokers is higher than the death rate of non-smokers from many causes of death (though in fact the results of Doll and Hill (1964) do not show that). But here surely one must return to my first characteristic, the strength of the association. If other causes of death are raised 10, 20 or even 50% in smokers whereas cancer of the lung is raised 900–1000% we have specificity – a specificity in the magnitude of the association. We must also keep in mind that diseases may have more than one cause. It has always been possible to acquire a cancer of the scrotum without sweeping chimneys or taking to mule-spinning in Lancashire. One-to-one relationships are not frequent. Indeed, I believe that multi-causation is generally more likely than single causation though possibly if we knew all the answers we might get back to a single factor. In short, if specificity exists we may be able to draw conclusions without hesitation; if it is not apparent, we are not thereby necessarily left sitting irresolutely on the fence.

4. Temporality: My fourth characteristic is the temporal relationship of the association – which is the cart and which the horse? This is a question which might be particularly relevant with diseases of slow development. Does a particular diet lead to disease or do the early stages of the disease lead to those peculiar dietetic habits? Does a particular occupation or occupational environment pro- mote infection by the tubercle bacillus or are the men and women who select that kind of work more liable to contract tuberculosis whatever the environment – or, indeed, have they already contracted it? This temporal problem may not arise often but it certainly needs to be remembered, particularly with selective factors at work in industry.

5. Biological gradient: Fifth, if the association is one which can reveal a biological gra- dient, or dose-response curve, then we should look most carefully for such evidence. For instance, the fact that the death rate from cancer of the lung rises linearly with the number of cigarettes smoked daily adds a very great deal to the simpler evidence that cigarette smokers have a higher death rate than non-smokers. That comparison would be weakened, though not necessarily destroyed, if it depended upon, say, a much heavier death rate in light smo- kers and a lower rate in heavier smokers. We should then need to envisage some much more complex relationship to satisfy the cause-and- effect hypothesis. The clear dose-response curve admits of a simple explanation and obviously puts the case in a clearer light.

5 Hill

The same would clearly be true of an alleged dust hazard in industry. The dustier the environment the greater the incidence of disease we would expect to see. Often the difficulty is to secure some satisfactory quantitative measure of the environment which will permit us to explore this dose-response. But we should invariably seek it.

6. Plausibility: It will be helpful if the causation we suspect is biologically plausible. But this is a feature I am convinced we cannot demand. What is biologically plausible depends upon the biological knowledge of the day. To quote again from my Alfred Watson Memorial Lecture (Hill 1962), there was

...no biological knowledge to support (or to refute) Pott’s observation in the 18th century of the excess of cancer in chimney sweeps. It was lack of biological knowledge in the 19th that led a prize essayist writing on the value and the fallacy of statistics to conclude, amongst other ‘absurd’ associations, that ‘it could be no more ridiculous for the stranger who passed the night in the steerage of an emigrant ship to ascribe the typhus, which he there contracted, to the vermin with which bodies of the sick might be infected’. And coming to nearer times, in the 20th century there was no biological knowledge to support the evidence against rubella.

In short, the association we observe may be one new to science or medicine and we must not dismiss it too light-heartedly as just too odd. As Sherlock Holmes advised Dr Watson, ‘when you have eliminated the impossible, whatever remains, however improbable, must be the truth’.

7. Coherence: On the other hand the cause-and-effect interpretation of our data should not seriously conflict with the generally known facts of the natural history and biology of the disease – in the expression of the Advisory Committee to the Surgeon-General it should have coherence. Thus in the discussion of lung cancer, the Committee finds its association with cigarette smoking coherent with the temporal rise that has taken place in the two variables over the last generation and with the sex difference in mortality – features that might well apply in an occupational problem. The known urban/rural ratio of lung cancer mortality does not detract from coherence, nor the restriction of the effect to the lung. Personally, I regard as greatly contributing to coherence the histopathological evidence from the bronchial epithelium of smokers and the isolation from cigarette smoke of factors carcinogenic for the skin of laboratory animals. Nevertheless, while such labo- ratory evidence can enormously strengthen the hypothesis and, indeed, may determine the actual causative agent, the lack of such evidence cannot nullify the epidemiological observations in man. Arsenic can undoubtedly cause cancer of the skin in man but it has never been possible to demonstrate such an effect on any other animal. In a wider field, John Snow’s epidemiological observations on the conveyance of cholera by the water from the Broad Street pump would have been put almost beyond dispute if had been then around to isolate the vibrio from the baby’s nappies, the well itself and the gentleman in delicate health from Brighton. Yet the fact that

6 Association or Causation?

Koch’s work was to be awaited another 30 years did not really weaken the epidemio- logical case though it made it more difficult to establish against the criticisms of the day – both just and unjust.

8. Experiment: Occasionally, it is possible to appeal to experimental, or semi-experimental, evidence. For example, because of an observed association some preventive action is taken. Does it in fact prevent? The dust in the workshop is reduced, lubricating oils are changed, persons stop smoking cigarettes. Is the frequency of the associated events affected? Here the strongest support for the causation hypothesis may be revealed.

9. Analogy: In some circumstances, it would be fair to judge by analogy. With the effects of thalidomide and rubella before us, we would surely be ready to accept slighter but similar evidence with another drug or another viral disease in .

Here, then, are nine different viewpoints from all of which we should study association before we cry causation. What I do not believe – and this has been suggested – is that we can usefully lay down some hard-and-fast rules of evidence that must be obeyed before we accept cause and effect. None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non. What they can do, with greater or less strength, is to help us to make up our minds on the fundamental question – is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect?

Tests of Significance No formal tests of significance can answer those questions. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis. Nearly 40 years ago, among the studies of occupa- tional health that I made for the Industrial Health Research Board of the Medical Research Council was one that concerned the workers in the cotton-spinning mills of Lancashire (Hill, 1962). The question that I had to answer, by the use of the National Health Insurance records of that time, was this: Do the workers in the cardroom of the spinning mill, who tend the machines that clean the raw cotton, have a sickness experience in any way different from that of other operatives in the same mills who are relatively unexposed to the dust and fibre that were features of the cardroom? The answer was an unqualified ‘Yes’. From age 30 to age 60, the cardroom workers suffered over three times as much from respiratory causes of illness whereas from non-respiratory causes their experience was not different from that of the other workers. This pronounced difference with the respiratory causes was derived not from abnormally long periods of sickness but rather from an excessive number of repeated absences from work of the cardroom workers. All this has rightly passed into the limbo of forgotten things. What interests me today is this: My results were set out for men and women separately and for half a dozen age groups in 36 tables. So there were plenty of sums. Yet I cannot find that anywhere I thought it necessary to use a test of significance. The evidence was so clear-cut, the differences between

7 Hill

the groups were mainly so large, the contrast between respiratory and non-respiratory causes of illness so specific, that no formal tests could really contribute anything of value to the argument. So why use them? Would we think or act that way today? I rather doubt it. Between the two world wars there was a strong case for emphasising to the clinician and other research workers the importance of not overlooking the effects of the play of chance upon their data. Perhaps too often generalities were based upon two men and a laboratory dog while the treatment of choice was deduced from a difference between two bedfuls of patients and might easily have no true meaning. It was therefore a useful corrective for statisticians to stress, and to teach the need for, tests of significance merely to serve as guides to caution before drawing a conclusion, before inflating the particular to the general. I wonder whether the pendulum has not swung too far – not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately, I believe we have not yet gone so far as our friends in the USA where, l am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary – because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance. What is worse the glitter of the t table diverts attention from the inadequacies of the fare. Only a tithe, and an unknown tithe, of the factory personnel volunteer for some procedure or interview, 20% of patients treated in some particular way are lost to sight, 30% of a randomly-drawn sample are never contacted. The sample may, indeed, be akin to that of the man who, according to Swift, ‘had a mind to sell his house and carried a piece of brick in his pocket, which he showed as a pattern to encourage purchasers’. The writer, the editor and the reader are unmoved. The magic formulae are there. Of course, I exaggerate. Yet too often I suspect we waste a deal of time, we grasp the shadow and lose the substance, we weaken our capacity to interpret data and to take reasonable decisions whatever the value of P. And far too often we deduce ‘no difference’ from ‘no significant difference’. Like fire, the χ2 test is an excellent servant and a bad master.

The case for action Finally, in passing from association to causation I believe in ‘real life’, we shall have to consider what flows from that decision. On scientific grounds, we should do no such thing. The evidence is there to be judged on its merits and the judgment (in that sense) should be utterly independent of what hangs upon it – or who hangs because of it. But in another and more practical sense we may surely ask what is involved in our decision. In occupa- tional medicine our object is usually to take action. If this be operative cause and that be deleterious effect, then we shall wish to intervene to abolish or reduce death or disease. While that is a commendable ambition it almost inevitably leads us to introduce differ- ential standards before we convict. Thus, on relatively slight evidence, we might decide to restrict the use of a drug for early-morning sickness in pregnant women. If we are wrong in deducing causation from association no great harm will be done. The good lady and the pharmaceutical industry will doubtless survive.

8 Association or Causation?

On fair evidence, we might take action on what appears to be an occupational hazard, e.g. we might change from a probably carcinogenic oil to a non-carcinogenic oil in a limited environment and without too much injustice if we are wrong. But we should need very strong evidence before we made people burn a fuel in their homes that they do not like or stop smoking the cigarettes and eating the fats and sugar that they do like. In asking for very strong evidence I would, however, repeat emphatically that this does not imply crossing every ‘t’, and swords with every critic, before we act. All scientific work is incomplete – whether it be observational or experimental. All scientific work is liable to be upset or modified by advancing knowledge. That does not confer upon us a freedom to ignore the knowledge we already have, or to postpone the action that it appears to demand at a given time. Who knows, asked Robert Browning, but the world may end tonight? True, but on available evidence most of us make ready to commute on the 8.30 next day.

References Doll, R (1964). Cancer. In: Witts LJ (ed.) Medical Surveys and Clinical Trials, 2nd ed. London: Oxford University Press, p.333. Doll, R and Hill, A.B. (1964). Mortality in relation to smoking: ten years’ observations of British doctors. British Medical Journal, 1, 1399-1410. Heady J.A. (1958). False figuring: statistical method in medicine. Med World Lond, 89, 305. Hill A.B. (1930). Sickness Amongst Operatives in Lancashire Spinning Mills. Industrial Health Research Board Report No. 59. London: HMSO. Hill, A.B. (1962). The statistician in medicine. (Alfred Watson Memorial Lecture). J Inst Actuar, 88, 178–191. Snow J. (1855). On the Mode of Communication of Cholera. 2nd edn. London: John Churchill (reprinted 1936, New York). US Department of Health, Education, and Welfare (1964). Smoking and Health. Public Health Service Publication No. 1103. Washington.

9 Observational Studies 6 (2020) 10 Submitted 7/19; Published 1/20

Commentary on Lecture by Sir Austin Bradford Hill

Peter Armitage [email protected] Wallingford, Oxfordshire, U.K.

This lecture is quoted widely as the source for Bradford Hill’s principles, or criteria, for favouring in epidemiological studies as distinct from mere association between environmental factor and disease. The criteria were also proposed in an edition of his text- book on published later (Hill, 1977). A careful analysis of the arguments is provided by Rothman and Greenland (1998). The lecture was given a few years after Hill’s retirement, and perhaps indicates a return to the themes that would have perme- ated his thoughts in the pre-war period, when he was actively involved in occupational , before his important advocacy of randomized trials which occupied much of his time in the 1950s. The abundance in this paper of examples of contentious issues in occupational medicine suggests that these had never been far from his mind. Is it also, per- haps, an indication that he thought of himself primarily as an epidemiologist rather than a statistician? It is interesting that statistical analysis does not occur as the central theme of any of the nine criteria, but there is an important final section on significance tests, which he regards as being irrelevant to the main issue. This may have been an offshoot of his desire to discuss the problem as one of common sense, avoiding technical detail. One wonders, though, whether significance tests can be wholly ignored. If all the other criteria are satisfied the results of a test may be sufficiently ambiguous as to affect the conclusions and perhaps lead the investigators to extend the study before publicizing the results. Bradford Hill was a persuasive and perceptive advocate, and his writings are a joy to read. I frequently read in the press of large studies which demonstrate that people who have particular habits suffer more from some disease, without any explanation of the thorny distinction between association and causation. Let us hope that this reissue of Bradford Hill’s classic lecture will achieve some of its author’s intentions.

References Hill, A.B. (1977). A Short Textbook of Medical Statistics. London: Hodder and Stoughton. Rothman, K.J and Greenland, S. (1998). Hill’s Criteria for Causality. In Encyclopedia of Biostatistcs (ed. P. Armitage and T. Colton), Vol.3. London: Wiley.

c 2020 Peter Armitage. Observational Studies 6 (2020) 11-16 Submitted 12/19; Published 1/20

Following Bradford Hill

Mike Baiocchi [email protected] Department of Epidemiology and Population Health Stanford University Stanford, CA 94305, USA

Abstract In 1965, Sir Austin Bradford Hill offered his thoughts on: “What aspects of [an] associ- ation should we especially consider before deciding that the most likely interpretation of it is causation?” He proposed nine means for reasoning about the association, which he named as: strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, and analogy. In this paper, we look at what motivated Bradford Hill to propose we focus on these nine features. We contrast Bradford Hill’s approach with a currently fashionable framework for reasoning about statistical associations – the Common Task Framework. And then suggest why following Bradford Hill, 50+ years on, is still extraordinarily reasonable. Keywords: Causality, , Common Task Framework

1. Reading with context

It feels odd writing about a paper that is more than 50 years old, particularly inside of a discipline that is currently undergoing extraordinary growth and innovation. But the “Bradford Hill criteria” (Bradford Hill, 1965) occupy a particularly prominent peak for those of us interested in making decisions that hinge on causal claims. To some, Bradford Hill laid out friendly signposts that suggest safer paths to achieving solid inferences about causal connections. To more, the “Bradford Hill criteria” are only stood up in order to be knocked right back down; the nine “criteria” are introduced and logical holes are punched through until students are left with the impression that hemming in causality with rules is a fool’s errand. And yet, this paper persists. In fact, I was discussing this paper the other day with a colleague who told a fascinating story about when she served as an expert witness for a defense team in some legal case or another. The defense attorneys wanted her to work through the plausibility of each of the criteria. By report, it sounded like a deeply interesting (and lucrative) exercise in careful thinking. Her story got me curious so I dug into the legal world – and lo and behold – there are citations, and guides, and warnings about both deploying the Bradford Hill criteria in one’s arguments before the court, as well as detailed guides on how to counter opposing counsel’s expert witness’s use of the criteria. These ideas seem to have sprouted legs and scurried out of our exclusive domain and into others, even while still kicking up heated exchanges in our own academic literature (Phillips and Goodman (2004), H¨ofler(2005), Phillips and Goodman (2006), H¨ofler (2006) – as you might be able to guess from the alternation of authors, that’s a fun exchange). So what’s going on here?

c 2020 Mike Baiocchi. Baiocchi

If you encountered Bradford Hill’s ideas in a setting disconnected from the original manuscript then it may help unlock a bit more of his meaning by considering his audience. Bradford Hill first gave his remarks to the newly formed Section of Occupational Medicine. In context, these ideas were offered to medical practitioners charged with making complex decisions about health but with an eye toward bottom line economics, employment dynam- ics, and (to some degree) consumer tastes. His audience was not the usual data-analysts we think of, concerned with uncertainty intervals and p-values. Bradford Hill tells us this directly as he sets the table: “[Suppose] our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?” Using modern terminology, we might say Bradford Hill is quite a bit less interested in statistical inference and more interested in study design considerations. Though even that terminology does not get quite at the nub of his line of reasoning. To get closer to the flow of his arguments, let’s jump over all the important bits in the middle and pull from his concluding section: The Case for Action. Again, letting the man speak for himself: “Finally, in passing from association to causation I believe in ’real life’ we shall have to consider what flows from that decision... In occupational medicine our object is usually to take action. . . While that is a commendable ambition it almost inevitably leads us to introduce differential standards before we convict.” What follows is a discussion of balances – how does the strength of evidence enter into the decision to forbid/compel people to take actions? The size of the effect, levels of certainty, chains of consequences that may arise from our (in)actions – these all need to be weighed out. If he stopped his reasoning at that level of thought then this manuscript likely would have resolved into some kind of call for a better decision-theoretic framework. But that’s not where Bradford Hill took the argument, and this is why this paper is still fascinating all these years later. As far as I can discern, there are two additional tensions he is tracking. The first is the tension he highlights quite a bit which is the need for action right now, which he contrasts with the academic’s slower, more careful building of evidence toward solid, scientific conclusions. More interesting (at least to me and my contemporary eyes) is his focus on convincing people.

2. Reasoning with context

There are several ways to think about what we – those of us who are interested in making empirically rigorous, positive change in the world – are doing. Perhaps we are mathematiz- ing the scientific method, providing crisp, quantified boundaries on what can be known and how best to empirically know it. Maybe we merely clear away the rubbish others bring into the Ivory Tower. These are recognizable roles we play in academic settings. But Bradford Hill is reminding us of a more fundamental role: we do all this to convince people. We may believe that rigorous evidence will compel, but it won’t. Look at the absurdity of climate-change denial, or the rates of anti-vax. Change does not – exclusively – arise from rigorous empirical conclusions. One way to understand the challenge of convincing people is to unpack why we tend to formalize and create decision rules. The first reason we formalize is to make discovery

12 Following Bradford Hill

more productive. When a new causal discovery emerges it often feels a bit shocking and it’s natural to marvel at its departure from what has come before. For us, data-analysts, we tend to focus on the methods by which the discovery was achieved – which can be even more novel than the underlying causal discovery. But quickly, what used to seem novel and cutting-edge in science gets pulled into the core – what used to be “art” becomes codified and reproducible. This process is wildly productive, allowing many researchers to explore new directions that previously only the cleverest could. The second reason we tend to formalize requires some insight about how humans make decisions and assign responsibility: if we can formalize these kinds of causal-discovery methods, then we shift the burden of responsibility for declaring discovery outside of the individual (e.g., located in the idiosyncrasies of both the situation under investigation and the researcher making the claim) and into the general (e.g., rules that are recognizable across settings but also rules that have buy-in from the researchers or policy makers or stakeholders in our domain who will be impacted by our empirical investigations). Formalized decision rules that standardize discovery and regulate our claims on the strength of conclusions – in a manner that is much like laws – help communities set standards and make planning and settling disagreements easier and less arbitrary. In fact, looking at several of Bradford Hill’s “criteria” with modern eyes, you can see how his insights have been formalized with statistical methodologies (to pick a few): (i) “strength of effect” looks quite a bit like the thinking behind modern sensitivity analysis addressing unobserved confounders; (ii) “consistency” looks like meta-analysis, replicability, and transportability of effect; (iii) even the initially surprising appeal to “analogy” that he suggests has found some formalization in the work on “transfer learning” in machine learning. These additions to our formalized tool set are great. But, again, formalization is not what Bradford Hill is interested in. In fact, he takes some wonderfully cheeky shots at formalized decision making. He suggests that using statistical processes allows decision-makers to obscure and shirk their critical responsibilities. The tension that keeps Bradford Hill’s argument fresh is the one that makes many of us excited about doing our jobs: figuring out how to bring new discoveries to the larger community. When a debate is vital and complex, when the stakes really matter then how do we reach solid conclusions that will be strong enough to win over our colleagues and those impacted by our conclusions? If you’re in the position Bradford Hill was, talking to a room full of physicians interested in Occupational Medicine, then you’ll understand the tension between the kinds of formalized rules that rigorous statistical analysis provides and the kinds of arguments used to convince and debate in the larger (less technical) community. A concrete example: given our analysis, we believe we should order a popular agricultural product removed from the market. If we decide to act on this belief then a new rule will come into existence. There will be many “losers” in this new regime, and a number of them will need to change their behaviors. How do we explain this rule? How do we get buy-in for this rule? The less familiar the logic used to create the rule is to the people on the receiving end then the less likely they will engage in the required change in behavior. For a moment, let’s pause this unpacking of Bradford Hill’s manuscript and move for- ward to approximately contemporary time. One of the dominant modes of reasoning in contemporary data analysis – the Common Task Framework (CTF) – serves as an extraor- dinary contrast to Bradford Hill’s ideas; it’s worth exploring the CTF to better understand Bradford Hill.

13 Baiocchi

3. Reasoning without context The productivity, and explosive improvements, in statistical prediction (“machine learning”) has rightly stood out in fields touched by data science. (If we’re being honest then to some degree it has also caused some feelings of anxiety inside those of us less inclined to flashy prediction, and more enamored of the slower accumulation of information in fields interested in .) If, like me, you are less familiar with the field of prediction then you’re likely even less familiar with the epistemological engine that has powered its growth. The Common Task Framework (Liberman (2015); Donoho (2017)) provides a fast, low-barriers- to-entry way for analysts to debate which algorithm performs “best” on a given data set. The CTF is an alternative way of assessing the suitability of an algorithm; it stands in contrast to the more traditional methods like mathematical theorems or simulations from a given data generating function. Even if the CTF name is unfamiliar you’ve likely heard of this dynamic; the NetFlix Prize (Bennett and Lanning, 2007) was an excellent example of this framework. The key features of the CTF are (slightly modified from Donoho (2017)):

1. A publicly available training dataset involving, for each observation, a list of (possibly many) feature measurements, and an outcome for that observation.

2. A set of enrolled competitors (analysts) whose common task is to infer a prediction rule from the training data.

3. A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule.

All competitors share the common task of creating prediction rules which will receive a good score; hence the phase common task framework. The performance metric provides an ordering that gives analysts permission to claim “this algorithm provides useful insights when used on this data set” – such claims are strongest when framed relative to other algorithms. This is where the “leaderboard” style of algorithmic development came from. Obviously, predictive models are not causal models. But it is not hard to find colleagues who have become a bit too enamored of the predictive power of this or that fantastical black box – believing a bit too much in its ability to accurately describe all possible dynamics of the data. For these folks, it is a small leap of faith to using a model like this to try answering questions about what will happen if we intervene. (Dear reader, I assure you: it pains me too.) How? Perhaps they use something like predictive margins (see Graubard and Korn, 1999) – first, setting all the observational units level to unexposed, second setting all the observational units level to exposed, and then contrasting the two hypothetical groups’ outcomes. Hidden behind almost all actions taken after consulting a black box is a confidence in its ability to faithfully describe all potential configurations of the data. But where did their confidence in the model come from? The CTF allows fantastically complex, “black box” algorithms to be developed and (in a particular sense) evaluated. Without the CTF, complex algorithms – so complicated that they cannot be described mathematically – would have much weaker evidence to be trusted and thus deployed. With the CTF, we can see the performance of any algorithm vis-`a-vis

14 Following Bradford Hill

any other algorithm on the same data set. The CTF has allowed algorithm developers to be extremely productive, principally by being able to avoid both slow moving math as well as the kind of deeper engagement with nuanced issues that gave rise to the data that traditional causal inference analysts do. Algorithms that grow up in the CTF aren’t required to be accountable to slow-moving, tradition-bound, coherence-seeking people. In fact, there’s a very explicit line of argument inside some data communities that “human experts need to be removed from the decision-making process” – rather, the machines should do the learning because they are more likely to produce the most optimal results. The thinking goes: Humans are slow. Humans are hard to understand. Let’s remove humans from this process. But here’s the thing, when we stand with Bradford Hill, humans are the point. We’re trying to convince humans to change. Rules that are (in a particular sense) “optimal” are not the same as rules that are useful for affecting change. In fact, it’s easy to imagine that rules generated by “black box” algorithms (i.e., literally inexplicable) are less likely to be complied with than rules reached through consensus building and through reasoning accessible by those who are being asked to have their lives shaped by the rules. Do not mistake what I’m saying as arguing against well-reasoned, formal, quantified rules that come out of statistical analyses. Our rigorous statistical methods are the strong bones of the beast, but they don’t provide the heart, mind, and muscles that animate and make these decisions human. We are better now that we have statistical procedures that formalize many of Bradford Hill’s criteria. But these new statistical procedures do not solve the principal issue Bradford Hill was engaging, how to convince and change.

4. Following Bradford Hill

I didn’t introduce the CTF to either bury or praise it, but rather because it is a perfect example of how one might reason about data in a way that is about as far removed as possible from the way Bradford Hill advocates we reason about data. The contrast here, I hope, helps illuminate the point that Bradford Hill was making. When I read this manuscript, I see someone making tough, impactful decisions in the presence of uncertainty. He is steeped in the particulars of the situation. While formalizing Bradford Hill’s criteria is useful, and will produce better decision-making, it is also beside the point. (And, in the most extreme, can lead to a type of blindness about the role of experts, stakeholders, and consequences for our analyses.) The criteria are paths of reasoning about causality that resonate and reassure. In the kinds of questions epidemiologists, health policy researchers, economists, criminologists. . . engage, the ultimate audience is a community of people who our conclusions impact. Statistical reasoning can be like mathematics at times, but in answering these kinds of questions it is better to think of statistical reasoning as a form of rigorous, quantitative argumentation – meant to guide thought and shift beliefs. I have a friend that keeps a copy of Bradford Hill’s criteria pinned to his office wall. He uses it to remind himself of the paths he might take. I like that. If you follow Bradford Hill then I think you’ll have an easier time reaching your audience.

Acknowledgments

15 Baiocchi

I would like to Jordan Rodu for conversations about the CTF and the cheese plate.

References Bennett, J. and Lanning, S. (2007). The netflix prize. Proceedings of KDD cup and work- shop, 2(3):35.

Bradford Hill, A. (1965). The environment and disease: association or causation? Proceed- ings of the Royal Society of Medicine, 58(2):295–300.

Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 2(26):745–66.

Graubard, B. and Korn, E. (1999). Predictive margins with survey data. Biometrics, 55(2):652–9.

H¨ofler,M. (2005). The bradford hill considerations on causality: a counterfactual perspec- tive. Emerging Themes in Epidemiology, 2(1).

H¨ofler,M. (2006). Getting causal considerations back on the right track. Emerging Themes in Epidemiology, 3(1).

Liberman, M. (2015). Reproducible research and the common task method. Simmons Foundation Lecture.

Phillips, C. and Goodman, K. (2004). The missed lessons of sir austin bradford hill. Epi- demiologic Perspectives & Innovations, 1(1).

Phillips, C. and Goodman, K. (2006). Causal criteria and counterfactuals; nothing more (or less) than scientific common sense. Emerging Themes in Epidemiology, 3(1).

16 Observational Studies 6 (2020) 17-19 Submitted 9/19; Published 1/20

On the use and abuse of Hill’s viewpoints on causality

Samantha Kleinberg [email protected] Computer Science Department Stevens Institute of Technology Hoboken, NJ, USA 07030

Here, then, are nine different viewpoints from all of which we should study association before we cry causation. What I do not believe — and this has been suggested — is that we can usefully lay down some hard-and- fast rules of evidence that must be obeyed before we accept cause and effect. None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non. What they can do, with greater or less strength, is to help us to make up our minds on the fundamental question — is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect? (emphasis added, italics original) Hill (1965)

Not since Fisher1 suggested p < 0.05 is often convenient has such a clear statement by a statistician been so misunderstood. Hill’s sensible advice has has been transformed like Samsa in Kafka’s Metamorphosis into what his article warned against: a checklist. Google scholar returns over 100,000 articles using the phrase “Bradford Hill Criteria,” it has been growing in usage in books since the 1990s (see figure 1),2 and even the Wikipedia page on the topic is titled “Bradford Hill Criteria.”3 And yet Hill wrote that there are no “hard-and-fast rules” for causality. This is not just a marketing problem. How we talk influences how we think (Boroditsky, 2011) and the mutation of considerations into criteria is in fact part of their misuse. Hill referred to the pieces of evidence we may wish to examine as “aspects of [an] association [to] consider before deciding that the most likely interpretation of it is causation” (p. 295) and “viewpoints from [which] to study association before we cry causation” (p. 299). Con- siderations may influence our decisions, such as whether to believe a causal relationship exists, but they are also things we may evaluate and ignore if they’re not relevant. Criteria, in contrast, are a benchmark against which we test something. In the case of causality, criteria provide a tantalizing yet misleading shortcut: check off these boxes and you can claim causality. Yet, there is no such checklist for causality and Hill’s considerations are neither necessary nor sufficient to establish a causal relationship.4

1. Fisher (1925) said about a p-value threshold of 0.05 that “it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.” The message people heard appears to be “p < 0.05 or it didn’t happen.” 2. Variations of the phrase involving viewpoints and considerations are so rare that no Ngrams are found. 3. https://en.wikipedia.org/wiki/Bradford_Hill_criteria 4. See Kleinberg (2015) and (Rothman et al., 2008, p. 26) for a few examples detailing just why this is.

c 2020 Samantha Kleinberg. Kleinberg

Figure 1: Google Ngram results for usage of “Bradford Hill Criteria” in books.

Checklists can be a powerful tool in safety critical domains where cognitive load is high and time is short (Gawande, 2010). However the settings where Hill’s views are most useful are not that. They are cases where experiments are difficult or impossible and we must cobble together piecemeal evidence for causal claims. These are cases where we also must assess whether a consideration is relevant to the topic. For example, Hill along with Richard Doll identified a link between smoking and lung cancer at a time when little was known about the etiology of lung cancer (Doll and Hill, 1950). It is not possible to conduct randomized experiments to test the hypothesis that smoking is responsible for cancer, but it is of great public health significance to know what causes cancer so it can be prevented. From this experience Hill distilled his views on how we can gain such causal knowledge into his famous article. Yet rather than providing a starting point, Hill’s viewpoints have been widely and repeatedly used as a standard of evidence, the same way the majority of researchers use a p-value cutoff of 0.05. The precise danger of conventions is that one need not justify them, whereas a p-value threshold of 0.04 or 0.06 would invite significant scrutiny.5 However, many other factors such as effect size are important to determining whether a result is actually important or not. When Hill’s views are treated as criteria, they similarly become a causal inference figleaf. If these reasonable but still unvalidated pieces of evidence can be provided,6 then congratulations, you can claim causality. So if I’m suggesting researchers quit the causal criteria cold turkey, what will replace them? It is perfectly fine to refer to Hill’s considerations as a starting point when thinking about what evidence one might gather when evaluating an association. The part that is not fine is making the leap from these pieces of evidence to a definitive claim of causality – and both failing to consider other types of evidence and forcing these considerations to fit scenarios where they do not apply. These are two critical areas for future research to

5. Editorial guidelines for the journal Cognition explicitly state that only effects with p < 0.05 can be described as statistically significant, stating that “for better or for worse, this is the current convention,” which it seems even journals are powerless to change. 6. This is all leaving aside the question of what it means to satisfy each criteria, which surely requires more nuance than present/absent.

18 Use and abuse of Hill’s viewpoints on causality

explore. First, our methods and data have evolved in the years since Hill’s article, yet the considerations remain static. It is worth exploring whether there are other evidence types that may prove useful as well as updating how current methods might support the existing considerations (e.g. how do big data and simulations fit in?). Second, while Hill focused on epidemiology, the considerations have been used more broadly and it is important to examine how the needs for and standards of evidence vary across domains. By allowing them to evolve, Hill’s considerations will hopefully meet a better end than poor Samsa.

Acknowledgments Thanks to Dylan Small for providing an outlet for these viewpoints.

References Boroditsky, L. (2011). How language shapes thought. Scientific American, 304(2):62–65.

Doll, R. and Hill, A. B. (1950). Smoking and carcinoma of the lung. British medical journal, 2(4682):739.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.

Gawande, A. (2010). The Checklist Manifesto. Henry Holt and Company.

Hill, A. (1965). The environment and disease: association or causation? Proceedings of the Royal Society of Medicine, 58(2):295–300.

Kleinberg, S. (2015). Why: A Guide to Finding and Using Causes. O’Reilly Media.

Rothman, K. J., Greenland, S., Lash, T. L., et al. (2008). Modern epidemiology, volume 3. Wolters Kluwer Health/Lippincott Williams & Wilkins Philadelphia.

19 Observational Studies 6 (2020) 20-23 Submitted 10/19; Published 1/20

Plausibility and the Benefit of Theoretical Reasoning: Comment on Austin Bradford Hill (1965)

A. James O’Malley [email protected] Department of Biomedical Data Science and The Dartmouth Institute of Health Policy and Clinical Practice Geisel School of Medicine, Dartmouth University Lebanon, NH 03756, USA

I greatly enjoyed reading the paper “The Environment and Disease: Association or Cau- sation” by Professor Austin Bradford Hill (ABH). The paper continues to be held in high regard and is massively cited (9,359 citations as of October 20, 2019). The paper is cen- tered around nine aspects of statistical association that ABH suggests be considered before deciding if causation should be claimed. In some fields, these points have erroneously been viewed and taught as causal criteria (Phillips and Goodman 2004). Given that “Associa- tion or Causation” discernment is receiving increasing attention in the statistical literature, and that we’re in the midst of the emergence of data science and ever-growing volumes of observational data, reviewing ABH (1965) and assessing whether there are insights that can inform contemporary statistical analysis is timely and valuable. ABH (1965) has been the subject of much review and discussion over many years (e.g., Philips and Goodman 2004; Thygesen, Andersen and Andersen 2005; Fedak et al 2015). Often reviews have systematically considered each of the nine points and offered discussion and critique about each. Because the appropriateness or not of these points has already received an extensive amount of attention, I have focused my comment on a specific topic that while being inherent to ABH (1965) has not been directly addressed. I focus on the role of theoretical models – often used in economics and sociology to generate causal stories and allied hypotheses prior to analyzing data or even designing a study – in relation to the task of distinguishing cause from association. Is the ability to construct and exploit theoretical models in order to devise ways of testing their legitimacy an under-appreciated skillset that ought to be part of statistical (and data science) practice and education? In the following, I first provide some general comments and then give an illustrative example. The point in ABH (1965) that is most directly relevant to my comment is Plausibility, one of the least considered of the nine points. ABH states “It will be helpful if the causation we suspect is biologically plausible.” I agree with this statement and think that it can be extended to “It will be helpful if the causation we suspect is plausible in the scientific or other setting being analyzed.” In other words, does the causal story make sense given current knowledge. In order to answer this question, it is imperative that one understand the underlying science or subject area knowledge. If the causal story is plausible then the skill of being able to think theoretically or conceptually about it may allow tests of hypotheses and ways of identifying parameters in the theoretical model to be constructed that data can subsequently adjudicate.

c 2020 A. James O’Malley. Plausibility and Benefit of Theoretical Reasoning

Since 1965, the presence and importance of statistics in science and medicine has grown. However, in many settings, roles of statisticians have been restricted to data analysis as op- posed to the art of linking the broader scientific perspective underlying research or other investigation to data. To prepare students for such data-focused careers, the budding statis- tician is typically supplied with practice problems to solve. This tailor-made approach by- passes a lot of the artfulness of being able to reason theoretically in the application-side of a problem to identify the key aspects of it that can be tested by data. Thus, even if their empirical work is performed with much creativity and skill, in terms of overall impact the role of a traditionally-trained statistician may not be as great as it could be. The statistician is often a “Devil’s advocate”. While being a Devil’s advocate is certainly needed, if the ultimate goal is to get the best answer possible using current resources, the skill of being able to discern what should be tested in the first place is vital. I think of the construction of a theoretical model based on current knowledge as constructive causal thinking. This process can enlighten one of the data that will be most useful to acquire in order to strengthen a claim of causality and of what identification strategies can be used. I advocate for the involvement of statisticians in the role of constructive causal thinking as well as a Devil’s advocate. In methodological statistics it is common to assume certain complications don’t exist in order to focus on the conditions for which a particular result applies. Yet, from a purist standpoint, observational data analysis is inherently doomed. By its nature, there is always an assumption which, if violated, leads to bias as there is no way of ensuring that all assumptions, particularly those involving unobserved variables, hold. Therefore, it seems important to pair methodological work on a ready-made or stylized problem with the ability to identify the key problem to work on in the first place. This encompasses being able to justify why a question is important, why assumptions are plausible, recognizing and constructing sanity checks to inform the validity of a model, and so on. Working through these steps may help enable the methodological problem that has the biggest impact to be realized. Technical approximation is another critical but often untaught skill. It is a means of assessing what is a reasonable “ball-park” answer to a numerical problem. The concept can be generalized to the ability to construct a reasonable ball-park theoretical model that can in turn be used to assess whether the stated question of interest is the right one and whether a proposed analysis answers the question of interest. In practice, the best research questions and empirical strategies can be ingenious for their ease of understanding. Exceedingly complicated solutions are seldom universally popular. Explanations involving elegant sanity checks based on a plausible theoretical model can be far more convincing than the output of uninformed but complex empirical analyses. To conclude the first part of my comment, many social science fields (e.g., economics, sociology) place substantial value on theoretical models that develop hypotheses and allied identification strategies. I believe that the skill of being able to theoretically construct relationships between variables and testable constructs is underutilized in statistics. A general example of this skill is seen is the identification of an instrumental variable (IV); because IVs cannot be fully tested by the data, their construction is a skill that relies on broad and artful thinking.

21 O’Malley

To illustrate theoretical (non-empirical) reasoning and its benefits, consider the tie- directionality identification strategy developed in Christakis and Fowler (2007) to account for unmeasured common causes in the estimation of peer-effects (the effect of one individual on another). This example features the points in ABH (1965) concerning Plausibility, Analogy and Dose-response. Christakis and Fowler (2007) hypothesized that a health behavior or characteristic (e.g., or body-mass-index) of an individual is affected by the same trait in their peers. One form of bias that may afflict the estimator of a peer-effect is from unmeasured common causes – uncontrolled factors that impart similar effects on a group of individuals. Un- measured common causes may give the appearance that the change in the outcome of one individual tracks that of other individuals, even if there is no actual peer effect. If there exist dyads for which there should be no inter-individual influence, then such dyads can be used as a control group whose estimated peer effect captures any unmea- sured common causes (Christakis and Fowler 2007). Under the homogeneity assumption that all types of dyads are equally affected by exposure to an unmeasured common-cause, differencing a peer-effect against that for the control dyad group is able to recover the true peer effect. This is an example of argument by Analogy. Christakis and Fowler (2007) also theorized that if peer effects exist they should increase with the strength of relationship between individuals. This led them to theorize that in the case of mutable relationships such as friendships, the peer association will be greatest in dyads for which the friendship is mutual, second-greatest along network ties when friendship is only from the focal actor to the other actor, third-greatest along network ties when friendship is only from the other actor to the focal actor, and least in dyads with no relationship (null dyads). (The “focal actor” is the individual providing the dependent variable.) A requirement that peer-effect estimates follow this theoretically-implied ordering of effect-sizes is a more rigorous test of the existence of peer-effects than a single comparison of the estimated peer-effects for non-null and null dyads as the former encapsulates all hypotheses simultaneously. Although all sources of unmeasured are not guarded against, which is im- possible in an observational study, the theoretical derivation and implied empirical strategy for testing the peer-effects hypothesis is appealing for featuring Plausibility, Analogy and Dose-response. The underlying theoretical model makes sense, is easily understood and can be constructively debated, which is what has transpired. For example, in a subsequent paper, Shalizi and Thomas (2011) introduced a form of confounding, known as latent ignor- ability, which is not accounted for by the Christakis and Fowler approach. They argued that individuals may form relationships because they have similar characteristics that continue to affect their outcomes post tie- formation and thus that latent ignorability is indistinguish- able from the true peer-effects in the Christakis and Fowler study. They embellished their theory with a compelling example of how latent homophily could occur. By articulating the theoretical model or causal mechanism clearly, both the original paper and subsequent critiques written about it can triggered valuable discussion and insights resulting in further advances in understanding. Plausibility assessment and theoretical reasoning are important skills! Another important point implied by ABH (1965) is that one should not hold back from publishing results in order to wait for every contingency to be accounted for. It is better to advance knowledge with results once available as long as reports and publications are

22 Plausibility and Benefit of Theoretical Reasoning

accompanied with clear descriptions of the theoretical model, assumptions and identification strategy. The final section of ABH (1965), “The Case for Action,” argues that evidence is there to be judged on its merits and should be independent of what hangs on it. This is consistent with the notion that while a great idea isn’t foolproof, it might still be worthwhile to pursue. This progressive approach allied with full acknowledgement and explanation of assumptions and an adept sensitivity analysis that evaluates the impact of plausible violations of the assumptions (another invocation of the wisdom of ABH (1965)) supports fully-informed and timely decision making. While the technical correctness of the points made in ABH (1965) have been extensively debated, I believe that ABH (1965) remains a valuable paper to read and that statisticians can benefit from doing so, including by embracing the importance of plausibility and re- lated points. Social scientists have practiced the art of developing and applying theoretical models for many years. I conjecture that being able to evaluate theoretical models for their plausibility and apply them to construct identification strategies will help to increase the prominence of statisticians in science, medicine and policy-formation (and other areas of society).

References Christakis, N.A., Fowler, J.H. (2007). The spread of obesity in a large social network over 32 years. New England Journal of Medicine, 357:370–379. Fedak, K.M., Bernal, A., Capshaw, Z.A., Gross, S. (2015). Applying the Bradford Hill cri- teria in the 21st century: how data integration has changed causal inference in molecular epidemiology. Emerging Themes in Epidemiology, 12:14. Hill, A.B. (1965). The environment and disease: association or causation? Proceedings of the Royal Society of Medicine, 58:295–300. Phillips, C.V and Goodman, K.J. (2004). The missed lessons of Sir Austin Bradford Hill. Epidemiological Perspectives and Innovations, 1(1):3. Shalizi, C.R., Thomas, A.C. (2011). Homophily and contagion are generically confounded in observational social network studies. Sociological Methods and Research, 40(2):211–239. Thygesen, L.C., Andersen, G.S., Andersen, H. (2005). A philosophical analysis of the Hill criteria. Journal of Epidemiology and Community Health, 59(6):512-6.

23 Observational Studies 6 (2020) 24-29 Submitted 8/19; Published 1/20

Comment on “The Environment and Disease: Association or Causation?”

Christopher J. Phillips [email protected] Department of History Joel Greenhouse [email protected] Department of Statistics Carnegie Mellon University Pittsburgh, PA 15289

A. B. Hill’s 1965 discussion of the relationship between statistical association and causal- ity has become so well known among epidemiologists, it is easy to treat his list of relevant “aspects” as timeless and context-less, as a set of logical postulates. In this comment, we instead want to place Hill’s article in its historical context, both that of the author, and even more so that of epidemiology’s understanding of statistical association and causation. The most obvious, but often neglected, context is that the text was first presented as the Presidential Address to the Royal Society of Medicine’s Section of Occupational Medicine. Presidential addresses, particularly when given by senior colleagues, are opportunities for reflection beyond a typical research article. There’s nothing necessary about this reflectivity, however, and the following year’s presidential address by R.S.F. Schilling was instead an impassioned account of the dangers of trawler fishing (Schilling, 1966). This was also not the first such address Hill himself had given – his May 1954 address to the Epidemiology and Preventive Medicine section eschewed broad generalization in favor of a more traditional account of the expected versus observed cases of interwar polio in England and Wales, broken down by county and locale (Hill, 1954). The important contextual difference in October 1964 was that the Occupational Medicine section had just been formed, and Terence Cawthorne’s opening address noted that the sec- tion was intended not just for physicians and surgeons but also scientists from a range of disciplines (Cawthorne, 1965). As subsequent speakers at the first meeting made clear, the traditional focus on industrial medicine was still present, but the renaming as “occupational medicine” was a move indicating its broader audience, from sports physicians to education workers. In this context, Hill set out to address a question that by its very nature was inter- disciplinary and of great interest to this larger group: what is the relationship between an environmental agent and disease, and when is it appropriate to identify such a relationship as causal? This was a pressing issue for the new section, Hill wrote, because the characterization of the relationship between occupational conditions and sickness is “fundamental” and yet problematic. When is a respiratory illness among workers, he asks, simply associated with dust in the environment, and when is it caused by it? As Hill well knew, the question of causation cannot be answered solely by any one field, instead, by drawing on physiol-

c 2020 Christopher Phillips and Joel Greenhouse. Comment on “The Environment and Disease: Association or Causation?”

ogy, statistics, labor relations, public health, and pulmonology, Hill sought to portray the question as essentially interdisciplinary and therefore of great interest to the new section. Hill himself had long been concerned with causal questions. Though he modestly doesn’t cite his own work, he had written a long report on the prevalence and origins of respira- tory illnesses among Lancashire’s cotton cardroom operators in the 1920s for the Medical Research Council. The cleaning of raw cotton prior to spinning was known to cast off dust and fibers, and in 1927 the UK’s home secretary established an investigation into “whether, and if so to what extent, dust in cardrooms in the cotton industry is a cause of ill-health or disease among cardroom operatives” (Hill, 1930). It was known already that there were health effects from the cleaning of the carding machines themselves, and the owners of the mills contended that the mechanical methods installed to remove dust had remedied the problem, while operatives contended that cleaning cotton carried health risks distinct from the cleaning of the machines. Hill was tasked with finding out what kind of ill effects, if any, there were from the cleaning of cotton, and whether these were distinct from other respiratory diseases known to occur at different stages of the process, and what if anything could be done about it. Hill’s choice of this same example in 1964 is a clear indication that the question of “environment and disease” was one that he had been contemplating for a long time. Hill’s role within industrial health efforts of the 1920s and 1930s put him in contact with colleagues who themselves saw an essential role for statistics in making causal claims. Unlike then-contemporary biomedicine’s focus on bacteria and other microscopic agents of infec- tion, industrial and environmental health were areas in which “causal factors” and “causal relationships” were known to be multifactored, complicated, and often hidden. Work in these areas using statistical rates to make claims about causality, from the relationship of housing and health to that of infant mortality, goes back at least to and Flo- rence Nightingale. Later, had used statistical methods, specifically a regression equation, to try to pinpoint causes of pauperism in England at the turn of the century.1 Hill’s work at the Medical Research Council along these lines was initially overseen by , an influential statistician whose own training, combining statistics (he stud- ied under ) with physiology and preventive medicine, exemplified the growing importance of data for measuring associations between health and environmental agents (Higgs, 2000). Indeed, there’s a good argument that Hill inherited the mantle of statistics in medicine from this earlier generation. Their approaches – emphasizing careful study design and data collection techniques, relying on relatively conservative uses of statistics, avoiding formal inferences tests and elaborate models – would later characterize Hill’s own approach throughout his career. With remarkable clarity in his 1964 address, Hill lays out the question of interest: Our observations reveal an association between two variables, perfectly clear- cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation? (p. 295) Contrary to Rothman and Greenland’s claim (1998) that Hill’s criteria were essentially “an expansion of criteria offered previously in the landmark Surgeon General’s report on

1Yule (1899) was discussed thoughtfully alongside other examples in Freedman (1999).

25 Phillips and Greenhouse

Smoking and Health,” Hill’s approach for distinguishing causal from non-causal associations was developed based on his longstanding experience in epidemiologic field studies. As an illustration we consider his influential paper with Richard Doll, “Smoking and Carcinoma of the Lung” (1950). This was an early case-control study of 20 hospitals in the London region of patients presenting with cancer of the lung, stomach and large bowel. The patients with carcinoma of the stomach and large bowel served as one comparison group and another comparison group were non-cancer general hospital patients, “chosen so as to be of the same sex and age as the lung-carcinoma patients.” In the Discussion section, Doll and Hill synthesized the results, in language that would both implicitly and explicitly feature the relevant aspects of association, specificity, biological gradient, consistency, coherence, and plausibility: • “...the comparison of the smoking habits of patients in different groups...revealed no association between smoking and cancer of the other sites (mainly stomach and large bowel. The association therefore seems to be specific to carcinoma of the lung.” • “The effect of smoking varies, as would be expected, with the amount smoked.” • “How do these results fit in with other known facts about smoking and carcinoma of the lung? Both the consumption of tobacco and the number of deaths attributed to cancer of the lung are known to have increased, and to have increased largely, in many countries this century.” • “As to the nature of the carcinogen we have no evidence. The only carcinogenic substance which has been found in tobacco smoke is arsenic, but the evidence that arsenic can produce carcinoma of the lung is suggestive rather than conclusive. Should arsenic prove to be the carcinogen, the possibility arises that it is not the tobacco itself which is dangerous. Insecticides containing arsenic have been used for the protection of the growing crop since the end of the last century and might conceivably be the source of the responsible factor.” Clearly, Doll and Hill are systematically and logically assessing the body of evidence from their study and the existing epidemiologic literature to make a case for (or against) a causal association. In the last quotation, Doll and Hill consider an alternative explanation for the ob- served association between tobacco smoke and carcinoma of the lung—arsenic in tobacco. Curiously, in his 1964 address he does not explicitly include elimination of alternative expla- nations as a relevant aspect. However, Hill clearly considered the elimination of alternative explanations central to the establishment of a causal association as he indicated in his Wat- son Memorial Lecture (1962), itself repeatedly cited in the 1964 address. In the section entitled “The Assessment of Evidence” he writes: We are continuously brought back to the fundamental question - what alterna- tive explanation will fit a set of observations, what other differences between our contrasted groups could equally, or better, account for the observed incidences. That is the crux of the matter - and no χ2 test or other application of the Greek alphabet will answer it. It demands an experience, and acumen, in what to collect, how to seek in the data for essentials, how to interpret. (p. 189)

26 Comment on “The Environment and Disease: Association or Causation?”

This focus on alternative explanations predated Hill’s work with Doll on smoking and health. Indeed, his experience with the Industrial Health Research Board had already established this pattern in the 1920s. When looking at whether artificial humidification produced diseases among workers in the cotton industry’s sheds, for example, Hill showed how simply by collecting the right data one could carefully rule out possible alternative explanations for the sickness of workers (Hill, 1927). He’d include towns with only one kind of shed, to ensure sicker workers were not selecting the humid sheds; he’d also include towns with both humid and dry sheds to ensure there was not a bias at that level of selection; he’d track the sickness of every weaver, even those who left before the end of the study, to ensure that the departure of the sickest workers was not itself an explanation of the data; he consulted over 20,000 records to be sure that any measured differences were unlikely to be explained by chance. Ruling out alternative explanations didn’t involve formal tests for Hill so much as it did careful data collection and logical thinking. Counterfactual history is a dangerous game to play but it is likely uncontroversial to say that even if Hill hadn’t published his list of criteria, some set of criteria for moving from statistical association to causation would have become standard by the 1980s. We might instead be talking about the U.S. Surgeon General’s “Criteria of the Epidemiologic Method” since their publication in 1964’s Smoking and Health featured a similar empha- sis on consistency, strength, specificity, temporal relationship, and coherence. Throughout the 1950s, a number of eminent epidemiologists and statisticians had tried to specify how exactly statistical methods might be useful for making causal claims. In 1959, Jerome Corn- field and colleagues published a long review article systematically evaluating the question of causation in smoking and health. The conclusion – that smoking plays a “causal role” in the acquisition of lung cancer – was based on the existing data’s strength, specificity, tempo- ral relationship, consistency, coherence, and a systematic attempt to eliminate alternative explanations (Greenhouse 2009). These were, of course, exactly the criteria later used in the Surgeon General’s report. Two years earlier, Abraham M. Lilienfeld’s “Epidemiological Methods and Inferences in Studies of Noninfectious Diseases” (1957) also used smoking as the model for thinking about how to elucidate possible etiological factors, settling on a pragmatic approach to analyzing statistical associations which emphasized, e.g., that when the exposure to a factor is diminished, so should the incidence. And two years before that, E. Cuyler Hammonds’ chapter on “Cause and Effect” in Ernest Wynder’s The Biologic Effects of Tobacco made it plain that “The eventual aim of epidemiologic research is to dis- cover means by which conditions may be altered in such a way as to lower disease incidence and mortality rates or at least to limit their rise. Thus, it becomes a search for causative factors.” Moreover, for Hammonds, causative factors were readily ascertained statistically because they typically acted “quantitively,” namely they “increas[ed] the probability that a specific event will occur” (Hammonds, 1955: 173-174). Though with different emphases and subtleties, there was a consistent approach from many epidemiologists and biostatisti- cians to think systematically and rigorously about how to make causal claims in the years after World War II. Hill’s lecture, while more expansive in its list of criteria than anything that came before, did little to change the overall trajectory of how practitioners were using statistical measures of association. As a coda, it is worthwhile to remember that Hill’s lecture wasn’t hailed as a masterpiece, or even as particularly important when it was first published. Though now widely cited,

27 Phillips and Greenhouse

it was hardly cited at all through the 1970s (only about twice per year on average in the fifteen years after publication). In fact, Mervyn Susser’s 1973 textbook Causal Thinking in the Health Sciences has its “criteria of judgment” explicitly modeled around the Surgeon General’s report. As Susser was to admit years later, he had not even read Hill’s article when preparing his textbook, surely a sign if there is any that its importance wasn’t immediately apparent (Susser, 1973, 1991). We think it plausible, in fact, that the association of causal criteria with Hill’s speech was ultimately more of an honorary gesture towards Hill’s own importance in the 1980s and 1990s. By that point, there was little doubt of his lasting role, in the development of clinical trials and observational studies, in the use of statistics by governmental agencies, and, indeed, in the ways we can make rigorous causal claims.

References Cawthorne, T. (1965). Opening Address. Proceedings of the Royal Society of Medicine. 58(5):289-94. Cornfield, J. et al. (1959). Smoking and Lung Cancer: Recent Evidence and a Discussion of Some Questions. Journal of the National Cancer Institute. 22(1):173-203. Doll R. and Hill, A.B. (1950 Sep 30). Smoking and Carcinoma of the Lung. British Medical Journal. 2(4682):739-748. Freedman, D. (1999). From Association to Causation: Some Remarks on the . Statistical Science. 14(3):243-258. Greenhouse, J.B. (2009). Commentary: Cornfield, Epidemiology, and Causality. Interna- tional Journal of Epidemiology, 38(5):1199-1201. Hammond, E.C. (1955). Cause and Effect. The Biologic Effects of Tobacco. E.L. Wynder, ed. Little, Brown, Boston. 171-196. Higgs, E. (2000). Medical Statistics, Patronage, and the State: The Development of the MRC Statistical Unit, 1911-1948. Medical History, 44:323-340. Hill, A.B. (1927). Artificial Humidification in the Cotton Weaving Industry. Medical Research Council, Industrial Fatigue Research Board, Report No. 48. London: Her Majesty’s Stationary Office. Hill, A.B. (1930). Sickness Among Operatives in Lancashire Cotton Spinning Mills. Medical Research Council, Industrial Health Research Board, Report No. 59. London: Her Majesty’s Stationary Office. Hill, A.B. (1954). Poliomyelitis in England and Wales Between the Wars. Proceedings of the Royal Society of Medicine. 47(9):795-805. Hill, A.B. (1962). Alfred Watson Memorial Lecture: The Statistician in Medicine. Journal of the Institute of Actuaries. 88(2):178-191. Lilienfeld, A.M. (1957). Epidemiological Methods and Inferences in Studies of Noninfectious Diseases. Public Health Reports. 72(1):51-60. Rothman, K.J. and Greenland, S. (1998). Causation and Causal Inference. In Modern Epidemiology, 2nd edition. Lippincott Williams & Wilkins, Philadelphia. Schilling, R.S. (1966). Trawler Fishing: An Extreme Occupation. Proceedings of the Royal Society of Medicine. 59(5):405-410. Susser, M. (1973). Causal Thinking in the Health Sciences. Oxford University Press, New York.

28 Comment on “The Environment and Disease: Association or Causation?”

Susser, M. (1991). What is a Cause and How Do We Know One? A Grammar for Pragmatic Epidemiology. American Journal of Epidemiology, 133(7):635-648. Yule, G.U. (1899). An Investigation into the Causes of Changes in Pauperism in England, Chiefly During the Last Two Intercensal Decades. Journal of the Royal Statistical Society. 62(2):249-295.

29 Observational Studies 6 (2020) 30-32 Submitted 7/19; Published 1/20

The Wrong Message from the Wrong Talk

Kenneth J. Rothman [email protected] Boston University and Research Triangle Institute 3040 East Cornwallis Road Research Triangle Park, NC USA 27709

Hill’s 1965 address (Hill, 1965) on causal inference to the newly formed section on Occupational Medicine of the Royal Society is considered hallowed scripture, not merely revered by researchers, but even used as a template for adjudication of causal questions by courts of law. It certainly was a provocative and thoughtful after-dinner talk, but it does not merit the status of scripture, especially not for the reasons it is so revered. The message for which it is remembered was not the most important advice in the talk, and the talk itself was less important than another lecture, now largely forgotten, that Hill gave some years earlier. The 1965 talk, which has been cited over 9000 times, gained prominence because it seems to offer a checklist of criteria for causal inference. Scientists, like anyone else, are inclined to follow a recipe if one exists. Apparently, few are bothered by how well the recipe works, or even whether it works. Consequently, numerous papers have attempted to infer causation through the application of the “Hill criteria,” on topics ranging from the zika virus and microcephaly (Awadh et al., 2017) to neuropsychiatry (van Reekum et al., 2001) to climate change (Science in Society Archive). In one recent review of dietary sodium and cardiovascular risk, the authors divided the paper into sections that corresponded to each of Hill’s nine criteria (Cogswell et al., 2016). In other papers, the authors employed scoring systems to quantify how well the criteria are met and from that derived an overall number which was intended to be a measure of causality (Mente et al., 2009). In one such approach, a discriminant analysis was used to generate weights for each of the nine criteria, which were used to derive an overall score that was interpreted as a probability indicating whether the association is causal (Swaen and Amelsvoort, 2009). Philosophers of science consider these to be futile efforts, because they agree that there cannot be a logical basis for a checklist approach to scientific inference. It comes as no surprise, then, that critics have steadily questioned the legitimacy of inferential checklists, as well as the origin of Hill’s list, and the applicability of specific points that Hill included in the list (Weed, 1988; Haack, 2014; Ward, 2009; Rothman and Greenland, 2005; Morabia, 1991; Philips and Goodman, 2004; Blackburn and Labarthe, 2012; Thygesen et al., 2005; Weiss, 2002). But we can hardly criticize Hill for originating the checklist, because he asked to be counted among the critics of a checklist approach to inference. He studiously avoided the word “criteria” in his talk, and he cautioned against using his “viewpoints” as if they were criteria, claiming that none of them was a sine qua non for causal inference.

c 2020 Kenneth Rothman. The Wrong Message from the Wrong Talk

In 2004, Philips and Goodman elegantly laid out the thesis that Hill’s 1965 paper is esteemed for the wrong reason (Philips and Goodman, 2004): “We will say only that Hill’s list seems to have been a useful contribution to a young science that surely needed systematic thinking, but it long since should have been relegated to part of the historical foundation, as an early rough cut. Yet it is still being recited by many as something like natural law. Appealing in our teaching and epistemology to the untested ”criteria” of a great luminary from the past is reminiscent of the ”scientific” methods of the Dark Ages.” In their essay, they suggested two alternative messages that they deemed to be the “missed lessons” of Hill’s talk. One of these was Hill’s warning about reliance on statistical significance testing for inference: “(W)e waste a deal of time, we grasp the shadow and lose the substance, we weaken our capacity to interpret data and to take reasonable decisions whatever the value of P. And far too often we deduce ’no difference’ from ’no significant difference’” (Hill, 1965). Hill continued, “I wonder whether the pendulum has not swung too far - not only with the attentive pupils but even with the statisticians themselves.” This issue, which has been mostly submerged for decades, has finally surfaced to spawn serious debate and reflection (Wasserstein and Lazar, 2016; Amrhein et al., 2019). Reliance on significance testing for inference has led to countless misinterpretations of data, and has exacerbated the problem of replication of results. If that message had been the legacy of Hill’s talk instead of the checklist, untold errors might have been avoided. A dozen years before the 1965 lecture for the Royal Society, Hill gave the Cutter Lec- ture at Harvard. Some background is relevant: in 1951, Prof. Hugh Sinclair of Oxford University was the Cutter Lecturer (Sinclair, 1951). Sinclair’s theme was to exalt the ex- perimental method, while attacking nonexperimental science, at least regarding nutritional research: “The use of the experimental method has brilliant discoveries to its credit, whereas the method of observation has achieved little...The observer must await the occurrence of the natural succession of events he wishes to study, and he is very apt to be misled by the fallacy of post hoc ergo propter hoc or by the existence of a correlation without causality” (Sinclair, 1951). When Hill delivered his Cutter Lecture two years later, he pointedly titled it “Observation and Experiment” (Hill, 1953). He made the case for nonexperimental epi- demiology, starting with the example of Snow’s work on cholera, moving on to studies of rubella, and then to his own work on smoking and lung cancer. He made it clear that he was not criticizing experimentation; indeed he preferred when possible to get experimental evidence, a point reiterated in his 1965 Royal Society talk. But he emphasized that “...my preference in preventive medicine for the experimental approach...does not lead me to repu- diate or even, I hope, to underrate the claims of accurate and designed observations” (Hill, 1953). Hill’s lecture was clearly a response to Sinclair’s attack on nonexperimental science. He repeated Sinclair’s phrases: “The observer may well have to be more patient than the experimenter – awaiting the occurrence of the natural succession of events he desires to study; he may well have to be more imaginative – sensing the correlations that lie below the surface of his observations; and he may well have to be more logical and less dogmatic – avoiding as the evil eye the fallacy of post hoc ergo propter hoc, the mistaking of correlation for causation” (Hill, 1953). Hill’s inspirational rebuttal to Sinclair, along with his work on smoking and lung cancer, helped to fortify the philosophic foundation for nonexperimental epidemiologic research. He asked whether we can draw inferences at all without experimentation, and he answered affir-

31 Rothman

matively. In comparison, the issue that became the focus of his 1965 talk was misconstrued and far less compelling.

References Amrhein V, Greenland S, McShane B (2019). Scientists rise up against statistical signifi- cance. Nature, 567: 305-307. doi: 10.1038/d41586-019-00857-9. Awadh A, Chughtai AA, Dyda A, Sheikh M, Heslop DJ, MacIntyre CR. (2017). Does zika virus cause microcephaly – applying the Bradford Hill viewpoints. PLOS Currents Outbreaks, Feb 22. Blackburn H, Labarthe D (2012). Stories from the evolution of guidelines for causal inference in epidemiologic associations: 1953–1965. Am J Epidemiol, 176: 1071–1077. Cogswell ME, Mugavero K, Bowman BA, Frieden TR (2016). Dietary sodium and cardio- vascular disease risk – measurement matters. N Engl J Med, 375: 580-586.’ Haack S (2014). Evidence matters; science, proof, and truth in the law. Cambridge Univer- sity Press, pp 258-263. Hill, AB (1953). Observation and experiment. N Engl J Med, 248: 995-1001. Hill, AB (1965). The environment and disease: association or causation? Proc R Soc Med., 58:295–300. Mente A, de Koning L, Shannon HS, Anand SS. (2009). A systematic review of the evidence supporting a causal link between dietary factors and coronary heart disease. Arch Intern Med., 169(7):659–669. Morabia A (1991). On the origin of Hill’s criteria. Epidemiology,2: 367-369. Philips CV, Goodman KJ (2004). The missed lessons of Sir Austin Bradford Hill. Epidemi- ologic Perspectives & Innovations, 1:3. doi:10.1186/1742-5573-1-3 Rothman KJ and Greenland S (2005). Causation and causal inference in epidemiology. Am J Public Health, 95: S144–S150. doi:10.2105/AJPH.2004.059204 Science in Society Archive: http://www.i-sis.org.uk/TheBradfordHillCriteria.php Sinclair HM (1951). Nutritional surveys of population groups. N Engl J Med, 145, 39-47. Swaen G, Amelsvoort L (2009). A weight of evidence approach to causal inference. J Clin Epidemiol, 62: 270-277 Thygesen LC, Andersen GS, Andersen H. (2005). A philosophical analysis of the Hill criteria. J Epidemiol Community Health, 59: 512–516. doi: 10.1136/jech.2004.027524 van Reekum R, Streiner DL, Conn DK (2001). Applying Bradford Hill’s criteria for causa- tion to neuropsychiatry: challenges and opportunities. J Neuropsychiatry Clin Neurosci., 13:318-325. Ward AC (2009). The role of causal criteria in causal inferences: Bradford Hill’s ”aspects of association.” Epidemiologic Perspectives & Innovations , 6:2 doi:10.1186/1742-5573-6-2 Wasserstein RL, Lazar NA (2016). The ASA’s Statementon p-Values: Context, Process, and Purpose. The American Statistician, 70: 129-133, doi:10.1080/00031305.2016.1154108 Weed DL (1988). Causal criteria and Popperian refutation. In Causal Inference, Rothman KJ, ed., Epidemiology Resources, Inc. Weiss NS. (2002). Can the “specificity” of an association be rehabilitated as a basis for supporting a causal hypothesis? Epidemiology, 13: 6–8.

32 Observational Studies 6 (2020) 33-46 Submitted 11/19; Published 1/20

Causation in Action: Some Remarks Attendant to Re-reading Hill (1965)

Herbert L. Smith [email protected] Department of Sociology and Population Studies Center University of Pennsylvania Philadelphia, PA, USA 19104

What a lovely paper! I use a number of the topics and points raised by Hill (1965) to reconsider aspects of the notion of causation in research in the social sciences and especially in sociology, the field with which I am most familiar. One could do more. The paper’s intellectual concision and economy of expression are laudable. Each time I read it, I generate a new set of marginal comments. I first read the paper many years ago. But why? The topical ambit is restricted: association or causation in epidemiology—occupational medicine in particular. It is very English in that it features chimney sweeps, although there is nothing Mary Poppins-y about them: These poor men were dying from scrotal cancer at a rate that was extraordinarily high relative to such deaths among other workers. This implicated as causes of their the tars and oils characteristic of their trade (Hill 1965, p. 295). I am a sociologist, not an epidemiologist; have not studied scrotal cancer or any of the other diseases and physical conditions discussed by Hill (1965); and until late in life had never been to England. I would only have known of the paper via Holland (1986, pp. 956-957), where it figures among the canonical disciplinary treatments of causation related to Rubin’s (1974) model for causal inference. At that time I did not pick up much from it, because I was reading it with a whiggish cast of mind, as if it were evident that the diffuse treatments of causation in the past were noble-but-incomplete efforts on route to the precision of the present. Now I wonder, especially where the social sciences are concerned. The mental discipline imposed by the potential outcomes framework (Rubin 2005) is very powerful. When I was first exposed to it (Holland 1986; Rosenbaum 1984), it was as though the scales fell from my eyes. I used this framework to first think (Smith 1990), then re-think (Smith 2013) all manner of studies in sociology, , criminology, and social epidemiology. Developments in causal thinking in the social sciences have been tremendous (e.g., Morgan 2013). But as I read Hill (1965) in retrospect, I think I see some threads of my own re-thinking of the situation, which is an admix of professional, scientific, and intellectual critique. In brief, and without nuance: We have harnessed ourselves to a “game” in which the objective is to make a world of interconnected, purposive actors bound in historical time and changing social structures look something like a randomized experiment. This feeds into a reductionist, individualist view of social science—and of the world we live in. Causes become embodied in the subjects on whom we make measurements and do causal calcu- lations. Researchers claim priority for the importance of causal analysis because of its

c 2020 Herbert Smith. Smith

importance for action (in polite terms, “policy”), even if the action and active agents are at great remove from the assignment mechanisms (random or as-if-random) that constitute manipulation (action) in the experimental model. We find ourselves at the wrong level of analysis, justifying our claims about how the world works on the basis of a precision that is specious. It’s not that we are stupid. It’s the sociology of situation. The generalized scientific development of causal analysis for observational studies (because that’s what most empirical social science is) feeds into a status hierarchy not just of ideas, but of individuals within the profession. One or more versions of many of these points are elaborated in a more genteel and considered manner in Smith (2013). I stab further at a few here, drawing on topics suggested by Hill (1965), although I do not mean to implicate him anachronistically, which would be particularly unfair to someone who valued coherence (p. 298).

Those Lurking Confounds Hill’s (1965) first criterion with respect to causation is strength of an association. In ad- dition to giving some trenchant examples of just how much damage certain environmental conditions induce in the humans who are exposed to them (pp. 295-296), he makes the important statistical point that for a strong observed association to be explained by some concomitant or antecedent factor, that factor must be very strongly associated in turn with the variable perhaps being mistaken for a cause. In particular:

...to explain the pronounced excess in cancer of the lung in any other environ- mental terms requires some feature of life so intimately linked with cigarette smoking and with the amount of smoking that such a feature should be easily detectable. (p. 296)

In sociology, it was once common practice to teach this kind of thinking with reference to cross-classified survey data (Rosenberg 1968). A large zero-order association was one with a large percentage difference, i.e., the percentage with some outcome characteristic under a potential treatment condition less the percentage with that same outcome characteristic conditional on an alternative (control). Before one spent a lot of time re-tabulating the data conditional on one (or two) control variables, it behooved one to create zero-order tables checking the association between a control variable and the original independent (or treatment) variable, and between the control variable and the dependent variable. If these zero-order percentage differences were not at least as large as the original effect (association) observed, then finding that that original association could be explained – in the sense of a substantial reduction in the association treatment/control (independent/dependent variable) conditional on re-categorization by the control variable – was just not going to happen. This must have been the kind of knowledge people didn’t really want to have, be- cause the rise of high-speed computing and regression models with many variables led to a situation in which the vagueness of high-dimensional space gave license to all manner of seminar-room speculation and criticism of interpretations of strong observed associations as reflecting causal processes. Hope could spring, if not eternal, then at least more provisional than would be warranted given a tighter focus on elementary facts. With controlling and

34 Causation in Action

partialling taking place through the calculation and inversion of ever-expanding variance- covariance matrices, it was easy to forget just how big an association a possible confounding or lurking variable needed to have to be the real explanation of an observed association. I thus sympathize with Hill’s (1965) plaint:

If we cannot detect it or reasonably infer a specific [confounding factor], then in such circumstances I think we are reasonably entitled to reject the vague contention of the armchair critic ‘you can’t prove it, there may be such a feature’. (p. 296)

Also, things improved: The statistical methods and vocabulary for addressing hidden bias in observational studies (Rosenbaum 1991) were a bracing antidote to “vague con- tention,” since the establishment of bounds on estimated treatment effects entails a state- ment of just how strong selection into a treatment—how far a departure from random assignment, how “intimately linked” the confound and treatment– would need to be to gainsay the treatment effect as estimated conditional on observables and other aspects of study design. These positive developments have been taken up by sociologists (e.g., DiPrete and Gangl 2004), but I would still like to quibble with the current state of scientific and scholarly affairs. Here I piggyback on Hill’s (1965) perspicacious allusion to the idea that a strong confound “should be easily detectable” (p. 296). Our overriding concern with hidden bias in causal inference suggests to me (a) a level of social science so immature as to not yet have recognized the most powerful features of the explanatory environment; and/or (b) the human tendency to imagine that the forces that we cannot see are far larger than those that are in front of our eyes. I suspect that there are non-sociologists and sociologists alike who would plump for the former characterization. I am not among them. Later I shall comment on some aspects of sociological explanation that run counter to the individual-level reductionism intrinsic to statistical and econometric causal analysis. Here I offer an example of the hoary socio- logical approach to causal analysis, in this case an investigation of whether a theoretically anticipated association is suppressed by observable, plausible confounds. Davis (1982) presents findings “. . . [that] cast considerable doubt on the ‘class culture’ notion that occupational strata have vast and diffuse effects on the texture of our lives” (pp. 580-581). There is no evidence that an effect of occupation cum class culture on various dispositions is being suppressed by associations of occupation with race, age, and/or sex. Pace “’middle-class values,’ ‘the culture of poverty,’ ‘hard hat mentality,’ ‘working class authoritarianism’” (p. 582) and so on:

...[T]he association between race and [occupational] stratum, net of [e]ducation, is not all that strong... Since test variables must have stronger associations with X and Y than the associations they explain, race is not a promising [sup- pressor variable]. Sex, on the other hand, does have a healthy association with [o]ccupational stratum, net of [e]ducation..., but at the other end of the line, [s]ex is not, in general, a strong correlate of...attitude... Finally, we consider [a]ge. Very young workers do have lower prestige jobs... [and] there is a decent [association] for [a]ge and [o]ccupation net of [e]ducation, and numerous studies

35 Smith

show younger people to be more ‘liberal.’ I suspect further analyses introduc- ing [a]ge would allow more occupational effects to peep through. Nevertheless, [the partial association of age with occupation] is not a whopper vis-`a-vis[other associations]. Consequently, age would have to show extraordinary effects on the [outcome] variables for it to suppress any but small effects of occupational stratum. Thus, I doubt that [a]ge has strong enough effects to change the broad picture. (pp. 582-583) Davis (1982) concludes in a manner that is in line with Hill’s (1965) thinking on confounding variables and causation: Similar mental exercises with other obvious test variables did not lead to any more promising ones... In sum, my conclusion is that occupational stratum simply does not have the diffuse and strong effects on our nonvocational attitudes and opinions that Sociologists have generally assumed. (p. 583; emphasis mine) As for our penchant to credit the power of unseen forces for the acts of men: In the Western canon, we can date the imputation of reflexive action – of reasoning, of means-ends orientations consequent to thought and to choice – to the 5th century of the Athenian era (Romilly [1984] 1994). To be sure, prior to then people acted – Homer’s heroes were nothing if not men of action – but this preceded the idea that a cause could and should attach itself to some purposive action, as in action after reflection. Before this, stuff just happens, as in the deterministic, inevitable cycles of revenge detailed by Herodotus (Romilly [1984] 1994, p. 177) or in consequence to exogenous shocks: storms, waves, and other accidental misfortunes (p. 34). But the most interesting causal attributions in the pre-psychological era were the forces that substituted for cognition in the minds of men. It was all about the gods, who hovered everywhere. They were primordial in Homer, for whom “action goes faster than reflection, brushing it away. So that, as necessary, a god can take it on himself to intervene and decide” (Romilly [1984] 1994, p. 34; my translation). In consequence divine causality often comes to remove much of the importance of human causa- tion. . . . And above all, even when a man seems to be making a decision alone, one never knows if there isn’t a god leading him along. Eschylus shows con- stantly this divine causation that doubles for, and puts the lie to, the free play of human motives. (pp. 36 and 65 [my translation]) Hill’s (1965) skepticism regarding alternative explanations for the much higher rates of lung cancer among cigarette smokers than among non-smokers was restricted to “any other environmental terms” (p.296; emphasis mine). But what of non-environmental factors? Nobody would now claim that the gods of the ancients still exist, or that they meddle in the affairs of men. Fisher (1958, p. 163), the leading proponent for the view that the smoking-cancer relationship was spurious and not causal, instead emphasized genotype as the hidden variable (“common cause”) behind the association between smoking and lung cancer. With a solicitude that would have been touching were it not so murderous, Fisher (1958) fulminated against telling people that something they were doing could be making them deadly ill, especially if their reasons for smoking were reasonable and, in the event, beyond their control. As with the gods, back in the day.

36 Causation in Action

I’m kidding. Sort of. Genes exist. We can now observe them. They are correlated with all manner of things, many in the sociological realm (Bearman 2008). Let’s set aside for the moment the fraught place of genetics in sociology and other social sciences—the difficulty in conversation between those for whom the science of the situation is so appealing as to make its desirability self-evident and those who smell yet another non-human-agency ratio- nalization for social inequality. Are these the massive, powerful lurking variables, hitherto unobservable but now eminently so, that constitute the hidden bias in the estimation of causal effects in the social world? To date, the answer would seem to be “no,” at least if we think about causation, especially in observational studies, with reference to a “fall from grace” relative to the experimental model. By this I mean genes that are non-randomly assigning folks to one social position or another while simultaneously determining some desirable social achievement, thereby creating the illusion that the social position is in some sense causing the social achievement. When really it is just the genes, doing their thing. Instead, “genetic expression can only reveal itself through social structural change” (Bearman 2008, p. v). This remark derives from a collection of studies that find, inter alia, that all manner of gene-behavior associations are altered if not neutralized as social environments vary (e.g. school atmospherics [Guo, Tong, and Cai 2008], family life [Martin 2008], networks of social support [Pescosolido et al. 2008] , and national educational policies [Penner 2008]). The lurking variables, from a causative perspective, are less the gods and alleles—they are what they are—than all the social circumstances staring us in the face. The reference to social structure, as opposed to the socially meaningful attributes of individuals, may seem murky at this point. I address it further below. An overarching point is that our penchant for reductionism tends to steer us away from causation operating above the level of the individual (Smith 2013, pp. 65-69).

Causes with Many Effects

To help in adducing causation in observational studies, we can elaborate our theories to specify unaffected units, essentially equivalent treatments, and unaffected responses (Rosen- baum 1984, pp. 42-44). We gain confidence that something is a cause of something else if we find its effects where we anticipate them and not where we do not. A now common feature of econometrics, a placebo or falsification test (e.g., Rothstein 2010), is typically concerned with specifying unaffected responses. Specificity was the third of Hill’s (1965) criteria for establishing causation, although he was quick to add that “[w]e must not, however, overemphasize the importance of the characteristic” (p. 297; see, also, Holland 1986, pp. 956-957). Small wonder. The re- lationship between cigarette smoking and lung cancer is very strong (p. 296), but “the prospective investigations of smoking and cancer of the lung have been criticized for not showing specificity – in other words the death rate of smokers is higher than the death rate of non-smokers from many causes of death...” (p. 297). In the event, smoking is just flat- out bad for human health, even if its signature effect is on lung cancer. Hill (1965, p, 296) was quick to point out that “it does not follow...that [the] best measure of the effect upon mortality is also the best measure in relation to ætiology...” The case against smoking was best made with reference to differential mortality with respect to lung cancer, although the population prevalence of cardiac disease means that the weaker causal effects of smoking

37 Smith

are nonetheless associated with more excess deaths (e.g., Fenelon and Preston 2012). As Hill (1965, p. 296) noted,

It does not, of course, follow that the differences revealed by ratios are of any practical importance. Maybe they are, maybe they are not; but that is another point altogether.

Analogies have their limits. Cigarettes and lung cancer may be an extreme causal relation. But the generalized noxiousness of smoking is not. In the social sciences there are several factors that are associated with all manner of outcomes. Education is the canonical example, and fortunately the effects of education are in general positive. Davis (1982) was combing the General Social Survey for predictors of morale (social life feelings), social attachment, political opinion, values and tastes, and stances on social issues. Again and again, education popped up as a strong predictor (if not a cause), even after netting out education’s effect on occupational attainment, the indicator of the class cultures that Davis (1982) found wanting in explanatory power. Cutler and Lleras-Muney (2010) do a similar survey of the association between education and various health behaviors. Education is again an ever-present factor. It is hard to think of a domain of social life when it is not. When there is a factor whose effects are so ubiquitous, the precision of any one effect is of decreasing interest, in the sense that building an argument for action (or not) on the precise results in one domain may be a bit blind with respect to the sociology of the situation. Studies seeking to estimate the causal effect of education on this or that abound, and of course the estimated causal effects do not always look like the zero-order associations, or even the partial associations net of standard sets of observable antecedent variables. I don’t mean to be a scientific Luddite: I am not arguing that we break our tools for determining what effects are and are not “causal.” I do want to suggest that such results be taken in the larger context, where the larger context includes the mass of effects on all manner of outcomes. One counter-argument gets back to the gods and the genes: That once we uncover the hidden factors that are determining both education and everything else we do, we’ll feel silly for having imagined that any of this was within our control—in the provision of education, for example. Against this, I would argue that many social causative factors—especially those with many effects—are operating at a level above the inter-individual variability that tends to dominate our sense of what is causal and what is not. Education is in one sense embedded in the social structure; but changes in education—especially the stock of education—are also changing the social structure. Mass education and fertility is a good example. Education has long been associated with declines in fertility. But how exactly? Caldwell’s (1982) theory of fertility and fertility decline brings into focus the social structural character of the factors first supporting high fertility, then precipitating its decline (Smith 1989, pp. 172-173):

...Patriarchy is a social institution subsuming large numbers of women within families, families within kin groups, and kin groups within communes or vil- lages. The subordination of youth to their elders and of women to men is not a feature of particular households, families, or kin groups, but of the larger so- cial structure. Individual variation (deviance) is of little account when arrayed

38 Causation in Action

against the larger forces militating for conformity to essential behaviors, includ- ing fertility (Caldwell, 1982:172). When change comes, it comes not through the collective exercise of individual choice, but through the collapse of a larger sys- tem that had heretofore constrained all choices of behavior open to individuals. Theories of modernization that concentrate on personality change are criticized, due to their implication that ”individuals could always have lived different ways of life by opting to do so, whether or not the needed economic and social institu- tions for the new way of life had yet come into existence” (Caldwell, 1982:280). A primary source of fertility change is mass education. Mass education cre- ates both educated (expensive, ungrateful, questioning) children and educated wives. Educated wives reduce the net wealth flow from wife to husband (and mother-in-law and father-in-law), strengthen the bonds between husband and wife (undermining the traditional family structure and its morality), and seek to avoid repeated and periods with infants. Education is easily measured at the individual level, and its incorporation into micro models of fertility and fertility-related behaviors is on occasion justified with reference to Caldwell’s (1982) emphasis on education as a source of fertility decline. But Caldwell unambiguously points to the macro properties of education: ”[T]he education of only half the community does not have the same effect on that half of the population, nor half the effect on the whole population” (1982:329). When there remain many in a community who have not attended school, strong forces maintaining the traditional family morality still abound. ”[T]he evidence suggests that the most potent force for change is the breadth of education (the proportion of the community receiving some schooling) rather than the depth (the average duration of schooling among those who have attended school).”

Causation and Action If you are reading about John Snow and cholera, you are probably not studying cholera. More likely, you are being instructed regarding what causal inference in good observational science looks like (e.g., Fisher 1958, pp. 156-157; Freedman 1991, pp. 294-299). Ever willing to ape my betters, I have adduced Snow’s admirable studies of cholera in support of a point that strikes me as basic but nevertheless goes against the grain of contemporary research habits: If you have a nice estimate of an effect based on random assignment or credible as-if-by-random assignment (Snow’s case [Smith 2013, p. 50]), do you really need to understand the intervening process to have established causation? I think not (Smith 2013, pp. 60-63), and am buoyed in seeing that Hill (1965, p. 298) had also used Snow as an example in getting to the same point (albeit half a century earlier):

Before deducing ‘causation’ and taking action we shall not invariably have to sit around awaiting the results of that research. The whole chain may have to be unraveled or a few links may suffice. It will depend upon circumstances. (p. 295)

Moreover, the social circumstances linking social conditions to specific outcomes, in- cluding health outcomes can be intransigent, in a way that is not well captured by their

39 Smith

elaboration in terms of individual-level intervening health behaviors and biological charac- teristics. “[F]undamental causes can defy efforts to eliminate their effects when attempts to do so focus solely on the mechanisms that happen to link them to a disease in a particular situation” (Link and Phelan 1995, p. 81). This is reflected in the enduring individual-level association between socioeconomic status, including education, and health. How does this happen?

...[T]he association between a fundamental cause can be preserved through changes either in the mechanisms or in the outcomes...[S]ome causes...“basic causes,” have enduring effects on a dependent variable because, when the ef- fect of one mechanism declines, the effect of another emerges or becomes more prominent. (Link and Phelan 1995, p. 87; my emphasis)

Why does this happen? Social structures are about nothing if not the differential allocation of scarce resources, both for their intrinsic value and for the maintenance of status hier- archies that will allow for differential allocation in the future, as new outcomes of interest arise. In the case of social status and health,

...[T]he essential feature of fundamental social causes...is that they involve ac- cess to resources that can be used to avoid risks or to minimize the consequences of disease once it occurs...[R]esources...include money, knowledge, power, pres- tige, and the kinds of interpersonal resources embodied in the concepts of social support and social networks. Variables like SES, social networks, and stigma- tization are used...to directly assess these resources and are therefore especially obvious as potential fundamental causes. However, other variables...such as race/ethnicity and gender...are so closely tied to resources like money, power, prestige, and/or social connectedness that they should be considered as poten- tial fundamental causes of disease as well (Link and Phelan 1995, p. 87)

Because the social structures that we create have an internal logic that transcends the temporal correlations that they create with so-called intervening variables, action oriented toward those intervening variables alone may not have the anticipated outcomes. Knowing more is better, and elaborating the process leading from some factor to an outcome of interest is a hallmark of science. But it is not necessarily a hallmark of causation, especially inasmuch as causation is understood as what would or will happen if one were actually to do something (e.g. Hill 1965, p. 300). The distinction has been made by Holland (2008, p. 99):

One of the problems of communication between social scientists and policy mak- ers is related to the distinction I make between assessing effects and describing mechanisms. Understanding some aspect of a causal mechanism often advances science (i.e., theory), whereas the needs of public policy often require an answer that assesses the effects of an intervention, rather than reasons or speculations as to how these effects come about. If class size reduction results in better stu- dent learning, a policy maker might argue that it does not matter if this effect is due to more time for individualized instruction, fewer classroom disruptions,

40 Causation in Action

or something else. On the other hand, the mechanism might matter to the policy maker if other reform policies besides class size reduction are of interest. Knowledge of the causal mechanism could indicate that other policies would be supportive or possibly contraindicated when classes are small. My view is that both positions need to be clearly delineated and not confused with each other. This confusion of purpose stares at us on the first pages of one of the best texts conveying the modern armamentarium for causal analysis within economics: A causal relationship is useful for making predictions about the consequences of changing circumstances or policies; it tells us what would happen in alter- native (or “counterfactual”) worlds. For example, as part of a research agenda investigating human productivity capacity—what labor economists call human capital—we have both investigated the causal effect of schooling on wages. . . . The causal effect of schooling on wages is the increment to wages an individual would receive if he or she got more schooling. A range of studies suggest the causal effect of a college degree is about 40 percent higher wages on average, quite a payoff. The causal effect of schooling on wages is useful for predicting the earnings consequences of, say, changing the costs of attending college, or strengthening compulsory attendance laws. (Angrist and Pischke 2009, pp. 3-4) I am not contending that estimates of the “the causal effect of schooling on wages” are useless “for predicting the earnings consequences of, say, changing the costs of attending college.” They are definitely useful, and it is possible to integrate formally the two perspectives. Todd and Wolpin (2006) is an admirable example, based on some serious theory (Wolpin 2013). On the ground, I think that we are far closer to the confusion that Holland (2008, p. 99) describes. In seminar rooms, researchers present and debate the fine points of soi-disant causal analysis. With their authority duly established by dint of having wrangled some micro-level observational data into the as-if-by-random-assignment computational frame, they soon weigh in on actions, which are virtually always policy prescriptions for altering choice sets, not the micro factors involved in the preceding causal estimation demonstration. The focus on establishing “causation” as a form of legitimation (Holland 2008, p. 101) without keeping the action orientation in mind can lead us to strange places. For example: In the United States, lists of eligible voters are publicly available, so that one can observe who actually voted in a given election (if not for whom they voted, individually). In 2004 a team of political scientists obtained files of the electoral rolls in Illinois—more than seven million names—including information on their addresses, telephone numbers, and demo- graphic characteristics including their sex and age, along with their histories of electoral participation (Arceneaux, Gerber, and Green 2010). They eliminated very large households and those without telephone numbers (talk about a bygone era!) This led to a file of 2.7 million households with at least one eligible voter (and less than five). They then took a random sample of 16,000 potential electors (only one per household) and tried to reach them by telephone to encourage them to vote in an upcoming election. Only 41% of the potential voters sampled could be reached by phone, and the researchers were concerned that the sort of folks who still pick up a telephone no matter who is calling are also the most likely to vote following a get-out-to-vote call.

41 Smith

They thus sought to estimate a treatment-on-treated effect, “the causal effect of a phone call among those who are reachable” (Arceneaux et al. 2010, p. 260). They were well aware that a control group—even one carefully selected from among the possible voters who were not called but who might be matched on observables (age, voting history, etc.) to those who were reached and hence had been encouraged to vote—would be mixing up the polite folks who feel obliged to pick up the phone when it rings with those who don’t trust unsolicited phone calls, are rarely at home (remember: a bygone era!), and/or who didn’t specify a home phone number. Therefore, to evaluate the effects of a get-out-the-vote phone call on those who did in fact receive one (they picked up the phone, they got the message), Arceneaux et al. (2010) did an analysis via two-stage least-squares, where the random selection for possible contact served as the instrumental variable in a regression of subsequent vote (went to the polls or did not) on whether contact was actually made (yes or no). It turns out that (a) the probability of going to vote increased by 2% as a result of receiving (implying answering) an encouragement-to-vote phone call; and as suspected, (b) the people who tended to answer such a phone call are also the kind of people who go out to vote, whether they are called or not. Arceneaux et al. (2010) were interested in demonstrating the insufficiency of a different potential method—matching on observables, as described above—for estimating their pre- ferred parameter, the effect of the treatment on the treated. Agreed: Matching doubtless does not control for an important unobservable, the tendency to pick up the telephone when it rings. In comparison, the two-stage estimator gives an unbiased estimate of the effect of calling among those who pick up the phone. In that sense it is a preferable causal estimator. But suppose you are a party worker or other campaign operative who is interested in finding more votes: The effect of a phone call on those who answer the phone is not your primary interest. You want to know what is going to happen when you inundate the registered voting population with telephone calls that will, in the main, go unanswered. The effect touted and precisely measured by Arceneaux et al. (2010) does not accord well with the intervention that one could imagine making. In which case I think the tendency to put causal estimation first and action not first is problematic, at least in the action-orientation sense of causation “What would happen if...?”

Social Structure as a Cause A social structure is a relational system that exists not just in function of the individuals present in a society and their characteristics (psychological, genetic, social and otherwise), but of a set of social roles that tend to persist (or change) independent to great degree of the characteristics of the incumbents of roles. It includes power relations, norms, and habits of mind that weigh on opportunities and behavior. This can sound imprecise and unscientific, certainly relative to measures of observables on individuals, such as alleles, levels of education, criminal records, etc. – maybe even people’s sex and race – that figure in most statistics-based treatments of cause and effect. For example, At the class certification hearing in federal district court, Plaintiffs’ sociological expert witness testified regarding his “social framework analysis” of Wal-Mart’s “culture” and personnel practices, and concluded that the company was “vul-

42 Causation in Action

nerable” to gender discrimination. The reasoning here was from the general – that of Wal-Mart’s “strong corporate culture” – to the specific – that Wal- Mart discriminated against its women employees as a consequence... (Dawid, Faigman, and Fienberg 2014, p. 380). The many terms placed inside quotation marks speaks volumes: Dawid et al. (2014) were not fans of this form of analysis (nor was the court). I cannot speak for the legal issues in general, nor about this case in particular, but I would not be so quick to dismiss the validity of the concepts—that is, their reality as depictions of essential aspects (yes, in the causative sense) of the organization in which these women were working. Imagine we were to describe the sort of gender system within which we are living, I as the writer, you as the readers. Unless the readership of Observational Studies is substantially different from what I imagine it to be, the rules of engagement, the expectations, the opportunities, the behaviors, and the relationships (in every sense of the word), not just between men and women, but between women and other women, men and other men—not to mention the creation of gender identities not bound to these binary criteria—well, I suspect that they would preclude if not sanction placing in our papers a homily of the form: If your wife ran off with the lodger last week you still have to take your perforated ulcer to hospital without delay. But with a hernia you might prefer to stay home for a while – to mourn (or celebrate) the event. (Hill 1965, p. 296) By which I do not mean to impugn Sir Austin Bradford Hill. He, like most of us, was a man of his times with respect to many things, in addition to being farsighted intellectually in the ways we celebrate in revisiting his work. I do, however, mean to illustrate why something along the lines of a “strong [XXXX] culture” might look like, what it might feel like, and how it might be perpetuated, even (especially) by those who mean no harm. The modern statistical and philosophical litera- tures on causation have long been stuck on how best to capture the “effects” of race and sex (or gender) – what sociologists call the ascriptive aspects of individuals. (Education and income, in contrast, are statuses that one putatively achieves). I would have thought that scholars would have moved past the idea that racial characteristics ascribed to individuals were in any sense a cause of what might or might not be happening to them (Smith 2003, p. 465), but I would have been wrong (cf. Marcellesi 2013). Yet even reframing the problem so that the subjects who are causing things are the units across whom race inputs are being measured (e.g. Pager 2003; Greiner and Rubin 2011) does not get at the social structure that is in a real sense causing the situation: [A]fter a society becomes racialized, racialization develops a life of its own. Al- though it interacts with class and gender structurations in the social system, it becomes an organizing principle of social relations in itself...Race, as most analysts suggest, is a social construct, but that construct, like class and gender, has independent effects in social life. After racial stratification is established, race becomes an independent criterion for vertical hierarchy in society. There- fore different races experience positions of subordination and superordination in society and develop different interests. (Bonilla-Silva 1997, p. 475)

43 Smith

The idea of racism without racists (Bonilla-Silva 2013) undercuts the continual focus on race as a causal property of individuals, because what one is dealing with are the pervasive, interrelated effects of a social structure with any number of taken-for-granted roles and social relations. Perhaps in another social world, the set of experiences would be very different:

If race is not a causal variable, how do we analyze issues of social discrimination in causal terms, if at all? We certainly do think of racial discrimination in causal terms because many of us think racial discrimination is something that could be changed, reduced, or in some way altered. There are those who dream of a day when racial discrimination is a thing of the past and long forgotten. What is it that has to change? Certainly not the color of people’s skin or some other physical characteristic. Clearly discrimination is a social phenomenon, one that is learned; it is taught and fostered by a social system in which it plays a complex part. When we envision a world without racial discrimination we thus envision it as a whole social system that must be different in a variety of ways from what we see before us. One almost has to envision a parallel world, so to speak, in which things are so different that what we recognize in our own world as racial discrimination does not exist in this other parallel world... (Holland 2008, p. 102)

Coda

I may be straying into dogmatism, which would be unfortunate, because this is precisely the point: Our interest in causation could use far less dogmatism, far less sense that there are some principles that are intrinsically more important than others in understanding one aspect of how the world works: i.e., what might we expect to happen if we did something or another? I read Hill (1965) as a reminder of what that sensibility might look like.

References

Angrist, Joshua D., and J¨orn-SteffenPischke (2009). Mostly Harmless Econometrics. Princeton: Princeton University Press. Arceneaux, Kevin, Alan Gerber, and Donald P. Green (2010). “A Cautionary Note on the Use of Matching to Estimate Causal Effects: An Empirical Example Comparing Matching Estimates to an Experimental Benchmark.” Sociological Methods & Research, 39(2):256–282. https://doi.org/10.1177%2F0049124110378098 Bearman, Peter (2008). “Introduction: Exploring Genetics and Social Structure.” American Journal of Sociology, 114(S1):v-x. https://doi.org/10.1086/596596 Bonilla-Silva, Eduardo (1997). “Rethinking Racism: Toward a Structural Interpretation.” American Sociological Review, 62(3):465-480. https://www.jstor.org/stable/2657316 Bonilla-Silva, Eduardo (2013). Racism without Racists: Color-Blind Racism and the Per- sistence of Racial Inequality in America (4th ed.) Lanham, MD: Rowman and Littlefield. Caldwell, John C (1982). Theory of Fertility Decline. New York: Academic Press.

44 Causation in Action

Cutler, David M., and Adriana Lleras-Muney (2010). “Understanding Differences in Health Behaviors by Education.” Journal of Health Economics, 29(1):1-28. https://doi.org/ 10.1016/j.jhealeco.2009.10.003 Davis, James A. (1982). “Achievement Variables and Class Cultures: Family, Schooling, Job, and Forty-Nine Dependent Variables in the Cumulative GSS.” American Sociological Review, 47(5):569-586. https://www.jstor.org/stable/2095159 Dawid, Philip A., David L. Faigman, and Stephen E. Fienberg (2014). “Fitting Science into Legal Contexts: Assessing Effects of Causes or Causes of Effects?” Sociological Methods & Research, 43(3):359-390. https://doi.org/10.1177%2F0049124113515188 DiPrete, Thomas A., and Markus Gangl (2004). “Assessing Bias in the Estimation of Causal Effects: Rosenbaum Bounds on Matching Estimators and Instrumental Variables Estimation with Imperfect Instruments.” Sociological Methodology, 34:271-310. https: //doi.org/10.1111%2Fj.0081-1750.2004.00154.x Fenelon, Andrew and Samuel H. Preston (2012). “Estimating Smoking–Attributable Mor- tality in the United States.” Demography, 49(3)797-818. https://doi.org/10.1007/ s13524-012-0108-x Fisher, Sir Ronald (1958). “Cigarettes, Cancer, and Statistics.” The Centennial Review of Arts & Science, 2:151–166. www.jstor.org/stable/23737529 Freedman, David A. (1991). “Statistical Models and Shoe Leather.” Sociological Methodol- ogy, 21:291-213. http://www.jstor.org/stable/270939 Greiner, James, and Donald Rubin (2011). “Causal Effects of Perceived Immutable Char- acteristics.” Review of Economics and Statistics, 93(3):775–85. https://doi.org/10. 1162/REST_a_00110 Guo, Guang, Yuying Tong, and Tianji Cai (2008). “Gene by Social Context Interactions for Number of Sexual Partners among White Male Youths: Genetics-Informed Sociology.” American Journal of Sociology, 114(S1:S36-S66. https://doi.org/10.1086/592207 Hill, Sir Austin Bradford (1965). “The Environment and Disease: Association or Causa- tion?” Proceedings of the Royal Society of Medicine, 58(5):295-300. https://journals. sagepub.com/doi/pdf/10.1177/003591576505800503 Holland, Paul W. (1986). “Statistics and Causal Inference.” Journal of the American Sta- tistical Association, 81(396):945-960. http://www.jstor.org/stable/2289064 Holland, Paul W. (2008). “Causation and Race.” pp. 93-109 in White Logic, White Methods: Racism and Methodology, edited by Tukufu Zuberi and Eduardo Bonilla-Silva. Lanham, MD: Rowman & Littlefield. Link, Bruce G., and Jo Phelan (1995). “Social Conditions As Fundamental Causes of Disease.” Journal of Health and Social Behavior 35 (Extra Issue): 80-94. http://www. jstor.org/stable/2626958 Marcellesi, Alexandre (2013). “Is Race a Cause?” Philosophy of Science, 80(5):650-659. https://www.jstor.org/stable/10.1086/673721 Martin, Molly A. (2008). “The Intergenerational Correlation in Weight: How Genetic Re- semblance Reveals the Social Role of Families.” American Journal of Sociology 114(S1):S67- S105. https://doi.org/10.1086/592203 Morgan, Stephen L. (ed.) (2013). Handbook of Causal Analysis for Social Research, edited by Stephen L. Morgan. Dortrecht: Springer.

45 Smith

Pager, Devah (2003). “The Mark of a Criminal Record.” American Journal of Sociology, 108(5):937-975. https://doi.org/10.1086/374403 Penner, Andrew M. (2008). “Gender Differences in Extreme Mathematical Achievement: An International Perspective on Biological and Social Factors.” American Journal of Sociology, 114(S1):S138-S170. https://doi.org/10.1086/589252 Pescosolido, Bernice A., Brea L. Perry, J. Scott Long, Jack K. Martin, John I. Nurnberger, Jr., and Victor Hesselbrock (2008). “Under the Influence of Genetics: How Transdisci- plinarity Leads Us to Rethink Social Pathways to Illness.” American Journal of Sociology, 114(S1):S171-S201. https://doi.org/10.1086/592209 Romilly, Jacqueline (de). [1984] 1994. “Patience mon coeur !” L’essor de la psychologie dans la litt´erature grecque classique. Paris: Pocket. Rosenbaum, Paul R. (1984). “From Association to Causation in Observational Studies: The Role of Tests of Strongly Ignorable Tests of Treatment Assignment.” Journal of the American Statistical Association, 79(385):41-48. https://www.jstor.org/stable/ 2288332 Rosenbaum, Paul, R. (1991). “Discussing Hidden Bias in Observational Studies.” Annals of Internal Medicine 115(11):901-905. https://doi.org/10.7326/0003-4819-115-11-901 Rosenberg, Morris (1968). The Logic of Survey Analysis. New York: Basic Books. Rothstein, Jesse (2010). “Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement.” The Quarterly Journal of Economics, 125(1):175–214. https: //doi.org/10.1162/qjec.2010.125.1.175 Rubin, Donald B. (1974). “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology, 66(5):688-701. http:// dx.doi.org/10.1037/h0037350 Rubin, Donald B. (2005). “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions.” Journal of the American Statistical Association, 100(469):322-331. https: //doi.org/10.1198/016214504000001880 Smith, Herbert L. (1989). “Integrating Theory and Research on the Institutional Deter- minants of Fertility,” Demography, 26(2):171-184. http://www.jstor.org/page/info/ about/policies/terms.jsp Smith, Herbert L. (1990). “Specification Issues in Experimental and Nonexperimental So- cial Research.” Sociological Methodology, 20:59-91. https://www.jstor.org/stable/ 271082 Smith, Herbert L. (2003). “Some Thoughts on Causation as It Relates to Demography and Population Studies.” Population and Development Review, 29(3):459-469. http: //www.jstor.org/stable/3115284 Smith, Herbert L. (2013). “Research Design: Toward a Realistic Role for Causal Analysis.” Chapter 4 (pp. 45-73) in Handbook of Causal Analysis for Social Research, edited by Stephen L. Morgan. Dortrecht: Springer. Todd, Petra E., and Kenneth I. Wolpin (2006). “Assessing the Impact of a School Subsidy Program in Mexico: Using a Social Experiment to Validate a Dynamic Behavioral Model of Child Schooling and Fertility.” American Economic Review, 96(5):1384-1417. https: //www.jstor.org/stable/30034980 Wolpin, Kenneth I. (2013). The Limits of Inference without Theory. Cambridge, MA: MIT Press.

46 Observational Studies 6 (2020) 47-54 Submitted 9/19; Published 1/20

Hill’s Causal Considerations and the Potential Outcomes Framework

Tyler J. VanderWeele [email protected] Department of Epidemiology and Department of Harvard University Boston, MA, USA 02115

Hill’s considerations for assessing causality (Hill, 1965) are arguably still important to- day, as indeed they were at the time of writing. The formalization of causal inference has, however, advanced considerably since Hill’s paper. The potential outcomes or quantitative counterfactual framework (Rubin, 1974; Rosenbaum and Rubin, 1983a; Robins, 1986; Im- bens and Rubin, 2015; Hern´anand Robins, 2019) has clarified and made more precise our reasoning about causation and the underlying assumptions for much of causal inference. It has also allowed for dramatic expansion of the types of causal questions that can be addressed (Angist et al., 1995; Robins et al., 2000; Pearl, 2009; Hudgens and Halloran, 2008; Imai et al., 2010; Tchetgen Tchetgen and VanderWeele, 2012; VanderWeele, 2015). And it has moreover, I will argue below, altered how evidence concerning causation is often assessed in practice. Within the potential outcomes framework, in its most basic form, we can let Y (a) denote the potential outcome or counterfactual outcome that would have been observed for an individual if the exposure A had, possibly contrary to fact, been set to level a. We let Y (1) − Y (0) denote the causal effect for an individual i.e. the difference in potential outcomes for that individual. We say that the covariates C suffice to control for confounding (or that “exchangeability holds conditional on C” or “ignorability conditional on C”) if the counterfactuals Y (a) are independent of A conditional on C i.e. if within strata of C, the group that actually had exposure status A = a is representative of what would have occurred had the entire group with C = c been given exposure A = a. We say that the assumption of consistency holds if whenever the actual exposure A equal level a, then Y (a) = Y . Under these assumptions of exchangeability and consistency we have that the average causal effect conditional on C = c is given by E[Y (1)|c]–E[Y (0)|c] = E[Y (1)|A = 1, c]–E[Y (0)|A = 0, c] = E[Y |A = 1, c]–E[Y |A = 0, c], where the first equality holds by exchangeability and the second by consistency. This final expression is just the association between exposure A and outcome Y on a difference scale. Under the assumptions of exchangeability and consistency, this association is equal to the conditional causal effect; under these assumptions, association implies causation. The potential outcomes framework thus effectively employs deductive logic in its conclu- sion about causation. In contrast, as will be discussed further below, Hill’s considerations or viewpoints, approach causation from an inductive approach. In this commentary, I will offer some discussion on how the potential outcomes framework has contributed to, or clarified, each of Hill’s considerations, how it contrasts with those considerations by using deduc-

c 2020 Tyler VanderWeele. VanderWeele

tive rather than inductive inference, and how it has thus also altered approaches to causal reasoning in practice.

1. Strength Hill noted that, other things being equal, the greater the strength of association between exposure and outcome, the more plausible that the relationship is causal. The underlying reasoning is that a stronger association is more difficult to entirely attribute to explanations other than causality. This intuitive notion has been formal- ized through various sensitivity analysis approaches (Rosenbaum and Rubin, 1983b; Imbens, 2003; Hsu and Small, 2013). Indeed sensitivity analysis to assess sensitivity or robustness to unmeasured confounding has been around since before Hill’s paper (Cornfield et al., 1959). However, these approaches to sensitivity analysis have ex- panded dramatically since the introduction of the potential outcomes framework to observational studies, and have been extended to address other possible biases such as measurement error and selection bias (Greenland, 2005; Rothman et al., 2008; Lash et al., 2009). For some of these techniques, the relevant sensitivity analysis parameters that are required to explain away an association can be expressed as simple mono- tonic functions of the magnitude of the association between the exposure and the outcome (VanderWeele and Ding, 2017; Smith and VanderWeele, 2019; VanderWeele and Li, 2019). Such techniques reinforce the contention that strength of association is a critical consideration in assessing evidence for causation.

2. Consistency Hill devoted considerable discussion to consistency, to arriving at the same conclusion repeatedly by different persons, in different places, circumstances, and times, and us- ing different designs. Meta-analysis and systematic review is of course often useful in accumulating evidence over different persons, places, times and circumstances and much of the statistical development of meta-analysis proceeded independent of the po- tential outcomes framework. More recently, however, methods have been developed that employ the potential outcomes framework to combine meta-analytic approaches assessing consistency over studies, with the possibility of unmeasured confounding in those studies (Mathur and VanderWeele, 2019). Such approaches effectively tackle both the first and the second of Hill’s considerations simultaneously. In a different context, recent work on evidence factors provides another quantitative approach to assess robustness of evidence from different sources of data (Rosenbaum, 2010, 2017). Perhaps an even more profound contribution from the potential outcomes framework to Hill’s consideration of consistency arises from his point about consistency of conclu- sions across different designs. Hill explicitly references prospective and retrospective studies; however he also makes mention, in his discussion of nickel refiners, of what might be referred to today as an interrupted design. Indeed, the number of designs that employ assumptions other than “control for confounding by covariates” to assess causation has grown dramatically, often motivated by, grounded in, or made precise through the potential outcomes framework. These include instrumental vari- able approaches including Mendelian randomization, regression discontinuity designs, difference-in-difference methods, and interrupted time series designs (Angist et al.,

48 Hill and the Potential Outcomes Framework

1995; Didelez and Sheehan, 2007; Angrist and Pischke, 2009; Morgan and Winship, 2015). Each of these employ different assumptions and consistency of evidence across these different designs will render more plausible the conclusion of causality. There has been recent discussion of this as a process of “triangulation” (Lawlor et al., 2016). However, it should also be noted that each of these designs also often identifies a different causal effect, for different subpopulations, and thus discrepant results across these different study designs do not necessarily imply a contradiction, or that one of them is wrong, and thus, in words of Hill “different results of a different inquiry certainly cannot be held to refute the original evidence.” Nevertheless, the potential outcomes framework has arguably dramatically expanded the different ways to look for consistency of evidence, and has also clarified when discrepant results are or are not problematic. 3. Specificity Hill indicated that if the association with an exposure is specific to a particular out- come (or is considerably stronger for that outcome) then this may render a causal conclusion more plausible. Recent work on negative controls has formalized this no- tion within a potential outcomes framework (Tchetgen Tchetgen, 2013) and such more formal quantitative approaches to leveraging specificity, while perhaps not yet widely used, could potentially contribute a great deal to our causal reasoning. However, as with consistency, so also with specificity, Hill notes, “if it is not apparent, we are not thereby necessarily left sitting irresolutely on the fence.” 4. Temporality Causes of course must precede effects. However, Hill’s discussion of the temporality consideration in fact concerns ruling out the possibility of reverse causation i.e. the possibility that the supposed outcome might in fact precede and subsequently alter the variable taken as the exposure. This temporality or “reverse-causation” consider- ation is mirrored in calls within the potential outcomes literature to use longitudinal data to assess evidence for causation in observational studies and to control for base- line outcome to rule out that possibility of reverse causation (VanderWeele et al., 2016). Such control will often effectively be needed to ensure the “no-confounding” or “ignorability” assumption within the potential outcome approach does indeed hold. 5. Biological Gradient Hill argues that in some cases a causal relationship may be more plausible if there is a monotonic dose-response relationship. Some discussion of when such dose-response re- lationships may render evidence for causality more plausible has been formalized using potential outcomes (Rosenbaum, 2003). However, here too it must be acknowledged that the absence of a monotonic dose-response relationship is not necessarily evidence against causality. Causal relationships can plausibly be U-shaped, for example. 6. Plausibility Hill writes “It will be helpful if the causation we suspect is biologically plausible.” However, Hill does not place a great deal of weight on this consideration in his dis- cussion because it is entirely possible that the relevant biological knowledge may be

49 VanderWeele

lacking. Certainly one consideration with regard to biological plausibility concerns knowledge of the mechanisms. The potential outcomes framework has supplied a set of concepts and methods, and has clarified the underlying assumptions, to quanti- tatively assess the relative importance of various proposed mechanisms (Robins and Greenland, 1992; Pearl, 2009; Imai et al., 2010; VanderWeele, 2015), or, said another way, to assess the extent to which each potential mechanism might mediate the effect. Unfortunately, the assumptions required by these approaches are in fact generally considerably stronger than those required to establish causation between the expo- sure and the outcome itself. These strong assumptions are needed for quantification. However, from a more qualitative perspective the logic of mediation may still be of help for contributing biological plausibility: if we have evidence that the exposure al- ters particular mechanisms, and also separate evidence that the mechanisms alter the outcome, then this renders more plausible that the exposure itself alters the outcome (VanderWeele, 2015).

7. Coherence Hill places some emphasis on the coherence with existing knowledge, that “the cause- and-effect interpretation of our data should not seriously conflict with the generally known facts.” This is of course an important principle. It is perhaps one that has not as often been explicitly and quantitatively exploited within the counterfactual framework. However, consideration of “triangulation” may come into play here as well (Lawlor et al., 2016).

8. Experiment Of course experimental evidence from a trial randomizing the exposure can contribute substantially to the case for causality. The potential outcomes framework helpx make clear the logic, but more profoundly, the theory and methods that have developed from the potential outcomes framework has been useful in evaluating the strength of evidence from randomized trials when something goes wrong e.g. when there is non-compliance, or attrition, or censoring due to death (Robins, 1986; Angrist et al., 1995; Scharfstein et al., 1999; Frangakis and Rubin, 2002).

9. Analogy Hill’s consideration of analogy (e.g. that when we have evidence for a causal effect of one exposure of some class on an outcome of a particular class, it may render causal- ity more plausible between other members of the same exposure class and outcome class) does not have a direct counterpart within the potential outcomes framework of which I am aware. There are perhaps some indirect resonances with certain questions concerning surrogacy. For example, when it is known from randomized trials that a particular drug has beneficial effects on a health outcome, then when is it reasonable to draw conclusions using a randomized trial for the effect of a different drug, but in the same drug class, if data is only available on a surrogate, rather than the primary outcome of interest? Such questions have been formalized using counterfactual ap- proaches (Frangakis and Rubin, 2002; Gilbert and Hudgens, 2008; Joffe and Green, 2009; VanderWeele, 2013), but there is almost always some reasoning by analogy in

50 Hill and the Potential Outcomes Framework

such cases when these approaches are employed in practice. Such reasoning by anal- ogy can of course go wrong, in general, and of course with surrogacy as well, and there have been a number of disastrous failures concerning such reasoning by analogy with surrogates (Moore, 1995; Fleming and DeMets, 1996). Caution must be exercised.

Inductive vs. Deductive Reasoning

Hill’s conclusion is that these “are nine different viewpoints from all of which we should study association before we cry causation.” He believes that “None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non.” Hill’s considerations for assessing evidence for causation essentially relies on inductive reasoning. Hill states that the “fundamental question” is whether there is “any other way of explaining the set of facts before us, is there any other answer equally, or more likely than cause and effect?” and he believes the nine viewpoints are helpful in this type of inductive reasoning. As I have argued above, I believe these viewpoints are indeed still helpful from an inductive standpoint. The potential outcomes framework, in contrast, follows a deductive form of logic to reach the conclusion of causation. Under the potential outcomes framework, under a certain set of assumptions (e.g. sometimes stated as “exchangeability, consistency, and positivity” (Hern´anand Robins, 2019) or as “stable unit treatment value assumption and ignorable treatment assignment” (Imbens and Rubin, 2015)), then any actual association implies causation. While the potential outcomes framework in principle employs deductive logic to reach the conclusion of causation, in practice, there are still inductive elements. We are never sure that the premises hold. In fact, we often are sure that they do not exactly hold and must use sensitivity analysis to assess whether our conclusions are robust to violations. We also often cannot definitively conclude that an association is present; rather, we use statistical inference to assess the evidence for an association. We are still left only with evidence, and must evaluate whether, and to what extent, that evidence justifies a conclusion about causation. There is thus an inductive element to the application of the potential outcomes approach in practice. Nevertheless, the deductive form of logic underlying the potential outcomes approach itself simplifies the inductive process. Because we understand the logic relating premises to conclusions, we can carefully evaluate each premise. Rather than having to rule out all possible alternative explanations to a causal relation that either we, or someone else, may come up with, we instead can focus more narrowly on each premise that is used by the potential outcomes framework to deduce causation and can carefully evaluate these premises. We can often do so quantitatively by means of sensitivity analysis (Rosenbaum and Rubin, 1983b; Imbens, 2003; Hsu and Small, 2013). This can sometimes lend considerable confidence to the conclusions that are drawn, and can also indicate when caution may be needed. However, when such sensitivity analyses for unmeasured confounding, and measurement error, and selection bias (Greenland, 2005; Rothman et al., 2008; Lash et al., 2009) do indicate substantial robustness, the confidence with which we can draw causal conclusions may often be considerably greater than if we had had, in a more purely inductive approach, to consider every other non-causal alternative explanation for an association.

51 VanderWeele

It is my view that it is this move of only needing to evaluate the premises imbedded in the potential outcomes framework’s logic to draw causal conclusions that has altered, for many, the form of reasoning and the types of evidence considered when assessing causal claims in practice. The focus has shifted from Hill’s considerations to the premises charac- teristic of potential outcomes logic. It has simplified our reasoning. This has been, in my view, an advance, and has resulted in real changes in practice with regard to how causal evidence is assessed. Nevertheless, with simplified reasoning can come also the danger of complacency. Because the deductive logic is relatively straightforward, we may be tempted to no longer even carefully evaluate the premises. Hill’s considerations can still be helpful today in countering that complacency, by looking at the association through a different, more inductive, lens and thinking creatively about what alternative explanations might be. However, if we return to the potential outcomes framework itself, then I think the central tool in confronting complacency over the assumptions is sensitivity analysis. As described above, sensitivity analysis is the principal means by which the premises fundamental to the potential outcomes approach can be evaluated. As I have written elsewhere (VanderWeele and Ding, 2017; VanderWeele et al., 2019), I believe these should be routine in analysis of observational data and should always take place when such data is being used to argue for a conclusion of causality. In attempt to help facilitate this, I have elsewhere proposed a set of extremely simple easy-to-use sensitivity techniques that do not require sophisticated software resources or a more technical background in statistics and that are applicable to unmeasured confounding, selection bias, and measurement error (VanderWeele and Ding, 2017; Smith and VanderWeele, 2019; VanderWeele and Li, 2019); they can effectively be implemented by hand, and are relatively easy to interpret. There are many other good sensitivity analysis techniques available, but these sometimes require considerably more effort to implement. My hope is that these simple techniques leave empirical researchers without any excuse for not carrying out what is at least a very simple sensitivity analysis. Such sensitivity analysis, in conjunction with the simple deductive logic of the potential outcomes framework, can, at least sometimes, establish very strong evidence indeed for a causal conclusion.

References

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables (with discussion). Journal of the American Statistical Association, 91:444–472. Angrist JD, and Pischke J-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press. Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., and Wynder, L. L. (1959). Smoking and lung cancer: Recent evidence and a discussion of some questions. Journal of the National Cancer Institute, 22:173–203. Didelez, V., and Sheehan, N. (2007). Mendelian randomization as an instrumental variable approach to causal inference. Statistical Methods in Medical Research, 16(4), 309-330. Fleming, T. R. and DeMets, D. L. (1996). Surrogate end points in clinical trials: Are we being misled? Annals of Internal Medicine, 125, 606–613.

52 Hill and the Potential Outcomes Framework

Frangakis, C. E., and Rubin, D. B. (2002). Principal stratification in causal inference. Biometrics, 58(1), 21-29. Gilbert, P. B., and Hudgens, M. G. (2008). Evaluating candidate principal surrogate end- points. Biometrics, 64(4), 1146-1154. Greenland, S. (2005). Multiple-bias modeling for analysis of observational data (with dis- cussion). Journal of the Royal Statistical Society Series A, 168:267–308. Hern´anMA, Robins JM (2019). Causal Inference. Boca Raton: Chapman & Hall/CRC, forthcoming. Hill, A. B. (1965). The environment and disease: association or causation? Proceedings of the Royal Society of Medicine, 295-300. Hsu, J. Y., and Small, D. S. (2013). Calibrating sensitivity analyses to observed covariates in observational studies. Biometrics, 69(4), 803-811. Hudgens, M. G., and Halloran, M. E. (2008). Toward causal inference with interference. Journal of the American Statistical Association, 103(482), 832-842. Imai, K., Keele, L., and Tingley, D. (2010). A general approach to causal mediation analysis. Psychological Methods, 15(4):309–334. Imbens, G., and Rubin, D. B. (2015). Causal Inference in Statistics, Social, and Biomedical Sciences: An Introduction. New York: Cambridge University Press. Joffe, M. M., and Greene, T. (2009). Related causal frameworks for surrogate outcomes. Biometrics, 65(2), 530-538. Lash, T.L., Fox, M.P., and Fink, A.K. (2009). Applying Quantitative Bias Analysis to Epidemiologic Data. New York: Spring. Lawlor, D. A., Tilling, K., and Davey Smith, G. (2016). Triangulation in aetiological epidemiology. International Journal of Epidemiology, 45(6), 1866-1886. Mathur, M. B., and VanderWeele, T. J. (2019). Sensitivity analysis for unmeasured con- founding in meta-analyses. Journal of the American Statistical Association, in press, https://doi.org/10.1080/01621459.2018.1529598 Moore, T. (1995). Deadly Medicine: Why Tens of Thousands of Patients Died in America’s Worst Drug Disaster. New York:Simon and Schuster. Morgan, S. L., and Winship, C. (2015). Counterfactuals and Causal Inference. Cambridge, UK: Cambridge University Press, 2nd Edition. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd Edition. Cambridge: Cambridge University Press. Robins, J. M. (1986). A new approach to causal inference in mortality studies with sus- tained exposure period – Application to control of the healthy worker survivor effect. Mathematical Modelling, 7:1393–1512. Robins, J. M., and Greenland, S. (1992). Identifiability and exchangeability for direct and indirect effects. Epidemiology, 143-155. Robins J.M., Hern´anM.A. and Brumback B. (2000). Marginal structural models and causal inference in epidemiology. Epidemiology, 11:550-560. Rosenbaum, P. R. (2003). Does a dose–response relationship reduce sensitivity to hidden bias?. Biostatistics, 4(1):1-10. Rosenbaum, P. R. (2010). Evidence factors in observational studies. Biometrika, 97(2), 333-345.

53 VanderWeele

Rosenbaum, P. R. (2017). The general structure of evidence factors in observational studies. Statistical Science, 32(4), 514-530. Rosenbaum, P.R. and Rubin, D.B. (1983a). The central role of the propensity score in observational studies for causal effects. Biometrika, 1983; 70:41-55. Rosenbaum, P. R., and Rubin, D. B. (1983b). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological), 45(2), 212-218. Rothman KJ, Greenland S, Lash TL. (2008) Modern Epidemiology. 3rd Edition. Lippincott. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonran- domized studies. Journal of Educational Psychology, 66 :688–701. Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448), 1096-1120. Smith, L. and VanderWeele, T.J. (2019). Bounding bias due to selection. Epidemiology, 30(4), 509-516. Tchetgen Tchetgen, E. (2013). The control outcome calibration approach for causal infer- ence with unobserved confounding. American Journal of Epidemiology, 179(5), 633-640. Tchetgen Tchetgen, E. J. and VanderWeele, T. J. (2012). On causal inference in the presence of interference. Statistical Methods in Medical Research – Special Issue on Causal Inference, 21:55–75. VanderWeele, T. J. (2013). Surrogate measures and consistent surrogates (with Disucssion). Biometrics, 69(3), 561-581. VanderWeele, T.J. (2015). Explanation in Causal Inference: Methods for Mediation and Interaction. New York: Oxford University Press. VanderWeele, T.J., Jackson, J.W., and Li, S. (2016). Causal inference and longitudinal data: a case study of religion and mental health. Social Psychiatry and Psychiatric Epidemiology, 51:1457-1466. VanderWeele, T.J. and Ding, P. (2017). Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine, 167:268-274. VanderWeele, T.J. and Li, Y. (2019). Simple sensitivity analysis for differential measure- ment error. American Journal of Epidemiology, in press. https://doi.org/10.1093/ aje/kwz133 VanderWeele, T. J., Mathur, M. B., and Chen, Y. (2019). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, in press.

54 Observational Studies 6 (2020) 55-56 Submitted 2/19; Published 1/20

Standing on the shoulders of Austin Bradford Hill: The refinement of “specificity” as a consideration in causal inference

Noel S. Weiss [email protected] Department of Epidemiology University of Washington Seattle, WA, USA 98195

I began my study of epidemiology in 1968, just a few years after Bradford Hill (and his American counterparts) outlined sets of guidelines to assist in inferring the causes of disease. At that time both Bradford Hill and the Americans had as a focus of their thinking the example of cigarette smoking and lung cancer. Both invoked the notion of the specificity of the association: for example, the Americans argued (U.S. Public Health Service, 1964) that “on an intuitive basis...when a given characteristic is found to be associated with one, or at most a few diseases, then the evidence for a causal relationship is more convincing.” However, they also acknowledged that “it would not be surprising to find that the diverse substances in tobacco smoke could produce more than a single disease”, and Bradford Hill (Hill, 1965) said that “we must not overemphasize the importance of the characteristic [i.e. specificity].” So, while the notion of specificity as a consideration when seeking to infer cause and effect was mentioned by Bradford Hill in his 1965 article, that mention came with qualifications. After completing my academic training in epidemiology, I sought a career in academia, and was fortunate to find a position at the University of Washington where I could immedi- ately (and for the next 45 years!) indulge my passion for the teaching of the principles and methods of epidemiology. For a number of years I would discuss the guidelines for causal inference in class, mentioning specificity, but always downplaying its importance. I would say something like, “the fact that cigarette smoking predisposes to emphysema, bladder cancer, coronary heart disease, etc. should not in any way detract from the inference that cigarette smoking predisposes to the development of lung cancer”. Then, around the year 2000, I was reading a document prepared by one of our doc- toral students in which she was addressing a possible causal influence of the presence of bacterial vaginosis in the acquisition of HIV infection. I remember gently chastising her for taking into account the notion of specificity, reminding her that this often was not a useful guideline. No more than two days following this conversation I received an article from my colleague, Victoria Holt, on a topic of mutual interest, the potential etiologic role of ovarian endometriosis in the development of ovarian cancer. That article reported a several-fold increase in the incidence of ovarian cancer associated with the presence of ovar- ian endometriosis, but no corresponding increased risk associated with endometriosis that affected only pelvic structures other than the ovary. I wrote a message to Victoria thanking

c 2020 Noel Weiss. Weiss

her for the paper, and added that because of the strength and specificity of the association, I suspected that a causal link was present. Yes, I used the word “specificity”! Startled that I had done so, I began to search my memory for other instances in which I had invoked the notion without actually using the term. They were not hard to find. Here are two examples:

• In 1992, I had joined Joe Selby and others at Northern California Kaiser Permanente in publishing the results of a case-control study of fatal colorectal cancer in relation to a history of screening sigmoidoscopy (Selby et al., 1992). We reported that a consider- ably lower proportion of fatal cases of rectal and distal colon cancer than controls had previously undergone screening, but that no similar association was present for fatal cases whose tumor had arisen beyond the reach of the sigmoidoscope. The specificity of this association, as well as its large magnitude, led to the widespread acceptance of the results as indicative of genuine secondary prevention.

• In my teaching on clinical epidemiology, for some years I had featured a study using Medicaid data (Ray et al., 1997), which examined the occurrence of hip fracture among persons using psychoactive drugs. An increased risk of 50 to 80% was seen among current users of antipsychotic, tricyclic antidepressant, and long half-life hypnotic or anxiolytic agents. However, the question of confounding loomed - were people who took these drugs inherently prone to falls and/or fractures? One piece of evidence in support of a causal relation came from the specificity of the association - there was no increased risk of fracture associated with the use of short half-life hypnotic or anxiolytic agents, just as would have been predicted based on the relatively brief impact these agents might have on motor function.

Soon thereafter, I wrote an article (Weiss, 2001) to describe my new appreciation of the concept of specificity as a guide to causal inference, and went on to modify my classroom teaching accordingly. And yes, I offered an apology to the graduate student! On each of the several times we have encountered one another in the nearly 20 years since that time, we have shared a good laugh recalling this episode.

References U.S. Public Health Service (1964). Smoking and health. Report of the Advisory Committee to the Surgeon General of the Public Health Service. PHS Pub. No. 1103. Hill, A.B. (1965) The environment and disease: association or causation. Proc R Soc Med, 58:295-300. Selby JV, Friedman GD, Quesenberry CP, et al. (1992). A case-control study of screening sigmoidoscopy and colorectal cancer mortality. N Engl J Med, 326:653-7. Ray WA, Griffin MR, Schaffner W et al. (1987). Psychotropic drug use and the risk of hip fracture. N Engl J Med, 316:363-9. Weiss NS (2001). Can the “specificity” of an association be rehabilitated as a basis for supporting a causal hypothesis? Epidemiology, 13:6-8.

56 Observational Studies 5 (2020) 57-65 Submitted 10/19; Published 1/20

Hill and Campbell: Separate or symbiotic paradigms?

William H. Yeaton [email protected] Measurement and Statistics program Florida State University Tallahassee, FL 32306, U.S.A.

1. Brief Introduction to Hill At its heart, A.B. Hill’s 1965 Presidential address includes a set of nine “viewpoints” that, if judged to be present, increase the chance that a causal claim is correct. When taken in this light, as Hill cautions, the nine viewpoints are neither necessary nor sufficient for cause. Instead, they might be thought of as attendant features whose presence will enhance causal inference.

What I do not believe – and this has been suggested – is that we can usefully lay down some hard-and-fast rules of evidence that must be observed before we accept cause and effect. None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non. What they can do, with greater or less strength, is to help us to make up our minds on the fundamental question – is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect? (Hill, 1965, p. 299)

2. Introduction to the Campbellian paradigm: Two symbols and the truth1 A second system of inquiry, elaborated by Donald Campbell and his colleagues (Shadish, Cook and Campbell, 2002), also informs casual inference but does so from what is essentially an opposing perspective. It asks if one or more of a lengthy list of internal validity threats are present, thereby decreasing the probability of a correct causal claim. To be precise, Shadish, Cook, and Campbell list nine internal validity threats (p. 55), though one of these, “ambiguous temporal precedence,” is the same as “temporality” in the Hillian paradigm. In addition to these threats to causal inference, four “interaction with selection” threats (selection by history, by maturation, by instrumentation, and by regression) are elaborated, along with four threats to construct validity (compensatory equalization, compensatory rivalry, resentful demoralization, and treatment diffusion) which were previously listed as internal validity threats in Cook and Campbell (1979). Campbell’s primary mechanism for ruling out internal validity threats is an individual research design. While this research design-internal validity threats nexus is rich and de-

1“Two symbols” refers to Xs and Os, from which many research designs are created.

c 2020 William Yeaton. Yeaton

tailed, Campbell and his colleagues’ discussion of different design options and their role in eliminating validity threats is beyond the scope of this paper. Here, the central point is that the Cambell-ian approach looks for negative study features and aims to rule them out, while the Hillian paradigm looks for positive study features and tries to rule them in. For Hill, the researcher’s task is to demonstrate that favorable conditions exist; for Campbell, the task is to demonstrate that unfavorable conditions do not exist.

3. Combine Hill and Campbell paradigms? It is difficult to argue that one of these paradigm’s approach to truth is intrinsically better than the other. Certainly, researchers’ tasks are different, and each approach ultimately requires judgment as neither process provides a “for sure” determination of cause. Classic Campbell verbiage demands that we “render less plausible” a particular validity threat. A Campbellian might ask if selection bias is likely absent in a particular observational study, while a Hillian might ponder whether sufficient specificity is present to infer cause. Journal reviewers might reasonably disagree about the absence of selection bias or the presence of specificity. I maintain that these alternative approaches, foundationally based upon very different logical and disciplinary orientations, can advance science more that either, alone. Below, I provide examples that show how each of Hill viewpoints can be amplified using the Camp- bellian paradigm. The “more is better” argument is not new; as Feyerabend long ago is reported to have said “Let a thousand methods bloom” (e.g., Heit, 2016). Restricting dif- ferent methodological principles to particular research orientations (isolation) leads to fewer insights. A hybrid orientation (symbiosis) would spark cross-disciplinary collaboration via methodological pluralism, allowing the Hill and Campbell approaches to be simultaneously implemented and enabling each Hill principle to be empirically embellished using elements of the Campbellian approach. Unfortunately, mutually beneficial symbiosis is not the usual way of science (Campbell, 2005). There are many precedents in which researchers have subscribed to apparently adversarial approaches and acted as rivals. Friction between Bayesians and frequentists is longstanding. Those who utilize the causal diagrams of Judea Pearl (2018) and those who follow the potential outcomes, statistical modelling approach of Don Rubin (1974) are less likely to read the same studies or to conduct research in the same way. I assert that few social scientists are familiar with Hill’s paradigm. Consistent with this claim is the fact that neither the prominent, Cambellian-based textbooks (Campbell and Stanley, 1966; Cook and Campbell, 1979; Shadish, Cook, and Campbell, 2002) nor a recently published quasi-experimental design text by a Campbellian disciple (Reichardt, 2019) cite Hill’s Presidential address. In a similar vein, Hill would have been unfamiliar with the Campbellian paradigm. Near the time of Hill’s 1965 Presidential address, few of Campbell’s design-based papers had been published (Campbell, 1957; Campbell and Stanley, 1966). And while Campbell-based methods have increasingly diffused into the medical and health areas, their spread is far from complete. Consistent with methodological pluralism, I claim that the Campbellian approach, par- ticularly those dimensions that consider the quality of different research designs or imple- ment multiple design elements, can directly inform Hill guidelines. The juxtaposition of

58 Hill and Campbell

Campbell’s design thinking as a means to rule out threats to internal validity can serve as a powerful adjunct to the nine guidelines that Hill aims to rule in. In addition, to my knowledge, no one has directly addressed the Hillian approach from the Campbellian per- spective using research designs as the bridging process. In short, the explicit methods that Campbell uses to rule out internal validity threats provide an empirical basis for gauging the presence of the Hill viewpoints.

4. Using both single and multiple design orientations to embellish Hill’s viewpoints

To illustrate how the Campbellian approach might inform Hill’s viewpoints, I provide ex- amples that follow either the usual practice of incorporating a single research design or a more novel approach that utilizes multiple research designs (Yeaton, 2019). I argue that this context of single and multiple designs creates a better, value-added basis to evaluate Hill’s viewpoints. While this multiple design strategy is structurally similar to an approach named within- study comparison (WSC), its purpose is fundamentally different. In most WSCs (e.g., Wong and Steiner, 2018), an RCT estimate is used to validate an estimate from a quasi- experiment. In contrast, the primary aim of the multiple design approach is to produce corroborative evidence in which estimates from different designs are comparable in direction and magnitude. Thus, multiple designs within the same study are used to produce results aimed to closely agree. With similar design quality, one result from one design does not hold more inferential weight than a second result from a second design. More formally, this design-in-design approach reflects the principle of “critical multi- plism” which Reichardt (2019) defines as “Using multiple methods with different strengths and weaknesses so that, to the extent the results of multiple methods agree, the credibility of results is increased.” Here, “multiple methods” can be thought of as multiple designs. If estimates from different designs within the same study are comparable, then causal inference is enhanced. I searched for a case in health or medicine in which a multiple-design approach had been used to yield this kind of corroborative evidence, but my efforts were unsuccessful. A recent high quality meta-analytic review (Chaplin et al., 2018) reported 15 WSCs in which RCTs results were used to validate regression discontinuity (RD) results. In a personal commu- nication with one of the authors, (Morris, September 5, 2019), I wondered if the searching procedures these authors had used to identify WSCs might also have identified examples in which more than one design had been used as corroborating evidence. Unfortunately, the Chaplin et al. search strategy did not yield a relevant RCT-RD example in the health or medicine field that had been used in this corroborating way.

5. Examples that relate Campbell’s approach to Hill’s viewpoints

Two fictional but realistic examples apply the Campbellian approach to amplify the nine knowledge probes from Hill. The first example incorporates the regression discontinuity design to assess a drug intervention for hypertension. Imagine a drug intervention aimed to reduce hypertension. Recent rules establish a cut-point for clinically significant systolic

59 Yeaton

blood pressure (SBP) of 120 mm. of mercury (e.g., Mohammad and Deshpande, 2016). Certainly, family history and other risk factors enter into a clinical decision to prescribe hypertension medication, but, for the sake of presentation brevity, those factors will be ignored. For the RD design, with initial SBP on the x-axis and subsequent SBP on the y-axis, one looks for a discontinuous SBP drop at the cut point in the regression line for patients above 120 who received the drug. The second example embeds an RCT within the RD design (e.g., Chaplin, 2018). Though also hypothetical, the example’s logic is borrowed from an application of the design- in-design approach used in the social sciences (Yeaton and Moss, 2020). For medical appli- cations, Kaiser (2015) has extolled the virtues of combining RCT and observational studies but does so by incorporating both kinds of information within the same statistical model. In the current study, the combination design is used to corroborate effect estimates. In the recent past, the line between normal and elevated systolic blood pressure has vacillated between 120 and 140, and best placement of the distinguishing cut-point still bears research scrutiny (e.g., Flint et al., 2019). Thus, one might imagine an RCT specifically aimed at persons in the range of uncertainty between 120 and 140. We would expect an RCT-based reduction to be comparable to the size of the discontinuous drop for the RD (keeping in mind that the RCT measures average treatment effect and the RD measures average treatment effect at the cut-point). To better articulate the value-added of combining the Campbell and Hill paradigms, I first note and briefly describe each of Hill’s nine viewpoints. Then, in the context of the two fictional examples introduced above, I elaborate upon additional knowledge gained by adding data from the Campbellian perspective.

1. Strength (size of effect) For Hill, a strong environment-disease relationship is one that produces large effects; it is easier to discern causal relationships in the context of large rather than small effects. To best assess the comparability of effects from two different designs, we would compare the size of the RD discontinuity to the distance between the RCT’s treatment and control regression lines at the cut-point. If the average treatment effect in the RCT and the average treatment effect at the cut-point in the RD were similar, that corroboration would provide a more robust strength estimate...one at the cut-point and one near the cut-point.

2. Consistency (similar results across different study units Hill seeks findings that are similar across “different persons, places, circumstances, times.” The RD assignment variable runs the gamut of baseline SBPs. The RCT is limited to SBPs for persons in the 120-140 range. If the RCT were based on a random sample of participants in the region near the cut, further consistency of study findings would be established.

3. Specificity (asks if the relationship is one-to-one: does a given disease have one kind of outcome and is that disease created by a single cause) Hill notes that one-to-one, independent-dependent variables relationships are infre- quent. To illustrate, he asks whether cancers related to smoking will increase but

60 Hill and Campbell

cancers and deaths due to other causes (e.g., brain cancer) will not. To probe speci- ficity in the Campbellian approach, both RD and RCT can utilize the so-called “non- equivalent dependent variable” design (e.g., Herman and Walsh, 2011). The idea: the hypertensive drug’s benefit will be reflected by a difference in SBP between the treat- ment and the control group in both designs, but a second disease outcome, say ulcer prevalence, will be not show such a difference. The multiple design approach allows replication of a no-difference finding in diseases unlikely to change from hypertension medication.

4. Temporality (causes precede effects) RD and RCT designs are both prospective. The cause necessarily precedes the effect.

5. Biological gradient (dose-response) In the absence of confounders that covary with dose, if increased doses lead to in- creased effects, cause is buttressed. In the RD design, one could track different dose levels in different patient subsets to determine if there were different degrees of SPB decrease. In the RCT, one could randomly assign patients different doses. In an educational version of the multiple design approach, a random sample of college students who had been placed on academic probation with semester grades below C (2.0) but at or above D (1.0) were first chosen (Yeaton and Moss, 2018). Next, students in this 1.0-2.0 region were randomly allocated to receive either a strongly worded, warning letter sent by certified mail (stronger dose) or an email warning letter containing less strong language (weaker dose). In addition, RD results in the higher dose, probation intervention were compared to results for students who scored at or above 2.0. Thus, the dose-response data from the RCT were used in collaboration with the higher dose treatment versus no-treatment results of the RD design.

6. Experiment (random allocation) A small sample RCT was used in the design-in-design approach. This hybrid design approach with experimental and quasi-experimental evidence has coincident advan- tages; RD had larger sample size (enhanced statistical power) and enrolled patients with a wider range of pretest scores. Each of the remaining three Hillian viewpoints addresses the credibility of the independent- dependent variable relationship. In the Campbellian paradigm, these three viewpoints relate to causal attribution but are not direct threats to internal validity. Rather they address construct validity, the representations and labels given to the cause and effect constructs and the causal mechanism itself. Empirical study of these three viewpoints using the Campbellian paradigm would allow researchers to address the nature of the presumed cause. Investigation of these three Hillian viewpoints requires elaborated Campbellian designs (e.g., additional design elements such as comparative groups or supplementary data that establish causal links) to inform each viewpoint.

7. Plausibility (asks if the relationship between the independent and dependent variable “makes sense,” from the standpoint of contemporary knowledge)

61 Yeaton

8. Coherence (asks if different levels of evidence “tell a similar story” regarding cause and effect. As Hill notes, that evidence may be more expansive; women begin to suffer the same lung cancer fate as men when women begin to smoke more, or the pattern of harm is similar when the biology of smoking is examined in rats and man).

9. Analogy (asks if another drug or disease would have similar biological effects)

To address the plausibility of benefit of a particular hypertensive drug, each design might include measures of the biological markers known to be associated with that hypertensive drug. If the predicted causal links occur within such a mechanism study, then the drug’s link with benefit is enhanced. To address coherence, inference from the RD and RCT designs might be buttressed by using multiple study conditions (e.g., different drug treatment groups in both the RCT and the RD...perhaps those with long and short histories of hypertension). To address analogy, different hypertensive drugs of the same class, say a different beta blocker, might be used, in both designs, to better probe the “logic” of the disease.

6. Cochran’s causal crossword in the context of Hill and Campbell Rosenbaum (2015) reflected upon Cochran’s idea of a “causal crossword” to better articulate an intricately woven argument showing how cigarette smoking might be causally related to lung cancer. Below, Rosenbaum’s words appear in italics but are interrupted to allow relevant Hillian and Campbellian commentary. Each element of Rosenbaum’s argument reflects a Hillian viewpoint. And each Hillian viewpoint can be directly or indirectly related to design thinking via Campbell. A controlled randomized experiment shows that deliberate exposure to the toxin causes this same cancer in laboratory animals. Experiment (with rats, not humans) Consider the same ideas in a biological context. A high level of exposure to a toxin, such as cigarette smoke, is associated with a particular disease, say a particular cancer, in a human population, where experimentation with toxins is not ethical. Consistency (with humans) in an observational design (smokers versus non-smokers) A DNA-adduct is a chemical derived from the toxin that is covalently bound to DNA, perhaps disrupting or distorting DNA transcription. Exposure to the toxin is observed to be associated with DNA-adducts in lymphocytes in humans. Plausibility (addresses mediator/moderator issues by providing data on biological mech- anism) A further controlled experiment shows that the toxin causes these DNA adducts in cell cultures. Coherence (via experiment, using evidence at the cellular level) A case-control study finds cases of this cancer have high levels of these DNA-adducts, whereas non-cases (so-called “controls”) have low levels. Temporality (observational data with humans, but the data are retrospective) A study finds these DNA-adducts in tumors of the particular cancer under study, but not in most tumors from other types of cancer. Specificity (idea is reflected in the non-equivalent dependent variable design)

62 Hill and Campbell

Certain genes are involved in repairing DNA, for instance, in removing DNA adducts; see Goode et al. (2002). In human populations, a rare genetic variant has a reduced ability to repair DNA, in particular a reduced ability to remove adducts, and people with this variant exhibit a higher frequency of this cancer, even without high levels of exposure to the toxin. Analogy (idea of an observational study for human cancer rates with and without this variant) Each of these entries in the larger puzzle is quite tentative as an indicator that the toxin causes cancer in humans, and some of the entries do not directly intersect; e.g., the rare genetic variant is not directly linked to the toxin. Yet, the filled in puzzle with its many intersections may be quite convincing. This summary statement nicely reflects the Hillian perspective; only the interwoven tapestry argument becomes definitive. Note that Rosenbaum does not include the view- points strength or dose-response. However, these two viewpoints could easily be incor- porated using relevant design elements via Campbell (by adding one pack-a-day and three pack-a-day smokers to an observational study and by reporting the large expected difference in lung cancer rates for these two smoking conditions).

7. What is missing from Hill

The Hillian paradigm is incomplete. The Campbellian perspective considers dimensions not directly discussed by Hill that are critical to correct causal inference. For example, differen- tial attrition, treatment compliance, and diffusion from the treatment to control group are not explicitly addressed in Hill’s viewpoints. The construct validity of cause threats men- tioned above (compensatory equalization compensatory rivalry, and resentful demoraliza- tion) directly relate to the one of the Stable Unit Treatment Value Assumption requirements of Rubin’s (1974) potential outcomes approach–that administering treatment to interven- tion group members does not change the likelihood that control group participants receive treatment. In addition, subsequent to Hill’s 1965 address, design advancements such as instrumental variable analysis (Angrist, Imbens, and Rubin, 1996) have been applied to provide high quality outcome estimates in the face of incomplete treatment compliance.

8. Value-added considerations via Campbell

Five of Hill’s viewpoints are directly informed by Campbellian, design-based approaches, while three Hill viewpoints can be informed using empirical evidence of a surrogate nature. (Recall, one Hillian viewpoint, temporality, is redundant with a Campbellian internal valid- ity threat.) Multiple research designs used in the same context can better inform the Hillian paradigm, since one design informs fewer viewpoints. For example, an estimate of effect from a single RD design would not inform consistency. However, one might use a single, prospective design (e.g., RD only) to address temporality, specificity, and dose-response, then add a second design, say an RCT that incorporates a design-in-design approach, to enhance both strength and consistency (to judge both size and similarity of effect sizes). As an extra bonus, with evidence from multiple designs within the same study, consistent with Hill’s values, the need to rely upon statistical significance is further reduced.

63 Yeaton

By itself, Hill’s paradigm provides a rich, principle-driven perspective that contributes important theoretical perspectives. By itself, Campbell’s approach provides design-contingent, empirical evidence allowing one to investigate the pattern of multiple study findings and to elaborate Hill’s nine queries. Together, the empirically- and conceptually-based probes of causal truth contribute complementary rule-in and rule-out arguments on which to re- duce uncertainty and to enhance inference. Though sometimes non-intersecting, the result- ing Hill-Campbell methodological combinations allow researchers to form a more complete causal mosaic.

References Angrist, J., Imbens, G. and Rubin, D. (1996). Identification of causal effects using instru- mental variables. Journal of the American Statistical. Association, 91, 444–472. Campbell, D.T. (1957). Factors relevant to validity of experiments in field settings. Psy- chological Bulletin, 54, 297-312. Campbell, D.T. (2005). Ethnocentrism of disciplines and the fish-scale model of omni- science. In S.J. Derry, C.D. Schunn, and M.A. Gernsbacher, (Eds.), Interdisciplinary Collaboration: An Emerging Cognitive Science, pp. 3-21. Lawrence Erlbaum, Mahwah, NJ. Campbell, D.T. and Stanley, J.C (1966). Experimental and quasi-experimental designs for research. Rand McNally, Chicago. Chaplin, D.D, Cook, T.D., Zurovac, J., Coopersmith, J.S., Finucane, M.M., Vollmer, L.N., and Morris, R.E., (2018). The internal and external validity of the regression discontinu- ity design: A meta-analysis of 15 within-study comparisons. Journal of Policy Analysis and Management, 37, 403-429, https://doi.org/10.1002/pam.22051 Cook, T.D. and Campbell, D.T. (1979). Quasi-experimentation: Design and analysis for field settings. Rand McNally, Chicago. Flint, A.C. et al. (2019). Effect of systolic and diastolic blood pressure on cardiovascular outcomes. New England Journal of Medicine, 381, 243-251. Heit, H. (2016). Reasons for relativism: Feyerabend on the ‘rise of rationalism’ in ancient Greece. Studies in History and Philosophy of Science, 57, 70-78. Herman, P.M. and Walsh, M.E. (2011). Hospital admissions for acute myocardial infarction, angina, , and asthma after implementation of Arizona’s comprehensive statewide smoking ban. American Journal of Public Health, 101, 491-496. Hill, A. B. (1965). The environment and disease: Association or causation? Proceedings of the Royal Society of Medicine, 58, 295-300. Kaisar, E.E. (2015). Incorporating both randomized and observational data into a single analysis. Annual Review of Statistics and its Application, 2, 49-72. Pearl, J. (2018). The book of why: The new science of cause and effect. Basic Books, New York. Reichardt, C.S. (2019). Quasi-experimentation. A guide to design and analysis. Guilford, New York. Rosenbaum, P.R. (2015). Cochran’s causal crossword. Observational Studies, 1, 205-211. Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688–701.

64 Hill and Campbell

Saklayen, M.G. and Deshpande, N.V. (2016). Timeline of history of hypertension treatment. Frontiers in Cardiovascular Medicine, 3, 1-14. Shadish, W.R., Cook, T.D., and Campbell, D.T. (2002). Experimental and quasi-experimental design for generalized causal inference. Houghton Mifflin, Boston. Wong, V.C. and Steiner, P.M. (2018) Designs of empirical evaluations of nonexperimental methods in field settings. Evaluation Review, 42, 176-213. Yeaton, W.H. and Moss, B.G. (2020). A multiple-design, experimental strategy: Academic probation warning letter’s impact on student achievement. Journal of Experimental Ed- ucation, 88, 123-144. Yeaton, W. H. (2019). Experimental and quasi-experimental designs. In P. Brough (Ed.), Advanced Research Methods for Applied Psychology, pp. 107-123. Routledge, New York.

65