UC San Diego UC San Diego Electronic Theses and Dissertations

Title Multimodal

Permalink https://escholarship.org/uc/item/2x54p2rw

Author Stegenga, Jacob

Publication Date 2011

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO

Multimodal Evidence

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

in

Philosophy

by

Jacob Stegenga

Committee in charge:

Professor Nancy Cartwright, Chair Professor William Bechtel Professor Craig Callender Professor Naomi Oreskes Professor Robert Westman

2011

© Jacob Stegenga, 2011 All rights reserved.

The Dissertation of Jacob Stegenga is approved, and it is acceptable in quality and form for publication on microfilm and electronically:

Chair

University of California, San Diego

2011

iii

For Skye, who teaches that there is much about which to argue; For my Mother, whose arguments are sparing, and kind; For Bob, who argues well; For Alexa, who knows that one need not always argue.

iv

It is the profession of philosophers to question platitudes. A dangerous profession, since philosophers are more easily discredited than platitudes.

David Lewis

v TABLE OF CONTENTS

Signature Page……………………………………………………………… iii

Dedication………………………………………………………………….. iv

Epigraph……………………………………………………………………. v

Table of Contents…………………………………………………………... vi

List of Tables………………………………………………………………. vii

List of Terminology………………………………………………………... viii

List of Abbreviations and Symbols………………………………………… xii

Acknowledgements………………………………………………………… xiv

Vita…………………………………………………………………………. xviii

Abstract…………………………………………………………………….. xix

Chapter 1: Introduction…………………………………………………….. 1

Chapter 2: Varieties of Evidential Experience……………………………... 14

Chapter 3: Underdetermination of Evidence by Theory…………………… 49

Chapter 4: Independent Evidence………………………………………….. 75

Chapter 5: Robustness, Discordance, and Relevance……………………… 102

Chapter 6: Amalgamating Multimodal Evidence………………………….. 124

Chapter 7: Is Meta-Analysis the Platinum Standard of Evidence?………… 145

Chapter 8: An Impossibility Theorem for Amalgamating Evidence………. 176

References………………………………………………………………….. 214

vi LIST OF TABLES

Table 1. Features of Evidence……………………………………………… 33

Table 2. Likelihoods of Evidence in La Jolla Murder Mystery……………. 90

Table 3. Binary Outcomes in an Experiment and Control Group………….. 167

Table 4. Analogy Between Amalgamating Preferences and

Amalgamating Evidence…………………………………………………… 182

Table 5. Profiles Constructed in Proof of Impossibility Theorem for

Amalgamating Evidence…………………………………………………… 212

vii LIST OF TERMINOLOGY

In this dissertation my use of some words departs slightly from standard philosophical or scientific usage, and for some concepts I have had to invent entirely new words or phrases. The following glossary provides informal definitions of such terms; for terms that have a corresponding formal definition I indicate its location in the dissertation. Technical terms used in standard ways (such as ‘meta-analysis’) are defined in the body of the dissertation.

Amalgamation

The combination of multimodal evidence.

Amalgamation Method

A method to amalgamate multimodal evidence into a measure of overall

support for a hypothesis.

Amalgamation Function

A type of amalgamation method, the inputs and output of which are

limited to ordinal rankings of hypotheses (formal definition on page 186-

187).

Concordance

Consistency or agreement of evidence from multiple modes.

viii

Conditional Probabilistic Independence

Probabilistic independence between multiple modes of evidence,

conditional on a hypothesis (formal definition on page 96).

Confirmation Ordering

A confirmation relation, denoted by ≽i , where i is a mode (the

confirmation ordering relation is indexed to the mode of evidence), such

that H1 ≽i H2 means “evidence from mode i confirmation orders H1

equally to or above H2” (formal definition on page 183-184).

Constraint

A desideratum of amalgamation methods which stipulates that AMs

should constrain intersubjective assessment of hypotheses.

Dictatorship

A mode of evidence which always trumps other modes in an

amalgamation function. One of the desiderata for an amalgamation

function is ‘non-dictatorship’, which stipulates that no mode should be a

dictator (formal definition on page 208).

ix Discordance

Inconsistency or disagreement of evidence from multiple modes.

Dyssynergistic Evidence

Multimodal evidence which confirms the negation of a hypothesis which

is confirmed by evidence from the individual modes constituting the

multimodal evidence (formal definition on page 89).

Independence of Irrelevant Alternatives

A desideratum for amalgamation functions which stipulates that a

ranking of two hypotheses relative to each other by an amalgamation

function should depend only on how the individual modes rank these two

hypotheses relative to each other, and not on how the modes rank them

relative to other hypotheses (formal definition on page 207-208).

Mode

A method of producing evidence or a particular way of learning about

the world: a technique, apparatus, or experiment. Modes are types of

which there can be a plurality of tokens. A full account of modes

requires a criterion of individuation of modes, which is the subject of

Chapter 4.

x Multimodal Evidence

The set of evidence produced by multiple independent modes relevant to

a given hypothesis (formal notation on page 183).

Ontic Independence

The form of independence between multiple modes of evidence based on

different materials constituting the modes or different assumptions or

theories required by the modes.

Robustness

The state in which a hypothesis is supported by concordant multimodal

evidence.

Unanimity

A desideratum for amalgamation functions which stipulates that if all

modes confirm hypothesis H1 over H2, then the amalgamation function

must do the same (formal definition on page 207).

Unrestricted Domain

A desideratum for amalgamation functions which stipulates that the

amalgamation function can accept as input all possible confirmation

orderings.

xi LIST OF ABBREVIATIONS AND SYMBOLS

AMM1944 Avery, McLeod, and McCarty (1944)

AEC Atomic Energy Commission

AF Amalgamation Function

AM Amalgamation Method

B Believability

BT Bayes’ Theorem

C Concordance

CPI Conditional Probabilistic Independence

D Non-Dictatorship

DE Dyssynergistic Evidence

DNA Deoxyribonucleic Acid e Evidence

H Hypothesis

I Independence of Irrelevant Alternatives

JC Jeffrey Conditionalization

NAS National Academy of

NRS Non-randomized Study

OI Ontic Independence

P Patterns p(x) Probability of x

xii p(x|y) Probability of x conditional on y

Q Quality

R Relevance

RCT Randomized Controlled Trial

RD Risk Difference

RR Risk Ratio

SC Strict Conditionalization

T Transparency

TS Transforming Substance

U Unanimity

UET Underdetermination of Evidence by Theory

UV Ultraviolet

xiii ACKNOWLEDGEMENTS

The time and critical attention which Nancy Cartwright has dedicated to this dissertation is incredible; she has had an enormous influence on the material presented here. Nancy’s influence goes far beyond her copious written feedback on multiple drafts of these chapters. Our rambles through the English countryside and hikes through

California deserts were ideal ways to work through material in the following pages. Her own work is my exemplar, as is her work ethic – I am grateful for Nancy’s supervision.

Graduate students are often asked who is on their supervisory committee. My response has always felt like a boast. Bill Bechtel’s leadership of the Philosophy of

Biology Research Group allowed discussions which helped to develop nascent ideas into chapters. Craig Callender encouraged me to pursue the eccentric topics that constitute this dissertation, criticized my work when needed, and provided guidance on broader matters of graduate study. I am grateful to Bill and Craig for being proxy advisors. During early discussions, and in her reading of my prospectus, Naomi Oreskes posed questions which have occupied me for two years. Bob Westman has provided thoughtful guidance since my first weeks at UCSD when I took his seminar in history and philosophy of . Moreover, Bob has facilitated the development of my dissertation, and my ability to share it in a dozen cities, via his directorship of the

Science Studies Program. I know too well that this dissertation is not what Bill, Craig,

Naomi, or Bob had hoped for; I wish that it could suffice as its own apology.

xiv Several other faculty members in the philosophy department provided feedback on aspects of my dissertation, including Don Rutherford, Sam Rickless, Christian

Wüthrich, Rick Grush, David Brink, and Clinton Tolley.

My fellow graduate students at UCSD have been heavily involved with this dissertation. Tarun Menon has encouraged and challenged me from the start, and has read most of this dissertation and spent many hours discussing it with me. Tarun is also a co-author of Chapter 4. I am fortunate to have Tarun as a colleague and friend. Eric

Martin’s paper on combining evidence was an early inspiration. Long bike rides through San Diego County with Charlie Kurth were a great way to discuss philosophy.

Daniel Schwartz, Ioan Muntean, Cole Macke, Sindhuja Bhakthavatsalam, Nat Jacobs,

Marta Halina, Matt Brown, and Joyce Havstad have read parts of this dissertation and provided critical feedback. My fellow students engaged this work when it was in early, painfully inchoate stages, and so their fortitude is admirable.

Nancy occasionally complained that I spent too much time at conferences. It has been, though, a pleasure to discuss some of the ideas in this dissertation with the following people whom I first met at conferences: Patrick Forber (Tufts), John Koolage

(Eastern Michigan), Andrew Winters (Colorado), Sharon Berry (Harvard), Léna Soler

(Université Poincaré, France), Robert Hudson (Saskatchewan), Kareem Khalifa

(Middlebury), Jonathan Tsou (Iowa State), Casey Helgeson (Wisconsin), Branden

Fitelson (Rutgers), and Monica Aufrecht (Washington). I am particularly grateful to

Eran Tal and Boaz Miller, both of University of Toronto, who have provided critical feedback on much of this dissertation.

xv Many other philosophers have provided encouragement and feedback on parts of my dissertation, including Brian Keeley (Pitzer), Helmut Heit (TU Berlin), Douglas

Allchin, Heather Douglas (Tennessee), Ian Hacking (Toronto), Ted Porter (UCLA),

Miriam Solomon (Temple), James Hawthorne (Oklahoma), and Daniel Steel (Michigan

State). Samir Okasha (Bristol) gave detailed feedback on Chapter 8, and his enthusiasm was motivating. I am grateful to the philosophers of science at UC Davis – Paul Teller,

Jim Griesemer, and Roberta Millstein, and their students Michael Trestman and Vadim

Keyser – for our joint workshops. During my time at the London School of Economics I had valuable discussions with Katie Steele, Christian List, Miklos Redei, and Franz

Dietrich. I am especially grateful to Franz Dietrich for his assistance with the proof of the impossibility theorem of Chapter 8. I was also fortunate that Deborah Mayo

(Virginia Tech) was at the LSE while I was there, and since then she has continued to provide encouragement and support.

The problem of amalgamating discordant multimodal evidence was first emphasized to me by a scientist, and since then the following scientists have shared their time and expertise answering my questions about evidence: Erin Shields

(Oceanography, Scripps Institute), Norman Anderson (, UCSD), Dick

Zoutman (Epidemiology, Queen’s University), Johannes Martin (Cosmology, Toronto and Bonn), and the epidemiologists at the Public Health Agency of Canada.

The Science Studies Program at UCSD provided superb support for the duration of my degree. I completed this dissertation while living in a beautiful house generously provided by the E.R.R.O.R. foundation; it is hard to imagine a more idyllic setting for

xvi writing. Further material support for which I am grateful came in the form of a doctoral fellowship from the Social Sciences and Humanities Research Council of Canada.

Parts of Chapters 5, 7, and 8 have been or will be published as the following papers:

“An Impossibility Theorem for Amalgamating Evidence.” Forthcoming in Synthese,

“Is Meta-Analysis the Platinum Standard of Evidence?” Forthcoming in Studies in

History and Philosophy of Biological and Biomedical Sciences,

“The Chemical Characterization of the Gene: Vicissitudes of Evidential Assessment.”

2011. History and Philosophy of the Life Sciences 33(1):103-126,

“Robustness, Discordance, and Relevance.” 2009. Philosophy of Science 76(5):650-

661.

Many of those acknowledged above were instrumental in helping me to improve the dissertation chapters to the point where they were ready to be published. Moreover, anonymous reviewers for these journals provided valuable critical feedback.

Finally, I am grateful to Alexa for her care, her patience, and, to the extent that she was able, for occasionally dragging me away from this dissertation. Writing a dissertation in philosophy is a lonely livelihood; Alexa made it less so.

One might expect that after receiving such help – from an impressive group of scholars, institutions, and friends – a dissertation would be fully satisfactory. I know that it is not. The work presented in the following pages cannot possibly repay the intellectual debt I have incurred to those listed here. It is, at best, a promise.

xvii VITA Education PhD UC San Diego Philosophy & Science Studies, 2011 MA University of Toronto IHPST, 2005 MSc University of Toronto Physiology/Neuroscience, 2003 BA University of Victoria Biology and Philosophy, 1999

Appointments Visiting Scholar London School of Economics, 2008 Epidemiologist Public Health Agency of Canada, 1999 - 2001, 2003 - 2007

Select Publications “An Impossibility Theorem for Amalgamating Evidence” Forthcoming in Synthese “Is Meta-Analysis the Platinum Standard of Evidence?” Forthcoming in Studies in History and Philosophy of Biological and Biomedical Sciences “The Chemical Characterization of the Gene: Vicissitudes of Evidential Assessment” Forthcoming in History and Philosophy of the Life Sciences “Population is Not a Natural Kind of Kinds” Biological Theory (2010) 5(2): 154-160. “A Theory of Evidence for Evidence-Based Policy” Proceedings of the British Academy 171: 289-319 (with Nancy Cartwright) “Rerum Concordia Discors: Robustness and Discordant Multimodal Evidence” Forthcoming in The Robustness of Science (ed. Léna Soler), Springer “Robustness, Discordance, and Relevance” Philosophy of Science (2009) 76: 650-661

Select Presentations “Pseudorobustness” Error in the Sciences Conference, Leiden, Netherlands, 2011 “Biological Populations” Consortium for History & Philosophy of Biology, Paris, 2011 “The Independence Requirement for Robust Evidence” APA Pacific Division, 2011 “Is Meta-Analysis the Platinum Standard of Evidence?” AAAS, 2011 “Varieties of Evidential Experience” University of Pittsburgh, 2010 “The New New ” Harvard-MIT Graduate Conference, 2009

Select Book Reviews Bordogna, William James at the Boundaries. In British Journal for the History of Science (2010) 43(1): 130-131 Weber, The Philosophy of Experimental Biology. In Erkenntnis (2009) 71(3): 431-436

Select Awards SSHRC Doctoral Fellowship, 2008-2010 Hadden Award, Best Graduate Essay, CSHPS, 2010

xviii

ABSTRACT OF THE DISSERTATION

Multimodal Evidence

by

Jacob Stegenga

Doctor of Philosophy in Philosophy

University of California, San Diego, 2011

Professor Nancy Cartwright, Chair

We often have a variety of evidence available for a given hypothesis. For example, the efficacy of pharmaceuticals is studied with diverse experiments on animals, humans, and cells. I call evidence like this multimodal; a “mode” is a particular way of learning about the world: a technique, apparatus, or experiment. Philosophers have appealed to multimodal evidence to make robustness claims to advance various forms of scientific realism and to resist skeptical worries. The depth of such arguments, though, has advanced little since Whewell. What are the conditions under which such arguments are compelling? I raise methodological and epistemological arguments, and use examples from biology and medicine, to identify demanding constraints for successful appeals to multimodal evidence.

xix

CHAPTER 1: INTRODUCTION

It is fitting to begin this dissertation with a passage from a recent paper by

Nancy Cartwright (2006), in which she introduces the principal problem with which this dissertation is concerned. To make her point Cartwright describes work by the

British epidemiologist Michael Marmot (2004), who argues that low socioeconomic status is bad for one’s health in all situations in which such status leads to increased social isolation and stress. To support this general conclusion Marmot cites a variety of evidence. For example, his own work on British civil servants, based on longitudinal studies over twenty years, showed that the highest paid British civil servants have twice the chance of living until the age of sixty than do the lowest paid

British civil servants. Marmot also cites evidence from interviews and surveys on job stress, evidence from laboratory experiments showing associations between stress and physiological reactions, and evidence from other disciplines altogether, such as

1 2 primatology and anthropology. Cartwright’s main interest in this example is how

Marmot purports to extrapolate the findings from his best-performed studies to a more general population. However, Cartwright claims that there is a related question.

The related question is about combining evidence. How does Marmot himself support the move from Whitehall civil servants to a far broader population? He does so by marshaling a great deal of evidence of different kinds: for instance, experiments on monkeys that put together the top monkey from a number of different troupes. The monkeys again form a hierarchy, and the ones at the top are by far the healthiest. And he does so by looking at health data across Canadian provinces and at what happened to health in Russia—especially among Russian men— after the change from socialism. And so forth.

Altogether, informally, it is an impressive package. Where did he publish it? That helps to make my point—in one of those high-caliber ‘semipopular’ books. For this is not the kind of thing that goes into a serious journal, and in a sense rightly so. Even review articles in journals tend to cite studies that have a great deal of commonality of language and method—that way they can be adequately policed by the experts in the field. That is just the problem. We have no experts on combining disparate kinds of evidence (apart from some neat metastatistical techniques, which do not stretch very far). But doing so is at the heart of scientific when that epistemology is directed at establishing results we can use. (987)

This passage exemplifies the problems that occupy the following pages. The worry about combining evidence of different kinds, I argue, is deep.

We often have a variety of evidence available for a given research problem. I call disparate kinds of evidence, such as that marshaled by Marmot, multimodal. A

“mode” is a particular way of learning about the world: a technique, apparatus, or experiment. In Chapter 4 I consider ways to make the concept ‘mode’ more precise.

Not only does amalgamating multimodal evidence matter to scientific epistemology when that epistemology is meant to establish results we can use in a practical setting, as is Cartwright’s concern in the above passage, but scientific

3 epistemology has appealed in numerous ways to the epistemic importance of multimodal evidence in the service of various philosophical positions. I bundle arguments based on an appeal to concordant multimodal evidence under the umbrella term ‘robustness’, and in Chapters 4 and 5 I canvas some of the appeals made by philosophers of science to robustness.1 Robustness is said to counter the experimenter’s regress (Culp 1994), resolve worries about underdetermination

(Wimsatt 1981), demarcate artifacts from real entities (Franklin 2002), allow us to

‘observe’ unobservable entities (Hacking 1983), aid us in our pursuit of objectivity

(Douglas 2004), and provide grounds for various kinds of scientific realism

(Cartwright 1983, Salmon 1984, Snyder 2005, Maddy 2007). A canonical example often appealed to is Perrin’s determination of Avogadro’s number, which he did with thirteen distinct methods (described in Nye 1972): if Perrin measured Avogadro’s number using such a variety of methods, such reasoning goes, then atoms must be real.2

Robustness arguments are pervasive in philosophy of science. The program for the 2010 Philosophy of Science Association conference had at least four talks which

1 I have seen the term “robustness” first used as a methodological adage by the statistician George Box in 1953 – a robust statistical analysis is one in which its conclusions are consistent despite changes in underlying analytical assumptions. 2 My concern in this dissertation is with robustness understood as concordant multimodal evidence for an empirical hypothesis. There is a growing literature on robustness of models; for example, Levins (1966) and Wimsatt (1981) have argued that robustness is valuable for modeling: since all of our models are idealizations, “our truth is the intersection of independent lies”; while Cartwright (1991) has argued that robustness arguments in econometric modeling are crude inductions, and Orzack and Sober (1993) have criticized model robustness on the charge of it being a non- empirical form of confirmation. The debate continue today (e.g. Weisberg 2006). This dissertation is not concerned with robustness of models.

4 relied on robustness. I know of at least five dissertations currently being written that are based one way or another on robustness. Nearly every issue of our journals has an article which purports to solve an arcane problem by appealing to robustness.

Several synonyms have been used for robustness: “multiple derivability”

(Nederbragt 2003), “confirmational significance of evidential diversity” (Fitelson

1996), “more varied data” (Kruse 1997), “aligning of research techniques” (Bechtel

2002), “argument from coincidence” (Cartwright 1991), and “triangulation” (Leigh

Star 1989). Of course it is not a transient fad. Robustness was famously discussed in the nineteenth century by Whewell (1837), who gave it the name “consilience of inductions” (see also Laudan 1970), and in mid-twentieth century Hempel (1965) recognized its importance. Perhaps its most interesting synonym comes from fin de siecle psychical research: an objection to William James’ reports of psychic phenomena was that the probability that witnesses of the phenomena were dishonest was higher than the probability that the phenomena truly occurred. The “faggot” argument, made by James’s friend Edmund Gurney, was meant to counter this objection. A faggot is a bundle of woven sticks – the eponymous argument was that as more independent witnesses reported a phenomenon, the chance that they were all dishonest decreased to below the chance that the phenomenon was real. James agreed that “weak sticks make strong faggots.”3 Robustness has long had a prominent place

3 See Bordogna (2008), who describes the disputants’ attitudes towards the faggot argument. James thought it “carried heavy weight” (p. 121): the “various fragments of evidence” regarding a psychic phenomenon “resembled the recording of a multivocal performance” (p. 133). In response, a critic argued that “when we have an enormous number of cases, and cannot find among them all a single one that is quite conclusive,

5 in our epistemology. But today the frequent use of robustness and variety of epistemic tasks placed on the notion is remarkable. I enjoy swimming against the current: partly because of the sheer popularity of robustness, I have chosen to raise trouble for it.

Indeed, this dissertation aims to be a gadfly for robustness. The intuition that concordant multimodal evidence is epistemically important is broadly shared, deeply felt, and historically entrenched, and thus I do not expect to have much of an impact on the intuition. Robustness has become a platitude. My epigram on page v, from

Lewis (1969), was chosen with this in mind: I have given myself the task of questioning the platitude of robustness, fully knowing the danger of this task, since I am more easily discredited than robustness. Nevertheless, if the dissertation is a gadfly then I will deem it a success. One of my complaints about robustness is that despite the frequency with which philosophers appeal to robustness, and the epistemic weight that many claim it can carry, the argument form is hardly more sophisticated today than it was in Whewell’s day. We do not know the conditions under which robustness arguments ought to be compelling. My epigram from Lewis continues as follows:

when a good philosopher challenges a platitude, it usually turns out that the platitude was essentially right; but the philosopher has noticed trouble that one who did not think twice could not have met. In the end the challenge is answered and the platitude survives, more often than not. But the philosopher has done the adherents of the platitude a service: he has made them think twice.

I do not claim to be a good philosopher. But I have noticed trouble that many others have not. I hope the trouble will be met, and regardless I expect the platitude will survive. the very number of cases may be interpreted as an index of the weakness of the evidence” (Bordogna 2008 p. 121).

6

It is generally assumed that a condition of robustness arguments is that the various pieces of evidence must be independent; indeed, a criterion of individuation is precisely what is needed for some set of evidence to be multimodal.4 I show in

Chapter 4 that most philosophers implicitly or explicitly rely on a notion of ontic independence (OI) when making a robustness argument. Evidence is OI when the available modes depend on different physical materials, assumptions, or background theories; thus evidence is multimodal when the modes are individuated based on a particular OI criterion. Usually when robustness arguments are proposed no explicit individuation criterion is mentioned, but when an individuation criterion is explicitly made, it is usually an OI criterion.

Contrary to what many philosophers assume, I argue that OI does not justify robustness: I show that multimodal (OI) evidence can confirm a hypothesis to a lower degree than any of the individual pieces of evidence (such evidence is dyssynergistic), and so OI cannot be sufficient for robustness. Indeed, dyssynergistic evidence can provide the opposite confirmation than any of its individual pieces. To use terminology from Chapter 8, ‘unanimity’ ought not be satisfied when evidence is dyssynergistic. With Tarun Menon, I formulate a criterion – conditional probabilistic independence (CPI) – and prove that multimodal evidence that meets this criterion is collectively more confirmatory than its individual pieces, thereby justifying

4 It is an unfortunate accident of intellectual history that the term ‘independence’ appears in two distinct contexts in this dissertation, with entirely different meanings: the ‘independence’ axiom (Chapter 8) is borrowed from social choice theory in which the use of term is entrenched, and the ‘independence’ requirement for robustness (Chapter 4) comes from philosophy of science, in which the use of the term is also entrenched.

7 robustness. But although CPI is sufficient to justify robustness, it is not clear that CPI is epistemically accessible to experimentalists. If it is not, then, so I argue, we are left in a quandary: to justify robustness arguments, CPI is sufficient but epistemically inaccessible.

In Chapter 5 I argue that without good methods of combining evidence robustness arguments lack the rigor and quantitative precision often associated with compelling inductive arguments; robustness arguments are often vague, except perhaps in ideal cases in which all or most of the available multimodal evidence is concordant, such as Perrin’s determinations of Avogadro’s number. When the available multimodal evidence is discordant, as is often the case in the biomedical and social sciences, a compelling robustness argument cannot be made. Commenting on a trial testing a new pharmaceutical, the chief executive of the responsible company described the evidence as “a dog’s breakfast” (Harris 2009). In Chapter 5 I give several contemporary and historical examples in which multimodal evidence looks like a dog’s breakfast. I further argue that to determine the support that multimodal evidence provides to a hypothesis the evidence must be amalgamated in a systematic way, yet we lack good amalgamation methods (AMs). In Chapter 6 I survey various possible AMs, and in Chapter 7 I investigate in more detail the most prominent of such methods: meta-analysis.

In the passage with which I began Cartwright parenthetically mentioned metastatistical techniques, sometimes called ‘meta-analysis’, and suggested that such techniques “do not stretch very far.” I have come to agree with this, and indeed, perhaps pessimistically I have come to consider it a vast understatement. In Chapter 7

8

I show that numerous decisions must be made when performing a meta-analysis which allow wide latitude for personal idiosyncrasies to influence the outcome of a meta- analysis. I illustrate this with examples of multiple meta-analyses on the same research question which reach contradictory conclusions. In short, meta-analysis fails to constrain intersubjective assessments of hypotheses. This is worrying and surprising, since many assume that, at least in the biomedical and social sciences, regardless of what we ought to consider to be the gold-standard of evidence, meta- analysis is the platinum standard. I argue that this is hubris: meta-analysis cannot be the platinum standard of evidence.

The chapter on meta-analysis proceeds by directly examining scientific practice. If philosophers are interested in arguments based on appeals to concordant multimodal evidence, then why not take a close look at those parts of science in which purportedly rigorous and quantitative methods have been developed to amalgamate a huge volume of evidence? The approach in this chapter can be thought of as ‘bottom- up’, in that I attempt to raise suspicions for a general class of philosophical arguments

(robustness arguments) by attending to details of a methodology commonly used by practicing scientists (meta-analysis). But an entirely different approach – an approach which can be described as ‘top-down’ – is also possible. One could investigate methods of amalgamating multimodal evidence by considering what the logical space of possibilities for amalgamating evidence is in the first place. In Chapter 8 I pursue such an approach.

Amalgamating multimodal evidence is, I argue, analogous to amalgamating individuals’ preferences into a group preference (a burgeoning topic in social choice

9 theory). The latter faces well-known paradoxes, such as Arrow’s impossibility theorem. In Chapter 8, the final chapter of the dissertation, I prove an Arrow-like theorem which shows that if an AM satisfies ‘universality’ (the AM can accept as input all possible confirmation orderings of hypotheses), ‘independence’ (a confirmation ordering from an AM should only depend on the confirmation orderings of individual input modes), and ‘unanimity’ (if all input modes order hypothesis H1 over H2, the AM must also order H1 over H2), then the AM must be a ‘dictatorship’

(the confirmation ordering of the AM depends only on the confirmation ordering of a single mode). This surprising theorem is a step toward delimiting the logical space of possibilities for amalgamating multimodal evidence.

Most of Chapter 8, though, is occupied with exploring the plausibility of the four desiderata used in the impossibility theorem. I argue that two of the desiderata – universality and non-dictatorship – ought to be considered general norms for an AM.

But I argue that unanimity, though seemingly a generally desirable feature of an AM, ought to fail when the relevant multimodal evidence is dyssynergistic; I define this notion in Chapter 4 and show why it is occasionally reasonable to violate unanimity.

The intuition undergirding unanimity, however, is strong – it is the basis of robustness arguments, after all – and thus I assume that we normally want to uphold unanimity as a desideratum for an AM. If one denies the desirability of unanimity, one denies robustness, which is fine with me – the gadfly goal of this dissertation would be more than met – but I assume that most will think that, as a way of avoiding the impossibility theorem of Chapter 8, denying unanimity is an extreme measure. It is, instead, the ‘independence’ desideratum which will seem by many to be the least

10 plausible of those required for the impossibility theorem. This will be explained in due course, but allow me some anticipation.

The primary objection that I have heard to the independence desideratum is that there is no reason to limit AMs to ordinal rankings of hypotheses. After all, we have more precise quantitative measures of the support that evidence provides to hypotheses than mere ordinal rankings – why not allow measures which are more informative than mere rankings of hypotheses? If we admit cardinal measures of the support that evidence provides to a hypothesis, as we would if we accepted one of the various probabilistic measures of evidential support currently on offer, then we could avoid the impossibility theorem of Chapter 8. But one conclusion from Chapters 2 and

3 is that conditions for applying probabilistic measures of evidential support are often unmet. In Chapter 2 I follow Bechtel (2006), Franklin (1986), and others to argue that when assessing the epistemic importance of evidence there are a variety of features of the evidence that must be considered – such as the quality and relevance of the evidence, patterns in the evidence, and plausibility of the evidence.5 I proceed in

Chapter 3 to argue that a number of contradictory but equally rational ways of weighing and prioritizing these features are possible. This allows me to support

Glymour (1980), Earman (1992) and others who have claimed that in many empirical situations, precise and intersubjective determinations of evidential likelihoods are

5 In Chapter 2 I include two case studies in some detail, which serve dual purposes: first, they illustrate the plurality of the features of evidence discuss in this chapter, and second, I return to the case studies throughout the dissertation to provide illustrations of later arguments. These case studies are the elucidation of the material basis of heredity in the 1940s by Avery and colleagues, and the purported demonstration of ‘water memory’ by Benveniste and colleagues in the 1980s.

11 impossible. It follows that the probabilistic measures of evidential support on offer by contemporary ‘confirmation theory’ are not usable in many rich empirical situations.

This, in turn, provides reason to think that the independence axiom is plausible. And if so, the impossibility theorem presented in Chapter 8 is truly perplexing, since it shows that four realistic, general desiderata of AMs cannot jointly be satisfied.

Chapters 2 and 3 are concerned with the assessment of evidence simpliciter, while Chapters 4 - 8 are concerned with multimodal evidence. Chapter 2 situates robustness among several key features of methods of evidence production and of evidence itself which are relevant for assessing the epistemic importance of a given piece of evidence. The context-setting role of this chapter, and its case studies (which

I return to later in the dissertation), are the reasons why I place it first. But moreover, the plurality of features of evidence discussed in Chapter 2 supports the underdetermination of evidence by theory argument made in Chapter 3. In turn, I rely on the conclusion of this argument to support my ‘quandary’ regarding the epistemic inaccessibility of CPI in Chapter 4, and to provide plausibility to the ‘independence’ axiom in the impossibility theorem of Chapter 8. Thus, although Chapters 2 and 3 are not directly concerned with multimodal evidence or robustness, they provide groundwork to support arguments made later in the dissertation which are directly concerned with robustness understood as concordant multimodal evidence.

Above I call this dissertation a gadfly for robustness. To sum, the nature of these gadfly arguments are as follows.

12

Epistemological: the criterion of independence that most philosophers

have used to justify robustness arguments fails to do so, and a criterion

of independence that can guarantee robustness is epistemically

inaccessible (Chapter 4).

Empirical: multimodal evidence is often a dog’s breakfast, in which

case a compelling robustness argument cannot be made (Chapter 5).

Methodological: our most frequently used method for amalgamating

evidence – meta-analysis – fails to constrain intersubjective assessments

of hypotheses (Chapters 6 and 7).

Logical: methods for amalgamating multimodal evidence face an

impossibility theorem which forces us to give up at least one general

desideratum for an AM (Chapter 8).

In recent years Cartwright has argued that the rigor with which internal validity is established in various methods – randomized controlled trials, for example

– is in stark contrast with the lack of rigor with which results from high quality methods are extrapolated to more general conclusions and ultimately to guidance on policy (see, e.g., Cartwright 2007). This dissertation is written in the same spirit.

Many of our inductive methods are highly sophisticated, well-grounded in epistemological theory, and have clearly delineated, well-known, and broadly

13 respected conditions for success. In contrast, I argue that our methods for combining multimodal evidence fall far short of such rigor. This is in spite of the great importance of multimodal evidence. Multimodal evidence is practically important because it can be a powerful and ubiquitous basis of both belief and uncertainty, and it is philosophically important because concordant multimodal evidence has been celebrated as a tonic for the fallibility of any single kind of evidence, as a counter to skeptical arguments, and so is said to provide grounds for objectivity and various forms of scientific realism. In the pages that follow I offer epistemological, empirical, methodological, and logical arguments, and use examples from biology and medicine, to identify demanding conditions for the successful appeal to multimodal evidence.

CHAPTER 2: VARIETIES OF EVIDENTIAL EXPERIENCE

Abstract

I describe two traditions of philosophical accounts of evidence: one explicates the notion in terms of signs of success, the other characterizes the notion in terms of conditions of success. The former often relies on the probability calculus, and has the virtues of generality and theoretical simplicity. The latter tends to be comprised of the features of evidence which scientists appeal to in practice, which include general features of methods, such as quality and relevance, and general features of evidence, such as patterns in data, concordance with other evidence, and believability of the evidence. Two infamous episodes from biomedical research will help illustrate these features. Philosophical characterization of these latter features – conditions of success

– has the virtue of potential relevance to, and descriptive accuracy of, practices of experimental scientists.

14 15

2.1 Introduction

Contemporary accounts of evidence, when explicated in terms of signs of success, specify what is achieved once one has good evidence (§2.2). Such accounts are largely unhelpful for those scientists involved in generating and assessing evidence, since a primary concern of experimentalists is to determine whether or not some evidence is indeed good evidence, rather than to determine the precise nature of what is gained, epistemically, by good evidence. The multiple features of evidence which are important for assessing evidence are conditions of success. Scientists assess features of methods, such as quality, relevance, and transparency (§2.3), and features of evidence, such as patterns in data, concordance with other evidence, and believability of the evidence (§2.4). When evidence is judged favorably on these desiderata it is considered truth-conducive.

Examples from biomedical research help illustrate these features of evidence. I describe two cases which demonstrate the appeal to conditions of success: the elucidation of the material basis of heredity in the 1940s by Avery and colleagues

(§2.5), and the purported demonstration of ‘water memory’ by Benveniste and colleagues in the 1980s (§2.6). The two cases present a nice contrast, since the first is an episode in which evidence is (today) broadly regarded as compelling or even conclusive, whereas the second is an episode in which evidence is broadly regarded as incredible. The features of evidence are appealed to both when criticizing evidence and when praising evidence.

My primary aim is to compare the signs of success tradition with the conditions of success tradition, and to provide some detail to the description of the

16 conditions of success tradition. I focus on the conditions of success not because it is superior to the signs of success tradition, but because it is in a sense prior to and richer than the signs of success tradition. The signs of success tradition is necessarily post hoc, in that one already must possess and have evaluated the evidence in order to do business in the signs of success tradition. Those who think that philosophy of science ought to attend to details of scientific practice that might matter to scientists themselves should be concerned with a forward-looking attention to methodology over a backward-looking attention to comparing probabilities. Hence my focus is on the conditions of success tradition.

Consider the following analogy. What makes for a good wine? One answer would be that a wine is good if the wine is awarded more than 90 points by a well- known wine critic. Another answer would describe methods of quality wine production by careful vineyards. A third answer would list features of wine itself, such as its bouquet, color, and taste, that one ought to consider when assessing wine. The first answer is relatively uninformative, since it simply restates what we want to know, albeit in more precise and quantitative terms. The second answer lists concrete aspects of the method of production of a particular wine that a critic could appeal to. The third answer lists concrete aspects of the wine itself that a critic could appeal to. Of course, once a critic has appealed to the latter considerations she might come up with a simple numerical score to summarize her oenological investigation.

Characterizing evidence in terms of conditions of success is also a more accurate description of scientific behavior than characterizing evidence in terms of signs of success. To be clear, though, the conditions of success are fully normative.

17

Strictly speaking the two traditions of evidence are not at odds, since they aim to characterize different aspects of evidence. The conditions of success tradition describes the important features of evidence which ought to be considered when assessing evidence, and the signs of success tradition describes what is achieved once one has good evidence. Perhaps another way to describe the difference between the two traditions of evidence is in terms of process and product. The conditions of success tradition describes the process of generating evidence, whereas the signs of success tradition describes the product.

I argue that evidence that fares well on the conditions of success is more truth- conducive than evidence that fares poorly on these criteria (and is thought so by scientists themselves); in Chapter 3 I show how these criteria are relevant to assessing the truth-conduciveness of evidence. Here I emphasize the plurality and complexity of the conditions: assessing the conditions is complicated and there is no simple or universally agreed-upon algorithm for assessing the particular criteria.

2.2 Signs of Success

Many philosophical accounts of evidence describe what good evidence achieves. For instance, a leading probabilistic account of evidence states that e is evidence for some hypothesis h, given background assumptions b, if and only if p(h|e

& b) > p(h|b); in other words, the probability of the hypothesis given e and b must be greater than the probability of the hypothesis without e, if and only if e is to count as evidence for the hypothesis. The more confirming e is, the greater is the inequality between p(h|e & b) and p(h|b). Alternatively, some philosophers require the

18 probability of the hypothesis to be above a certain threshold after receiving e, if e is to count as evidence for the hypothesis; on this account e is evidence for h if and only if p(h|e) > x, where x is some minimum threshold. Achinstein (2001), for example, requires x to be 0.5 for one to consider e as veridical evidence for h (evidence that provides good reason to believe h), and Roush (2005) requires x to be much greater than 0.5 to consider e as good or strong evidence for h.

Another sign of good evidence is what Roush calls ‘discrimination’ – evidence should discriminate between a hypothesis and the negation of that hypothesis – and to measure this Roush argues that the likelihood ratio is appropriate: p(e|h)/p(e|~h). If the likelihood ratio is greater than 1, then e discriminates between the hypothesis and its negation, and so is good evidence for the hypothesis. In other words, e should be more likely conditional on the hypothesis being true than conditional on the hypothesis being false. Roush’s example is a car’s ‘check engine’ light: if it is much more probable that the check engine light is on when there is engine trouble, as compared to the check engine light being on when there is no engine trouble, then the check engine light is discriminating evidence for the hypothesis that there is engine trouble. Roush also claims that good evidence should have a high probability if we are to think that the evidence is credible; that is, a high p(e) indicates that the evidence is believable.

Each of these accounts provides a competing characterization of what evidence achieves.6 Each is compelling in different ways – I will not review the virtues and vices of these accounts of evidence. They all share one feature which renders them

6 On these accounts of evidence, see e.g. Carnap (1962), Achinstein (2001), Roush (2005).

19 relatively useless to the experimenter: they indicate what is achieved once one has good evidence – they are signs of success – but they do not indicate how to generate or identify evidence which can then be granted these signs. They provide post hoc characterizations of good evidence rather than guidance on the production or identification of good evidence. These accounts of evidence are like awards which are used to distinguish compelling evidence from weak evidence: they characterize the nature of the award, but scientists want to know which evidence to give the award to.

Perhaps these philosophers think that there is nothing general to be said about evidential standards, and so instead of concerning themselves with identifying good evidence they concern themselves with characterizing good evidence. Or perhaps these philosophers think that identifying good evidence is a matter best left to scientists, whereas characterizing good evidence is more properly a philosophical topic. However, in what follows I attempt to describe at least several general features of evidence which can be used to generate and identify good evidence.7

7 According to accounts of evidence associated with Williamson (2000) and others, evidence is merely factive, and so it makes little sense to talk about the ‘veracity of evidence’, or ‘good evidence’. There is, on the factive account, simply evidence, the veracity of which is taken for granted. Naturally, any account of evidence must be able to accommodate the difference between (i) evidence generated by a method with rampant systematic errors and (ii) evidence generated by a method which controls for known errors. On my account (ii) is ‘good’ evidence and (i) is ‘weak’ evidence. On the factive account, a proposition expressing an evidential report can accommodate the difference between (i) and (ii) by including the relevant information regarding the methodological differences between (i) and (ii) in the proposition expressing the evidential report itself. Given the methodological complexity of contemporary experiments, many evidential reports in the factive account would have to be complex and cumbersome. The advantage of the conditions of success account is that an evidential report can be stated rather simply while the assessment of such reports can be as complex as need be. This is, moreover, precisely how scientists themselves go about reporting and assessing evidence.

20

2.3 Conditions of Success: Methodological Features

A method of generating evidence can be assessed in the abstract, independently of any actual evidence generated by the method. That is, prior to the consideration of any evidence from a method, the method itself can be (and almost always is) assessed. Three general features or desiderata of methods are freedom from systematic errors, relevance to our hypothesis of interest, and how ascertainable either of these are. I will call these quality, relevance, and transparency.

2.3.1 Quality

A method is high quality if and only if possible systematic errors are controlled for. If the method controls for systematic errors, then evidence generated by the method is a faithful indicator of the subject of study. The term ‘internal validity’ has often been used for the notion of quality, and is meant to indicate how well a study is designed and performed to avoid systematic error and , which can result from flaws in study design, conduct, analysis, interpretation, and reporting. Quality has been a staple subject for statisticians, philosophers of science, and scientists worried about methodology; volumes have been written on the subject. One careful account of evidential quality is what Mayo calls the ‘severity principle’: data x provide good evidence for a hypothesis H to the extent that H severely passes a stringent test with x

(see Mayo 1996). On this account a method is high quality if it comprises a stringent test, and a stringent test is one which severely probes for possible errors. Achinstein proposes another way to characterize quality: his notion of ‘evidential flaws’ refers to

21 flaws in the evidence-generating method – on this account quality would be an absence of such flaws (2001).

I will not add anything to the discussion of quality except to note that assessing quality is itself a complex task. The notion of quality of a method is itself comprised of numerous features. Standard elements of experimental design determine quality of evidence – random allocation of subjects, appropriate blinding, and proper use of analytical tools are factors which determine the quality of evidence. Quality can be characterized in terms of the plausibility of the background assumptions – not necessarily shared or even explicitly formulated – required to consider evidence as a truth-conducive indicator of the particular subject under investigation.

To illustrate such complexity, consider the following. Health researchers in

Alberta recently provided a concrete example of the difficulty with assessing the quality of medical studies (Hartling et al. 2009). A quality assessment scale known as the ‘risk of bias tool’ was devised by the Cochrane Collaboration to assess the degree to which results of a study “should be believed”. The risk of bias tool has six components: sequence generation, allocation concealment, blinding, incomplete outcome data, selective outcome reporting, and other sources of bias; assessments of the risk of bias are made for each component and then an overall assessment of bias is made. The Alberta researchers distributed 163 manuscripts of RCTs on child health among five reviewers, who assessed the RCTs on risk of bias. The inter-rater agreement of the risk of bias assessments was found to be very low. In other words, even when given a standard quality assessment tool, training on how to use the scale,

22 and a narrow range of methodological diversity (RCTs, in this case), there is a wide variability in assessments of study quality.

In sum, quality of methods is itself a complex notion, comprised of multiple features of methodological design, and given this complexity, shared determinations of quality can be difficult.

2.3.2 Relevance

Tossing a coin several times gives some evidence regarding the fairness of the coin, since there is a clear relationship between the results of a series of coin tosses and the probability that the coin is fair. But tossing a coin several times gives no evidence regarding tomorrow’s weather, since there is no relationship between the results of a coin toss and tomorrow’s weather. In other words, evidence from coin tossing has no relevance to tomorrow’s weather. And this, of course, is true regardless of any actual evidence generated from coin tossing. Relevance to a hypothesis is obviously a crucial feature of methods.

The degree of relevance of a method to a hypothesis depends, of course, on both the method and the hypothesis, as the coin tossing examples show. Suppose our hypothesis is more general than simply the fairness of the single coin which we toss, but is rather about the fairness of all the coins in my pocket. The method – tossing the single coin – then would be less relevant to the hypothesis (but obviously not irrelevant). Considering a more general hypothesis in this case made the same method less relevant. But this depends on the background assumptions which we are willing to entertain. Some methods will be similarly relevant to hypotheses of a range of

23 generality. For instance, dropping a coin once will give some evidence about the tendency of this coin to fall, but it will also give some evidence about the general tendency of coins to fall in such situations, because it is reasonable to suppose that when it comes to falling, there is no relevant difference between the coin which we dropped and most other coins.

There have been a number of attempts to specify what relevance is, including hypothetico-deductivism, inductivism, and conventionalism (see Glymour 1975).

Given one of these views, a good way to construe the degree of relevance is the degree of plausibility of the background assumptions required to relate e to H. The following example illustrates.

The detection of solar neutrinos has been of interest to philosophers of science, given the complexity of the inferential steps required to justify detection claims from physicists’ observational set-ups (e.g. Shapere 1982; Pinch 1985). Neutrinos are virtually massless and chargeless, and so there is no straightforward way to observe them. One of the first set-ups designed to detect solar neutrinos was a 100000 gallon tank of perchloroethylene (C2Cl4) – dry-cleaning fluid – placed one mile deep inside a mine shaft. Neutrinos can interact with an isotope of chlorine, Cl37, which is present in

37 37 37 - C2Cl4, to produce a radioactive isotope of argon, Ar , as follows: Cl + v  Ar + e

This reaction occurred less than once per day in the tank of dry-cleaning fluid. The presence of argon in the tank after about one month was measured and taken to be evidence for neutrino detections. The amount of Ar37 was itself measured indirectly: helium gas was used to ‘sweep’ the tank, and the argon was collected on a supercooled charcoal trap. The collected argon then decayed, emitting electrons which were

24 recorded by a Geiger counter. Some Geiger clicks occurred for reasons other than argon decay (‘noise’), and so analytical techniques based on the energy and pulse-time of the clicks were used to separate signal from noise. If e is the number of Geiger counts, and H is the number of neutrinos detected, then the physicists attempting to detect solar neutrinos followed roughly this chain of inferences (with the arrows indicating direction of inference, opposite that of assumed direction of causality):

e: number of Geiger counts

(i) number of Geiger counts  number of ‘signal’ Geiger counts

(ii) number of ‘signal’ Geiger counts  number of electrons emitted

(iii) number of electrons emitted  number of Ar37 captured

(iv) number of Ar37 captured  number of Cl37 captured

(v) number of Cl37 captured  number of neutrinos detected

H: number of neutrinos detected

Is e relevant to H? That depends on how plausible you find the inferences associated with (i) through (v). (i) assumes that electrons emitted from argon decay have a particular range of energy and pulse-time measurable by the Geiger counter. (ii) assumes, first, that the Geiger clicks transformed by the analytical tools used to distinguish signal from noise are successful at identifying Geiger clicks as signals when they are in fact caused by electrons emitted from argon decay (that is, the true positive rate is high or the false positive rate is low), and second, that the Geiger clicks transformed by the analytical tools used to distinguish signal from noise are successful

25 at identifying Geiger clicks as noise when they are in fact not caused by electrons emitted from argon decay (that is, the true negative rate is high or the false negative rate is low). (iii) assumes that the theoretical understanding of argon decay is correct.

And so on. In his study of this case Pinch (1985) noted that different detection claims made by different groups varied in which of (i) through (v) are taken for granted.

Some groups reported the number of neutrinos detected per unit time, taking all of (i) through (v) for granted. Other groups reported number of argon atoms detected per unit time, taking (i) through (iii) for granted. The complexity of inferences required to assume that e is evidence for H in the case of detecting solar neutrinos indicates how complex the assessment of relevance can be.

That both quality and relevance are features that we should want our methods to have is, I assume, self-evident: methods with these features are more likely to generate evidence which is truth-conducive. These are standard desiderata of methods, though they are not always weighted equally. For example, in evidence-based medicine, randomized controlled trials (RCTs) are often considered to be high quality because they are said to minimize , and this is often said to be important even if a particular RCT is less relevant to a general hypothesis of interest than is a study design which has more potential for systematic error (such as observational case-control studies). In other words, in evidence-based medicine, for better or worse, quality has tended to be emphasized over relevance.8

8 Some now argue that this is for the worse; see for example Worrall (2002) and Cartwright (2007).

26

2.3.3 Transparency

Another general feature of methods, independent of the evidence produced by them, is transparency. Some methods are easier to know how they produce their evidence, or that they produce their evidence free from bias, or that evidence produced from them would be relevant to a hypothesis, compared with other methods. Unlike quality and relevance, transparency is not, strictly speaking, an epistemic feature of methods – that is, a transparent method is not necessarily truth-conducive, and vice versa – but it is a feature that is nonetheless appealed to when evaluating methods in scientific practice. In order to know if a method is high quality or is relevant to a particular hypothesis, one must know the details of the method’s operation. A method is transparent if we can understand how it works – that is, a method is transparent if we can understand the mechanism of the method or the ‘nomological machine’ undergirding the method (Cartwright 1999). We want to be able to make transparent judgments regarding quality and relevance – judgments which can be communicated and shared such that some agreement regarding quality and relevance is achieved.

Collins famously argues that such judgments involve an ‘experimenters regress’: good evidence is generated from properly functioning techniques, but properly functioning techniques are just those that give good evidence (1985). Of course, one often knows enough about the nomological machine undergirding a method to make informed judgments regarding its quality and relevance. But Collins was worried about novel methods at the cutting edge of science, in which one often cannot make judgments regarding the quality (or relevance) of a method simply because we do not know enough about the inner workings of the method to make such

27 judgments. In other words, in many cases, especially those in which new methods are introduced into scientific practice, the method is not transparent. If scientists cannot directly assess the quality and relevance of a method, then they need something else to help determine if the evidence produced by the method is truth-conducive or else if it is inductively risky.

2.4 Conditions of Success: Evidential Features

Using examples from microscopy and neuroimaging studies, Bechtel has shown that when scientists assess evidence produced by novel methods with low transparency regarding quality and relevance, rather than attempt to assess the quality or relevance of the methods, scientists tend to assess multiple features of the evidence produced by the methods (2000). In other words, rather than simply assessing features of the method, scientists also assess features of the evidence itself. Bechtel’s examples are of visual evidence, but here I generalize his argument to be more broadly applicable.

Evidence should be considered compelling (or not) based on a variety of features of the evidence produced by a method. Some of these features which are often appealed to include: patterns within the evidence, concordance with evidence from other methods, and believability of the evidence. I will discuss each in turn.

2.4.1 Patterns

If data have a determinate pattern or structure, then that is suggestive that the evidence is tracking something real. For example, Bechtel (2000) discusses the

28 strategies that neuroscientists use to assess images of the brain generated by positron emission tomography (PET), one of which was to appeal to determinate structures in the images generated by PET scans. The PET images were not collections of randomly distributed colors, but rather the colors (which themselves are transformations of numerical data) were arranged in a salient structure. This structure, independent of any interpretation of the structure’s relation to brain activity, was taken to indicate that the images were somehow meaningful. Assessing the relevance of PET was not straightforward since the method was not transparent, but the sheer existence of structure in PET data suggested that the images were not merely artifacts.

Similarly, the epidemiologist Sir Bradford Hill provided a list of nine criteria with which he assessed causal hypotheses, and one of these was the presence of a

‘dose-response relationship’ between the purported cause and the purported effect, which is another example of the presence of a suggestive pattern in data. If the lung cancer rate is higher amongst those that smoke more cigarettes, lower amongst those that smoke fewer cigarettes, and the relation is roughly linear between the ‘dose’

(smoking) and the ‘response’ (lung cancer), then this was considered better evidence than a simple correlation between smoking and getting lung cancer for the hypothesis that smoking is a cause of lung cancer. A simple correlation could be due to a common cause of smoking and lung cancer, but the presence of a dose-response relationship – a highly structured pattern of correlations – between smoking and lung cancer is a more plausible indication of a true causal relationship.

29

Bogen and Woodward have emphasized the importance of patterns in data: for them, patterns are precisely what scientists look for when examining data (1988).9 In response to Bogen and Woodward, McAllister (1997) agrees that patterns are important, but argues that for any set of data an infinite number of patterns can be discerned, expressible as the sum of a relatively simple function and an error term, such as F(x) = ax + R(x) . For instance, any set of data could be described by the following patterns:

Pattern A + noise at m percent

Pattern B + noise at n percent

Pattern C + noise at 0 percent

Pattern C would be the (perhaps complex) pattern which exactly fits the data points.

Though this is a version of the standard curve-fitting problem (itself a version of

Duhemian underdetermination), McAllister suggests that the choice of which pattern is the salient one for a given experiment is a complex judgment on the part of the investigators.

In contrast to the reliance on judgment for choosing a pattern, numerous algorithms have been proposed to aid in choosing between pattern descriptions (or models of data), such as the Akaike Information Criterion (AIC) and the Bayesian

Information Criterion (BIC).10 These algorithms attempt to balance two desiderata of

9 “The problem of detecting a phenomenon is the problem of […] identifying a relatively stable and invariant pattern of some simplicity and generality with recurrent features – a pattern which is not just an artefact of the particular detection techniques we employ or the local environment in which we operate” (Woodward 1989 p. 396- 397). 10 See, e.g. Sober (2007).

30 patterns or data models which trade-off against one another: simplicity and accuracy.

The algorithms reward patterns which have fewer terms (i.e. patterns which are simpler) and less departure from observed data (i.e. patterns which are accurate).

However, there is no meta-methodological algorithm for choosing between the model choice algorithms; for instance, there is no principled way to choose AIC over BIC.

Trouble arises when AIC and BIC (or any other model selection criterion) select different models as achieving the optimal balance of simplicity and accuracy. In my toy example above, AIC might choose “Pattern A + noise at m percent” as the best model of the data, while BIC might choose “Pattern B + noise at n percent” as the best model of the data. Without a methodological meta-standard, it is unclear what the superior model is.

In short, a distinctive pattern or a suggestive structure in data is an important but complex consideration when assessing evidence.

2.4.2 Concordance

If a particular piece of evidence displays notable patterns, this might be due a feature of the experimental intervention or an artifact of the observational apparatus, rather than a sign of a true feature of the object under investigation. Thus, a common practice is to compare the evidence from one method to evidence from other methods.

Concordant evidence from multiple methods is taken to be epistemically valuable (the term ‘robustness’ has been used to describe situations in which evidence from multiple methods is concordant). For example, when Hacking (1983) asked ‘do we see through a microscope?’ – meaning, do we ‘observe’ unobservable entities with such

31 instruments, to which van Fraassen’s constructive empiricist answer was ‘no’ –

Hacking answered ‘yes’, because the fact that we can observe the same microscopic entities with multiple kinds of microscopes (optical, ultraviolet, electron…) gives us good reason to think that such entities are real. In Part II of this dissertation I more fully characterize this feature of assessing evidence, and argue that the conditions under which such arguments are successful are demanding. However, it is undeniably a common practice for scientists to appeal to the epistemic value of concordant evidence, and many philosophers of science interested in evidence and inference have, at least in passing, extolled the virtues of experimental robustness.11

2.4.3 Believability

A necessary condition for evidence to guide belief is that the evidence must be believable. Evidence can be believable or not for several reasons. I will focus on two grounds on which the believability of evidence is assessed: theoretical & mechanistic.

Evidence is theoretically believable if and only if that the evidence occurred can be explained by, or at least be consistent with, broadly accepted theories. Evidence is mechanistically believable if and only if how the evidence occurred can be explained by, or at least be consistent with, plausible or broadly accepted mechanisms of both

11 Including, for examples, Horwich (1982), Cartwright (1983), Franklin (1984), Sober (1984), Howson and Urbach (1989), Achinstein (2001), Staley (2004) and Bechtel (2006). Several papers were devoted to robustness at a workshop in June 2008 in Nancy, France.

32 the target system and the method.12 Often only one of these kinds of plausibility are applicable to a given piece of evidence.

A metaphysician’s example will help. If an astronomer reported observing a gold sphere one mile in diameter, this would hardly be mechanistically believable, since we have no plausible explanation for how such a sphere could have originated.

Nevertheless, a one-mile radius gold sphere would at least be theoretically possible, since it would not contradict broadly accepted physical theories or laws. On the other hand, if an astronomer reported observing a sphere of uranium one mile in radius, this would contradict broadly accepted theories, since a one-mile radius sphere of uranium would be far larger than the critical mass of uranium (which is reached by a sphere of uranium of roughly 15 centimeters).13

Hill’s nine criteria for judging causal hypotheses in epidemiology included

‘plausibility’ – that is, if epidemiological data were plausible on independent theoretical grounds, this was further reason to consider the data as truth-conducive. To continue the smoking example, if we know that cigarette smoke contains certain toxins, and similar toxins are otherwise known to cause adverse health effects in laboratory animals, then the finding that smoking is correlated with adverse health effects is more indicative of a true causal relationship than if there were no

12 Such mechanisms include, but are not limited to, mechanisms of the method (since mechanisms of the method could be assessed under the rubric of quality). 13 The criterion of believability is similar to Quine’s (1951) claim that one can be justified in rejecting a certain observation if that observation strongly conflicts with one’s background theories, while in the absence of such theories the same observation might be more plausible.

33 independent grounds for thinking that the toxins in cigarette smoke could possibly cause adverse health effects.

I have discussed the conditions of success which scientists appeal to when assessing evidence. When general features of methods are assessed, the question asked is “Would any evidence from this method be truth-conducive?” When general features of evidence are assessed, the question asked is “Is the evidence which was actually produced truth-conducive?” Table 1 below is a list of the evidential features discussed above. I have given each an abbreviation for quick reference in the illustrations that follow.

Table 1. Features of Evidence.

General features of a method

(Q) Quality

(R) Relevance

(T) Transparency

General features of evidence

(P) Patterns

(C) Concordance

(B) Believability

This account of evidence – in terms of conditions of success, rather than signs of success – has the advantage of reflecting more closely the assessment of evidence by

34 scientists. The following two cases will illustrate the ubiquitous practice of assessing evidence in terms of conditions of success rather than signs of success. When generating, assessing, and criticizing evidence, scientists do not ask “Is p(H|e) > p(H)?” or “Is p(e) > 0.5?”, but rather “What is the (T), (R), and (Q) of this method, and does the evidence have features (P), (C), and (B)?” But these features of evidence are also obviously normative – as I have tried to briefly indicate in the last two sections, philosophers and scientists argue for (and occasionally challenge) the importance of each of these features as conditions for truth-conduciveness.

2.5 Illustration: Material Basis of Heredity

Determining the material basis of heredity occupied the interest of some scientists in the 1930s through 1950s. I will use the assessment of evidence presented in the famous 1944 paper by Avery, Macleod, and McCarty (hereafter AMM1944) as an illustration of the appeal to what I have identified as conditions of evidence.14

First, some history. In 1928 Griffith published his work on the ‘transformation’ of pneumococcal types. He had injected heat-killed, virulent, “smooth” pneumococci and live, non-virulent, “rough” pneumococci into mice. The mice died, and from their blood Griffith isolated live, virulent, “smooth” pneumococci. The live bacteria had changed virulence and morphology (from non-virulent to virulent, and from rough to smooth). Transformation of pneumococcal types was quickly replicated at the Koch

Institute in Berlin (Neufeld and Levinthal 1928). Avery’s own anecdotal assessment of

Griffith’s results was critical based on (Q): “For many months, Avery refused to

14 I examine this episode in richer historical detail in Stegenga (forthcoming).

35 accept the validity of this claim [transformation] and was inclined to regard the finding as due to inadequate experimental controls.”15 The phenomenon of transformation was surprising, and if it was real it was possibly a kind of hereditary phenomenon.

Soon Avery’s colleagues at the Rockefeller Institute had replicated Griffith’s results, and had isolated the substance responsible for transformation (e.g. Dawson

1928). Alloway (1933) provided an early clue to the chemical identity of the

“transforming substance” (TS): when he added the TS to alcohol, “a thick syrupy precipitate formed.” Commenting on this, Avery said that “the transforming agent could hardly be carbohydrate, did not match very well with protein,” and so Avery is reported to have “wistfully suggested that it might be a nucleic acid” (Hotchkiss

1965). However, it was assumed by most that the TS was a protein – proteins were known to be highly variable, whereas nucleic acids were thought to be a repetitive structural molecule, like collagen. This was partly the legacy of one of Avery’s colleagues, Levene, who had proposed the tetranucleotide hypothesis for the structure of nucleotides (Levene 1921). The structure of TS was assumed to be complex, because the phenotypic features transferred between pneumococcal types were complex: a complex function, it was thought, must be caused by a complex structure.

By late 1940 Avery and MacLeod were attempting to improve the isolation and preservation of the TS, and in 1941 they had begun experiments to determine its chemical identity.16 In February 1944 their paper was published providing evidence

15 Quoted in Dubos (1956) 16 See e.g. MacLeod and Avery, 22 October 1940. “Laboratory Notes”; MacLeod and Avery, 28 January 1941. “Laboratory Notes”

36 that the TS was DNA. This evidence was ‘multimodal’ – that is, the paper reported evidence from multiple methods. A reconstruction of the hypothesis and evidence of

AMM1944 is as follows:

Hypothesis H1: the molecule which causes transformation (the TS) is DNA.

Evidence e:

method 1: chemical analysis of TS

e1: the amounts of carbon, hydrogen, nitrogen, and phosphorous were

close to the theoretical values for DNA

method 2: effect of trypsin, chymotrypsin, and ribonuclease – protein and

ribonucleic acid degrading enzymes – on activity of TS

e2: protein and ribonucleic acid degrading enzymes had no effect on TS

method 3: effect of DNA-degrading enzyme on activity of TS

e3: DNA-degrading enzyme inactivated the TS

method 4: ultraviolet absorption of TS

e4: ultraviolet absorption of TS was characteristic of DNA

method 5: electrophoretic movement of TS

e5: electrophoretic movement of TS was characteristic of DNA

method 6: molecular weight analysis of TS

e6: molecular weight of TS was characteristic of DNA

37

The final sentence of the discussion in AMM1944 reads: “If the results of the present study on the chemical nature of the transforming principle are confirmed, then nucleic acids must be regarded as possessing biological specificity the chemical basis of which is as yet undetermined.” In a letter to his brother (1943) Avery asked “Who could have guessed it?” Supporting H1 was surprising.

Assessments of the evidence presented in AMM1944 were based on both the general features of methods – (Q), (R), and (T) – and the general features of evidence

– (P), (C), and (B). The main experimental concern was that the TS was likely impure, and could have had trace amounts of protein in it which caused the transformation.

The chemical tests available at the time were not sensitive enough to detect the presence of up to 5% protein, and the enzymatic experiments could conceivably have been ineffective in degrading an active protein, especially if it was covered by structural nucleic acids. This criticism, directed at (Q), was voiced by Mirsky, one of

Avery’s colleagues, “frequently in personal conversations”17 and later in print: “…it is not yet known which the transforming agent is—a nucleic acid or a nucleoprotein. To claim more, would be going beyond the experimental evidence” (Mirsky and Pollister

1946). But AMM1944 did not claim more: the final paragraph of AMM1944 itself suggested that (Q) might be problematic: “It is, of course, possible that the biological activity of the substance described is not an inherent property of the nucleic acid but is due to minute amounts of some other substance.” The last three sentences begin with

“If ,” “Assuming . . .,” and, again, “If.” This cautious rhetoric suggests that, publicly

17 From McCarty (1985).

38 at least, Avery, MacLeod, and McCarty were not “going beyond the experimental evidence.”

Although the hypothesis tested in AMM1944 was specific to the chemical identity of the TS, some considered the phenomenon of transformation as an exemplar of heredity more generally, and thus some thought that the chemical identity of the TS could be the chemical identity of hereditary phenomena more generally (that is, the gene). Avery himself considered this possibility. If this were the case, then the evidence in AMM1944 could be taken to support hypothesis H2:

Hypothesis H2: the class of molecules responsible for heredity is DNA.

Against this, critics noted that transformation had only been demonstrated in bacteria, and it was not clear that bacteria had genes comparable to eukaryotic (non-bacterial) organisms. Even if the TS was DNA, such criticism went, this would not mean that H2 was true. Thus (R) was an important factor in assessing the evidence in AMM1944 with respect to H2: the chemical identity of TS was thought irrelevant to H2, since there was little reason to believe a necessary assumption (bacterial genetics) to relate e to H2.

Transformation was a hot topic at the 1946 Cold Spring Harbor Symposium, but the interpretations of transformation by some of those involved were non- committal. One author wrote that the TS is “difficult if not impossible to distinguish from viruses” (Anderson 1946). Hershey, when referring to AMM1944, defined transformation as “transmission of genetic material,” without mentioning the molecule

39 responsible as either DNA or protein (1946). Another referred to the TS as a

“nucleoprotein” (Spiegelman 1946). Although the term “nucleoprotein” seemed to be the most appropriate for the transforming factor, given the concerns regarding (Q), in the two years following publication of AMM1944 Avery’s group had become more confident in their identification of the TS: “accumulated evidence … has established beyond reasonable doubt that the active substance responsible for transformation is a specific nucleic acid of the desoxyribose type” (McCarty et al. 1946). Such confidence came from an appeal to (C).

The sheer plausibility of the evidence was also considered. For decades it had been assumed that proteins were the hereditary molecule (genes), given their complex structure, and DNA was thought to be a repetitive molecule supporting the transmission of genetic protein. DNA was considered too regular a molecule, with no informational content to be able to provide genetic changes, whereas proteins were known to be diverse in structure and function. It was thought that any phenomena that resembled heredity must be due to complex molecules like protein. That is, it was thought that H2 is false and instead H2’ was thought to be much more plausible:

Hypothesis H2’: the class of molecules responsible for heredity is protein.

Commenting on AMM1944 in retrospect, Stanley (1970) wrote “Perhaps of major importance was the fact that the discovery was quite contrary to the dominant thinking of many years.” In other words, (B) was a relevant consideration when assessing the evidence presented in AMM1944, since H2’ was assumed true and H2 assumed false.

40

The variety of methods used by Avery and his colleagues was to allow a favorable assessment based on (C). However, at the time of their publication there was no other evidence for H1 with which the evidence in AMM1944 could be concordant.

But in 1952 Hershey and Chase (hereafter HC1952) provided evidence concordant with the evidence in AMM1944 using completely different methods. They labeled bacteriophages (viruses of bacteria) with S35 (which labeled only protein) and P32

(which labeled only DNA), and found that when the bacteriophage infected bacteria,

P32 entered the bacteria while most of the S35 remained outside the cell. Given that viruses replicate inside the cells of hosts, and apparently only the DNA of viruses entered the host cells, HC1952 provided independent evidence for H2. However, the primary methodological criticism that was directed at AMM1944 could have been directed at HC1952: the potential for protein contamination in the portion of the virus that entered the cell in Hershey and Chase’s experiments was as great as the potential for protein contamination in Avery’s TS. Such criticisms against HC1952 were not as pronounced as they were against AMM1944 – the evidence in HC1952 was rapidly accepted, and Hershey went on to win a Nobel Prize (notably, the Nobel Committee did not award the prize to Martha Chase, Hershey’s co-author). At least one way to understand this is that given AMM1944, scientists could then assess HC1952 favorably by (C). And conversely, once the evidence in HC1952 was available, the evidence in

AMM1944 could also be reconsidered on the grounds of (C).

This point can be made more generally: in the years following the publication of AMM1944, its context of evidential assessment shifted; (P), (C), and (B) with respect to the evidence in AMM1944 became more favorable, and consequently the

41 evidential assessments of AMM1944 changed, often by those same people who earlier were critics. Mirsky’s criticism of the evidence in AMM1944 was based on (Q); his concern was that protein had contaminated the TS. But results from experiments with

DNase (DNA-degrading enzyme) after AMM1944 further strengthened H1 (e.g.

McCarty 1945). The transformation of bacillus by Boivin provided further confirmation of H1 using a different organism (1947). Transformation was shown on multiple genetic markers (e.g. Hotchkiss 1951). Genetic recombination in bacteria was demonstrated in 1946 by Lederberg and Tatum – thereby proving that bacteria had genes, an obvious condition to consider the evidence in AMM1944 as a general hereditary phenomenon. Chargaff challenged Levene’s tetranucleotide hypothesis by showing phylogenetic differences in base composition and demonstrating A:T and

C:G ratios, making it at least conceivable that DNA could have the complexity necessary for the molecule causally responsible for heredity (1950, 1951). After these developments the evidence in AMM1944 could be assessed in light of other evidence generated with a variety of methods, showing consistent patterns of results, and based on new considerations of believability (bacterial genetics, DNA composition): assessment of the AMM1944 evidence in terms of conditions of success (P), (C) and

(B) had changed. Consequently, the overall assessment of H2 itself changed. Mirsky himself, once the strongest critic of AMM1944, exemplified this change: “that intact nucleic acids have a high degree of specificity in biological systems is evident both from the role of DNA in bacterial transformation (Avery et al. 1944) …” (Mirsky et al. 1956). In an even more striking change of retrospective assessment, Mirsky (1968) wrote “25 years ago [that is, in 1943], [DNA] was conclusively shown to be the

42 genetic material.” It is foremost the evidence in AMM1944 that Mirsky re-evaluated; because of this re-evaluation of e in light of re-evaluations of (P), (C), and (B), Mirsky came to accept not only H1, but more importantly, he claimed that H2 had been

“conclusively shown”. Conclusively, perhaps, but only in hindsight and a context in which (P), (C), and (B) had changed.

One of the strongest retrospective supporters of AMM1944, Joshua Lederberg, also changed his assessment of AMM1944 after (P), (C), and (B) had changed.

Lederberg used cautious rhetoric when discussing Avery’s work in the mid-1950s; he claimed that the TS is only “intimately connected with the stuff of heredity” (1956) – intimately, perhaps structurally, but not necessarily causally or functionally connected.

Until transformation studies were “broadened about 1951 with experiments on drug resistance and other markers, a variety of opinions were forwarded” regarding the TS.

In this and another genetics review published around the same time, Lederberg warned the reader to take note of the valid criticisms, by Mirsky and others, against over- interpreting transformation studies. But in Lederberg’s Nobel speech (1958) he claimed that “by 1943, Avery and his colleagues had shown that this inherited trait was transmitted from one pneumococcal strain to another by DNA. The general transmission of other traits by the same mechanism can only mean that DNA comprises the genes.” Thus by 1958, in the prestigious forum of the Nobel speech,

Lederberg was retrospectively claiming that the evidence in AMM1944 supported both

H1 and H2.

43

In sum, when assessing the evidence presented in AMM1944, in diverse ways scientists appealed to general features of evidence (P), (C), and (B), and general features of methods (Q) and (R).

2.6 Illustration: Water Memory

That the conditions of success are appealed to when assessing evidence is apparent when considering cases of extreme criticism. Evidence may be criticized on the grounds of (P), (C), and (B), and when evidence suggests something truly surprising, the evidence can be (and often is) criticized on the grounds of (Q) or (R).

The following example is illustrative.

When human basophils (white blood cells involved in immune defense) are exposed to a certain antibody (anti-IgE antibodies), they become “degranulated” (they look differently under a microscope). In 1988 Nature published a now infamous paper from a research group led by the well-known French immunologist Jacques

Benveniste, demonstrating that such degranulation occurs after anti-IgE is diluted by a factor of 10120 in water (Davenas et al. 1988). At this dilution no antibody remains in the solution. This was taken to support the following hypothesis:

Hypothesis Hw: water can retain a ‘memory’ of molecules dissolved to near infinite dilution.

Evidence:

method: degranulation of basophils by solutions of high dilution anti-IgE

44

e: degranulation occurs by anti-IgE diluted up to a factor of 10120 in water

Hw, if true, would provide theoretical support to homeopathy, since it is a common practice in homeopathy to treat patients with extreme dilutions of substances, under the assumption that the solute retains a memory of the substance which can stimulate one’s immune system. The paper was accompanied by an editorial written by John Maddox, then editor of Nature, titled “When to believe the unbelievable”

(1988), in which Maddox made the following remarks that relate to (P), (C), and (B) respectively:

(P) “there is a surprising rhythmic fluctuation in the activity of the

solution”

(C) “there is no evidence of any other kind to suggest that such

behaviour may be within the bounds of possibility … when told

… that the experiments should be repeated at an independent

laboratory, he [Benveniste] arranged for this to be done”

(B) “there is no physical basis for such an activity”; the findings

“strike at the roots of two centuries of observation and

rationalization of physical phenomenon”

Prior to publishing the paper, Maddox had requested independent replications of the results from Benveniste, and Benveniste had complied: other laboratories from around the world had confirmed his results. Thus a skeptical, but not negative, editorial assessment focused on (P), (C), and (B).

45

Nature’s readers were critical of the methodology: for them (Q) was an important factor. One letter to Nature wrote that “an important control experiment has been overlooked […] one might wonder to what extent this observation can be accounted for by contaminating” (Lasters and Bardiaux 1988). Another reiterated this concern: “I am puzzled by the fact that there has been no control of impurities”

(Danchin 1988). The variability of the data itself, that is (P), was assessed: “one obvious flaw can be seen when looking at the standard errors …” (Fierz 1988). One resorted to ridicule: “the paper demonstrating that dilutions of anti-IgE must be vortexed rather than stirred in order to retain an imprint of the antibody on the solvent elucidates another long-standing question: how James Bond could distinguish Martinis that had been shaken or stirred” (Nisonoff 1988).

Subsequent research has been mixed. Some attempts at replicating similar protocols to the Benveniste lab have succeeded and others have failed. In 2005 a paper showed that “liquid water essentially loses the memory of persistent correlations in its structure” within 50 femtoseconds (50 millionths of a nanosecond) (Cowan et al.).

Thus subsequent research has allowed critiques of e to appeal to (C) and (B).

The paper had been published under the agreement that a team put together by

Nature could visit the lab. The three-person team included a science journalist and a famous magician (both known as debunkers of pseudoscientific claims), in addition to

Maddox himself. There was no professional biologist or immunologist in their group

46

(Maddox was a physicist). Their report (1988) criticized the evidence from

Benveniste’s lab almost entirely based on (Q):18

(Q) “our investigation concentrated exclusively on the experimental

system”

(Q) “the extensive series of experiments … are statistically ill-

controlled, from which no substantial effort has been made to

exclude systematic error, including observer bias”

(Q) “the design of the experiments … is inadequate”

(Q) “the experimental data have been uncritically assessed”

Maddox noted that in the lab’s original protocol the experimenters knew which test tubes contained antibody and which test tubes were the controls containing no antibody, and so the team speculated that e could be explained by observer bias

(though this itself would need explanation, since the hypothesis that the experimenters influenced basophil degranulation is just as fanciful as Hw). The report concluded “that there is no substantial basis for the claim that antiIgE at high dilution (by factors as

120 great as 10 ) retains its biological effectiveness.” That is, e is false and so Hw is not justified.

Predicting methodological criticism, in the original paper the authors affirmed that their evidence was “established under stringent experimental conditions, such as blind double-coded procedures involving six laboratories from four countries.”

18 Fraud would be an extreme type of criticism based on (Q). The Nature team was less than subtle in such a suggestion: “we were dismayed to learn that the salaries of two of Dr Benveniste’s coauthors of the published article are paid for under a contract between INSERM 200 and the French company Boiron et Cie., a supplier of pharmaceuticals and homeopathic medicines, as were our hotel bills.”

47

Benveniste’s subsequent defense against the charges of the Nature team was also based on (Q) (1988) – he claimed that the Nature visit was “a mockery of scientific inquiry” and that “the judgment is based on one dilution tested on two bloods in awful technical and psychological conditions.” As discussed above, (Q) is not straightforward to assess. In a subsequent interview Benveniste complained about the stressful conditions of the visit by the Nature team, and he claimed that his original experiments were not replicated properly, given the lack of collegiality during the

Nature visit.

In sum, the evidence presented in the infamous paper by Benveniste and his colleagues was assessed by features of evidence (P), (C), and (B), and by features of methods (Q) and (R).

2.7 Conclusion

The signs of success tradition of evidence describes what one achieves once one has good evidence, whereas the conditions of success tradition of evidence describes the normative strategies scientists use to generate compelling evidence.19

The conditions under which something is considered good evidence have been often discussed by both philosophers and scientists, but this tradition of theorizing about evidence has often lacked the aim of generality of scope that the signs of success

19 The contrast between the two traditions of accounts of evidence that I have called signs of success tradition and the conditions of success is analogous to Musgrave’s contrast between what he calls logical accounts of confirmation and historical accounts of confirmation (1974), and is also similar to Mayo’s contrast between what she calls evidential-relationship accounts of inference and testing accounts of inference (1996).

48 traditions has had. In epidemiology, for instance, some claim that evidence from randomized controlled trials (RCTs) is superior to evidence from case-control studies, while others dispute this claim. Some suggest that computer simulations can provide evidence for a hypothesis, while others dispute this. Many have claimed that concordant multimodal evidence is epistemically valuable. And so on. In this chapter I have attempted to gather such considerations under the umbrella I am calling

“conditions of success”. Although the set of features I have identified is likely incomplete, they are primary considerations when assessing evidence. I have highlighted both the plurality of the conditions of success and the complexity of assessing each of the individual conditions.

General features of methods include quality, relevance, and transparency, and general features of evidence include concordance, patterns, and believability. This account of evidence is meant to be in terms which are both general and based on conditions of success, rather than most previous philosophical accounts which are in terms which are highly particular or based on signs of success.

This characterization of evidence describes some of the principal aspects of evidence which matter to scientists. I have emphasized the normative aspects of the conditions of success; that is, I have discussed why these features are conditions of success. The two traditions of evidence compared here are not necessarily at odds with each other, since they describe different aspects of evidence: the signs of success tradition describes what is achieved once one has good evidence, whereas the conditions of success tradition describes methods of generating good evidence.

CHAPTER 3: UNDERDETERMINATION OF EVIDENCE BY THEORY

Abstract

Multiple ways of assessing and prioritizing the plurality and complexity of the various features of methods and of evidence are possible – there is ‘no-best-ordering’ of features of evidence which determine the epistemic import of that evidence. It follows that for any piece of evidence multiple rational determinations of the epistemic import of that evidence are possible. I explicate this argument in terms of (and ultimately direct my argument against) Bayesian accounts of evidence.

49 50

3.1 Introduction

It is now a platitude, albeit a somewhat controversial one, that evidence underdetermines theories or hypotheses. It is also now a platitude, again somewhat controversial, that evidence is theory-laden. Another platitude is that there are a plurality of theoretical virtues, such as simplicity, fruitfulness, and explanatory power, that can be appealed to when choosing between theories. But let us put those platitudes aside for a moment, or, if you rather, laden our evidence with the same theories, fix our background hypotheses and our views about the relative importance of theoretical virtues. Still, after all this, there could be rational disagreement about the epistemic import of a given piece of evidence. This is because there is a plurality of features of evidence that must be considered when assessing the degree of support that evidence provides to a hypothesis – these are the features discussed in Chapter 2 – and these evidential features can be variably assessed and ordered.

Disagreement about the support that a piece of evidence provides to a hypothesis has been frequently documented, both in the study of the history of science and in the psychological study of human reasoning. The argument presented here intends to show that such disagreements can be ‘rational’ because multiple assessments of the same evidence can differ based on different orderings of the features relevant to the assessment of evidence.

Regardless of whether or not one is a card-carrying Bayesian, it can sometimes increase the precision of one’s argument to think like one. Moreover, many contemporary accounts of evidence and inference are based on Bayes’ theorem. For these reasons I explicate my argument in terms of (and ultimately direct my argument

51 against) Bayesian accounts of evidence. For a Bayesian, the impact that evidence has on a hypothesis is represented by two probabilities: the probability of the evidence conditional on the hypothesis, that is, the ‘likelihood’: p(e|H), and the ‘prior’ probability of the evidence: p(e). I argue that determining these probabilities in a precise and intersubjective way is, in many real empirical situations, impossible. A common criticism of Bayesianism is that it relies on subjective determinations of the prior probability of the hypothesis, p(H). The argument below extends this charge of excessive subjectivity to the assessment of the evidence itself. Another commonly stated (but rarely argued for) assumption is that the likelihood and the prior probability of the evidence are easier to assess than the probability of the hypothesis conditional on the evidence (that is, the ‘posterior’ probability of the hypothesis, p(H|e)). The argument below is meant to show that this assumption is wrong: none of these probabilities are generally any easier to assess than others. Similar claims have been urged by others (noted in §3.2), but rarely with any detailed argument. This chapter aims to fill that lacuna.

For the main argument I need three premises. First, the features of evidence I discuss in Chapter 2 are important determinants of the likelihood and the prior probability of the evidence; I argue for this in §3.3. Second, some of the evidential features might indicate that a given piece of evidence provides strong support to a hypothesis while other evidential features might indicate that the evidence provides weak support to the hypothesis or even support to a competitor hypothesis, and yet there is no single scheme for ordering or weighing the relative importance of these evidential features. I give such a no-best-ordering argument in §3.4. Third, each of the

52 evidential features are themselves amalgams of a complex set of considerations. The various aspects of a specific empirical situation which can influence an assessment of one of these evidential features are numerous, often difficult to identify and articulate, and emphasized by different scientists to varying degrees; this was suggested in

Chapter 2. With these premises I argue for what I am calling the ‘underdetermination of evidence by theory’: assessing the epistemic import of a piece of evidence in a precise and intersubjective way is, in many empirical situations, impossible.

3.2 Measuring Evidential Support

There has long been hope among some philosophers of science for a general quantitative measure of the support that evidence provides to a hypothesis. Most attempts to provide such a measure today are (broadly) Bayesian, and this sub- discipline is sometimes referred to as ‘confirmation theory’. Learning from new evidence is thought by Bayesians to be represented (normatively, at least) by updating one’s belief in a hypothesis using Bayes’ theorem. One’s belief in a hypothesis after receiving new evidence, pnew(H), ought to equal one’s original belief in the hypothesis conditional on the evidence, pold(H|e); this is sometimes called ‘strict conditionalization’ (SC):

(SC) pnew(H) = pold(H|e)

Since pold(H|e) is a conditional probability, we can apply Bayes’ theorem (BT) as follows:

(BT) pold(H|e) = p(e|H)pold(H)/p(e)

Combining (BT) with (SC) we get:

53

(SC’) pnew(H) = p(e|H)pold(H)/p(e)

The prior probability of the hypothesis, pold(H) represents one’s belief in the hypothesis prior to the new evidence. By SC’, upon learning new evidence e, pold(H) is

‘updated’ by the ratio of the likelihood of the evidence over the prior probability of the evidence: p(e|H)/p(e).

Bayesians distinguish between the final amount of confirmation that some evidence provides to a hypothesis from the change in confirmation that some evidence provides to a hypothesis. The former is simply measured by the ‘posterior’ probability of the hypothesis (the probability of the hypothesis after learning new evidence): pnew(H). The latter can be represented in a number of possible ways; recent philosophical literature has included defenses of the following confirmation measures:

(i) the difference measure: cd(H, e) = p(H|e) - p(H)

(ii) the ratio measure: cr(H, e) = p(H|e)/p(H)

(iii) the likelihood-ratio measure: cl(H, e) = p(e|H)/p(e|~H)

(iv) the log ratio measure: clog-r(H, e) = log[p(H|e)/p(H)]

20 (v) the log likelihood measure: clog-l(H, e) = log[p(e|H)/p(e|~H)]

The likelihood, p(e|H), is a critical quantity for assessing the support that evidence provides to a hypothesis in some empirical situations. Typically the posterior probability is not known prior to observing the evidence, since (according to

Bayesians) the whole point of gathering data is to update one’s prior probability in the hypothesis by SC’. As Sober (2009) puts it, “For Bayesians, observational evidence can modify one’s degrees of belief in various hypotheses only by way of likelihoods.”

20 On these measures, see e.g. Fitelson (1999).

54

If one wanted to measure the absolute degree of confirmation that evidence provides to a hypothesis – that is, if one wanted to know p(H|e) – then one would need to know the likelihood, the prior probability of the evidence, and the prior probability of the hypothesis (that is, all the quantities in BT). If one wanted to know the change in confirmation that some piece of evidence provided to a hypothesis, then one would also need to know the likelihood, since of the plausible confirmation measures listed above, (i), (ii), and (iv) require knowing p(e|H) (because the posterior, p(H|e), is a term in the measures and, assuming one does not know the posterior without knowing the impact of the evidence on the prior, according to BT the likelihood is required), and (iii) and (v) require knowing p(e|H) since the term appears directly in the measures. For those classes of evidence-hypothesis relations in which a likelihood can be determined, a necessary condition for determining the support that the evidence provides to the hypothesis (either absolute or increase in support) is met. Other necessary conditions must be met for each of the confirmation measures; for example, to determine the support that the evidence provides to the hypothesis using (i), (ii), and

(iv), we would need to know p(H), a notorious headache. To determine the support that the evidence provides to the hypothesis using (iii) and (v), we would need to know p(e|~H). In any case, for Bayesians, evidence modifies assessment of hypotheses via likelihoods.

It will help to specify those situations in which one can justify a determinate and intersubjective likelihood. There are three classes of evidence-hypothesis relations such that justifying a precise measure of support that the evidence provides to the hypothesis is possible.

55

(A) When evidence e is deductively entailed by hypothesis H.

If this condition is satisfied, then p(e|H) can be trivially determined to be 1. If we assume H to be true, then p(H) = 1, and if H entails e, then the probability of e conditional on H is also 1.

(B) When the opposite of e is deductively entailed by H.

If this condition is satisfied, then p(e|H) can be trivially determined to be 0 in the same manner as (A). If we assume H to be true, then p(H) = 1, and if H entails ~e, then the probability of e conditional on H is 0.

(C) When H specifies a specific chance set-up with objective probability

parameters (as in classic examples such as coin tosses or colored balls

in an urn), and e is a particular outcome of this chance set-up.

If this condition is satisfied, then p(e|H) can be determined straightforwardly by the laws of probability. In cases like (A) - (C), determinate likelihoods can be justified.

However, there is a class of evidence-hypothesis relations such that determining the measure of support that a piece of evidence provides to a hypothesis in an intersubjective way is impossible. This class of evidence-hypothesis relations is large, at least in the empirical sciences. As argued for above, all plausible measures of confirmation depend on knowing the likelihood, p(e|H). However, given that there are a variety of features of evidence that must be weighed and variably prioritized when assessing evidence, and given that there are numerous equally rational yet contradictory ways to weigh and prioritize these features of evidence, the probability of the evidence under the assumption that a particular hypothesis is true – that is, the likelihood – is, at least in many cases in the empirical sciences, indeterminate.

56

Reversing the Duhem-Quine locution, I call this the underdetermination of evidence by theory (UET). Another way to word the conclusion of UET is: the probability of the evidence given the truth of some hypothesis is underdetermined. In punchier terms: the probability of the evidence is underdetermined by the hypothesis (or theory) under consideration. In still punchier terms: evidence is underdetermined by theory.

This argument is directed against those who have claimed that determining the likelihood is a straightforward task. Sometimes the endorsement of determinate likelihoods is implicit, as when philosophers and statisticians give examples of chance set-ups based on urns or coins, and proceed to calculate what p(e|H) is in these scenarios. These examples are taken from classes of evidence-hypothesis relations in which determining the probability of the evidence given the hypothesis is possible; in particular, cases like class (C) above are often described. Sober (2007) give another implicit endorsement of determinate likelihoods when he argues that the likelihood ratio is the superior measure of support – thereby assuming that the likelihoods are easier to know than the priors. Fitelson (2007) explicitly argues for the superiority of absolute measures of confirmation over relative measures of confirmation – thereby assuming that likelihoods are determinate, since likelihoods are required for all the absolute measures of confirmation which Fitelson considers. But sometimes the endorsement of determinate likelihoods is explicit. Witness Gillies (1991): he writes that the likelihood “can usually be calculated in a quite unproblematic manner” and that “these probabilities are not a matter of taste, and do not vary in an arbitrary manner from scientist to scientist. On the contrary there is usually a high degree of consensus in the scientific community as to whether a given theory T is well-

57 confirmed or badly-confirmed by the available evidence” (525).21 And witness

Hawthorne (forthcoming):

the likelihoods that relate hypotheses to evidence in scientific contexts will often have widely recognized objective or intersubjectively agreed values. For, likelihoods represent the empirical content of hypotheses – what hypotheses say about the observationally accessible parts of the world. So the empirical objectivity of a science relies on a high degree of agreement among scientists on their values.

Hawthorne’s claim regarding widespread agreement on the values of likelihoods does not follow from the latter two sentences, which are presumably meant as an argument for the claim. The latter two sentences could be true (many would argue that they are) but the first sentence regarding likelihoods could still be false. Indeed, I argue here that it is. I suspect the view that is expressed in the above proclamations is widespread.

However, I am not, of course, the first to note that likelihoods are difficult to determine. And of course others have noted the multifaceted complexity of evidential assessment. As far as I am aware, though, the former has only been stated as a self- evident truism by some philosophers, and the latter has not been used in an argument to support the former. I am not aware of any argument that has been mustered for the former let alone arguments that have justified the former based on the latter.

For instance, in John Earman’s critical examination of a Bayesian account of the Duhem-Quine problem (1992), he writes:

21 Gillies also writes that p(e) is “something on which most scientists could agree irrespective of the research programme on which they happen to be working” (530). But by the principal of total probability, for possible hypotheses {H1, H2… Hn), p(e) = p(e|H1)p(H1) + p(e|H2)p(H2) + … p(e|Hn)p(Hn), which entails that p(e) is necessarily more difficult to determine than p(e|Hi) for some i, since knowing p(e|Hi) for all i requires knowing more than knowing p(e|Hi) for some i. In short, for Gillies to be correct about the ease of intersubjectively determining p(e), ∀ip(e|Hi) must be intersubjectively determinable.

58

I will note in advance that while much of the attention on the Bayesian version of the problem has focused on the assignments of prior probabilities, the assignments of likelihoods involves equally daunting difficulties. (p. 84).

But no argument is provided for this worry regarding likelihoods. Earman roughly agrees with the conclusion of UET, but does not give his own argument for it. His next mention of the difficulty with determining likelihoods is more detailed:

There are, of course, cases where the likelihoods do have an objective status. The HD [hypothetico-deductive] case, where for each i either Hi & K ⊨ e or Hi & K ⊨ ~e, is one such. Another obtains when all the Bayesian agents agree on a statistical model for a chance experiment, e reports outcomes of the experiment, the Hi are alternative hypotheses about the objective-probability parameters of the chance setup, and Lewis’s principal-principle applies. But these cases hardly exhaust the domain that would have to be covered by an adequate theory of scientific inference. Consider, for example, as astronomers of the seventeenth century were forced to, what probability should be assigned to stellar parallax of various magnitudes on the assumptions of a Copernican cosmology and the then accepted background knowledge. (p. 140-1).22

The cases in which Earman claims that likelihoods have an ‘objective status’ are my cases (A) - (C) above: when the likelihood is deducible as certain or impossible, it is justifiably determinable, as it is in those cases when all relevant scientists agree on the relevant parameters, but in all other cases the likelihood is not determinable. Earman’s example of seventeenth century astronomy is a case in which quantifying an intersubjectively determinate likelihood is impossible. Beyond this case, however, he provides no further argument for the general UET argument.

Marcel Weber has also suggested that the likelihood merely “expresses the confidence that an individual scientist has that a hypothesis accounts for some

22 I have modified Earman’s notation slightly to match mine.

59 evidence” (2005 p. 112). He described the oxidative phosphorylation (ox-phos) controversy in biochemistry, in which there were two competing hypotheses to explain how cells can utilize energy generated by cellular respiration: the chemiosmotic theory

(proposed by Peter Mitchell in 1961, who won the Nobel Prize for this theory in

1978), and the chemical theory. Weber argues that

For a supporter of Mitchell’s theory, the evidence from the mode of action of uncouplers indeed seemed to be strong evidence, since a central prediction of Mitchell’s theory was confirmed. However, for the opponents of chemiosmotic theory, this evidence was not considered to be very strong, since explanations for this effect within the framework of the chemical theory could be given. Thus how well a hypothesis accounts for some evidence seems to be a matter of belief. (p. 112)23

Again, this passage states the conclusion of UET, but besides the mention of two biochemical theories, does not give an argument for the conclusion.

In a recent intellectual biography of William James, Francesca Bordogna describes James’s thoughts on evidence (2008). She writes that if one does not believe in ‘crucial experiments’, as James did not, then when multimodal evidence is available, “one was left with a set of details whose degree of evidential value was measured in different ways by different people. James found that a person’s feelings might inform his or her conclusions in that regard” (p. 127).

An early criticism of Bayesianism came from Clark Glymour (1980). One of his worries was precisely the issue that I raise here. Glymour argued that there are cases in which a likelihood is clear:

If one flips a coin three times and it turns up heads twice and tails once, in using this evidence to confirm hypotheses (e.g., of the fairness of the coin) one does not take the probability of two heads and one tail to be

23 I return to this case in Chapter 5.

60

what it is after flipping – namely, unity – but what it was before the flipping. In this case there is an immediate and natural counterfactual degree of belief that is used in conditionalizing by Bayes’ rule. (p. 87- 88)

This is an instance of the class (C) of evidence-hypothesis relations, discussed above.

However, Glymour went on to claim that such clarity in likelihoods was possible only in rather artificial circumstances: “the trouble with the scientific cases is that no such immediate and natural alternative distribution of degree of belief is available” (p. 88).

Glymour gives as an example the difficulty of determining the degree to which

Einstein’s derivation of the perihelion advance confirmed general relativity: numerous scientists, including Leverrier, Newcomb, and Doolittle, had calculated the anomaly of the perihelion with different methods and parameter assumptions. “For actual historical cases, unlike the coin-flipping case, there is no single counterfactual degree of belief in the evidence ready to hand, for belief in the evidence sentence may have grown gradually – in some cases it may have even waxed, waned, and waxed again”

(p. 88). Glymour does not explain such vicissitudes of evidential assessment. The UET argument presented here is mean to show why Glymour, James, Weber, and Earman are right to claim that intersubjectively determinate likelihoods are often impossible in scientific practice.

3.3 Epistemology of Evidential Features

In this section I show how the evidential features discussed in Chapter 1 influence the likelihood and prior probability of the evidence. More precisely, I argue that favorable assessments of the features of evidence can be represented by a high

61 p(e|H) and a low p(e), and conversely unfavorable assessments of the features of evidence can be represented by a low p(e|H) and a high p(e).

Quality

A well-controlled experiment or observation is meant to minimize the possibility that the evidence generated by the method could have occurred for reasons other than those supposed by the hypothesis. Since the quality of a method involves controlling for systematic errors, quality is meant to ensure that observations generated by the method would be otherwise unlikely were it not for a particular hypothesis being true. Putting this in terms of probabilities, the higher the quality of the method, the higher is p(e|H) and the lower is p(e).24 Sometimes such evidence is called ‘novel’ or ‘surprising’: the evidence has a low probability except when we condition on the truth of H, in which case the evidence has a higher probability

(perhaps because the hypothesis ‘explains’ the evidence).

An example from Chapter 2 will help. Avery’s group was presenting evidence

(e) for the hypothesis (H1) that the transforming substance is composed of DNA. In order to think that the method which generated e was of high quality, one had to assume that the sample containing the transforming substance was not contaminated with protein, because if it was, then it could have been the protein which was

24 By the principal of total probability, p(e) = p(e|H1)p(H1) + p(e|H2)p(H2) + … p(e|Hn)p(Hn). If evidence is highly probable regardless of what hypothesis we conditionalize on, such evidence is uninformative about the hypotheses under consideration. But if evidence is probable conditional on H1 and improbable conditional on other hypotheses, then p(e|H1) > p(e) and so by Bayes’ Theorem H1 receives high confirmation.

62 responsible for the phenomenon of transformation. This was precisely the criticism that Mirsky raised against AMM1944. Since the available tests could not detect up to

5% contaminating protein, p(e) was relatively high: the evidence could be explained by a number of hypotheses alternative to H1.

Relevance

There are many competing views about what exactly relevant evidence is associated with familiar philosophical accounts of scientific inference: hypothetico- deductivism, inductivism, falsificationism. Sidestepping the question of what a good characterization of relevance is, a good way to characterize the degree of relevance is by considering the plausibility of the background assumptions necessary to relate e to

H (as discussed in Chapter 2). Take, for instance, an inductivist view of relevant evidence: H is confirmed by e in conjunction with necessary auxiliary assumptions, and e is more relevant to H to the extent that the auxiliary assumptions required for the e-H relation are plausible. The plausibility of such assumptions depends on the degree of abstraction of H. I will continue to illustrate using the Avery example. The hypothesis H1 was specific to the chemical identity of the transforming substance. But since the phenomenon of transformation could be related to that of general heredity, then the chemical responsible for transformation could be the chemical responsible for hereditary phenomena more generally: the transforming substance could be the gene.

Thus the evidence e in AMM1944 could be taken to support H2: the class of molecules responsible for heredity is DNA. In Chapter 2 I noted several auxiliary assumptions which are required to think that e is relevant to H2: the phenomenon of transformation

63 had to be a genetic phenomenon (rather than, say, an exotic infectious phenomenon), and bacteria had to have genes comparable to non-bacterial organisms. Thus even if e were relevant to H1 it might not be relevant to H2. Letting kn represent the plausible auxiliary assumptions, then to put these considerations in terms of likelihoods: p(e|H1

& kn) > p(e|H2 & kn).

Transparency

Transparency is defined in terms of degree to which one can assess quality and relevance. In Chapter 2 I noted that for some methods assessing their quality and relevance is relatively straightforward, whereas for other methods it is not. Important for my argument below, many episodes in the history of science suggest that scientists often have differing views about the degree of transparency for a given method.

Patterns

Patterns in nature tend to be surprising, unless we come to understand the pattern with a hypothetical explanation. For instance, without a theory of gravity to explain the ubiquity of falling objects, the patterned evidence for the motion of objects when released would be remarkably surprising, but with a theory of gravity the patterned evidence is to be expected. Evidence of a distinctive pattern in nature is prima facie less probable than evidence of a distinctive pattern which is explained by a hypothesis. To put formalize this, more distinctive patterns have a lower p(e) and a higher p(e|H).

64

Concordance

When evidence from multiple methods (multimodal evidence) confirms a hypothesis, that is said to provide greater support to a hypothesis than does the same volume of evidence but from fewer methods. Philosophers have sometimes construed such reasoning in terms of a no-miracles argument: it would be highly unlikely if multiple methods all supported H and H were not true. Alternatively: ceterus paribus,

H is more likely to be true the more diverse are the methods from which confirming evidence is generated. Since all of Part II of this dissertation is, in one way or another, focused on this argument, I do not dwell on it here. But for present purposes the no- miracles intuition can be formulated in the following way. Suppose that n methods generate evidence en which confirm H, and m is a proper subset of n. Then:

p(en|H) / p(en) > p(em|H) / p(em)

This is because the likelihoods, p(en|H) and p(em|H) are roughly equivalent if conditioning on H renders en as probable as em (which is to be expected if all evidence from methods n confirm H), while p(en) is less than p(em) (because of the no-miracles intuition: it is more surprising the greater the diversity of methods providing evidence which confirms H).

Believability

Evidence must be believable to be compelling. There is a prima facie contrast with the intuitive desideratum that evidence be novel or surprising, relied on in my discussion of patterns, concordance, and quality. One might think that believability and surprise are directly traded-off against each other: the more believable that some

65 piece of evidence is, the less surprising it is, and vice-versa. But the two desiderata are not directly at odds. Believability can be represented by p(e): the probability of evidence e. By the principal of total probability, p(e) = p(e|H1)p(H1) + p(e|H2)p(H2) +

… p(e|Hn)p(Hn). The more believable e is, the higher p(e) is. Surprising evidence is only a good thing if we can make sense of it based on background knowledge, a particular hypothesis of interest, and knowledge that the method that generated the surprising evidence was high quality (and thus the surprising evidence is not an artifact). We do not want evidence which is so surprising as to be totally unbelievable, because this is to say that there is no hypothesis H which makes e probable: in this case p(e) is low but p(e|H) is also low (and so by BT there is no confirmation of any particular H). But we also do not want evidence which is so believable so as to be unsurprising, because this is to say that there are many hypotheses Hn which make e probable, and then our evidence is undiscriminating (and so by BT there is no confirmation of any particular H). Instead, what we want is evidence to be i) so surprising as to be unbelievable assuming that H1 through Hn are true, and yet ii) not surprising and very believable if we assume that Hn+1 is true. In this case p(e) is sufficiently lower than p(e|Hn+1) (and so by BT there is confirmation of Hn+1). That is, we want evidence to be surprising/unbelievable except for the fact that there is a unique hypothesis which renders the evidence less surprising and more believable.

3.4 Balancing Evidential Features

From the fact that several evidential desiderata are, and should be, appealed to when assessing evidence, there are multiple equally rational ways of satisfying these

66 evidential desiderata. Ideal evidence is absolutely free from systematic errors, completely relevant to our hypothesis of interest, totally transparent, perfectly concordant with evidence from other modes and our background theories, and highly surprising except in light of our hypothesis of interest. This is ’s evidence. I assume that there is no such thing as Plato’s evidence. All evidence has some systematic error, or some relevance-distance from our hypothesis, or some discordance with other modes of evidence, or some implausibility with respect to a body of accepted theory, or some other flaw vis a vis the evidential features discussed in Chapter 2.

You and I might reasonably have different views about which of these features of evidence is most important. We could both agree that quality, relevance, and concordance are all important, and we could both agree that if we could have Plato’s evidence then we would be happy. But which of these features is most important? This is like asking, given that an ideal spouse is intelligent, beautiful, and pleasant, which of these three aspects is most important? You might say that pleasance is more important than intelligence and intelligence is more important than beauty, and alternatively, I might say that beauty is more important than pleasance and pleasance is more important than intelligence. Similarly, you might say that quality is more important than relevance and relevance is more important than concordance, and I might say that concordance is more important than quality and quality is more important than relevance. Suppose we do. Our assessments of evidence would be equally rational, though we would be concerned with different features of evidence.

67

Consider our differing views of spousal features. You think pleasance > intelligence > beauty and I think beauty > pleasance > intelligence. Along comes a possible spouse – a real person, not Plato’s spouse – with varying degrees of pleasance, intelligence, and beauty. Our assessments of this person as a potential spouse will differ. Suppose this person is extremely pleasant, moderately intelligent, and only somewhat beautiful. You will find this person a good match, given your rank-ordering of spousal desiderata. I, on the other hand, will not find this person a good match given the relative importance I place on beauty compared to pleasance and intelligence.

To continue my wine example from Chapter 2: one oenophile might consider the oaky bouquets of certain varietals to be more important than the ‘legs’ of a wine, and might consider pleasant taste to relatively unimportant to judging a wine’s quality, whereas another oenophile might consider taste to trump any other feature of wine.

The no-best ordering argument for the features of evidence is more difficult than this. The various features of evidence described in Chapter 2 are themselves relatively abstract notions, themselves determined by multiple considerations. Quality of evidence – that is, freedom from systematic error – is comprised of numerous features which determine the degree to which evidence is free from systematic error.

All the standard elements of experimental design are features which determine quality of evidence – random allocation of subjects, appropriate blinding, and proper use of analytical tools are factors which determine the quality of evidence. In the words of others “Quality is a multidimensional concept, which could relate to the design, conduct, and analysis of a trial, its clinical relevance, or quality of reporting” (Jüni et

68 al. 2001). Same with relevance: numerous considerations bear on the notion.

Cartwright’s 2010 presidential address to the Philosophy of Science Association argued for the importance of four such considerations which are relevant to relevance

(and each of these are highly complex considerations). To continue my spouse example: elements of beauty might include facial structure, physical fitness, skin complexion, and hairstyle. Likewise, intelligence can be comprised of many different things – for example, Howard Gardner analyzed intelligence into multiple components: logical, linguistic, spatial, musical, kinesthetic, naturalist, intrapersonal, and interpersonal (1983) (though his analysis of intelligence is controversial). The rank ordering of the coarse-grained spousal features – beauty, pleasance, and intelligence – depends on a weighting and combination of finer-grained features. This is also the case for coarse-grained features of evidence.

An empirical demonstration of different evidence assessment schemes which give contradictory results, based on emphasizing different features that determine the quality of evidence, was given by Jüni and his colleagues (1999). They analyzed data from 17 studies which tested a particular medical intervention, using 25 different scales to assess “trial quality”. These quality assessment scales, first described in

Moher et al. (1995), varied in the number of assessed study attributes, from a low of three study attributes to a high of 34, and varied in the weight given to the various features of evidence. The results were troubling. The outcome of interest was measured by what epidemiologists call ‘relative risk reduction’, and depending on which quality assessment scale was used, the relative risk reduction ranged from 0.63 to 0.90 for studies categorized as ‘high quality’ and from 0.52 to 1.13 for studies

69 categorized as ‘low quality’. That is, meta-analyses of the studies which were considered high quality, as assessed by the various quality assessment scales, had outputs that differed by up to 43% (0.90-0.63/0.63); meta-analyses of the studies which were considered low quality had outputs that differed by up to 117%. In the authors’ words, “the type of scale used to assess trial quality can dramatically influence the interpretation of meta-analytic studies” (1058).

One might be tempted to respond to the above ‘no best ordering’ argument by claiming that I have merely noted that there exists a variety of ways of ordering desiderata – for spouses or for evidence. But it does not follow from this that there is no best or most rational way of ordering desiderata. One particular ordering of the spousal desiderata could be best, at least given a particular goal. If one’s goal was to have a happy relationship, for example, there might be a rationally best way of ordering the spousal desiderata (even if no one has yet found that way). Similarly, a particular ordering of the evidential desiderata might in fact be more truth-conducive, and simply because we have not found that ordering does not mean it does not exist.

Some orderings do seem rationally best conditional on specific goals. The bicycle parts designer Keith Bontrager made a quip about the desiderata for choosing components of a bicycle: “Strong. Light. Cheap. Pick two.” Given specific cycling goals a particular ordering of these desiderata would be rational. A professional cyclist specializing in hill climbing (such as Lance Armstrong) should pick light over strong over cheap. An impoverished commuter (such as myself) should pick cheap over strong over light. An amateur mountain-biker should pick strong over light over cheap. In these cases, given some goal, a particular ordering of desiderata can be

70 argued to be most rational. But this works because we can make general arguments about the relative importance of these desiderata with respect to these particular goals.

The same is not true for more general goals. If a customer walks into a bicycle shop and asks for the best bike, a good salesperson will question the customer about her intended use. Similarly for truth: truth is an abstract goal, and as such no general argument can be made about the importance of the specific evidential desiderata relative to each other with respect to attaining the goal of truth.

My list of desiderata for assessing evidence, and the associated no-best- ordering argument, is similar to the lists of desiderata and associated no-best-ordering arguments that bear on the assessment of theories. Such features are sometimes called

‘epistemic virtues’ or ‘epistemic values’. Hempel gave the following criteria of theory assessment: simplicity, support by more general theories, prediction of novel phenomena, and plausibility with respect to background knowledge (1966). Kuhn’s notorious criteria were accuracy, consistency, scope, simplicity, and fruitfulness

(1977). Van Fraassen includes elegance, simplicity, completeness, unifying power, and explanatory power, and urges that these are merely ‘pragmatic’ rather than truth- conducive. Lycan lists simplicity, testability, fertility, neatness, conservativeness, and generality (1998). Longino (1994) provides a list of ‘feminist theoretical virtues’, which include ontological heterogeneity, mutuality of interaction, applicability to human needs, accessibility of ideas, and novelty (see also Wylie 1995). Chang argues that scientific progress can be understood as the enrichment of these epistemic virtues in a particular research program (2004). Similar to my argument presented here, some philosophers with an inclination toward relativism with respect to theory choice urge a

71 no-best-ordering argument with respect to such epistemic virtues. Kuhn (1962) for instance, is sometimes read as arguing that the choice between paradigms is not based on best empirical fit, because all paradigms up for grabs in any historical episode are well-supported by evidence, but instead the choice between paradigms is a result of a balancing the theoretical virtues. Kuhn urged that there is no single universal algorithm for balancing the theoretical virtues, and that the evidence-theory relation cannot be quantified: “There is no more precise answer to the question whether or how well an individual theory fits the facts” (p. 147).25

The focus of these no-best-ordering arguments is on theory choice, or sometimes on assessing the support that evidence provides to a particular theory. My focus is, instead, on the assessment of the evidence itself. What distinguishes the desiderata presented here is that past lists of epistemic virtues were comprised of features of theories rather than features of evidence or of modes. Thus, even prior to the appeal to epistemic virtues to aid in theory choice, multiple evidential features must be appealed to when assessing evidence. And just as no-best-ordering arguments have been made for the epistemic virtues of theories, I have given for a no-best- ordering argument with respect to the features of evidence.26

25 However, Kuhn did stress that comparative confirmation was possible. The quoted passage continues: “But questions much like that can be asked when theories are taken collectively or even in pairs. It makes a great deal of sense to ask which of two actual and competing theories fits the facts better. Though neither Priestley’s nor Lavoisier’s theory, for example, agreed precisely with existing observations, few contemporaries hesitated more than a decade in concluding that Lavoisier’s theory provided the better fit of the two.” 26 This view is, of course, not without precedent. For instance, here is Neurath (1932): “Science is ambiguous – and is so on each level … Poincaré, Duhem, and others have adequately shown that even if we have agreed on the protocol statements, there is an

72

Consider Benveniste’s evidence for ‘water memory’ (Chapter 2). How compelling was the evidence? How ought one determine the prior probability of observing degranulation with infinitely diluted antibody? What about the probability of this observation conditional on the water memory hypothesis? The mere disagreement between Benveniste’s group and the editor and readership of Nature is not conclusive evidence that these probabilities are impossible to determine, but it does indicate the difficulty. The no-best-ordering argument is meant to show that, in principle, such difficulty is to be expected.

In sum, given that there are multiple equally rational ways of ordering the evidential desiderata, when given some piece of evidence e, multiple rational determinations of p(e|h) and p(e) are possible.

3.5 Bayesian Inference

The argument presented in this chapter – the underdetermination of evidence by theory (UET) – has implications for a leading theory of science inference:

Bayesianism. I have argued that it is impossible to determine the likelihood of the evidence in many real scientific contexts because a plurality of features of the evidence must be assessed, and there are numerous equally rational yet contradictory ways to balance such an assessment of plural evidential features. This is the

unlimited number of equally applicable, possible systems of hypotheses. We have extended this tenet of the uncertainty of systems of hypotheses to all statements, including protocol statements that are alterable in principle” (modified translation cited in Howard 2002); see also Cartwright et al. (1996).

73 conclusion of the argument that I am calling the underdetermination of evidence by theory.27

A standard reference text of Bayesian philosophy of science dismisses this worry: in a direct response to Glymour (1980), Howson and Urbach (1989) write

“whether there is as much epistemic warrant for the data in 1915 about the magnitude of Mercury’s perihelion advance as there is about the number of heads we have just observed in a sample of a hundred tosses of a coin is beside the point” (p. 272). They agree that “about some data we may be more tentative, about other data less” (p. 272).

But this is not supposed to be a problem, because

the Bayesian theory we are proposing is a theory of inference from data; we say nothing about whether it is correct to accept the data, or even whether your commitment to the data is absolute … the Bayesian theory of support is a theory of how the acceptance as true of some evidential statement affects your belief in some hypothesis. How you came to accept the truth of the evidence, and whether you are correct in accepting it as true, are matters which, from the point of view of the theory, are simply irrelevant. (p. 272)

This might be correct, but one might agree with the response of Alan Chalmers (1999) to this passage: “this is a totally unacceptable position for those who purport to be writing a book on scientific reasoning. For is it not the case that we seek an account of what counts as appropriate evidence in science?” (p. 191). Moreover, given a Bayesian theory of inference, one might agree with Glymour that “we need some rule for determining a counterfactual degree of belief in e and a counterfactual likelihood of e on T” (Glymour 1980 p. 87). If there does not exist such a rule, then a Bayesian theory

27 This conclusion is similar to that reached by John Norton’s ‘material theory of induction’. Norton (2003, 2005, 2007) has “urged that there is no single logic of induction, but many varieties of logic each adapted to particular contexts” (2007 p. 142).

74 of inference is benign.28 The UET argument has attempted to support the conclusion that there does not exist a general rule to determine these probabilities. Howson and

Urbach claim that they can “ignore this discussion” because “the sort of Bayesianism we are advocating is not regarded as a source of rules for computing all the probabilities in Bayes’s Theorem” (p. 273). Nevertheless, without a determinate likelihood, measures of confirmation cannot be used in scientific inference. The retort from Howson and Urbach is that “people are capable in many cases of determining, possibly only very roughly, to what extent they think a piece of data likely relative to a stock of residual background information” (p. 273). Perhaps. But the UET argument shows that other people are capable of determining – differently, yet rationally – to what extent they think a piece of data likely relative to background assumptions. The impact that evidence has on a hypothesis is underdetermined by the truth of that hypothesis.29

28 ‘Swamping’ solutions to the problem of subjective priors assumes determinate inter- subjective likelihoods. UET thus raises trouble for these swamping solutions. I thank Christian Wüthrich and Matthew Brown for emphasizing this point. 29 UET also renders less plausible the thesis known as epistemic uniqueness, which claims that particular evidence e uniquely justifies a particular belief b (e.g., White 2005). Epistemic permissivism responds that only with respect to an ‘epistemic system’ does e uniquely justify b, where ‘epistemic system’ is a coarse-grained, large- scale aspect of culture such as Science or Religion (see Boghossian 2006). UET supports permissivism by appealing to the fine-grained features of evidence: e justifies b only given a particular weighing of the relative importance of the features of evidence. Similarly, UET bears on the ‘epistemology of disagreement’, which asks: when one disagrees with an ‘epistemic peer’ (someone who shares e and reasoning ability) should one maintain one’s belief or adjust it toward that of the peer? The ‘equal weight view’ supports the latter response. But given UET, disagreeing peers might both be rational, and thus the norm of adjusting one’s belief towards one’s peer is mitigated.

CHAPTER 4: INDEPENDENT EVIDENCE

Abstract

Robustness arguments hold that a belief is better supported when multiple kinds of independent evidence support the belief. We investigate the notion of independence required for robustness. We identify two general kinds of indepence: ontic independence (OI) – when the multiple lines of evidence depend on different materials, assumptions, or theories – and probabilistic independence. We show that OI evidence can collectively confirm a hypothesis to a lower degree than any of the individual pieces: this is dyssynergistic evidence (DE). It follows that OI alone cannot be sufficient for robustness arguments. We formulate a probabilistic criterion of independence – what we call conditional probabilistic independence (CPI) – and prove that evidence that meets this criterion is collectively more confirmatory than its individual pieces. CPI, however, is often not epistemically accessible to an experimentalist.

75 76

4.1 Introduction

There is a family of arguments the members of which rely on the same basic intuition: a belief is better supported when one has multiple kinds of independent evidence supporting the belief. We call this family of arguments robustness arguments. Robustness is often appealed to as an argument for various kinds of scientific realism. There are subtle differences in the way such arguments have been instantiated and justified, but it is commonly presupposed that robustness arguments are epistemically valuable only when the multiple kinds of evidence used to construct the robustness argument are independent. We begin by describing several prominent examples of robustness arguments to show their reliance on a notion of independent evidence (§4.2).

Usually those who make robustness arguments rely on a notion of ontic independence (OI). Evidence is OI when the available multiple lines of evidence depend on different materials, background assumptions, theories, or ‘chunks of ’ (§4.3). We refer to such evidence as multimodal. OI focuses solely on the independence of the techniques by which the evidence is gathered rather than the relationship between the multimodal evidence and a hypothesis. We argue that OI leaves open the possibility of multimodal evidence that collectively confirms a hypothesis to a lower degree than any of the individual lines of evidence (§4.4). We call such evidence dyssynergistic. The existence of dyssynergistic evidence shows that

OI does not suffice as a criterion of independence required for robustness.

We formulate a probabilistic criterion of independence and argue that evidence that fits this criterion undergirds the “more is better” principle underlying robustness

77 arguments (§4.5). We label this criterion conditional probabilistic independence

(CPI). Evidence that is CPI is collectively more confirmatory than its individual pieces

(we prove this in the Appendix, §4.7).

4.2 Robustness & Independent Evidence

A canonical example of a robustness argument is Jean Perrin’s determination of Avogadro’s number. Perrin estimated Avogadro’s number by using thirteen different experimental and observational methods, including observations of Brownian motion, radioactivity, blackbody radiation, and ion motion in liquids (Nye 1972).

Philosophers have appealed to this case as support for various forms of realism, including causal realism (Salmon 1984), entity realism (Cartwright 1983), natural kind realism (Snyder 2005), and theory realism (Maddy 2007; Mayo 1986). Mayo (1996) calls Perrin’s measurements an exemplar of ‘severe testing’, while van Fraassen

(2009) argues that Perrin used a ‘bootstrapping’ method, and describes the case in terms less favorable to realism (but not necessarily favorable to his constructive either).30

As an argument for one of these forms of scientific realism, robustness is often formulated as a no-miracles argument, or an argument from coincidence. Here is

Salmon (1997) on the Perrin episode: “such agreement would be miraculous if matter were not composed of molecules and atoms.” Similarly, witness Cartwright (1983):

30Van Fraassen’s reinterpretation of the Perrin case suggests that, rather than a paradigm of a no-miracles argument for one version of realism or another, the case is an embarrassing example of the severe underdetermination of philosophical theory by historical evidence.

78

“Would it not be a coincidence if each of the observations was an artefact and yet all agreed so closely about Avogadro’s number?” Such reasoning goes back at least as far as Whewell (1857): “no accident could give rise to such an extraordinary coincidence.” In other words, it would be a miracle if concordant multimodal evidence supported a hypothesis and the hypothesis were not true; we do not accept miracles as compelling explanations; thus, when concordant multimodal evidence supports a hypothesis, we have strong grounds to believe that it is true.

Such a no-miracles argument is said to be compelling, however, only if the multiple kinds of available evidence are sufficiently independent. Regarding Perrin’s multiple methods of measuring Avogadro’s number, Salmon (1984) urges his reader to “notice what a wide variety of substances are involved and how diverse are the phenomena being observed”; he also notes that “the experiments on Brownian motion are physically separate and distinct from those on alpha radiation and helium production. The most famous ones were done in different countries.” Similarly, Kosso writes “the reason that Perrin’s results are so believable, and that they provide good reason to believe in the actual existence of molecules, is that he used a variety of independent theories and techniques and got them to agree on the answer” (1989).

Likewise, Mayo (1996) emphasizes the importance of the fact that Perrin’s methods were of “very different phenomena.”

The independence condition for robustness arguments is not, of course, limited to the Perrin episode. Culp (1995) considers robustness compelling if the following condition is met: “the techniques must not all use the same theoretical presuppositions in making raw data interpretations.” Similarly, when Hacking (1983) asked ‘Do we

79 see through a microscope?’, he was interested in the strategies that a scientist could use to be confident that features of an object observed under a microscope are real features of the object rather than artifacts of the observational apparatus, and confidence in the reality of the feature of an object observed under a microscope can be gained, according to Hacking, if different kinds of microscopes – his were electron transmission microscopes and fluorescent re-emission microscopes – are used to observe the features. Such confidence is justified because: “These processes have virtually nothing in common between them. They are essentially unrelated chunks of physics. It would be a preposterous coincidence if, time and again, two completely different physical processes produced identical visual configurations which were, however, artifacts of the physical processes rather than real structures in the cell.”

Thus Hacking’s ‘coincidence’ argument requires a notion of independence between the multiple kinds of evidence used in such an argument. Likewise, Douglas (2004) writes “The strength of the claims concerning the reliability of [robustness] rests on the independence of the techniques used.” Again, we can find mention of the independence condition for robustness as far back as Whewell (1857): “That rules springing from remote and unconnected quarters should thus leap to the same point can only arise from that being the point where the truth resides” (emphasis added).

In short, it is widely assumed that robustness arguments rely on a notion of independent evidence. The independence condition for robustness is meant to ensure that the concordant results of multiple methods are due to the object of investigation rather than some feature shared by the methods, allowing the data to be construed as evidence about the studied object – “the point where the truth resides” – rather than an

80 artifact caused by the feature shared between the various methods. Surprisingly little has been said, however, about the condition of independent evidence. Our understanding of this notion has progressed little since Whewell introduced his notion of consilience of inductions. It is independence of apparatus or experimental technique, what we call ontic independence, that most philosophers of science assume to be the relevant kind of independence required for robustness arguments.

4.3 Ontic Independence

We saw in §4.2 that robustness arguments often rely on a notion of ontic independence. Judgments of ontic independence, or independence between multiple lines of evidence, require a criterion for individuating different modes of evidence. At first glance, knowing when multiple lines of evidence are independent seems straightforward. Consider the following:

We have an intuitive grasp on the idea of diversity among experiments. For instance, measuring the melting point of oxygen on a Monday and on a Tuesday would be the same experiment, but would be different from determining the rate at which oxygen and hydrogen react to form water. (Howson and Urbach 1989 p. 84).

While we share this “intuitive grasp” of what diversity of evidence is, it is surprisingly difficult to specify more clearly the conditions under which multiple modes of evidence are ontically independent (OI). This difficulty stems from the challenge of determining what the proper form of independence should be between modes of evidence. What criteria should we use to individuate modes of evidence? We are here interested in the possibility of a general condition for OI which undergirds such commonly expressed intuitions regarding independent evidence. Our conclusion is

81 primarily negative: no such general solution to the problem of individuating OI evidence is forthcoming.

The individuation problem can be motivated by considering the following simple case. When testing the efficacy of a drug, one might use chemical assays, animal studies, and human trials, each of which one would intuitively describe as a different mode of evidence, and so this would be a case of multimodal evidence, or evidence which is OI. In contrast, performing a particular animal experiment on one day, and then performing the same experiment with all the same parameters again on another day, would not thereby generate two modes of evidence, and so this would not be a case of multimodal evidence (intuitively, the evidence from both days is not OI).

Why does the former set of experiments generate OI evidence while the latter set of experiments does not? If we had a criterion for the individuation of modes of evidence then we could determine if the available evidence was OI.

Usually when building a robustness argument no attempt is made to explain what is meant by the independence requirement. At the most, breezy stipulations of OI are made, based on the loose accounts of OI introduced in §4.2. For instance, when

Franklin (1986) claims that the bubble chamber and the spark chamber are independent kinds of evidence, he claims that one works by “bubble formation in superheated liquids” and the other works by “electrical discharge in ionized gases.”

While Franklin might be correct that the bubble chamber and the spark chamber are independent kinds of evidence, a general condition for OI would provide a justification of Franklin’s claim.

82

One suggestion for an OI condition is due to Culp (1994): a necessary condition for robustness-style arguments is that modes of evidence should rely on different background theories. It is a commonplace view that evidence is theory-laden, and Culp’s suggestion is that the different modes of evidence in a robustness argument must be laden with different theories. OI, on this view, is based on the difference in the background theories which underlie the multiple modes of evidence. This account of independent evidence shifts the required criterion of individuation from evidence to theories, for which Kosso (1989) suggested the following:

Given two scientific theories T1 and T2, T1 is independent of T2 in a way that makes T1 a possible source of objective test of T2 , if our acceptance of T1 as true (or rejection of T1 as false) does not force us to accept T2 as true (nor to reject T2 as false). If the truth or falsity of T1 is insulated from the truth or falsity of T2 , then the two theories are independent in the relevant way. (Kosso 1989)

The construal of independent evidence in terms of independence of ladening theories is potentially valuable in some circumstances. But not all evidence is theory- laden in the same way or to the same degree. And sometimes knowing what theory the data is laden with is difficult or impossible. Further, one can imagine two pieces of evidence which depend on the same theory for the production of data and interpretation of evidence, and yet which one would still call different modes.

Consider, for example, all the possible study designs in epidemiology (case-control studies, cohort studies, randomized controlled trials, and so on). Although each of these modes requires particular background assumptions to relate evidence from the mode to a hypothesis, such background assumptions are not necessarily theories, if one pedantically reserves this term for high-level scientific abstractions; perhaps some

83 theory is used in interpreting the evidence from these designs, but they are not necessarily different theories which laden the evidence from different epidemiological study designs; and yet, these study designs are considered to be different modes of evidence by epidemiologists (though of course they do not use our terminology).

Moreover, it is easy to imagine a robustness argument based on evidence from multiple epidemiological studies of different designs. The unit of theory is too coarse- grained to serve as a basis of individuation. Individuation of modes in terms of OI needs a finer-grained criterion than theory independence.

We have already briefly noted a similar proposal for conditions of independence by Hacking (1983). He claimed that his ‘coincidence’ argument regarding the convergence of multiple lines of evidence (in Hacking’s case, microscopes) was compelling because the multiple lines of evidence depended on

“unrelated chunks of physics.” Here is another example from Hacking that suggests a condition for independent evidence: “Light microscopes, trivially, all use light, but interference, polarizing, phase contrast, direct transmission, fluorescence and so forth exploit essentially unrelated phenomenological aspects of light” (p. 204). But this suggestion is also too coarse-grained for the same reason as Culp’s proposal. Perhaps it might be reasonable to suppose that independence based on different theories or different chunks of physics is at least sufficient to guarantee OI, but it is not necessary.

Moreover, it pushes the determination of OI back a level to determining independence of ‘phenomenological aspects’ of one’s methods of investigation.

It is a familiar thesis that all data is only evidence for a hypothesis when conjoined with certain background assumptions. Thus another way to consider the

84 individuation of modes of evidence is by the independence of background assumptions between the modes, relative to a given hypothesis. To individuate two modes, it might be sufficient if the modes share all the same background assumptions except one. This, though, is not restrictive enough. To consider Tuesday’s animal experiment the same mode as Thursday’s animal experiment, besides assuming that the animal experiments followed the same protocol, we must hold several background assumptions on

Thursday that we didn’t on Tuesday – that the bit of nature under investigation has retained its causal structure since Tuesday, that the different socks which the scientist is wearing on Thursday does not influence the results of the experiment, and so on – and yet intuitively these animal experiments are not OI. Thus it is necessary to have at least a few unshared background assumptions between even tokens of the same mode, let alone between multiple modes.

At the other end of the spectrum of degree of independence of background assumptions would be when two modes are individuated based on a total exclusivity of background assumptions; that is, when the evidential modes do not share a single background assumption. This might also be too restrictive, since one might think that at bottom all modes of evidence, at least when related to the same hypothesis, must share at least some background assumptions. Think of the sensory modalities: vision and touch, though seemingly very distinct modes of sensation, rely on much of the same cognitive apparatus. In Hacking’s coincidence argument, evidence from each microscope requires an assumption that random solar flares (for example) do not create artifacts in the microscopic image. OI cannot require individuation based on total exclusivity of background assumptions.

85

Since knowledge of many background assumptions can be far less than certain, interpretation of almost any data as evidence for a hypothesis might be an artifactual interpretation based on false background assumptions. A robustness argument based on evidence from different modes, with different background assumptions, might be compelling if the problematic assumptions for each mode of evidence – those assumptions which we are uncertain about – were different between modes. Consider a situation in which evidence from a case-control study with high external validity and low internal validity is concordant with evidence from a randomized controlled trial

(RCT) with high internal validity and low external validity. To think that both modes of evidence are truth-conducive for a general hypothesis of interest (that is, that both modes of evidence give evidence that is true and general, or internally and externally valid), it is necessary to hold certain background assumptions for each mode. For the case-control study, a required assumption is that there is no selection bias. For the

RCT, a required assumption is that the results are exportable to our population of interest. These modes are individuated rather weakly. They are both human studies at a population level, and as such they share many assumptions, and the statistical analysis of the data from the two modes rely on similar assumptions about population structure. However, the particularly problematic assumptions are the unshared ones.

Given that they are unshared, if the two kinds of studies give concordant evidence, then that is a reason to think that the unshared background assumptions are not as problematic (in this particular situation) as one would otherwise expect, and so the evidence can be considered truth-conducive. So problematic-auxiliary independence is a good candidate for individuating modes of evidence for arguments based on

86 robustness. The robustness argument for this example would then go as follows. If there was a positive result in the RCT, we might be wrong in assuming that we can generalize its results to a broader population, because of the RCT’s low external validity. If there was a positive result in the case-control study, we might be wrong in assuming that the positive result was a true finding, because of the case-control study’s low internal validity. But since the possible errors are different, the probability of error in both studies is less than the probability of error in either study. And so the concordance between the two kinds of studies gives good reason to think their results are true.

Thus we can say: it is the background assumptions which we are uncertain about that matter for OI. We could then account for the value of robustness arguments in the following way. A hypothesis is more likely to be true when two or more modes of evidence provide concordant multimodal evidence and the worrisome or problematic auxiliary assumptions for all modes of evidence are different.

Unfortunately, there are several problems with attempting to individuate modes based on problematic-auxiliary independence: we must assume that we can individuate assumptions, we must set a threshold of for determining which assumptions are problematic, and we must determine which assumptions do not meet this threshold. How can one do this for all the assumptions required for a given mode of evidence?

We could describe the “causal structure” or the “mechanism” of a mode of evidence – that is, we could list all the entities and relations involved in the production of the evidence – and then say that if the causal nexus contains an entity or a relation

87 about which we are uncertain in some respect, or which is somehow unreliable, then it is the assumptions about that entity or relation which are problematic. This is just pushing the individuation problem back a level: now we have to identify those worrying entities, for which we doubt there is any general criterion of identification.

Consider a comparison between electron microscopes and witnesses: on any account of individuation, evidence from an electron microscope should be construed as being of a different mode than evidence from personal testimony. Two common assumptions thought to be problematic for evidence from personal testimony is a witness’s capability and a witness’s honesty. But a person, the microscope operator, is also involved in the generation of evidence from an electron microscope, and yet one does not normally worry about the capability or the honesty of the microscope operator. It is almost always safe to assume that the microscope operator is honest and capable.

Both modes of evidence have, in their causal structure, the same type of entity and its associated activity: a person who relays their experience of the world. Despite this similarity, in one mode of evidence the entity has associated problematic assumptions and in the other mode of evidence the entity does not have associated problematic assumptions. Of course, various stories could be told to explain this. The point is that as a criterion of individuation of modes, appealing to problematic background assumptions shifts the burden from specifying a satisfactory and general criterion of

OI to specifying a satisfactory and general criterion of identifying problematic background assumptions. This is a burden unlikely to be met.

Proposals to identify conditions for OI such as those canvassed here have a lineage going back to Whewell (1857), who spoke of individuating evidence based on

88

‘natural classes of events’. A consilience of inductions was good grounds for belief

(for Whewell such grounds were infallible) when “an induction, obtained from one class of facts, coincides with an induction, obtained from a different class.” Snyder

(2004) explains this as follows: a ‘consilience of inductions’ (i.e. a robustness argument) requires the individual inductions to be based on different kinds of entities or processes. Thus, attempts to justify robustness arguments based on a plurality of theories (or on ‘different chunks of physics’) which laden one’s multimodal evidence are part of an old tradition. Unfortunately the prospect of identifying general conditions for OI, based on a criterion of individuation between modes, is more difficult than one might have at first thought. This does not entail that, in fact, there are no modes, or that OI is illusory. It merely suggests that a general criterion for identifying OI might be impossible. Nor does this mitigate the epistemic importance of multimodal evidence or of robustness arguments based on OI. After all, there does not exist a compelling universal criterion to individuate sensory modalities, and yet we assume that there are multiple sensory modalities and that having multiple sensory modalities is epistemically important (Keeley 2002). Same with OI: we might not be able to come up with a compelling account of OI based on a general criterion of individuation for modes, but one might think that OI is nonetheless epistemically important. In the following section we argue that at least one standard way of construing the epistemic importance of OI is more contrained than many have thought.

Nevertheless, although identifying problematic background assumptions with a general criterion of individuation is perhaps impossible, identifying problematic background assumptions is regularly done on a case-by-case (or mode-by-mode) basis.

89

4.4 Dyssynergistic Evidence

OI evidence, or multimodal evidence, can be dyssynergistic. Multimodal evidence is dyssynergistic when the conjunction of all the evidence is less confirming than the conjunction of any proper subset of the modes of evidence.

Dyssynergistic Evidence (henceforth DE)

Two pieces of evidence, E1 & E2, are dyssynergistic if and only if E1 and E2

are ontically independent, both E1 and E2 confirm H, and c(H|E1 & E2) < c(H,

E1) or c(H|E1 & E2) < c(H, E2).

When OI evidence is DE, a robustness argument based on the OI evidence cannot be made. In what follows we provide three cases to illustrate DE.31

La Jolla Murder Mystery

A detective investigating a murder has an initial list of suspects, 50% of whom are from La Jolla. The detective acquires two pieces of evidence at the crime scene. A blond hair is found on the victim’s body that must be the murderer’s, and a bit of the

31 Our cases are constructed. Admittedly, our discussion of DE would be helped by detailed historical illustrations of real scientific examples of DE. Such illustrations would have the following structure: for some particular scientific hypothesis multiple pieces of OI evidence individually supported hypothesis X, but jointly the multiple pieces of OI evidence disconfirmed X, and we have come to believe this perhaps because we now believe that X is false. We are unsure if is it possible, for any particular historical episode in science, to distinguish such cases of DE from quotidian experimental error.

90 murderer’s blood is found at the crime scene, and determined to be type A. These pieces of evidence are plausibly OI. There need be no substantive background theoretical assumptions in common between the tools used to analyze the blood and the tools used to analyze the hair.

The detective studies the suspects and discovers the following:

60% of the suspects from La Jolla have blond hair.

40% of the suspects from outside La Jolla have blond hair.

60% of the suspects from La Jolla have type A blood.

40% of the suspects from outside La Jolla have type A blood.

20% of the suspects from La Jolla have both blond hair and type A blood.

40% of the suspects from outside La Jolla have both blond hair and type A blood.

These conditional probabilities are summarized in Table 1.

Table 2. Likelihoods of Evidence in La Jolla Murder Mystery.

Pr(-- | La Jolla) Pr(-- | ~La Jolla)

type A type A

+ - + -

blond + 0.2 0.4 blond + 0.4 0

- 0.4 0 - 0 0.6

91

Let the hypothesis H be that the murderer is from La Jolla, and let evidence E1 be the blond hair and E2 be the type A blood. It is evident from the data that Pr(E1|H) >

Pr(E1|~H) and Pr(E2|H) > Pr(E2|~H). It follows that Pr(H|E1) > Pr(H) and Pr(H|E2) >

Pr(H).32 Each piece of evidence is individually confirmatory of the hypothesis H. It is also clear from the data that Pr(E1 & E2|H) < Pr(E1 & E2|~H), which means Pr(H|E1 &

E2) < Pr(H). The combined evidence is disconfirmatory. This, then, is a case of DE.

Bank Robbery

A detective is investigating a bank robbery. She is entertaining the hypothesis that the perpetrator is a man who is known to always wear a very peculiar jacket. One side of this jacket is white, and the other side is an unusual shade of green not often used for clothing, but the detective does not know which side of the jacket is white and which is green. She has a witness, Waldo, who says he saw a man running away from the bank wearing a jacket of that particular shade of green (although he only saw the man in profile). This strongly confirms the detective’s hypothesis. But she has another witness, Wendy, who says the exact same thing: she saw a man in profile wearing a green jacket. Unfortunately for the detective, Wendy was on the opposite side of the street from Waldo.

Either piece of evidence on its own would be strongly confirmatory, but the conjunction of the evidence is not, since it suggests that the criminal was wearing a jacket that is entirely green. This is, then, an example of dyssynergistic evidence. But is there a failure of OI? There is no reason to think so, no matter which of the specific

32 Proof omitted.

92 criteria from §4.3 we adopt. It is entirely possible that Wendy and Waldo are completely independent observers. They never met to collude about their testimony, and they are not prone to similar visual hallucinations. To make the ontic independence of the sources of evidence even more stark, suppose one of them is actually a camera. The point is that the independence (or lack thereof) in the background theories of the two modes of evidence has nothing to do with the source of dyssynergy. Rather, the dyssynergy stems from a peculiar feature of the hypothesis under test.

Interacting Drugs

Suppose you are a hospital physician, pondering a hypothesis about a patient’s survival (let H be “the patient will live”), and you have available two modes of evidence: the report from life-support machine 1 and the report from life-support machine 2. Machine 1 informs you that the patient is receiving drug X (call this E1).

Drug X is known to help such patients, and so Pr(H|E1) > Pr(H). Machine 2 informs you that the patient is receiving drug Y (call this E2). Drug Y is known to help such patients, and so Pr(H|E2) > Pr(H). There is no causal interaction between Machine 1 and Machine 2, and moreover they are based on different physical principles and background theories, and so the two modes of evidence are OI by any of the criteria in

§3. Each piece of evidence on its own confirms H. However, you recently read a paper which showed that drug X binds to drug Y, creating a lethal toxin which causes severe brain damage. Therefore you decide that, although Pr(H|E1) > Pr(H) and Pr(H|E2) >

93

Pr(H), Pr(H|E1 & E2) < Pr(H). The evidence from the two modes is thus dyssynergistic.

4.5 Conditional Probabilistic Independence

Bayesian approaches to scientific reasoning offer the promise of a rigorous explication of the value of independent evidence. Fitelson (2001) offers a confirmation theoretic approach to independent evidence. Let c(H, E) be a measure of confirmation, the degree to which a piece of evidence E confirms a hypothesis H. Two pieces of evidence E1 and E2 are confirmationally independent relative to a hypothesis H iff c(H, E1|E2) = c(H, E1) and c(H, E2|E1) = c(H, E2). That is, the degree to which either piece of evidence confirms the hypothesis should not be affected by whether or not we conditionalize on the other piece of evidence. Fitelson proves that, for the three most popular measures of confirmation, if E1 and E2 are confirmationally independent relative to H and each piece of evidence independently confirms H, then c(H, E1 & E2)

33 > c(H, E1) and c(H, E1 & E2) > c(H, E2). That is, the conjunction of confirmationally independent pieces of evidence is more confirmatory than the individual pieces of evidence. This is not generally the case for evidence that is not confirmationally independent. When evidence is not confirmationally independent, there is the possibility of dyssynergistic evidence, where the combined evidence is less

33 The relevant confirmation measures are the difference measure, cd(H, E) = Pr(H|E) - Pr(H), the ratio measure, cr(H, E) = Pr(H|E)/Pr(H), and the likelihood-ratio measure, cl(H, E) = Pr(E|H)/Pr(E|~H). See Fitelson (1999) for a description of these measures and a discussion of their relative merits.

94 confirmatory than the individual pieces. Confirmational independence rules out the possibility of dyssynergy.

We seem to have a nice explanation of the value of independent evidence.

Independent evidence is epistemically valuable because it guarantees that more evidence is better.34 If we collect multiple pieces of confirmatory evidence that are not independent we cannot say if the hypothesis is gaining credence unless we can ascertain that the sorts of interaction effects that produce dyssynergy are not taking place. With independent evidence we are guaranteed that this condition is met.

However, this explanation is a plausible candidate only if the formal criterion of confirmational independence accurately characterizes cases in which scientists would judge that evidence is independent.

Fitelson himself shows that, for some measures of confirmation, confirmational independence does not in fact track what we mean by independence

(Fitelson 2001). Consider the following experiment: An urn is selected at random from a large collection of urns containing black and white balls. Some of these urns contain x white balls for every black ball, and other urns have y white balls for every black ball, with 0 < x, y < 1. The proportion of the first kind of urn in the collection is z. The hypothesis, H, is that the selected urn is one of these. We draw a ball from the urn at

34 Novack (2007) argues that Fitelson’s measure is problematic on the grounds that it depends on the description of the evidence. Novack shows that for any set S of evidence that is not confirmationally independent, one can construct a set S’ that is independent such that the conjunction of the evidence in S is identical to the conjunction of the evidence in S’. We do not consider this a significant problem with the account. Confirmational independence is a relation between pieces of evidence, not a property of conjunctions of evidential statements. The individual pieces of evidence in S and S’ will be different, even though their conjunction is not, so the differing judgments of independence do not tell against Fitelson’s measure.

95

random, replace it and draw again. This gives us two pieces of evidence E1 and E2, say that both draws produced white balls. Fitelson claims that it is intuitive that these two pieces of evidence are independent relative to the hypothesis, since the draws are causally unrelated.35 However, according to two of the three measures under consideration, the difference measure and the ratio measure, these pieces of evidence are not in general confirmationally independent.

Fitelson considers this an argument in favor of the third measure, the likelihood-ratio measure. However, partisans of the other measures may well consider this a reductio of the notion of confirmational independence. If in fact Fitelson is right that the evidence in the urn case is intuitively independent and if it is not confirmationally independent according to plausible measures of confirmation, then confirmational independence must not be the appropriate mathematical characterization of independence. Even though confirmational independence can be shown to rule out dyssynergistic evidence on all three measures, the fact that it does not capture the intuitive notion of independence for at least two of those measures limits the explanatory value of Fitelson’s notion of confirmational independence.

However, we can formulate a condition which preserves the spirit of his argument without discriminating between the three popular measures. Instead of confirmational independence, we rely on a straightforward notion of probabilistic

35 ‘Independent’ here means ontically or experimentally independent, discussed in §4.2. The thought experiment must stipulate no causal interaction between the two draws to fully pump one’s intuition that the two draws are independent. They are obviously not ‘background theory’ independent, since presumably both draws require the same background theories. But as we argue in §4, background-theory independence is too coarse-grained to serve as a full analysis of independent evidence, as the urn example illustrates.

96 independence. We want to articulate a condition that will rule out dyssynergistic evidence, defined above.

In probabilistic terms, a condition that rules out DE will be one that ensures that Pr(H|E1 & E2) > Pr(H|E1) and mutatis mutandis for E2. The following condition is sufficient to rule out dyssynergy:

Conditional Probabilistic Independence (henceforth CPI)

Two pieces of evidence, E1 & E2, are CPI if and only if Pr(E1 & E2|H) =

Pr(E1|H) × Pr(E2|H) and Pr(E1 & E2|~H) = Pr(E1|~H) × Pr(E2|~H).

In other words, if both E1 and E2 confirm H and they are probabilistically independent conditional on H, then the conjunction must be more confirmatory than each individual conjunct. We prove this in the appendix (§4.7).

CPI correctly characterizes the evidence in Fitelson’s urn case as independent, and this works for all three prominent confirmation measures.36 This is not sufficient to show that it is an adequate mathematization of the intuitive notion of OI used in the robustness arguments canvassed in §4.2. In order to gauge whether CPI adequately captures this notion, thus providing some explanation for its epistemic significance, we need a fuller notion of what sort of independence these arguments appeal to.

36 This is because, for all three measures under consideration, c(H, E1) > c(H, E2) if and only if Pr(H|E1) > Pr(H|E2), and CPI guarantees that Pr(H|Ei & Ej) > Pr(H|Ei).

97

4.6 Conclusion

We have shown that OI is insufficient for robustness, given DE, and we have also shown that CPI is sufficient for robustness.37 This mismatch between OI and CPI should not be surprising.38 CPI is a hypothesis-relative notion. Its epistemic significance is that it rules out DE, which is also a hypothesis-relative notion. The notions of independence used by philosophers in robustness arguments – the various versions of OI – are not hypothesis-relative. The independence they rely on is a relation between modes of evidence, irrespective of the particular hypothesis being tested. A light microscope and an electron microscope are ontically independent sources of evidence simpliciter, not independent relative to some hypothesis.39

37 One might respond by suggesting that OI is valuable not because OI is sufficient for robustness, but because OI tends to achieve robustness (perhaps because OI tends to be CPI). That is, one might think OI is a valuable heuristic for determining if the independence condition for robustness arguments is met because OI tracks CPI with some satisfactorily high probability. Fair enough. But the burden for such a view would then be to determine how frequently, and in what kinds of epistemic situations, OI tracks CPI. 38 OI is not sufficient for robustness, but is it necessary? Here the fine details of the particular account of OI matter. There are some considerations that tell in favor of the necessity of OI. The conjunctive fork structure embodied by CPI, if represented as a Bayesian network, requires that two pieces of evidence E1 and E2, if they are CPI relative to H, must have no common ancestors that are not either ancestors or descendants of H. (This is assuming that the Markov condition holds.) That is, there must be no other path that correlates E1 and E2. This could be seen as embodying the requirement that the causal paths governing the two modes be distinct, or that there is an independence of background theories underlying the modes. However, merely showing this necessity does not license attempts to explain the value of OI using CPI. For this, OI must be sufficient for CPI. 39 Howson and Urbach (1989) propose a probabilistic measure for the diversity of a set of evidence that is not hypothesis-relative. The measure tracks the degree to which the pieces of evidence are unconditionally correlated. This measure is criticized in Wayne (1995), Myrvold (1996) and Fitelson (2001). If all the evidence is supposed to be evidence for a single hypothesis, then unconditional independence is an inappropriate condition. As Fitelson points out, newspaper reports and radio reports of the same

98

If, as we have said, the mismatch between OI and CPI is unsurprising, why waste all this ink pointing it out? Because philosophers of science have not always been careful about this distinction. There is a tendency to take formal results akin to

CPI as explanations of the value of OI. For instance, Lloyd (2009) appeals to

Fitelson’s notion of independence as an explanation of why it is valuable to test global climate models using distinct data sets: “We have fulfilled these conditions [of confirmational independence] when, for example, the ocean heat variable is tested against an ocean temperature observational data set and the pressure variable at given locations is tested against the observed pressures.” The pressure data and the ocean temperature data are plausibly ontically independent, but why think they are confirmationally independent? As we have seen, the first condition does not ensure the second. Further argument is needed to secure the link, but Lloyd does not provide it.

Fitelson himself, in his discussion of independent evidence, characterizes it as capturing the intuition that evidence from celestial and terrestrial domains of a theory provides stronger confirmation than evidence from just one domain (Fitelson 2001 f.n.

20).40 Again, the relevant intuition here seems to be about the independence of the causal structure undergirding the modes by which the evidence is gathered, and the link to confirmational independence is not clearly made.

baseball game are plausibly independent modes of evidence about the game, but they are certainly not unconditionally independent. Myrvold patches the correlation view to emphasize independence conditional on a hypothesis, but this returns us to a hypothesis-relative conception of independence and hence cannot characterize OI. 40 To be fair, however, Fitelson is one of the only philosophers to note that “it is not evidence of different ‘kinds’ per se that boost confirmational power. Rather, it is data whose confirmational power is maximal, given the evidence we already have that are confirmationally advantageous.”

99

More broadly, as we saw in §4.2, most explications of robustness arguments are made in terms of OI, but since we have shown that OI is insufficient to rule out dyssynergistic evidence, it follows that OI is insufficient as a basis for explaining the value of robustness. As we have seen, a number of authors move from the independence of modes of evidence in some scientific episode to a claim that the hypothesis is therefore more strongly confirmed; the assumption seems to be that OI evidence justifies a robustness argument. But as long as DE is possible, the former does not justify the latter. Some additional condition beyond OI must be appealed to – some condition that rules out dyssynergy.

We have shown that CPI provides a justification to robustness arguments based on the concordance of multimodal evidence, because CPI guarantees the avoidance of DE. CPI requires discussion of the nature of the hypothesis, the relevant alternative hypotheses, and the connection of these hypotheses to the evidence, rather than merely depending on the relationship between the different modes of evidence.

Although CPI is sufficient to justify robustness, it is not clear that CPI is epistemically accessible to experimentalists. For CPI to be epistemically accessible, one must be able to determine if the probabilistic equalities that constitute CPI are true. That is, one would have to know that Pr(E1 & E2|H) = Pr(E1|H)×p(E2|H), and that Pr(E1 & E2|~H) =

Pr(E1|~H)×p(E2|~H), in addition to knowing that both E1 and E2 confirm H. If the probabilities that constitute CPI are not epistemically accessible, then we are left in a quandary: to justify robustness arguments, CPI is sufficient but epistemically inaccessible.

100

4.7 Appendix

Assume:

(1) p(H|E1) > p(H)

(2) p(H|E2) > p(H)

(3) p(E1 & E2|H) = p(E1|H)×p(E2|H)

(4) p(E1 & E2|~H) = p(E1|~H)×p(E2|~H)

We want to show: p(H|E1 & E2) > p(H|E1)

Assume for reductio: (5) p(H|E1 & E2) ≤ p(H|E1)

(6) p(E1 & E2|H)×p(H)/p(E1 & E2) ≤ p(E1|H)×p(H)/p(E1) (from 5, Bayes’ Theorem

(BT))

(7) p(E1 & E2|H)/p(E1 & E2) ≤ p(E1|H)/p(E1) (from 6)

(8) p(E1 & E2|H)/p(E1|H) ≤ p(E1 & E2)/p(E1) (from 7)

(9) p(E2|H) ≤ p(E1 & E2)/p(E1) (from 3, 8)

(10) p(E2|H)×p(H)/p(E2) > p(H) (from 2, BT)

(11) p(E2|H) > p(E2) (from 10)

(12) p(E1 & E2)/p(E1) > p(E2) (from 9, 11)

(13) 1 - p(H|E1 & E2) ≥ 1 - p(H|E1) (from 5)

(14) p(~H|E1 & E2) ≥ p(~H|E1) (from 13)

(15) p(E1 & E2|~H)×p(~H)/p(E1 & E2) ≥ p(E1|~H)×p(~H)/p(E1) (from 14, BT)

(16) p(E1 & E2|~H)/p(E1 & E2) ≥ p(E1|~H)/p(E1) (from 15)

(17) p(E1 & E2|~H)/p(E1|~H) ≥ p(E1 & E2)/p(E1) (from 16)

101

(18) p(E2|~H) ≥ p(E1 & E2)/p(E1) (from 4, 17)

(19) p(E2|~H) > p(E2) (from 12, 18)

(20) p(~H|E2)×p(E2)/p(~H) > p(E2) (from 19, BT)

(21) p(~H|E2) > p(~H) (from 20)

(22) 1 - p(H|E2) > 1 - p(H) (from 21)

(23) p(H|E2) < p(H) (from 22)

Contradiction between lines 2 and 23, so we infer the falsity of line 5. Our four assumptions commit us to the claim that p(H|E1 & E2) > p(H|E1). A similar argument could easily be constructed to show that the assumptions also commit us to the claim that p(H|E1 & E2) > p(H|E2).

Acknowledgment

This chapter was co-written with Tarun Menon.

CHAPTER 5: ROBUSTNESS, DISCORDANCE AND RELEVANCE

Abstract

Robustness is a common platitude: hypotheses are better supported with evidence generated by multiple techniques that rely on different background assumptions.

Robustness has been put to numerous epistemic tasks, including the demarcation of artifacts from real entities, countering the “experimenter’s regress,” ensuring appropriate data selection, and resolving evidential discordance. Despite the frequency of appeals to robustness, the notion itself has received little philosophical explication.

Although robustness may be valuable in ideal evidential circumstances (that is, when evidence is concordant), often when a variety of evidence is available from multiple modes, the evidence is discordant. Further, if the multimodal evidence rely on bottleneck techniques then one might be presented with pseudorobustness.

102 103

5.1 Introduction

Robustness is a recent term for a common platitude: hypotheses are better supported with evidence generated by multiple techniques that rely on different background assumptions. Robustness is often presented as an epistemic virtue that ensures objectivity.41 Champions of robustness claim that robust evidence can demarcate artifacts from real entities, counter the “experimenter’s regress,” ensure appropriate data selection, and resolve evidential discordance. Consider the worry about artifacts: if a new technique shows x (an entity, process, relation, etc.), this might be due to a quirky aspect of the technique rather than a real aspect of x.

Response: if x is observed with multiple methods it is extremely unlikely that x is an artifact (Hacking 1983). Consider the “experimenter’s regress”: good evidence is generated from properly functioning techniques, but properly functioning techniques are just those that give good evidence (Collins 1985). Response: this vicious experimental circle is broken if we get the same result from multiple techniques (Culp

1994). Consider the concern about data selection: scientists use only some of their data, selected in various ways for various reasons, and the rest is ignored – but how do we know that the selection process gives true results? Response: vary the selection criteria, and invariant results are more likely to be true (Franklin 2002). Finally, consider discordant data: multiple experimental results are not always coherent –

41 Presumably robustness admits of degrees, but as I will argue, one of the key challenges facing robustness is the difficulty (or impossibility) of specifying the degree of robustness for any hypothesis.

104 which results should we believe? Response: conduct more experiments until they yield convergent results.

Robustness is valuable in ideal evidential circumstances, when all available evidence is concordant. One difficulty for robustness is that often when multimodal evidence is available for a given hypothesis, the evidence is discordant (§5.3). The value of robustness is mitigated by the problem of discordant evidence (§5.4). Given the vicissitudes of evidence, scientists must choose sets of evidence which they deem most relevant to their given tasks (§5.5). Evidence of varying degrees of quality is more or less confirming of and more or less relevant to a particular hypothesis.

Further, we lack systematic methods for assessing and combining multimodal evidence, and without such methods, robustness is limited to a qualitative or intuitive notion. Finally, robustness is fallible, as demonstrated by cases of ‘pseudorobustness’

(§5.6).

5.2 Two Preliminary Problems

Generating concordant multimodal evidence is difficult. Scientists need different kinds of evidence generated by independent techniques, but they do not always have multiple techniques to study the same subject. New techniques are introduced into scientific practice for a good reason: they give evidence on a new subject, or on a smaller or larger scale, or in a different context, than existing techniques. Even if multiple techniques do exist, it is not always clear that the techniques are independent. Bechtel (2006) argued that often new techniques are

105 calibrated to existing techniques, and so even when both techniques provide concordant results, the techniques cannot be said to be independent.

Furthermore, as argued in Chapter 4, determining what criteria should be used to determine independence between modes is a difficult problem. Simply put, having independent modes of evidence, and knowing that they are properly independent, is difficult; since robustness requires multiple modes of evidence, an incomplete or vague individuation of evidential modes will leave robustness as an incomplete or vague notion, and hence robustness-style arguments will be vague or inconclusive.

This is not to claim that robustness is a useless methodological strategy –

Perrin’s arguments for the existence of molecules, the canonical example based on concordant multimodal evidence, was convincing – it is simply to state what scientists already know: generating multimodal evidence is difficult. If the two challenges discussed here are met, then multimodal evidence is available. Trouble looms for robustness if multimodal evidence is discordant.

5.3 Discordance

If concordant multimodal evidence provide greater epistemic support to a hypothesis, it is unclear what support is provided to a hypothesis in the more common situation in which multiple techniques give results that are inconsistent or incongruent.

Franklin recently raised the problem of discordance, and suggested that it can be readily solved by various methodological strategies, which prominently include the strategy of generating more data from independent techniques (2002). While I think

Franklin is correct to identify discordance as a problem for what he calls the

106

“epistemology of evidence”, and his appeal to a plurality of reasoning strategies is valuable, the trouble is that often the sheer plurality of modes is precisely what is responsible for discordance.

Discordance is really two separate problems of evidence: inconsistency and incongruity. Inconsistency is straightforward: Petri dishes suggest x and test tubes suggest ¬x. In the absence of a methodological meta-standard, there is no obvious way to reconcile various kinds of inconsistent data. Incongruity is even more troublesome.

How is it even possible for evidence from different types of experiments to cohere?

Evidence from different types of experiments is often written in different ‘languages’.

Petri dishes suggest x, test tubes suggest y, mice suggest z, monkeys suggest 0.8z, mathematical models suggest 2z, clinical experience suggests that sometimes y occurs, and human case-control studies suggest y while randomized control trials suggest ¬y.

To consider multimodal evidence as evidence for the same hypothesis requires more or less inferences between evidential modes. The various ‘languages’ of different modes of evidence can be translated into other languages with the right background assumptions. If techniques actually have independent background assumptions, they might simply generate incommensurable data. The background assumptions necessary for such translations have varying degrees of plausibility.

Another dimension of discordance is the degree of “intensity” or salience of results. Consider Galison’s distinction between golden-event experiments and statistical experiments within the particle physics community (1987). Golden-event experiments give “intense” evidence for particularly rare events. In contrast, statistical experiments measure more frequent but less striking events. If different kinds of

107 evidence with different intensities support opposite conclusions, there is no obvious way to compare or combine such evidence in an orderly or quantifiable way.

Philosophers have long wished to quantify the degree of support that evidence provides to a hypothesis. At best, the problem of discordance implies that robustness is limited to a qualitative notion. But if robustness is a qualitative notion, how do we demarcate robust from non-robust evidence? At worst, the problem of discordance implies that evidence of different kinds cannot be combined in any coherent way.

That multiple independent techniques often display discordant evidence is an empirical claim. Some might think this a weakness of the above argument. However, the opposite is, of course, also an empirical claim – that multiple independent techniques often display concordance – and this is an empirical claim which seems false. History of science might occasionally provide examples of apparent concordance, but concordance is easier to see in retrospect, with a selective filter for reconstructions of scientific success. Much history of science tends to focus on the peaks of scientific achievement rather than the winding paths in the valleys of scientific effort – at least, the history of science that philosophers often appreciate, like

Nye’s account of Perrin’s arguments for atoms, is history of scientific success.

Philosophers have focused on the peaks of scientific success, but the lovely paths of truth in the valleys of scientific struggle are often discordant.

Philosophy of science often considers idealizations of evidence.42 In ideal evidential circumstances, robustness is a valuable epistemic guide. Real science is

42 Carnap, for example, developed confirmation theory “given a body of evidence e”, without worrying about what constitutes a “body” (1950).

108 almost never in ideal evidential circumstances; recent historical and sociological accounts of science have reminded philosophers of this messy detail. The following examples illustrate discordance, though they should hardly be needed: discordance is ubiquitous.

5.4 Dog’s Breakfast Evidence

Epidemiologists do not know how the influenza virus is transmitted from one person to another. The mode of infectious disease transmission has been traditionally categorized as either “airborne” or “contact”.43 A causative organism is classified as airborne if it travels on aerosolized particles through the air, often over long distances, from an infected individual to the recipient. A causative organism is classified as contact if it travels on large particles or droplets over short distances and can survive on surfaces for some time. Clinicians tend to believe that influenza is spread only by contact transmission. Years of experience caring for influenza patients and observing the patterns of influenza outbreaks has convinced them that the influenza virus is not spread through the air. If influenza is an airborne virus, then patterns of influenza transmission during outbreaks should show dispersion over large distances, similar to other viruses known to be spread by airborne transmission. Virtually no influenza outbreaks have had such a dispersed pattern of transmission. Moreover, nurses and physicians almost never contract influenza from patients, unless they have provided close care of a patient with influenza.

43 I simplify for purposes of exposition.

109

Conversely, some scientists, usually occupational health experts and academic virologists, believe that influenza could be an airborne virus. Several animal studies have been performed, with mixed conclusions. One prominent case study often referred to is based on an airplane that was grounded for several hours, in which a passenger with influenza spread the virus to numerous other passengers. Based on seating information and laboratory results, investigators were able to map the spread of the virus; this map was evidence that the influenza virus was transmitted through the air. More carefully controlled experiments are difficult. No carefully controlled human experiments can be performed for ethical reasons. However, in the 1960s researchers had prisoner ‘volunteers’ breathe influenza through filters of varying porosity; again, interpretations of results from these experiments were varied, but suggested that influenza could be airborne. Mathematical models of influenza transmission have been constructed, using parameters such as the number of virus particles emitted during a sneeze, the size of sneeze droplets upon emission, the shrinking of droplet size in the air, the distance of transmission of particles of various size, and the number of virus particles likely to reach a ‘target’ site on recipients. The probability of airborne influenza transmission is considered to be relatively high given reasonable estimates for these parameters.

Even when described at a coarse grain, the various types of evidence regarding the mode of influenza transmission illustrate the problem of discordance. Some scientists argue (using mathematical models and animal experiments) that influenza is transmitted via an airborne route, whereas others argue (based on clinical experience and observational studies) that influenza is transmitted via a contact route. Such

110 discordance demonstrates the poverty of robustness: multiple experimental techniques and reasoning strategies have been used by different scientists, but the results remain inconclusive. A single case-study does not, of course, demonstrate the ubiquity of discordance; rather, the case-study is merely meant as an illustration of what is meant by discordance.

It is easy to find examples of discordance in epidemiology, because they are common. Indeed, the examples of contradictory meta-analyses on the same hypotheses discussed in Chapter 7 are all examples in which the primary-level evidence is discordant. There is even some empirical work beginning to be published on the phenomenon of discordance. For instance, recently an epidemiologist examined all original studies published in leading medical journals between 1990 and 2003 which had been cited over 1000 times (Ioannidis 2005). There were 49 such studies – these are some of the most highly regarded medical studies of our time – and 45 of them gave positive evidence for the efficacy of a medical intervention, such as hormone replacement therapy for menopausal women, daily aspirin to mitigate blood pressure, or vitamin E to reduce heart disease. Of these 45 studies, 14 were contradicted in various ways by subsequent studies, and 11 had not been replicated. In other words

31% of the most highly regarded medical studies had, only a short time later, been contradicted by discordant evidence. Since these were high impact studies, an a fortiori argument could be made regarding the discordance of studies with less impact.

We are used to discordance in epidemiology since we often receive vacillating guidance on health in our daily papers. Discordance, though, is ubiquitous in all disciplines of science. I expect this to be obvious to historians of science, since a

111 stock-in-trade technique of some historians is to study controversy, which is often fueled by discordance. Nevertheless, the examples illustrate, I hope, that robustness arguments could be more broadly applied, and applied more rigorously, if a method of amalgamating evidence was available for such cases.

The ‘oxidative phosphorylation controversy’ was a dispute in the 1960s and

1970s about how cells generate energy via metabolism. This was described by Weber

(2005), who suggested that the resolution of the ox-phos controversy (as it is sometimes called) was possible only when there was a “combination of all the reconstitution experiments done in Racker’s laboratory that provided the crucial evidence” (p. 108). Weber includes a table which lists the various modes of evidence in favor of the chemiosmotic hypothesis, one of the contender hypotheses, and the various modes of evidence in favor of the other primary contender, the ‘chemical hypothesis’ (p. 104). The evidence from the various modes was discordant, with several modes favoring one hypothesis and several modes favoring the other. In the end, strictly speaking, there was no “combination” of the multimodal evidence, since scientists, then as now, do not know how to systematically combine evidence from such different modes. Rather, it was a consideration of evidence from different kinds of experiments that compelled most biochemists to accept the chemiosmotic hypothesis. Weber calls these reconstitution experiments “crucial”, but he notes that there was not a single crucial experiment. In an earlier discussion of the ox-phos case,

Allchin (1992) describes the eventual evidence as an “ensemble of empirical demonstrations” rather than crucial. Weber does not further pursue the question of how multimodal evidence can be assessed concomitantly to decide between competing

112 hypotheses; he “does not think that there exist sound methodological principles that would allow this” (p. 105), though he does not argue for this. My view is that, even if

Weber is correct, given the ubiquity and epistemic importance of discordant multimodal evidence in contemporary science, policy, and law, the task of developing and justifying methodological principles to assess and amalgamate multimodal evidence should be a priority for theoretical scientists and philosophers of science.

The history of earth sciences also presents us with examples of discordance. At the end of the nineteenth century there were two competing views of the earth’s evolution. One view – what Oreskes (1999) calls the European view – proposed that the earth was in a state of flux with total interchangeability of its parts: oceans could elevate into continents and continents could collapse into oceans. The American view, on the other hand, proposed that geological change was confined to discrete zones between continents. As Oreskes argues:

The two theories also differentially weighed the available facts. The American perspective emphasized the physical properties of minerals, the contrasting compositions of continental rocks and the ocean floor, and the asymmetry of folding in the Appalachians. The European view emphasized the biogeographical patterns, the stratigraphic evidence of interchangeability of land and sea, and the diverse patterns of folding in European and African mountain belts. (p. 19)

Then early in the twentieth century evidence mustered to support Wegener’s theory of continental drift included paleontological parallels between continents, stratigraphic parallels between continents, and the jigsaw-puzzle fit of continents (see Oreskes 1999

113 p. 273).44 But there was also evidence against continental drift: for example, some data indicated that the Earth’s mantle is rigid.

For another example of discordant multimodal evidence, consider research on the relationship between the biodiversity of an ecosystem and the bioproductivity of the ecosystem. It has long been observed that there is a correlation between the biodiversity of an ecosystem, measured by species counts, and the bioproductivity of an ecosystem, measured in various ways but typically by biomass. By the 1980s a

‘hump-shaped’ relation between biodiversity and bioproductivity was thought by many to be the ‘true’ relation: at low to mid levels of biodiversity, as biodiversity increased bioproductivity increased, but then the correlation reaches a maximum of bioproductivity after which as biodiversity increases bioproductivity decreases.

Evidence from many different modes are available on the biodiversity-bioproductivity relation, including controlled laboratory experiments (in which biodiversity is manipulated by proxy parameters such as water or fertilizer), field experiments, observational studies, and models. Recently a group of ecologists attempted to survey all the available evidence on the biodiversity-bioproductivity relation (Mittelbach et al.

2001). They included 172 studies which varied significantly with respect to how biodiversity and bioproductivity were measured, the scale of measurements, and the

44 Based on this multimodal evidence Wegener himself made a robustness argument: the “totality of these points of correspondence constitutes an almost incontrovertible proof” (cited in Oreskes 1999 p. 57-8). But it was this method of proof that critics noted when rejecting his theory: “My principal objection to the Wegener hypothesis rests on the author’s method. This, in my opinion, is not scientific, but takes the familiar course of an initial idea, a selective search through the literature for corroborative evidence, ignoring most of the facts that are opposed to the idea…” (cited in Oreskes 1999 p. 126).

114 kinds of ecosystems studied. The observed biodiversity-bioproductivity relations in the 172 were discordant: 39 of the studies confirmed the standard hump-shaped relation, 10 studies supported the opposite u-shaped relation, 45 showed a straightforward positive linear relation, 23 showed a negative linear relation, and 39 showed no relation at all. The available evidence on the biodiversity-bioproductivity relation is truly a dog’s breakfast.45

Sometimes the metaphor of ‘weighing’ evidence is used when discordant multimodal evidence is available. The metaphor of ‘weight of evidence’ is old. For instance, the frontispiece of the New Almagest, by the Jesuit Giovanni Battista Riccioli

(1598-1671), centers around an image of a scale, weighing a geocentric model against a heliocentric model. The book “weighed” forty-nine arguments for a Copernican system against seventy-seven arguments against the earth’s motion; the arguments

“are so many of such a kind that with some ability, the intellect can be inclined towards one or the other hypothesis”46 – though the scale in Riccioli’s frontispiece is clearly tipping towards the geocentric model. The decisive arguments, for Riccioli, were scriptural (Westman, forthcoming).

An appeal to the weight of evidence is often made in cases of evidential discordance, but there is little understanding about what weight of evidence is. A recent review identified three ways in which the notion of weight of evidence is used in epidemiology: metaphorical, in which weight of evidence means a collection of evidence with an unspecified method of combining the evidence; methodological, in

45 Incidentally, given the discordance, this group decided not to do a meta-analysis and stuck with a simple vote-counting method of summarizing the multimodal evidence. 46 Cited in Westman (forthcoming).

115 which weight of evidence means using some established method for combining or comparing evidence (for example, using systematic reviews, meta-analyses, or causal criteria); or theoretical, where evidence is combined by something like a Kuhnian paradigm or a label for a conceptual framework (Weed 2005). This schema seems roughly right. Given discordant multimodal evidence, one can ‘weigh’ the epistemic import of the evidence metaphorically, as did the epidemiologist Marmot mentioned in

Chapter 1; one can weigh the epistemic import of the evidence more rigorously with a method of combining the multimodal evidence; or one can diminish the apparent diversity of the evidence by unifying the evidence into a common theoretical structure.

But in the absence of this latter option, if one wants more rigor and constraint than an appeal to metaphor, one needs a method of amalgamating multimodal evidence.

This section has described several examples of discordance. One might respond: discordance is not a problem for robustness, since by definition robust evidence is generated when multiple modes give concordant evidence for the same hypothesis. To appeal to discordant evidence as a challenge for robustness simply misses the point – robustness just is concordance of multiple kinds of evidence, so no one would say that evidence which is discordant could also be robust. Fair enough – the problem of discordance is not a knockdown argument against the value of robustness per se. Rather, discordance demonstrates an important constraint on the value of robustness. Robustness and its corresponding methodological prescription – get more data! (of different kinds) – are trivially valuable. However, the prescription to get more data, from different kinds of experiments, is not something that scientists need to be told – they already follow this common-sense maxim. I share the intuition

116 that multimodal evidence does (often) provide greater epistemic support to a hypothesis than does monomodal evidence – at least when all independent techniques are concordant. Unfortunately, multimodal evidence, when available, is often discordant. Further, robustness-style arguments presuppose a principled and systematic method of assessing and amalgamating multimodal evidence, and without such methods of combining evidence, robustness arguments are merely intuitive or qualitative.

Franklin suggests that robustness helps resolve discordant data, but I have argued the converse: discordant data diminishes the epistemic value of robustness.

Epistemic guidance is needed most in difficult cases, when multiple independent techniques produce discordant evidence. In such cases robustness is worse than useless, since the fact of multiple modes of evidence is the source of the problem. Real science is almost always confronted with the problem of discordance; in such circumstances scientists must decide which evidence is most relevant. We could call this the “new-new problem of induction” – are there principled, compelling methods to amalgamate discordant multimodal evidence? I canvas several possibilities for amalgamating evidence in Chapter 6.

5.5 Relevance

If evidence for a particular hypothesis from all types of techniques is concordant, then scientists do not need to choose which mode of evidence is more or less relevant to the hypothesis. At least, they are not faced with contradictory evidence. But given discordant evidence, scientists are faced with choices: data from

117 some techniques support a hypothesis, while other data do not (inconsistency), or worse, data from some techniques simply require too many implausible assumptions to consider them as evidence for the same hypothesis as evidence from other techniques (incongruity). Indeed, the basis of many scientific controversies is the problem of relevance: one group of scientists believes that evidence from some techniques is relevant to a hypothesis while another group of scientists believes that evidence from other techniques is more relevant.

Cartwright has argued that modes of evidence are of varying quality and are of varying relevance to a given policy; but of course, the problem of differential quality and relevance of evidence also applies to other kinds of hypotheses. There is often a trade-off between quality and relevance. Evidence of high quality could be generated from experiments with low relevance; such experiments Cartwright calls ‘clinchers’.

Evidence of high relevance could be generated from experiments that are of low quality; Cartwright calls these experiments ‘vouchers’. The challenge facing policy makers is as much a challenge for any scientist considering a hypothesis of relatively general scope: some evidence must be selected as relevant from discordant data generated by multiple kinds of techniques. Which kinds of data are most relevant to the hypothesis? Which kinds of data are high quality? Scientists and policy makers can consider data from all kinds of experiments, or only data from high quality experiments, or only data from one particular kind of experiment. How should they choose?

Scientists lack universal criteria for making decisions regarding relevance, though particular disciplines do have criteria for determining what counts as high

118 quality evidence. As Galison argues, one tradition in physics considers an image of a

“golden event” to be high quality evidence. The evidence based medicine movement rank-orders various kinds of experiments, with the randomized control trial (RCT) considered the highest quality of evidence; prospective cohort studies, case-control studies, observational studies, and case reports normally follow RCTs in descending order of quality. However, since high quality evidence is not necessarily the most relevant to a given policy (or hypothesis), Cartwright has argued that multiple kinds of evidence must be considered (and not just ‘clinchers’).47 Deliberation about a policy or a general hypothesis should use evidence which is ideally both high quality and relevant, but this is not often available. To illustrate, I will continue the example based on the mode of influenza transmission; though again, such an example should hardly be necessary, since the problem of relevance is ubiquitous.

Despite ignorance about the mode of influenza transmission, policy makers in public health jurisdictions around the world have been expected to develop guidelines regarding the type of mask that should be provided to healthcare workers in the case of an influenza pandemic. If influenza is spread through the air, then active filtration masks are necessary; these masks are cumbersome, must be custom-fitted to the face of healthcare workers, and are relatively expensive. If influenza is spread by contact transmission, then surgical masks are sufficient; surgical masks are cheap, readily accessible, and do not require custom fitting. The guideline written by the U.K.

Department of Health is exemplary: the authors claim that the “balance of evidence points to large droplet and direct and indirect contact as the most important routes of

47 See also Worrall (2002).

119 transmission” of influenza.48 The trouble is that what “balance of evidence” means in this guideline is unspecified. The policy makers had no principled method to amalgamate the discordant multimodal evidence regarding the mode of influenza transmission, and nor did the policy makers have criteria to determine which particular kinds of evidence were most relevant to the question of what kinds of masks should be recommended to healthcare workers during an influenza pandemic. Vague appeals to balancing evidence are perhaps all that can be done, given discordance. Unfortunately, as argued in Sections 2 and 3, arguments that appeal to a balance of evidence can support incorrect conclusions.

Determining what evidence is relevant could mitigate the problem of discordance. A discordant evidential situation could be rendered more concordant if some of the discordant evidence were deemed less relevant. At the beginning of this section I suggested that perhaps many scientific controversies are controversies just because different scientists consider different kinds of evidence relevant. Conversely, scientific controversies could be closed by determining what evidence is relevant to the undecided hypothesis. However, this just shifts the difficulty from the problem of discordance to the problem of relevance.

5.6 Pseudorobustness

In Wimsatt’s canonical discussion of robustness, he raised the problem of

‘pseudorobustness’ (1981). To put this concept in my terms, pseudorobustness occurs

48 Hedging their bets, or worried about massive lawsuits, the writers of this guideline also claim: “Airborne, or fine droplet transmission, may also occur.”

120 when evidence which seems to be multimodal is concordant for some hypothesis, but the evidence in question is later shown to not be sufficiently independent (and hence not truly multimodal). Here I describe two cases of pseudorobustness: one about a hypothesis later deemed false and one about a hypothesis later deemed true.

A detailed case-study by Rasmussen provided an instance of this problem: multiple methods of preparing samples for electron microscopy demonstrated the existence of what is now considered an artifact (1993). The fact that such evidential diversity was used as an argument for the reality of an artifact mitigates the epistemic value of robustness. The problem demonstrated by Rasmussen can be generalized: concordant multimodal evidence can support an incorrect conclusion.49 Defenders of robustness, though, have an easy response: the hypothesis of the existence of mesosomes was not robust, but rather was pseudorobust. The mesosome case, then, merely illustrates the fallibility of pseudorobustness. But before considering the response it will help to have more details of the case.

During the 1950s and 1960s, cell biologists claimed that they had discovered a new structure within Bacillus cells. Fitz-James (1960) called such structures

‘mesosomes’ and suggested that such structures were organelles of many bacteria.

Fitz-James used multiple fixation techniques to preserve the bacteria, and observed mesosomes using both electron microscopes and light microscopes. Fitz-James also

49 One might think that multiple invalid arguments that reach the same conclusion give no better reason to believe this conclusion than a single invalid argument reaching the same conclusion. Similarly, multiple methodologically mediocre experiments, or multiple epistemically unrelated experiments, or multiple experiments with implausible background assumptions, give no better reason to believe a hypothesis than does a single experiment (let alone a single well-performed experiment with plausible background assumptions).

121 claimed to observe mesosomes in living bacteria by using a special staining technique.

Thus, prima facie, it appears that a robustness argument could have been made for the existence of mesosomes.50 This is precisely what Fitz-James did.51 The ground for such a robustness argument continued to get stronger. A large body of work, which continued into the 1970s, was concerned with determining the function of the mesosome. Biochemists purified mesosomes and subjected them to numerous tests, and this was done with multiple species of bacteria.

Ultimately most scientists came to think that mesosomes are artifacts, and this shift was partially due to new methods of fixing bacteria. Hudson (1999) argued that this suggests that the relevant scientists were less swayed by the concordance of multimodal evidence than they were by evidence from a single mode which was deemed more reliable than other modes (regardless of the discordance between evidence from the later mode with the earlier multimodal evidence). Hudson calls this

‘reliable process reasoning.’ In Chapter 4 I suggested that this case is an illustration of the fallibility of robustness. This seems now to be a hasty conclusion. Rather, as above, the case illustrates what Wimsatt called pseudorobustness. The modes of evidence presented by Fitz-James relied on ‘bottleneck’ techniques associated with

50 Though as Rasmussen (1993) notes: “it is hard to map between light and electron microscope observations because images are so different. [sic]” 51 Fitz-James made a robustness argument despite discordance; Rasmussen (1993) argued that concordance was used as an argument for robustness but discordance could be explained away by appealing to differences in experimental set-up: “similarities between prior and new observations are evidence that mesosomes really exist, but differences in mesosome appearance are due to differences in observation conditions.”

122 methods of preparing the bacteria.52 The reliance on bottleneck techniques implies that the available evidence was not sufficiently independent for a robustness argument to be made. It is reasonable to ask whether or not the reliance on bottleneck techniques can be identified prospectively. If it possible to identify bottleneck techniques, then a natural methodological prescription would be: if one wants to make a robustness argument, one must avoid bottleneck techniques. If it is impossible to identify bottleneck techniques (for epistemic reasons, rather than for the reason that bottleneck techniques are not used), then a natural methodological prescription would be: do not make robustness arguments. The mesosome case is one in which the actors themselves did not worry about the reliance on bottleneck techniques. The following case is another example of pseudorobustness in which it was possible to identify bottleneck techniques.

Not all situations of pseudorobustness support false hypotheses. The evidence presented in AMM1944 (discussed in Chapter 2) was apparently multimodal: it was generated by multiple qualitative and quantitative chemical tests. This evidence was concordant: the various experiments yielded concordant evidence which suggested that the TS was DNA. However, all the experiments depended on bottleneck techniques. One type of organism was used, one genetic marker analyzed, and one method of extracting, purifying, and administering the transforming factor was developed; not surprisingly, the reliance on these bottleneck techniques were subject

52 With the exception of the observations on live cells. However, Rasmussen notes that “in the case of Fitz-James’ Janus green experiment, the specimen was alive which by light microscopy standards assures freedom from artifact, but the identity of the stained invaginations and mesosomes seen in thin section is only presumptive [sic].”

123 to criticism: for instance, this was precisely the focus of Mirsky’s attacks. Mirsky emphasized that Avery’s diverse evidence relied on bottleneck techniques. Thus, the concordant multimodal evidence presented in AMM1944 should be considered a case of pseudorobustness (recognized as such by critics like Mirsky). But in contrast to the mesosome case, this is an example in which scientists achieved pseudorobustness about a finding that we now consider (roughly) true. Pseudorobustness need not necessarily lead one astray.

One way to avoid pseudorobustness is to avoid bottleneck techniques.

However, in Chapter 4 I presented a stronger version of pseudorobustness: what I called dyssynergistic evidence (DE). In contrast to cases of pseudorobustness based on bottleneck techniques, there is no way to prospectively identify cases of DE.

Parts of this chapter appeared as “Robustness, Discordance, and Relevance.” 2009.

Philosophy of Science 76(5):650-661.

CHAPTER 6: AMALGAMATING EVIDENCE

Abstract

Given a diverse set of evidence, how ought we combine it in order to provide systematic constraint on the assessment of a hypothesis? Different conceptions of evidence suggest different methods for amalgamating multimodal evidence. This chapter describes several possible methods for amalgamating multimodal evidence.

124 125

6.1 Amalgamating What?!

A long tradition in philosophy considers evidence to be something that an individual has.53 Another tradition in philosophy considers evidence to be something that is publicly accessible.54 A passage from Carnap (1963) nicely illustrates the respective virtues of the two view of evidence:

Under the influence of some philosophers, especially Mach and Russell, I regarded in the Logischer Aufbau a phenomenalistic language as the best for a philosophical analysis of knowledge. I believed that the task of philosophy consists in reducing all knowledge to a basis of certainty. Since the most certain knowledge is that of the immediately given, whereas knowledge of material things is derivative and less certain, it seemed that the philosopher must employ a language which uses sense-data as a basis. In the Vienna discussions my attitude changed gradually toward a preference for the physicalistic language … Neurath in particular urged the development toward a physicalistic attitude … one of the most important advantages of the physicalistic language is its intersubjectivity, i.e., the fact that the events described in this language are in principle observable by all users of the language (1963: 50-52)

I mention these two ways of understanding evidence not because I intend to referee between them, but rather because it focuses my motivating question.

When I pose the question of amalgamating evidence, what is supposed to be amalgamated? By amalgamation of multimodal evidence, one might mean the

53 Russell and early logical positivists thought that evidence came in the form of sense data; Quine claimed that evidence was the stimulation of nerve endings; subjective Bayesians consider evidence to be one’s certain beliefs; so-called ‘evidentialists’ hold that true beliefs are justified only if they are based on one’s evidence; Williamson (2000) considers evidence to be the set of propositions that one knows. 54 Such a view of evidence preserves the common sense intuition that evidence is in the world, is an indicator of other less accessible things in the world (as smoke indicates fire), and can thus be appealed to by multiple people to achieve a shared understanding of the world.

126 amalgamation of the data, the likelihoods or error probabilities,55 other types of facts accepted as evidence, the outputs of an inference rule, or the scientists’ judgments based on the evidence. I consider each possibility.

6.2 Amalgamating Data

If data are quantitative then they can be summarized in numerous ways. Three common summaries of data are the mean (and variance), effect size, and odds ratio.

When evidence is available from multiple modes, and if such evidence can be neatly summarized by one of these measures, then the measure could be amalgamated by calculating a weighted average of the measure. Which of these measures (or others) should be chosen depends on the hypothesis of interest and the form of the data from the modes – for example, a hypothesis about a single parameter measurement might be best determined by the mean of the data measuring the parameter, and in this case it is meaningless to talk about an effect size or an odds ratio. But if a causal hypothesis is tested by a controlled experiment, then the effect size could be calculated by subtracting the value of the effect variable in the control group from the value of the effect variable in the experimental group. The weight is a numerical assessment of the various features of the mode and the evidence which are epistemically important to the scientist (discussed in Chapter 2).

55 The two leading theories of scientific inference are confirmation theory (most prominent in this class are Likelihoodism and Bayesianism) and classical statistics (most prominently represented by the theories of Fisher, Neyman and Pearson). For a Bayesian or likelihoodist, the epistemic import of evidence is represented by the likelihood, and for a frequentist the import of the evidence is represented by error probabilities.

127

This procedure is fine for situations in which a plurality of evidence is available from the same mode, and the evidence from each study or publication can be summarized with a chosen measure, weighted accordingly, and then an average effect measure calculated. An example of this methodology is described in detail in Chapter

6. Such situations are frequent in medicine: numerous RCTs are often available for a particular hypothesis, and evidence from then are amalgamated by meta-analysis.

However, if data are generated by the same method, then such evidence is not multimodal. The purported epistemic value of multimodal evidence is not manifest with data generated by the same method. Recall the multimodal evidence presented in

AMM1944 (Chapter 2). One method used in Avery’s laboratory was chemical analysis of the transforming substance, which showed that the amounts of carbon, hydrogen, nitrogen, and phosphorous were close to the theoretical values for DNA; another method tested the effect of trypsin, chymotrypsin, and ribonuclease (protein and ribonucleic acid degrading enzymes) on the activity of the transforming substance – protein and ribonucleic acid degrading enzymes had no effect on the transforming substance; still another method measured the ultraviolet absorption of the transforming substance, and such measures were characteristic of DNA. Several other independent methods also supported the hypothesis that the transforming substance was DNA.

Taken together the multimodal evidence had the element of surprising concordance associated with experimental robustness arguments (“who could have guessed it?” wrote Avery to his brother). But such evidence is incommensurable. This word has plenty of baggage thanks to Kuhn, who used it to describe relations between

128 paradigms. I mean the word literally: the multimodal evidence presented in AMM1944 had no common standard of measurement.

One way to render evidence measured on different metrics to be commensurable would be to have a translation principle to convert the measure from one mode into a measure with the same meaning as a measure from another mode. For example, suppose we are testing the ability of drug A to lower blood sugar levels in people with diabetes, and we perform experiment testing the ability of drug A to lower blood sugar levels in mice, and we perform another experiment testing the ability of drug A to lower blood sugar levels in people with diabetes. The result of the first experiment is that A lowers blood sugar levels in mice by m mg/dL and the result of the second experiment is that A lowers blood sugar levels in patients with diabetes by p mg/dL. If we want to amalgamate the evidence from these two experiments by the procedure described in this section, then we need a translation principle like the following: p mg/dL = xm mg/dL, where x is a multiplication factor which represents the ratio of the effect of drug A in mice to the effect of drug A in people. If x = 1, then if an experiment which showed a decrease in mice blood sugar levels of m mg/dL, then we would expect a decrease of m mg/dL in people with diabetes.

Common methodological guidance for doing a meta-analysis is to avoid mixing evidence from different modes in the way suggested by the above paragraph.

129

But with a translation principles between modes, there is no reason in principle why a meta-analysis should not include evidence from multiple modes.56

6.3 Amalgamating Error Probabilities

Statistical tests originally developed by Fisher, Neyman and Pearson and their students, sometimes called frequentist statistics, are ubiquitous in science. Such tests are applied to a set of data to estimate the value of a parameter, to compare the mean and variance of multiple sets of data, and to determine the probability that such tests commit an error. More specifically, a common class of methods in classical statistics is known as ‘hypothesis testing’, in which a ‘null hypothesis’ (H0) is formulated as a statement about the absence of a difference in the values of a parameter between two data sets, and tests are performed to calculate the probability of falsely rejecting H0 (a

Type I error) and the probability of falsely accepting H0 (a Type II error). The terms

‘error statistics’ and ‘error probabilities’ come from this latter feature of classical statistics.57

It would be a mistake to think that the probabilities associated with classical statistical tests can be meaningfully amalgamated. These probabilities are properties of tests, rather than frequencies of the object of study or probabilities measuring someone’s subjective belief. Such probabilities cannot be directly thought of as

56 Of course, it makes sense to inquire about the justification for these translation principles. Such translation principles are justified to the extent that extrapolating the results from one mode to predictions about results from another mode are justified, which is entirely an empirical question. 57 One of the outstanding disputes in philosophy of science is about the foundations of statistics, with Bayesians on one side and frequentist statisticians on the other; this, though, is not the place to lay out the features of this debate.

130 evidence for or against a hypothesis of interest. The probabilities associated with classical statistical tests are probabilities that a particular test will avoid Type I or

Type II errors. Although multiple tests each have associated error probabilities, this is a superficial similarity and not one that has much meaning between tests. For this reason, error probabilities from statistical tests on data from multiple modes cannot meaningfully be amalgamated. To do so would be akin to amalgamating the probability that the next flip of a coin will be heads with the probability that the next draw from this deck will be the jack of diamonds as a way of estimating my success in the casino tonight.58

However, Mayo (1996) argues for an ‘evidential’ interpretation of classical statistical tests: the output of a classical statistical test can be thought of as evidence for or against one’s hypothesis of interest, but only qualitatively. If a statistical test suggests that we should ‘reject the null hypothesis’, with low probabilities for associated Type I and Type II errors (in Mayo’s terms, if the hypothesis of interest passes a ‘severe’ test), then this is evidence in support of our hypothesis of interest. On this evidential interpretation of classical statistical tests, it is important to remember that the support provided by the output of the statistical test to the hypothesis of interest is qualitative. The associated error probabilities are not probabilities that the hypothesis of interest is true, regardless of whether such a probability is construed as an objective fact about the world or as a subjective degree of belief. Although the test outputs are quantitative, the evidential interpretation of such outputs are not. The

58 Although some early statisticians such as Fisher and Pearson proposed combining p- values across studies, as I discuss in Chapter 6 meta-analyses today are performed by combining effect sizes rather than p-values.

131 evidential interpretation of the output of classical statistical tests supports qualitative comparisons of best-supported hypotheses, but does not support quantitative determinations of the absolute measure of support for any hypothesis. That is, one can conclude that a particular set of data provides good evidence, or no evidence, for or against H, but one cannot conclude that a particular set of data provides evidence that

H is confirmed or disconfirmed by a particular amount.

If one were to amalgamate the output of a classical statistical test, one would be amalgamating qualitative assessments of the support that evidence from various modes provides to a hypothesis. These qualitative assessments could take the form of simple claims of evidential support – “this data is good evidence for H” – or contrastive claims of evidential support – “dataset A is better evidence for H1 than dataset B is evidence for H2” (where A could be the same as B or H1 could be the same as H2). These assessments would generate an ordinal ranking of hypotheses, and so when presented with multimodal evidence, assessments of evidence based on the latter could be amalgamated by any method suited for amalgamating sets of ordinal rankings. Social choice theorists have studied such methods in technical detail. In

Chapter 8 I consider a framework of ordinal rankings of hypotheses by multimodal evidence, and I import results from social choice theory.

6.4 Bayesian Amalgamation

This section considers how one might amalgamate evidence using a Bayesian approach. Bayesian conditionalization is a rule for revising one’s probability of a hypothesis upon receiving evidence. If a scientist learns e, and pold(H) is the scientist’s

132 assessment of the probability of a hypothesis before receiving the evidence, then pnew(H) – the scientist’s assessment of the probability of a hypothesis after receiving the evidence – should equal pold(H|e). Since this latter term is a conditional probability, it can be calculated using Bayes’ Theorem (BT):

(BT) p(H|e) = p(e|H)p(H)/p(e)

This suggests a possible way to amalgamate multimodal evidence, based on what is sometimes called ‘strict conditionalization’ (SC): we could update the probability of the hypothesis by sequentially conditionalizing with Bayes’ Theorem for each mode of evidence.59

(SC) pnew(H) = pold(H|e) = p(e|H)pold(H)/p(e)

One would arbitrarily order available modes from 1 to n, and then use Bayes’

Theorem to update the probability of the hypothesis sequentially for each mode, and the posterior probability of the hypothesis after updating on evidence from mode n would become the prior probability of the hypothesis for updating on evidence from mode n+1. The probability of the hypothesis after conditionalizing on the evidence from the first mode would be as above, substituting numerical subscripts for evidence from each mode in place of ‘old’ and ‘new’:

p(H|e1) = p(e1|H)p(H)/p(e1)

The posterior probability, p(H|e1), would then be the ‘new’ prior probability, p(H) for updating by evidence from the next mode, e2:

p(H|e2) = p(e2|H)p(H|e1)/p(e2)

59 Dutch Book arguments are meant to show that one is rationally required to use SC to learn from evidence.

133

This sequential updating would continue until the evidence from the final mode, n was used to update the penultimate probability of the hypothesis p(Hf-1) to determine the final probability of the hypothesis p(Hf ):

p(Hf|en) = p(en|H)p(Hf-1)/p(en)

Some Bayesians might consider this approach to be the best way to amalgamate multimodal evidence. Several conditions must be met for this method of sequential conditionalization. For all modes of evidence, all terms in Bayes’ Theorem must be known: that is, for all modes i, p(ei|H) must be known (discussed in Chapter

3); the initial p(H) must be known (this condition has generated much worry, known as the ‘problem of the priors’); and for all modes i, p(ei) must be known. I argued in

Chapter 3 that determining these terms in practice is often impossible. Consider again the evidence presented in AMM1944. What was the probability that the particular UV absorption values that Avery’s group measured would be the case assuming that the transforming substance was DNA? What was the prior probability that the transforming substance was DNA? What was the prior probability of that the particular UV absorption values that Avery’s group measured would be the case? Now repeat these questions for the other five modes of evidence presented in AMM1944 that I describe in Chapter 2.

Also troubling is that Bayes’ Theorem requires the scientist using the theorem to know e to be true once e has been observed. In most scientific contexts this is unrealistic. Consider an example given by Skyrms (1986): suppose I see a bird at dusk, and I identify it as a black raven, but because of the evening light, I do not hold the proposition “the bird is a black raven” as my evidence e with perfect confidence

134

(that is, with a probability of 1). Rather, I might believe e to be true with probability

0.95. Jeffrey (1965) proposed a modification of Bayesian conditionalization to deal with cases in which evidence is uncertain (which, it is reasonable to suppose, is wholly ubiquitous in science). Jeffrey conditionalization (JC), sometimes referred to as

‘probability kinematics’, is as follows: given multimodal evidence ei one’s updated probability in H, pnew(H), should be:

(JC) ∀i1-n∑ pold(H|ei)pnew(ei)

In other words, this is a weighted average of strict conditionalization. To use JC for amalgamating multimodal evidence, one would sequentially update the probability of the hypothesis using JC, similar to the sequential procedure used with SC.

Bayesianism is beset with many well-known problems. This is not the place to rehearse such problems. But are there any problems with Bayesianism that arise specifically with respect to amalgamating multimodal evidence? A condition of the particular method described above was an arbitrary ordering of the modes. Whatever ordering is chosen should not affect the final probability of the hypothesis.

Unfortunately, it is a common complaint against JC that it is ‘non-commutative’ – the order in which particular pieces of evidence is used to update the probability of the hypothesis makes a difference to the final probability of the hypothesis (see van

Fraassen 1989). This problem could be mitigated if there were a way of ordering modes which was superior to others. One might think that if we ordered modes by quality, and used JC on the highest quality mode first and subsequently conditionalized on modes in decreasing order of quality, then the non-commutative property of JC would at least be minimized, because evidence from lower quality

135 modes ought to have a lower impact on the hypothesis anyway. The trouble with this approach is that, despite what some have claimed in particular domains such as evidence-based medicine, there is no general, decontextualized way to order modes of evidence according to a unitary desideratum such as quality (see, e.g., Cartwright

2007b). As I urged in Chapters 2, there are multiple evidential features that determine the import of a mode or a particular piece of evidence, and in Chapter 3 I argued that there is no best ordering of these evidential features. Thus, one cannot resolve the non- commutativity of JC by ordering modes based on a ranking of superiority.

6.5 Amalgamating Judgments

Different scientists reach different conclusions regarding the support that particular evidence provides to a hypothesis. The history of science is full of such examples, and in Chapter 3 I gave an argument for why we ought to consider such variable assessments of evidence to be reasonable. One might think that a good way to amalgamate multimodal evidence would be to amalgamate the judgments of a group of scientists. Such judgments could have been reached by scientists assessing evidence from all available modes (these assessments would depend on the individual scientists’ disciplinary proclivities), and making inferences on this evidence using any preferred inductive rules of the respective scientists.

The form that these judgments take (votes, probabilities, intuitions…) depends on one’s theory of scientific inference, which in turn constrains the possible ways of amalgamating such judgments. In other words, the forms that such judgments can be expressed in are just the forms of the output of the inductive methods discussed above.

136

Probabilists such as Bayesians think that the appropriate form that a scientist’s judgment should take is a probability; the amalgamation of judgments would then be the amalgamation of such probabilities.60 Conversely, others think that the appropriate form that a scientist’s judgment should take is an ordinal ranking of hypotheses; the amalgamation of judgments would then be the amalgamation of such rankings (this is the subject of Chapter 8). Since I have already briefly canvassed various possible technical methods for amalgamating evidence in these forms, in what follows I discuss an important non-technical method of amalgamating judgments relevant to the amalgamation of judgments: deliberation.

A deliberative approach to multimodal evidence amalgamation is a social process of bringing together representatives of multiple perspectives, or individuals who might evaluate different modes of evidence differently, rather than bringing together the multimodal evidence itself.61 A ‘consensus conference’ is an example of a deliberative process.62 Evidence from multiple disciplines or professions is considered, and these professions often have conflicting standards of evidence and policy

60 Lehrer and Wagner (1981) give technical results on the amalgamation of individuals’ beliefs. 61 Not all deliberative approaches are designed to be cooperative; deliberative processes can be adversarial. Amalgamating legal evidence exemplifies an adversarial method of deliberation, and some have advocated for an adversarial approach to amalgamating multimodal scientific evidence (e.g. the ‘Science Court’ proposed by Kantrowitz in the 1960s). Moreover, there is no common structure to consensus conferences: variable features include who or what the participants represent, the methods of deliberation, the variety of evidence considered, the length of deliberations, and the format of communicating the conclusions. 62 Denmark was one of the first countries to implement contemporary consensus conferences; the Danish model invites randomly selected citizens to engage with experts on scientific questions that have a technological or policy implication, such as genetic engineering for agriculture, air pollution, or genetic testing for health insurance.

137 interests.63 Moreover, features of the dispute other than evidence from high quality modes are considered, including questions regarding the relevance of evidence to a general hypothesis of interest, social values that relate to policies based on the hypothesis, such as cost, safety, and equity, and other pragmatic issues such as operational feasibility of policies based on the hypothesis and the importance of public perception of the dispute or agreement.

At least some consensus conferences are primarily attempts to resolve disputes regarding scientific hypotheses, and such disputes are often fueled by discordant multimodal evidence. Contemporary examples of consensus conferences are common in the biomedical and social sciences. Another early contemporary model of consensus conferences is the U.S. National Institute of Health (NIH) model; NIH consensus conferences differ from the Danish model in that they are usually convened to settle on factual conclusions rather than policy guidance. It is this epistemic goal of consensus conferences which is my interest.

Given that consensus conferences sometimes have other aims besides settling on an agreement about a matter of fact, non-epistemic standards can sometimes be used to assess their success. For example, since consensus conferences often involve the participation of citizens, and since the political legitimacy of a decision is an aim of some consensus conferences, Douglas (2005) poses the following evaluative

63 Naturally, the process of bringing together experts in an attempt to resolve disagreement and settle on a fact of the matter is probably as old as organized humanity. One of the more infamous examples of a consensus conference is the 1616 meeting of the commission of theologians, or Qualifiers, who came to a formal consensus that the hypothesis of a motionless sun is “foolish and absurd in philosophy.”

138 question: “Has citizen involvement helped to bring citizen values into the heart of technical judgment?” Although this is an important question, it is not my aim to evaluate AMs with Douglas’ criterion. Rather, I want to know if consensus conferences constrain intersubjective assessments of hypotheses; this is my primary standard for evaluating meta-analysis in Chapter 7.

The standard of ability to constrain intersubjective assessments of hypotheses is more modest than ‘ability to get at the truth’.64 Solomon (2007) claims specifically about NIH consensus conferences that they have “never been assessed for the accuracy of outcomes. No-one has investigated, for example, whether the outcomes are better – more ‘true’ or whatever – than those achieved by other methods such as non-neutral panels or formal meta-analysis of evidence.” The trouble with holding an

AM to this standard is that the point of gathering multimodal evidence, or multiple experts’ judgments, is to gather our best indicators of the truth. We cannot assess the accuracy of the outcomes of an AM such as consensus conferences, nor their ability to get at the truth, unless we have an independent indicator of the truth, in which case we would have no need for the AM in the first place. This consideration applies even when the independent indicator of truth is itself another AM (like meta-analysis): judging the veracity of consensus conferences based on their agreement with meta- analyses assumes that the meta-analyses are themselves veracious. Thus AMs should

64 Some argue that knowledge is what an ideal epistemic community would, in the long run, eventually agree on (for instance, this is one interpretation of Peirce’s notion of convergence to the truth). Moreover, some argue that knowledge is just what an actual epistemic community settles on (see, for example, Kusch 2003), and so if intersubjective assessment of hypotheses were tightly constrained, then knowledge would be achieved. Though I will not argue the point here, since many others have done so, the conflation between consensus and knowledge should be rejected.

139 be assessed not based on whether they are true or accurate, but at the very least we ought to hope that an AM can constrain intersubjective assessments of hypotheses.

Whether or not consensus conferences can achieve constraint is an empirical questions that has primarily been addressed by descriptions of a handful of case studies; there have been few, if any, systematic empirical assessments of consensus conferences.65 However, a cursory glance at several examples suggests that the function of consensus conferences appears to be: uncertainty in, uncertainty out.

A consensus conference with an indeterminate outcome was convened in 2006 by the Public Health Agency of Canada to address how the influenza virus is transmitted from person to person. The Contact Hypothesis is that influenza is transmitted by direct or indirect touching, and the Airborne Hypothesis is that influenza can be transmitted on fine droplets over long distances through the air; the trouble is that available evidence on influenza virus transmission is discordant and multimodal: some modes of evidence support the Contact Hypothesis and other modes support the Airborne Hypothesis.66 The consensus conference involved representatives from relevant disciplines, including virology, clinical infectious diseases, occupational health, and epidemiology. The disputants did not come to a settled view regarding how influenza is transmitted from person to person. Moreover, similar consensus conferences on the same topic were held around the world in various jurisdictions,

65 Though Solomon (2007) notes that consensus conferences have been assessed on their freedom from bias, including Rand Corporation in 1983, a group at University of Michigan in 1987, and the NIH in 1999. 66 I describe this case in more detail in Chapter 5.

140 including the United States, the United Kingdom, and the World Health Organization, none of which had a settled outcome.

An older example of a deliberative approach to amalgamating multimodal evidence was given by John Beatty (2006). He describes a group of geneticists in the

1950s who had the task of estimating the minimum threshold of radiation that humans could safely be exposed to before undergoing genetic mutation. Hermann Muller was one of the geneticists most worried about mutation, and had won the Nobel Prize in

1946 for related work. Other prominent geneticists had argued that genetic variation was valuable for evolution, and so (they argued) it was difficult to know just how detrimental new radiation-induced variation would be. But the disagreement between the geneticists as a group and the Atomic Energy Commission (AEC) was even greater; the stakes for the AEC were high:

AEC officials sometimes claimed that the biological (including genetic) effects of radiation exposure from bomb testing and other sources were negligible … they most chose to rebut Muller by emphasizing the lack of consensus among geneticists... (Beatty 2006 p. 56)

To respond to the AEC, the U.S. National Academy of Sciences (NAS) organized a panel of geneticists, chaired by the Rockefeller official (and non-geneticist) Warren

Weaver, to develop guidance on an acceptable radiation level. The meeting began with

Weaver encouraging the geneticists to communicate a sense of certainty and agreement to the public. However, during and after the meeting, given the discordant multimodal evidence which was available for this question, the NAS panel disagreed by over three orders of magnitude on estimates of radiation danger. But the final report, published in the New York Times, claimed that there was “no disagreement as

141 to fundamental conclusions” (emphasis in original). The primary goal of the NAS panel was epistemic, but Beatty argues that a secondary goal was to mask disagreement (which they succeeded at), and to develop guidelines before another group, especially the AEC, did so. Regardless, this is another example of a consensus conference which failed to achieve constraint (despite the panel’s attempt to mask such failure).

Solomon (2007) reaches a similar conclusion regarding the inability of consensus conferences to achieve constraint, at least in situations of discordant multimodal evidence. Her argument can be construed as a dilemma: consensus conferences occur either i) before scientific consensus has been achieved, or ii) after scientific consensus has been achieved (through means other than consensus conferences). If (ii), then the consensus conference did not achieve constraint, because constraint was already achieved; if (i), then the consensus conference cannot achieve constraint, because the evidence is still too contentious for consensus to be reached. So either way consensus conferences do not achieve constraint. Her conclusion is more modest than this reconstruction: “the window for usefulness [of consensus conferences] is small – after there is enough evidence to reach a conclusion but before

67 the research community itself has reached consensus” (169).

Deliberative approaches may use formal or quantitative methods as part of the process of assessing multimodal evidence. The NAS geneticists described by Beatty

67 Elsewhere Solomon argues that what I am calling constraint might not be desirable, since if constraint were met, then insights of dissenting opinions and data could be lost, perhaps due to ‘groupthink’ (2006). I agree that this is a worry, but only if constraint were achieved cheaply, by ignoring the complexity and diversity of evidence.

142

(2006), for example, calculated numerical averages of their estimates for minimum acceptable radiation levels and for estimating a number of genetic defects given certain levels of radiation. Of course, quantitative or formal methods can be more sophisticated than this, and above I described several such possible approaches.

However, some advocates of deliberative approaches have been critical of formal approaches; one criticism is that technical algorithms for assessing and amalgamating multimodal evidence “bury under a series of assumptions many value judgments”

(Lomas et al. 2003). In situations of discordant multimodal evidence, critics of formal approaches to amalgamation have claimed that discordance “cannot be resolved by an appeal to science,” and when faced with discordance, “the search for some formula or set of principles designed to provide decision-making rules will always prove elusive”

(Klein and Williams 2000). But conversely, others argue that formal approaches are more objective, since these approaches counter the subjective and uncertainty in deliberative, qualitative amalgamation functions.

Solomon, for instance, suggests that rather than rely on consensus conference to amalgamate evidence, it would be “quicker, more timely, and at least as good to do a meta-analysis of the available evidence” (169). Moreover, Solomon, along with many others (especially statisticians and those associated with groups like the

Cochrane collaboration), suggests that formal AMs such as meta-analysis are more objective, since they counter the subjective biases and uncertainty in qualitative AMs such as consensus conferences. Chapter 6 scrutinizes this appeal to the objectivity of formal AMs such as meta-analysis.

143

6.6 Summary

I have discussed several possible approaches to amalgamating multimodal evidence. However, the possibilities that I consider for amalgamating multimodal evidence are limited in two ways. First, I only consider methods of amalgamation of multimodal evidence for a single hypothesis; I do not consider methods of amalgamation of multimodal evidence for a complex of interrelated hypotheses. A common practice to amalgamate evidence in such cases is to build a causal model or a mechanism which describes a network of causal relations. We might have evidence from mode 1 that A causes B, evidence from mode 2 that B causes C, and evidence from mode 3 that C causes D, and a common way to amalgamate this multimodal evidence is to construct a hypothetical model of these causes, such as ABCD.

Similarly, phylogenetic trees are constructed on the basis of multimodal evidence, and inasmuch as a single hypothesis regarding species relations is based on multimodal evidence (morphological similarity, genetic similarity, ecological niche patterns, etc.), the considerations in this chapter apply. However, the broader aim of a phylogenetic tree is to relate multiple hypotheses regarding descent together into an interrelated nexus of hypotheses regarding common descent, and while each of these hypotheses might be supported by evidence from different modes, and so one might construe phylogenetic trees as a method for amalgamating evidence, phylogenetic trees are in this way like mechanistic models: they are methods for amalgamating multimodal evidence on multiple interrelated hypotheses rather than a single hypotheses.

144

The second limitation to my discussion of the possibilities for amalgamating multimodal evidence is that I do not entertain an ‘explanationist’ account of amalgamating multimodal evidence. Such an approach to amalgamating multimodal evidence might be useful in some contexts, especially if the available multimodal evidence were concordant. To consider Hacking’s (1983) example, suppose a certain cellular structure is observed with a light microscope, an electron microscope, and an ultraviolet microscope. The best explanation for this concordant multimodal evidence is that this cellular structure is a real part of the cell (rather than an artifact of any particular microscope). For situations in which the multimodal evidence is discordant, however, there is often no obvious ‘best explanation’ available. Moreover, explanationism does not address the worries regarding rigor of amalgamation that the above methods are meant to address – to any purported ‘best explanation’ of multimodal evidence one can ask why this explanation is indeed the best, and to properly answer that question a more rigorous method of amalgamation is needed.

Despite these two limitations, this chapter has attempted to canvas various possible methods for amalgamating multimodal evidence. How to amalgamate multimodal evidence depends on what one means by evidence: data, likelihoods or error probabilities, accepted facts, inference outputs, or scientists’ judgments. In the next chapter I investigate in more detail a method for amalgamating data: meta- analysis.

CHAPTER 7: IS META-ANALYSIS THE PLATINUM STANDARD OF EVIDENCE?

Abstract

An astonishing volume and diversity of evidence is available for many hypotheses in the biomedical and social sciences. Some of this evidence – usually from randomized controlled trials (RCTs) – is amalgamated by meta-analysis. The status of RCTs as the

‘gold-standard’ of evidence in medicine is debated. It is usually meta-analyses, though, rather than RCTs, which are considered the best source of evidence: meta- analysis is thought to be the platinum standard of evidence. However, I argue that meta-analyses fall far short of that standard. Different meta-analyses of the same primary evidence can reach contradictory conclusions. Meta-analysis fails to provide objective grounds for guiding belief because numerous decisions must be made when performing a meta-analysis, which allow wide latitude for subjective idiosyncrasies to influence the results. I end by discussing an older tradition of evidence in medicine: the plurality of reasoning strategies appealed to by the epidemiologist Sir Bradford

Hill (1897 - 1991).

145 146

7.1 Introduction

Biomedical and social scientists are faced with a daunting volume and diversity of evidence for many hypotheses of interest. For example, by 1985 there had been over 700 studies on the relationship between class size and academic achievement, over 800 studies on the effectiveness of psychotherapy, and 120 studies testing if the phase of the moon affects human behavior (Smith & Glass 1977; Glass &

Smith 1979; Rotton and Kelly 1985). This avalanche of evidence contributed to the formation of groups like the Cochrane Collaboration, to journals dedicated to publishing reviews rather than primary research (e.g. Annual Review of Genetics or

Epidemiologic Reviews), and to methods of amalgamating a large volume and diversity of evidence, such as meta-analysis.

Here is the United Kingdom National Health Service definition:

Meta-analysis: a mathematical technique that combines the results of

individual studies to arrive at one overall measure of the effect of a

treatment.

Meta-analysis is done on studies that produce effect sizes as outcomes (or at least studies that have outcomes in forms from which one can determine an effect size).

Meta-analysis is performed by i) selecting which primary studies are to be included in the meta-analysis, ii) calculating the magnitude of the effect due to a purported cause for each study, iii) assigning a weight to each study, which is often determined by the size and the quality of the study, and then iv) calculating a weighted average of the

147 effect magnitudes.68 A frequent goal of using meta-analysis is to discover true causal relationships and to determine the magnitude of an effect for a particular magnitude of a purported cause.

I begin by discussing the history of meta-analysis and the aims which analysts seek to achieve by using meta-analysis (§7.2). These aims are constraint – the use of meta-analysis should constrain intersubjective assessment of hypotheses – and objectivity – meta-analysis should be performed in a way which limits the influence of subjective biases and idiosyncrasies of particular researchers. I raise several cases to show that the use of meta-analysis can fail to achieve constraint (§7.3). Meta-analysis fails to constrain belief in hypotheses because numerous decisions must be made when performing a meta-analysis which allow wide latitude for subjective idiosyncrasies to influence the results of a meta-analysis. Some of these decisions are particular to technical meta-analyses and others are relevant to any situation in which a scientist is presented with diverse evidence. The bulk of my argument involves a close examination of the methodological details of meta-analysis (§7.4). Although meta- analysis is often used in the biological, human, and social sciences, my focus will be on medical research. I draw on the published guidance of the Cochrane Collaboration, one of the primary organizations of evidence-based medicine which commissions a large number of meta-analyses, to help describe the methodology of meta-analysis.

Finally, I end by discussing an alternative, older, and arguably better approach to

68 This is a typical approach to meta-analysis, exemplified, for example, by Cooper (2010), a recent meta-analysis textbook written by a leading social scientist.

148 evidence in medicine (§7.5), associated with the epidemiologist Sir Bradford Hill

(1897 - 1991).

Many arguments have been proposed debating whether or not randomized controlled trials (RCTs) provide the best evidence for causal hypotheses in medicine and the social sciences (see Borgenson 2008; Worrall 2002, 2007). Nancy Cartwright

(2007), for instance, asks “Are RCTs the gold standard?” to which she answers ‘no’.

However, despite the debates surrounding the gold-standard status of RCTs, it is in fact meta-analysis which is at the top of the most prominent evidence hierarchies in medicine and social policy (e.g. the evidence ranking schemes of the Oxford Centre for Evidence-Based Medicine, the Scottish Intercollegiate Guidelines Network, and the Australian National Health and Medical Research Council).69 It is widely thought that meta-analysis is the platinum standard of evidence. In this chapter I criticize the platinum standard status of meta-analysis.

7.2 Constraint and Objectivity

The first comprehensive meta-analysis performed on a single hypothesis with evidence from multiple sources was about extra-sensory perception (Rhine et al.

1940), which is a nice historical accident because Ian Hacking (1988) showed that the practice of randomizing subjects into different groups also began in psychical research

– thus both our gold standard of evidence and our platinum standard of evidence come

69 As I discuss below, however, those meta-analyses which are usually considered to be the best are those which include only RCTs.

149 from research in paranormal psychology.70 Meta-analysis later became the platinum standard of evidence in medicine and the social sciences for several reasons. The sheer volume of available evidence meant that most users of evidence (e.g. physicians or policy-makers) could not be aware of all relevant evidence; a proposed solution was to produce systematic reviews of the available evidence. By the 1990s, hundreds of meta-analyses were being published every year, and recently the number of published meta-analyses has exceeded two thousand per year (Sutton and Higgins 2008).

Meta-analysis became a prominent method in part due to the purported rigor of meta-analyses compared with qualitative methods of amalgamating evidence: in contrast to unsystematic literature reviews or consensus conferences, for example, meta-analyses have both quantitative inputs and outputs. The importance of a systematic way of amalgamating evidence became apparent by the 1970s, when scientists began to review a plethora of evidence with what some took to be personal idiosyncrasies: “A common method for integrating several studies with inconsistent findings is to carp on the design or analysis deficiencies of all but a few studies – those remaining frequently being one’s own work or that of one’s students or friends” (Glass

1976). An example of such a review is Pauling (1986), in which the Nobel Laureate cited dozens of his own studies supporting his pet hypothesis that large doses of vitamin C can reduce the risk of catching a cold, and yet he did not cite any studies contradicting this hypothesis, though several had been published (Knipschild 1994). A recent textbook on meta-analysis worries that unsystematic reviews (sometimes called

70 Although Ignaz Semmelweis (1818 - 1865) randomly allocated his patients into one of two clinics during in his infamous inquiry in the 1840s into the cause of childbed fever (Persson 2009).

150

‘narrative reviews’) can fail to constrain belief in hypotheses: “there are examples in the literature where two narrative reviews come to opposite conclusions, with one reporting that a treatment is effective while the other reports that it is not” (Borenstein et al. 2009). A recent statistics textbook emphasizes the worry regarding reviewers’ idiosyncrasies: “the conclusions of one reviewer are often partly subjective, perhaps weighing studies that support the author’s preferences more heavily than studies with opposing views” – and so the authors suggest that “it is extremely difficult to balance multiple studies by intuition alone without quantitative tools” (Whitlock and Schluter

2009). The quantitative tool most often used to achieve such a ‘balance’ of multiple studies in medicine is meta-analysis.

The best justification or account of the scientific value of meta-analysis is rather simpler than one might suppose. One might think that an aim of meta-analysis is to satisfy a principle stipulating the consideration of all available evidence for a hypothesis (such as Carnap’s “Principle of Total Evidence”). However, as I argue below, meta-analyses violate such a principle because they normally include only a small fraction of available evidence. Alternatively, one might think that an aim of meta-analysis is to satisfy a principle of robustness: hypotheses are often said to be more likely to be true if they are supported by evidence from multiple independent sources (see e.g. Trout 1995; Thagard 1998; Stegenga 2009). However, because meta- analyses usually include only evidence from homogeneous methods (such as RCTs), such evidence fails to be methodologically independent, which is often said to be a requirement of robustness arguments. One proposal to amalgamate diverse evidence is to use the evidence to build causal models, or models of a network of interconnected

151 causal relations (Danks 2005; Cartwright 2007a; Cartwright and Stegenga forthcoming). Accordingly, one might think that an aim of meta-analysis is to construct causal models. But meta-analyses amalgamate evidence on a single causal relation, not on a network of interconnected causal relations.

The best justification or explanation of the value of meta-analysis is statistical: many purported causes in medicine and the human sciences have a small observable effect, and so when analyzing data from a single trial on an intervention with a small effect, there might be no statistically significant difference between the experimental group and the control group. But by pooling data from multiple studies the sample size of the analysis increases, which tends to decrease the width of confidence intervals, thereby potentially rendering estimates of the magnitude of an intervention effect more accurate, and perhaps statistically significant. One aim of meta-analysis, then, is quantitative accuracy. More generally, such quantitative accuracy can serve to constrain our belief in a hypothesis.

In short, meta-analysis was introduced as a method to assess and combine evidence from multiple studies, with the aim of constraining belief in a hypothesis, and doing so in a way which is not infused with the subjective idiosyncrasies that can be present in informal literature reviews or qualitative methods of amalgamating evidence. Meta-analysis is said to have the virtues of rigor, transparency, quantitative accuracy, and freedom from personal bias. To put these aims in terms of recent historiography of science, meta-analysis is a method which attempts to use

‘mechanical objectivity’ to amalgamate evidence from multiple studies, with the aim of providing a quantitative summary of these multiple studies to satisfy our ‘trust in

152 numbers’ (Daston and Galison 2007; Porter 1995). The purported virtues and aims of meta-analysis can be summarized by these two broad norms:

Constraint: A meta-analysis should constrain our belief in a hypothesis.

Objectivity: A meta-analysis should be performed in a way which counters the

analyst’s idiosyncratic or personal biases.

Meta-analyses, unfortunately, often fall short of these aims (§6.3); the reasons for such failure are explored in §6.4.

7.3 Failure of Constraint

Epidemiologists have recently noted that multiple meta-analyses of the same

(or at least similar) primary evidence, performed by different analysts, can reach contradictory conclusions. For example, there have been numerous inconsistent studies on the benefits and harms of a newer synthetic dialysis membrane versus an older cellulose membrane for patients with acute renal failure: one meta-analysis of these studies found that the use of synthetic dialysis membranes improved the survival of such patients compared with the use of cellulose membranes (Subramanian et al.

2002) while another meta-analysis reached the opposite conclusion (Jaber et al. 2002).

Similarly, contradictory conclusions have been reached from meta-analyses on the benefits of acupuncture and homeopathy, mammography for women under fifty, and the use of antibiotics to treat otitis (see e.g. Linde and Willich 2003).

153

Barnes and Bero (1998) performed a quantitative analysis of multiple meta- analyses which reached contradictory conclusions regarding the same hypothesis, and found a correlation between the outcomes of the meta-analyses and the analysts’ relationships to industry. They analyzed 106 review papers on the health effects of passive smoking: thirty-nine of these reviews concluded that passive smoking is not harmful to health, and the remaining 67 concluded that there is at least some adverse health effect associated with passive smoking. Of the variables investigated, the only significant difference between the analyses that showed adverse health effects versus those that did not was the analysts’ relationship to the tobacco industry: analysts who had received funding from the tobacco industry were 88 times more likely to conclude that passive smoking has no adverse health effects compared with analysts who had not received tobacco funding.

Here is a similar example. Antihypertensive drugs have been tested by hundreds of studies, and as of 2007 there had been 124 meta-analyses on such drugs.

Meta-analyses of these drugs were five times more likely to reach positive conclusions regarding the drugs if the reviewer had financial ties to a drug company (Yank et al.

2007). Or consider the meta-meta review of meta-analyses of studies on spinal manipulation as a treatment for lower back pain: some meta-analyses of this intervention have reached positive conclusions regarding the intervention while other meta-analyses have reached negative conclusions, and a factor associated with positive meta-analyses was the presence of a spinal manipulator on the review team

(Assendelft et al. 1995).

154

These examples are merely meant to show that multiple meta-analyses of the same primary set of evidence can reach contradictory conclusions, not that they must, or even often do, reach contradictory conclusions. However, the features of meta- analysis which explain its occasional failure to tightly constrain intersubjective assessments of hypotheses are shared by all meta-analyses. That is, the conditions under which multiple meta-analyses of the same primary evidence can reach contradictory conclusions are inherent features of the methodology common to all meta-analyses. I now turn to a detailed examination of the methodology of meta- analysis.

7.4 Inherent Subjectivity

The failure of constraint in the above cases is at least partially a consequence of the failure of objectivity: constraint on justified credence is not met by the meta- analyses in §7.3 because meta-analyses are not sufficiently objective. Subjectivity is infused at many levels of a meta-analysis: when designing and performing a meta- analysis, decisions must be made – based on judgment, expertise, and personal preferences – at each step of a meta-analysis, which most importantly include the:

i. Choice of primary evidence

ii. Choice of effect measure

iii. Choice of quality assessment scale

iv. Choice of averaging technique

Some of these choices are not specific to meta-analysis (i and perhaps iii), but are nevertheless relevant to explaining the shortcomings of meta-analysis, while others are

155 particular to the technicalities of meta-analysis (ii and perhaps iv). The general principles of meta-analysis are simple and are not unique to the biomedical or social sciences. For example, a common method of combining multiple expert probability forecasts (say, for sunshine in three days, or for a stock price increase in the next fiscal quarter, or for a victory for a presidential candidate) is to calculate a statistical average: when multiple experts give probability forecasts, a standard way to combine these multiple forecasts into a single forecast is to simply calculate an average of the probabilities. However straightforward a weighted average may seem, the subtleties of meta-analysis are complex. In what follows I consider each class of choices required in the steps of a meta-analysis.

7.4.1 Choice of Primary Evidence

Multiple decisions must be made regarding what primary evidence to include in a meta-analysis. I survey some of these decisions, and critically evaluate arguments for particular strategies to these decisions.

7.4.1.1 Methodological Quality

The dominant view in evidence-based medicine is to include only evidence from RCTs in a meta-analysis; according to a statement of leaders in evidence-based medicine, in a meta-analysis “researchers should consider including only controlled trials with proper randomisation” (Egger, Smith, and Phillips 1997). Such a view excludes other common kinds of statistical evidence, including that from cohort studies and case-control studies, as well as non-statistical evidence which is not in the

156 domain of usual technical meta-analyses, such as pathophysiological evidence, and evidence from animal experiments, mathematical models, and clinical expertise.

In contrast, others argue that an evidence amalgamation method such as meta- analysis should use all available evidence. Glass (1976) gives a convincing argument along the following lines: an effect size of 2.0 x from 3 RCTs testing a purported causal relation should have a different impact on our belief in the causal hypothesis when considered in the light of (i) 50 matched case-control studies, purportedly testing the same causal relation as the RCTs, that show an effect size of 2.2 x, versus (ii) 50 matched case-control studies, purportedly testing the same causal relation as the

RCTs, that show an effect size of -0.8 x. If our belief in the causal hypothesis were not different in the two scenarios, we would effectively be committing the base-rate fallacy: our belief in the hypothesis after observing new evidence should also be guided by all of our previous evidence, and if it is not we are liable to make an ill- formed judgment of the probability that the hypothesis is true in light of the new evidence. In (i) there is concordance between the new evidence (from RCTs) and the previous evidence (from case-control studies), which might suggest that the two kinds of studies are converging on a true effect size (but of course such concordance can occur for other reasons). In (ii) there is discordance between the new evidence (from

RCTs) and the previous evidence (from case-control studies), which might suggest (a) that there is a systematic problem with the case-control studies, given the known potential biases with case-control studies compared with RCTs (this is a typical response in the evidence-based medicine community when faced with discordance between RCTs and case-control studies), (b) that there is a systematic problem with

157 the RCTs, given the low number of them compared with the large number of case- control studies, (c) that the two kinds of studies were not similar enough in all important parameters, including the causal structure of the study populations, (d) that the purported cause is spurious, (e) that a highly unlikely series of events has occurred.

In other words, in (ii) there is no general reason to assume (a) as an explanation of the discordance, and if we blindly do assume (a) as an explanation we are liable to be wrong.

Another way to put this is: even if RCTs were justifiably the gold standard of evidence, that would not mean that evidence from non-randomized studies were negligible. Indeed, some of our most believable causal hypotheses were first supported by evidence from case-control studies, and for many hypotheses we only have evidence from non-randomized studies. A joke in such discussions is that there has never been a carefully performed RCT which has tested the causal efficacy of parachutes.

The exclusive use of a narrow range of evidence is purportedly justified on the grounds that the methods of meta-analysis are only valid for homogeneous evidence (I discuss this below), and by the “garbage-in-garbage-out” argument: if low quality evidence is included in a meta-analysis, then the output of the meta-analysis will also be low quality, and so rather than including all available evidence, meta-analyses should only include the ‘best evidence’ (e.g. Slavin (1995), who argues that meta- analysis should be limited to ‘best evidence synthesis’). The trouble with this argument is outlined above: if we ignore some evidence, even if it comes from a method deemed to be of low quality, we effectively commit the base-rate fallacy.

158

Moreover, there is no reason why an analyst cannot assess lower-quality evidence appropriately, simply by assigning a lower weight to such evidence when calculating the weighted average. Finally, the veiled premise of the garbage-in-garbage-out argument – that all and only non-randomized studies require problematic background assumptions in order for evidence from such studies to be truth-conducive – is false.

All methods presuppose background assumptions that must be met for the evidence from such methods to be considered truth-conducive, and such assumptions may or may not be problematic, but this depends on specific features of the study design in relation to one’s hypothesis of interest, and not on non-relational and abstract features solely of the study design. In short, although all evidence is inductively risky, there are good reasons for including as much evidence as possible in a meta-analysis.

Regardless, when performing a meta-analysis one must make a decision regarding the breadth of methodological quality to include, and there is no reason why such a decision will be consistently made by different analysts.

7.4.1.2 Methodological Diversity

Another justification for only including evidence from select methods is the possibility of variable treatment effects among different subjects or different experimental circumstances. Consider the following guidance from the Cochrane

Collaboration:

you have to be confident that clinical and methodological diversity is not so great that we should not be combining studies at all. This is a

159

judgement, based on evidence, about how we think the treatment effect might vary in different circumstances.71

For the Cochrane Collaboration, the standard for what counts as methodological diversity is low; these meta-analyses only include a narrow range of study designs in any given review. Some limitation to the diversity of primary evidence which gets included in a meta-analysis is justifiable. The Cochrane group gives the following proviso: “Meta-analysis should only be considered when a group of studies is sufficiently homogeneous in terms of participants, interventions and outcomes”

(Cochrane Handbook 9.5.1). Including only studies with homogenous outcomes is fine if by ‘outcome’ they mean kind of outcome; for example, if one study tests the effect of a drug on lowering blood pressure, and another study tests the effect of the same drug on the rate of heart attacks, then there is no shared outcome on which to calculate an average. More generally, a meta-analysis is only meaningful if the data from multiple studies is generated from a single causal relation.72 But even when multiple studies are purported to measure the same causal relation, the only evidence that analysts have to assess this (besides the substantive features of the study designs) is by the statistical variability between the data from the studies. As the Cochrane group rightly states, this is a ‘judgement’ regarding whether or not a meta-analysis is even meaningful in the first place.

71 Cochrane website http://www.cochrane-net.org/openlearning/html/mod13-4.htm, accessed Oct 20, 2009. 72 Drawing on the distinction enunciated by Bogen and Woodward (1988), a causal relation can be thought of as a ‘phenomenon’, about which the primary studies under discussion here are used to collect ‘data’ on; such data can be used as evidence for hypotheses regarding these causal relations, and meta-analysis can be used to amalgamate such evidence from multiple primary studies.

160

Homogeneity of participants and interventions might be similarly justifiable. If we are interested in the effect of a given intervention, we must be consistent with what that intervention is – although a narrow range of intervention diversity (say, using a single dose of an experimental drug) will narrow the range of conclusions one can draw about the intervention. Likewise for the use of a narrow range of participants – before we can know if an intervention works in a broad demographic, it is reasonable to try to determine if it works in a narrower demographic.73 (But of course, if we already have evidence from a broader population of subjects, including non-human subjects, then we should not ignore such evidence.) Moreover, some interventions only have a specific effect in a narrow range of subject diversity. Thus, there can be good reasons for limiting the diversity of participants, interventions, and kinds of outcomes to be included in a meta-analysis. Nevertheless, though, such parameters of meta-analyses are decision points which can influence the outcomes of a meta- analysis.

Other limitations to the primary evidence included in a meta-analysis are more troublesome. Consider the following Cochrane guidance: “we strongly recommend that review authors should not make any attempt to combine evidence from randomized trials and NRS [non-randomized studies]” (13.2.1.1). No justification is provided for this limitation; not only is evidence from non-randomized studies not to amalgamated with evidence from RCTs, but neither is evidence from

73 I give short shrift to a growing debate: Epstein (2007) argues that our knowledge of the safety and efficacy of many biomedical interventions is limited because for too many years these interventions were tested on a narrow demographic range of subjects.

161 pathophysiological knowledge, background considerations of underlying mechanisms, animal experiments, and results from mathematical models. Such a practice could limit the external validity of a meta-analysis, since RCTs on humans are typically performed with relatively narrow study parameters while other kinds of evidence – including evidence from non-randomized human studies, studies on animals, and experiments designed to elucidate causal mechanisms which are often performed on tissue and cell cultures – can have diverse study parameters at lower cost. Moreover, as discussed above, this practice violates a principle of total evidence, which comes with possibly significant epistemic risk: neglecting other kinds of evidence risks making an uninformed judgment (or, the base-rate fallacy) on a hypothesis.

Methods of amalgamating evidence from multiple studies, but which systematically exclude all evidence but that from a single kind of study, are not limited to medicine. A non-medical example is in ‘driving under the influence’ cases. In most jurisdictions in the United States there are at least three kinds of evidence that can be used to detect intoxication of drivers: (1) a police officer’s subjective assessment of the driver;74 (2) the driver’s blood alcohol concentration as extrapolated from a portable breath test machine in the officer’s car; (3) the driver’s blood alcohol concentration as extrapolated from a more reliable breath test machine in a police station (Mnookin 2008). The use of breath test machines is meant to mitigate officers’ subjective assessments; to use a term of Daston and Galison (2007), the ‘mechanical objectivity’ of breath test machines are thought to counter the subjectivity of officers.

74 This subjective assessment is itself comprised of various kinds of evidence, including the driver’s ability to perform behavioral tasks, the driver’s conversational capability, and the driver’s outward appearance and smell.

162

In many jurisdictions, evidence from (3) trumps evidence from (2) or (1): if a driver is suspected of being intoxicated according to (1), and fails the breath test in (2), but gets to the station and then passes the breath test in (3), the driver is released with no charges. In short, in such cases a single kind of evidence trumps other available kinds of evidence.

Decisions regarding the methodological diversity to be included in a meta- analysis can vary between analysts, and such differences might affect the outcome of a meta-analysis.

7.4.1.3 Discordance

Another choice that must be made regarding which primary evidence to include in a meta-analysis is the degree of discordance – that is, the degree to which evidence from different primary studies disagree or contradict each other – that the analyst is willing to accept amongst the primary set of evidence.

The Cochrane Collaboration Handbook has a section which discusses strategies for dealing with discordant primary evidence (9.5.3). An examination of these strategies is revealing. One strategy is to “explore” the discordance: discordance might be due to systematic differences between studies, and so a post-hoc meta-study can be done to determine if systematic differences between studies are related to systematic differences in outcomes. Another strategy is to exclude studies from the meta-analysis: the Handbook claims that discordance might be a result of several outlying studies, and if some factor can be found that might explain the discordance between these outlying studies and the remainder of the studies, then those outliers can

163 be excluded. The Handbook notes, however, that “Since usually at least one characteristic can be found for any study in any meta-analysis which makes it different from the others, this criterion is unreliable because it is all too easy to fulfill.” Indeed, a study can be similar or dissimilar to another study on an infinite number of features, and so if one had sufficient data and resources, one could always find a potential difference-maker about a study that would purportedly justify its exclusion.

Finally, when faced with discordant primary evidence, the Cochrane group suggests that a meta-analysis may not be meaningful – “If you have clinical, methodological or statistical heterogeneity it may be better to present your review as a systematic review using a more qualitative approach to combining results...”75 This is because, as discussed above, the primary evidence might be discordant not because of random variations of measures from a single causal relation, but rather because the multiple primary studies were measuring multiple causal relations.

Each of these strategies for dealing with discordance can be pursued in a multitude of ways, with varying amounts of time and energy devoted to the particular strategies. There is no reason to think that different analysts will follow these strategies in the same way. Different approaches to discordance could affect the outcomes of a meta-analysis.

75 http://www.cochrane-net.org/openlearning/html/mod13-4.htm Accessed May 13, 2009.

164

7.4.1.4 Data Access

Decisions regarding what primary evidence to include in a meta-analysis are constrained by what primary evidence is available. The internet has improved access to primary evidence. Nevertheless, a well-known problem in medical research is : papers which show statistically significant positive findings are more likely to be published than papers that have null or negative findings (especially when the research is funded by private companies – see Brown (2008)). A corollary of publication bias has its own name: the File Drawer Problem. Reviewers performing a meta-analysis have less access to null or negative evidence (because it is sitting in file drawers) than they do to published positive evidence. A related problem is faced by analysts who want to do a meta-analysis with patient-level outcomes (which has several advantages over published study-level outcomes which I do not discuss here): often patient-level data is confidential or is protected by corporate interests.

Other practical problems regarding access to primary evidence include studies published in languages foreign to the analyst, and evidence available only in the ‘gray literature’ of conference proceedings and dissertations. Interestingly, publication bias appears to be worse in some countries, so if one wanted positive results in a meta- analysis, one might consider including studies published in those countries. How intensely an analyst grapples with these practical problems of data access can influence the results of a meta-analysis.

165

7.4.1.5 Summary

A number of decisions must be made regarding which studies to include in a meta-analysis, including the acceptable range of methodological quality of studies, the acceptable range of study parameter diversity, whether or not to exclude studies with outlying data, how hard to look through the gray literature, if the File Drawer Problem is severe or not, and whether or not a meta-analysis is even feasible in the first place.

In the words of a critic of meta-analysis: “It is precisely in those areas where there is most disagreement that these methods [meta-analysis] are least applicable” (Eysenck

1984). Regardless of how justified the decisions regarding choice of primary evidence are for any particular case, they must be based on expertise and judgment, thereby inviting idiosyncrasy, and allowing a degree of latitude in the possible results of a meta-analysis. Such decisions threaten objectivity and thereby constraint.

7.4.2 Choice of Effect Measure

Data from primary studies must be summarized quantitatively by a standardized measure, usually referred to as an ‘effect measure’, before being amalgamated into a weighted average. An effect measure (sometimes also called an outcome measure) is used to summarize data into an ‘effect size’, which is an estimate of the magnitude of the purported strength of the cause-effect relationship under investigation. Multiple effect measures can be used for this – frequent choices include the odds ratio, the risk difference, and the correlation coefficient (I give examples of these below). The choice of effect measure can influence the degree to which the primary evidence appears concordant or discordant, and so ultimately the choice of

166 effect measure influences the results of meta-analysis, and can even influence whether or not an analyst thinks a meta-analysis is worth doing in the first place. The guidance from the Cochrane group will again help me to explain this.

As discussed above, the Cochrane group gives several strategies for dealing with discordant primary evidence. One of these strategies, not discussed above, is to

“change the effect measure”, because discordance “may be an artificial consequence of an inappropriate choice of effect measure.” The Cochrane Handbook is correct to claim that “when control group risks vary, homogeneous odds ratios or risk ratios will necessarily lead to heterogeneous risk differences, and vice versa.” This is simply due to the mathematical relationship between ratios and differences. However, although it may be true that evidence from multiple studies appears discordant only because one effect measure is used rather than another, it might not be true: heterogeneity might simply be due to a lack of systematic effect by the intervention. A hypothetical case will help me illustrate the trouble with choosing between effect measures based on discordance between primary studies.

Consider two studies (1 and 2), each with two experimental groups (E and C), and each with a binary outcome (Y and N). Table 3 below indicates the possible outcomes for each study, where the letters (a-d) are the numbers for each outcome in each group.

167

Table 3. Binary Outcomes in an Experiment and Control Group.

Group Outcome

Y N

E a b

C c d

The risk ratio (RR) is defined as:

(RR) = [a/(a+b)] / [c/(c+d)]

The risk difference (RD) is defined as:

(RD) = a/(a+b) - c/(c+d)

Now, suppose for Study 1 the numbers for the two outcomes in each group are a=1, b=1, c=1, d=3 and for Study 2 they are a=6, b=2, c=3, d=5. This would give the following effect sizes for the two studies:

RR1 = 2; RR2 = 2; RD1 = 0.25; RD2 = 0.375

Thus a meta-analysis on just these two studies, using RD as the effect measure, would have discordant primary effect sizes to amalgamate (0.25 and 0.375); but by switching the effect measure to RR the meta-analysis would have concordant primary results to amalgamate (2 and 2). Although the Cochrane Collaboration advises changing the effect measure if the primary studies have discordant results, choosing between effect measures solely on the basis of trying to avoid discordance is ad hoc. More to the point, the choice of effect measure is another decision in which personal judgment is

168 required, and the fact that there are multiple effect measures allows a range of outputs for any meta-analysis. Again, this threatens objectivity and thereby constraint.

7.4.3 Choice of Quality Assessment Scale

Analysts often attempt to account for differences in the size and methodological quality of primary studies included in a meta-analysis by weighing the primary studies with a formalized quality assessment scale. The conclusion of a meta- analysis depends on how the primary evidence is weighed, because the weights are used as a multiplier when the primary effect sizes are averaged. There are many features of evidence that should influence how primary evidence is weighed, including multiple features that influence the internal validity of a study (e.g. freedom from numerous potential biases) and the external validity of a study (i.e. the relevance of the evidence to one’s general hypothesis of interest). Scientists lack principles to precisely determine how these numerous features should be weighed relative to each other.

Different weighing schemes can give contradictory results when evidence is amalgamated. An empirical demonstration of this was given by Jüni and his colleagues (1999). They amalgamated data from 17 trials testing a particular medical intervention, using 25 different scales to assess study quality (thereby effectively performing 25 meta-analyses). These quality assessment scales varied in the number of assessed study attributes, from a low of three attributes to a high of 34, and varied in the weight given to the various study attributes; however, Jüni and his colleagues note that “most of these scoring systems lack a focused theoretical basis.” Their results were troubling: the amalgamated effect sizes between these 25 meta-analyses differed

169 by up to 117% – using exactly the same primary evidence. The authors concluded that

“the type of scale used to assess trial quality can dramatically influence the interpretation of meta-analytic studies.”

Not only does the choice of quality assessment scale dramatically influence the results of meta-analysis, but so does the choice of analyst. A quality assessment scale known as the ‘risk of bias tool’ was devised by the Cochrane group to assess the degree to which the results of a study “should be believed.” Alberta researchers distributed 163 manuscripts of RCTs among five reviewers, who assessed the RCTs with this tool, and they found the inter-rater agreement of the quality assessments to be very low (Hartling et al. 2009). In other words, even when given a single quality assessment tool, and training on how to use it, and a narrow range of methodological diversity, there was a wide variability in assessments of study quality.

Thus, when performing a meta-analysis, the choice of a quality assessment scale, and variations in the assessments of quality using a particular scale by different analysts, threatens objectivity and constraint.

7.4.4 Choice of Averaging Technique

Once effect measures are calculated for each primary study, two common ways to determine the average effect measure are possible: sub-group averages and pooled averages. In a pooled average, all subjects from included studies are merged in the analysis as if they were part of one large study with no distinct demographic sub- groups. One problem with the pooled average approach is Simpson’s paradox: the comparative success rate of two groups can be reversed in all of their respective sub-

170 groups, so if a meta-analysis simply pooled all participants into an analysis of overall groups then the calculated effect of the intervention could be the opposite of what one would find in every sub-group. Another problem with the pooled average approach is that different demographic groups might respond differently to an intervention. For example, a drug might, on average, have a large benefit to males and a small harm to females, and if data from these groups were combined in a pooled average we would erroneously conclude that the drug has, on average, a small benefit to all people, including females.

Maintaining distinct sub-groups in a meta-analysis, which the Cochrane group rightly advises, is an attempt to avoid such problems. However, maintaining sub- groups does not avoid Simpson’s paradox unless there is a principled way to demarcate sub-groups such that the ‘true’ result one is interested in is relative to those sub-groups and these exact sub-groups were used in the primary analyses. Moreover, to determine a sub-group average, either the sub-groups must be consistently demarcated amongst primary studies, or the patient-level data necessary to demarcate sub-groups, such as age and gender, must be available to the analyst. The former is often not the case and the latter is often not available. However, if patient-level demographic data is available to the analyst, then the analyst can re-group individual sub-groups any way she wishes until she finds something interesting, but of course such retrospective data-dredging can be ad hoc, depending on the way it is interpreted.

More to the point, once again: the choice of average type – pooled or sub-group (and if the latter, the choice of appropriate sub-groups) – is another decision point in the methodology of meta-analysis which threatens objectivity and constraint.

171

7.5 The Hill Alternative

One of the criticisms I raised for meta-analysis is the limitation of including only evidence from RCTs in meta-analyses. A long-time critic of meta-analysis has argued that subjective knowledge is necessary to properly assess a large volume and diversity of evidence:

A good review is based on intimate personal knowledge of the field, the participants, the problems that arise, the reputation of different laboratories, the likely trustworthiness of individual scientists, and other partly subjective but extremely relevant considerations. Meta- analysis rules out any such subjective factors. (Eysenck 1994)

Others claim that formal methods of amalgamating evidence “bury under a series of assumptions many value judgments” (Lomas et al. 2003), and that in situations of discordant primary evidence, we do not have (and will not find) a good “formula or set of principles designed to provide decision-making rules” (Klein and Williams 2000), especially when the discordant primary evidence comes from very different kinds of experiments. There is, then, at least at first glance, a tension between the purported objectivity and quantificational simplicity of meta-analyses and the subjectivity and qualitative complexity required to assess and interpret the relevant aspects of all available evidence.

An older tradition of evidence in medicine, associated with the epidemiologist

Sir Bradford Hill (1897 - 1991), might go some way toward resolving this tension.

Hill was one of the leading epidemiologists involved in the first large-scale case- control studies during the 1950s which showed a correlation between smoking and lung cancer (Doll and Hill 1950, 1954). Hill’s statistician nemesis Ronald Fisher

172

(1890 - 1962) noted the absence of controlled experimental evidence required to prove that the smoking-cancer association was indeed causal. Fisher’s now infamous criticism was that the smoking-cancer correlation could be explained by a confounding variable, or common cause of the smoking and cancer: he postulated that a genetic predisposition could be the common cause of both smoking and cancer. The only way to determine a true causal relation, according to Fisher, was to perform a controlled experiment; of course, no such experiment could be performed. Hill, at the time an epidemiologist at the London School of Hygiene and Tropical Medicine, responded by appealing to a plurality of reasoning strategies which, he claimed, when taken together make a compelling case that the link was truly causal (Hill 1965).

These reasoning strategies, what Hill called ‘causal criteria’, have recently been investigated by Howick et al. (2009). I list them here merely to illustrate the heterogeneity of kinds of evidence Hill appealed to (in my terms: Hill appealed to multimodal evidence):

1. strength of associations between variables: strong associations between

variables are more likely to be causal than weak associations

2. consistency of results between studies: an association between variables

which is observed in different populations under different circumstances is

more likely to be causal

3. specificity of variables: a single specific cause has a single specific effect;

correlations between coarse-grained or non-specific variables are less-

compelling evidence for a true causal relation

4. temporality: a cause must precede its effect

173

5. biological gradient: a dose-response pattern of associations between

variables suggests a true causal relation

6. plausibility: a plausible biological mechanism which can explain an

correlation suggests that the association is a true causal relation76

7. coherence: a causal interpretation of an association should not conflict with

other knowledge of the disease, and epidemiological evidence should

cohere with evidence from laboratory experiments

8. experimental evidence: despite criticisms from Fisher, Hill of course

recognized the value, when available, of evidence from controlled

experiments

9. analogy: analogies with other known causal relations can aid in causal

inference; that is, if the purported cause and purported effect are similar in

important respects to a known cause and its effect, then there is at least

some reason to think that the purported causal relation is real

Hill considered these only as guidelines rather than necessary or sufficient conditions

(except, perhaps, for temporality, which might be construed as necessary) (Doll 2003).

Although Hill granted that no single criterion was necessary or sufficient to demonstrate causality, he claimed that jointly the criteria could make for a good argument for the presence of a causal relation.77 Each particular criterion could use

76 For an interesting study of an eighteenth century case in which the search for a causal mechanism led the researchers astray, see De Vreese (2008). 77 The Hill strategy could, perhaps, be understood as part of a shift in epidemiological concepts of cause and disease from a monocausal model to a multifactorial model; for an insightful discussion of concepts of cause and disease in epidemiology, see Broadbent (2009).

174 philosophical critique (as has begun by Howick at al. 2009), but the important point for the purpose of contrast with meta-analysis is the sheer plurality of reasons and sources of evidence that Hill appealed to.

The reasoning strategies appealed to by Hill depend on diverse kinds of evidence, which lack a shared quantitative measure – like that of evidence solely from

RCTs – such that the evidence can be combined by a simple weighted average. The four specific problems I raised for meta-analysis – the choice of primary evidence to include, the choice of a metric or effect size to quantify the evidence, the choice of a quality assessment scale to assess or weigh the evidence, and the kind of averaging technique used – are even more troublesome for the Hill strategy. Thus one might think: meta-analysis has the virtue of amalgamating evidence with objectivity and quantitative simplicity, yet has the vice of amalgamating only a narrow range of evidence, while the Hill strategy has the virtue of considering all available evidence, yet has the vice of qualitative subjectivity. But given my arguments in §3 and §4, the purported virtues of meta-analysis – objectivity and constraint – are less apparent than many have thought. Some epidemiologists now argue that criteria such as those used by Hill should be employed more often (Weed 1997), whereas others argue that such criteria should not be used to assess causal relations (Charlton 1996, Rothman and

Greenland 2005). At the very least, the Hill strategy of dealing with a huge volume and diversity of evidence might, given the problems with meta-analysis discussed in

§6.3 and §6.4, be more virtuous than meta-analysis.

175

7.6 Conclusion

I have argued that meta-analyses fail to constrain belief in hypotheses. This is because the decisions that must be made when designing and performing a meta- analysis require personal judgment and expertise, but also allow personal biases and idiosyncrasies of reviewers to influence the outcome of the meta-analysis. The failure of objectivity at least partly explains the failure of constraint: that is, the subjectivity required for meta-analysis explains how multiple meta-analyses of the same primary evidence can reach contradictory conclusions regarding the same hypothesis.

One of the main criticisms I raised against meta-analysis is its reliance on a narrow range of evidential diversity. An older tradition of evidence in medicine, associated with the epidemiologist Sir Bradford Hill, is in this respect superior.

However, there is no formal method for assessing, quantifying, and amalgamating the very disparate kinds of evidence that Hill considered. Thus the Hill strategy lacks the apparent objectivity and quantificational simplicity of meta-analysis. But given the argument of this chapter, the fact that the Hill strategy lacks a simple method of objectively amalgamating diverse evidence is not a strike against it relative to meta- analysis, since I have argued that the simplicity and objectivity of the latter is a chimera. Despite the ubiquitous view that meta-analysis is the platinum standard of evidence in medicine, meta-analysis is not, in the end, very shiny.

A version of this chapter will appear as “Is Meta-Analysis the Platinum Standard of

Evidence?” in Studies in History and Philosophy of Biological and Biomedical

Sciences.

CHAPTER 8: AN IMPOSSIBILITY THEOREM FOR AMALGAMATING EVIDENCE

Abstract

Amalgamating evidence of different kinds for the same hypothesis into an overall confirmation is analogous, I argue, to amalgamating individuals’ preferences into a group preference. The latter faces well-known impossibility theorems, most famously

“Arrow’s Theorem”. Once the analogy between amalgamating evidence and amalgamating preferences is tight, it is obvious that amalgamating evidence might face a theorem similar to Arrow’s. I prove that this is so, and end by discussing the plausibility of the axioms required for the theorem.

176 177

8.1 Introduction

Often our hypotheses are confirmed or disconfirmed by evidence from multiple modes. For example, consider the competing hypotheses about how the influenza virus is spread from person to person discussed in Chapter 4: the Contact hypothesis is that influenza is spread by direct contact between people, the Droplet hypothesis is that influenza is spread on large droplets expelled by coughs and sneezes, and the Airborne hypothesis is that influenza is spread on tiny airborne particles over large distances. To determine which of these hypotheses is best supported we have multimodal evidence: the epidemiological patterns of influenza spread, evidence from controlled animal experiments using various ingenious designs and different kinds of animals, results from mathematical models, and clinical experience. Some modes of evidence support the Contact hypothesis while other modes of evidence support one of its competitor hypotheses.

Once evidence is thought of in this way it suggests an analogy between amalgamating preferences from multiple people – a burgeoning topic in social choice theory – and amalgamating evidence from multiple modes. Amalgamating individuals’ preferences into a group decision faces several well-known impossibility theorems, including Condorcet’s voting paradox and Arrow’s impossibility theorem. I describe Arrow’s theorem in §8.2, and in §8.3 I attempt to draw the analogy between amalgamating preferences and amalgamating multimodal evidence as tightly as possible. The analogy suggests that amalgamating multimodal evidence might face an impossibility theorem similar to Arrow’s theorem (§8.4). The primary contribution of this chapter is to demonstrate that this is so – amalgamating multimodal evidence

178 faces an Arrow-like impossibility theorem (proven in §8.8). This chapter makes small steps toward delimiting the logical space of possibilities for how multimodal evidence can be amalgamated. More promising, perhaps, is the demonstration that the analogy between preference amalgamation and evidence amalgamation allows for a substantial import of results from the rich literature on amalgamating preferences.78

The theorem presented here is limited to amalgamation functions which accept as input only the confirmation ordering of hypotheses by evidence; if an amalgamation function accepted as input the absolute degree of confirmation of a hypothesis, then a key axiom would be violated and so the impossibility result would not apply. This will be explained in due course, but I mention it now since some might think this focus on confirmation ordering to be unorthodox, and will suspect that the impossibility result will have a narrow range of application given the ubiquity of absolute measures of confirmation. That would be hasty. Many evidence-hypothesis relations make the determination of precise values for likelihoods impossible. In such cases no precise absolute measures of confirmation are possible.79 In Chapter 3 I argue that evidence for scientific hypothesis is often like this – yielding at best rather vague, imprecise values for likelihoods. If determinations of absolute confirmations are not possible, at least comparative confirmations may be; comparative confirmation can be understood as statements of the form “evidence i supports hypothesis H1 more than/equally/less than hypothesis H2.” Confirmation orderings like this are the most that can be justified

78 Seminal results in social choice theory include Arrow (1951), Black (1958), and Sen (1970). See also Dietrich and List (2007) and references therein for recent work on the amalgamation of judgments. 79 Vague likelihoods have received little attention, but see Hawthorne (forthcoming).

179 in much of science. Such orderings, explicated in §8.4, satisfy the axioms of the impossibility theorem for confirmation.

The primary result presented here is perhaps best thought of as a no-go theorem which directs attention to the general plausibility of its axioms. I discuss the plausibility of each of the axioms in §8.5 and §8.6, and argue that two of the axioms are necessary requirements of evidence amalgamation functions, and the remaining two axioms, while not exceptionless requirements of rationality, are generally desirable features of evidence amalgamation functions.

8.2 Amalgamating Preferences

We often wish to amalgamate the preferences of a set of individuals into an overall group preference. Contemporary social choice theorists call a “social welfare function” any aggregation device which takes as its input the preferences of individuals and generates as its output a group preference. The amalgamation of a set of preferences, given certain minimal conditions, can lead to paradoxes. Here I briefly introduce an infamous example.

8.2.1 Arrow’s Theorem

In 1950 the economist Kenneth Arrow published part of his doctoral dissertation as a groundbreaking paper in social choice theory. In one of its strongest forms, Arrow’s impossibility theorem shows that if a society has at least two decision

180 makers and three options to choose from, then no social welfare function (SWF) can jointly meet the following desiderata, stated informally:

Non-Dictatorship: The SWF cannot have as its output the preference orderings

of a single decision maker, for all possible preference orderings of that

decision maker.

Independence of Irrelevant Alternatives: The ordering of choices

N> by the SWF should only depend on the individuals’ orderings of

choices .

Unrestricted Domain: The SWF must be able to accept as input all preference

orderings from all decision makers.

Unanimity: If all individuals prefer A to B, then the SWF must rank A over B.

This is a troubling result, since it is reasonable to want a SWF that meets all of the above desiderata (I further discuss Arrow’s axioms in §8.5). Here is another way of putting the theorem: any SWF which satisfies Independence of Irrelevant Alternatives,

Unrestricted Domain, and Unanimity must be a Dictatorship (that is, must have as its output only the preference orderings of a single individual). It is at first glance a surprising conclusion, since the Independence of Irrelevant Alternatives, Unrestricted

Domain, and Unanimity axioms seem to have little to do with one another, but are jointly sufficient to require a SWF to be a Dictatorship. Arrow’s theorem has generated an explosion of literature interpreting the theorem and demonstrating other theorems by relaxing, removing, strengthening, or adding axioms.

181

8.3 Analogy

Preference : Choice :: Evidence : Confirmation

An individual’s preferences are to a group’s choice as evidence from a single mode is to a confirmation of a hypothesis by multimodal evidence. Individuals prefer one choice over another choice just as evidence supports one hypothesis over another hypothesis. Multimodal evidence (the set of evidence from all relevant modes) supports one hypothesis over another hypothesis, just as a group’s aggregated preferences support one choice over another choice.80 Finally, just as a set of individual preferences must be combined by an amalgamation function to determine a group’s ordering of preferences, so too must multimodal evidence be combined by an amalgamation function to determine an overall ordering of confirmations of hypotheses.

Social welfare function : preference-choice relation :: Multimodal evidence amalgamation function : evidence-confirmation relation

A social welfare function takes as input preference orderings of multiple individuals and delivers as output a group ordering; similarly a multimodal evidence amalgamation function takes as input confirmation orderings from multiple modes and delivers as output an overall confirmation ordering. A social welfare function is the medium between the preference-choice relation just as a multimodal evidence amalgamation function is the medium between the evidence-confirmation relation.

80 In §8.4 I give a more precise exposition of confirmation ordering.

182

The following table illustrates the analogy between amalgamating preferences in social choice theory and amalgamating multimodal evidence in confirmation.

Table 4. Analogy Between Amalgamating Preferences and

Amalgamating Evidence.

Social Choice Confirmation

Individual voter Single mode of evidence

Set of voters Set of modes

Preference Evidence

Preference orderings (input) Confirmation orderings (input)

Social Welfare Function (operation) Amalgamation Function (operation)

Preference ordering (output) Confirmation ordering (output)

183

8.4 Impossibility Theorem for Confirmation

8.4.1 Notation and Definitions

I rely on the following notation.

Multimodal Evidence

A mode of evidence i generates evidence ei. The evidence from all

available modes relevant to a set of competing hypotheses H {H1

… Hm} I will call multimodal evidence {e1, e2, … en}.

This account of multimodal evidence is not committed to any particular notion of evidence, or of individuation of modes.81 A confirmation relation is a binary relation over H. A confirmation order is a confirmation relation, denoted by ≽i where i is a

mode (the confirmation ordering relation is indexed to the mode of evidence). Thus H1

≽i H2 means “evidence from mode i confirmation orders H1 equally to or above H2.”

A confirmation order is reflexive, transitive and connected (but not necessarily anti- symmetric), denoted as follows.

81 Conceptualizing how modes of evidence should be individuated is surprisingly difficult; see Chapter 4. The argument in the present chapter does not depend on particular answers to such questions. Even in what appears to be the clearest domain in which we might be able to individuate modes – sensation – there is little agreement regarding how to individuate sensory modalities (see Keeley 2002).

184

Transitivity

if H1 ≽i H2 and H2 ≽i H3 then H1 ≽i H3, for all H1, H2, H3 in H and for all

i.

Reflexivity

H1 ≽i H1 for all H1 in H and for all i.

Connected

82 H1 ≽i H2 or H2 ≽i H1, for all distinct H1, H2 in H and for all i.

I have not yet said what confirmation ordering means in terms familiar to philosophers of science. Confirmation ordering of multiple hypotheses by evidence from a particular mode can be understood by any account of the evidence-hypothesis relation.

A likelihoodist, for example, could understand confirmation ordering as follows:

H1 ≽i H2 if and only if p(ei|H1) ≥ p(ei|H2)

82 Every confirmation relation ≽i induces corresponding relations ≻i (‘is more confirmed than’) and ~i (‘is equally confirmed as’). They are defined as usual: H1 ≻i H2 if and only if H1 ≽i H2 and not H2 ≽i H1 H1 ~i H2 if and only if H1 ≽i H2 and H2 ≽i H1

185

A Bayesian could understand confirmation ordering with whatever her preferred confirmation measure happened to be; here is confirmation ordering put in terms of the difference measure of confirmation, for example:

H1 ≽i H2 if and only if p(H1|ei) - p(H1) ≥ p(H2|ei) - p(H2)

An error-statistical approach could understand confirmation ordering as follows:

H1 ≽i Hj if and only if when given ei the probability that H1 is false despite

concluding that H1 is true is lower than the probability that H2 is false despite

concluding that H2 is true.

On these accounts of the evidence-hypothesis relation, see (respectively) Sober

(2008), Fitelson (1999), and Mayo (1996).

Confirmation orderings are inherently less informative than absolute measures of evidential support (such as likelihood ratios, likelihood differences, posterior ratios, and posterior differences). Absolute measures of evidential support have been the primary focus of confirmation theory, but comparative measures of evidential support of hypotheses have received at least some attention in modern confirmation theory.83

Sober (2008) argues that when comparing two hypotheses, if we have little information about the prior probabilities of the hypotheses, given evidence from a

83 See, e.g., Hacking (1965), Royall (1997) and Sober (2007).

186 single mode we should use the “law of likelihood” to compare their relative support, rather than attempt to compare their posterior probabilities. Comparative measures of support can be derived from absolute measures of support, but not vice versa. In §6 I sketch an argument which shows that often in scientific practice absolute measures of confirmation are not possible, and in such cases comparative measures of confirmation should be preferred.

I introduce two further notions not broadly used in philosophy of science.

A profile of confirmation orders, or simply profile, is a vector (≽1, …, ≽n) of confirmation orders. An amalgamation function (AF) is a rule or function which aggregates multimodal evidence. To know how well a hypothesis is supported by multimodal evidence, evidence from particular modes must be combined by an AF, just as multiple individuals’ preferences must be amalgamated by a social welfare function in order to determine a group’s preference. It is likely that different sciences should have difference amalgamation functions. There are functions that combine quantitative evidence from different modes and have quantitative outputs, including

Dempster-Shafer Theory, Bayesian conditionalization, and statistical meta-analysis, and there are functions that combine evidence from different modes but have qualitative outputs, such as the evidence hierarchy schemes in evidence-based medicine or consensus conferences in medicine and social policy. Many disciplines currently have AFs, but since the notion has not been studied in depth we have no principles to systematically assess and compare the various AFs currently in use. The notion is perfectly general: an AF is simply any way of considering the support that

187 diverse evidence provides to a hypothesis. However, the theorem considered here will only consider AFs which map each possible profile of confirmation orders to an output confirmation order.84 This limitation will be defended in §8.5 and §8.6.

Given a profile (≽1, …, ≽n), the corresponding aggregate relation is denoted by ≽ (= AF(≽1, …, ≽n)). Similarly, given a profile (≽’1, …, ≽’n), the corresponding aggregate relation is denoted by ≽’. In short, AF output orders are denoted by dropping the indexes of the input orders.

The goal of much of the literature on preference amalgamation is to specify the logical boundaries on what social welfare functions can do. Just as impossibility theorems serve to delimit the logical space of possibility for preference amalgamation functions, similar impossibility theorems for multimodal evidence amalgamation functions can be constructed, with the same goal. To demonstrate that this is so, I now provide an analogue to Condorcet’s voting paradox and an analogue to Arrow’s impossibility theorem.

8.4.2 Impossibility Theorem: Arrow Analogue

There are several desiderata one might want a multimodal evidence AF to meet; these are analogues to Arrow’s desiderata for social welfare functions. I here informally state the desiderata; I state the desiderata in formal terms in the Appendix

(§8.8).

84 This builds a ‘universal domain’ criterion into the definition of an aggregation rule; i.e. all profiles of orders are permissible inputs of the AF.

188

Unanimity (U): If all modes prefer one hypothesis over another, then the

AF must do the same.

Non-Dictatorship (D): No mode is dictatorial (i.e. no mode of evidence

always trumps all other modes).

Independence of Irrelevant Alternatives (I): The way two hypotheses are

ranked relative to each other by an AF depends only on how the individual

modes rank these two hypotheses relative to each other, and not on how the

modes rank them relative to other hypotheses.

Ordered Output (O): An AF generates a confirmation order for every

profile.85

Now it should be clear that the amalgamation of multimodal evidence can be structured in a way perfectly analogous to the amalgamation of preferences. Thus, one should expect an impossibility theorem for confirmation analogous to Arrow’s impossibility theorem for preference amalgamation. The final condition that must be met for the theorem to hold is that there must be at least two modes of evidence available, and there must be at least three hypotheses in H.

85 Ordered Output is one way in which the Constraint desideratum for AMs, discussed in §7.2, could be satisfied. In my view, Ordered Output is a modest and realistic goal for both confirmation functions and amalgamation functions (as argued for in a roundabout way in Chapters 2 and 3).

189

Theorem: No AF can jointly satisfy U, I, O and D.

Proof: See Appendix (§8.8).

Here is another way to put the theorem: Any amalgamation function which satisfies

Unanimity, Independence of Irrelevant Alternatives, and Ordered Output, must be a

Dictatorship.

8.5 Arrow’s Axioms

The impossibility theorem presented here is a surprising result, just as Arrow’s theorem was. At first glance, Unrestricted Domain, Independence, and Unanimity have little to do with each other, but this theorem shows that the only AF which can jointly satisfy these three desiderata is a Dictatorship. In short, the theorem is surprising and the proof is valid, and so it is reasonable to critically evaluate the axioms. Before considering the axioms in the case of amalgamating multimodal evidence it will help to consider the axioms in the case of amalgamating preferences.

Much research after the publication of Arrow’s theorem was directed at the axioms that Arrow used, and arguments were proposed to relax some of the axioms, and other theorems were demonstrated by strengthening, relaxing, removing, or

190 adding assumptions.86 In short, though, Non-Dictatorship, Unanimity, and

Unrestricted Domain are intuitively plausible principles for a SWF, especially if one is committed to minimally democratic norms. The axiom most often thought reasonable to relax is Independence of Irrelevant Alternatives. Indeed, some have considered this axiom to be too strong a requirement for a SWF.

Recall what the Independence of Irrelevant Alternatives axiom in Arrow’s

Theorem requires: The ordering of choices by the SWF should only depend on the individuals’ orderings of choices . The axiom limits the information regarding individuals’ preferences available to a SWF in at least two ways. First, it disallows information regarding individuals’ preferences about options outside the choice set to influence the output ordering of the SWF. Second, it only allows ordinal information regarding individuals’ preferences to be amalgamated by the SWF. I will call the first aspect of the axiom simply ‘Independence’, and the second aspect ‘Ordinality’. I address each in turn, since both aspects of the axiom are important to the confirmation analogue.

8.5.1 Independence

Independence is an intuitively desirable feature of an SWF. For example, if I prefer apples (A) over bananas (B), and bananas over cherries (C), my preference ordering of these fruit is

86 See, for example, Black (1958), Sen (1970), and Arrow’s 1963 edition of Social Choice and Individual Values. For a philosophical exposition and defense of Non- Dictatorship, Unanimity, and Unrestricted Domain, see MacKay (1980).

191

(1) A ≻ B ≻ C

If I then include strawberries (S) in my preference ordering of fruits, then (S) might be more or less preferable to (A) and/or (B) and/or (C), so that one possible ordering might be

(2) A ≻ S ≻ B ≻ C but the orderings of (A) and (B) and (C) relative to each other should not change upon consideration of (S), so that the following ordering, for example, is prohibited:

(3) B ≻ S ≻ C ≻ A

The mere inclusion of (S) in my appreciation of fruit should not change my relative orderings of (A) and (B) and (C). In (3), the mere inclusion of (S) in my rank-ordering of fruit has shifted (A) from my most-preferred fruit to my least-preferred fruit, which seems irrational. In (2) Independence is satisfied whereas in (3) Independence is violated. A natural question to ask is: what does ‘irrelevance’ mean? Suppose I have an odd condition which makes strawberries react with apples, thereby causing digestive discomfort. Then, supposing my rank-ordering of fruit was for the purpose of making a fruit salad, in which the three highest ranked fruit would be consumed together, then strawberries would be relevant to my rank-ordering of apples with respect to other fruit, and so (3) might then be a reasonable ordering of my fruit preferences.

Despite the intuitive plausibility of Independence, voting systems exist which fail to satisfy it. For example, the Borda count method is an election method in which voters rank candidates in order of their preferences: if there are n candidates, then a

192 candidate receives n points for a voter’s first preference, n - 1 points for a second preference, and so on; the points are then summed and the candidate with the most points wins. That Borda count does not satisfy Independence is a commonly recognized fact amongst social choice theorists. My purpose in raising the matter is to urge that the axioms (or at least the Independence axiom) used in Arrow’s theorem should be thought of as desiderata rather than as necessary criteria for a SWF. Despite the fact that Borda count fails at least one of the axioms, it is occasionally used in real voting systems: the Pacific Island nation of Kiribati, for example, uses Borda count to elect its politicians. The violation of a desideratum in actual cases of preference amalgamation diminishes the desirability of the amalgamation function, but some desiderata are worse to violate then others, and many have thought that Independence is the most acceptable desideratum to violate. This will be important when considering the evidence analogues of the desiderata.

8.5.2 Ordinality

The ordinality aspect of Arrow’s Independence of Irrelevant Alternatives axiom might at first glance seem unrealistically constraining, since it prohibits information about the intensity of individuals’ preferences for the available choices. If we include preference intensity information in a SWF, and we assume we can make meaningful interpersonal comparisons of such information, then we can avoid Arrow’s theorem (Sen 1970). One might think that we can elicit preference intensities from individuals. The intensity of individual’s preferences could be measured on an interval scale (or an absolute scale), in which the meaning between two equally-sized intervals

193 on the scale is the same across the scale and for all individuals. Such preference intensity measures, if measured in a non-arbitrary and objective way, might allow inter-personal comparisons of preference intensity. This information would be richer than ordinal information: ordinal rankings can be inferred from measures on an interval scale, but interval measures cannot be inferred from ordinal rankings. The

Independence axiom in Arrow’s theorem explicitly stipulates the exclusion of measures on an interval or absolute scale.

Is this limitation to ordinal information of preferences for a SWF justified?

There are at least two reasons to think so. First, it is unlikely that preferences can be meaningfully measured on an interval scale. Preferences are just words used to summarize poorly understood mental phenomena. There is no standard, intersubjective scale with which to measure preferences (at least one which is not arbitrary in important respects). Since a plausible guiding principle is to limit ourselves to what is both meaningful and possible, and since ordinal rankings of preferences are both meaningful and possible, and since interpersonal interval or absolute measures of preferences are not both meaningful and possible, a SWF should be limited to ordinal measures of preferences. Second, even supposing it were possible to elicit meaningful preference intensity measures, it is not obvious that we would want to include such information when amalgamating individuals’ preferences, since including preference intensity information in a SWF would benefit fanatics at the expense of moderates: the preferences of those individuals with higher preference intensities would count more toward the group choice than would the preferences of individuals with less-intense

194 preferences.87 Thus, it is both possible and desirable to limit the input of an SWF to ordinal rankings of preferences, as opposed to interval or cardinal measures of preference intensity.

8.6 Axioms of Present Theorem

How plausible are the axioms for the impossibility theorem for confirmation?

For the theorem to be broadly applicable, the axioms would have to be seen as broadly desirable features of an AF. In what follows I give reasons for thinking that, although these desiderata are not exceptionless, inviolable principles of rationality, they are generally desirable features of an AF. In other words, although these desiderata are neither necessary nor sufficient conditions for an AF to be truth-conducive, AFs which satisfy them are expected to be more truth-conducive, on average, than those which do not satisfy them. Just as with Arrow’s theorem, by far the most complicated and controversial axiom is Independence of Irrelevant Alternatives, so I leave it for last and devote the majority of space to it.

8.6.1 Unrestricted Domain (Universality)

Although not explicitly stated as an axiom in the theorem, the definition of an amalgamation function assumed what social choice theory calls ‘Unrestricted

Domain’, and what I refer to as ‘universality.’ The Unrestricted Domain axiom in the case of amalgamating confirmation orderings is nearly a dictate of reason. If an AF

87 The considerations in this paragraph skirt over decades of controversy. See MacKay (1980) for a philosophical discussion and defense of the ordinality limitation.

195 could accept as its input only a limited range of possible confirmation orderings, then the AF could be faced with some confirmation ordering by some modes of evidence which it would not be able to amalgamate. Such AFs would more constrained than otherwise need be. As with Arrow’s Theorem, restrictions to the scope of orderings available to the aggregation function can be devised which allows the impossibility result to be avoided (for example, as Black (1958) showed, limiting the scope of preferences which are made available to a SWF to ‘single peaked preferences’ thereby avoids Arrow’s theorem). However, it is hard to imagine a reason to restrict the domain of a multimodal evidence AF which is independent of the wish to avoid the impossibility result presented here.

8.6.2 Non-Dictatorship

The Non-Dictatorship axiom is, like Unrestricted Domain, nearly a requirement of rationality. Non-Dictatorship demands that no single mode of evidence determine the confirmation ordering of an AF, for any of the confirmation orderings of the mode. This desideratum is weaker than, but follows logically from, a requirement that an AF consider all evidence from all relevant modes. Carnap’s “Principle of Total

Evidence” is the similar requirement that a person must consider all available evidence when estimating a probability. If one does not consider all evidence from all available modes, then one is liable to unnecessary inductive risk. In the case of preference amalgamation, the Non-Dictatorship desideratum is a corollary of basic democratic commitments. In the case of evidence amalgamation, the Non-Dictatorship desideratum is a corollary of basic empiricist commitments. One of the purposes of a

196

SWF is to take into account the preferences of all decision-makers – if we did not have the goal of accommodating the preferences of all (or at least of most), then there would be no need for a SWF in the first place. Similarly, one of the purposes of an AF is to take into account all available evidence – if we did not have the goal of considering all available evidence (or at least of most), then there would be no need for a AF in the first place. Thus, the desideratum of Non-Dictatorship is a necessary feature of an AF.

8.6.3 Unanimity

An intuition undergirding robustness arguments is that if all modes of evidence confirmation order one hypothesis over another, then the AF should do the same, and if it does not, the AF is flawed. The following toy example illustrates the Unanimity desideratum. Suppose our three hypotheses are:

H1: The global climate is warming.

H2: The global climate is neither warming nor cooling.

H3: The global climate is cooling.

And suppose we have three modes of evidence:

i: Atlantic ocean temperature measurements

j: Arctic ice mass measurements

k: Atmospheric CO2 concentration measurements

Further suppose that all three modes of evidence confirmation order H1 over H2 and

H2 over H3. More precisely, using the notation introduced in §8.4, suppose we have the following confirmation orderings:

197

H1 ≻i H2 ≻i H3

H1 ≻j H2 ≻j H3

H1 ≻k H2 ≻k H3

In such a situation it is intuitively compelling to demand of any AF that upon amalgamation of evidence from i, j, and k, the AF must confirmation order H1 over H2 and H2 over H3. That is, given the above confirmation orderings it is reasonable to demand of the AF that

H1 ≻ H2 ≻ H3

Unanimity, then, is a general (but not exceptionless) desideratum of AFs.

Despite the intuitive appeal of Unanimity, epistemic modesty requires us to recognize that Unanimity is fallible; cases of dyssynergistic evidence and pseudorobustness are situations in which Unanimity is fallible (discussed in Chapter 4 and 5, respectively). To illustrate the fallibility of Unanimity, consider another example. Suppose you are a hospital’s chief of medicine, pondering a patient’s survival, and you have available two modes of evidence: verbal reports from Dr. Blue, and verbal reports from Dr. Green. Let your hypotheses be:

H1: the patient will live

H2: the patient will die

Let your modes of evidence be:

i: verbal reports from Dr. Blue

j: verbal reports from Dr. Green

198

Dr. Blue tells you that she is giving the patient drug X, because drug X is known to help such patients; for Dr. Blue, as for you, p(H1|ei) > p(H1), and suppose that p(H1|ei)

- p(H1) > p(H2|ei) - p(H2), so in the above notation: H1 ≻i H2. Dr. Green tells you that he is giving the patient drug Y, because drug Y is known to help such patients; for Dr.

Green, as for you, p(H1|ej) > p(H1), and suppose that p(H1|ej) - p(H1) > p(H2|ej) - p(H2), so in the above notation: H1 ≻j H2. However, you know that administering drug X and drug Y together will be lethally damaging to the patient’s kidney; you decide that, although p(H1|ei) - p(H1) > p(H2|ei) - p(H2) and p(H1|ej) - p(H1) > p(H2|ej) - p(H2), p(H1|ei & ej) - p(H1) < p(H2|ei & ej) - p(H2), and in the above notation: H2 ≻ H1. In other words, the amalgamated evidence has the opposite rank-ordering of hypotheses than do both individual modes of evidence. Unanimity fails in this case, for a seemingly good reason.

Thus, Unanimity should not be construed as a necessary requirement of rationality, since it is occasionally violated by seemingly truth-conducive AFs.

Nevertheless, although it is not exceptionless, the ‘robustness’ intuition illustrated by the global-warming example has been frequently defended as a useful and relatively general heuristic for scientific reasoning. The same reasons which justify the robustness intuition support the status of Unanimity as a generally desirable feature of an AF.

8.6.4 Independence of Irrelevant Alternatives

199

The Independence of Irrelevant Alternatives axiom might, perhaps, be construed as the most troublesome of the axioms used in the impossibility theorem for confirmation. Recall what the axiom stipulates: an AF confirmation ordering of hypotheses {H1 … Hn} should only depend on the confirmation orderings, by individual modes, of {H1 … Hn }. Thus, the axiom has both the Independence and

Ordinality features as it does in the case of amalgamating preferences. I will discuss each in turn.

8.6.4.1 Independence

The Independence aspect of the axiom in the case of evidence amalgamation is formally analogous to the case of preference amalgamation, and it is just as intuitively compelling, as the following example illustrates. Suppose an AF has confirmation ordered a heliocentric model of the solar system (Hh) over an epicyclic model (He), and an epicyclic model over an eccentric model (Hc); the confirmation ordering would then be:

(1’) Hh ≻ He ≻ Hc

We might then include another hypothesis in our rank-ordering – the blue cheese (Hb) model of the solar system, for example – and determine that the AF confirmation ordering of the blue cheese model is greater than the AF confirmation ordering of an epicyclic model of the solar system, and so the following confirmation ordering is possible:

(2’) Hh ≻ Hb ≻ He ≻ Hc

200

But merely including the blue cheese model hypothesis in the ordering of astronomical hypotheses should not alter the relative confirmation orderings of the heliocentric model, the epicyclic model, and the eccentric model; thus the following confirmation ordering would be prohibited:

(3’) Hc ≻ Hb ≻ He ≻ Hh

In (2’), but not in (3’), Independence is satisfied.

It is possible to construct cases in which it seems reasonable to relax

Independence. Consider the following example (which, for simple exposition, is more fiction than fact). The biodiversity of an ecosystem has long been known to be correlated with the bioproductivity of the ecosystem (this was briefly discussed in

Chapter 5). The direction of causality, however, is not known. Suppose our two hypotheses are:

H1: Biodiversity causes bioproductivity

H2: Bioproductivity causes biodiversity

The variables ‘biodiversity’ and ‘bioproductivity’ are coarse-grained, but suppose that for each macro-level hypothesis we hypothesize one micro-level causal mechanism, which I will call, respectively, H1’ and H2’. Given all available evidence, H1’ ≻ H2’

(e.g. suppose p(H1’) = 0.6 and p(H2’) = 0.4). But then suppose a second micro-level causal mechanism is proposed for H1, which I will call H1’’, and suppose that in the absence of any new evidence, the rank-orderings of the macro-level hypotheses do not change. Whatever credence is devoted to H1’’ must come from H1’. If H1’ were equally plausible as H1’’ (e.g. if p(H1’) = 0.3 and p(H1’’) = 0.3), then the plausibility of H1’

201

should go down, and so it follows that H2’ ≻ H1’. In other words the AF confirmation

ordering of the micro-level causal hypotheses H1’ and H2’ was reversed, merely by introducing a competitor micro-level causal hypothesis, while the rank-ordering of the macro-level hypotheses did not change. Although such situations are possible to imagine, how frequent they are is an empirical question, but at first glance there is little reason to think such situations are frequent. The Independence aspect of the axiom is a generally compelling desideratum for an amalgamation function.

8.6.4.2 Ordinality

The Ordinality aspect of the axiom might strike some as odd. Ordinality limits the information regarding the support that evidence provides to a hypothesis which is made available to the AF, by ruling out information stronger than the comparative confirmation ordering of multiple hypotheses. More specifically, Ordinality rules out cardinal measures of confirmation. This is also true for the Ordered Output axiom, which states that an output of an AF is also a confirmation order. These might seem unduly restrictive, since at first glance we have cardinal measures of confirmation for hypotheses: probabilities are measured on a cardinal scale. If we were to include such information in an AF, then the above impossibility theorem simply would not apply, since the Independence axiom would not be satisfied (and nor perhaps the Ordered

Output axiom). Thus, such reasoning might go, the impossibility theorem for confirmation presented here is based on an artificial limitation of information. We can and should relax this limitation for real cases of confirmation of multiple hypotheses

202 by multimodal evidence, since we often have absolute measures of confirmation; it would follow that the theorem presented here is of limited scope since its domain of application is limited to those situations in which absolute measures of confirmation are not available. So one might think.

It is true, of course, that there are classes of evidence-hypothesis relations such that determining a precise measure of support that the evidence provides to the hypothesis is possible. If evidence e is deductively entailed by hypothesis H, then the likelihood of the evidence, p(e|H), can be trivially determined. If the opposite of evidence e is deductively entailed by hypothesis H, then p(e|H) can be trivially determined. If H specifies a particular chance set-up (as in classic examples such as colored balls in an urn), and e is a particular outcome of this chance set-up, then p(e|H) can be determined.88 Ordinality, in these classes of evidence-hypothesis relations, is, for good reasons, not satisfied.

However, there is a class of evidence-hypothesis relations such that determining a cardinal measure of support that the evidence provides to the hypothesis is impossible. In Chapter 3 I argue that this class of evidence-hypothesis relations is large, at least in the empirical sciences. In a nutshell, the argument is as follows. There are numerous features of evidence (such as the quality, relevance, and transparency of the evidence; the theoretical plausibility of the evidence; patterns in the evidence; and concordance with other evidence) that must be weighed and variably prioritized when assessing evidence, and there are numerous equally rational yet contradictory ways to do so; thus the probability of the evidence under the assumption that a particular

88 These are discussed in §3.2

203 hypothesis is true – that is, the likelihood – is, at least in many cases in the empirical sciences, indeterminate. In this class of evidence-hypothesis relations, then, absolute measures of confirmation are impossible to determine. But for this class of evidence- hypothesis relations (or at least a significant subset), ordinal rankings of confirmation

(what I above call confirmation orderings) may still be possible.

As discussed in Chapter 3, I am not the first to note the difficulty with determining likelihoods. Earman, for instance, wrote: “…while much of the attention on the Bayesian version of the problem has focused on the assignments of prior probabilities, the assignments of likelihoods involves equally daunting difficulties”

(1992). Similarly, Glymour claims that determinate likelihoods is possible only in rare circumstances, and in most empirical science “no such immediate and natural alternative distribution of degree of belief is available” (1980). For much of empirical science we cannot determine precise, absolute measures of confirmation on an interval scale. At least in such cases it is reasonable to limit an AF to ordinal rankings of comparative confirmation orderings among multiple competing hypotheses.

One might worry that the considerations raised against the possibility of cardinal measures of confirmation are equally problematic for determining the ordinality of confirmation among multiple competing hypotheses. After all, in §8.4 confirmation ordering was explicated in terms of cardinal measures of confirmation

(likelihoodist and Bayesian measures). However, cardinal measures are inherently richer in information than ordinal measures: one can derive ordinal measures from cardinal measures, but not vice versa. So there is ‘something more’ needed to justify a cardinal measure over than an ordinal measure, and the arguments above are directed

204 at that ‘something more’. Another way of putting this is: despite the complexities of evidential assessment, when we have evidence for competing hypotheses, we often can at least know that H1 is confirmed more than H2, but we usually cannot know the precise values of these confirmations. Comparative confirmation is not necessarily derived from (or reducible to) respective cardinal measures of confirmation. My argument justifies skepticism in cardinal measures of confirmation for a class of evidence-hypothesis relations, but not wholesale skepticism that evidence cannot sort out better from worse confirmed hypotheses.

For many empirical situations it is reasonable to limit an AF to ordinal rankings of hypotheses, since often anything beyond ordinal rankings of hypotheses – for example, measures of confirmations on cardinal scales – is impossible, and it is reasonable to limit our methods to what is possible. In these cases, the Ordinality aspect of the Independence of Irrelevant Alternatives axiom is reasonably satisfied, as is the Ordered Output axiom. And, though arguing for the point here would take me far afield, such cases are ubiquitous in science.

8.6.5 Summary

In this section I have discussed the four axioms required for the impossibility theorem for confirmation. Although there are cases in which a truth-conducive AF does not satisfy one or more of the axioms, nonetheless the axioms should be considered generally desirable features of an AF. Moreover, the occasional mismatch between the truth-conduciveness of an AF and its satisfaction of the desiderata further

205 strengthens the analogy between amalgamating evidence and amalgamating preferences, because there are similar occasional mismatches in the preference case.

8.7 Conclusion

I have argued that there is an analogy between amalgamating individuals’ preferences into a group decision – a subject which has been developing technically for the past several decades – and amalgamating multimodal evidence into a confirmation ordering of competing hypotheses. Just as the former faces Arrow’s famous impossibility theorem, I here prove that amalgamating multimodal evidence into a confirmation ordering of competing hypotheses faces an analogous impossibility theorem. The axioms of the theorem are generally desirable features of an AF, although they are not exceptionless requirements of rationality – this too is another parallel in the analogy between amalgamating preferences and amalgamating evidence. Besides proving the impossibility theorem itself, this chapter more generally shows that at least some of the technical results in social choice theory have plausible analogues for confirmation theory. The potential to utilize these richly-developed results from social choice theory is, I hope, promising.

The general conclusion of the argument presented here should not be seen as overly pessimistic. The argument is simply based on a piece of logic best construed as a no-go theorem which directs our attention to assessing the general plausibility of its axioms. In any situation in which the axioms are satisfied, the theorem applies. I have argued for the frequent (but not universal) applicability of the axioms.

206

Another way to describe the theorem is as an inconsistent quaternity.

Philosophers are used to inconsistent triads – the problem of evil, for example, can be described as the claim that no entity can be omniscient, omnipotent, and omnibenevolent. The theorem presented here states that no amalgamation function can jointly satisfy unanimity, non-dictatorship, independence of irrelevant alternatives, and universality.

8.8 Appendix: Proof

I include a formal proof of the theorem to make the formal analogy with preference amalgamation explicit. I rely on the notation and definitions introduced in

§8.4, and a few additional pieces of terminology are used in the proof: A hypothesis H is ‘top’ in a confirmation relation if it is confirmation ordered (strictly) above all other hypotheses. A hypothesis H is ‘bottom’ in a confirmation relation if it is confirmation ordered (strictly) below all other hypotheses. A hypothesis H is ‘extreme’ in a confirmation relation if it is top or bottom. Finally, the formal statement of the AF desiderata are as follows (with the informal statement repeated in parentheses for ease of reference):

Unanimity (U)

For all profiles (≽1, …, ≽n) and every pair of hypotheses H1, H2 in H, if H1

≻i H2 for all modes i, then H1 ≻ H2. (Informally: If all modes prefer one

hypothesis over another, then the AF must do the same.)

207

Independence of Irrelevant Alternatives (I)

For every pair of hypotheses H1, H2 in H and every pair of profiles (≽1, …,

≽n), (≽’1, …, ≽’n), if [for all modes i the relations ≽i and ≽’i coincide on

{H1, H2}] then ≽ and ≽’ coincide on {H1, H2}. Two relations coincide on

{H1, H2} when they either both rank H1 strictly over H2 or both rank H2

strictly over H1 or both rank H1 and H2 equally. (Informally: The way two

hypotheses are ranked relative to each other by an AF depends only on how

the individual modes rank these two hypotheses relative to each other, and

not on how they rank them relative to other hypotheses.)

Ordered Output (O)

AF generates an order (i.e., a transitive, reflexive and connected relation)

for every profile.

Non-Dictatorship (D)

There is no mode i such that AF(≽1, …, ≽n) = ≽i for all profiles (≽1, …,

≽n). (Informally: No mode is dictatorial.)

Theorem: There exists no amalgamation function satisfying U, I, O, and D.

208

The proof of the impossibility theorem for confirmation follows a strategy of a proof of Arrow’s Theorem by Geanakoplos (2005), and proceeds in four steps. The strategy is to assume U, I and O, and then to prove the violation of D, i.e., the existence of a dictator.

Step 1

If a hypothesis H1 is extreme in every confirmation order of a profile, then it is also extreme in the AF output.

Proof

Assume the contrary: suppose hypothesis H1 is extreme in all orders of profile (≽1, …,

≽n) but not extreme in the output relation ≽. Then there exist hypotheses H2 and H3 such that H2 ≻ H1 and H1 ≻ H3. So, by transitivity, H2 ≻ H3. We may assume without loss of generality that all modes i have H3 ≻i H2. Indeed, if this were not the case we could modify the profile so that it becomes true while retaining how each mode ranks

H1 relative to any other hypothesis; by (I), this modification would not affect how H1 is ranked compared to any other hypothesis in the AF output. By (U), H3 ≻ H2, in contradiction to H2 ≻ H3.

209

Step 2

For every hypothesis H there exists a mode i which is ‘pivotal’ for H; that is, i is the earliest mode such that H is top in the AF output relation for every profile in which all modes up to i have H top and all following modes have H bottom.

Proof

Let H be an arbitrary hypothesis. Consider any fixed profile (≽1, …, ≽n) in which H is bottom in every confirmation order. I will call this “profile 0”: please see Table 5 following the proof for a graphical representation of profiles. By (U), H must be bottom in AF(≽1, …, ≽n). For every mode j let ≽’j be the confirmation order in which

H is top and any two other hypotheses are ranked just as in ≽j. Define i as the earliest mode with the property that H is not bottom in AF(≽’1 , ... , ≽’i , ≽i+1 , ... , ≽n). There exists such an i because, by (U), H is top in AF(≽’1 , ... , ≽’n). Call the profile (≽’1 , ...

, ≽i , ≽i+1 , ... , ≽n) “profile I”, and the profile (≽’1 , ... , ≽’i , ≽i+1 , ... , ≽n) “profile II”

(see table). By Step 1, since H is extreme and H is not bottom in AF(≽’1 , ... , ≽’i ,

≽i+1 , ... , ≽n), H is top in AF(≽’1 , ... , ≽’i , ≽i+1 , ... , ≽n). Finally, i is pivotal for H because, by (I), H is top in the AF relation for every profile which shares with (≽’1 , ...

, ≽’i , ≽i+1 , ... , ≽n) the property that all modes up to i have H top and all following modes have H bottom.

210

Step 3

For every hypothesis H, there is a mode i which is dictatorial over any pair of hypotheses other than H.

Proof

Consider any hypothesis H, and let i be the mode which is pivotal for H (i must exist, by Step 2). Consider any hypotheses H1 and H2 distinct from H, and let (≽1, …, ≽n) be an arbitrary profile. Without loss of generality H1 and H2 are distinct and H1 ≻i H2

(if H2 ≻i H1 then simply exchange the roles of H1 and H2 in the argument). By (I) we may assume that H1 ≻i H ≻i H2 (since the way the output relation ranks H1 and H2 relative to each other is not affected by how these hypotheses compare to H by mode i). We may also assume without loss of generality that H is top in all modes earlier than i and bottom in all modes following i, again because by (I) we can change the rank ordering of H as long as the respective ordering of H1 and H2 do not change. The profile we are considering I will call “profile III”. I have already shown that H is bottom in the AF output of profile I and that H is top in the AF output of profile II

(since that was how we defined i which we use in the current step). In profile III all H-

H1 pairs are ordered as they were in profile I, so, since H is bottom in AF(profile I), in profile III it must be that H1 ≻ H, by (I). In profile III all H-H2 pairs are ordered as they were in profile II, so, since H is top in AF(profile II), it must be that in profile III,

211

H ≻ H2, by (I). So, by transitivity, H1 ≻ H2. To summarize this step: we assumed H1

≻i H (which by (I) should not affect the AF ordering of the H1-H2 pair), and we assumed H1 ≻i H2 and showed H1 ≻ H2.

Step 4

Some mode is dictatorial over every pair of distinct hypotheses H1 and H2.

Proof

By Step 3, for every hypothesis H there is a mode i which is dictatorial over all hypotheses other than H; call this mode iH. I must show that iH is the same for all H.

Suppose for a contradiction that H and H’ are hypotheses such that iH and iH’ are distinct. Pick any hypothesis H’’ distinct from H and from H’. Consider a profile (≽1,

… , ≽n) in which H’ ≻iH H’’ and H’’ ≻iH’ H and H ≻iH’’ H’ (this is possible because iH, iH’ and iH’’ are not all the same mode). Then, by the local dictatorship properties of iH, iH’ and iH’’ demonstrated in step 3, H’ ≻ H’’ and H’’ ≻ H and H ≻ H’. This cycle violates transitivity: a contradiction. This completes Step 4, and Step 4 completes the proof.

The following table (Table 5) visually depicts the Profiles constructed in the proof.

212

Table 5. Profiles Constructed in Proof of Impossibility Theorem for

Amalgamating Evidence.

Profile 0 I II III Mode 1 … ≻1 H H ≻1 … H ≻1 … H ≻1 … 2 … ≻2 H H ≻2 … H ≻2 … H ≻2 … … i … ≻i H … ≻i H H ≻i… …H1 ≻i …H … ≻i H2 … i+1 … ≻i+1 H … ≻i+1 H … ≻i+1 H … ≻i+1 H … n … ≻n H … ≻n H … ≻n H … ≻n H _____ AF … ≻ H … ≻ H H ≻… H1 ≻ H and H ≻ H2 , so H1 ≻ H2

Each cell in the table indicates that the hypotheses listed in the cell have a confirmation ordering, by the mode listed in the left-most column, represented with the indexed ordering symbol ≻i , the hypothesis variables, and the ellipsis. Thus in the first cell of Profile 0, “… ≻1 H1” means “hypothesis H1 is the least confirmed hypothesis (‘bottom’), by mode 1”, and in the fifth cell of Profile III “… H3 ≻n H2 …

≻n H1” means “for mode n there are an indeterminate number of hypotheses

confirmation ordered above H3, which is confirmation ordered above H2, which itself is confirmation ordered above an indeterminate number of hypotheses, and H1 is the least confirmed hypothesis”.

213

This chapter will appear as “An Impossibility Theorem for Amalgamating

Evidence” in Synthese.

REFERENCES

Achinstein, Peter (2001) The Book of Evidence. Oxford University Press.

Allchin, Douglas (1992) “How Do You Falsify a Question?: Crucial Tests v. Crucial Demonstrations” PSA 1992 (1): 74-88.

Alloway, J.L. (1933) “Further observations on the use of pneumococcus extracts in effecting transformation of type in vitro” Journal of Experimental Medicine 57: 265-278.

Anderson, T.F. (1946) “Morphological and chemical relations in viruses and bacteriophages” Cold Spring Harbor Symposium on Quantitative Biology. 11: 1- 13.

Arrow, Kenneth (1951) Social Choice and Individual Values. John Wiley and Sons.

Arrow, Kenneth (1950) “A Difficulty in the Concept of Social Welfare” Journal of Political Economy 58(4): 328-346.

Assendelft W. J., Koes B. W., Knipschild P. G., Bouter, L. M. (1995) “The relationship between methodological quality and conclusions in reviews of spinal manipulation” The Journal of the American Medical Association 274: 1942-1948.

Avery, O.T., MacLeod, C.M., and McCarty, M (1944) “Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type III” Journal of Experimental Medicine 79: 137-158.

Barnes, D., & Bero, L. (1998) “Why review articles on the health effects of passive smoking reach different conclusions” The Journal of the American Medical Association 279(19): 1566-70.

Beatty, John (2006) “Masking Disagreement among Scientific Experts” Episteme 3: 52-67.

214 215

Bechtel, William (2006) Discovering Cell Mechanisms. Cambridge University Press.

Bechtel, William (2002) “Aligning Multiple Research Techniques in Cognitive Neuroscience: Why is it Important?” Philosophy of Science 69: S48-S58.

Bechtel, William (2000) “From imaging to believing: Epistemic issue in generating biological data” In R. Creath and J. Maienschein (eds.), Epistemology and Biology, pp. 138-163. Cambridge University Press.

Benveniste, Jacques (1988) “Dr Jacques Benveniste replies” Nature 334: 291.

Black, Duncan (1958) The Theory of Committees and Elections. Cambridge University Press.

Bogen, Jim and Woodward, James (1988) “Saving the Phenomena” Philosophical Review 97: 303-352.

Bogghosian, Paul (2006) Fear of Knowledge: Against Relativism and Constructivism Oxford University Press.

Boivin, A. (1947) “Directed mutation in colon bacilli, by an inducing principle of desoxyribonucleic nature: Its meaning for the general biochemistry of heredity” Cold Spring Harbor Symposia on Quantitative Biology 12: 7-17.

Bordogna, Francesca (2008) William James at the Boundaries: Philosophy, Science, and the Geography of Knowledge. University of Chicago Press.

Borenstein, M., Hedges, L.V., Higgins, J.P.T., Rothstein, H.R. (2009) Introduction to Meta-analysis. Chichester: John Wiley and Sons.

Borgenson, K. (2008) Valuing and Evaluating Evidence in Medicine. PhD diss., University of Toronto.

Box, George (1953) “Non-Normality and Tests on Variances” Biometrika XL 318- 335.

Broadbent, A. (2009) “Causation and Models of Disease in Epidemiology” Studies in History and Philosophy of Biological and Biomedical Sciences 40: 302-311.

Brown, James (2008) “The community of Science” In (Eds.) Carrier, Howard, and Kourany, The Challenge of the Social and the Pressure of Practice: Science and Values Revisited. Pittsburgh: University of Pittsburgh Press.

Carnap, Rudolf (1963) The Philosophy of Rudolph Carnap. (Library of Living Philosophers, Vol. 11). Edited by Paul A. Schilpp. (Open Court: IL).

Carnap, Rudolf (1962) Logical Foundations of Probability.

216

Cartwright, Nancy, and Stegenga, Jacob (forthcoming). “A Theory of Evidence for Evidence-Based Policy” In (Eds.) Dawid, Twining, and Vasilaki, Evidence, inference and enquiry. British Academy Publications.

Cartwright, Nancy (2007a) Hunting Causes and Using Them. Cambridge University Press.

Cartwright, Nancy (2007b) “Are RCTs the Gold Standard?” Biosocieties 2: 11-20.

Cartwright, Nancy (2006) “Well-Ordered Science: Evidence for Use” Philosophy of Science 73: 981-990.

Cartwright, Nancy (1999) The Dappled World. Cambridge University Press.

Cartwright, Nancy, Jordi Cat, Lola Fleck and Thomas Uebel (1996) Otto Neurath: Philosophy Between Science and Politics. Cambridge University Press.

Cartwright, Nancy (1991) “Replicability, , and Robustness: Comments on Collins” History of Political Economy 23: 143-155.

Cartwright, Nancy (1983) How the Laws of Physics Lie. Oxford: Clarendon Press.

Chalmers, Alan (1999) What is This Thing Called Science? Hackett Publishing Company.

Chang, Hasok (2004) Inventing Temperature. Oxford University Press.

Chargaff, E. (1950) “Chemical specificity of nucleic acids and mechanism of their enzymic degradation” Experientia 6: 201-209.

Chargaff, E. (1951) “Structure and function of nucleic acids as cell constituents” Fed. Proc. 10: 654-659.

Charlton, B. G. (1996) “Attribution of Causation in Epidemiology: Chain or Mosaic?” Journal of Clinical Epidemiology 49: 105-107.

Cochrane Handbook. Available online at http://www.cochrane.org/resources/handbook

Collins, Harry (1985) Changing Order: Replication and Induction in Scientific Practice. University of Chicago Press.

Cooper, Harris (2010) Research Synthesis and Meta-Analysis: A Step-by-Step Approach. Sage Publications.

217

Cowan ML, Bruner BD, Huse N, Dwyer JR, Chugh B, Nibbering ETJ, Elsaesser, T, Miller RJD (2005) “Ultrafast memory loss and energy redistribution in the hydrogen bond network of liquid H2O” Nature 434: 199-202.

Culp, Sylvia (1995) “Objectivity in Experimental Inquiry: Breaking Data-Technique Circles” Philosophy of Science 62: 438-458.

Culp, Sylvia (1994) “Defending Robustness: The Bacterial Mesosome as a Test Case” PSA 1994 1: 46-57.

Danchin A (1988) “Explanation of Benveniste” Nature 334: 286.

Danks, D. (2005) “Scientific Coherence and the Fusion of Experimental Results” The British Journal for the Philosophy of Science 56: 791-807.

Daston, Lorraine, and Galison, Peter (2007) Objectivity. Cambridge: Zone Books.

Davenas E, Beauvais F, Amara J, Oberbaum M, Robinzon B, Miadonna A, Tedeschi A, Pomeranz B, Fortner P, Belon P, Sainte-Laudy J, Poitevin B, Benveniste J. (1988) “Human basophil degranulation triggered by very dilute antiserum against IgE” Nature 333: 816-818.

Dawson, M. H. (1928) “The Interconvertibility of ‘R’ and ‘S’ Forms of Pneumococcus” Journal of Experimental Medicine 47 4: 577-591.

De Vreese, L. (2008) “Causal (mis)understanding and the Search for Scientific Explanations: a Case Study from the History of Medicine” Studies in History and Philosophy of Biological and Biomedical Sciences 39: 14-24.

Dietrich, Franz and List, Christian (2007) “Arrow’s Theorem in Judgment Aggregation” Social Choice and Welfare 29: 19-33.

Doll, R., & Hill, A. B. (1950) “Smoking and Carcinoma of the Lung: Preliminary Report” British Medical Journal 2(4682): 739-748.

Doll, R., & Hill, A. B. (1954) “The Mortality of Doctors in Relation to Their Smoking Habits” British Medical Journal 1(4877): 1451-5.

Doll, R. (2003) “Fisher and Bradford Hill: Their Personal Impact” International Journal of Epidemiology 32: 929-931.

Douglas, Heather (2005) “Inserting the Public Into Science” in Democratization of Expertise? Exploring Novel Forms of Scientific Advice in Political Decision- Making, Maasen and Weingart (eds.) Springer Netherlands.

218

Douglas, Heather (2004) “The Irreducible Complexity of Objectivity” Synthese 138(3): 453-473.

Dubos, R. J. (1956) “Obituary of 0. T. Avery 1877-1955” Biographical Memoirs of Fellows of the Royal Society 2: 35-48.

Earman, John (1992) Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory. MIT Press/Bradford Books.

Egger, M., Smith, G. D., Phillips, A. N. (1997) “Meta-analysis: Principles and Procedures” British Medical Journal 315: 1533-37.

Epstein, Steven (2007). Inclusion: The Politics of Difference in Medical Research. Chicago: Chicago University Press.

Eysenck, H (1994) “Systematic Reviews: Meta-analysis and its Problems” British Medical Journal 309: 789-792.

Eysenck, H. (1984) “Meta-analysis: an Abuse of Research Integration” Journal of Special Education 18(1): 41-59.

Fierz, W. (1988) “Explanation of Benveniste” Nature 334: 286.

Fitelson, Branden (2007) “Likelihoodism, Bayesianism, and Relational Confirmation” Synthese 156: 473-489.

Fitelson, Branden (2001) “A Bayesian Account of Independent Evidence with Applications” Philosophy of Science 68: S123-140.

Fitelson, Branden (1999) “The Plurality of Bayesian Measures of Confirmation and the Problem of Measure Sensitivity” Philosophy of Science 66: S362-378.

Fitelson, Branden (1996) “Wayne, Horwich, and Evidential Diversity” Philosophy of Science 63: 652-660.

Fitz-James, P (1960) “Participation of the cytoplasmic Membrane in the Growth and Spore Formation of Bacilli” Journal of Biophysical and Biochemical Cytology 8:507-528.

Franklin, Allan (2002) Selectivity and Discord: Two Problems of Experiment. University of Pittsburgh Press.

Franklin, Allan (1986) The Neglect of Experiment. Cambridge University Press.

Franklin, Allan and Colin Howson (1984) “Why Do Scientists Prefer to Vary Their Experiments?” Studies in the History and Philosophy of Science 15:51-62.

219

Galison, Peter (1987) How Experiments End. University of Chicago Press.

Gardner, Howard (1983) Frames of Mind: The Theory of Multiple Intelligences. New York: Basic Books.

Geanakoplos, John (2005) “Three Brief Proofs of Arrow’s Impossibility Theorem” Economic Theory 26: 211-215.

Glass, G. V. (1976) “Primary, Secondary and Meta-analysis of Research. Educational Researcher 10: 3-8.

Glass, G. V. & Smith, M. L. (1979) “Meta-analysis of Research on Class Size and Achievement” Educational Evaluation and Policy Analysis 1(1): 2-16.

Glymour, Clark (1980) Theory and Evidence. Princeton University Press.

Glymour, Clark (1975) “Relevant Evidence” Journal of Philosophy 72: 403-426.

Griffith, F. (1928) “The Significance of Pneumococcal Types” Journal of Hygiene 27: 113-159.

Hacking, Ian (1988) “Telepathy: Origins of Randomization in Experimental Design” Isis 79(3): 427-51.

Hacking, Ian (1983) Representing and Intervening. Cambridge University Press.

Hacking, Ian (1965) The Logic of Statistical Inference. Cambridge University Press.

Harris, Gardiner (2009) “Where Cancer Progress is Rare, One Man Says No” New York Times September 15 2009.

Hartling, L. Ospina, M., Liang, Y., Dryden, D., Hooten, N., Seida, J., Klassen, T. (2009) “Risk of bias versus quality assessment of randomised controlled trials: cross sectional study” British Medical Journal 339: b4012.

Hawthorne, James (forthcoming) “Bayesian Confirmation Theory” in Continuum Companion to Philosophy.

Hempel, Carl (1966) Philosophy of . Englewood Cliffs, N.J.: Prentice- Hall.

Hershey, A.D. (1946) “Spontaneous mutations in bacterial viruses” Cold Spring Harbor Symposium on Quantitative Biology 11: 67-77.

Hershey, A.D. and Chase, M. (1952) “Independent functions of viral proteins and nucleic acid in growth of bacteriophage” Journal of General Physiology 36: 39-56.

220

Hill, B. (1965) “The environment and disease: association or causation?” Proceedings of the Royal Society of Medicine 58: 295-300.

Horwich, Paul (1982) Probability and Evidence. Cambridge: Cambridge University Press.

Hotchkiss, R.D. (1951) “Transfer of penicillin resistance in pneumococci by the desoxyribonucleate fractions from resistant cultures” Cold Spring Harbor Symposia on Quantitative Biology 16: 457-461.

Hotchkiss, R.D. (1965) “Ostwald T. Avery.” Genetics 51:1-10.

Howard, Don (2002) “Lost Wanderers in the Forest of Knowledge: Some Thoughts on the Discovery-Justification Distinction” In Revisiting Discovery and Justification. Max-Planck-Institut für Wissenschaftsgeschichte, Preprint 211. Berlin: 2002, pp. 41-58.

Howick J, Glasziou P, Aronson JK. (2009) “The evolution of evidence hierarchies: what can Bradford Hill's 'guidelines for causation' contribute?” J R Soc Med 102:186-194.

Howson, Colin and Peter Urbach (1989) Scientific Reasoning: The Bayesian Approach. La Salle, IL: Open Court.

Hudson, Robert G. (1999) “Mesosomes: A Study in the Nature of Experimental Reasoning” Philosophy of Science 66(2): 289-309.

Ioannidis, John (2005) “Contradicted and Initially Stronger Effects in Highly Cited Clinical Research” JAMA 294: 218-228.

Jaber, B. L., Lau, J., Schmid, C. H., Karsou, S. A., Levey, A. S., Pereira, B. J. (2002) “Effect of biocompatibility of hemodialysis membranes on mortality in acute renal failure: a meta-analysis” Clinical Nephrology 57(4): 274-82.

Jeffrey, Richard (1965) The Logic of Decision University Of Chicago Press.

Jüni, P., Witschi, A., Bloch, R., Egger, M. (1999). “The hazards of scoring the quality of clinical trials for meta-analysis” The Journal of the American Medical Association 282(11): 1054-60.

Keeley, Brian (2002) “Making Sense of the Senses: Individuating Modalities in Humans and Other Animals” The Journal of Philosophy 99(1): 5-28.

Klein R, Williams A. (2000) “Setting priorities: what is holding us back—inadequate information or inadequate institutions?” In The global challenge of health care rationing (Coulter & Ham, eds). Open University Press.

221

Knipschild, P. (1994) “Systematic reviews: Some examples” British Medical Journal, 309, 719-721.

Kosso, Peter (1989) “Science and Objectivity” The Journal of Philosophy 86(5): 245- 257.

Kruse, Michael (1997) “Variation and the Accuracy of Predictions” The British Journal for the Philosophy of Science 48: 181-193.

Kuhn, Thomas (1977) “Objectivity, Value Judgment, and Theory Choice” In The Essential Tension: Selected Studies in Scientific Tradition and Change. University of Chicago Press.

Kuhn, Thomas (1962) The Structure of Scientific Revolutions. University of Chicago Press.

Kusch, Martin (2003) Knowledge By Agreement: The Programme Of Communitarian Epistemology. Clarendon.

Lasters, I and Bardiaux, M. (1988) “Explanation of Benveniste” Nature 334: 285-286.

Laudan, Larry (1970) “William Whewell's Theory of Scientific Method” The British Journal for the Philosophy of Science 21(3): 311-312.

Lederberg, J. and E. L. Tatum (1946) “Gene recombination in Eschericia coli” Nature 58: 558.

Lederberg, Joshua (1956) “Genetic transduction” American Scientist 44: 264-280.

Lederberg, Joshua (1958) “Nobel Lecture” Available at www.nobel.se

Lehrer, Keith and Wagner, Carl (1981) Rational Consensus in Science and Society D. Reidel and Co.

Leigh Star, Susan (1989) Regions of the Mind: Brain Research and the Quest for Scientific Certainty Stanford University Press.

Levene, P.A. (1921) “On the structure of thymus nucleic acid and on its possible bearing on the structure of plant nucleic acid” J Biol Chem 48: 119-125.

Levins, Richard (1966) “The Strategy of Model Building in Population Biology” American Scientist 54: 421-431.

Lewis, David (1969) Convention: A Philosophical Study. Harvard University Press.

222

Linde, K. & Willich, S. (2003) “How Objective are Systematic Reviews? Differences Between Reviews on Complementary Medicine” Journal of the Royal Society of Medicine 96: 17-22.

Lloyd, Elisabeth (2009) “Varieties of Support and Confirmation of Climate Models” Aristotelian Society Supplementary Volume 83 (1):213-232.

Lomas J, Fulop N, Gagnon D, Allen P. (2003) “On being a good listener: setting priorities for applied health services research” Milbank Q 81(3): 363-88.

Longino, Helen (1994) “In Search of Feminist Epistemology” The Monist 77: 472- 485.

Lycan, William (1998) “Theoretical (Epistemic) Virtues” In Edward Craig, ed,. Routledge Encyclopedia of Philosophy 9: 340-343. London: Routledge.

MacKay, Alfred F. (1980) Arrow’s Theorem: The Paradox of Social Choice. Yale University Press.

MacLeod, C. M. and Avery, O. T. 22 October 1940. “Laboratory Notes: Exp. 1 (T[ransforming]. P[rinciple].) Effect of Fluoride on Autolysis of Pneumococcus Type III and on Preservation of the Transforming Principle.”

MacLeod, C. M. and Avery, O. T. 28 January 1941. “Laboratory Notes: Effect of Ribonuclease on Deproteinized Extract 5-40.”

Maddox J, Randi J, Stewart WW (1988) “‘High-dilution’ Experiments a Delusion” Nature 334: 287-290.

Maddy, Penelope (2007) Second Philosophy: A Naturalistic Method. Oxford University Press.

Magnus, P.D., and Craig Callender (2004) “Realist Ennui and Base Rates” Philosophy of Science 71(3): 320-338.

Marmot, Michael (2004) Status Syndrome: How Your Social Standing Directly Affects Your Health and Life Expectancy. London: Bloomsbury.

Mayo, Deborah (1996) Error and the Growth of Experimental Knowledge. University of Chicago Press.

Mayo, Deborah (1986) “Cartwright, Causality, and Coincidence” PSA 1986 1: 42-58.

McAllister, James (1997) “Phenomena and Patterns in Data Sets” Erkenntnis 47: 217- 228.

223

McCarty, Maclyn (1986) The Transforming Principle: Discovering that Genes are Made of DNA.

McCarty, M., Taylor, H.E., Avery, O.T. (1946) “Biochemical studies of environmental factors essential in transformation of pneumococcal types.” Cold Spring Harbor Symposium on Quantitative Biology. 11: 177-183.

McCarty, M. (1945) “Reversible inactivation of the substance inducing transformation of pneumococcal types.” Journal of Experimental Medicine 81:501-514.

Mirsky, A.E. and Pollister, W.W. (1946) “Chromosin, a desoxyribose nucleoprotein complex of the cell nucleus.” Journal of General Physiology 30: 117-148.

Mirsky, A., Osawa, S., Allfrey, V. (1956) “The nucleus as a site of biochemical activity” Cold Spring Harbor Symposium on Quantitative Biology. 21: 49-74.

Mirsky, A.E. (1968) “The Discovery of DNA” Scientific American 218: 78-88.

Mittelbach, G.G., C.F. Steiner, S.M. Scheiner, K.L. Gross, H.L. Reynolds, R.B. Waide, M.R. Willig, S.I. Dodson, and L. Gough (2001) “What is the Observed Relationship between Species Richness and Productivity?” Ecology 82: 2381- 2396.

Mnookin, J. (2008) “Under the Influence of Technology: DUI and the Legal Production of Objectivity” UCSD Science Studies Colloquium, April 21 2008.

Moher, D., Jadad, A. R., Nichol, G., Penman, M., Tugwell, P., Walsh, S. (1995) “Assessing the quality of randomized controlled trials: An annotated bibliography of scales and checklists” Controlled Clinical Trials 16: 62-73.

Musgrave, Alan. (1974) “Logical Versus Historical Theories of Confirmation” British Journal for the Philosophy of Science 25: 1-23.

Myrvold, Wayne (1996) “Bayesianism and Diverse Evidence: A Reply to Andrew Wayne” Philosophy of Science 63: 661-665.

Nederbragt, Hubertus (2003) “Strategies to improve the reliability of a theory: the experiment of bacterial invasion into cultured epithelial cells” Studies in History and Philosophy of Biological and Biomedical Sciences 34: 593-614.

Neufeld, F. and Levinthal, W. (1928) “Beiträge zur variabilität der pneumokokken” Z Immunitätsforschung und exp. Therapie 55: 324-340.

Nisonoff A. (1988) “Explanation of Benveniste” Nature 334: 286.

224

Norton, John (2007) “Probability Disassembled” The British Journal for the Philosophy of Science 58: 141-171.

Norton, John (2005) “A Little Survey of Induction” in P. Achinstein, ed., Scientific Evidence: Philosophical Theories and Applications. Johns Hopkins University Press. pp. 9-34.

Norton, John (2003) “A Material Theory of Induction” Philosophy of Science 70: 647- 670.

Novack, Greg (2007) “Does Evidential Variety Depend on How the Evidence is Described?” Philosophy of Science 74: 701-711.

Nye, Mary Joe (1972) Molecular Reality: A Perspective on the Scientific Work of Jean Perrin. New York: Elsevier, and London: Macdonald.

Olby, Robert (1974) The Path to the Double Helix. University of Washington Press, Seattle.

Oreskes, Naomi (1999) The Rejection of Continental Drift. Oxford University Press.

Pauling, Linus (1986) How to Live Longer and Feel Better. New York: W.H. Freeman.

Persson, J. (2009) “Semmelweis’s methodology from the modern stand-point: Intervention studies and causal ontology” Studies in History and Philosophy of Biological and Biomedical Sciences 40: 204-209.

Pinch, Trevor (1985) “Towards an analysis of scientific observation: the externality and evidential significance of observational reports in physics” Social Studies of Science 15: 3-36.

Porter, Theodore (1996) Trust in Numbers: The Pursuit of Objectivity in Science and Public Life. Princeton: Princeton University Press.

Quine, W. V. O. (1951) “Two Dogmas of Empiricism” Philosophical Review 60: 20- 43.

Rasmussen, Nicolas (2001) “Evolving Scientific and the Artifacts of Empirical Philosophy of Science: A Reply Concerning Mesosomes” Biology and Philosophy 16: 629-654.

Rasmussen, Nicolas (1993) “Facts, Artifacts, and Mesosomes: Practicing Epistemology with the Electron Microscope” Studies in History and Philosophy of Science 24(2): 221-265.

225

Rhine, J. B., Pratt, J. G., Stuart, C. E., Smith, B. M., and Greenwood, J. A. (1940) Extrasensory Perception after Sixty Years. New York: Holt.

Rothman K. J., & Greenland S. (2005) “Causation and causal inference in epidemiology” American Journal of Public Health 95: S144-S150.

Rotton, J. & Kelly, I. W. (1985) “Much ado about the full moon: A meta-analysis of lunar-lunacy research” Psychological Bulletin 97: 286-306.

Roush, Sherrilyn (2005) Tracking Truth. Oxford University Press.

Royall, Richard (1997) Statistical Evidence: A Likelihood Paradigm. Chapman & Hall.

Salmon, Wesley (1997) Causality and Explanation. Oxford University Press.

Salmon, Wesley (1984) Scientific Explanation and the Causal Structure of the World. Princeton: Princeton University Press.

Sellars, Wilfrid (1956) “Empiricism and the Philosophy of Mind” in Feigl and Scriven, eds., Minnesota Studies in the Philosophy of Science, Volume I: The Foundations of Science and the Concepts of Psychology and Psychoanalysis. University of Minnesota Press.

Sen, Amartya (1970) Collective Choice and Social Welfare. Holden-Day.

Shapere, Dudley (1982) “The Concept of Observation in Science and Philosophy” Philosophy of Science 49(4): 485-525.

Slavin, R. (1995) “Best evidence synthesis: An intelligent alternative to meta- analysis” Journal of Clinical Epidemiology 48(1): 9-18.

Smith, M. L., & Glass, G. V. (1977) “Meta-analysis of psychotherapy outcome studies” American Psychologist 32: 752-60.

Snyder, Laura (2005) “Consilience, Confirmation, and Realism” in Achinstein (ed.) Scientific Evidence: Philosophical Theories and Applications. The Johns Hopkins University Press.

Sober, Elliott (2009) “Absence of Evidence and Evidence of Absence: Evidential Transitivity in Connection with Fossils, Fishing, Fine-Tuning, and Firing Squads” Philosophical Studies 143: 63-90.

Sober, Elliot (2008) Evidence and Evolution: The Logic Behind the Science. Cambridge University Press.

226

Solomon, Miriam (2007) “The Social Epistemology of NIH Consensus Conferences” in Establishing Medical Reality: Essays In The Metaphysics And Epistemology Of Biomedical Science, Kincaid and McKitrick (eds.) Springer.

Solomon, Miriam (2006) “Groupthink versus The Wisdom of Crowds: The Social Epistemology of Deliberation and Dissent” The Southern Journal of Philosophy 44: 28-42.

Spiegelman, S. (1946) “Nuclear and cytoplasmic factors controlling enzymatic constitution” Cold Spring Harbor Symp. on Quantitative Biology 11: 256-277.

Staley, Kent (2004) “Robust Evidence and Secure Evidence Claims” Philosophy of Science 71: 467-488.

Stanley, W M. (1970) “The ‘Undiscovered’ Discovery” Archives of Environmental Health 2: 256-262.

Stegenga, Jacob (2009) “Robustness, Discordance, and Relevance” Philosophy of Science 76: 650-661.

Stegenga, Jacob (forthcoming) “The Chemical Characterization of the Gene” History and Philosophy of the Life Sciences.

Subramanian, S., Venkataraman, R., Kellum, J. A. (2002) “Influence of dialysis membranes on outcomes in acute renal failure: A meta-analysis” Kidney International 62: 1819-23.

Sutton, A. J. & Higgins, J. P. T. (2008) “Recent developments in meta- analysis” Statistics in Medicine 27: 625-50.

Thagard, Paul (1998) “Ulcers and bacteria I: Discovery and acceptance” Studies in History and Philosophy of Biological and Biomedical Sciences 29: 107-136.

Trout, J. D. (1995) “Diverse tests on an independent world” Studies in History and Philosophy of Science 26(3): 407-29. van Fraassen, Bas (2009) “The Perils of Perrin, in the Hands of Philosophers” Philosophical Studies 143: 5-24. van Fraassen, Bas (1989) Laws and Symmetry. Oxford University Press van Fraassen, Bas (1980) The Scientific Image. Oxford: Clarendon Press.

Wayne, Andrew (1995) “Bayesianism and Diverse Evidence” Philosophy of Science 62: 111-121.

227

Weber, Marcel (2005) Philosophy of Experimental Biology. Cambridge University Press.

Weed, Douglas (2005) “Weight of Evidence: A Review of Concepts and Methods” Risk Analysis 25: 1545-1557.

Weed, Douglas (1997) “On the Use of Causal Criteria” International Journal of Epidemiology 26: 1137-1141.

Weisberg, Michael (2006) “Robustness Analysis” Philosophy of Science 73: 730-742.

Westman, Robert (forthcoming 2011) The Copernican Question: Prognostication, Scepticism and Celestial Order. University of California Press.

Whewell, William (1837/1857) The History of the Inductive Sciences, from the Earliest to the Present Time. New York: D. Appleton and Co.

White, Roger (2005) “Epistemic Permissiveness” Philosophical Perspectives 19: 445- 459.

Whitlock, M. & Schluter, D. (2009) The Analysis of Biological Data. Greenwood Village: Roberts and Company Publishers.

Williamson, Timothy (2000) Knowledge and Its Limits. Oxford: Oxford University Press.

Wimsatt, William (1981) “Robustness, Reliability, and Overdetermination” in Brewer and Collins (Eds.) Scientific Inquiry and the Social Sciences. Jossey-Bass

Woodward, James (1989) “Data and Phenomena” Synthese 79: 393-472.

Worrall, John (2007) “Why there’s no cause to randomize” The British Journal for the Philosophy of Science 58: 451-88.

Worrall, John (2002) “What evidence in evidence-based medicine?” Philosophy of Science 69: S316-30.

Wylie, Alison (1995) “Doing Philosophy as a Feminist: Longino on the Search for a Feminist Epistemology” Philosophical Topics 23: 345-358.

Yank, V., Rennie, D., Bero, L. A. (2007) “Financial ties and concordance between results and conclusions in meta-analyses: A retrospective cohort study” British Medical Journal 335: 1202-5.