<<

The Epistemology of Measurement:

A Model-Based Account

by

Eran Tal

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Graduate Department of Philosophy University of Toronto

© Copyright by Eran Tal 2012 The Epistemology of Measurement: A Model-Based Account

Eran Tal, Doctor of Philosophy

Department of Philosophy, University of Toronto, 2012

Thesis abstract

Measurement is an indispensable part of physical science as well as of commerce, industry, and daily life. Measuring activities appear unproblematic when performed with familiar instruments such as thermometers and clocks, but a closer examination reveals a host of epistemological questions, including:

1. How is it possible to tell whether an instrument measures the quantity it is

intended to?

2. What do claims to measurement accuracy amount to, and how might such

claims be justified?

3. When is disagreement among instruments a sign of error, and when does it

imply that instruments measure different quantities?

Currently, these questions are almost completely ignored by philosophers of science, who view them as methodological concerns to be settled by scientists. This dissertation shows that these questions are not only philosophically worthy, but that their exploration has the potential to challenge fundamental assumptions in philosophy of science, including the distinction between measurement and prediction.

ii The thesis outlines a model-based epistemology of physical measurement and uses it to address the questions above. To measure, I argue, is to estimate the value of a parameter in an idealized model of a physical process. Such estimation involves inference from the final state (‘indication’) of a process to the value range of a parameter (‘outcome’) in of theoretical and statistical assumptions. Idealizations are necessary preconditions for the possibility of justifying such inferences. Similarly, claims to accuracy, error and quantity individuation can only be adjudicated against the background of an idealized representation of the measurement process.

Chapters 1-3 develop this framework and use it to analyze the inferential structure of standardization procedures performed by contemporary standardization bureaus.

Standardizing time, for example, is a matter of constructing idealized models of multiple atomic clocks in a way that allows consistent estimates of duration to be inferred from clock indications. Chapter 4 shows that calibration is a special sort of modeling activity, i.e. the activity of constructing and testing models of measurement processes. Contrary to contemporary philosophical views, the accuracy of measurement outcomes is properly evaluated by comparing model predictions to each other, rather than by comparing observations.

iii Acknowledgements

In the course of writing this dissertation I have benefited time and again from the knowledge, advice and support of teachers, colleagues and friends. I am deeply indebted to Margie Morrison for being everything a supervisor should be and more: generous with her time and precise in her feedback, unfailingly responsive and relentlessly committed to my success. I thank Ian Hacking for his constant encouragement, for never ceasing to challenge me, and for teaching me to respect the science and scientists of whom I write. I owe many thanks to Anjan Chakravartty, who commented on several early proposals and many sketchy drafts; this thesis owes its clarity to his meticulous feedback. My teaching mentor, Jim Brown, has been a constant source of friendly advice on all academic matters since my very first day in Toronto, for which I am very grateful.

In addition to my formal advisors, I have been fortunate enough to meet faculty members in other institutions who have taken an active interest in my work. I am grateful to Stephan Hartmann for the three wonderful months I spent as a visiting researcher at Tilburg University; to Allan Franklin for ongoing feedback and assistance during my visit to the University of Colorado; to Paul Teller for insightful and detailed comments on virtually the entire dissertation; and to Marcel Boumans, Wendy Parker, Léna Soler, Alfred Nordmann and Leah McClimans for informal mentorship and fruitful research collaborations.

Many other colleagues and friends provided useful comments on this thesis at various stages of writing, of which I can only mention a few. I owe thanks to Giora Hon, Paul Humphreys, Michela Massimi, Luca Mari, Carlo Martini, Ave Mets, Boaz Miller, Mary Morgan, Thomas Müller, John Norton, Isaac Record, Jan Sprenger, Jacob Stegenga, Jonathan Weisberg, Michael Weisberg, Eric Winsberg, and Jim Woodward, among many others.

I am especially thankful to Hasok Chang for writing a thoughtful and detailed appraisal of this dissertation, and to Joseph Berkovitz and Denis Walsh for serving on my examination committee.

iv The work presented here depended on numerous physicists who were kind enough to meet with me, show me around their labs and answer my often naive questions. I am grateful to members of the Time and Frequency Division at the US National Institute of Standards and Technology (NIST) and JILA labs in Boulder, Colorado for their helpful cooperation. The long hours I spent in conversation with Judah Levine introduced me to the fascinating world of atomic clocks and ultimately gave rise to the central case studies reported in this thesis. David Wineland’s invitation to visit the laboratories of the Ion Storage Group at NIST in summer 2009 resulted in a wealth of materials for this dissertation. I am also indebted to Eric Cornell, Till Rosenband, Scott Diddams, Tom Parker and Tom Heavner for their time and patience in answering my questions. Special thanks go to Chris Ellenor and Rockson Chang, who, as graduate students in Aephraim Steinberg’s laboratory in Toronto, spent countless hours explaining to me the technicalities of Bose-Einstein Condensation.

My research for this dissertation was supported by several grants, including three Ontario Graduate Scholarships, a Chancellor Jackman Graduate Fellowship in the Humanities, a School of Graduate Studies Travel Grant (the latter two from the University of Toronto), and a Junior Visiting Fellowship at Tilburg University.

I am indebted to Gideon Freudenthal, my MA thesis supervisor, whose enthusiasm for teaching and attention to detail inspired me to pursue a career in philosophy.

My mother, Ruth Tal, has been extremely supportive and encouraging throughout my graduate studies. I deeply thank her for enduring my infrequent visits home and the occasional cold Toronto winter.

Finally, to my partner, Cheryl Dipede, for suffering through my long hours of study with only support and love, and for obligingly jumping into the unknown with me, thanks for being you.

v Table of Contents

Introduction ...... 1 1. Measurement and knowledge...... 1 2. The epistemology of measurement...... 3 3. Three epistemological problems...... 5 The problem of coordination...... 8 The problem of accuracy ...... 11 The problem of quantity individuation...... 12 Epistemic entanglement...... 14 4. The challenge from practice...... 15 5. The model-based account...... 17 6. Methodology...... 21 7. Plan of thesis ...... 24

1. How Accurate is the Standard Second? ...... 26 1.1. Introduction...... 26 1.2. Five notions of measurement accuracy ...... 29 1.3. The multiple realizability of unit ...... 33 1.4. Uncertainty and de-idealization ...... 37 1.5. A robustness condition for accuracy ...... 40 1.6. Future definitions of the second ...... 44 1.7. Implications and conclusions...... 46

2. Systematic Error and the Problem of Quantity Individuation ...... 48 2.1. Introduction...... 48 2.2. The problem of quantity individuation ...... 51 2.2.1. Agreement and error...... 51 2.2.2. The model-relativity of systematic error...... 55 2.2.3. Establishing agreement: a threefold condition...... 59 2.2.4. Underdetermination...... 62 2.2.5. Conceptual vs. practical consequences ...... 64 2.3. The shortcomings of foundationalism ...... 67 2.3.1. Bridgman’s operationalism...... 68 2.3.2. Ellis’ conventionalism...... 70 2.3.3. Representational Theory of Measurement ...... 73 2.4. A model-based account of measurement...... 78 2.4.1. General outline ...... 78 2.4.2. Conceptual quantity individuation...... 83 2.4.3. Practical quantity individuation...... 88 2.5. Conclusion: error as a conceptual tool ...... 91

vi 3. Making Time: A Study in the Epistemology of Standardization ...... 93 3.1. Introduction...... 93 3.2. Making time universal ...... 99 3.2.1. Stability and accuracy...... 99 3.2.2. A plethora of clocks...... 103 3.2.3. Bootstrapping reliability ...... 106 3.2.4. Divergent standards...... 108 3.2.5. The leap second...... 111 3.3. The two faces of stability...... 112 3.3.1. An explanatory challenge ...... 112 3.3.2. Conventionalist explanations...... 113 3.3.3. Constructivist explanations...... 118 3.4. Models and coordination...... 123 3.4.1. A third alternative...... 123 3.4.2. Mediation, legislation, and models...... 126 3.4.3. Coordinative freedom...... 130 3.5. Conclusions ...... 136

4. Calibration: Modeling the Measurement Process ...... 138 4.1. Introduction...... 138 4.2. The products of calibration...... 142 4.2.1. Metrological ...... 142 4.2.2. Indications vs. outcomes...... 143 4.2.3. Forward and backward calibration functions...... 146 4.3. Black-box calibration...... 148 4.4. White-box calibration...... 151 4.4.1. Model construction...... 151 4.4.2. Uncertainty estimation...... 154 4.4.3. Projection ...... 158 4.4.4. Predictability, not just correlation...... 160 4.5. The role of standards in calibration ...... 164 4.5.1. Why standards?...... 164 4.5.2. Two-way white-box calibration...... 165 4.5.3. Calibration without metrological standards...... 168 4.5.4. A global perspective...... 170 4.6. From predictive uncertainty to measurement accuracy ...... 174 4.7. Conclusions ...... 177

Epilogue ...... 178

Bibliography...... 181

vii

List of Tables

Table 1.1: Comparison of uncertainty budgets of aluminum (Al) and mercury (Hg) optical atomic clocks...... 45 Table 3.1: Excerpt from Circular-T...... 104 Table 4.1: Uncertainty budget for a torsion pendulum measurement of G, the Newtonian gravitational constant...... 156 Table 4.2: Type-B uncertainty budget for NIST-F1, the US primary frequency standard...... 166

List of Figures

Figure 3.1: A simplified hierarchy of approximations among model parameters in contemporary timekeeping...... 129 Figure 4.1: Modules and parameters involved in a white-box calibration of a simple caliper...... 153 Figure 4.2: A simplified diagram of a round-robin calibration scheme ...... 169

viii The Epistemology of Measurement: A Model-Based Account

Introduction

I often say that when you can measure what you are speaking about and express it in numbers you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind […].

– William Thomson, Lord Kelvin (1891, 80)

1. Measurement and knowledge

Measurement is commonly seen as a privileged source of scientific knowledge. Unlike qualitative observation, measurement enables the expression of empirical claims in mathematical form and hence makes possible an exact description of nature. Lord Kelvin’s famous remark expresses high esteem for measurement for this same reason. Today, in an age when thermometers and ammeters produce stable measurement outcomes on familiar scales, Kelvin’s remark may seem superfluous. How else could one gain reliable knowledge of temperature and electric current other than through measurement? But the quantities called ‘temperature’ and ‘current’ as well as the instruments that measure them have long

1 histories during which it was far from clear what was being measured and how – histories in which Kelvin himself played important roles1.

These early struggles to find principled relations between the indications of material instruments and values of abstract quantities illustrate the dual nature of measurement. On the one hand, measurement involves the design, execution and observation of a concrete physical process. On the other hand, the outcome of a measurement is a knowledge claim formulated in terms of some abstract and universal concept – e.g. mass, current, length or duration. How, and under what conditions, are such knowledge claims warranted on the basis of material operations?

Answering this last question is crucial to understating how measurement produces knowledge. And yet contemporary philosophy of measurement offers little by way of an answer. Epistemological concerns about measurement were briefly popular in the 1920s

(Campbell 1920, Bridgman 1927, Reichenbach [1927] 1958) and again in the 1960s (Carnap

[1966] 1995, Ellis 1966), but have otherwise remained in the background of philosophical discussion. Until less than a decade ago, the philosophical literature on measurement focused on either the metaphysics of quantities (Swoyer 1987, Michell 1994) or the mathematical structure of measurement scales. The Representational Theory of Measurement (Krantz et al

1971), for example, confined itself to a discussion of structural mappings between empirical and quantitative domains and neglected the possibility of telling what, and how accurately, such mappings measure. It is only in the last several years that a new wave of philosophical writings about the epistemology of measurement has appeared (most notably Chang 2004,

Boumans 2006, 2007 and van Fraassen 2008, Ch. 5-7). Partly drawing on these recent

1 See Chang (2004, 173-186) and Gooday (2004, 2-9). 2 achievements, this thesis will offer a novel systematic account of the ways in which measurement produces knowledge.

2. The epistemology of measurement

The epistemology of measurement, as envisioned in this dissertation, is a subfield of philosophy concerned with the relationships between measurement and knowledge. Central topics that fall under its purview are the conditions under which measurement produces knowledge; the content, scope, justification and limits of such knowledge; the reasons why particular methodologies of measurement and standardization succeed or fail in supporting particular knowledge claims; and the relationships between measurement and other knowledge-producing activities such as observation, theorizing, experimentation, modeling and calculation. The pursuit of research into these topics is motivated not only by the need to clarify the epistemic functions of measurement, but also by the prospects of contributing to other areas of philosophical discussion concerning e.g. reliability, evidence, causality, objectivity, representation and information.

As measurement is not exclusively a scientific activity – it plays vital roles in engineering, medicine, commerce, public policy and everyday life – the epistemology of measurement is not simply a specialized branch of philosophy of science. Instead, the epistemology of measurement is a subfield of philosophy that draws on the tools and concepts of traditional epistemology, philosophy of science, philosophy of language, philosophy of technology and philosophy of mind, among other subfields. It is also a multi-

3 disciplinary subfield, ultimately engaging with measurement techniques from a variety of disciplines as well as with the histories and sociologies of those disciplines.

The goal of providing a comprehensive epistemological theory of measurement is beyond the scope of a single doctoral dissertation. This thesis is cautiously titled ‘account’ rather than ‘theory’ in order to signal a more modest intention: to argue for the plausibility of a particular approach to the epistemology of measurement by demonstrating its strengths in a specific domain. I call my approach ‘model-based’ because it tackles epistemological challenges by appealing to abstract and idealized models of measurement processes. As I will explain below, this thesis constitutes the first systematic attempt to bring insights from the burgeoning literature on the philosophy of scientific modeling to bear on traditional problems in the philosophy of measurement. The specific domain I will focus on is physical metrology, officially defined as “the science of measurement and its application”2. Metrologists are the physicists and engineers who design and standardize measuring instruments for use in scientific and commercial applications, and often work at standardization bureaus or specially accredited laboratories.

The immediate aim of this dissertation, then, is to show that a model-based approach to measurement successfully solves certain epistemological challenges in the domain of physical metrology. By achieving this aim, a more far-reaching goal will also be accomplished, namely, a demonstration of the importance of research into the epistemology of measurement and of the promise held by model-based approaches for further research in this area.

2 JCGM 2008, 2.2. 4 The epistemological challenges addressed in this thesis may be divided into two kinds.

The first kind consists of abstract and general epistemological problems that pertain to any sort of measurement, whether physical or nonphysical (e.g. of social or mental quantities). I will address three such problems: the problem of coordination, the problem of accuracy, and the problem of quantity individuation. These problems will be introduced in the next section. The second kind of epistemological challenge consists of problems that are specific to physical metrology. These problems arise from the need to explain the efficacy of metrological methods for solving problems of the first sort - for example, the efficacy of metrological uncertainty evaluations in overcoming the problem of accuracy. After discussing these ‘challenges from practice’, I will introduce the model-based account, explicate my methodology and outline the plan of this thesis.

3. Three epistemological problems

This thesis will address three general epistemological problems related to measurement, which arise when one attempts to answer the following three questions:

1. Given a procedure P and a quantity Q, how is it possible to tell whether P

measures Q?

2. Assuming that procedure P measures quantity Q, how is it possible to tell how

accurately P measures Q?

5 3. Assuming that P and P′ are two measuring procedures, how is it possible to

tell whether P and P′ measure the same quantity?

Each of these three questions pertains to the possibility of obtaining knowledge of some sort about the relationship between measuring procedures and the quantities they measure. The sort of possibility I am interested in is not a general metaphysical or epistemic one – I do not consider the existence of the world or the veridical character of perception as relevant answers to the questions above. Rather, I will be interested in possibility in the practical, technological sense. What is technologically possible is what humans can do with the limited cognitive and material resources they have at their disposal and within reasonable time3. Hence to qualify as an adequate answer to the questions above, a condition of possibility must be cognitively accessible through one or more empirical tests that humans may reasonably be expected to perform. For example, an adequate answer to the first question would specify the sort of evidence scientists are required to collect in order to test whether an instrument is a thermometer – i.e. whether or not it measures temperature – as well as general considerations that apply to the analysis of this evidence.

An obvious worry is that such conditions are too specific and can only be supplied on a case-by-case basis. This worry would no doubt be justified if one were to seek particular test specifications or ‘experimental recipes’ in response to the questions above. No single test, nor even a small set of tests, exist that can be applied universally to any measuring procedure and any quantity to yield satisfactory answers to the questions above. But this worry is founded on an overly narrow interpretation of the questions’ scope. The conditions

3 For an elaboration of the notion of technological possibility see Record (2011, Ch. 2). 6 of possibility sought by the questions above are not empirical test specifications but only general formal constraints on such specifications. These formal constraints, as we shall see, pertain to the structure of inferences involved in such tests and to general representational preconditions for performing them. Of course, it is not guaranteed in advance that even general constraints of this sort exist. If they do not, knowledge claims about measurement, accuracy and quantity individuation would have no unifying grounds. Yet at least in the case of physical quantities, I will show that a shared inferential and representational structure indeed underlies the possibly of knowing what, and how accurately, one is measuring.

Another, sceptical sort of worry is that the questions above may have no answer at all, because it may in fact be impossible to know whether and how accurately any given procedure measures any quantity. I take this worry to be indicative of a failure in philosophical methodology rather than an expression of a cautious approach to the limitations of human knowledge. The terms “measurement”, “quantity” and “accuracy” already have stable (though not necessarily unique) meanings set by their usage in scientific practice. Claims to measurement, accuracy and quantity individuation are commonly made in the sciences based on these stable meanings. The job of epistemologists of measurement, as envisioned in this thesis, is to clarify these meanings and make sense of scientific claims made in light of such meanings. In some cases the epistemologist may conclude that a particular scientific claim is unfounded or that a particular scientific method is unreliable.

But the conclusion that all claims to measurement are unfounded is only possible if philosophers create perverse new meanings for these terms. For example, the idea that measurement accuracy is unknowable in principle cannot be seriously entertained unless the meaning of “accuracy” is detached from the way practicing metrologists use this term, as will be shown in Chapter 1. I will elaborate further on the interplay between descriptive and

7 normative aspects of the epistemology of measurement when I discuss my methodology below.

As mentioned, the attempt to answer the three questions above gives rise to three epistemological problems: the problem of coordination, the problem of accuracy and the problem of quantity individuation, respectively. The next three subsections will introduce these problems, and the fourth subsection will discuss their mutual entanglement.

The problem of coordination

How can one tell whether a given empirical procedure measures a given quantity? For example, how can one tell that an instrument is a thermometer, i.e. that the procedure of its use results in estimates of temperature? The answer is clear enough if one is allowed to presuppose, as scientists do today, an accepted theory of temperature along with accepted standards for measuring temperature. The epistemological conundrum arises when one attempts to explain the possibility of establishing such theories and standards in the first place. To establish a theory of temperature one has to be able to test its predictions empirically, a task which requires a reliable method of measuring temperature; but establishing such method requires prior knowledge of how temperature is related to other quantities, e.g. volume or pressure, and this can only be settled by an empirically tested theory of temperature. It appears to be impossible to coordinate the abstract notion of temperature to any concrete method of measuring temperature without begging the question.

8 The problem of coordination was discussed by Mach ([1896] 1966) in his analysis of temperature measurement and by Poincaré ([1898] 1958) in relation to the measurement of space and time. Both authors took the view that the choice of coordinative principles is arbitrary and motivated by considerations of simplicity. Which substance is taken to expand uniformly with temperature, and which kind of clock is taken to ‘tick’ at equal time intervals, are choices based of convenience rather than observation. The conventionalist solution was later generalized by Reichenbach ([1927] 1958), Carnap ([1966] 1995) and Ellis (1966), who understood such coordinative principles (or ‘correspondence rules’) as a priori definitions that are in no need of empirical verification. Rather than statements of fact, such principles of coordination were viewed as semantic preconditions for the possibility of measurement.

However, unlike ‘ordinary’ conceptual definitions, conventionalists maintained that coordinative definitions do not fully determine the meaning of a quantity concept but only regulates its use. For example, what counts as an accurate measurement of time depends on which type of clock is chosen to regulate the application of the notion of temporal uniformity. But the extension of the notion of uniformity is not limited to that particular type of clock. Other types of clock may be used to measure time, and their accuracy is evaluated by empirical comparison to the conventionally chosen standard4.

Another approach to the problem of coordination, closely aligned with but distinct from conventionalism, was defended by Bridgman (1927). Bridgman’s initial proposal was to define a quantity concept directly by the operation of its measurement, so that strictly speaking two different types of operation necessarily measure different quantities. The

4 See, for example, Carnap on the periodicity of clocks (1995 [1966], 84). For a discussion of the differences between operationalism and conventionalism see Chang and Cartwright (2008, 368.)

9 operationalist solution is more radical than conventionalism, as it reduces the meaning of a quantity concept to its . Bridgman motivated this approach by the need to exercise caution when applying what appears to be the same quantity concept across different domains. Bridgman later modified his view in response to various criticisms and no longer viewed operationalism as a comprehensive theory of meaning (Bridgman 1959,

Chang 2009, 2.1).

A new strand of writing on the problem of coordination has emerged in the last decade, consisting most notably of the works of Chang (2004) and van Fraassen (2008, Ch.

5). These works take a historical-contextual and coherentist approach to the problem. Rather than attempt a solution from first principles, these writers appeal to considerations of coherence and consistency among different elements of scientific practice. The process of theory-construction and standardization is seen as mutual and iterative, with each iteration respecting existing traditions while at the same time correcting them. At each such iteration the quantity concept is re-coordinated to a more robust set of standards, which in turn allows theoretical predictions to be tested more accurately, etc. The challenge for these writers is not to find a vantage point from which coordination is deemed rational a priori, but to trace the inferential and material apparatuses responsible for the mutual refinement of theory and measurement in any specific case. Hence they reject the traditional question:

‘what is the general solution to the problem of coordination?’ in favour of historically situated, local investigations.

As will become clear, my approach to the problem of coordination continues the historical-contextual and coherentist trend in recent scholarship, but at the same time seeks to specify general formal features common to successful solutions to this problem. Rather than abandon traditional approaches to the problem altogether, my aim will be to shed new

10 light on, and ultimately improve upon, conventionalist and operationalist attempts to solve the problem of coordination. To this end I will provide a novel account of what it means to coordinate quantity concepts to physical operations – an account in which coordination is understood as a process rather than a static definition – and clarify the conventional and empirical aspects of this process.

The problem of accuracy

Even if one can safely assume that a given procedure measures the quantity it is intended to, a second problem arises when one tries to evaluate the accuracy of that procedure. Quantities such as length, duration and temperature, insofar as they are represented by non-integer (e.g. rational or real) numbers, cannot be measured with complete accuracy. Even measurements of integer-valued quantities, such as the number of alpha-particles discharged in a radioactive reaction, often involve uncertainties. The accuracy of measurements of such quantities cannot, therefore, be evaluated by reference to exact values but only by comparing uncertain estimates to each other. Such comparisons by their very nature cannot determine the extent of error associated with any single estimate but only overall mutual compatibility among estimates. Hence multiple ways of distributing errors among estimates are possible that are all consistent with the evidence gathered through

11 comparisons. It seems that claims to accuracy are intrinsically underdetermined by any possible evidence5.

Many of the authors who have discussed the problem of coordination appear to have also identified the problem of accuracy, although they have not always distinguished the two very clearly. Often, as in the cases of Mach, Ellis and Carnap, they naively believed that fixing a measurement standard in an arbitrary manner is sufficient to solve both problems at once. However, measurement standards are physical instruments whose construction, maintenance, operation and comparison suffer from uncertainties just like those of other instruments. As I will show, the absolute accuracy of measurement standards is nothing but a myth that obscures the complexity behind the problem of accuracy. Indeed, I will argue that the role played by standards in the evaluation of measurement accuracy has so far been grossly misunderstood by philosophers. Once the epistemic role of standards is clarified, new and important insights emerge not only with respect to the proper solution to the problem of accuracy but also with respect to the other two problems.

The problem of quantity individuation

When discussing the previous two problems I implicitly assumed that it is possible to tell whether multiple measuring procedures, compared to each other either synchronically or diachronically, measure the same quantity. But this assumption quickly leads to another underdetermination problem, which I call the ‘problem of quantity individuation.’ Even

5 See also Kyburg (1984, 183) 12 when two different procedures are thought to measure the same quantity, their outcomes rarely exactly coincide under similar conditions. Therefore when the outcomes of two procedures appear to disagree with each other two kinds of explanation are open to scientists: either one (or both) of the procedures are inaccurate, or the two procedures measure different quantities6. But any empirical test that may be brought to bear on this dilemma necessarily presupposes additional facts about agreement or disagreement among measurement outcomes and merely duplicates the problem. Much like claims about accuracy, claims about quantity individuation are underdetermined by any possible evidence.

As Chapter 2 will make clear, existing philosophical accounts of quantity individuation do not fully acknowledge the import of the problem. Bridgman and Ellis, for example, both acknowledge that claims to quantity individuation are underdetermined by facts about agreement and disagreement among measuring instruments. And yet they fail to notice that facts about agreement and disagreement among measuring instruments are themselves underdetermined by the indications of those instruments. Once this additional level of underdetermination is properly appreciated, Bridgman and Ellis’ proposed criteria of quantity individuation are exposed as question-begging. A proper solution to the problem of quantity individuation, I will argue, is possible only if one takes into account its entanglement with the first two problems.

6 This second option may be further subdivided into sub-options. The two procedures may be measuring different quantity tokens of the same type, e.g. lengths of different objects, or two different types of quantity altogether, e.g. length and area.

13 Epistemic entanglement

Though conceptually distinct, I will argue that the three problems just mentioned are epistemically entangled, i.e. that they cannot be solved independently of one another.

Specifically, I will show that (i) it is impossible to test whether a given procedure P measures a given quantity Q without at the same time testing how accurately procedure P would measure quantity Q; (ii) it is impossible to test how accurately procedure P would measure quantity Q without comparing it to some other procedure P′ that is supposed to measure Q; and (iii) it is impossible to test whether P and P′ measure the same quantity without at the same time testing whether they measure some given quantity, e.g. Q. Note that these

‘impossibility theses’ are epistemic rather than logical. For example, it is logically possible to know that two procedures measure the same quantity without knowing which quantity they measure7. Nevertheless, it is epistemically impossible to test whether two procedures measure the same quantity without making substantive assumptions about the quantity they are supposed to measure.

The extent and consequences of this epistemic entanglement have hitherto remained unrecognized by philosophers, despite the fact that some of the problems themselves have been widely acknowledged for a long time. The model-based account presented here is the first epistemology of measurement to clarify how it is possible in general to solve all three problems simultaneously without getting caught in a vicious circle.

7 The opposite is not the case, of course: one cannot (logically speaking) know which quantities two procedures measure without knowing whether they measure the same quantity. Questions 1 and 3 are therefore logically related, but not logically equivalent. 14 4. The challenge from practice

Apart from solving abstract and general problems like those discussed in the previous section, a central challenge for the epistemology of measurement is to make sense of specific measurement methods employed in particular disciplines. Indeed, it would be of little value to suggest a solution to the abstract problems that has no bearing on scientific practice, as such solution would not be able to clarify whether and how accepted measurement methods actually overcome these problems. The ‘challenge from practice’, then, is to shed light on the epistemic efficacy of concrete methodologies of measurement and standardization. How do such methods overcome the three general epistemological problems discussed above? As already mentioned, this thesis will focus on the standardization of physical measuring instruments. Physical metrology involves a variety of methods for instrument comparison, error detection and correction, uncertainty evaluation and calibration. These methods employ theoretical and statistical tools as well as techniques of experimental manipulation and control. A central desideratum for the plausibility of the model-based account will be its ability to explain how, and under what conditions, these methods support knowledge claims about measurement, accuracy and quantity individuation.

As my focus will be on physical metrology, I will pay special attention to the methodological guidelines developed by practitioners in that field. In particular, I will frequently refer to two documents published in 2008 by the Joint Committee for Guides in

Metrology (JCGM), a committee that represents eight leading international standardization

15 bodies8. The first document is the International Vocabulary of Metrology – Basic and General

Concepts and Associated Terms (VIM), 3rd edition (JCGM 2008)9. This document contains definitions and clarificatory remarks for dozens of key concepts in metrology such as calibration, measurement accuracy, measurement precision and measurement standard.

These definitions shed light on the way practitioners understand these concepts and on their underlying (and sometimes conflicting) epistemic and metaphysical commitments. The second document is titled Evaluation of Measurement Data — Guide to the Expression of

Uncertainty in Measurement (GUM), 1st edition (JCGM 2008a). This document provides detailed guidelines for evaluating measurement uncertainties and for comparing the results of different measurements. Together these two documents portray a methodological picture of metrology in which abstract and idealized representations of measurement processes play a central role. However, being geared towards regulating practice, these documents do not explicitly analyze the presuppositions underlying this methodological picture nor its efficacy for overcoming general epistemological conundrums that are of interest to philosophers. It is this gap between methodology and epistemology that the model-based account of measurement is intended to fill.

8 The JCGM is composed of representatives from the International Bureau of Weights and Measures (BIPM), the International Electrotechnical Commission (IEC), the International Federation of Clinical Chemistry and Laboratory Medicine (IFCC), the International Laboratory Accreditation Cooperation (ILAC), the International Organization for Standardization (ISO), the International Union of Pure and Applied Chemistry (IUPAC), the International Union of Pure and Applied Physics (IUPAP) and the International Organization of Legal Metrology (OIML). 9 A new version of the 3rd edition of the VIM with minor changes was published in early 2012. My discussion in this thesis applies equally to this new version. 16 5. The model-based account

According to the model-based account, a necessary precondition for the possibility of measuring is the specification of an abstract and idealized model of the measurement process. To measure a physical quantity is to make coherent and consistent inferences from the final state(s) of a physical process to value(s) of a parameter in the model. Prior to the subsumption of a process under some idealized assumptions, it is impossible to ground such inferences and hence impossible to obtain a measurement outcome. Rather than be given by observation, measurement outcomes are sensitive to the assumptions with which a measurement process is modelled and may change when these assumptions change. The same holds true for estimates of measurement uncertainty, accuracy and error, as well as for judgements about agreement and disagreement among measurement outcomes – all are relative to the assumptions under which the relevant measurement processes are modelled.

My conception of the nature and functions of models follows closely the views expressed in Morrison and Morgan (1999), Morrison (1999), Cartwright et al. (1995) and

Cartwright (1999). I take a scientific model to be an abstract representation of some local phenomenon, a representation that is used to predict and explain aspects of that phenomenon. A model is constructed out of assumptions about the ‘target’ phenomenon being represented. These assumptions may include laws and principles from one or more theories, empirical generalizations from available data, statistical assumptions about the data, and other local (and sometimes ad hoc) simplifying assumptions about the phenomenon of interest. The specialized character of models allows them to function autonomously from the theories that contributed to their construction, and to mediate between the highly abstract assumptions of theory and concrete phenomena. I view models as instruments that

17 are more or less useful for purposes of prediction, explanation, experimental design and intervention, rather than as descriptions that are true or false.

Though not committed to any particular view on how models represent the world, the model-based account does not require models to mirror the structure of their target systems in order to be successful representational instruments. My framework therefore differs from the ‘semantic’ view, which takes models to be set-theoretical relational structures that are isomorphic to relations among objects in the target domain (Suppes 1960, van Fraassen

1980, 41-6). The model-based account is also permissive with respect to the ontology of models, and apart from assuming that models are abstract constructs I do not presuppose any particular view concerning their nature (e.g. abstract entities, mathematical objects, fictions). I do, however, take models to be non-linguistic entities and hence different from the equations used to express their assumptions and consequences10.

The epistemic functions of models have received far less attention in the context of measurement than in other contexts where models are used to produce knowledge, e.g. theory construction, prediction, explanation, experimentation and simulation. An exception to this general neglect is the use of models for measurement in economics, a topic about which philosophers have gained valuable insights in recent years (Boumans 2005, 2006,

2007; Morgan 2007). The Representational Theory of Measurement (Krantz et al 1971) appeals to models in the set-theoretical sense to elucidate the adequacy of different types of scales, but completely neglects epistemic questions concerning coordination, accuracy and quantity individuation. This thesis will focus on the epistemic functions of models in

10 On this last point my terminology is at odds with that of the VIM, which defines a measurement model as a set of equations. Cf. JCGM 2008, 2.48 “Measurement Model”, p. 32. 18 physical measurement, a topic on which relatively little has been written, and to date no systematic account has been offered11.

The models I will discuss represent measurement processes. Such processes have physical and symbolic aspects. The physical aspect of a measurement process, broadly construed, includes interactions between a measuring instrument, one or more measured samples, the environment and human operators. The symbolic aspect includes data processing operations such as averaging, data reduction and error correction. The primary function of models of measurement processes is to represent the final states – or ‘indications’ – of the process in terms of values of the measured quantity. For example, the primary function of a model of a cesium fountain clock is to represent the output frequency of the clock (the frequency of its

‘ticks’) in terms of the ideal frequency associated with a specific hyperfine transition in cesium-133. To do this, the model of the clock must incorporate theoretical and statistical assumptions about the working of the clock and its interactions with the cesium sample and the environment, as well as about the processing of the output frequency signal.

A measurement procedure is a measurement process as represented under a particular set of modeling assumptions. Hence multiple procedures may be instantiated on the basis of the same measurement process when the latter is represented with different models12. For example, the same interactions among various parts of a cesium fountain clock and its

11 But see important contributions to this topic by Morrison (2009) and Frigerio, Giordani and Mari (2010). 12 Here too I have chosen to slightly deviate from the terminology of the VIM, which defines a measurement procedure as a description of a measurement process that is based on a measurement model (JCGM 2008, 2.6, p. 18). I use the term, by contrast, to denote a measurement process as represented by a measurement model. The difference is that in the VIM definition a procedure does not itself measure but only provides instructions on how to measure, whereas in my definition a procedure measures. 19 environment may instantiate different procedures for measuring time when modelled with different assumptions.

According to the model-based account, knowledge claims about coordination, accuracy and quantity individuation are properly ascribable to measurement procedures rather than to measurement processes. That is, such knowledge claims presuppose that the measurement process in question is already subsumed under specific idealized assumptions, and may therefore be judged as true or false only relative to those assumptions. The central reason for this model-relativity is that prior to the subsumption of a measurement process under a model it is impossible to warrant objective claims about the outcomes of measurement, that is, claims that reasonably ascribe the outcome to the object being measured rather than to idiosyncrasies of the procedure. This will be explained in detail in Chapter 2.

As I will argue, the model-based account meets both the abstract and practice-based challenges I have discussed. Once the inferential grounds of measurement claims are relativized to a representational context, it becomes clear how all three epistemic problems mentioned above may be solved simultaneously. Moreover, it becomes clear how contemporary metrological methods of standardization, calibration and uncertainty evaluation are able to solve these problems, and what practical considerations and trade-offs are involved in the application of such methods. Finally, it becomes clear why measurement outcomes retain their objective validity outside the representational context in which they are obtained, thereby avoiding problems of incommensurability across different measuring procedures.

In providing a model-based epistemology of measurement, I intend to offer neither a critique nor an endorsement of metaphysical realism with respect to measurable quantities.

My account remains agnostic with respect to metaphysics and pertains to measurement

20 solely as an epistemic activity, i.e. to the inferences and assumptions that make it possible to warrant knowledge claims by operating measuring instruments. For example, nothing in my account depends on whether or not ratios of mass (or length, or duration) exist mind- independently. Indeed, in Chapters 1 and 4 I show that the problem of accuracy is solved in exactly the same way regardless of whether one interprets measurement uncertainties as deviations from true quantity values or as estimates of the degree of mutual consistency among the consequences of different models. The question of realism with respect to measurable quantities is therefore independent of the epistemology of measurement and underdetermined by any evidence one can gather from the practice of measuring.

6. Methodology

As I have mentioned, the model-based account is designed to meet both general epistemological challenges and challenges from practice. These two sorts of challenge may be distinguished along the lines of a normative-descriptive divide and formulated as two questions:

1. Normative question: what are some of the formal desiderata for an adequate

solution to the problems of coordination, accuracy and quantity

individuation?

2. Descriptive question: do the methods employed in physical metrology satisfy

these desiderata?

21 It is tempting to try to answer these questions separately – first by analyzing the abstract problems and arriving at formal desiderata for their solution, and then by surveying metrological methods for compatibility with these desiderata. But on a closer look it becomes clear that these two questions cannot be answered completely independently of each other. Much like the first-order problems, these questions are entangled. On the one hand, overly strict normative desiderata would lead to the absurdity that no method can resolve the problems (why this is an absurdity was discussed above). An example of an overly strict desideratum is the requirement that measurement processes be perfectly repeatable, a demand that is unattainable in practice. On the other hand, overly lenient desiderata would run the risk of vindicating methods that practitioners regard as flawed.

While not necessarily absurd, if such cases abounded they would eventually raise the worry that one’s normative account fails to capture the problems that practitioners are trying to solve. To avoid these two extremes, the epistemologist must be able to learn from practice what counts a good solution to an epistemological problem, yet do so without relinquishing the normativity of her account.

These seemingly conflicting needs are fulfilled by a method I call ‘normative analysis of exemplary cases’. I provide original and detailed case studies of metrological methods that practitioners consider exemplary solutions to the general epistemological problems posed above. Being exemplary solutions, they must also come out as successful solutions in my own epistemological account, for otherwise I have failed to capture the problems that metrologists are trying to solve. Note that this is not a license to believe everything practitioners say, but merely a reasonable starting point for a normative analysis of practice.

In other words, this method reflects a commitment to learn from practitioners what their

22 problems are and assess their success in solving these problems rather than the preconceived problems of philosophers.

For my main case studies I have chosen to concentrate on the standardization of time and frequency, the most accurately and stably realized physical quantities in contemporary metrology. In addition to a study of the metrological literature, I spent several weeks at the laboratories of the Time and Frequency Division at the US National Institute of Standards and Technology (NIST) in Boulder, Colorado. I conducted interviews with ten of the

Division’s scientists as well as with several other specialists at the University of Colorado’s

JILA labs. In these interviews I invited metrologists to reflect on the reasons why they make certain knowledge claims about atomic clocks (e.g. about their accuracy, errors and agreement), on the methods they use to validate such claims, and on problems or limitations they encounter in applying these methods.

These materials then served as the basis for abstracting common presuppositions and inference patterns that characterize metrological methods more generally. At the same time, my superficial ‘enculturation’ into metrological life allowed me to reconceptualise the general epistemological problems and assess their relevance to the concrete challenges of the laboratory. These ongoing iterations of abstraction and concretization eventually led to a stable set of desiderata that fit the exemplars and at the same time were general enough to extend beyond them.

23 7. Plan of thesis

This dissertation consists of four autonomous essays, each dedicated to a different aspect of the epistemic and methodological challenges mentioned above. Rather than advance a single argument, each essay contains a self-standing argument in favour of the model-based account from different but interlocking perspectives.

Chapter 1 is dedicated to primary measurement standards, and debunks the myth according to which such standards are perfectly accurate. I clarify how the uncertainty associated with primary standards is evaluated and how the subsumption of standards under idealized models justifies inferences from uncertainty to accuracy.

Chapter 2 introduces the problem of quantity individuation, and shows that this problem cannot be solved independently of the problems of coordination and accuracy. The model-based account is then presented and shown to dissolve all three problems at once.

Chapter 3 expands on the problem of coordination through a discussion of the construction and maintenance of Coordinated Universal Time (UTC). As I argue, abstract quantity concepts such as terrestrial time are not coordinated directly to any concrete clock, but only indirectly through a hierarchy of models. This mediation explains how seemingly ad hoc error corrections can stabilize the way an abstract quantity concept is applied to particulars.

Finally, Chapter 4 extends the scope of discussion from standards to measurement procedures in general by focusing on calibration. I show that calibration is a special sort of modeling activity, and that measurement uncertainty is a special sort of predictive uncertainty associated with this activity. The role of standards in calibration is clarified and a

24 general solution to the problem of accuracy is provided in terms of a robustness test among predictions of multiple models.

25 The Epistemology of Measurement: A Model-Based Account

1. How Accurate is the Standard Second?

Abstract: Contrary to the claim that measurement standards are absolutely accurate by definition, I argue that unit definitions do not completely fix the referents of unit terms. Instead, idealized models play a crucial semantic role in coordinating the theoretical definition of a unit with its multiple concrete realizations. The accuracy of realizations is evaluated by comparing them to each other in light of their respective models. The epistemic credentials of this method are examined and illustrated through an analysis of the contemporary standardization of time. I distinguish among five senses of ‘measurement accuracy’ and clarify how idealizations enable the assessment of accuracy in each sense.13

1.1. Introduction

A common philosophical myth states that the meter bar in Paris is exactly one meter long – that is, if any determinate length can be ascribed to it in the metric system. One variant of the myth comes from Wittgenstein, who tells us that the meter bar is the one thing “of which one can say neither that it is one long, nor that it is not one metre long” (1953 §50). Kripke famously disagrees, but develops a variant of the same myth by

13 This chapter was published with minor modifications as Tal (2011). 26 stating that the length of the bar at a specified time is rigidly designated by the phrase ‘one meter’ (1980, 56). Neither of these pronouncements is easily reconciled with the 1960 declaration of the General Conference on Weights and Measures, according to which “the international Prototype does not define the metre with an accuracy adequate for the present needs of metrology” and is therefore replaced by an atomic standard (CGPM 1961). There is, of course, nothing problematic with replacing one definition with another. But how can the accuracy of the meter bar be evaluated against anything other than itself, let alone be found lacking?

Wittgenstein and Kripke almost certainly did not subscribe to the myth they helped disseminate. There are good reasons to believe that their examples were meant merely as hypothetical illustrations of their views on meaning and reference14. This chapter does not take issue with their accounts of language, but with the myth of the absolute accuracy of measurement standards, which has remained unchallenged by philosophers of science. The meter is not the only case where the myth clashes with scientific practice. The second and the kilogram, which are currently used to define all other units in the International System

(i.e. the ‘metric’ system), are associated with primary standards that undergo routine accuracy evaluations and are occasionally improved or replaced with more accurate ones. In the case of the second, for example, the accuracy of primary standards has increased more than a thousand-fold over the past four decades (Lombardi et al 2007).

This chapter will analyze the methodology of these evaluations, and argue that they indeed provide estimates of accuracy in the same senses of ‘accuracy’ normally presupposed

14 Wittgenstein mentions the meter bar only in passing as an analogy to color language-games. Kripke carefully notes that the uniqueness of the meter bar’s role in standardizing length is no more than a hypothetical supposition (1980, 55, fn. 20.) 27 by scientific and philosophical discussions of measurement. My main examples will come from the standardization of time. I will focus on the methods by which time and frequency standards are evaluated and improved at the US National Institute of Standards and

Technology (NIST). These tasks are carried out by metrologists, experts in highly reliable measurement. The methods and tools of metrology – a live discipline with its own journals and controversies – have received little attention from philosophers15. Recent philosophical literature on measurement has mostly been concerned either with the metaphysics of quantity and number (Swoyer 1987, Michell 1994) or with the mathematical structures underlying measurement scales (Krantz et al. 1971). These ‘abstract’ approaches treat the topics of uncertainty, accuracy and error as extrinsic to the theory of measurement and as arising merely from imperfections in its application. Though they do not deny that measurement operations involve interactions with imperfect instruments in noisy environments, authors in this tradition analyze measurement operations as if these imperfections have already been corrected or controlled for.

By contrast, the current study is meant as a step towards a practice-oriented epistemology of physical measurement. The management of uncertainty and error will be viewed as intrinsic to measurement and as a precondition for the possibility of gaining knowledge from the operation of measuring instruments. At the heart of this view lies the recognition that a theory of measurement cannot be neatly separated into fundamental and applied parts. The methods employed in practice to correct errors and evaluate uncertainties crucially influence which answers are given to so-called ‘fundamental’ questions about

15 Notable exceptions are Chang (2004) and Boumans (2007). Metrology has been studied by historians and sociologists of science, e.g. Latour (1987, ch. 6), Schaffer (1992), Galison (2003) and Gooday (2004). 28 quantity individuation and the appropriateness of measurement scales. This will be argued in detail in Chapter 2.

In this chapter I will use insights into metrological practices to outline a novel account of the underexplored relationship between uncertainty and accuracy. Scientists often include uncertainty estimates in their reports of measurement results, but whether such estimates warrant claims about the accuracy of results is an epistemological question that philosophers have overlooked. Based on an analysis of time standardization, I will argue that inferences from uncertainty to accuracy are justified when a doubly robust fit – among instruments as well as among idealized models of these instruments – is demonstrated. My account will shed light on metrologists’ claims that the accuracy of standards is being continually improved and on the role played by idealized models in these improvements.

1.2. Five notions of measurement accuracy

Accuracy is often ascribed to scientific theories, instruments, models, calculations and data, although the meaning of the term varies greatly with context. Even within the limited context of physical measurement the term carries multiple senses. For the sake of the current discussion I offer a preliminary distinction among five notions of measurement accuracy. These are intended to capture different senses of the term as it is used by physicists and engineers as well as by philosophers of science. The five notions are neither co-extensive nor mutually exclusive but instead partially overlap in their extensions. As I will argue below, the sort of robustness test metrologists employ to evaluate the uncertainty of

29 measurement standards provides sufficient evidence for the accuracy of those standards under all five senses of ‘measurement accuracy’.

1. Metaphysical measurement accuracy: closeness of agreement between a

measured value of a quantity and its true value16

(correlate concept: truth)

2. Epistemic measurement accuracy: closeness of agreement among values

reasonably attributed to a quantity based on its measurement17

(correlate concept: uncertainty)

3. Operational measurement accuracy: closeness of agreement between a

measured value of a quantity and a value of that quantity obtained by

reference to a measurement standard

(correlate concept: standardization)

4. Comparative measurement accuracy: closeness of agreement among

measured values of a quantity obtained by using different measuring systems,

or by varying extraneous conditions in a controlled manner

(correlate concept: reproducibility)

16 cf. “Measurement Accuracy” in the International Vocabulary of Metrology (VIM) (JCGM 2008, 2.13). My own definitions for measurement-related terms are inspired by, but in some cases diverge from, those of the VIM. 17 cf. JCGM 2008, 2.13, Note 3. 30

5. Pragmatic measurement accuracy (‘accuracy for’): measurement accuracy

(in any of the above four senses) sufficient for meeting the requirements of a

specified application.

Let us briefly clarify each of these five notions. First, the metaphysical notion takes truth to be the standard of accuracy. For example, a thermometer is metaphysically accurate if its outcomes are close to true ratios between measured temperature intervals and the chosen unit interval. If one assumes a traditional understanding of truth as correspondence with a mind-independent reality, the notion of metaphysical accuracy presupposes some form of realism about quantities. The argument advanced in this chapter is nevertheless independent of such realist assumptions, as it neither endorses nor rejects metaphysical conceptions of measurement accuracy.

Second, a thermometer is epistemically accurate if its design and use warrant the attribution of a narrow range of temperature values to objects. The dispersion of reasonably attributed values is called measurement uncertainty and is commonly expressed as a value range18. Epistemic accuracy should not be confused with precision, which constitutes only one aspect of epistemic accuracy. Measurement precision is the closeness of agreement among measured values obtained by repeated measurements of the same (or relevantly similar)

18 cf. “Measurement Uncertainty” (JCGM 2008, 2.26.) Note that this term does not refer to a degree of confidence or belief but to a dispersion of values whose attribution to a quantity reasonably satisfies a specified degree of confidence or belief. 31 objects using the same measuring system19. Imprecision is therefore caused by uncontrolled variations to the equipment, operation or environment when measurements are repeated.

This sort of variation is a ubiquitous but not exclusive source of measurement uncertainty.

As will be explained below, some measurement uncertainty stems from other sources, including imperfect corrections to systematic errors. The notion of epistemic accuracy is therefore broader than that of precision.

Third, operational measurement accuracy is determined relative to an established measurement standard. For example, a thermometer is operationally accurate if its outcomes are close to those of a standard thermometer when the two measure relevantly similar samples.

The most common way of evaluating operational accuracy is by calibration, i.e. by modeling an instrument in a manner that establishes a relation between its indications and standard quantity values20.

Fourth, comparative accuracy is the closeness of agreement among measurement outcomes when the same quantity is measured in different ways. The notion of comparative accuracy is closely linked with that of reproducibility. To say that a measurement outcome is comparatively accurate is to say that it is closely reproducible under controlled variations to measurement conditions and methods.21 For example, thermometers in a given set are comparatively accurate if their outcomes closely agree with one another’s when applied to relevantly similar samples.

19 cf. “Measurement Precision” (JCGM 2008, 2.15) and “Measurement Repeatability” (ibid, 2.21). My concept of precision is narrower than that of the VIM (see also fn. 21.) 20 cf. “Calibration” (JCGM 2008, 2.39) 21 Unlike precision, reproducibility concerns controlled variations to measurement conditions. I deviate slightly from the VIM on this point to reflect general scientific usage of these terms (cf. “Measurement Reproducibility”, JCGM 2008, 2.25). 32 Finally, pragmatic measurement accuracy is accuracy sufficient for a specific use, such as a solution to an engineering problem. There are four sub-senses of pragmatic accuracy, corresponding to the first four senses of measurement accuracy. For example, a thermometer is pragmatically accurate in an epistemic sense if the overall uncertainty of its outcomes is low enough to reliably achieve a specified goal, e.g. keeping an engine from over-heating. Of course, whether or not a measuring system (or a measured value) is pragmatically accurate depends on its intended use22.

In the physical sciences quantitative expressions of measurement accuracy are typically cast in epistemic terms, namely in terms of uncertainty. This does not mean that scientific estimates of accuracy are always and only estimates of epistemic accuracy. What matters to the classification of accuracy is not the form of its expression, but the kind of evidence on which estimates of accuracy are based. As I will argue below, metrological evaluations provide evidence of the right sort for estimating accuracy under all five notions. Before delving into the argument, the next section will provide some background on the concepts, methods and problems involved in the standardization of time.

1.3. The multiple realizability of unit definitions

A key distinction in the standardization of physical units is that between definition and realization. Since 1967 the second has been defined as the duration of exactly

22 Pragmatic accuracy may be understood as a threshold (pass/fail) concept. Alternatively, pragmatic accuracy may be represented continuously, for example as the likelihood of achieving the specified goal. Both analyses of the concept are compatible with the argument presented here. 33 9,192,631,770 periods of the radiation corresponding to a hyperfine transition of cesium-133 in the ground state (BIPM 2006). This definition pertains to an unperturbed cesium atom at a temperature of absolute zero. Being an idealized description of a kind of atomic system, no actual cesium atom ever satisfies this definition. Hence a question arises as to how the reference of ‘second’ is fixed. The traditional philosophical approach would be to propose some ‘semantic machinery’ through which the definition succeeds in picking out a definite duration, e.g. a possible-world semantics of counterfactuals. However, this sort of approach is hard pressed to explain how metrologists are able to experimentally access the extension of ‘second' given the fact that it is physically impossible to instantiate the conditions specified by the definition. Consequently, it becomes unclear how metrologists are able to tell whether the actual durations they label ‘second’ satisfy the definition. By contrast, the approach adopted in this chapter takes the definition to fix a reference only indirectly and approximately by virtue of its role in guiding the construction of atomic clocks. Rather than picking out any definite duration on its own, the definition functions as an ideal specification for a class of atomic clocks. These clocks approximately satisfy – or in the metrological jargon,

‘realize’ – the conditions specified by the definition23. The activities of constructing and modeling cesium clocks are therefore taken to fulfill a semantic function, i.e. that of approximately fixing the reference of ‘second’, rather than simply measuring an already linguistically fixed time interval.

The construction of an accurate primary realization of the second – a ‘meter stick’ of time – must make highly sophisticated use of theory, apparatus and data analysis in order to

23 The verb ‘realize’ has various meanings in philosophical discussions. Here I follow the metrological use of this term and take it to be synonymous with ‘approximately satisfy’ (pertaining to a definition.) 34 approximate as much as possible the ideal conditions specified by the definition. But multiple kinds of physical processes can be constructed that would realize the second, each departing from the ideal definition in different respects and degrees. In other words, different clock designs and environments correspond to different ways of de-idealizing the definition. As of 2009, thirteen atomic clocks around the globe are used as primary realizations of the second. There are also hundreds of official secondary realizations of the second, i.e. atomic clocks that are traced to primary realizations. Like any collection of physical instruments, different realizations of the second disagree with one another, i.e. ‘tick’ at slightly different rates. The definition of the second is thus multiply realizable in the sense that multiple real durations approximately satisfy the definition, and no method can completely rid us of the approximations.

That the definition of the second is multiply realizable does not mean that there are as many ‘seconds’ as there are clocks. What it does mean is that metrologists are faced with the task of continually evaluating the accuracy of each realization relative to the ideal cesium transition frequency and correcting its results accordingly. But the ideal frequency is experimentally inaccessible, and primary standards have no higher standard against which they can be compared. The challenge, then, is to forge a unified second out of disparately

‘ticking’ clocks. This is an instance of a general problem that I will call the problem of multiple realizability of unit definitions24. This problem is semantic, epistemological and methodological all at once. To solve it is to specify experimentally testable satisfaction criteria for the

24 Chang’s (2004, 59) ‘problem of nomic measurement’ is a closely related, though distinct, problem concerning the standardization of instruments. Both problems instantiate the entanglement between claims to coordination, accuracy and quantity individuation mentioned in the introduction to this thesis. This entanglement and its consequences will be discussed in detail in Chapter 2. 35 idealized definition of ‘second’, a task which is equivalent to that of specifying grounds for making accuracy claims about cesium atomic clocks, which is in turn equivalent to the task of specifying a method for reconciling discrepancies among such clocks. The conceptual distinction among three axes of the problem should not obstruct its pragmatic unity, for as we shall see below, metrologists are able to resolve all three aspects of the problem simultaneously.

Prima facie, the problem can be solved by arbitrarily choosing one realization as the ultimate standard. Yet this solution would bind all measurement to the idiosyncrasies of a specific artifact, thereby causing measurement outcomes to diverge unnecessarily. Imagine that all clocks were calibrated against the ‘ticks’ of a single apparatus: the instabilities of that apparatus would cause clocks to run faster or slower relative to each other depending on the time of their calibration, and the discrepancy would be revealed when these clocks were compared to each other. A similar scenario has recently unfolded with respect to the

International Prototype of the kilogram when its mass was discovered to systematically ‘drift’ from those of its official copies (Girard 1994). Hence a stipulative approach to unit definitions exacerbates rather than removes the challenge of reconciling discrepancies among multiple standards.

The latter point is helpful in elucidating the misunderstanding behind the myth of absolute accuracy. Once it is acknowledged that unit definitions are multiply realizable, it becomes clear that no single physical object can be used in practice to completely fix the reference of a unit term. Rather, this reference must be determined by an ongoing comparison among multiple realizations. Because these comparisons involve some uncertainty, the references of unit terms remain vague to some extent. Nevertheless, as the next sections will make clear, comparisons among standards allow metrologists to minimize

36 this vagueness, thereby providing an optimal solution to the problem of multiple realizability.

1.4. Uncertainty and de-idealization

The clock design currently implemented in most primary realizations of the second is known as the cesium ‘fountain’, so called because cesium atoms are tossed up in a vacuum cylinder and fall back due to gravity. The best cesium fountains are said to measure the relevant cesium transition frequency with a fractional uncertainty25 of less than 5 parts in 1016

(Jefferts et al. 2007). It is worthwhile to examine how this number is determined. To start off, it is tempting to interpret this number naively as the standard deviation of clock outcomes from the ideally defined duration of the second. However, because the definition pertains to atoms in physically unattainable conditions, the aforementioned uncertainty could not have been evaluated by direct reference to the ideal second. Nor is this number the standard deviation of a sample of readings taken from multiple cesium fountain clocks.

If a purely statistical approach of this sort were taken, metrologists would have little insight into the causes of the distribution and would be unable to tell which clocks ‘tick’ closer to the defined frequency.

The accepted metrological solution to these difficulties is to de-idealize the theoretical definition of the second in discrete ‘steps’, and estimate the uncertainty that each

‘step’ contributes to the outcomes of a given clock. The uncertainty associated with a

25 ‘Fractional’ or ‘relative’ uncertainty is the ratio between measurement uncertainty and the best estimate of the value being measured (usually the mean.) 37 specific primary frequency standard is then taken to be the total uncertainty contributed to its outcomes by a sufficient de-idealization of the definition of the second as it applies to that particular clock. The rest of the present section will describe how this uncertainty is evaluated, and the next section will describe what kind of evidence is taken to establish the ‘sufficiency’ of de- idealization.

Two kinds of de-idealization of the definition are involved in evaluating the uncertainty of frequency standards. These correspond to two different methods of evaluating measurement uncertainty that metrologists label ‘type-A’ and ‘type-B’26. First, the definition of the second is idealized in the sense that it presupposes that the relevant frequency of cesium is a single-valued number. By contrast, the frequency of any real oscillation converges to a single value only if averaged over an infinite duration, due to so- called ‘random’ fluctuations. De-idealizing the definition in this respect means specifying a set of finite run times, and evaluating the width of the distribution of frequencies for each run time. Uncertainties evaluated in this manner, i.e. by the statistical analysis of a series of observations, fall under ‘type-A’.

The second kind of de-idealization of the definition has to do with systematic effects.

For example, one way in which the definition of the second is idealized is that it presupposes that the cesium atom resides in a completely flat spacetime, i.e. a gravitational potential of zero. When measured in real conditions on earth, general relativity predicts that the cesium frequency would be red-shifted depending on the altitude of the laboratory housing the clock. The magnitude of this ‘bias’ is calculated based on a theoretical model of the earth’s

26 See JCGM (2008a) for a comprehensive discussion. The distinction between type-A and type-B uncertainty is unrelated to that of type I vs. type II error. Nor should it be confused with the distinction between random and systematic error. 38 gravitational field and an altitude measurement27. The measurement of altitude itself involves some uncertainty, which propagates to the estimate of the shift and therefore to the corrected outcomes of the clock. This sort of uncertainty, i.e. uncertainty associated with corrections to systematic errors, falls under ‘type-B’.

In addition to gravitational effects, numerous other effects must be estimated and corrected for a cesium fountain. With every such de-idealization and correction, some type-

B uncertainty is added to the final outcome, i.e. to the number of ‘ticks’ the clock is said to have generated in a given period. The overall type-B uncertainty associated with the clock is then taken to be equal to the root sum of squares of these individual uncertainties28. In other words, the type-B uncertainty of a primary standard is determined by the accumulated uncertainty associated with corrections applied to its readings. The general method of evaluating the overall accuracy of measuring systems in this way is known as uncertainty budgeting.

Metrologists draw up tables with the contribution of each correction and a ‘bottom line’ that expresses the total type-B uncertainty (an example will be given in Section 1.6). Such tables make explicit the fact that ‘raw ticks’ generated by a clock are by themselves insufficient to determine the uncertainty associated with that clock. Uncertainties crucially depend not only on the apparatus, but also on how the apparatus is modeled, and on the level of detail with which such models capture the idiosyncrasies of a particular apparatus.

27 The calculation of this shift involves the postulation of an imaginary rotating sphere of equal local gravitational potential called a geoid, which roughly corresponds to the earth’s sea level. Normalization to the geoid is intended to transform the proper time of each clock to the coordinate time on the geoid. See for example Jefferts et al (2002, 328). 28 This method of adding uncertainties is only allowed when it is safe to assume that uncertainties are uncorrelated. 39

1.5. A robustness condition for accuracy

We saw that metrologists successively de-idealize the definition of the second until it describes the specific apparatus at hand. The type-A and type-B uncertainties accumulated in this process are combined to produce an overall uncertainty estimate for a given clock. This is how, for example, metrologists arrived at the estimate of fractional frequency uncertainty cited in the previous section.

A question nevertheless remains as to how metrologists determine the point at which de-idealization is ‘sufficient’. After all, a complete de-idealization of any physical system is itself an unattainable ideal. Indeed, the most difficult challenges that metrologists face involve building confidence in descriptions of their apparatus. Such confidence is achieved by pursuing two interlocking lines of inquiry: on the one hand, metrologists work to increase the level of detail with which they model clocks. On the other hand, clocks are continually compared to each other in light of their most recent theoretical and statistical models. The uncertainty budget associated with a standard is then considered sufficiently detailed if and only if these two lines of inquiry yield consistent results. The upshot of this method is that the uncertainty ascribed to a standard clock is deemed adequate if and only if the outcomes of that clock converge to those of other clocks within the uncertainties ascribed to each clock by appropriate models, where appropriateness is determined by the best currently available theoretical knowledge and data-analysis methods. This kind of convergence is routinely tested for all active cesium fountains (Parker et al 2001, Li et al 2004, Gerginov 2010) as well as for candidate future standards, as will be shown below.

40 The requirement for convergence under appropriate models embeds a double robustness condition, which may be generalized in the following way:

(RC) Given multiple, sufficiently diverse realizations of the same unit, the

uncertainties ascribed to these realizations are adequate if and only if

(i) discrepancies among realizations fall within their ascribed

uncertainties; and

(ii) the ascribed uncertainties are derived from appropriate models of

each realization.

These two conditions loosely correspond to what Woodward (2006) calls

‘measurement robustness’ and ‘derivational robustness’. The first kind of robustness concerns the stability of a measured value under varying measurement procedures, while the second concerns the stability of a prediction under varying modeling assumptions. Note, however, that in the present case we are not dealing with two independently satisfiable conditions, but with two sides of a single, composite robustness condition. Recall that the discrepancies mentioned in sub-condition (i) already incorporate corrections to the quantity being compared, corrections that were calculated in light of detailed models of the relevant apparatuses. Conversely, the ‘appropriateness’ of models in (ii) is considered sufficiently established only once it is shown that these models correctly predict the range of discrepancies among realizations.

Metrology teaches us that (RC) is indeed satisfied in many cases, sometimes with stunningly small uncertainties. However, the question remains as to why one should take uncertainties that satisfy this condition to be measures of the accuracy of standards. This

41 question can be answered by considering each of the five variants of accuracy outlined above.

To start with the most straightforward case, the comparative accuracy of realizations is simply the closeness of agreement among them, e.g. the relative closeness of the frequencies of different cesium fountains. Clearly, uncertainties that fulfill sub-condition (i) are (inverse) estimates of accuracy in this sense.

Second, from an operational point of view, the accuracy of a standard is the closeness of its agreement to other standards of the same quantity. This is again explicitly guaranteed by the fulfillment of (RC) under sub-condition (i). That sub-condition (i) guarantees two types of accuracy is hardly surprising, since in the special case of comparisons among standards the notions of comparative and operational accuracy are coextensive.

Third, the epistemic conception of accuracy identifies the accuracy of a standard with the narrowness of spread of values reasonably attributed to the quantity realized by that standard. The evaluation of type-A and type-B uncertainties in light of current theories, models and data-analysis tools is plausibly the most rigorous way of estimating the range of durations that reasonably satisfy the definition of ‘second’. The appropriateness requirement in sub-condition (ii) guarantees that uncertainties are evaluated in this way whenever possible.

Fourth, according to the metaphysical conception of accuracy, the accuracy of a standard is the degree of closeness between the estimated and true values of the realized quantity. Here one may adopt a skeptical position and claim that the true values of physical quantities are generally unknowable. The skeptic is in principle correct: it may be the case that despite their diversity, all the measurement standards that metrologists have compared

42 are plagued by a common systematic effect that equally influences the realized quantity and thus remains undetected. But for a non-skeptical realist who believes (for whichever reason) that current theories are true or approximately true, condition (RC) provides a powerful test for metaphysical accuracy because it relies on the successive de-idealization of the theoretical definition of the relevant unit. Estimating the metaphysical accuracy of a cesium clock, for example, amounts to determining the conceptual ‘distance’ of that clock from the ideal conditions specified by the definition of the second. As mentioned, the uncertainties that go into (RC) are consequences of precisely those respects in which the realization of a unit falls short of the definition. It is therefore plausible to consider cross-checked uncertainty budgets of multiple primary standards as supplying good estimates of metaphysical accuracy.

Nevertheless, it is important to note that condition (RC) and the method of uncertainty budgeting do not presuppose anything about the truth of our current theories or the reality of quantities. That is, (RC) is compatible with a non-skeptical realist notion of accuracy without requiring commitment to its underlying metaphysics.

Finally, from a pragmatic point of view the accuracy of a standard is its capacity to meet the accuracy needs of a certain application. Here the notion of ‘accuracy needs’ is cashed out in terms of one of the first four notions of accuracy. As the uncertainties vindicated by (RC) have already been shown to be adequate estimates of accuracy under the first four notions, they are ipso facto adequate for the estimation of pragmatic accuracy.

43 1.6. Future definitions of the second

The methodological requirement to maximize robustness to the limit of what is practically possible is one of the main reasons why unit definitions are not chosen arbitrarily.

If the definitions of units were determined arbitrarily, their replacement would be arbitrary as well. But, as metrologists know only too well, changes to unit definitions involve a complex web of theoretical, technological and economic considerations. Before the metrological community accepts a new definition, it must be convinced that the relevant unit can be realized more accurately with the new definition than with the old one. Here again ‘accuracy’ is cashed in terms of robustness. In the case of the second, for example, a new generation of

‘optical’ atomic clocks is already claimed to have achieved “an accuracy that exceeds current realizations of the SI unit of time” (Rosenband et al. 2008, 1809). To demonstrate accuracy that surpasses the current cesium standard, optical clocks are compared to each other in light of their most detailed models available. Table 1.1 presents a comparative uncertainty budget for aluminum and mercury optical clocks recently evaluated at NIST. The theoretical description of each atomic system is de-idealized successively, and the uncertainties contributed by each component add up to the ‘bottom line’ type-B uncertainty for each clock. These uncertainties are roughly an order of magnitude lower than those ascribed to cesium fountain clocks.

44

Table 1.1: Comparison of uncertainty budgets of aluminum (Al) and mercury (Hg) optical atomic clocks. This table was used to support the claim that both clocks are more accurate than the current cesium standard. ∆ν stands for fractional frequency bias and σ stands for uncertainty, both expressed in 10-18. (source: Rosenband et al 2008, 1809. Reprinted with permission from AAAS)

The experimenters showed that successive comparisons of the frequencies of these clocks indeed yield outcomes that fall within the ascribed bounds of uncertainty, thereby applying the robustness condition above. The fact that these clocks involve two different kinds of atoms was taken to strengthen the robustness of the results. Nevertheless, it is unlikely that the second will be redefined in the near future in terms of an optical transition.

More optical clocks must be built and compared before metrologists are convinced that such clocks are modeled with sufficient detail. Meanwhile the accuracy of current cesium standards is still being improved by employing new methods of controlling and correcting for errors. In the long run, however, increasing technological challenges involved in improving the accuracy of cesium fountains are expected to lead to the adoption of new sorts of primary realizations of the second such as optical clocks.

45 1.7. Implications and conclusions

As the foregoing discussion has made clear, measurement standards are not absolutely accurate, nor are they chosen arbitrarily. Moreover, unit definitions do not completely fix the reference of unit terms, unless ‘fixing’ is understood in a manner that is utterly divorced from practice. Instead, choices of unit definition, as well as choices of realization for a given unit definition, are informed by intricate considerations from theory, technology and data analysis.

The study of these considerations reveals the ongoing nature of standardization projects. Theoretically, quantities such as mass, length and time are represented by real numbers on continuous scales. The mathematical treatment of these quantities is indifferent to the accuracy with which they are measured. But in practice, we saw that the procedures required to measure duration in seconds change with the degree of accuracy demanded.

Consequently, a necessary condition for the possibility of increasing measurement accuracy is that unit-concepts are continually re-coordinated with new measuring procedures29. Metrologists are responsible for performing such acts of re-coordination in the most seamless manner possible, so that for all practical purposes the second, meter and kilogram appear to remain unchanged. This is achieved by constructing and improving primary and secondary realizations, and (less frequently) by redefinition. The dynamic coordination of quantity concepts with increasingly robust networks of instruments allows measurement results to retain their validity even when standards are improved or replaced. Moreover, increasing

29 See van Fraassen’s discussion of the problem of coordination (2008, ch.5). I take my own robustness condition (RC) to be a methodological explication of van Fraassen’s ‘coherence constraint’ on acceptable solutions to this problem. 46 robustness minimizes vagueness surrounding the reference of unit terms, thereby providing an optimal solution to the problem of multiple realizability of unit definitions.

47 The Epistemology of Measurement: A Model-Based Account

2. Systematic Error and the Problem of Quantity Individuation

Abstract: When discrepancies are discovered between outcomes of different measuring instruments two sorts of explanation are open to scientists. Either (i) some of the outcomes are inaccurate or (ii) the instruments measure different quantities. Here I argue that, due to the possibility of systematic error, the choice between (i) and (ii) is in principle underdetermined by the evidence. This poses a problem for several contemporary philosophical accounts of measurement, which attempt to analyze ‘foundational’ concepts like quantity independently of ‘applied’ concepts like error. I propose an alternative, model-based account of measurement that challenges the distinction between foundations and application, and show that this account dissolves the problem of quantity individuation.

2.1. Introduction

Physical quantities – the , the melting point of gold, the earth’s diameter

– can often be measured in more than one way. Instruments that measure a given quantity may differ markedly in the physical principles they utilize, and it is difficult to imagine scientific inquiry proceeding were this not the case. The possibility of measuring the same quantity in different ways is crucial to the detection of experimental errors and the development of general scientific theories. An important question for any epistemology of

48 measurement is therefore: ‘how are scientists able to know whether or not different instruments measure the same quantity?’

However straightforward this question may seem, an adequate account of quantity individuation across measurement procedures has so far eluded philosophical accounts of measurement. Contemporary measurement theories either completely neglect this question or provide overly simplistic answers. As this chapter will show, the question of quantity individuation is of central concern to theories of measurement. Not only is the question more difficult than previously thought, but when properly appreciated the challenge posed by this question undermines a widespread presupposition in contemporary philosophy of measurement. This presupposition will be referred to here as conceptual foundationalism.

Prevalent in the titles of key works such as Ellis’ Basic Concepts of Measurement (1966) and

Krantz et al’s Foundations of Measurement (1971), conceptual foundationalism is the thesis that measurement concepts are rigidly divided into ‘fundamental’ and ‘applied’ types, the former but not the latter being the legitimate domain of philosophical analysis. Fundamental measurement concepts – particularly, the notions of quantity and scale – are supposed to have universal criteria of existence and identity. Such criteria apply to any measurement regardless of its specific features. For example, whether or not two procedures measure the same quantity is determined by applying a universal criterion of quantity identity to their results, regardless of which quantity they happen to measure or how accurately they happen to measure it. By contrast, ‘applied’ concepts like accuracy and error are seen as experimental in nature. Discussion of the ‘applied’ portion of measurement theory is accordingly left to laboratory manuals or other forms of discipline-specific technical literature.

As I will argue in this chapter, conceptual foundationalist approaches do not, and cannot, provide an adequate analysis of the notion of measurable quantity. This is because

49 the epistemic individuation of measurable quantities essentially depends on considerations of error distribution across measurement procedures. Questions of the form ‘what quantity does procedure P measure?’ cannot be answered independently of questions about the accuracy of P. Deep conceptual and epistemic links tie together the so-called ‘fundamental’ and ‘applied’ parts of measurement theory and prevent identity criteria from being specified for measurable quantities independently of the specific circumstances of their measurement.

The main reason that these links have been ignored thus far is a misunderstanding of the notion of measurement error, and particularly systematic measurement error. The possibility of systematic error – if it is acknowledged at all in philosophical discussions of measurement – is usually brought up merely to clarify its irrelevance to the discussion30. The next section of this chapter will therefore be dedicated to an explication of the idea of systematic error and its relation to theoretical and statistical assumptions about the specific measurement process. These insights will be used to generate a challenge for the conceptual foundationalist that I will call ‘the problem of quantity individuation.’ The following section will discuss the ramifications of this problem for several conceptual foundationalist theories of measurement, including the Representational Theory of Measurement (Krantz et al 1971.)

Finally, Section 2.4 will present an alternative, non-foundationalist account of quantity individuation. I will argue that claims to quantity individuation are adequately tested by establishing coherence and consistency among models of different measuring instruments.

The account will serve to elucidate the model-based approach to measurement and to demonstrate its ability to avoid conceptual problems associated with foundationalism.

Moreover, the model-based approach will provide a novel understanding of the epistemic

30 See for example Campbell (1920, pp. 471-3) 50 functions of systematic errors. Instead of being conceived merely as obstacles to the reliability of experiments, systematic errors will be shown to constitute indispensable tools for unifying quantity concepts in the face of seemingly inconsistent evidence.

2.2. The problem of quantity individuation

2.2.1. Agreement and error

How can one tell whether two different instruments measure the same quantity? This question poses a fundamental challenge to the epistemology of measurement. For any attempt to test whether two instruments measure the same quantity, either by direct comparison or by reference to other instruments, involves testing for agreement among measurement outcomes; but any test of agreement among measurement outcomes must already presuppose that those outcomes pertain to the same quantity.

To clarify the difficulty, let us first consider the sort of evidence required to establish agreement among the outcomes of different measurements. We can imagine two instruments that are intended to measure the same quantity, such as two thermometers. For the sake of simplicity, we may assume that the instruments operate on a common set of measured samples. Now suppose that we are asked to devise a test that would determine whether the two instruments agree in their outcomes when they are applied to samples in the set.

Naively, one may propose that the instruments agree if and only if their indications exactly coincide when introduced with the same samples under the same conditions. But variations in operational or environmental conditions cause indications to diverge between

51 successive measurements, and this should not count as evidence against the claim that the outcomes of the two instruments are compatible.

A more sophisticated proposal would be to repeat the comparison between the instruments several times under controlled conditions and to use one of any number of statistical tests to determine whether the difference in readings is consonant with the hypothesis of agreement between the instruments. This procedure would determine whether the readings of the two instruments coincide within type-A (or ‘statistical’) uncertainty, the component of measurement uncertainty typically associated with random error31.

However, due to the possibility of systematic error, coincidence within type-A uncertainty is neither a necessary nor sufficient criterion for agreement among measuring instruments, regardless of which statistical test is used. Mathematically speaking, an error is

‘systematic’ if its expected value after many repeated measurements is nonzero32. In most cases, the existence of such errors cannot be inferred from the distribution of repeated readings but must involve some external standard of accuracy33. Once systematic errors are corrected, seemingly disparate readings may turn out to stand for compatible outcomes while apparently convergent readings can prove to mask disagreement. Consequently, it is impossible to adjudicate questions concerning agreement among measuring instruments before systematic errors have been corrected.

31 My terminology follows the official vocabulary of the International Bureau of Weights and Measures as published by the Joint Committee for Guides in Metrology. For definitions and discussion of type-A and type-B uncertainties see JCGM (2008, 2008a). Note that the distinction between type-A and type-B uncertainty is unrelated to that of type I vs. type II error. Nor should it be confused with the distinction between random and systematic error. 32 cf. JCGM (2008, 2.17) 33 Some systematic errors can be evaluated purely statistically, such as random walk noise (aka Brownian noise) in the frequency of electric signals. 52 A well-known example34 concerns glass containers filled with different thermometric fluids – e.g. mercury, alcohol and air. If one examines the volume indications of these thermometers when applied to various samples, one discovers that temperature intervals that are deemed equal by one instrument are deemed unequal by the others. These discrepancies are stable over many trials and therefore not eliminable through statistical analysis of repeated measurements. Moreover, because the ratio between corresponding volume intervals measured by different thermometers is not constant, it is impossible to eliminate the discrepancy by linear scale transformations such as from Celsius to Fahrenheit.

Nevertheless, from the point of view of scientific methodology these thermometers may still turn out to be in good agreement once an appropriate nonlinear correction is applied to their readings. Such numerical correction is often made transparent to users by manipulating the output of the instrument, e.g. by incorporating the correction into the gradations on the display. For example, if the thermometers appear to disagree on the location of the midpoint between the temperatures of freezing and boiling water, the ‘50

Celsius’ marks on their displays may simply be moved so as to restore agreement. Corrective procedures of this sort are commonplace during calibration and are viewed by scientists as enhancing the accuracy of measuring instruments. Indeed, in discussions concerning agreement among measurement outcomes scientists almost never compare ‘raw’, pre- calibrated indications of instruments directly to each other, a comparison that is thought to be uninformative and potentially misleading.

What sort of evidence should one look for to decide whether and how much to correct the indications of measuring instruments? Background assumptions about what the

34 See Mach (1966 [1896]), Ellis (1966, 90-110), Chang (2004, Ch. 2) and van Fraassen (2008, 125-30). 53 instrument is measuring play an important role here. When in 1887 Michelson and Morley measured the velocity of light beams propagating parallel and perpendicular to the supposed ether wind they observed little or no significant discrepancy35. Whether this result stands for agreement between the two values of velocity nevertheless depends on how one represents the apparatus and its interaction with its environment. Fitzgerald and Lorentz hypothesized that the arms of the interferometer contracted in dependence on their orientation relative to the ether wind, an effect that would result in a systematic error that exactly cancels out the expected difference of light speeds. According to this representation of what the apparatus was measuring, the seeming convergence in velocities merely masked disagreement. By contrast, under Special Relativity length contraction is considered not an extraneous disturbance to the measurement of the velocity of light but a fundamental consequence of its invariance, and the results are taken to indicate genuine agreement. Hence an effect that requires systematic correction under one representation of the apparatus is deemed part of the correct operation of the apparatus under another.

A similar point is illustrated, though under very different theoretical circumstances, by the development of thermometry in the eighteenth and nineteenth centuries. As noted by

Chang (2004, Ch. 2), by the mid-1700s it was well known that thermometers employing different sorts of fluids exhibit nonlinear discrepancies. This discovery prompted the rejection of the naive assumption that the volume indications of all thermometers were linearly correlated with temperature. Eventually, comparisons among thermometers

(culminating in the work of Henri Regnault in the 1840s) gave rise to the adoption of air

35 Here the comparison is not between different instruments but different operations that involve the same instrument. I take my argument to apply equally to both cases. 54 thermometers as standards. But the adoption of air as a standard thermometric fluid did not cause other thermometers, such as mercury thermometers, to be viewed as categorically less accurate. Instead, the adoption of the air standard led to the recalibration of mercury thermometers under the assumption that their indications are nonlinearly correlated with temperature. What matters to the accuracy of mercury thermometers under the new assumption is no longer their linearity but the predictability of their deviation from air thermometers. The indications of a mercury thermometer could now deviate from linearity without any loss of accuracy as long as they were predictably correlated with corresponding indications of a standard. Once again, what is taken to be an error under one representation of the apparatus is deemed an accurate result under another.

2.2.2. The model-relativity of systematic error

The examples of thermometry and interferometry highlight an important feature of systematic error: what counts as a systematic error depends on a set of assumptions concerning what and how the instrument is measuring. These assumptions serve as a basis for constructing a model of the measurement process, that is, an abstract quantitative representation of the instrument’s behavior including its interactions with the sample and environment. The main function of such models is to allow inferences to be made from indications (or ‘readings’) of an instrument to values of the quantity being measured. While various types of models are involved in interpreting measurement results, for the sake of the current discussion it is sufficient to distinguish between models of the data generated by a measurement process and theoretical models representing the dynamics of a measurement

55 process. Both sorts of models involve assumptions about the measuring instrument, the sample and environment, but the kind of assumptions is different in either case.

Models of data (or ‘data models’) are constructed out of assumptions about the relationship between possible values of the quantity being measured, possible indications of the instrument, and values of extraneous variables, including time36. These assumptions are used to predict a functional relation between the input and output of an instrument known as the ‘calibration curve’. We already saw the centrality of data models to the detection of systematic error in the thermometry example. The initial assumption of linear expansion of fluids provided a rudimentary calibration curve that allowed inferring temperature from volume. The linear data model nevertheless proved to be of limited accuracy, beyond which its predictions came into conflict with the assumption that different instruments measure the same single-valued quantity, temperature37. Hence a systematic error was detected based on linear data models and later corrected by constructing more complex data models that incorporate nonlinearities. In the course of this modeling activity the thermometers in question are viewed largely as ‘black-boxes’, and very little is assumed about the mechanisms that cause fluids and gases to expand when heated38.

In the Michelson-Morley example, by contrast, model selection was informed by a theoretical account of how the apparatus worked. Generally, a theoretical model of a measuring instrument represents the internal dynamics of the instrument as well as its interactions with the environment (e.g. ether) and the measured sample (e.g. light beams.)

36 For a general account of models of data see Suppes (1962). 37 See Chang (2004, pp. 89-92) 38 The distinction between data models and theoretical models is closely related to the distinction between ‘black-box’ and ‘white-box’ calibration tests. For detailed discussion of this distinction see Chapter 4, Sections 4.3 and 4.4. as well as Boumans (2006). 56 The accepted theoretical model of an instrument is crucial in specifying what the instrument is measuring. The model also determines which behaviors of the instrument count as evidence for a systematic error. Both of these epistemic functions are clearly illustrated by the

Michelson-Morley example, where the classical model of what the instrument is measuring

(light speed relative to the ether) was replaced with a relativistic model of what the instrument is measuring (universally constant light speed in any inertial frame.) As part of this change in the accepted theoretical model of the instrument, the dynamical explanation of length contraction was replaced with a kinematic explanation. Rather than correct the effects of length contraction, the new theoretical model of the apparatus conceptually

‘absorbed’ these effects into the value of the quantity being measured.

Despite vast differences between the two examples, they both illustrate the sensitivity of systematic error to a representational context. That is, in both the thermometry and interferometry cases the attribution of systematic errors to the indications of the instrument depends on what the instrument is taken to measure under a given representation.

Furthermore, in both cases the error is corrected (or conceptually ‘eliminated’) merely by modifying the model of the apparatus and without any physical change to its operation.

The following ‘methodological definition’ of systematic error makes explicit the model-relativity of the concept:

 Systematic error: a discrepancy whose expected value is nonzero between the

anticipated or standard value of a quantity and an estimate of that value based

on a model of a measurement process.

57 This definition is ‘methodological’ in the sense that it pertains to the method by which systematic errors are detected and estimated. This way of defining systematic error differs from ‘metaphysical’ definitions of error, which characterize measurement error in relation to a quantity’s true value. The methodological definition has the advantage of being straightforwardly applicable to scientific practice, because in most cases of physical measurement the exact true value of a quantity is unknowable and thus cannot be used to estimate the magnitude of errors.

Apart from its applicability to scientific practice, the methodological definition of systematic error has the advantage of capturing all three sorts of ways in which systematic errors may be corrected, namely (i) by physically modifying the measurement process – for example, shielding the apparatus from causes of error; (ii) by modifying the theoretical or data model of the apparatus or (iii) by modifying the anticipated value of the quantity being measured. In everyday practice (i) and (ii) are usually used in combination, whereas (iii) is much rarer and may occur due to a revision to the ‘recommended value’ of a constant or due to changes in accepted theory39. I will discuss the first two sorts of correction in detail below.

The methodological definition of systematic error is still too broad for the purpose of the current discussion, because it includes errors that can be eliminated simply by changing the scale of measurement, e.g. by modifying the zero point of the indicator or by converting from, say, Celsius to Fahrenheit. By contrast, a subset of systematic errors that I will call

‘genuine’ cannot be eliminated in this fashion:

39 The Michelson-Morley example illustrates a combination of (ii) and (iii), as both the theoretical model of the interferometer and the expected outcome of measurement are modified. 58  A genuine systematic error: a systematic error that cannot be eliminated merely

by a permissible transformation of measurement scale.

The possibility of genuine systematic error will prove crucial to the individuation of quantities across different measuring instruments. Unless otherwise mentioned, the term

‘systematic error’ henceforth denotes genuine systematic errors.

2.2.3. Establishing agreement: a threefold condition

To recapitulate the trajectory of the discussion so far, the need to test whether different instruments measure the same quantity has led us to look for a test for agreement among measurement outcomes. We saw that agreement can only be established once systematic errors have been corrected, and that what counts as a systematic error depends on how instruments are modeled. Consequently, any test for agreement among measuring instruments is itself model-relative. The model-relativity of agreement is a direct consequence of the fact that a change in modeling assumptions may result in a different attribution of systematic errors to the indications of instruments. For this reason, the results of agreement tests between measuring instruments may be modified without any physical change to the apparatus, merely by adopting different modeling assumptions with respect to the behavior of instruments.

Agreement is therefore established by the convergence of outcomes under specified models of measuring instruments. Specifically, detecting agreement requires that:

59 (R1) the instruments are modeled as measuring the same quantity, e.g. temperature

or the velocity of light40;

(R2) the indications of each instrument are corrected for systematic errors in light

of their respective models; and

(R3) the corrected indications converge within the bounds of measurement

uncertainty associated with each instrument.

As I have shown in the previous chapter, these requirements are implemented in practice by measurement experts (or ‘metrologists’) to establish the compatibility of measurement outcomes41.

Before we examine the epistemological ramifications of these three requirements, a clarification is in order with respect to requirement (R3). This is the requirement that convergence be demonstrated within the bounds of measurement uncertainty. The term

‘measurement uncertainty’ is here taken to refer to the overall uncertainty of a given measurement, which includes not only type-A uncertainty calculated from the distribution of repeated readings but also type-B uncertainty, the uncertainty associated with estimates of the magnitudes of systematic errors42. Theoretical models of the apparatus play an important role in evaluating type-B uncertainties and consequently in deciding what counts as appropriate bounds for agreement between instruments. For example, the theoretical model of cesium fountain clocks (the atomic clocks currently used to standardize the unit of time

40 This requirement may be specified either in terms of a quantity type (e.g. velocity) or in terms of a quantity token (e.g. velocity of light). Both formulations amount to the same criterion, for in both cases measurement outcomes are expressed in terms of some quantity token, e.g. a velocity of some thing. 41 See Chapter 1, Section 1.5, as well as the VIM definition of “Compatibility of Measurement Results” (JCGM 2008, 2.47) 42 See Chapter 1, Section 1.4 as well as fn. 31 above. 60 defined as one second) predicts that the output frequency of the clock will be affected by collisions among cesium atoms. The higher the density of cesium atoms housed by the clock, the larger the systematic error with respect to the quantity being measured, in this case the ideal cesium frequency in the absence of such collisions. To estimate the magnitude of the error, scientists manipulate the density of atoms and then extrapolate their data to the limit of zero density. The estimated magnitude of the error is then used to correct the raw output frequency of the clock, and the uncertainty associated with the extrapolation is added to the overall measurement uncertainty associated with the clock43. This latter uncertainty is classified as ‘type-B’ because it is derived from a secondary experiment on the apparatus rather than from a statistical analysis of repeated clock readings.

The conditions under which requirement (R3) is fulfilled are therefore model-relative in two ways. First, they depend on a theoretical or data model of the measurement process to establish what counts as a systematic error and therefore what the corrected readings are.

Second, as just noted, these conditions depend on how type-B uncertainties are evaluated, which again depends on theoretical and statistical assumptions about the apparatus. As the first two requirements (R1) and (R2) are already explicitly tied to models, the upshot is that each of the three requirements that together establish agreement among measuring instruments is model-relative in some respect.

43 See Jefferts et al (2002) for a detailed discussion of this evaluation method. 61 2.2.4. Underdetermination

When the threefold condition above is used as a test for agreement among measuring instruments, the corrected readings may turn out not to converge within the expected bounds of uncertainty. In such case disagreement (or incompatibility) is detected between measurements outcomes. There are accordingly three possible sorts of explanation for such disagreement:

(H1) the instruments are not measuring the same quantity44;

(H2) systematic errors have been inappropriately evaluated; or

(H3) measurement uncertainty has been underestimated.

How does one determine which is the culprit (or culprits)? Prima facie, one should attempt to test each of these three hypotheses independently. To test (H1) scientists may attempt to calibrate the instruments in question against other instruments that are thought to measure the desired quantity. But this sort of calibration is again a test of agreement. For calibration to succeed, one must already presuppose under requirement (R1) that the calibrated and calibrating instrument measure the same quantity. The success of calibration therefore cannot be taken as evidence for this presupposition. Alternatively, if calibration fails scientists are faced with the very same problem, now multiplied.

44 Cf. fn. 40. As before, it makes no conceptual difference whether this hypothesis is formulated in terms of a quantity type or a quantity token. The choice of formulation does, however, makes a practical difference to the strategies scientists are likely to employ to resolve discrepancies. See Section 2.4.3 for discussion. 62 Scientists may attempt to test (H2) or (H3) independently of (H1). But this is again impossible, because the attribution of systematic error involved in claim (H2) is model- relative and can only be tested by making assumptions about what the instruments are measuring. Similarly, the evaluation of measurement uncertainty involved in testing (H3) includes type-B evaluations that are relative to a theoretical model of the measurement process. Moreover, as (H3) applies only to readings that have already been corrected for systematic error, it cannot be tested independently of (H2) and ipso facto of (H1).

We are therefore confronted with an underdetermination problem. In the face of disagreement among measurement outcomes, no amount of empirical evidence can alone determine whether the procedures in question are inaccurate [(H2 or H3) is true] or whether they are measuring different quantities (H1 is true). Any attempt to settle the issue by collecting more evidence merely multiplies the same conundrum. I call this the problem of quantity individuation. Like other cases of Duhemian underdetermination, it is only a problem if one believes that there is a disciplined way of deciding which hypothesis to accept (or reject) based on empirical evidence alone. As we shall see immediately below, several contemporary philosophical theories of measurement indeed subscribe to this mistaken belief. That is, they assume that questions of the form ‘what does procedure P measure?’ can be answered decisively based on nothing more than the results of empirical tests, and independently of any prior assumptions as to the accuracy of P. This belief lies at the heart of the foundationalist approach to the notion of measurable quantity, a notion that is viewed as epistemologically prior to the ‘applied’ challenges involved in making concrete measurements.

A direct upshot of the problem of quantity individuation is that the individuation of measurable quantities and the distribution of systematic error are but two sides of the same

63 epistemic coin. Specifically, the possibility of attributing genuine systematic errors to measurement outcomes (along with relevant type-B uncertainties) is a necessary precondition for the possibility of establishing the unity of quantities across the various instruments that measure them. Unless genuine systematic errors are admitted as a possibility when analyzing experimental results, instruments exhibiting nonscalable discrepancies cannot be taken to measure the same quantity. Concepts such as temperature and the velocity of light therefore owe their unity to the possibility of such errors, as do the laws in which such quantities feature. The notion of measurement error, in other words, has a constructive function in the elaboration of quantity concepts, a function that has so far remained unnoticed by theories of measurement.

2.2.5. Conceptual vs. practical consequences

The problem of quantity individuation may strike one as counter-intuitive. Do not scientists already know that their thermometers measure temperature before they set out to detect systematic errors? The answer is that scientists often do know, but that their knowledge is relative to background theoretical assumptions concerning temperature and to certain traditions of interpreting empirical evidence. Such traditions serve, among other things, to constrain the range of trade-offs between simplicity and explanatory power that a scientific community would deem acceptable. Theoretical assumptions and interpretive traditions inform the choices scientists make among the three hypotheses above. In ‘trivial’ cases of quantity individuation, namely in cases where previous agreement tests have already been performed among similar instruments under similar conditions with a similar or higher

64 degree of accuracy, an appeal to background theories and traditions is usually sufficient for determining which of the three hypotheses will be accepted.

As we shall see below, the foundationalist fallacy is to think of such choices as justified in an absolute sense, that is, outside of the context of any particular or interpretive tradition. Van Fraassen calls this sort of absolutism the ‘view from nowhere’

(2008, 122) and rightly points out that there can be no way of answering questions of the form ‘what does procedure P measure?’ independently of some established tradition of theorizing and experimenting. He distinguishes between two sorts of contexts in which such questions may be answered: ‘from within’, i.e. given the historically available theories and instrumental practices at the time, or ‘from above’, i.e. retrospectively in light of contemporary theories.

Although van Fraassen does not discuss the problem of quantity individuation, his terminology is useful for distinguishing between two different consequences of this problem.

The first, conceptual consequence has already been mentioned: there can be no theory-free test of quantity individuation. This consequence is not a problem of practicing scientists but only for conceptual foundationalist accounts of measurement. It stems from the attempt to devise a test for quantity individuation that would view measurement ‘from nowhere’, prior to any theoretical assumptions about what is being measured and regardless of any particular tradition of interpreting empirical evidence.

The other, practical consequence of the problem of quantity individuation is a challenge for scientists engaged in ‘nontrivial’ measurement endeavors, ones that involve new kinds of instruments, novel operating conditions or higher accuracy levels than previously achieved for a given quantity. Exemplary procedures of calibration and error correction may not yet exist for such measurements. In the face of incompatible outcomes from novel instruments,

65 then, researchers may not have at their disposal established methods for restoring agreement. Nor can they settle the issue based on empirical evidence from comparison tests alone, for as the problem of quantity individuation teaches us, such evidence is insufficient for deciding which one (or more) of the three hypotheses above to accept. The practical challenge is to devise new methods of adjudicating agreement and error ‘from within’, i.e. by extending existing theoretical presuppositions and interpretive traditions to a new domain.

As we shall see below, multiple strategies are open to scientists confronted with disagreement among novel measurements.

Historically, the process of extension has almost always been conservative. Scientists engaged in cutting-edge measurement projects usually start off by dogmatically supposing that their instruments will measure a given quantity in a new regime of accuracy or operating conditions. This conservative approach is extremely fruitful as it leads to the discovery of new systematic errors and to novel attempts to explain such errors. But such dogmatic supposition should not be confused with empirical knowledge, because novel measurements may lead to the discovery of new laws and to the postulation of quantities that are different from those initially supposed. Instead, this sort of dogmatic supposition can be regarded as a manifestation of a regulative ideal, an ideal that strives to keep the number of quantity concepts small and underlying theories simple.

Due to their marked differences, I will consider the two consequences of the problem of quantity individuation as two distinct problems. The next section will discuss the conceptual problem and its consequences for foundationalist theories of measurement. The following section will explain how the conceptual problem is dissolved by adopting a model- based approach to measurement. I will then return to the practical problem – the problem of deciding which hypotheses to accept in real, context-rich cases of disagreement – at the end

66 of Section 2.4. If not otherwise mentioned, the ‘problem of quantity individuation’ henceforth refers to the conceptual problem.

2.3. The shortcomings of foundationalism

The conceptual problem of quantity individuation should not come as a surprise to philosophers of science. It is, after all, a special case of a well-known problem named after

Duhem45. Nevertheless, a look at contemporary works on the philosophy of measurement reveals that the problem of quantity individuation has so far remained unrecognized. Worse still, the consequences of this problem are in conflict with several existing accounts of physical measurement. This section is dedicated to a discussion of the repercussions of the conceptual problem of quantity individuation for three philosophical theories of measurement. A by-product of explicating these repercussions is that the problem itself will be further clarified.

All three philosophical accounts discussed here are empiricist, in the sense that they attempt to reduce questions about the individuation of quantities to questions about relations holding among observable results of empirical procedures. These accounts are also foundationalist insofar as they take universal criteria pertaining to the configuration of observable evidence to be sufficient for the individuation of quantities, regardless of theoretical assumptions about what is being measured or local traditions of interpreting evidence. Hence for a foundational empiricist the result of an individuation test must not

45 Duhem (1991 [1914], 187) 67 depend on any background assumption unless that assumption can be tested empirically. As

I will argue, foundational empiricist criteria individuate quantities far too finely, leading to a fruitless multiplication of natural laws. Such accounts of measurement are unhelpful in shedding light on the way quantities are individuated by successful scientific theories.

2.3.1. Bridgman’s operationalism

The first account of quantity individuation I will consider is operationalism as expounded by Bridgman (1927). Bridgman proposes to define quantity concepts in physics such as length and temperature by the operation of their measurement. This proposal leads

Bridgman to claim that currently accepted quantity concepts have ‘joints’ where different operations overlap in their value range or object domain. He warns against dogmatic faith in the unity of quantity concepts across these ‘joints’, urging instead that unity be checked against experiments. Bridgman nevertheless concedes that it is pragmatically justified to retain the same name for two quantities if “within our present experimental limits a numerical difference between the results of the two sorts of operations has not been detected” (ibid, 16.)

Bridgman can be said to advance two distinct criteria of quantity individuation, the first substantive and semantic, the other nominal and empirical. The first criterion is a direct consequence of the operationalist thesis: quantities are individuated by the operations that define them. Hence a difference in measurement operation is a sufficient condition for a difference in the quantity being measured. But even if we grant Bridgman the existence of a clear criterion for individuating operations, the operationalist approach generates an absurd

68 multiplicity of quantities and laws. Unless ‘operation’ is defined in a question-begging manner, there is no reason to think that operating a ruler and operating an interferometer

(both used for measuring length) are instances of a single sort of operation. Bridgman, of course, welcomed the multiplicity of quantity concepts in the spirit of empiricist caution.

Nevertheless, it is doubtful whether the sort of caution Bridgman advised is being served by his operational analysis of quantity. As long as quantities are defined by operations, no two operations can measure the same quantity; as a result, it is impossible to distinguish between results that are ascribable to the objects being measured and those that are ascribable to some feature of the operation itself, the environment, or the human operator. An operational analysis, in other words, denies the possibility of testing the objective validity of measurement claims – their validity as claims about measured objects. This denial stands in stark contrast to Bridgman’s own cautionary methodological attitude.

Bridgman’s second, empirical criterion of individuation is meant to save physical theory from conceptual fragmentation. According to the second criterion, quantities are nominally individuated by the presence of agreement among the results of operations that measure them. The same ‘nominal quantity’, such as length, is said to be measured by several different operations as long as no significant discrepancy is detected among the results of these operations. But this criterion is naive, because different operations that are thought to measure the same quantity rarely agree with each other before being deliberately corrected for systematic errors. Such corrections are required, as we have seen, even after one averages indications over many repeated operations and ‘normalizes’ their scale. The empirical criterion of individuation therefore fails to avoid the unnecessary multiplicity of quantities.

Alternatively, if by ‘numerical difference’ above Bridgman refers to measurement results that have already been corrected for systematic errors, such numerical difference can only be

69 evaluated under the presupposition that the two operations measure the same quantity. This presupposition is nevertheless the very claim that Bridgman needs to establish. This last reading of Bridgman’s individuation criterion is therefore circular46.

2.3.2. Ellis’ conventionalism

A second and seemingly more promising candidate for an empiricist criterion of quantity individuation is provided by Ellis in his Basic Concepts of Measurement (1966). Instead of defining quantity concepts in terms of particular operations, Ellis views quantity concepts as ‘cluster concepts’ that may be “identified by any one of a large number of ordering relationships” (ibid, 35). Different instruments and procedures may therefore measure the same quantity. What is common to all and only those procedures that measure the same quantity is that they all produce the same linear order among the objects being measured: “If two sets of ordering relationships, logically independent of each other, always generate the same order under the same conditions, then it seems clear that we should suppose that they are ordering relationships for the same quantity” (ibid, 34).

Ellis’ individuation criterion appears at first to capture the examples examined so far.

The thermometers discussed above preserve the order of samples regardless of the

46 Note that my grounds for criticizing Bridgman differ significantly from the familiar line of criticism expressed by Hempel (1966, 88-100). Hempel rejects the proliferation of operational quantity-concepts insofar as it makes the systematization of scientific knowledge impossible. In this respect I am in full agreement with Hempel. But Hempel fails to see that Bridgman’s nominal criterion of quantity individuation is not only opposed to the systematic aims of science but also blatantly circular. Like Bridgman, Hempel wrongly believed that agreement and disagreement among measuring instruments are adjudicated by a comparison of indications, or ‘readings’ (ibid, 92). The circularity of Bridgman’s criterion is exposed only once the focus shifts from instrument indications to measurement outcomes, which already incorporate error corrections. I will elaborate on this distinction below. 70 thermometric fluid used: a sample that is deemed warmer than another by one thermometer will also be deemed warmer by the others. Similarly, two atomic clocks whose frequencies are unstable relative to each other still preserve the order of events they are used to record, barring relativistic effects.

Nevertheless, Ellis’ criterion fails to capture the case of intervals and ratios of measurable quantities. Quantity intervals and quantity ratios are themselves quantities, and feature prominently in natural laws. Indeed, Ellis himself mentions the measurement of time-intervals and temperature-intervals47 and treats them as examples of quantities. As we have seen, when genuine systematic errors occur, measurement procedures do not preserve the order of intervals and ratios. Two temperature intervals deemed equal by one thermometer are deemed unequal by another depending on the thermometric fluid used.

Note that this discrepancy persists far above the sensitivity thresholds of the instruments and cannot be attributed to resolution limitations.

A similar situation occurs with clocks. Consider two clocks, one whose ‘ticks’ are slowly increasing in frequency relative to a standard, the other slowly decreasing. Now imagine that each of these clocks is used to measure the frequency of the standard.

Relativistic effects aside, the speeding clock will indicate that the time intervals marked by the standard are slowly increasing while the slowing clock will indicate that they are decreasing – a complete reversal of order of time intervals. Ellis’ criterion is therefore insufficient to decide whether or not the two clocks measure intervals of the same quantity

(i.e. time.) Considered in light of this criterion alone, the clocks may just as well be deemed to measure intervals of two different and anti-correlated quantities, time-A and time-B. But

47 Ibid, 44 and 100. 71 this is absurd, and again leaves open the possibility of unnecessary multiplication of quantities and laws encountered in Bridgman’s case.

As with Bridgman, Ellis cannot defend his criterion by claiming that it applies only to ordering relationships that have already been appropriately corrected for systematic errors.

For as we have seen, such corrections can only be made in light of the presupposition that the relevant procedures measure the same quantity, and this is the very claim Ellis’ criterion is supposed to establish.

Ellis may retort by claiming that his criterion is intended to provide only necessary, but not sufficient, conditions for quantity individuation. This would be of some consolation if the condition specified by Ellis’ criterion – namely, the convergence of linear order – was commonly fulfilled whenever scientists compare measuring instruments to each other. But almost all comparisons among measuring instruments in the physical sciences are expressed in terms of intervals or ratios of outcomes, and we saw that Ellis’ criterion is not generally fulfilled for intervals and ratios. Moreover, virtually all known laws of physics are expressed in terms of quantity intervals and ratios. The discovery and confirmation of nomic relations, which are among the primary aims of physics, require individuation criteria that are applicable to intervals and ratios of quantities, but these are not covered by Ellis’ criterion.

72 2.3.3. Representational Theory of Measurement

Perhaps the best known contemporary philosophical account of measurement is the

Representational Theory of Measurement (RTM)48. Unlike the two previous accounts, RTM does not explicitly discuss the individuation of quantities. Nevertheless, RTM discusses at length the individuation of types of measurement scales. A scale type is individuated by the transformations it can undergo. For example, the Celsius and Fahrenheit scales belong to the same type (‘interval’ scales) because they have the same set of permissible transformations, i.e. linear transformations with an arbitrary zero point49. The set of permissible transformations for a given scale is established by proving a ‘uniqueness theorem’ for that scale, a proof that rests on axioms concerning empirical relations among the objects measured on that scale50.

RTM can be used to generate an objection to my analysis of systematic error.

According to this objection, the discrepancies I call ‘genuine systematic errors’ are simply cases where the same quantity is measured on different scales. For example, the discrepancies between mercury and alcohol thermometers arise because these instruments represent temperature on different scales (one may call them the ‘mercury temperature scale’ and ‘alcohol temperature scale’.) RTM shows that these scales belong to the same type – namely, interval scales. Moreover, RTM supposedly provides us with a conversion factor that transforms temperature estimates from one scale to the other, and this conversion eliminates the discrepancies. ‘Genuine systematic errors’, according to this objection, are not

48 Krantz et al, 1971. 49 Ibid, 10. 50 Ibid. 73 errors at all but merely byproducts of a subtle scale difference. RTM eliminates these byproducts before the underdetermination problem I mention has a chance to arise.

Like the proposals by Bridgman and Ellis, this objection is circular. It purports to eliminate genuine systematic errors by appealing to differences in measurement scale, but any test for identifying differences in measurement scale must already presuppose that genuine systematic errors have been corrected.

This is best illustrated by considering a variant of the problem of quantity individuation. As before, we may assume that scientists are faced with apparent disagreement between the outcomes of different measurements. However, in this variant scientists are entertaining four possible explanations instead of just three:

(H1) the instruments are not measuring the same quantity;

(H1S) measurement outcomes are represented on different scales;

(H2) systematic errors have been inappropriately evaluated; or

(H3) measurement uncertainty has been underestimated.

According to the objection, hypothesis (H1S) can be tested independently of the other three hypotheses. In other words, facts about the appropriateness and uniqueness of a scale employed in measurement can be tested independently of questions about what, and how accurately, the instrument is measuring. This is yet another conceptual foundationalist claim, i.e. the claim that the concept of measurement scale is fundamental and therefore has universal criteria of existence and identity.

If taken literally, conceptual foundationalism about measurement scales leads to the same absurd multiplication of quantities already encountered above. This is because genuine

74 systematic errors by definition cannot be transformed away through alterations of measurement scale. In the thermometry case, for example, the nonlinear discrepancy between mercury and alcohol thermometers cannot be eliminated by transformations of the interval scale, as the latter only admits of linear transformations. One is forced to conclude that the thermometers are measuring temperature on different types of scales – a ‘mercury scale type’ and an ‘alcohol scale type’ – with no permissible transformation between them. But this conclusion is inconsistent with RTM, according to which both scales are interval scales and hence belong to the same type. How can temperature be measured on two different interval scales without there being a permissible transformation between them? The only way to avoid inconsistency is to admit that the so-called ‘thermometers’ are not measuring the same quantity after all, but two different and nonlinearly related quantities. Hence strict conceptual foundationalism about measurement scales leads to the same sort of fragmentation of quantity concepts already familiar from Bridgman and Ellis’ accounts. If

RTM is interpreted along such strict empiricist lines, it can provide very little insight into the way measurement scales are employed in successful cases of scientific practice.

A second and supposedly more charitable option is to interpret RTM as applying to indications in the idealized sense, already taking into account error corrections. This is compatible with the views expressed by authors of RTM, who state that their notion of

‘empirical structure’ should be understood as an idealized model of the data that already abstracts away from biases51. But on this reading the objection becomes circular. RTM’s proofs of uniqueness theorems, which according to the objection are supposed to make

51 See Luce and Suppes (2002, 2). 75 corrections to genuine systematic errors redundant, presuppose that these corrections have already been applied.

Not only the objection, but RTM itself becomes circular under this reading. RTM, recall, aims to provide necessary and sufficient conditions for the appropriateness and uniqueness of measurement scales. According to the so-called ‘charitable’ reading just discussed, these conditions are specified under the assumption that measurement errors have already been corrected. In other words, any test of (H1S) can only be performed under the assumption that (H2) and (H3) have already been rejected. But any test of (H2) or (H3) must already represent measurement outcomes on some measurement scale, for otherwise quantitative error correction and uncertainty evaluation are impossible. In other words, the representational appropriateness of a scale type must already be presupposed in the process of obtaining idealized empirical relations among measured objects.52 Consequently these empirical relations cannot be used to test the representational appropriateness of the scale type being used. Instead, (H1S) is epistemically entangled with (H2) and (H3) and ipso facto with (H1). The project of establishing the appropriateness and uniqueness of measurement scales based on nothing but observable evidence is caught in a vicious circle.

The so-called ‘charitable’ reading of RTM fails at being charitable enough because it takes RTM to be an epistemological theory of measurement. Those who read RTM in this light expect it to provide insight into the way claims to the appropriateness and uniqueness of measurement scales may be tested by empirical evidence. The authors of RTM occasionally

52 This last point has also been noted by Mari (2000), who claims that “the [correct] characterization of measurement is intensional, being based on the knowledge available about the measurand before the accomplishment of the evaluation. Such a knowledge is independent of the availability of any extensional information on the relations in [the empirical relational structure] RE” (ibid, 74-5, emphases in the originial). 76 make comments that encourage this expectation from their theory53. But this expectation is unfounded. As we have just seen, the justification for one’s choice of measurement scale cannot be abstracted away from considerations relating to the acquisition and correction of empirical data. Any test for the appropriateness of scales that does not take into account considerations of this sort is bound to be circular or otherwise multiply quantities unnecessarily. Given that RTM remains silent on considerations relating to the acquisition and processing of empirical evidence, it cannot be reasonably expected to function as an epistemological theory of measurement.

Under a third, truly charitable reading, RTM is merely meant to elucidate the mathematical presuppositions underlying measurement scales. It is not concerned with grounding empirical knowledge claims but with the axiomatization of a part of the mathematical apparatus employed in measurement. Stripped from its epistemological guise,

RTM avoids the problem of quantity individuation. But the cost is substantial: RTM can no longer be considered a theory of measurement proper, for measurement is a knowledge- producing activity, and RTM does not elucidate the structure of inferences involved in making knowledge claims on the basis of measurement operations. In other words, RTM explicates the presuppositions involved in choosing a measurement scale but not the empirical criteria for the adequacy of these presuppositions. RTM’s role with respect to measurement theory is therefore akin to that of axiomatic probability theory with respect to

53 For example, the authors of RTM seem to suggest that empirical evidence justifies or confirms the axioms: “One demand is for the axioms to have a direct and easily understood meaning in terms of empirical operations, so simple that either they are evidently empirically true on intuitive grounds or it is evident how systematically to test them.” (Krantz et al 1971, 25) 77 quantum mechanics: both accounts supply rigorous analyses of indispensible concepts (scale, probability) but not the conditions of their empirical application.

To summarize this section, the foundational empiricist attempt to specify a test of quantity individuation (or scale type individuation) in terms of nothing more than relations among observable indications of measuring instruments fails. And fail it must, because indications themselves are insufficient to determine whether instruments measure different

(but correlated) quantities or the same quantity with some inaccuracy. The next section will outline a novel epistemology of measurement, one that rejects foundationalism and dissolves the problem of quantity individuation.

2.4. A model-based account of measurement

2.4.1. General outline

According to the account I will now propose, physical measurement is the coherent and consistent attribution of values to a quantity in an idealized model of a physical process.

Such models embody theoretical assumptions concerning relevant processes as well as statistical assumptions concerning the data generated by these processes. The physical process itself includes all actual interactions among measured samples, instrument, operators and environment, but the models used to represent such processes neglect or simplify many of these interactions. It is only in light of some idealized model of the measuring process that measurement outcomes can be assessed for accuracy and meaningfully compared to each other. Indeed, it is only against the background of such simplified and approximate

78 representation of the measuring process that measurement outcomes can even be considered candidates for objective knowledge.

To appreciate this last point in full, it is useful to distinguish between the indications (or

‘readings’) of an instrument and the outcomes of measurements performed with that instrument. This distinction has already been implicit in the discussion above, but the model- based view makes it explicit. Examples of indications are the height of a mercury column in a barometer, the position of a pointer relative to the dial of an ammeter, and the number of cycles (‘ticks’) generated by a clock during a given sampling period. More generally, an indication is a property of an instrument in its final state after the measuring process has been completed. The indications of instruments do not constitute measurement outcomes, and in themselves are no different than the final states of any other physical process54. What gives indications special epistemic significance is the fact that they are used for inferring values of a quantity based on a model of the measurement process, a model that relates possible indications to possible values of a quantity of interest. These inferred estimates of quantity values are measurement outcomes. Examples are estimates of atmospheric pressure, electric current and duration inferred from the abovementioned indications. Measurement outcomes are expressed on a determinate scale and include associated uncertainties, although sometimes only implicitly.

A hallmark of the model-based approach to measurement is that models are viewed as preconditions for obtaining an objective ordering relation among measured objects. We already saw, for example, that the ordering of time intervals or temperature intervals obtained by

54 Indications may be divided into ‘raw’ and ‘processed’, the latter being numerical representations of the former. Neither processed nor raw indications constitute measurement outcomes. For further discussion see Chapter 4, Section 4.2.2. 79 operating a clock or a thermometer depends on how scientists represent the relationship between indications and values of the quantity being measured. Such ordering is a consequence of modeling the instrument in a particular way and assigning systematic corrections to its indications accordingly. Contrary to empiricist theories of measurement, then, the ordering of objects with respect to the quantity being measured is never simply given through observation but must be inferred based on a model of the measuring process.

Prior to such model-based inference, the ‘raw’ ordering of objects by the indications of an empirical operation is nothing more than a local regularity that may just as plausibly be ascribed to an idiosyncrasy of the instrument, the environment or the human operator as to the objects being ordered.

This last claim is not meant as a denial of the existence of theory-free operations for ordering objects, e.g. placing pairs of objects on the pans of an equal-arms balance.

However, such operations on their own do not yet measure anything, nor is measurement simply a matter of mapping the results of such operations onto numbers. Measurement claims, recall, are claims to objective knowledge – meaning that order is ascribed to measured objects rather than to artifacts of the specific operation being used. Grounding such claim to objectivity involves differentiating operation-specific features from those that are due to a pertinent difference among measured samples. As we already saw, different procedures that supposedly measure the same quantity often produce inconsistent, and in some cases even completely reversed, ‘raw’ orderings among objects. Such orderings must therefore be considered operation-specific and cannot be taken as measurement outcomes.

To obtain a measurement outcome from an indication, a distinction must be drawn between pertinent aspects of the measured objects and procedural artifacts. This involves the development of what is sometimes called a ‘theory of the instrument’, or more exactly an

80 idealized model of the measurement process, from theoretical and statistical assumptions.

Such models allow scientists to account for the effects of local idiosyncrasies and correct the outcomes accordingly. Unlike the ‘raw’ order indicated by an operation, the order resulting from a model-based inference has the proper epistemic credentials to ground objective claims to measurement, because it is based on coherent assumptions about the object (or process, or event) being measured.

Not any estimation of a quantity value in an idealized model of a physical process is a measurement. Rather, a measurement is based on a model that coheres with background theoretical assumptions, and is consistent with other measurements of the same or related quantities performed under different conditions. As a result, what counts as an instance of measurement may change when assumptions about relevant quantities, instruments or modeling practices are modified.

All measurement outcomes are relative to an abstract and idealized representation of the procedure by which they were obtained. This explains how the outcomes of a measurement procedure can change without any physical modification to that procedure, merely by changing the way the instrument is represented. Similarly, the model-based approach explains how the accuracy of a measuring instrument can be improved merely by adding correction terms to the model representing the instrument. Thirdly, the model- relativity of measurement outcomes explains how the same set of operations, again without physical change, can be used to measure different quantities on different occasions depending on the interests of researchers. An example is the use of the same pendulum to measure either duration or gravitational potential without any physical change to the pendulum or to the procedures of its operation and observation. The change is effected merely by a modification to the mathematical manipulation of quantities in the model. For

81 measuring duration, researchers plug in known values for gravitational potential in their model of the pendulum and use the indications of the pendulum (i.e. number of swings) to tell the time, whereas measuring gravitational potential involves the opposite mathematical procedure.

The notions of accuracy and error are similarly elucidated in relation to models. The accuracy of a measurement procedure is determined by the accuracy of model-based predictions regarding that procedure’s outcomes. That is, a measurement procedure is accurate relative to a given model if and only if the model accurately predicts the outcomes of that procedure under a given set of circumstances55. Similarly, measurement error is evaluated as the discrepancy between such model-based predictions and standard values of the quantity in question. Such errors include, but are not limited to, discrepancies that can be estimated by statistical analysis of repeated measurements. In attributing claims concerning accuracy and error to predictions about instruments, rather than directly to instruments themselves, the model-based account makes explicit the inferential nature of accuracy and error (see also

Hon 2009.)56

55 The accuracy of model-based predictions is evaluated by propagating uncertainties from ‘input’ quantities to ‘output’ quantities in the model, as will be clarified in Chapter 4. 56 Commenting on Hertz’ 1883 cathode ray experiments, Hon writes: “The error we discern in Hertz’ experiment cannot be associated with the physical process itself […]. Rather, errors indicate claims to knowledge. An error reflects the existence of an argument into which the physical process of the experiment is cast.” (2009, 21) 82 2.4.2. Conceptual quantity individuation

According to the model-based approach, a physical quantity is a parameter in a theory of a kind of physical system. Specifically, a measurable physical quantity is a theoretical parameter whose values can be related in a predictable manner to the final states of one or more physical processes. A measurable quantity is therefore defined by a background theory

(or theories), which in turn inform the construction of models of particular processes intended to measure that quantity.

The model-based approach is not committed to a particular metaphysical standpoint on the reality of quantities. Whether or not quantities correspond to mind-independent properties is seen as irrelevant to the epistemology of measurement, that is, to an analysis of the evidential conditions under which measurement claims are justified. This is not meant to deny that scientists often think of the quantities they measure as representing mind- independent properties and that this way of thinking is fruitful for the development of accurate measurement procedures. But whether or not the quantities scientists end up measuring in fact correspond to mind-independent properties makes no difference to the kinds of tests scientists perform or the inferences they draw from evidence, for scientists have no access to such putative mind-independent properties other than through empirical evidence57. As will become clear below, the model-based approach allows one to talk coherently about accuracy, error and objectivity as properties of measurement claims without

57 My agnosticism with respect to the existence of mind-independent properties does not, of course, imply agnosticism with respect to the existence of objects of knowledge and the properties they posses qua objects of knowledge. A column of mercury has volume insofar as it can be reliably perceived to occupy space. I therefore accept a modest form of epistemic (e.g. Kantian) realism. 83 committing to any particular metaphysical standpoint concerning the truth conditions of such claims. The model-based approach does, however, make a distinction among quantities in terms of their epistemic status. The epistemic status of physical quantities varies from merely putative to deeply entrenched depending on the demonstrated degree of success in measuring them. As mentioned, to successfully measure a quantity is to estimate its values in a consistent and coherent manner based on models of physical processes.

The model-based view provides a straightforward account of quantity individuation that dissolves the underdetermination problem discussed above. In order to individuate quantities across measuring procedures, one has to determine whether the outcomes of different procedures can be consistently modeled in terms of the same parameter in the background theory. If the answer is ‘yes’, then these procedures measure the same quantity relative to those models.

A few clarifications are in order. First, by ‘consistently modeled’ I mean that outcomes of different procedures converge within the uncertainties predicted by their respective models. A detailed example of this sort of test was discussed in Chapter 1. Second, the phrase ‘same parameter in the background theory’ requires clarification. A precondition for even testing whether two instruments provide consistent outcomes is that the outcomes of each instrument are represented in terms of the same theoretical parameter. By ‘same theoretical parameter’ I mean a parameter that enters into approximately the same relations with other theoretical parameters.58 The requirement to model outcomes in terms of the same theoretical quantity therefore amounts to a weak requirement for nomic coherence among

58 This definition is recursive, but as long as the model has a finite number of parameters the recursion bottoms out. A more general definition is required for models with infinitely many parameters. 84 models specified in terms of that quantity, rather than to a strong requirement for identity of extension or intension among quantity terms59.

The emphasis on theoretical models may raise worries as to the status of pre- theoretical measurements. After all, measurements were performed long before the rise of modern physics. However, even when a full-fledged theory of the measured quantity is missing or untrustworthy, some pre-theoretical background assumptions are still necessary for comparing the outcomes of measurements. When in the 1840s Regnault made his comparisons among thermometers he eschewed all assumptions concerning the nature of caloric and the conservation of heat (Chang 2004, 77) but he still had to presuppose that temperature is a single-valued quantity, that it increases when an object is exposed to heat source, and that an increase of temperature under constant pressure is usually correlated with expansion. These background assumptions informed the way Regnault modeled his instruments. Indeed, independently of these minimal assumptions the claim that Regnault’s instruments measured the same quantity cannot be tested.

To summarize the individuation criterion offered by the model-based approach, two procedures measure the same quantity only relative to some way of modeling those procedures, and if and only if their outcomes are shown to be modeled consistently and coherently in terms of the same theoretical parameter.

It is now time to clarify how this criterion deals with the problem of individuation outlined earlier in this chapter. As mentioned, the problem of quantity individuation has two distinct consequences that raise different sorts of challenges: one conceptual and the other practical. On the conceptual level it is only a problem for foundational accounts of

59 For a recent proposal to individuate quantity concepts in this way see Diez (2002, 25-9). 85 measurement, namely those that attempt to specify theory-free individuation criteria for measurable quantities. The model-based approach dissolves the conceptual problem by resisting the temptation to offer foundational criteria of quantity individuation. The identity of quantities across measurement procedures is relative to background assumptions, either theoretical or pre-theoretical, concerning what those procedures are meant to measure.

Genuinely discrepant thermometers, for example, measure the same quantity only relative to a theory of temperature, or before such theory is available, relative to pre-theoretical beliefs about temperature. Similarly, different cesium atomic clocks measure the same quantity only relative to some theory of time such as Newtonian mechanics or general relativity, or otherwise relative to some pre-theoretical conception of time.

Even relative to a given theory metrologists sometimes have a choice as to whether or not they represent instruments as measuring the same quantity. Relative to general relativity, for example, atomic clocks placed at different heights above sea level measure the same coordinate time but different proper times. Quantity individuation therefore depends on which of these two quantities the clocks are modeled as measuring. The choice among different ways of modeling a given instrument involves a difference in the systematic correction applied to its indications. The latter point has already been illustrated in the case of the

Michelson-Morely apparatus, but it holds even in more mundane cases that do not involve theory change. To return to the clock example, cesium fountain clocks that are represented as measuring proper time do not require correction for gravitational red-shifts. The discrepancy among their results is attributed to the fact that they occupy different reference frames and therefore measure different proper times relative to those frames. On the other hand, when the same clocks are represented as measuring the same coordinate time on the

86 geoid (an imaginary sphere of equal gravitational potential that roughly corresponds to the earth’s sea level) their indications need to be corrected for a gravitational red-shift.

As already noted, the distribution of systematic errors among measurement procedures and the individuation of quantities measured by those procedures are but two sides of the same epistemic coin. Which side of the coin the scientific community will focus on when resolving the next discrepancy depends on the particular history of its theoretical and practical development. This conclusion stands in direct opposition to foundational approaches, which attempt to provide sufficient conditions for establishing the identity of measurable quantities independently of any particular scientific theory or experiment. The model-based approach, by contrast, treats criteria for the individuation of quantities as already embedded in some theoretical and material setting from the start. Claims concerning the individuation of quantities are underdetermined by the evidence only in principle, when such claims are viewed ‘from nowhere.’ But to view such claims independently of their particular theoretical and material context is to misunderstand how measurement produces knowledge. Measurement outcomes are the results of model-based inferences, and owe their objective validity to the idealizing assumptions that ground such inferences. In the absence of such idealizations, there is no principled way of telling whether discrepancies should be attributed to the objects being measured or to extrinsic factors.

The search for theory-free criteria of quantity individuation is therefore opposed to the very supposition that measurement provides objective knowledge. Such foundational pursuits sprout from a conflation between instrument indications, which constitute the empirical evidence for making measurement claims, and measurement outcomes, which are value estimates that constitute the content of these claims. Once the conflation is pointed out, it becomes clear that the background assumptions involved in inferring outcomes from

87 indications play a necessary and legitimate role in grounding claims about quantity individuation, whereas the ‘raw’ evidence alone cannot and should not be expected to do so.

2.4.3. Practical quantity individuation

In addition to dissolving the conceptual problem of quantity individuation, the model- based approach to measurement also sheds light on possible solutions to the practical problem of quantity individuation, a task that is beyond the purview of other philosophical theories of measurement. The practical problem, recall, is that of selecting which of the three hypotheses (H1) – (H3) above to accept when faced with genuinely discrepant measurement outcomes. Laboratory scientists are habitually confronted with this sort of challenge, especially if they work in the forefront of accurate measurement where existing standards cannot settle the issue.

A common solution to the practical problem of quantity individuation is to accept only

(H3), the hypothesis that measurement uncertainty has been underestimated, and enhance the stated uncertainties so as to achieve compatibility among results. This is equivalent to extending uncertainty bounds (sometimes mistakenly called ‘error bars’) associated with different outcomes until the outcomes are statistically compatible. It is common to use formal measures of statistical compatibility such as the Birge ratio60 to assess the success of adjustments to stated uncertainties. Agreement is restored either by re-evaluating type-B uncertainties associated with measuring procedures, by modifying statistical models of noise,

60 See Birge (1932). Henrion & Fischhoff (1986, 792) provide a concise introduction to the Birge ratio. 88 or by increasing the stated uncertainty ad hoc based on ‘educated guesses’ as to which procedures are less accurate. Regardless of the technique of adjustment, the disadvantage of accepting only (H3) is that the increase in uncertainty required to recover agreement is similar in magnitude to the discrepancy among the outcomes. If the discrepancy is large relative to the initially stated uncertainty, this strategy results in a large increase of stated uncertainties.

Another option is to accept only (H2), the hypothesis that a systematic bias influences the outcomes of some of the measurements. In some cases such bias may be corrected by physically controlling its source, e.g. by shielding against background effects. This strategy is nevertheless limited by the fact that not all sources of systematic bias are controllable (such as the presence of nonzero gravitational potential on earth) and that others can only be controlled to a limited extent. Moreover, for older measurements the apparatus may no longer be available and attempts to recreate the apparatus may not succeed in reproducing its idiosyncrasies. For these reasons, systematic biases are often corrected only numerically, i.e. by modifying the theoretical model of the instrument with a correction factor that reflects the best estimate of the magnitude of the bias. Because accuracy is ascribable to measurement outcomes rather than to instrument indications, a model-based correction that modifies the outcome is a perfectly legitimate tool for enhancing accuracy, even if it has no effect on the indications of the instrument.

A third strategy for handling the practical challenge of quantity individuation, one that the model-based approach is especially useful in elucidating, involves accepting all three hypotheses (H1), (H2) and (H3) – namely, accepting that the instruments (as initially modeled) measure different quantities, that a systematic error is present that has not been appropriately corrected and that measurement uncertainties have been underestimated.

89 Agreement is then restored by a method I call ‘unity through idealization’, a method that is central to the work of metrologists because it restores agreement with a relatively small loss of accuracy and without necessarily involving physical interventions.

The core idea behind this method is known as Galilean idealization (McMullin 1985).

Galileo’s famous measurements of free-fall acceleration were performed on objects rolling down inclined planes. This replacement of experimental object was made possible by an idealization: acceleration on an inclined plane is an imperfect version of free-fall acceleration in a vacuum. To measure free-fall acceleration, one does not have to experiment on a free- falling object in a vacuum but merely to conceptually remove the effects of impediments such as the plane, air resistance etc. from an abstract representation of the rolling object.

More generally, the principle of unity through idealization is this: the same quantity can be measured in different concrete circumstances so long as these circumstances are represented as approximations of the same ideal circumstances.

This principle is utilized to restore agreement among seemingly divergent measurement outcomes. For example, when cesium fountain clocks are found to systematically disagree in their outcomes, it is occasionally possible to resolve the discrepancy by further idealizing the theoretical models representing these clocks. The discrepancy is attributed to the fact that the clocks were not measuring the same quantity in the less idealized representation. For example, the discrepancy is attributed to the fact that clocks were measuring different frequencies of cesium, the difference being caused by the presence of different levels of background thermal radiation. Instead of physically equalizing the levels of background radiation across clocks, the clocks are conceptually re-modeled so

90 as to measure the ideal cesium frequency in the absence of thermal background, i.e. at a temperature of absolute zero61. Under this new and further idealized representation of the clocks, metrologists are justified in applying a correction factor to the model of each clock that reflects their best estimate of the effect of thermal radiation on the indications of that clock. This correction involves a type-B uncertainty that is added to the total uncertainty of each clock, but this new uncertainty is typically much smaller than the discrepancy being corrected. When successful, this strategy leads to the elimination of discrepancies with only a small loss of accuracy, and with no physical modification to the apparatus.

2.5. Conclusion: error as a conceptual tool

Philosophers of science have traditionally sought to analyze what they took to be basic concepts of measurement independently of any particular scientific theory, experimental tradition or instrument. This approach has proved fruitful for the axiomatization of measurement scales, but as an approach to the epistemology of measurement, i.e. to the study of the conditions under which measurement claims are justified in light of possible evidence, conceptual foundationalism encounters severe limitations. This chapter was dedicated to the discussion of one such limitation of conceptual foundationalism, namely to its attempt to answer questions of the form ‘what does procedure P measure?’ independently of questions of the form ‘how accurate is P?’ As I have shown, the two sorts of questions are epistemically entangled, such that no empirical test can be devised that would answer one

61 For a detailed discussion of the modeling of cesium fountain clocks see Chapter 1. 91 without at the same time answering the other. Moreover, the choice of answers to both questions depends on background theories and on traditions of interpreting evidence that are accepted by the scientific community. Independently of such theories and traditions the indications of measuring instruments are devoid of epistemic significance, i.e. cannot be used to ground claims about the objects being measured.

The model-based approach offered here acknowledges the context-dependence of measurement claims and dissolves the worries of underdetermination associated with conceptual foundationalism. More importantly, the model-based approach clarifies how the use of idealizations allows scientists to ground claims to the unity of quantity concepts. The unity of quantity concepts across different measurement procedures rests on scientists’ success in consistently and coherently modeling these procedures in terms of the same theoretical parameter. This treatment of quantity individuation clarifies several aspects of physical measurement that have hitherto been neglected or poorly understood by philosophers of science, most notably the notion of systematic error. Far from merely being a technical concern for laboratory scientists, the possibility of systematic error is a central conceptual tool in coordinating theory and experiment. Genuine systematic errors constitute the conceptual ‘glue’ that allows scientists to model different instruments in terms of a single quantity despite nonscalable discrepancies among their indications. The applicability of quantity concepts across different domains, and hence the generality of physical theory, owe their existence to the possibility of distributing systematic errors among the indications of measuring instruments.

92 The Epistemology of Measurement: A Model-Based Account

3. Making Time: A Study in the Epistemology of Standardization

Abstract: Contemporary timekeeping is an extremely successful standardization project, with most national time signals agreeing well within a microsecond. But a close look at methods of clock synchronization reveals a patchwork of ad hoc corrections, arbitrary rules and seemingly circular inferences. This chapter offers an account of standardization that makes sense of the stabilizing role of such mechanisms. According to the model-based account proposed here, to standardize a quantity is to legislate the proper mode of application of a quantity-concept to a collection of exemplary artifacts. This legislation is performed by specifying a hierarchy of models of these artifacts at different levels of abstraction. I show that this account overcomes limitations associated with conventionalist and constructivist explanations for the stability of networks of standards.

3.1. Introduction

The reproducibility of quantitative results in the physical sciences depends on the availability of stable measurement standards. The maintenance, dissemination and improvement of standards are central tasks in metrology, the science of reliable measurement.

With the guidance of the International Bureau of Weights and Measures (Bureau International des Poids et Mesures or BIPM) near Paris, a network of metrological institutions around the globe is responsible for the ongoing comparison and adjustment of standards.

93 Among the various standardization projects in which metrologists are engaged, contemporary timekeeping is arguably the most successful, with the vast majority of national time signals agreeing well within a microsecond and stable to within a few nanoseconds a month62. The standard measure of time currently used in almost every context of civil and scientific life is known as Coordinated Universal Time or UTC63. UTC is the product of an international cooperative effort by time centers that themselves rely on state-of-the-art atomic clocks spread throughout the globe. These clocks are designed to measure the frequencies associated with specific atomic transitions, including the cesium transition, which has defined the second since 1968.

What accounts for the overwhelming stability of contemporary timekeeping standards?

Or, to phrase the question somewhat differently, what factors enable a variety of standardization laboratories around the world to so closely reproduce Coordinated Universal

Time? The various explanans one could offer in response to this question may be divided into two broad kinds. First, one could appeal to the natural stability, or regularity, of the atomic clocks that contribute to world time. Second, one could appeal to the practices by which metrological institutions synchronize these atomic clocks. The adequate combination of these two sorts of explanans and the limits of their respective contribution to stability are contested issues among philosophers and sociologists of science. This chapter will discuss three accounts of standardization along with the explanations they offer for the stability of

62 Barring time zone and daylight saving adjustments. See BIPM (2011) for a sample comparison of national approximations to UTC. 63 UTC replaced Greenwich Mean Time as the global timekeeping reference in 1972. The acronym ‘UTC’ was chosen as a compromise to avoid favoring the order of initials in either English (CUT) of French (TUC). 94 UTC. Each account will assign different explanatory roles to the social and natural factors involved in stabilizing timekeeping standards.

The first kind of explanation is inspired by conventionalism as expounded by Poincaré

([1898] 1958), Reichenbach ([1927] 1958) and Carnap ([1966] 1995). According to conventionalists, metrologists are free to choose which natural processes they use to define uniformity, namely, to define criteria of equality among time intervals. Prior to this choice, which is in principle arbitrary, there is no fact of the matter as to which of two given clocks

‘ticks’ more uniformly. The choice of natural process (e.g. solar day, pendulum cycle, or atomic transition) depends on considerations of convenience and simplicity in the description of empirical data. Once a ‘coordinative definition’ of uniformity is given, the truth or falsity of empirical claims to uniformity is completely fixed: how uniformly a given clock ‘ticks’ relative to currently defined criteria is a matter of empirical fact. In Carnap’s own words:

If we find that a certain number of periods of process P always match a certain number of periods of process P’, we say that the two periodicities are equivalent. It is a fact of nature that there is a very large class of periodic processes that are equivalent to each other in this sense. (Carnap [1966] 1995, 82-3, my emphasis)

We find that if we choose the pendulum as our basis of time, the resulting system of physical laws will be enormously simpler than if we choose my pulse beat. […] Once we make the choice, we can say that the process we have chosen is periodic in the strong sense. This is, of course, merely a matter of definition. But now the other processes that are equivalent to it are strongly periodic in a way that is not trivial, not merely a matter of definition. We make empirical tests and find by observation that they are strongly periodic in the sense that they exhibit great uniformity in their time intervals. (ibid, 84-5, my emphases)

Of course, some uncertainty is always involved in determining facts about uniformity experimentally. But for a conventionalist this uncertainty arises solely from the limited precision of measurement procedures and not from a lack of specificity in the definition.

95 Accordingly, the stability of contemporary timekeeping is explained by a combination of two factors: on the social side, the worldwide agreement to define uniformity on the basis of the frequency of the cesium transition; and on the natural side, the fact that all cesium atoms under specified conditions have the same frequency associated with that particular transition.

The universality of the cesium transition frequency is, according to conventionalists, a mind- independent empirical regularity that metrologists cannot influence but may only describe more or less simply.

The second, constructivist sort of explanation affords standardization institutions greater agency in the process of stabilization. Standardizing time is not simply a matter of choosing which pre-existing natural regularity to exploit; rather, it is a matter of constructing regularities from otherwise irregular instruments and human practices. Bruno Latour and

Simon Schaffer have expressed this position in the following ways:

Time is not universal; every day it is made slightly more so by the extension of an international network that ties together, through visible and tangible linkages, each of all the reference clocks of the world and then organizes secondary and tertiary chains of references all the way to this rather imprecise watch I have on my wrist. There is a continuous trail of readings, checklists, paper forms, telephone lines, that tie all the clocks together. As soon as you leave this trail, you start to be uncertain about what time it is, and the only way to regain certainty is to get in touch again with the metrological chains. (Latour 1987, 251, emphasis in the original)

Recent studies of the laboratory workplace have indicated that institutions’ local cultures are crucial for the emergence of facts, and instruments, from fragile experiments. […] But if facts depend so much on these local features, how do they work elsewhere? Practices must be distributed beyond the laboratory locale and the context of knowledge multiplied. Thus networks are constructed to distribute instruments and values which make the world fit for science. Metrology, the establishment of standard units for natural quantities, is the principal enterprise which allows the domination of this world. (Schaffer 1992, 23)

According to Latour and Schaffer, the metrological enterprise makes a part of the noisy and irregular world outside of the laboratory “fit for science” by forcing it to replicate

96 an order otherwise exhibited only under controlled laboratory conditions. Metrologists achieve this aim by extending networks of instruments throughout the globe along with protocols for interpreting, adjusting and comparing these instruments. The fact, then, that metrologists succeed in stabilizing their networks should not be taken as evidence for pre- existing regularities in the operation of instruments. On the contrary, the stability of metrological networks explains why scientists discover regularities outside the laboratory: these regularities have already been incorporated into their measuring instruments in the process of their standardization.

This chapter will argue that both conventionalist and constructivist accounts of standardization offer only partial and unsatisfactory explanations for the stability of networks of standards. These accounts focus too narrowly on either natural or social explanans, but any comprehensive picture of stabilization must incorporate both. I will propose a third, ‘model-based’ alternative to the conventionalist and constructionist views of standardization, which combines the strengths of the first two accounts and explains how both natural and social elements are mobilized through metrological practice.

This third approach views standardization as an ongoing activity aimed at legislating the proper mode of application of a theoretical concept to certain exemplary artifacts. By

‘legislation’ I mean the specification of rules for deciding which concrete particulars fall under a concept. In the case of timekeeping, metrologists legislate the proper mode of application of the concept of uniformity of time to an ensemble of atomic clocks. That is, metrologists specify algorithms for deciding which of the clocks in the ensemble approximate the theoretical ideal of uniformity more closely. Contrary to the views of conventionalists, this legislation is not a matter of arbitrary, one-time stipulation. Instead, I will argue that legislation is an ongoing, empirically-informed activity. This activity is

97 required because theoretical definitions by themselves do not completely determine how the defined concept is to be applied to particulars. Moreover, I will show that such acts of legislation are partly constitutive of the regularities metrologists discover in the behavior of their instruments. Which clocks count as ‘ticking’ more uniformly relative to each other depends – though only partially – on how metrologists legislate the mode of application of the concept of uniformity.

A crucial part of legislation is the construction of idealized models of measuring instruments. As I will argue, legislation proceeds by constructing a hierarchy of idealized models that mediate between the theoretical definition of the concept and concrete artifacts.

These models are iteratively modified in light of empirical data so as to maximize the regularity with which concrete instruments are represented under the theoretical concept.

Additionally, instruments themselves are modified in light of the most recent models so as to maximize regularity further. In this reciprocal exchange between abstract and concrete modifications, regular behavior is iteratively imposed on the network ‘from above’ and discovered ‘from below’, leaving genuine room for both natural and social explanans in an account of stabilization. Acts of legislation are therefore conceived both as constitutive of the regularities exhibited by instruments and as preconditions for the empirical discovery of new regularities (or irregularities) in the behaviors of those instruments64.

The first of this chapter’s three sections presents the central methods and challenges involved in contemporary timekeeping. The second section discusses the strengths and

64 In this respect the model-based account continues the analysis of measurement offered by Kuhn ([1961] 1977). Kuhn took scientific theories to be both constitutive of the correct application of measurement procedures and as preconditions for the discovery of anomalies. The model-based account extends Kuhn’s insights to the maintenance of metrological standards, where local models play a role analogous to theories in Kuhn’s account. 98 limits of conventionalist and constructivist explanations for the stability of metrological networks, while the third and final section develops the model-based account of standardization and demonstrates why it provides a more complete and satisfactory explanation for the stability of UTC than the first two.

3.2. Making time universal

3.2.1. Stability and accuracy

The measurement of time relies predominantly on counting the periods of cyclical processes, namely clocks. Until the late 1960s, time was standardized by recurrent astronomical phenomena such as the apparent solar noon, and artificial clocks served only as secondary standards. Contemporary time standardization relies on atomic clocks, i.e. instruments that produce an electromagnetic signal that tracks the frequency of a particular atomic resonance. The two central desiderata for a reliable clock are known in the metrological jargon as frequency stability and frequency accuracy. The frequency of a clock is said to be stable if it ticks at a uniform rate, that is, if its cycles mark equal time intervals.

The frequency of a clock is said to be accurate if it ticks at the desired rate, e.g. one cycle per second.

Frequency stability is, in principle, sufficient for reproducible timekeeping. A collection of clocks with perfectly stable frequencies would tick at constant rates relative to each other, and so the readings of any such clock would be sufficient to reproduce the readings of any

99 of the others by simple linear conversion65. A collection of frequency-stable clocks is therefore also ‘stable’ in the broader sense of the term, i.e. supports the reproducibility of measurement outcomes. For this reason I will use the term ‘stability’ insofar as it pertains to collections of clocks without distinguishing between its restricted (frequency-stability) and broader (reproducibility) senses unless the context requires otherwise.

In practice, no clock has a perfectly stable frequency. The very notion of a stable frequency is an idealized one, derived from the theoretical definition of the standard second.

Since 1967 the second has been defined as the duration of exactly 9,192,631,770 periods of the radiation corresponding to a hyperfine transition of cesium-133 in the ground state66. As far as the definition is concerned, the cesium atom in question is at rest at zero degrees

Kelvin with no background fields influencing the energy associated with the transition.

Under these ideal conditions a cesium atom would constitute a perfectly stable clock. There are several different ways to clocks that would approximate – or ‘realize’ – the conditions specified by the definition. Different clock designs result in different trade-offs between frequency accuracy, frequency stability and other desiderata, such as ease of maintenance and ease of comparison.

Primary realizations of the second are designed for optimal accuracy, i.e. minimal uncertainty with respect to the rate in which they ‘tick’. As of 2009 thirteen primary realizations are maintained by leading national metrological laboratories worldwide67. These clocks are special by virtue of the fact that every known influence on their output frequency

65 barring relativistic effects. 66 BIPM (2006), 113 67 As of 2009, active primary frequency standards were maintained by laboratories in France, Germany, Italy, Japan, the UK, and the US (BIPM 2010, 33) 100 is controlled and rigorously modelled, resulting in detailed ‘uncertainty budgets.’ The clock design implemented in most primary standards is the ‘cesium fountain’, so called because it

‘tosses’ cesium atoms up in a vacuum which then fall down due to gravity. This design allows for a higher signal-to-noise ratio and therefore decreases measurement uncertainty.

The complexity of cesium fountains, however, and the need to routinely monitor their performance and environment prevents them from running continuously. Instead, each cesium fountain clock operates for a few weeks at a time, about five times a year. The intermittent operation of cesium fountain clocks means that they cannot be used directly for timekeeping. Instead, they are used to calibrate secondary standards, i.e. atomic clocks that are less accurate but run continuously for years. About 350 such secondary standards are employed to keep world time68. These clocks are highly stable in the short run, meaning that the ratios between the frequencies of their ‘ticks’ remain very nearly constant over weeks and months. But over longer periods the frequencies of secondary standards exhibit drifts, both relative to each other and to the frequencies of primary standards.

Because neither primary nor secondary standards ‘tick’ at exactly the same rate, metrologists are faced with a variety of real durations that can all be said to fit the definition of the second with some degree of uncertainty. Metrologists are therefore faced with the task of realizing the second based on indications from multiple, and often divergent, clocks.

In tackling this challenge, metrologists cannot simply appeal to the definition of the second to tell them which clocks are more accurate as it is too idealized to serve as the basis for an evaluation of concrete instruments. In Chapter 1 I have called this the problem of multiple

68 Panfilo and Arias (2009) 101 realizability of unit definitions and discussed the way this problem is solved in the case of primary frequency standards.

This chapter focuses on the ways metrologists solve the problem of multiple realizability in the context of international timekeeping, where the goal is not merely to produce a good approximation of the second but also to maintain an ongoing measure of time and synchronize clocks worldwide in accordance with this measure. Timekeeping is an elaborate task that extends well beyond the evaluation of a handful of carefully maintained primary standards. It encompasses the global transmission of time signals that enable coordination in every aspect of civil and scientific life. From communication satellites, to financial exchanges, to the dating of astronomical observations, Coordinated Universal Time is meant to guarantee that all of our clocks tell the same time, and it must manage to do so despite the fact that every clock that maintains UTC ‘ticks’ with a slightly different ‘second’.

From the point of view of relativity theory, UTC is an approximation of terrestrial time, a theoretically defined coordinate time scale on the earth’s surface69. Ideally, one can imagine all of the atomic clocks that participate in the production of UTC as located on a rotating surface of equal gravitational potential that approximates the earth’s sea level. Such surface is called a ‘geoid’, and terrestrial time is the time a perfectly stable clock on that surface would tell when viewed by a distant observer. However, much like the definition of the second, the definition of terrestrial time is highly idealized and does not specify the desired properties of any concrete clock ensemble. Here again, metrologists cannot determine how well UTC

69 More exactly, it is International Atomic Time (TAI), identical to UTC except for leap seconds, that constitutes a realization of Terrestrial Time. 102 approximates terrestrial time based merely on the latter’s definition, and must compare UTC to other realizations of terrestrial time.

3.2.2. A plethora of clocks

Let us now turn to the method by which metrologists create a universal measure of time. At the BIPM near Paris, around 350 secondary standard indications from over sixty national laboratories are processed. The BIPM receives a reading from each clock every five days and uses these indications to produce UTC. Coordinated Universal Time is a measure of time whose scale interval is intended to remain as close as is practically possible to a standard second. Yet UTC is not a clock; it does not actually ‘tick’, and cannot be continuously read off the display of any instrument. Instead, UTC is an abstract measure of time: a set of numbers calculated monthly in retrospect, based on the readings of participating clocks70. These numbers indicate how late or early each nation’s ‘master time’, its local approximation of UTC, has been running in the past month. Typically ranging from a few nanoseconds to a few microseconds, these numbers allow national metrological institutes to then tune their clocks to internationally accepted time. Table 3.1 is an excerpt from the monthly publication issued by the BIPM in which deviations from UTC are reported for each national laboratory.

70 There are many clocks that approximate UTC, of course. As will be mentioned below, the BIPM and national laboratories produce continuous time signals that are considered realizations of UTC. However, UTC itself is an abstract measure and should not be confused with its many realizations. 103

disseminates disseminates ch comparison. ences in nanoseconds in the first seven columns indicate differ columns indicate first seven in the twenty laboratories is shown.) laboratories is twenty through which the International Bureau of Weights and Measures Weights through which the International Bureau of ations. The last three columns indicate type-A, type-B and total uncertainties for ea type-B type-A, The last three columns indicate ations. al standardization institutes. The numbers The numbers institutes. al standardization (Only data associated with the first (Only data associated with the between UTC and each of its local approxim between UTC and each of its report 2011), a monthly (BIPM Table 3.1: Excerpt from Circular-T Coordinated Universal Time (UTC) to nation Coordinated Universal

104 In calculating UTC Metrologists face multiple challenges. First, among the clocks contributing to UTC almost none are primary standards. As previously mentioned, most primary standards do not run continuously. Subsequently UTC is maintained by a free- running ensemble of secondary standards – stable atomic clocks that run continuously for years but undergo less rigorous uncertainty evaluations than primary standards. Today the majority of these clocks are commercially manufactured by Hewlett-Packard or one of its offshoot companies, Agilent and Symmetricom. These clocks have proven to be exceptionally stable relative to each other, and the number of HP clocks that participate in

UTC has been steadily increasing since their introduction into world timekeeping in the early

1990s. As of 2010 HP clocks constitute over 70 percent of contributing clocks71.

Comparing clocks in different locations around the globe requires a reliable method of fixing the interval of comparison. This is another major challenge to globalising time. Were the clocks located in the same room, they could be connected by optical fibres to a counter that would indicate the difference, in nanoseconds, among their readings every five days.

Over large distances, time signals are transmitted via satellite. In most cases Global

Positioning System (GPS) satellites are used, thereby ‘linking’ the readings of participating clocks to GPS time. But satellite transmissions are subject to delays, which fluctuate depending on atmospheric conditions. Moreover, GPS time is itself a relatively unstable derivative of UTC. These factors introduce uncertainties to clock comparison data known as time transfer noise. Increasing with its distance from Paris, transfer noise is often much

71 Petit (2004, 208), BIPM (2010, 52-67). A smaller portion of continuously-running clocks are hydrogen masers, i.e. atomic clocks that probe a transition in hydrogen rather than in cesium. 105 larger than the local instabilities of contributing clocks. This means that the stability of UTC is in effect limited by satellite transmission quality.

3.2.3. Bootstrapping reliability

The first step in calculating UTC involves processing data from hundreds of continually operating atomic clocks and producing a free-running time scale, EAL (Échelle

Atomique Libre). EAL is an average of clock indications weighted by clock stability. Finding out which clocks are more stable than others requires some higher standard of stability against which clocks would be compared, but arriving at such a standard is the very goal of the calculation. For this reason EAL itself is used as the standard of stability for the clocks contributing to it. Every month, the BIPM rates the weight of each clock depending on how well it predicted the weighted average of the EAL clock ensemble in the past twelve months.

The updated weight is then used to average clock data in the next cycle of calculation. This method promotes clocks that are stable relative to each other, while clocks whose stability relative to the overall average falls below a fixed threshold are given a weight of zero, i.e. removed from that month’s calculation. The average is then recalculated based on the remaining clocks. The process of removing offending clocks are recalculating is repeated exactly four times in each monthly cycle of calculation72.

Though effective in weeding out ‘noisy’ clocks, the weight updating algorithm introduces new perils to the stability of world time. First, there is the danger of a positive

72 Audoin and Guinot 2001, 249. 106 feedback effect, i.e. a case in which a few clocks become increasingly influential in the calculation simply because they have been dominant in the past. In this scenario, EAL would become tied to the idiosyncrasies of a handful of clocks, thereby increasing the likelihood that the remaining clocks would drift further away from EAL. For this reason, the BIPM limits the weight allowed to any clock to a maximum of about 0.7 percent73. The method of fixing this maximum weight is itself occasionally modified to optimize stability.

Other than positive feedback, another source of potential instability is the abruptness with which new clock weights are modified every month. Because different clocks ‘tick’ at slightly different rates, a sudden change in weights results in a sudden change of frequency.

To avoid frequency jumps, the BIPM adds ‘cushion’ terms to the weighted average based on a prediction of that month’s jump74. A third precautionary measure taken by the BIPM assigns a zero weight to new clocks for a four month test interval before authorizing them to exert influence on international time.

The results of averaging depend not only on the choice of clock manufacturer, transmission method and averaging algorithm, but also on the selection of particular participating clocks. Only laboratories in nations among the eighty members and associates of BIPM are eligible for participation in the determination of EAL. Funded by membership fees, the BIPM aims to balance the threshold requirements of metrological quality with the financial benefits of inclusiveness. Membership requires national diplomatic relations with

France, the depositary of the intergovernmental treaty known as the Metre Convention

(Convention du Mètre). This treaty authorizes BIPM to standardize industrial and scientific

73 Since 2002, the maximal weight of each clock is limited to 2.5 / N, where N is the number of contributing clocks (Petit 2004, 308). 74 Audoin and Guinot 2001, 243-5. 107 measurement. The BIPM encourages participation in the Metre Convention by highlighting the advantages of recognized metrological competence in the domain of global trade, and by offering reduced fees to smaller states and developing countries75. Economic trends and political considerations thus influence which countries contribute to world time, and indirectly which atomic clocks are included in the calculation of UTC.

3.2.4. Divergent standards

Despite the multiple means employed to stabilize the weighted average of clock readings, additional steps are necessary to guarantee stability, due to the fact that the frequencies of continuously operating clocks tend to drift away from those of primary standards. In the late 1950s, when atomic time scales were first calculated, they were based solely on free-running clocks. Over the course of the following two decades, technological advances revealed that universal time was running too fast: the primary standards that realized the second were beating slightly slower than the clocks that kept time. To align the two frequencies, in 1977 the second of UTC was artificially lengthened by one part in 1013.

At this time it was decided that the BIPM would make regular small corrections that would

‘steer’ the atomic second toward its officially realized duration, in at attempt to avoid future shocks76. This decision effectively split atomic time into two separate scales, each ‘ticking’ with a slightly different second: on the one hand, the weighted average of free-running

75 Quinn (2003) 76 Audoin and Guinot 2001, 250 108 clocks (EAL), and on the other the continually corrected (or ‘steered’) International Atomic

Time, TAI (Temps Atomique International).

The monthly calculation of steering corrections is a remarkable algorithmic feat, relying upon intermittent calibrations against the world’s ten cesium fountains. These calibrations differ significantly from one another in quality and duration. Some primary standards run for longer periods than others, resulting in a better signal; some calibrations suffer from higher transfer noise; and some of the primary standards involved are more accurate than others77. For this reason the BIPM assigns weights, or ‘filters’, to each calibration episode depending on its quality. These checks are still not sufficient. Primary standards do not agree with one another completely, giving rise to the concern that the duration of the UTC second could fluctuate depending on which primary standard contributed the latest calibration. To circumvent this, the steering algorithm is endowed with

‘memory’, i.e. it extrapolates data from past calibration episodes into times in which primary standards are offline. This extrapolation must itself be time-dependent, as noise limits the capacity of free-running clocks to ‘remember’ the frequency to which they were calibrated.

The BIPM therefore constructs statistical models for the relevant noise factors and uses them to derive a temporal coefficient, which is then incorporated into the calculation of

‘filters’78.

This steering algorithm allows metrologists to track the difference in frequency between free-running clocks and primary standards. Ideally, the difference in frequency would remain stable, i.e. there would be a constant ratio between the ‘seconds’ of the two

77 See Chapter 1 for a detailed discussion of how the accuracy of primary standards is evaluated. 78 Azoubib et al (1977), Arias and Petit (2005) 109 measures. In this ideal case, requirements for both accuracy and stability would be fulfilled, and a simple linear transformation of EAL would provide metrologists with a continuous timescale as accurate as a cesium fountain. In practice, however, EAL continues to drift. Its second has lengthened in the past decade by a yearly average of 4 parts in 1016 relative to primary standards79. This presents metrologists with a twofold problem: first, they have to decide how fast they want to ‘steer’ world time away from the drifting average. Overly aggressive steering would destabilize UTC, while too small a correction would cause clocks the world over to slowly diverge from the official (primary) second. Indeed, the BIPM has made several modifications to its steering policy in the past three decades in at attempt to optimize both smoothness and accuracy80. The second aspect of the problem is the need to stabilize the frequency of EAL. One solution to this aspect of the problem is to replace clocks in the ensemble with others that ‘drift’ to a lesser extent. This task has largely been accomplished in the past two decades with the proliferation of HP clocks, but some instability remains. Elimination or reduction of the remaining instability is likely to require new algorithmic ‘tricks’. The BIPM is currently considering a change to the EAL weighting method that would involve a more sophisticated prediction of the behaviour of clocks, a change that is expected to further reduce frequency drifts81.

Disagreements among standards are not the sole condition requiring frequency steering. Abrupt changes in the ‘official’ duration of the second as realized by primary standards may also trigger steering corrections. These abrupt changes can occur when metrologists modify the way in which they model their instruments. For example, in 1996

79 Panfilo and Arias (2009) 80 Audoin and Guinot 2001, 251 81 Panfilo and Arias (2009) 110 the metrological community achieved consensus around the effects of thermal background radiation on cesium fountains, previously a much debated topic. A new systematic correction was subsequently applied to primary standards that shortened the second by approximately 2 parts in 1014. While this difference may seem minute, it took more than a year of monthly steering corrections for UTC to ‘catch up’ with the suddenly shortened second82.

3.2.5. The leap second

With the calculation of TAI the task of realizing the definition of the standard second is complete. TAI is considered to be a realization of terrestrial time, that is, an approximation of general-relativistic coordinate time on the earth’s sea level. However, a third and last step is required to keep UTC in step with traditional time as measured by the duration of the solar day. The mean solar day is slowly increasing in duration relative to atomic time due to gravitational interaction between the earth and the moon. To keep ‘noon

UTC’ closely aligned with the apparent passage of the sun over the Greenwich meridian, a leap second is occasionally added to UTC based on astronomical observations. By contrast,

TAI remains free of the constraint to match astronomical phenomena, and runs behind

UTC by an integer number of seconds83.

82 Audoin and Guinot 2001, 251 83 In January 2009 the difference between TAI and UTC was 34 seconds. 111 3.3. The two faces of stability

3.3.1. An explanatory challenge

The global synchronization of clocks in accordance with atomic time is a remarkable technological feat. Coordinated Universal Time is disseminated to all corners of civil life, from commerce and aviation to telecommunication, in manner that is seamless to the vast majority of its users. This achievement is better appreciated when one contrasts it to the state of time coordination less than a century-and-a-half ago, when the transmission of time signals by telegraphic cables first became available. Peter Galison (2003) provides a detailed history of the efforts involved in extending a unified ‘geography of simultaneity’ across the globe during the 1870s and 1880s, when railroad companies, national observatories, and municipalities kept separate and conflicting timescales. Today, the magnitude of discrepancies among timekeeping standards is far smaller than the accuracy required by almost all practical applications, with the exception of few highly precise astronomical measurements.

The task of the remainder of this chapter is to explain how metrologists succeed in synchronizing clocks worldwide to Coordinated Universal Time. What are the sources of this measure’s efficacy in maintaining global consensus among time centers? An adequate answer must account for the way in which the various ingredients that make up UTC contribute to its success. In particular, the function of ad hoc corrections, rules of thumb and seemingly circular inferences prevalent in the production of UTC requires explanation. What role do these mechanisms play in stabilizing UTC, and is their use justified from an epistemic point of view? The importance of this question extends beyond the measurement

112 of time. Answering it will require an account of the goals of standardization projects, the sort of knowledge such projects produce, and the reasons they succeed or fail. I will begin by considering two such accounts, namely conventionalism and constructivism, and argue that they provide only partial and unsatisfactory explanations for the stability of contemporary timekeeping standards. I will follow this by combining elements of both accounts in the development of a third, model-based account of standardization that overcomes the explanatory limitations of the first two.

3.3.2. Conventionalist explanations

Any plausible account of metrological knowledge must attend to the fact that metrologists enjoy some freedom in determining the correct application of the concepts they standardize. In order to properly understand the goals of standardization projects one must first clarify the sources and scope of this freedom. Traditionally, philosophers of science have taken standardization to consist in arbitrary acts of definition. Conventionalists like

Poincaré and Reichenbach stressed the arbitrary nature of the choice of congruence conditions, that is, the conditions under which magnitudes of certain quantities such as length and duration are deemed equal to one another. In his essay on “The Measure of

Time” ([1898] 1958), Poincaré argued against the existence of a mind-independent criterion of equality among time intervals. Instead, he claimed that the choice of a standard measure of time is “the fruit of an unconscious opportunism” that leads scientists to select the simplest system of laws (ibid, 36). Reichenbach called these arbitrary choices of congruence conditions ‘coordinative definitions’ because they coordinate between the abstract concepts

113 employed by a theory and the physical relations represented by these concepts (Reichenbach

1927, 14). In the case of time, the choice of congruence conditions amounts to a coordinative definition of uniformity in the flow of time. Coordinative definitions are required because theories by themselves do not specify the application conditions for the concepts they define. A theory can only link concepts to one another, e.g. postulate that the concept of uniformity of time is tied to the concept of uniform motion, but it cannot tell us which real motions or frequencies count as uniform. For this, Reichenbach claimed, a coordinative definition is needed that would link the abstract concept of uniformity with some concrete method of time measurement. Prior to such coordinative definition there is no fact of the matter as to whether or not two given time intervals are equal (ibid, 116).

The standardization of time, according to classical conventionalists, involves a free choice of a coordinative definition for uniformity. It is worth highlighting three features of this definitional sort of freedom as conceived by classical conventionalists. First, it is an a priori freedom in the sense that its exercise is independent of experience. One may choose any uniformity criterion as long as the consequences of that criterion do not contradict one another. Second, it is a freedom only in principle and not in practice. For pragmatic reasons, scientists select uniformity criteria that make their descriptions of nature as simple as possible. The actual selection of coordinative definition is therefore strongly, if not uniquely, constrained by the results of empirical procedures. Third, definitional freedom is singular in the sense that it is completely exhausted by a single act of exercising it. Though a definition can be replaced by another, each such replacement annuls the previous definition. In this respect acts of definition are essentially ahistorical.

In the case of contemporary timekeeping, the definition of the second functions as a coordinative definition of uniformity. Recall that the definition of the second specifies that

114 the period associated with a particular transition of the cesium atom is constant, namely, that the cycles of the electromagnetic radiation associated with this transition are equal to each other in duration. The definition of the second, in other words, fixes not only a unit of time but also a criterion for the congruence of time intervals. In order to make this uniformity criterion consistent across different relativistic reference frames, the cesium atom is said to lie on the earth’s approximate sea level. The resulting coordinate timescale, terrestrial time, provides a universal definition of uniformity while conveniently allowing earth-bound clocks to approximate it.

According to conventionalists, once a coordinative definition of uniformity is chosen the equality or inequality of durations is a matter of empirical fact. As the passage quoted above from Carnap makes clear, the remaining task for metrologists is only to discover which clocks ‘tick’ at a more stable rate relative to the chosen definition of uniformity and to improve those clocks that were found to be less stable. Conventionalists, in other words, explain the stability of networks of standards in naturalistic terms. A naturalistic explanation for the stability of a network of standards is one that ultimately appeals to an underlying natural regularity in the properties or behaviors of those standards. In the case of time measurement, a conventionalist would claim that standardization is successful because the operation of atomic clocks relies on an empirical regularity, namely the fact that the frequency associated with the relevant transition is roughly the same for all cesium-133 atoms. This regularity may be described in ways that are more or less simple depending on one’s choice of coordinative definition, but the empirical facts underlying it are independent of human choice. Accordingly, a conventionalist explanation for the success of the stabilizing mechanisms employed in the calculation of UTC is that these mechanisms make

UTC a reliable indicator of an underlying regularity, namely the constancy of the frequency

115 associated with different concrete cesium atoms used by different clocks84. Supposedly, metrologists are successful in synchronizing clocks to UTC because the algorithm that calculates UTC detects those clocks that ‘tick’ closer to the ideal cesium frequency and distributes time adjustments accordingly.

The idea that UTC is a reliable indicator of a natural regularity gains credence from the fact that UTC is gradually ‘steered’ towards the frequency of primary standards. As previously mentioned, primary frequency standards are rigorously evaluated for uncertainties and compared to each other in light of these evaluations. The fact that the frequencies of different primary standards are consistent with each other within uncertainty bounds can be taken as an indication for the regularity of the cesium frequency. Assuming, as metrologists do85, that the long-term stability of UTC over years is due mostly to ‘steering’, one can plausibly make the case that the algorithm that produces UTC is a reliable detector of a natural regularity in the behavior of cesium atoms.

This nevertheless leaves unexplained the success of the mechanisms that keep UTC stable in the short-term, i.e. when UTC is averaged over weeks and months. These mechanisms include, among others, the ongoing redistribution of clock weights, the limiting of maximum weight, the ‘slicing’ of steering corrections into small monthly increments and the increasingly exclusive reliance on Hewlett-Packard clocks.

One way of accounting for these short-term stabilizing mechanisms is to treat them as tools for facilitating consensus among metrological institutions. I will discuss this approach

84 This is a slight over-simplification, because not all the clocks that contribute to UTC are cesium clocks. As mentioned, some are hydrogen masers. The ‘regularity’ in question can therefore be taken more generally to be the constancy of frequency associated with any given atomic transition in some predefined set. 85 Audoin and Guinot 2001, 251 116 in the next subsection. Another option would be to look for a genuine epistemic function that these mechanisms serve. To a conventionalist (as to any other naturalist), this means finding a way of vindicating these self-stabilizing mechanisms as reliable indicators of an underlying natural regularity. Because a reliable indicator is one that is sensitive to the property being indicated, one should expect the relevant stabilizing mechanisms to do less well when such regularity is not strongly supported by the data. In practice, however, no such degradation in stability occurs. On the contrary, short-term stabilization mechanisms are designed to be as insensitive to frequency drifts or gaps in the data as is practically possible.

It is rather the data that is continually adjusted to stabilize the outcome of the calculation. As already mentioned, whenever a discrepancy among the frequencies of different secondary standards persists for too long it is eliminated ad hoc, either by ignoring individual clocks or by eventually replacing them with others that are more favorable to the stability of the average. Frequency ‘shocks’ introduced by new clocks are numerically cushioned. Even corrections towards primary standards, which are supposed to increase accuracy, are spread over a long period by slicing them into incremental steering adjustments or by embedding them in a ‘memory-based’ calculation.

The constancy of the cesium period in the short-term is therefore not tested by the algorithm that produces UTC. For a test implies the possibility of failure, whereas the stabilizing mechanisms employed by the BIPM in the short-term are fail-safe and intended to guard UTC against instabilities in the data. Indeed, there is no sign that metrologists even attempt to test the ‘goodness of fit’ of UTC to the individual data points that serve as the input for the calculation, let alone that they are prepared to reject UTC if it does not fit the data well enough. Rather than a hypothesis to be tested, the stability of the cesium period is a presupposition that is written into the calculation from the beginning and imposed on the

117 data that serves as its input. This seemingly question-begging practice of data analysis suggests either that metrological methods are fundamentally flawed or that the conventionalist explanation overlooks some important aspect of the way UTC is supposed to function. In Section 3.4 I will argue that the latter is the case, and that the seeming circularity in the calculation of UTC dissolves once the normative role of models in metrology is acknowledged.

3.3.3. Constructivist explanations

As we learned previously, UTC owes its short-term stability not to the detection of regularities in underlying clock data, but rather to the imposition of a preconceived regularity on that data. This regularity, i.e. the frequency stability of participating clocks relative to

UTC, is imposed on the data through weighting adjustments, time steps and frequency corrections implemented in the various stages of calculation. Constructivist explanations for the success of standardization projects make such regulatory practices their central explanans. According to Latour and Schaffer (quoted above), the stability of global timekeeping is explained by the ongoing efforts of metrological institutions to harness clocks into synchronicity. Particularly, standard clocks agree about the time because metrologists maintain a stable consensus as to which clocks to use and how the readings of these clocks should be corrected. The stability of consensus is in turn explained by an international bureaucratic cooperation among standardization institutes. To use Latour’s language, the stability of the network of clocks depends on an ongoing flux of paper forms issued by a network of calculation centers. When we look for the sources of regularity by which these

118 forms are circulated we do not find universal laws of nature but international treaties, trade agreements and protocols of meetings among clock manufacturers, theoretical physicists, astronomers and communication engineers. Without the efforts and resources continuously poured into the metrological enterprise, atomic clocks would not be able to tell the same time for very long.

From a constructivist perspective, the algorithm that produces UTC is a particularly efficient mechanism for generating consensus among metrologists. Recall that Coordinated

Universal Time is nothing over and above a list of corrections that the BIPM prescribes to the time signals maintained by local standardization institutes. By administering the corrections published in the monthly reports of the BIPM, metrologists from different countries are able to reach agreement despite the fact that their clocks ‘tick’ at different rates.

This agreement is not arbitrary but constrained by the need to balance the central authority of the International Bureau with the autonomy of national institutes. The need for a trade- off between centralism and autonomy accounts for the complexity of the algorithm that produces UTC, which is carefully crafted to achieve a socially optimal compromise among metrologists. A socially optimal compromise is one that achieves consensus with minimal cost to local metrological authorities, making it worthwhile for them to comply with the regulatory strictures imposed by the BIPM. Indeed, the algorithm is designed to distribute the smallest adjustments possible among as many clocks as possible. Consequently, the overall adjustments required to approximate UTC at any given local laboratory is kept to a minimum.

In stressing the importance of ongoing negotiations among metrological institutions, constructivists do not yet diverge from conventionalists, who similarly view the comparison and adjustment of standards as prerequisites for the reproducibility of measurement results.

119 But constructivists go a step further and, unlike conventionalists, refuse to invoke the presence of an underlying natural regularity in order to explain the stability of timekeeping standards86. On the contrary, they remind us that regularity is imposed on otherwise discrepant clocks for the sake of achieving commercial and economic goals. Only after the fact does this socially-imposed regularity assume the appearance of a natural phenomenon.

Latour expresses this view by saying that “[t]ime is not universal; every day it is made slightly more so by the extension of an international network [of standards]” (1987, 251). Schaffer similarly claims that facts only “work” outside of the laboratory because metrologists have already made the world outside of the laboratory “fit for science”(1992, 23). According to these statements, if they are taken literally, quantitative scientific claims attain universal validity not by virtue of any preexisting state of the world, but by virtue of the continued efforts of metrologists who transform parts of the world until they reproduce desired quantitative relations87. In what follows I will call this the reification thesis.

The reification thesis is a claim about the sources of regularity exhibited by measurement outcomes outside of the carefully controlled conditions of a scientific laboratory. This sort of regularity, constructivists hold, is constituted by the stabilizing practices carried out by metrologists rather than simply discovered in the course of carrying out such practices. In other words, metrologists do not simply detect those instruments and methods that issue reproducible outcomes; rather, they enforce a preconceived order on otherwise irregular instruments and methods until they issue sufficiently reproducible

86 Ian Hacking identifies explanations of stability as one of three ‘sticking points’ in the debate between social constructivists and their intellectual opponents (1999, 84-92) 87 These claims echo Thomas Kuhn’s in his essay “The Function of Measurement in Modern Physical Science” ([1961] 1977). 120 outcomes. Note that the reification thesis entails an inversion of explanans and explanandum relative to the conventionalist account. It is the successful stabilization of metrological networks that, according to Latour and Schaffer, explains universal regularities in the operation of instruments rather than the other way around.

How plausible is this explanatory inversion in the case of contemporary timekeeping?

As already hinted at above, the constructivist account fits well with the details of the case insofar as the short-term stability of standards is involved. In the short run, the UTC algorithm does not detect frequency stability in the behavior of secondary standards but imposes stability on their behavior. Whenever a discrepancy arises among different clocks it is eliminated by ad hoc correction or by replacing some of the clocks with others. The ad hoc nature of these adjustments guarantees that any instability, no matter how large, can be eliminated in the short run simply by redistributing instruments and ‘paper forms’ throughout the metrological network.

The constructivist account is nevertheless hard pressed to explain the fact that the corrections involved in maintaining networks of standards remain small in the long run. An integral part of what makes a network of metrological standards stable is the fact that its maintenance requires only small and occasional adjustments rather than large and frequent ones.

A network that reverted to irregularity too quickly after its last recalibration would demand constant tweaking, making its maintenance ineffective. This long-term aspect of stability is an essential part of what constitutes a successful network of standards, and is therefore in need of explanation no less than its short-term counterpart. After all, nothing guarantees that metrologists will always succeed in diminishing the magnitude and frequency of corrections they apply to networks of instruments. How should one explain their success, then, in those cases when they so succeed? Recall that the conventionalist appealed to

121 underlying regularities in nature to explain long-term stability: metrologists succeed in stabilizing networks because they choose naturally stable instruments. But this explanatory move is blocked for those who, like Latour and Schaffer, hold to the reification thesis with its requirement of explanatory inversion.

To illustrate this point, imagine that metrologists decided to keep the same algorithm they currently use for calculating UTC, but implemented it on the human heart as a standard clock instead of the atomic standard88. As different hearts beat at different rates depending on the particular person and circumstances, the time difference between these organic standards would grow rapidly from the time of their latest correction. Institutionally imposed adjustments would only be able to bring universal time into agreement for a short while before discrepancies among different heart-clocks exploded once more. The same algorithm that produces UTC would be able to minimize adjustments to a few hours per month at best, instead of a few nanoseconds when implemented with atomic standards. In the long run, then, the same mechanism of social compromise would generate either a highly stable, or a highly unstable, network depending on nothing but the kind of physical process used as a standard. Constructivists who work under the assumption of the reification thesis cannot appeal to natural regularities in the behavior of hearts or cesium atoms as primitive explanans, and would therefore be unable to explain the difference in stability.

Constructivists may respond by claiming that, for contingent historical reasons, metrologists have not (yet) mastered reliable control over human hearts as they have over cesium atoms. This is a historical fact about humans, not about hearts or cesium atoms.

However, even if this claim is granted, it offers no explanation for the difference in long-

88 A similar imaginary exercise is proposed by Carnap ([1966] 1995), pp. 80-84. 122 term stability but only admits the lack of such an explanation. Another possibility is for constructivists to relax the reification thesis, and claim that metrologists do detect preexisting regularities in the behavior of their instruments, but that such regularities do not sufficiently explain how networks of standards are stabilized. Under this ‘moderate’ reification thesis, constructivists admit that a combination of natural and socio-technological explanans is required for the stability of metrological networks. The question then arises as to how the two sorts of explanans should be combined into a single explanatory account. The following section will provide such an account.

3.4. Models and coordination

3.4.1. A third alternative

As we have seen, conventionalists and constructivists agree that claims concerning frequency stability are neither true nor false independent of human agency, but disagree about the scope and limits of this agency. Conventionalists believe that human agency is limited to an a priori freedom to define standards of uniformity. For example, the statement:

‘under specified conditions, the cesium transition frequency is constant’ is a definition of frequency constancy. Once a choice of definition is made, stabilization is a matter of discovering which clocks agree more closely with the chosen definition and improving those clocks that do not agree closely enough. Hence the claim: ‘during period T1…T2, clock X ticked at a constant frequency relative to the current definition of uniformity’ is understood as an empirical claim whose truth or falsity cannot be modified by metrologists.

123 Constructivists argue instead that judgments about frequency stability cannot be abstracted away from the concrete context in which they are made. Claims to frequency stability are true or false only relative to a particular act of comparison among clocks, made at a particular time and location in an ever changing network of instruments, protocols and calculations. As evidenced in detail above, the metrological network of timekeeping standards is continually rebalanced in light of considerations that have little or nothing to do with the theoretical definition of uniformity. Quite apart from any ideal definition, de facto notions of uniformity are multiple and in flux, being constantly modified through the actions of standardization institutions. If claims to frequency stability appear universal and context- free, it is only because they rely on metrological networks that have already been successfully stabilized and ‘black-boxed’ so as to conceal their historicity.

In an attempt to reconcile the two views, one may be tempted to simply juxtapose their explanans. One would adopt a conventionalist viewpoint to explain the long-term stability of networks of standards and a constructivist viewpoint to explain short-term stability. But such juxtaposition would be incoherent, because the two viewpoints make contradictory claims. As already mentioned, constructivists like Latour and Schaffer reject the very idea of pre-existing natural regularity, an idea that lies at the heart of conventionalists explanations of stability. Any attempt to use elements of both views without reconciling their fundamental tension can only provide an illusion of explanation.

The philosophical challenge, then, is to clarify exactly how constructivism can be ‘naturalized’ and conventionalism ‘socialized’ in a manner that explains both long- and short-term stability. Meeting this challenge requires developing a subtler notion of natural regularity than either view offers.

124 The model-based account of standardization that I will now propose does exactly that.

It borrows elements from both conventionalism and constructivism while modifying their assumptions about the sources of regularity in both nature and society. As I will argue, this account successfully explains both the long- and short-term stability of metrological networks without involving contradictory suppositions.

The model-based account may be summarized by the following four claims:

(i) The proper way to apply a theoretical concept (e.g. the concept of uniformity

of time) depends not only on its definition but also on the way concrete

instruments are modeled in terms of that concept both theoretically and statistically;

(ii) Metrologists are to some extent free to influence the proper mode of

application of the concepts they standardize, not only through acts of

definition, but also by adjusting networks of instruments and by modifying

their models of these instruments;

(iii) Metrologists exercise this freedom by continually shifting the proper mode of

application of the concepts they standardize so as to maximize the stability of

their networks of standards;

(iv) In the process of maximizing stability, metrologists discover and exploit

empirical regularities in the behavior of their instruments.

In what follows I shall argue for each of these four claims and illustrate them in the special case of contemporary timekeeping. In so doing I will show that the model-based approach does a better job than the previous two alternatives at explaining the stability of metrological standards.

125 3.4.2. Mediation, legislation, and models

The central goal of standardizing a theoretical concept, according to the model-based approach, is to regulate the application of the concept to concrete particulars. A standardization project is successful when the application of the concept is universally consistent and independent of factors that are deemed local or irrelevant. In conventionalist jargon, standardization projects ‘coordinate’ a theoretical concept to exemplary particulars.

But in the model-based approach such coordination is not exhausted by arbitrary acts of definition. If coordination amounted to a kind of stipulative act as Reichenbach believed, the correct way to apply theoretical concepts to concrete particulars would be completely determinate once this stipulation is given. This is clearly not the case. Consider the application of the concept of terrestrial time to a concrete cesium clock: the former is a highly abstract concept, namely the timescale defined by the ‘ticks’ of a perfectly accurate cesium clock on the ideal surface of the rotating geoid; the latter is a machine exhibiting a myriad of imperfections relative to the theoretical ideal. How is one to apply the notion of terrestrial time to the concrete clock, namely, decide how closely the concrete clock ‘ticks’ relative to the ideal terrestrial timescale? The definition of terrestrial time offers a useful starting point, but on its own is far too abstract to specify a method for evaluating the accuracy of any clock. Considerable detail concerning the design and environment of the

126 concrete clock must be added to the definition before the abstract concept can be determinately applied to evaluate the accuracy of that clock89.

This adding of detail amounts, in effect, to the construction of a hierarchy of models of concrete clocks at differing level of abstraction. At the highest level of this hierarchy we find the theoretical model of an unperturbed cesium atom on the geoid. As mentioned, this model defines the notion of terrestrial time, the theoretical timescale that is realized by

Coordinated Universal Time.

At the very bottom of this hierarchy lie the most detailed and specific models metrologists construct of their apparatus. These models typically represent the various systematic effects and statistical fluctuations influencing a particular ensemble of atomic clocks housed in one standardization laboratory. These models are used for the calculation of local approximations to UTC.

Mediating between these levels is a third model, perhaps more aptly termed a cluster of theoretical and statistical models, grounding the calculation of UTC itself. The models in this cluster are abstract and idealized representations of various aspects of the clocks that contribute to UTC and their environments. Among these models, for example, are several statistical models of noise (e.g. white noise, flicker noise and Brownian noise) as well as simplified representations of the properties of individual clocks (weights, ‘filters’) and properties of the ensemble as a whole (‘cushion’ terms, ‘memory’ terms.) Values of the parameter called ‘Coordinated Universal Time’ are determined by analyzing clock data from the past month in light of the assumptions of models in this cluster.

89 As I have shown in Chapter 1, the accuracy of measurement standards can only be evaluated once the definition of the concept being standardized is sufficiently de-idealized.

127 It is to this parameter, ‘Coordinated Universal Time’, that the concept of terrestrial time is directly coordinated, rather than to any concrete clock90. Like Reichenbach, I am using the term ‘coordination’ here to denote an act that specifies the mode of application of an abstract theoretical concept. But the form that coordination takes in the model-based approach is quite different than what classical conventionalists have envisioned. Instead of directly linking concepts with objects (or operations), coordination consists in the specification of a hierarchy among parameters in different models. In our case, the hierarchy links a parameter (terrestrial time) in a highly abstract and simplified theoretical model of the earth’s spacetime to a parameter (UTC) in a less abstract, theoretical-statistical cluster of models of certain atomic clocks. UTC is in turn coordinated to a myriad of parameters

(UTC(k)) representing local approximations of UTC by even more detailed, lower-level models.

Finally, the particular clocks that standardize terrestrial time are subsumed under the lowest-level models in the hierarchy. I am using the term ‘subsumed under’ rather than

‘described by’ because the accuracy of a concrete clock is evaluated against the relevant low- level model and not the other way around. This is an inversion of the usual way of thinking about approximation relations. In most types of scientific inquiry abstract models are meant to approximate their concrete target systems. But the models constructed during standardization projects have a special normative function, that of legislating the mode of application of concepts to concrete particulars. Indeed, standardization is precisely the legislation of a proper mode of application for a concept through the specification of a

90 More exactly, the concept of terrestrial time is directly coordinated to TAI, i.e. to UTC prior to the addition of ‘leap seconds’ (see the discussion on ‘leap second’ in Section 3.2.5.) 128 hierarchy of models. At each level of abstraction, the models specify what counts as an accurate application of the standardized concept at the level below.

Figure 3.1: A simplified hierarchy of approximations among model parameters in contemporary timekeeping. Vertical position on the diagram denotes level of abstraction and arrows denote approximation relations. Note that concrete levels approximate abstract ones.

Consequently, the chain of approximations (or ‘realizations’) runs upwards in the hierarchy rather than downwards: concrete clocks approximate local estimates of UTC, which in turn approximate UTC as calculated by the International Bureau, which in turn approximates the ideal timescale known as terrestrial time. Figure 3.1 summarizes the various levels of abstraction and relations of approximation involved in contemporary atomic timekeeping.

129 The inversion of approximation relations explains why metrologists deal with discrepancies in the short run by adjusting clocks rather than by modifying the algorithm that calculates UTC. If UTC were an experimental best-fit to clock indications, the practice of correcting and excluding clocks would be suspect of question-begging. However, the goal of the calculation is not to approximate clock readings, but to legislate the way in which those readings should be corrected relative to the concept being standardized, namely uniform time on the geoid (i.e. terrestrial time). The next subsection will clarify why metrologists are free to perform such legislation.

3.4.3. Coordinative freedom

Equipped with a more nuanced account of coordination than that offered by conventionalists, we can now proceed to examine how metrological practices influence the mode of application of concepts. Conventionalists, recall, took the freedom involved in coordination to be a priori, in principle and singular. According to the model-based account, metrologists who standardize concepts enjoy a different sort of freedom, one that is empirically constrained and practically exercised in an ongoing manner. Specifically, metrologists are to some extent free to decide not only how they define an ideal measurement of the quantity they are standardizing, but also what counts as an accurate concrete approximation (or ‘realization’) of this ideal.

The freedom to choose what counts as an accurate approximation of a theoretical ideal is special to metrology. It stems from the fact that, in the context of a standardization project, the distribution of errors among different realizations of the quantity being

130 standardized is not completely determinate. Until metrologists standardize a quantity- concept, its mode of application remains partially vague, i.e. some ambiguity surrounds the proper way of evaluating errors associated with measurements of that quantity. Indeed, in the absence of such ambiguity standardization projects would be not only unnecessary but impossible. Nevertheless, ambiguity of this sort cannot be dissolved simply by making more measurements, as a determinate standard for judging what would count as a measurement error is the very thing metrologists are trying to establish. This problem of indeterminacy is illustrated most clearly in the case of systematic error91.

The inherent ambiguity surrounding the distribution of errors in the context of standardization projects leaves metrologists with some freedom to decide how to distribute errors among multiple realizations of the same quantity. Consequently, metrologists enjoy some freedom in deciding how to construct models that specify what counts as an ideal measurement of the quantity they are standardizing in some local context. Concrete instruments are then subsumed under these idealized models, and errors are evaluated relative to the chosen ideal.

Metrologists make use of this freedom to fit the mode of application of the concept to the goals of the particular standardization project at hand. In some cases such goals may be

‘properly’ cognitive, e.g. the reduction of uncertainty, a goal which dominates choices of primary frequency realizations. But in general there is no restriction on the sort of goals that may inform choices of realization, and they may include economic, technological and political considerations.

91 For a detailed argument to this effect see Chapter 2 of this thesis, “Systematic Error and the Problem of Quantity Individuation.” 131 The freedom to represent and distribute errors in accordance with local and pragmatic goals explains why metrologists allow themselves to introduce seemingly self-fulfilling mechanisms to stabilize UTC. Rather than ask: ‘how well does this clock approximate terrestrial time?’ metrologists are, to a limited extent, free to ask: ‘which models should we use to apply the concept of terrestrial time to this clock?’ In answering the second question metrologists enjoy some interpretive leeway, which they use to maximize the short-term stability of their clock ensemble. This is precisely the role of the algorithmic mechanisms discussed above. These self-stabilizing mechanisms do not require justification for their ability to approximate terrestrial time because they are legislative with respect to the application of the concept of terrestrial time to begin with. UTC is successfully stabilized in the short run not because its calculation correctly applies the concept of terrestrial time to secondary standards; rather, UTC is chosen to determine what counts as a correct application of the concept of terrestrial time to secondary standards because this choice results in greater short-term stability. Contrary to conventionalist explanations of stability, then, the short-term stability of UTC cannot be fully explained by the presence of an independently detectable regularity in the data from individual clocks. Instead, a complete explanation must non-reducibly appeal to stabilizing policies adopted by metrological institutions. These policies are designed in part to promote a socially optimal compromise among those institutions.

Coordination is nonetheless not arbitrary. The sort of freedom metrologists exercise in standardizing quantity concepts is quite different than the sort of freedom typically associated with arbitrary definition. As the recurring qualification ‘to some extent’ in the discussion above hints, the freedom exercised by metrologists in practice is severely, though not completely, constrained by empirical considerations. First, the quantity concepts being

132 standardized are not ‘free-floating’ concepts but are already embedded in a web of assumptions. Terrestrial time, for example, is a notion that is already deeply saturated with assumptions from general relativity, atomic theory, electromagnetic theory and quantum mechanics. The task of standardizing terrestrial time in a consistent manner is therefore constrained by the need to maintain compatibility with established standards for other quantities that feature in these theories. Second, terrestrial time may be approximated in more than one way. The question ‘how well does clock X approximate terrestrial time?’ is therefore still largely an empirical question even in the context of a standardization project. It can be answered to a good degree of accuracy by comparing the outcomes of clock X with other approximations of terrestrial time. Such approximations rely on post-processed data from primary cesium standards or on astronomical time measurements derived from the observation of pulsars. But these approximations of terrestrial time do not completely agree with one another. More generally, different applications of the same concept to different domains, or in light of a different trade-off between goals, often end up being somewhat discrepant in their results. Standardization institutes continually manage a delicate balance between the extent of legislative freedom they allow themselves in applying concepts and the inevitable gaps discovered among multiple applications of the same concept. Nothing exemplifies better the shifting attitudes of the BIPM towards this trade-off than the history of ‘steering’ corrections, which have been dispensed aggressively or smoothly over the past decades depending on whether accuracy or stability was preferred.

The gaps discovered between different applications of the same quantity-concept are among the most important (though by no means the only) pieces of empirical knowledge amassed by standardization projects. Such gaps constitute empirical discoveries concerning the existence or absence of regularities in the behavior of instruments, and not merely about the

133 way metrologists use their concepts. This is a crucial point, as failing to appreciate it risks mistaking standardization projects for exercises in the social regulation of data-analysis practices. Even if metrologists reached perfect consensus as to how they apply a given quantity concept, there is no guarantee that the application they have chosen will lead to consistent results. Success and failure in applying a quantity concept consistently are to be investigated empirically, and the discovery of gaps (or their absence) is accordingly a matter of obtaining genuine empirical knowledge about regularities in nature.

The discovery of gaps explains the possibility of stabilizing networks of standards in the long run. Metrologists choose to use as standards those instruments to which they have managed to apply the relevant concept most consistently, i.e. with the smallest gaps. To return to the example above, metrologists have succeeded in applying the concept of temporal uniformity to different cesium atoms with much smaller gaps than to different heart rates. This is not only a fact about the way metrologists apply the concept of uniformity, but also about a natural regularity in the behavior of cesium atoms, a regularity that is discovered when cesium clocks are subsumed under the concept of uniformity through the mediation of relevant models. Metrologists rely on such regularities for their choices of physical standards, i.e. they tend to select those instruments whose behavior requires the smallest and least frequent ad hoc corrections. Moreover, as standardization projects progress, metrologists often find new theoretical and statistical means of predicting some of the gaps that remain, thereby discovering ever ‘tighter’ regularities in the behavior of their instruments.

The notion of empirical regularity employed by the model-based account differs from the empiricist one adopted by classical conventionalists. Conventionalists equated regularity with a repeatable relation among observations. Carnap, for example, identified regularity in

134 the behavior of pendulums with the constancy of the ratio between the number of swings they produce ([1966] 1995, 82). This naive empiricist notion of regularity pertains to the indications of instruments. By contrast, my notion of regularity pertains to measurement outcomes, i.e. to estimates that have already been corrected in light of theoretical and statistical assumptions92. The behavior of measuring instruments is deemed regular relative to some set of modeling assumptions insofar as their outcomes are predictable under those assumptions.

Prior to the specification of modeling assumptions there can be no talk of regularities, because such assumptions are necessary for forming expectations about which configuration of indications would count as regular. Hence modeling assumptions are strongly constitutive of empirical regularities in my sense of the term. At the same time, regularities are still empirical, as their existence depends on which indications instruments actually produce.

Empirical regularities, in other words, are co-produced by observations as well as the assumptions with which a scientific community interprets those observations.

This Kantian-flavored, dual-source conception of regularity explains the possibility of legislating to nature the conditions under which time intervals are deemed equal. Recall that acts of legislation determine not only how concepts are applied, but also which configurations of observations count as regular. For example, which clocks ‘tick’ closer to the natural frequency of the cesium transition depends on which rules metrologists choose to follow in applying the concept of natural uniformity93. This is not meant to deny that there may be mind-independent facts about the frequency stability of clocks, but merely to

92 My analysis of the notion of empirical regularity is therefore similar to my analysis of the notion of agreement discussed in Chapter 2. 93 Kant would have disagreed with last statement, as he took time to be a universal form of intuition and the synthesis of temporal relations to be governed by universal schemata regardless of one’s theoretical suppositions. The inspiration I draw from Kant does not imply a wholesale adoption of his philosophy. 135 acknowledge that such mind-independent facts, if they exist, play no role in grounding knowledge claims about frequency stability. Indeed, the standardization of terrestrial time would be impossible were metrologists required to obtain such facts, which pertain to ideal and experimentally inaccessible conditions. From the point of view of the model-based account, by contrast, there is nothing problematic about this inaccessibility, as the application of a concept does not require satisfying its theoretical definition verbatim.

Instead, metrologists have a limited but genuine authority to legislate empirical regularities to their observations, and hence to decide which approximations of the definition are closer than others, despite not having experimental access to the theoretical ideal.

3.5. Conclusions

This chapter has argued that the stability of the worldwide consensus around

Coordinated Universal Time cannot be fully explained by reduction to either the natural regularity of atomic clocks or the consensus-building policies enforced by standardization institutes. Instead, both sorts of explanans dovetail through an ongoing modeling activity performed by metrologists. Standardization projects involve an iterative exchange between

‘top-down’ adjustment to the mode of application of concepts and ‘bottom-up’ discovery of inconsistencies in light of this application94.

94 This double-sided methodological configuration is an example of Hasok Chang’s (2004, 224-8) ‘epistemic iterations.’ It is also reminiscent of Andrew Pickering’s (1995, 22) patterns of ‘resistance and accommodation’, with the important difference that Pickering does not seem to ascribe his ‘resistances’ to underlying natural regularities. 136 This bidirectional exchange results in greater stability as it allows metrologists to latch onto underlying regularities in the behavior of their instruments while redistributing errors in a socially optimal manner. When modeling the behavior of their clocks, metrologists are to some extent free to decide which behaviors count as naturally regular, a freedom which they use to maximize the efficiency of a social compromise among standardizing institutions. The need for effective social compromise is therefore one of the factors that determine the empirical content of the concept of a uniformly ‘ticking’ clock. On the other hand, the need for consistent application of this concept is one of the factors that determine which social compromise is most effective. The model-based account therefore combines the conventionalist claim that congruity is a description-relative notion with the constructivist emphases on the local, material and historical contexts of scientific knowledge.

137 The Epistemology of Measurement: A Model-Based Account

4. Calibration: Modeling the Measurement Process

Abstract: I argue that calibration is a special sort of modeling activity, namely the activity of constructing, testing and deriving predictions from theoretical and statistical models of a measurement process. Measurement uncertainty is accordingly a special sort of predictive uncertainty, namely the uncertainty involved in predicting the outcomes of a measurement process based on such models. I clarify how calibration establishes the accuracy of measurement outcomes and the role played by measurement standards in this procedure. Contrary to currently held views, I show that establishing a correlation between instrument indications and standard quantity values is neither necessary nor sufficient for successful calibration.

4.1. Introduction

A central part of measuring is evaluating accuracy. A measurement outcome that is not accompanied by an estimate of accuracy is uninformative and hence useless. Even when a value range or standard uncertainty are not explicitly reported with a measurement outcome, a rough accuracy estimate is implied by the practice of recoding only ‘meaningful digits’. And yet the requirement to evaluate accuracy gives rise to an epistemological conundrum, which I have called ‘the problem of accuracy’ in the introduction to this thesis. The problem arises because the exact values of most physical quantities are unknowable. Quantities such as

138 length, duration and temperature, insofar as they are represented by non-integer (e.g. rational or real) numbers, are impossible to measure with certainty. The accuracy of measurements of such quantities cannot, therefore, be evaluated by reference to exact values but only by comparing uncertain estimates to each other. When comparing two uncertain estimates of the same quantity it is impossible to tell exactly how much of the difference between them is due to the inaccuracy of either estimate. Multiple ways of distributing errors among the two estimates are consistent with the data. The problem of accuracy, then, is an underdetermination problem: the available evidence is insufficient for grounding claims about the accuracy of any measurement outcome in isolation, independently of the accuracies of other measurements95.

One attempt to solve this problem which I have already discussed is to adopt a conventionalist approach to accuracy. Mach ([1896] 1966) and later Carnap ([1966] 1995) and Ellis (1966) thought that the problem of accuracy could be solved by arbitrarily selecting a measuring procedure as a standard. The accuracies of other measuring procedures are then evaluated against the standard, which is considered completely accurate. The disadvantages of the conventionalist approach to accuracy have already been explored at length in the previous chapters. As I have shown, measurement standards are necessarily inaccurate to some extent, because the definitions of the quantities they standardize necessarily involve

95 The problem of accuracy can be formulated in other ways, i.e. as a regress or circularity problem rather than an underdetermination problem. In the regress formulation, the accuracy of a set of estimates is established by appealing to the accuracy of yet another estimate, etc. In the circularity formulation, the accuracy of one estimate is established by appealing to the accuracy of a second estimate, whose accuracy is in turn established by appeal to the accuracy of the first. All of these formulations point to the same underlying problem, namely the insufficiency of comparisons among uncertain estimates for determining accuracy. I prefer the underdetermination formulation because it makes it easiest to see why auxiliary assumptions about the measuring process can help solve the problem. 139 some idealization96. Moreover, the inaccuracies associated with measurement standards are themselves evaluated by mutual comparisons among standards, a fact that further accentuates the problem of accuracy.

In Chapter 1 I provided a solution to the problem of accuracy in the special case of primary measurement standards. I showed that a robustness test performed among the uncertainties ascribed to multiple standards provides sufficient grounds for making accuracy claims about those standards. The task of the current chapter is to generalize this solution to any measuring procedure, and to explain how the methods actually employed in physical metrology accomplish this solution. Specifically, my aim will be to clarify how the various activities that fall under the title ‘calibration’ support claims to measurement accuracy.

At first glance this task may appear simple. It is commonly thought that calibration is the activity of establishing a correlation between the indications of a measuring instrument and a standard. Marcel Boumans, for example, states that “A measuring instrument is validated if it has been shown to yield numerical values that correspond to those of some numerical assignments under certain standard conditions. This is also called calibration

[…].” (2007, 236). I have already shown that there is good reason to think that primary measurement standards are accurate up to their stated uncertainties. Is it not obvious that calibration, which establishes a correlation with standard values, thereby also establishes the accuracy of measuring instruments?

96 Even when the definition of a unit refers to a concrete object such as the Prototype Kilogram, the specification of a standard measuring procedure still involves implicit idealizations, such as the possibility of creating perfect copies of the Prototype and the possibility of constructing perfect balances to compare the mass of the Prototype to those of other objects. 140 This seemingly straightforward way of thinking about calibration neglects a more fundamental epistemological challenge, namely the challenge of clarifying the importance of standards for calibration in the first place. Given that the procedures called ‘standards’ are to some extent inaccurate, and given that some measuring procedures are more accurate than the current standard (as shown in Chapter 1), why should one calibrate instruments against metrological standards rather than against any other sufficiently accurate measuring procedure?

In what follows I will show that establishing a correlation between instrument indications and standard values is neither necessary nor sufficient in general for successful calibration. The ultimate goal of calibration is not to establish a correlation with a standard, but to accurately predict the outcomes of a measuring procedure. Comparison to a standard is but one method for generating such predictions, a method that is not always required and is often inaccurate by itself. Indeed, only in the simplest and most inaccurate case of calibration

(‘black-box’ calibration) is predictability achieved simply by establishing empirical correlations between instrument indications and standard values. A common source of misconceptions about calibration is that this simplest form of calibration is mistakenly thought to be representative of the general case. The opposite is true: ‘black-box’ calibration is but a special case of a much more complex way of representing measuring instruments that involves detailed theoretical and statistical considerations.

As I will argue, calibration is a special sort of modeling activity, one in which the system being modeled is a measurement process. I propose to view calibration as a modeling activity in the full-blown sense of the term ‘modeling’, i.e. constructing an abstract and

141 idealized representation of a system from theoretical and statistical assumptions and using this representation to explain and predict that system’s behaviour97.

I will begin by surveying the products of calibration as explicated in the metrological literature (Section 4.2) and distinguish between two calibration methodologies that, following Boumans (2006), I call ‘black-box’ and ‘white-box’ calibration (Sections 4.3 and

4.4). I will show that white-box calibration is the more general of the two, and that it is aimed at predicting measurement outcomes rather than mapping indications to standard values. Section 4.5 will then discuss the role of metrological standards in calibration and clarify the conditions under which their use contributes to the accurate prediction of measurement outcomes. Finally, Section 4.6 will explain how the accuracy of measurement outcomes is evaluated on the basis of the model-based predictions produced during calibration.

4.2. The products of calibration

4.2.1. Metrological definition

The International Vocabulary of Metrology (VIM) defines calibration in the following way:

Calibration: operation that, under specified conditions, in a first step, establishes a relation between the quantity values with measurement uncertainties provided by measurement standards and corresponding indications with associated measurement uncertainties and, in a second step, uses this information to establish a relation for obtaining a measurement result from an indication. (JCGM 2008, 2.39)

97 See also Mari 2005. 142 This definition is functional, that is, it characterizes calibration through its products.

Two products are mentioned in the definition, one intermediary and one final. The final product of calibration operations is “a relation for obtaining a measurement result from an indication”, whereas the intermediary product is “a relation between the quantity values […] provided by measurement standards and corresponding indications”. Calibration therefore produces knowledge about certain relations. My aim in this section will be to explicate these relations and their relata. The following three sections will then provide a methodological characterization of calibration, namely, a description of several common strategies by which metrologists establish these relations. In each case I will show that the final product of calibration – a relation for obtaining a measurement result from an indication – is established by making model-based predictions about the measurement process. This methodological characterization will in turn set the stage for the epistemological analysis of calibration in the last section.

4.2.2. Indications vs. outcomes

The first step in elucidating the products of calibration is to distinguish between measurement outcomes (or ‘results’) and instrument indications, a distinction previously discussed in Chapter 2. To recapitulate, an indication is a property of the measuring instrument in its final state after the measurement process is complete. Examples of indications are the numerals appearing on the display of a digital clock, the position of an ammeter pointer relative to a dial, and the pattern of diffraction produced in x-ray crystallography. Note that the term ‘indication’ in the context of the current discussion

143 carries no normative connotation. It does not presuppose reliability or success in indicating anything, but only an intention to use such outputs for reliable indication of some property of the sample being measured. Note also that indications are not numbers: they may be symbols, visual patterns, acoustic signals, relative spatial or temporal positions, or any other sort of instrument output. However, indications are often represented by mapping them onto numbers, e.g. the number of ‘ticks’ the clock generated at a given period, the displacement of the pointer relative to the ammeter dial, or the spatial density of diffraction fringes. These numbers, which may be called ‘processed indications’, are convenient representations of indications in mathematical form98. A processed indication is not yet an estimate of any physical quantity of the sample being measured, but only a mathematical description of a state of the measuring apparatus.

A measurement outcome, by contrast, is an estimate of a quantity value associated with the object being measured, an estimate that is inferred from one or more indications.

Outcomes are expressed in terms of a particular unit on a particular scale and include, either implicitly or explicitly, an estimate of uncertainty. Respective examples of measurement outcomes are an estimate of duration in seconds, an estimate of electric current in Ampere, and an estimate of distance between crystal layers in nanometers. Very often measurement outcomes are recorded in the form of a mean value and a standard deviation that represents the uncertainty around the mean, but other forms are commonly used, e.g. min-max value range.

98 The difference between numbers and numerals is important here. Before processing, an indication is never a number, though it may be a numeral (i.e. a symbol representing a number). 144 To attain the status of a measurement outcome, an estimate must be abstracted away from its concrete method of production and pertain to some quantity objectively, namely, be attributable to the measured object rather than the idiosyncrasies of the measuring instrument, environment and human operators. Consider the ammeter: the outcome of measuring with an ammeter is an estimate of the electric current running through the input wire. The position of the ammeter pointer relative to the dial is a property of the ammeter rather than the wire, and is therefore not a candidate for a measurement outcome. This is the case whether or not the position of the pointer is represented on a numerical scale. It is only once theoretical and statistical background assumptions are made and tested about the behaviour of the ammeter and its relationship with the wire (and other elements in its environment) that one can infer estimates of electric current from the position of the pointer. The ultimate aim of calibration is to validate such inferences and characterize their uncertainty.

Processed indications are easily confused with measurement outcomes partly because many instruments are intentionally designed to conceal their difference. Direct-reading instruments, e.g. household mercury thermometers, are designed so that the numeral that appears on their display already represents the best estimate of the quantity of interest on a familiar scale. The complex inferences involved in arriving at a measurement outcome from an indication are ‘black-boxed’ into such instruments, making it unnecessary for users to infer the outcome themselves99. Regardless of whether or not users are aware of them, such inferences form an essential part of measuring. They link claims such as ‘the pointer is

99 Somewhat confusingly, the process of ‘black-boxing’ is itself sometimes called ‘calibration’. For example, the setting of the null indication of a household scale to the zero mark is sometimes referred to as ‘calibration’. From a metrological viewpoint, this terminological confusion is to be avoided: “Adjustment of a measuring system should not be confused with calibration, which is a prerequisite for adjustment” (JCGM 2008, 3.11, Note 2). Calibration operations establish a relation between indications and outcomes, and this relation may later be expressed in a simpler manner by adjusting the display of the instrument. 145 between the 0.40 and 0.41 marks on the dial’ to claims like ‘the current in the wire is

0.405±0.005 Ampere’. If such inferences are to be deemed reliable, they must be grounded in tested assumptions about the behaviour of the instrument and its interactions with the sample and the environment.

4.2.3. Forward and backward calibration functions

The distinction between indications and outcomes allows us to clarify the two products of calibration mentioned in the definition above. The intermediary product, recall, is “a relation between the quantity values with measurement uncertainties provided by measurement standards and corresponding indications with associated measurement uncertainties”. This relation may be expressed in the form of a function, which I will call the

‘forward calibration function’:

= fFC ( , ) (4.1)

The forward calibration function maps values of the quantity to be measured – e.g. the current in the wire – to instrument indications, e.g. the position of the ammeter pointer100.

100 The term ‘calibration function’ (also ‘calibration curve’, see JCGM 2008, 4.31) is commonly used in metrological literature, whereas the designations ‘forward’ and ‘backward’ are my own. I call this a ‘forward’ function because its input values are normally understood as already having a determinate value prior to measurement and as determining its output value through a causal process. Nevertheless, my account of 146 The forward calibration function may include input variables representing additional quantities that may influence the indication of the instrument – for example, the intensity of background magnetic fields in the vicinity of the ammeter. The goal of the first step of calibration is to arrive at a forward calibration function and characterize the uncertainties associated with its outputs, i.e. the instrument’s indications. This involves making theoretical and statistical assumptions about the measurement process and empirically testing the consequences of these assumptions, as we shall see below.

The second and final step of calibration is aimed at establishing “a relation for obtaining a measurement result from an indication.” This relation may again be expressed in the form of a function, which may be called the ‘backward calibration function’ or simply

‘calibration function’:

= fC ( , ) (4.2)

A calibration function maps instrument indications to values of the quantity being measured, i.e. to measurement outcomes. Like the forward function, the calibration function may include additional input variables whose values affect the relation between indications and outcomes. In the simplest (‘black-box’) calibration procedures additional input parameters are neglected, and a calibration function is obtained by simply inverting the forward function. Other (‘white-box’) calibration procedures represent the measurement process is more detail, and the derivation of the calibration function becomes more

calibration does not presuppose this classical picture of measurement, and is compatible with the possibility that the quantity being measured does not have a determinate value prior to its measurement. 147 complex. Once a calibration function is established, metrologists use it to associate values of the quantity being measured with indications of the instrument.

So far I have discussed the products of calibration without explaining how they are produced. My methodological analysis of calibration will proceed in three stages, starting with the simplest method of calibration and gradually increasing in complexity. The cases of calibration I will consider are:

1. Black-box calibration against a standard whose uncertainty is negligible

2. White-box calibration against a standard whose uncertainty is negligible

3. White-box calibration against a standard whose uncertainty is non-negligible (‘two-

way white-box’ calibration)

In each case I will show that the products of calibration are obtained by constructing models of the measurement process, testing the consequences of these models and deriving predictions from them. Viewing calibration as a modeling activity will in turn provide the key to understanding how calibration establishes the accuracy of measurement outcomes.

4.3. Black-box calibration

In the most rudimentary case of calibration, the measuring instrument is treated as a

‘black-box’, i.e. as a simple input-output unit. The inner workings of the instrument and the various ways it interacts with the sample, environment and human operators are either neglected or drastically simplified. Establishing a calibration function is then a matter of

148 establishing a correlation between the instrument’s indications and corresponding quantity values associated with a measurement standard.

For example, a simple caliper may be represented as a ‘black-box’ that converts the diameter of an object placed between its legs to a numerical reading. The caliper is calibrated by concatenating gauge blocks – metallic bars of known length – between the legs of the caliper. We can start by assuming, for the time being, that the uncertainties associated with the length of these standard blocks are negligible relative to those associated with the outcomes of the caliper measurement. Calibration then amounts to a behavioural test of the instrument under variations to the standard sample. The indications of the caliper are recorded for different known lengths and a curve is fitted to the data points based on background assumptions about how the caliper is expected to behave. The resulting forward calibration function is of the form:

I0 = fFC (O) (4.3)

This function maps the lengths (O) associated with a combination of gauge blocks to

the indications of the caliper (I0). Notice that despite the simplicity of this operation, some basic theoretical and statistical assumptions are involved. First, the shape chosen for fFC depends on assumptions about the way the caliper converts lengths to indications. Second, the use of gauge blocks implicitly assumes that length is additive under concatenation operations. These assumptions are theoretical, i.e. they suppose that length enters into certain nomic relations with other quantities or qualities. Third, associating uncertainties with the indications of the caliper requires making one or more statistical assumptions, for example, that the distribution of residual errors is normal. All of these assumptions are

149 idealizations: the response of any real caliper is not exactly linear, the concatenation of imperfect rods is not exactly additive, and the distribution of errors is never exactly normal.

The first step of calibration is meant to test how well these idealizations, when taken together, approximate the actual behaviour of the caliper. If they fit the data closely enough for the needs of the application at hand, these idealized assumptions are then presumed to continue to hold beyond the calibration stage, when the caliper is used to estimate the diameter of non-standard objects. Under a purely behavioural, black-box representation of the caliper, its calibration function is obtained by simple inversion of the forward function:

-1 O = fC (I0) = f FC (I0) (4.4)

The calibration function expresses a hypothetical nomic relation between indications and outcomes, a relation that is derived from a rudimentary theoretical-statistical model of the instrument. This function may now be used to generate predictions concerning the outcomes of caliper measurements. Whenever the caliper produces indication i the diameter

of the object between the caliper’s legs is predicted to be o = fC (i). Under this simple representation of the measurement process, the uncertainty associated with measurement outcomes arises wholly from uncontrolled variations to the indications of the instrument.

These variations are usually represented mathematically by applying statistical measures of variation (such as standard deviation) to a series of observed indications. This projection is based on an inductive argument and its precision therefore depends on the number of indications observed during the first step of calibration.

Black-box calibration is useful when the behaviour of the device is already well- understood and when the required accuracy is not too high. Because the calibration function

150 takes only one argument, namely the instrument indication (I0), the resulting quantity-value estimate (O) is insensitive to other parameters that may influence the behaviour of the instrument. Such parameters may have to do with interactions among part of the instrument, the sample being measured, and the environment. They may also have to do with the operation and reading of the instrument by humans, and with the way indications are recorded and processed.

The neglect of these additional factors limits the ability to tell whether, and under what conditions, a black-box calibration function can be expected to yield reliable predictions. As long as the operating conditions of the instrument are sufficiently similar to calibration conditions, one can expect the uncertainties associated with its calibration function to be good estimates of the uncertainty of measurement outcomes. However, black-box calibration represents the instrument too crudely to specify which conditions count as

‘sufficiently similar’. As a result, measurement outcomes generated through black-box calibration are exposed to systematic errors that arise when measurement conditions change.

4.4. White-box calibration

4.4.1. Model construction

White-box calibration procedures represent the measurement process as a collection of modules. This differs from the black-box approach to calibration, which treats the measurement process as a single input/output unit. Each module is characterized by one or more state parameters, laws of temporal evolution, and laws of interaction with other

151 modules. The collection of modules and laws constitutes a more detailed (but still idealized) model of the measurement process than a black-box model.

Typically, a white-box model of a measuring process involves assumptions concerning:

(i) components of the measuring instrument and their mutual interactions (ii) the measured sample, including its preparation and interaction with the instrument (iii) elements in the environment (‘background effects’) and their interactions with both sample and instrument,

(iv) variability among human operators and (v) data recording and processing procedures.

Each of these five aspects may be represented by one or more modules, though not every aspect is represented in every case of white-box calibration.

A white-box representation of a simple caliper measurement is found in Schwenke et al. (2000, 396.) Figure 4.1 illustrates the modules and parameters involved. The measuring instrument is represented by the component modules ‘leg’ and ‘scale’; the sample and its interaction with the instrument by the modules ‘workpiece’ and ‘contact’, and the data by the module ‘readout’. The environment is represented only indirectly by its influence on the temperatures of the workpiece and scale, and variability among human operators is completely neglected. Of course, one can easily imagine more or less detailed breakdowns of a caliper into modules than the one offered here. The term ‘white-box’ should be understood as referring to a wide variety of modular representations of the measurement process with differing degrees of complexity, rather than a unique mode of representation101.

101 Simple modular representations are sometimes referred to as ‘grey-box’ models. See Boumans (2006, 121-2). 152

Figure 4.1: Modules and parameters involved in a white-box calibration of a simple caliper (Source: Schwenke et al 2000)

The multiplicity of modules in white-box representations means that additional parameters are included in the forward and backward calibration functions, parameters that mediate the relation between outcomes and indications. In the caliper example, these parameters include the temperatures and thermal expansion coefficients of the workpiece and scale, the roughness of contact between the workpiece and caliper legs, the Abbe-error

(‘wiggle room’) of the legs relative to each other, and the resolution of the readout. These parameters are assumed to enter into various dependencies with each other as well as with the quantity being measured and the indications of the instrument. Such dependencies are specified in light of background theories and tested through secondary experiments on the apparatus.

Engineers who design, construct and test precision measuring instruments typically express these dependencies in mathematical form, i.e. as equations. Such equations represent the laws of evolution and interaction among different modules in a manner that is amenable to algebraic manipulation. The forward and backward calibration functions are then obtained by solving this set of equations and arriving at a general dependency relation

153 among model parameters102. The general form of a white-box forward calibration function is:

I0 = fFC (O , I1 , I2 , I3 , … In) (4.5)

where I0 is the model’s prediction concerning the processed indication of the instrument, O the quantity being measured, and I1,… In additional parameters. As before, O is obtained by reference to a measurement standard whose associated uncertainties may for the time being be neglected. The additional parameter values in the forward function are estimated by performing additional measurements on the instrument, sample and environment, e.g. by measuring the temperatures of the caliper and the workpiece, the roughness of the contact etc.

4.4.2. Uncertainty estimation

In the first step of white-box calibration, the forward function is derived from model equations and tested against the actual behaviour of the instrument. Much like the black-box case, testing involves recording the indications produced by the instrument in response to a

set of standard samples, and comparing these indications with the indications I0 predicted by the forward function. But the analysis of residual errors is more complex in the white-box case, because the instrument is represented in a more detailed way. On the one hand,

102 For the set of equations representing the caliper measurement process and a derivation of its forward calibration function see Schwenke et al. (2000, 396), eq. (3) and (4). 154 deviations between actual and predicted indications may be treated as uncontrolled (so-called

‘random’) variations to the measurement process. Just like the black-box case, such deviations are accounted for by modeling the residual errors statistically and arriving at a measure of their probability distribution. Uncertainties evaluated in this way are labelled

‘type-A’ in the metrological literature103. On the other hand, observed indications may also deviate from predicted indications because these predictions are based on erroneous

estimates of additional parameters I1,… In. A common source of uncertainty when predicting indications is the fact that additional parameters I1,… In are estimated by performing secondary measurements, and these measurements suffer from uncertainties of their own.

The effects of these ‘type-B’ uncertainties are evaluated by propagating them through the

model’s equations to the predicted indication I0. This alternative way of evaluating uncertainty is not available under a black-box representation of the instrument because such representation neglects the influence of additional parameters. In white-box calibration, by contrast, both type-A and type-B methods are available and can be used in combination to explain the total deviation between observed and predicted indications.

An example of the propagation of type-B uncertainties has already been discussed in

Chapter 1, namely the method of uncertainty budgeting. Individual uncertainty contributions are evaluated separately and then summed up in quadrature (that is, as a root sum of squares). A crucial assumption of this method is that the uncertainty contributions are independent of each other.

Table 4.1 is an example of an uncertainty budget drawn for a measurement of the

Newtonian gravitational constant G with a torsion pendulum (Luo et al. 2009). In a

103 For details see JCGM (2008a). 155 contemporary variation on the 1798 Cavendish experiment, the pendulum is suspended in a vacuum between two masses, and G is measured by determining the difference in torque exerted on the pendulum at different mass-pendulum alignments. The white-box representation of the apparatus is composed of several modules (pendulum, masses, fibre etc.) and sub-modules, each associated with one or more quantities whose estimation contributes uncertainty to the measured value of G. The last item in the budget is the

‘statistical’, namely type-A uncertainty arising from uncontrolled variations. The total uncertainty associated with the measurement is then calculated as the quadratic sum of individual uncertainty contributions.

Table 4.1: Uncertainty budget for a torsion pendulum measurement of G, the Newtonian gravitational constant. Values are expressed in 10-6. A diagram of the apparatus appears on the right. (Source: Luo et al. 2009, 3)

156 The method of uncertainty budgeting is computationally simple. As long as the uncertainties of different input quantities are assumed to be independent of each other, their propagation to the measurement outcome can be calculated analytically. A more computationally challenging case occurs when model parameters depend on each other in nonlinear ways, thereby making it difficult or impossible to propagate the uncertainties analytically. In such cases uncertainty estimates can sometimes be derived through computer simulation. This is the case when metrologists attempt to calibrate coordinate measuring machines (CMMs), i.e. instruments that measure the shape and texture of three-dimensional objects by recording a series of coordinates along their surface. These instruments are calibrated by constructing idealized models that represent aspects of the instrument

(amplifier linearity, probe tip radius), the sample (roughness, thermal expansion), the environment (frame vibration) and the data acquisition mechanism (sampling algorithm.)

Each such input parameter has a probability density function associated with it. The model along with the probability density functions then serve to construct a Monte Carlo simulation that samples the input distributions and propagates the uncertainty to measurement outcomes (Schwenke et al 2000, Trenk et al 2004). This sort of computer- simulated calibration has become so prevalent that in 2008 the International Bureau of

Weights and Measures (BIPM) published a ninety-page supplement to its “Guide to the

Expression of Uncertainty in Measurement” dealing solely with Monte Carlo methods

(JCGM 2008b)104.

104 The topic of uncertainty propagation is, of course, much more complex than the discussion here is able to cover. Apart from the methods of uncertainty budgeting and Monte Carlo several other methods of uncertainty propagation are commonly applied to physical measurement, including the Taylor method (Taylor 1997), probability bounds analysis, and Bayesian analysis (Draper 1995.) 157 During uncertainty evaluation, the forward calibration function is iteratively tested for compatibility with observed indications. Deviations from the predicted indications that fail to be accounted for by either type-A or type-B methods are usually a sign that the white-box model is misrepresenting the measurement process. Sources of potential misrepresentation include, for example, a neglected or insufficiently controlled background effect, an inadequate statistical model of the variability in indications, an error in measuring an additional parameter, or an overly simplified representation of the interaction between certain modules. Much like other cases of scientific modeling and experimenting, white-box calibration involves iterative modifications to the model of the apparatus as well as to the apparatus itself in an attempt to account for remaining deviations. The stage at which this iterative process is deemed complete depends on the degree of measurement accuracy required and on the ability to physically control the values of additional parameters.

4.4.3. Projection

Once sufficiently improved to account for deviations, the white-box model is projected beyond the circumstances of calibration onto the circumstances that are presumed to obtain during measurement. This is the second step of calibration, which involves the derivation of a backward function from model equations. The general form of a white-box backward calibration function is:

O = fC (I0 , I1 , I2 , I3 , … In) (4.6)

158

In general, a white-box calibration function cannot be obtained by inverting the forward function, but requires a separate derivation. Nevertheless, the additional parameters

I1…In are often presumed to be constant and equal (within uncertainty) to the values they had during the first step of calibration. For example, metrologists assume that the caliper will be used to measure objects whose temperature and roughness are the same (within uncertainty) as those of the workpieces that were used to calibrate it. This assumption of constancy has a double role. First, it allows metrologists to easily obtain a calibration function by inverting the forward function. Second, the assumption of constancy is epistemically important, as it specifies the scope of projectability of the calibration function. The function is expected to predict measurement outcomes correctly only when additional

parameters I1…In fall within the value ranges specified. If circumstances differ from this narrow specification, it is necessary to derive a new calibration function for these new circumstances prior to obtaining measurement outcomes.

This last point sheds light on an important difference between black-box and white- box calibration: they involve different trade-offs between predictive generality and predictive accuracy. Black-box calibration models predict the outcomes of measuring procedures under a wide variety of circumstances, but with relatively low accuracy, as they fail to take into account local factors that intervene on the relation between indications and outcomes.

White-box calibration operations, on the other hand, specify such local factors narrowly, but their predictions are projectable only within that narrow scope. Of course, a continuum lies between these two extremes. A white-box calibration function can be made more general by widening the specified value range of its additional parameters or by considering fewer such

159 additional parameters. In so doing, the uncertainty associated with its predictions generally increases.

4.4.4. Predictability, not just correlation

Another epistemically important difference between black-box and white-box calibration is the role played by measurement standards in each case. In black-box calibration, one attempts to obtain a stable correlation between the processed indications of the measuring instrument and standard values of the quantity to be measured. By ‘stable’ I mean repeatable over many runs; by ‘correlation’ I mean a mapping between two variables that is unique (i.e. bijective) up to some stated uncertainty. For example, the black-box calibration of a caliper is considered successful if a stable correlation is obtained between its readout and the number of 1-milimiter standard blocks concatenated between caliper legs.

This may lead one to hastily conclude that obtaining such correlations is necessary and sufficient for successful calibration. But this last claim does not generalize to white-box calibration, where one attempts to obtain a stable correlation between the processed indications of the measuring instrument and the predictions of an idealized model of the measurement process105. A correlation of the first sort does not imply a correlation of the second sort.

105 A different way of phrasing this claim would be to say that the idealized model itself functions as a measurement standard, though this way of talking deviates from the way metrologists usually use the term ‘measurement standard’. 160 To see the point, recall that during the first step of white-box calibration one accounts for deviations between observed indications and the indications predicted by the forward calibration function. The forward function is derived from equations specified by an idealized model of the measurement process. The total uncertainty associated with these deviations is accordingly a measure of the predictability of the behaviour of the measurement

process by the model. Recall further that the indications I0 predicted by a white-box model depend not only on standard quantity values O but also on a host of additional parameters

I1…In , as well as on laws of evolution and interaction among modules. Consequently, a mere correlation between observed indications and standard quantity values is insufficient for successful white-box calibration. To be deemed predictable under a given white-box model, indications should also exhibit the expected dependencies on the values of additional parameters.

As an example, consider the caliper once more. If the standard gauge blocks are gradually heated as they are concatenated, theory predicts that the indications of the caliper will deviate from linear dependence on the total length of the blocks due to the uneven expansion rates of the blocks and the caliper. Now suppose that this nonlinearity fails to be detected empirically – that is, the caliper’s indications do not display the sensitivity to temperature predicted by its white-box model but instead remain linearly correlated to the total length of the gauge blocks. It is tempting to conclude from this that the caliper is more accurate than previously thought. This would be a mistake, however, for accuracy is a property of an inference from indications to outcomes, and this inference has proved inaccurate in our case. Instead, the right conclusion from such empirical finding in the context of white-box calibration is that the model of the caliper is in need of correction. It may be that the dependency of indications on temperature has a different coefficient than

161 presumed, or that a hidden background effect cancels out the effects of thermal expansion, etc. Unless an adequate correction is made to the model of the caliper, the uncertainty associated with its predictions – and hence with the outcomes of caliper measurements - remains high despite the linear correlation between indications and standard quantity values.

It is the overall predictive uncertainty of the model, rather than the correlation of indications with standard values, that determines the uncertainty of measurement outcomes.

We already saw that in the second step of calibration model assumptions are projected beyond the calibration phase and used to predict measurement outcomes. The total uncertainty associated with measurement outcomes then expresses the likelihood that the measured quantity value will fall in a given range when the indications of the instrument are such-and-such. In other words, measurement uncertainty is a measure of the predictability of measurement outcomes under an idealized model of the measurement process, rather than a measure of closeness of correlation between the observed behaviours of the instrument and values supplied by standards.

This conclusion may be generalized to black-box calibration. Black-box calibration is, after all, a special case of white-box calibration where additional parameters are neglected.

All sources of uncertainty are represented as uncontrolled deviations from the expected correlation between indications and standard values, and evaluated through type-A

(‘statistical’) methods. A black-box model, in other words, is a coarse-grained representation of the measuring process under which measurement uncertainty and closeness of correlation with a standard happen to coincide. Nevertheless, in both black- and white-box cases theoretical and statistical considerations enter into one’s choice of model assumptions, and in both cases total measurement uncertainty is a measure of the predictability of the outcome under those assumptions. Black-box calibration is simply one way to ground such predictions, by

162 making data-driven empirical generalizations about the behaviour of an instrument. Such generalizations suffer from higher uncertainties and a fuzzier scope than the predictions of white-box models, but have the same underlying inferential structure.

The emphasis on predictability distinguishes the model-based account from narrower conceptions of calibration that view it as a kind of reproducibility test. Allan Franklin, for example, defines calibration as “the use of a surrogate signal to standardize an instrument.”

(1997, 31). Though he admits that calibration sometimes involves complex inferences, in his view the ultimate goal of such inferences is to ascertain the ability of the apparatus to reproduce known results associated with standard samples (‘surrogate signals’). A similar view of calibration is expressed by Woodward (1989, 416-8). These restrictive views treat calibration as an experimental investigation of the measuring apparatus itself, rather than an investigation of the empirical consequences of modelling the apparatus under certain assumptions. Hence Franklin seems to claim that, at least in simple cases, the success or failure of calibration procedures are evident through observation. The calibration of a spectrometer, for example, is understood by Franklin as a test for the reproducibility of known spectral lines as seen on the equipment’s readout (Franklin 1997, 34). Such views fail to recognize that even in the simplest cases of calibration one still needs to make idealized assumptions about the measurement process. Indeed, unless the instrument is already represented under such assumptions reproducibility tests are useless, as there are no grounds for telling whether a similarity of indications should be taken as evidence for a similarity in outcomes, and whether the behaviour of the apparatus can be safely projected beyond the test stage. Despite this, restrictive views neglect the representational aspect of calibration and only admit the existence of an inferential dimension to calibration in special and highly complex cases (ibid, 75).

163 4.5. The role of standards in calibration

4.5.1. Why standards?

As I have argued so far, the ultimate goal of calibration is to predict the outcomes of a measuring procedure under a specified set of circumstances. This goal is only partially served by establishing correlations with standard values, and must be complemented with a detailed representation of the measurement process whenever high accuracy is required. This line of argument raises two questions concerning the role of measurement standards. First, how does the use of standards contribute to the accurate prediction of measurement outcomes?

Second, is establishing a correlation between instrument indications and standard values necessary for successful calibration?

The simple answer to the first question is that standards supply reference values of the quantity to be measured. That is, they supply values of the variable O that are plugged into the forward calibration function, thereby allowing predictions concerning the instrument’s indications to be tested empirically. But this answer is not very informative by itself, for it does not explain why one ought to treat the values supplied by standards as accurate. In the previous sections we simply assumed that standards provide accurate values, despite the fact that (as already shown in Chapter 1) even the most accurate measurement standards have nonzero uncertainties. Given that the procedures metrologists call ‘standards’ are not absolutely accurate, is there any reason to use them for estimating values of O rather than any other procedure that measures the same quantity?

As I am about to show, the answer depends on whether the question is understood in a local or global context. Locally – for any given instance of calibration – it makes no

164 epistemic difference whether one calibrates against a metrological standard or against some other measuring procedure, provided that the uncertainty associated with its outcomes is sufficiently low. By contrast, from a global perspective – when the web of inter-procedural comparisons is considered as a whole – the inclusion of metrological standards is crucial, as it ensures that the procedures being compared are measuring the quantity they are intended to.

4.5.2. Two-way white-box calibration

Let us begin with the local context, and consider any particular pair of procedures - call them the ‘calibrated’ and ‘reference’ procedures. During calibration, the reference procedure is used to measure values of O associated with certain samples; these values are plugged into the forward function of the calibrated instrument and used to predict its indications; and these predictions are then compared to the actual indications produced by the calibrated instrument in response to the same (or similar) samples. For the sake of carrying out this procedure, it makes no difference whether the reference procedure is a metrologically sanctioned standard or not, because the accuracy of metrologically sanctioned standards is evaluated in exactly the same way as the accuracy of any other measurement procedure. The uncertainties associated with standard measuring procedures are evaluated by constructing white-box models of those standards, deriving a forward calibration function, propagating uncertainties through the model, and testing model predictions for compatibility with other standards.

165

Table 4.2: Type-B uncertainty budget for NIST-F1, the US primary frequency standard. The clock is deemed highly accurate despite the large discrepancy between its indications and the cesium clock frequency, because the correction factor is accurately predictable. (Source: Jefferts et al, 2007, 766)

This has already been shown in Chapter 1. For example, the cesium fountain clock

NIST-F1, which serves as the primary frequency standard in the US, has a fractional frequency uncertainty of less than 5 parts in 1016 (Jefferts et al. 2007). This uncertainty is evaluated by modelling the clock theoretically and statistically, drawing an uncertainty budget for the clock, and testing these uncertainty estimates for compatibility with other cesium fountain clocks106. Table 4.2 is a recent uncertainty budget for NIST-F1 (including only type-B evaluations). Note that the systematic corrections applied to the clocks indications

(the total frequency ‘bias’) far exceed the total type-B uncertainty associated with the

106 By ‘modeling the clock statistically’ I mean making statistical assumptions about the variation of its indications over time. These assumptions are used to construct models of noise, such as white noise, flicker noise and Brownian noise (see also section 3.4.2.) 166 outcome. In other words, the clock ‘ticks’ considerably faster – by hundreds of standard deviations – than the cesium frequency it is supposed to measure. The clock is nevertheless deemed highly accurate, because the cesium frequency is predictable from clock indications

(‘ticks’) with a very low uncertainty.

Now consider a scenario in which a cesium fountain clock is used to calibrate a hydrogen spectrometer, i.e. a device measuring the frequency associated with subatomic transitions in hydrogen. Such calibration is described in Niering et al (2000). The accuracy expected of the spectrometer is close to that of the standard, so that one cannot neglect the inaccuracies associated with the standard during calibration. In this case metrologists must consider two white-box models – one for the calibrated instrument and one for the standard

– and compare the measurement outcomes predicted by each model. These predicted measurement outcomes already incorporate bias corrections and are associated with estimates of total uncertainty propagated through each model. The calibration is then considered successful if and only if the outcomes of the two clocks, as predicted by their respective models, coincide within their respective uncertainties:

’ ’ ’ ’ ’ ’ fC (I0 , I1 , I2 , I3,… In) ≈ fC (I0 , I1 , I2 , I3 ,… Im ) (4.7)

’ where fC is the calibration function associated with the hydrogen spectrometer, fC is the calibration function associated with the cesium standard, and ≈ stands for ‘compatible up to stated uncertainty’. Notice the complete symmetry between the calibrated instrument and the standard as far as the formal requirement for successful calibration goes. This symmetry

167 expresses the fact that a calibrated procedure and a reference procedure differ only in their degree of uncertainty and not in the way uncertainty is evaluated.

This ‘two-way white-box’ procedure exemplifies calibration in its full generality, where both the calibrated instrument and the reference are represented in a detailed manner. As equation (4.7) makes clear, calibration is successful when it establishes a predictable correlation between the outcomes of measuring procedures under their respective models, rather than between their observed indications. Such correlation amounts to an empirical confirmation that the predictions of different calibration functions are mutually compatible.

4.5.3. Calibration without metrological standards

The importance of reference procedures in calibration, then, is the fact that they are modelled more accurately – namely, with lower predictive uncertainties – than the procedures they are used to calibrate. It is now easy to see that a reference procedure does not have to be a metrological standard, but instead may be any measurement procedure whose uncertainties are sufficiently low to make the comparison informative. We already saw an example of successful calibration without any reference to a metrological standard in

Chapter 1, where two optical clocks were calibrated against each other. In that case both clocks had significantly lower measurement uncertainties than the most accurate metrologically sanctioned frequency standard. More mundane examples of calibration without metrological standards are found in areas where an institutional consensus has not yet formed around the proper application of the measured concept. Paper quality is an example of a complex vector of quantities (including fibre length, width, shape and

168 bendability) for which an international reference standard does not yet exist (Wirandi and

Lauber 2006). The instruments that measure these quantities are calibrated against each other in a ‘round-robin’ ring, without a central reference standard (see Figure 4.2). This procedure sufficiently guarantees that the outcomes of measurements taken in one laboratory reliably predict the outcomes obtained for a similar sample at any other participating laboratory.

Figure 4.2: A simplified diagram of a round-robin calibration scheme. Authorized laboratories calibrate their measuring instruments against each other’s, without reference to a central standard. (source: Wirandi and Lauber 2006, 616)

These examples of ‘symmetrical’ calibration all share a common inferential structure, namely, they establish the accuracy of measuring procedures through a robustness argument.

The predictions derived from models of multiple measuring procedures are tested for compatibility with each other, and those that pass the test are taken to be accurate up to their respective uncertainties. As I will argue in the next section, this inferential structure is essential to calibration and present even in seemingly asymmetrical cases where the inaccuracy of standards is neglected.

Before discussing robustness, let me return to the role of metrological standards in calibration. From a local perspective, as we have already seen, metrological standards are not qualitatively different from other measuring procedures in their ability to produce reference

169 values for calibration. Metrological standards are useful only insofar as the uncertainty associated with their values is low enough to for calibration tests to be informative. Of course, the values provided by metrological standards are usually associated with very low uncertainties. But it is not by virtue of their role as standards that their uncertainties are deemed low. The opposite is the case: the ability to model certain procedures in a way that facilitates accurate predictions of their outcomes motivates metrologists to adopt such procedures as standards, as already discussed in Chapter 1.

Above I posed the question: is establishing correlation with standard values necessary for successful calibration? A partial answer can now be given. Insofar as local instances of calibration are concerned, it is not necessary to appeal to metrologically sanctioned standards in order to evaluate uncertainty. Establishing correlations among outcomes of different measuring procedures is sufficient for this goal. One may, of course, call some of the procedures being compared ‘standards’ insofar as they are modelled more accurately than others. But this designation does not mark any qualitative epistemic difference between standard and non-standard procedures.

4.5.4. A global perspective

The above is not meant to deny that choices of metrological standards carry with them a special normative . There is still an important difference between calibrating an instrument against a non-standard reference procedure and calibrating it against a metrological standard, even when both reference procedures are equally accurate. The

170 difference is that in the first case, if significant discrepancies are detected between the outcomes of either procedure, either procedure is in principle equally amenable to correction. All other things being equal, the models from which a calibration function is derived for either procedure are equally revisable. This is not the case if the reference procedure is a metrological standard, because a model representing a standard procedure has a legislative function with respect to the application of the quantity concept in question.

This legislative (or ‘coordinative’) function of metrological models has been discussed at length in Chapter 3. The theoretical and statistical assumptions with which a metrological standard is represented serve a dual, descriptive and normative role. On the one hand, they predict the actual behaviour of the process that serves as a standard, and on the other, they prescribe how the concept being standardized is to be applied to that process. Metrological standards can fulfill this legislative function because they are modelled in terms of the theoretical definition of the relevant concept, that is, they constitute realizations of that concept. For this reason, in the face of systematic discrepancies between the outcomes of standard and non-standard procedures, there is a good reason to prefer a correction to the outcomes of non-standard procedures over a correction to the outcomes of standard procedures.

Note that this preference does not imply that metrological standards are categorically more accurate, or accurate for different reasons, than other measuring procedures. The total uncertainty associated with a measuring procedure is still evaluated in exactly the same way whether or not that procedure is a metrological standard. But the second-order uncertainty associated with metrological standards – that is, the uncertainty associated with evaluations of their uncertainty – is especially low. This is the case because metrological standards are modelled in terms of the theoretical definition of the quantity they realize, and their

171 uncertainties are accordingly estimates of the degree to which the realization succeeds in approximately satisfying the definition. Such uncertainty estimates enjoy a higher degree of confidence than those associated with non-standard measuring procedures, because the latter are not directly derived from the theoretical definition of the measured quantity and cannot be considered equally safe estimates of the degree of its approximate satisfaction. For this reason, the assumptions under which non-standard measuring procedures are modelled are usually deemed more amenable to revision than the assumptions informing the modeling of metrological standards. This is the case even when the non-standard is thought to be more accurate, i.e. to have lower first-order uncertainty, than the standard. For example, if an optical atomic clock were to systematically disagree with a cesium standard, the model of the former would be more amenable to revision despite it supposedly being the more accurate clock.

The importance of the normative function of metrological standards is revealed from a global perspective on calibration, when one views the web of inter-procedural comparisons as a whole. Here metrological standards form the backbone that holds the web together by providing a stable reference for correcting systematic errors. The consistent distribution of systematic errors across the web makes possible its subsumption under a single quantity concept, as explained in Chapters 2. In the absence of a unified policy for distributing errors, nothing prevents a large web from breaking into ‘islands’ of intra-comparable but mutually incompatible procedures. By legislating how an abstract quantity concept is to be realized, models of metrological standards serve as a kind of ‘semantic glue’ the ties together distant parts of the web.

As an example, consider all the clocks calibrated against Coordinated Universal Time either directly or indirectly, e.g. through national time signals. What justifies the claim that all

172 these clocks are measuring, with varying degrees of accuracy, the same quantity – namely, time on a particular atomic scale? The answer is that all these clocks produce consistent outcomes when modelled in terms of the relevant quantity, i.e. UTC. But to test whether they do, one must first determine what counts as an adequate way of applying the concept of

Coordinated Universal Time to any particular clock. This is where metrological standards come into play: they fix a semantic link between the definition of the quantity being standardized and each of its multiple realizations. In the case of UTC, this legislation is performed by modeling a handful of primary frequency standards and several hundred secondary standards in a manner that minimizes their mutual discrepancies, as described in

Chapter 3. It then becomes possible to represent non-standard empirical procedures such as quartz clocks in terms of the standardized quantity by correcting their systematic errors relative to the network’s backbone. In the absence of this ongoing practice of correction, the web of clocks would quickly devolve into clusters that measure mutually incompatible timescales.

From a global perspective, then, metrological standards still play an indispensible epistemic role in calibration whenever (i) the web of instruments is sufficiently large and (ii) the quantity being measured is defined theoretically. This explains why metrological rigour is necessary for standardizing quantities that have reached a certain degree of theoretical maturity. At the same time, the analysis above explains why metrological standards are unnecessary for successful calibration in case of ‘nascent’ quantities such as paper quality.

173 4.6. From predictive uncertainty to measurement accuracy

We saw above (Section 4.4.) that measurement uncertainty is a kind of predictive uncertainty. That is, measurement uncertainty is the uncertainty associated with predictions of the form: “when the measuring instrument produces indication i the value of the measured quantity will be o.” Such predictions are derived during calibration from statistical and theoretical assumptions about the measurement process. Calibration tests proceed by comparing the outcomes predicted by a model of one measuring procedure (the ‘calibrated’ procedure) to the outcomes predicted by a model of another measuring procedure (the

‘reference’ procedure). When the predicted outcomes agree within their stated uncertainties, calibration is deemed successful. This success criterion is expressed by equation (4.7).

At first glance it seems that calibration should only be able to provide estimates of consistency among predictions of measurement outcomes. And yet metrologists routinely use calibration tests to estimate the accuracy of outcomes themselves. That is, they infer from the mutual consistency among predicted outcomes that the outcomes are accurate up to their stated uncertainties. The question arises: why should estimates of consistency among outcomes predicted for different measuring procedures be taken as good estimates of the accuracy of those outcomes?

The general outline of the answer should already be familiar from my discussion of robustness in Chapter 1. There I showed how robustness tests of the form (RC), performed among multiple realizations of the same measurement unit, ground claims to the accuracy of those realizations. I further showed that this conclusion holds regardless of the particular meaning of ‘accuracy’ employed – be it metaphysical, epistemic, operational, comparative or pragmatic. The final move, then, is to expand the scope of (RC) to include measuring

174 procedures in general. The resulting ‘generalized robustness condition’ may be formulated in the following way:

(GRC) Given multiple, sufficiently diverse processes that are used to measure the same quantity, the uncertainties ascribed to their outcomes are adequate if and only if (i) discrepancies among measurement outcomes fall within their ascribed uncertainties; and (ii) the ascribed uncertainties are derived from appropriate models of each measurement process.

Uncertainties that satisfy (GRC) are reliable measures of the accuracies of measurement outcomes under all five senses of ‘measurement accuracy’, for the same reasons that applied to (RC)107.

What remains to be clarified is how calibration operations test the satisfaction of

(GRC). Recall that calibration is deemed successful – that is, good at predicting the outcomes of a measuring procedure to within the stated uncertainty – when the predicted outcomes are shown to be consistent with those associated with a reference procedure. Now consider an entire web of such successful calibration operations. Each ‘link’ in the web stands for an instance of pairwise calibration, and is associated with some uncertainty that is a combination of uncertainties from both calibrated and reference procedures. Assuming that there are no cumulative systematic biases across the web, the relation of compatibility

107 See Chapter 1, Section 1.5: “A robustness condition for accuracy”. 175 within uncertainty ≈ can be assumed to be transitive108. Consequently, measurement uncertainties that are vindicated by one pairwise calibration are traceable throughout the web. The outcomes of any two measurement procedures in the web are predicted to agree within their ascribed uncertainties even if they are never directly compared to each other.

The web of calibrations for a given quantity may therefore be considered an indirect robustness test for the uncertainties associated with each individual measuring procedure.

Each additional calibration that successfully attaches its uncertainty estimates to the web indirectly tests those estimates for compatibility with many other estimates made for a variety of other measuring procedures. In other words, each additional calibration constitutes an indirect test as to whether (GRC) is satisfied when the web of comparisons is appended with a putative new member.

This conclusion holds equally well for black-box and white-box calibration, which are but special cases of the fully general, two-way white-box case. To be sure, in these special cases some of the complexities involved in deriving and testing model-based predictions remain implicit. In the one-way white-box case one makes the simplifying assumption that the behaviour of the standard is perfectly predictable. In the black-box case one additionally makes the simplifying assumption that changes in extrinsic circumstances will not influence the relation between indications and outcomes. These varying levels of idealization affect the accuracy and generality with which measurement outcomes are predicted, but not the general methodological principle according to which compatibility among predictions is the ultimate test for measurement accuracy.

108 Note that this last assumption is only adequate when the web is small (i.e. small maximal distance among nodes) or when metrological standards are included in strategic junctions, as already discussed above. 176

4.7. Conclusions

This chapter has argued that calibration is a special sort of modelling activity. Viewed locally, calibration is the complex activity of constructing, testing, deriving predictions from, and propagating uncertainties through models of a measurement process. Viewed globally, calibration is a test of robustness for model-based predictions of multiple measuring processes. This model-based account of calibration solves the problem of accuracy posed in the introduction to this thesis. As I have shown, uncertainty estimates that pass the robustness test are reliable estimates of measurement accuracy despite the fact that the accuracy of any single measuring procedure cannot be evaluated in isolation.

The key to the solution was to show that, from an epistemological point of view, measurement accuracy is but a special case of predictive accuracy. As far as it is knowable, the accuracy of a measurement outcome is the accuracy with which that outcome can be predicted on the basis of a theoretical and statistical model of the measurement process. A similar conclusion holds for measurement outcomes themselves, which are the results of predictive inferences from model assumptions mediated through the derivation of a calibration function. The intimate inferential link between measurement and prediction has so far been ignored in the philosophical literature, and has potentially important consequences for the relationship between theory and measurement.

177 The Epistemology of Measurement: A Model-Based Account

Epilogue

In the introduction to this thesis I outlined three epistemological problems concerning measurement: the problems of coordination, accuracy and quantity individuation. In each of the chapters that followed I argued that these problems are solved (or dissolved) by recognizing the roles models play in measurement. A precondition for measuring is the coherent subsumption of measurement processes under idealized models. Such subsumption is a necessary condition for obtaining objective measurement outcomes from local and idiosyncratic instruments indications. In addition, I have shown that contemporary methods employed in the standardization of measuring instruments indeed achieve the goal of coherent subsumption. Hence the model-based account meets both the general and the practice-based epistemological challenges set forth in the introduction.

A general evidential condition for testing measurement claims has emerged from my studies, which may be called ‘convergence under representations’. Claims to measurement, accuracy and quantity individuation are settled by testing whether idealized models representing different measuring processes converge to each other. This convergence requirement is two-pronged. First, the assumptions with which models are constructed have to cohere with each other and with background theory. Second, the consequences of representing concrete processes under these assumptions must converge in accordance with their associated uncertainties. When this dual-aspect convergence is shown to be sufficiently robust under alternations to the instrument, sample and environment, all three problems are solved simultaneously. That is, a robust convergence among models of multiple instruments

178 is sufficient to warrant claims about (i) whether the instruments measure the same quantity,

(ii) which quantity the instruments measure and (iii) how accurately each of them measures this quantity. Of course, such knowledge claims are never warranted with complete certainty.

The ‘sufficiency’ of robustness tests may always be challenged by a new perturbation that destroys convergence and metrologists to revise their models. As a result, some second-order uncertainty is always present in the characterization of measurement procedures.

Claims about coordination, accuracy and quantity individuation are contextual, i.e. pertain to instruments only as they are represented by specified models. This context- sensitivity is a consequence of recognizing the correct scope of knowledge claims made on the basis of measurements. As I have shown, measurement outcomes are themselves contextual and relative to the assumptions with which measurement processes are modeled.

Similarly, the notions of agreement, systematic error and measurement uncertainty all become clear once their sensitivity to representational context is acknowledged. This, however, does not mean that measurement outcomes lose their validity outside of the laboratory where they were produced. On the contrary, the condition of convergence under representations explains why measurement outcomes are able to ‘travel’ outside of the context of their production and remain valid across a network of inter-calibrated instruments. The fact that these instruments converge under their respective models ensures that measurement outcomes produced by using one instrument would be reproducible across the network, thereby securing the validity of measurement outcomes throughout the network’s scope.

The model-based account has potentially important consequences for several ongoing debates in the philosophy of science, consequences which are beyond the purview of this

179 thesis. One such consequence, already noted at the end of Chapter 4, is the centrality of prediction to measurement, a discovery which calls for subtler accounts of the relationship between theory and measurement. Another important consequence concerns the possibility of a clear distinction between hypotheses and evidence. As we saw above, measurement outcomes are inferred by projection from hypotheses about the measurement process. Just like any other projective estimate, the validity of a measurement outcome depends on the validity of underlying hypotheses. Hence the question arises whether and why measurement outcomes are better suited to serve as evidence than other projective estimates, e.g. the outputs of predictive computer simulations. Finally, the very idea that scientific representation is a two-place relation – connecting abstract theories or models with concrete objects and events – is significantly undermined by the model-based account. Under my analysis, whether or not an idealized model adequately represents a measurement process is a question whose answer is relative to the representational adequacy of other models with respect to other measurement processes. Hence the model-based account implies a kind of representational coherentism, i.e. a diffusion of representational adequacy conditions across the entire web of instruments and knowledge claims. These implications of the model-based account must nevertheless await elaboration elsewhere.

180 Bibliography

Arias, Elisa F., and Gérard Petit. 2005. “Estimation of the duration of the scale unit of TAI

with primary frequency standards.” Proceedings of the IEEE International Frequency Control

Symposium 244-6.

Audoin, Claude, and Bernard Guinot. 2001. The Measurement of Time. Cambridge: Cambridge

University Press.

Azoubib, J., Granveaud, M. and Guinot, B. 1977. “Estimation of the Scale Unit of Time

Scales.” Metrologia 13: 87-93.

BIPM (Bureau International des Poids et Measures). 2006. The International System of Units

(SI). 8th ed. Sèvres: BIPM, http://www.bipm.org/en/si/si_brochure/

———. 2010. BIPM Annual Report on Time Activities. Vol. 5. Sèvres: BIPM,

http://www.bipm.org/utils/en/pdf/time_ann_rep/Time_annual_report_2010.pdf

———. 2011. Circular-T 282. Sèvres: BIPM,

ftp://ftp2.bipm.org/pub/tai/publication/cirt.282

Birge, Raymond T. 1932. “The Calculation of Errors by the Method of Least Squares.”

Physical Review 40: 207-27.

Boumans, Marcel. 2005. How Economists Model the World into Numbers. London: Routledge.

———. 2006. “The difference between answering a ‘why’ question and answering a ‘how

much’ question.” In Simulation: Pragmatic Construction of Reality, edited by Johannes

Lenhard, Günter Küppers, and Terry Shinn, 107-124. Dordrecht: Springer.

———. 2007. “Invariance and Calibration.” In Measurement in Economics: A Handbook, edited

by Marcel Boumans, 231-248. London: Elsevier.

Bridgman, Percy W. 1927. The logic of modern physics. New York: MacMillan.

181 ———. 1959. “P. W. Bridgman's "The Logic of Modern Physics" after Thirty Years”,

Daedalus 88 (3): 518-526.

Campbell, Norman R. 1920. Physics: the Elements. London: Cambridge University Press.

Carnap, Rudolf. (1966) 1995. An Introduction to the Philosophy of Science. Edited by Martin

Gardner. NY: Dover.

Cartwright, Nancy. 1999. The Dappled World: A Study of the Boundaries of Science. Cambridge:

Cambridge University Press.

Cartwright, Nancy, Towfic Shomar, and Mauricio Suárez. 1995. “The Tool Box of Science.”

Poznan Studies in the Philosophy of the Sciences and the Humanities 44: 137-49.

Chang, Hasok. 2004. Inventing Temperature: Measurement and Scientific Progress. Oxford University

Press.

———. 2009. "Operationalism." In The Stanford Encyclopedia of Philosophy, edited by E.N.

Zalta, http://plato.stanford.edu/archives/fall2009/entries/operationalism/

Chang, Hasok, and Nancy Cartwright. 2008. “Measurement.” In The Routledge Companion to

Philosophy of Science, edited by Psillos, S. and Curd, M., 367-375. NY: Routledge.

Chakravartty, Anjan. 2007. A metaphysics for scientific realism: knowing the unobservable. Cambridge

University Press.

CGPM (Conférence Générale des Poids et Mesures). 1961. Proceedings of the 11th CGPM.

http://www.bipm.org/en/CGPM/db/11/6/

Diez, Jose A. 2002. “A Program for the Individuation of Scientific Concepts.” Synthese 130:

13-48.

Duhem, Pierre M. M. (1914) 1991. The aim and structure of physical theory. Princeton University

Press.

182 Draper, David. 1995. “Assessment and Propagation of Model Uncertainty.” Journal of the

Royal Statistical Society. Series B (Methodological) 57 (1): 45-97.

Ellis, Brian. 1966. Basic Concepts of Measurement. Cambridge University Press.

Franklin, Allan. 1997. “Calibration.” Perspectives on Science 5 (1): 31-80.

Frigerio, Aldo, Alessandro Giordani, and Luca Mari. 2010. “Outline of a general model of

measurement.” Synthese 175: 123-149.

Galison, Peter. 2003. Einstein’s Clocks, Poincaré’s Maps: Empires of Time. W.W. Norton.

Gerginov, Vladislav, N. Nemitz, S. Weyers, R. Schröder, D. Griebsch, and R. Wynands.

2010. “Uncertainty evaluation of the caesium fountain clock PTB-CSF2.” Metrologia 47:

65-79.

Girard, G. 1994. “The Third Periodic Verification of National Prototypes of the Kilogram

(1988- 1992).” Metrologia 31: 317-36.

Gooday, Graeme J. N. 2004. The Morals of Measurement: Accuracy, Irony, and Trust in Late

Victorian Electrical Practice. Cambridge: Cambridge University Press.

Hacking, Ian. 1999. The Social Construction of What? Harvard University Press.

Henrion, Max and Baruch Fischhoff. .1986. “Assessing Uncertainty in Physical Constants.”

American Journal of Physics 54 (9): 791-8.

Heavner, T.P., S.R. Jefferts, E.A. Donley, J.H. Shirley and T.E. Parker. 2005. “NIST-F1:

recent improvements and accuracy evaluations.” Metrologia 42: 411-422.

Hempel, Carl G. 1966. Philosophy of Natural Science. NJ: Prentice-Hall.

Hon, Giora. 2009. “Error: The Long Neglect, the One-Sided View, and a Typology.” In

Going Amiss in Experimental Research, edited by G. Hon, J. Schickore and F. Steinle. Vol.

267 of Boston Studies in the Philosophy of Science, 11-26. Springer.

183 JCGM (Joint Committee for Guides in Metrology). 2008. International Vocabulary of Metrology.

3rd edition. Sèvres: JCGM, http://www.bipm.org/en/publications/guides/vim.html

———. 2008a. Guide to the Expression of Uncertainty in Measurement. Sèvres: JCGM,

http://www.bipm.org/en/publications/guides/gum.html

———. 2008b. Evaluation of measurement data — Supplement 1 to the ‘Guide to the expression of

uncertainty in measurement’— Propagation of distributions using a Monte Carlo method. Sèvres:

JCGM, http://www.bipm.org/en/publications/guides/gum.html

Jefferts, S.R., J. Shirley, T. E. Parker, T. P. Heavner, D. M. Meekhof, C. Nelson, F. Levi, G.

Costanzo, A. De Marchi, R. Drullinger, L. Hollberg, W. D. Lee and F. L. Walls. 2002.

“Accuracy evaluation of NIST-F1.” Metrologia 39: 321-36.

Jefferts, S.R., T. P. Heavner, T. E. Parker and J.H. Shirley. 2007. “NIST Cesium Fountains –

Current Status and Future Prospects.” Acta Physica Polonica A 112 (5): 759-67.

Krantz, D. H., P. Suppes, R. D. Luce, and A. Tversky. 1971. Foundations of measurement:

Additive and polynomial representations. Dover Publications.

Kripke, Saul A. 1980. Naming and Necessity. Harvard University Press.

Kuhn, Thomas S. (1961) 1977. “The Function of Measurement in Modern Physical

Sciences.” In The Essential Tension: Selected Studies in Scientific Tradition and Change, 178-

224. Chicago: University of Chicago Press.

Kyburg, Henry E. 1984. Theory and Measurement. Cambridge University Press.

Latour, Bruno. 1987. Science in Action. Harvard University Press.

Li, Tianchu et al. 2004. “NIM4 cesium fountain primary frequency standard: performance

and evaluation.” IEEE International Ultrasonics, Ferroelectrics, and Frequency Control, 702-5.

184 Lombardi, Michael A., Thomas P. Heavner and Steven R. Jefferts. 2007. “NIST Primary

Frequency Standards and the Realization of the SI Second.” Measure: The Journal of

Measurement Science 2 (4): 74-89.

Luce, R.D. and Suppes, P. 2002. “Representational Measurement Theory.” In Stevens'

Handbook of Experimental Psychology, 3rd Edition, edited by J. Wixted and H. Pashler,

Vol. 4: Methodology in Experimental Psychology, 1-41. New York: Wiley.

Luo, J. et al. 2009. “Determination of the Newtonian Gravitational Constant G with Time-

of-Swing Method.” Physical Review Letters 102 (24): 240801.

Mach, Ernst. (1896) 1966. “Critique of the Concept of Temperature.” In: Brian Ellis, Basic

Concepts of Measurement, 183-96. Cambridge University Press.

Mari, Luca. 2000. “Beyond the representational viewpoint: a new formalization of

measurement.” Measurement 27: 71-84.

———. 2005. “Models of the Measurement Process.” In Handbook of Measuring Systems

Design, edited by P. Sydenman and R. Thorn, Vol. 2, Ch. 104. Wiley.

McMullin, Ernan. 1985. “Galilean Idealization.” Studies in History and Philosophy of Science 16

(3): 247-73.

Michell, Joel. 1994. “Numbers as Quantitative Relations and the Traditional Theory of

Measurement.” British Journal for the Philosophy of Science 45: 389-406.

Morgan, Mary. 2007. “An Analytical History of Measuring Practices: The Case of Velocities

of Money.” In Measurement in Economics: A Handbook, edited by Marcel Boumans, 105-

132. London: Elsevier.

Morrison, Margaret. 1999. “Models as Autonomous Agents.” In Models as Mediators:

Perspectives on Natural and Social Science, edited by Mary Morgan and Margaret Morrison,

38-65. Cambridge: Cambridge University Press.

185 ———. 2009. “Models, measurement and computer simulation: the changing face of

experimentation.” Philosophical Studies 143 (1): 33-57.

Morrison, Margaret, and Mary Morgan. 1999. “Models as Mediating Instruments.”, In Models

as Mediators: Perspectives on Natural and Social Science, edited by Mary Morgan and

Margaret Morrison, 10-37. Cambridge: Cambridge University Press.

Niering, M. et al. 2000. “Measurement of the Hydrogen 1S-2S Transition Frequency by

Phase Coherent Comparison with a Microwave Cesium Fountain Clock.” Physical

Review Letters 84(24): 5496.

Panfilo, G. and E.F. Arias. 2009. “Studies and possible improvements on EAL algorithm.”

IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control (UFFC-57), 154-160.

Parker, Thomas. 1999. “Hydrogen maser ensemble performance and characterization of

frequency standards.” Joint meeting of the European Frequency and Time Forum and the IEEE

International Frequency Control Symposium, 173-6.

Parker, T., P. Hetzel, S. Jefferts, S. Weyers, L. Nelson, A. Bauch, and J. Levine. 2001. “First

comparison of remote cesium fountains.” 2001 IEEE International Frequency Control

Symposium 63-68.

Petit, G. 2004. “A new realization of terrestrial time.” 35th Annual Precise Time and Time Interval

(PTTI) Meeting, 307-16.

Pickering, Andrew. 1995. The Mangle of Practice: Time, Agency and Science. Chicago and London:

University of Chicago Press.

Poincaré, Henri. (1898) 1958. “The Measure of Time.” In: The Value of Science, 26-36. New

York: Dover.

186 Quinn, T.J. 2003. “Open letter concerning the growing importance of metrology and the

benefits of participation in the Metre Convention, notably the CIPM MRA.”,

http://www.bipm.org/utils/en/pdf/importance.pdf

Record, Isaac. 2011. “Knowing Instruments: Design, Reliability and Scientific Practice.”

PhD diss., University of Toronto.

Reichenbach, Hans. (1927) 1958. The Philosophy of Space and Time. Courier Dover Publications.

Schaffer, Simon. 1992. “Late Victorian metrology and its instrumentation: a manufactory of

Ohms.” In Invisible Connections: Instruments, Institutions, and Science, edited by Robert Bud

and Susan E. Cozzens, 23-56. Cardiff: SPIE Optical Engineering.

Schwenke, H., B.R.L. Siebert, F. Waldele, and H. Kunzmann. 2000. “Assessment of

Uncertainties in Dimensional Metrology by Monte Carlo Simulation: Proposal for a

Modular and Visual Software.” CIRP Annals - Manufacturing Technology 49 (1): 395-8.

Suppes, Patrick. 1960. “A Comparison of the Meaning and Uses of Models in Mathematics

and the Empirical Sciences.” Synthese 12: 287-301.

———. 1962. “Models of Data.” In Logic, methodology and philosophy of science: proceedings of the

1960 International Congress, edited by Ernest Nagel, 252-261. Stanford University Press.

Swoyer, Chris. 1987. “The Metaphysics of Measurement.” In Measurement, Realism and

Objectivity, edited by John Forge, 235-290. Reidel.

Tal, Eran. 2011. “How Accurate Is the Standard Second?” Philosophy of Science 78 (5): 1082-

96.

Taylor, John R. 1997. An Introduction to Error Analysis: the Study of Uncertainties in Physical

Measurements. University Science Books.

Thomson, William. 1891. “Electrical Units of Measurement.” In: Popular Lectures and

Addresses, vol. 1, 25-88. London: McMillan.

187 Trenk, Michael, Matthias Franke and Heinrich Schwenke. 2004. “The ‘Virtual CMM’ a

software tool for uncertainty evaluation – practical application in an accredited

calibration lab.” Summer Proceedings of the American Society for Precision Engineering.

Tsai, M.J. and Hung, C.C. 2005. “Development of a high-precision surface metrology system

using structured light projection.” Measurement 38: 236-47. van Fraassen, Bas C. 1980. The Scientific Image. Oxford: Clarendon Press.

———. 2008. Scientific Representation: Paradoxes of Perspective. Oxford University Press.

Wirandi, J. and Lauber, A. 2006. “Uncertainty and traceable calibration – how modern

measurement concepts improve product quality in process industry.” Measurement 39:

612-20.

Woodward, Jim. 1989. “Data and Phenomena.” Synthese 79: 393-472.

188