Measurement and Meaning in Author(s): David Houle, Christophe Pélabon, Günter P. Wagner, Thomas F. Hansen Source: The Quarterly Review of Biology, Vol. 86, No. 1 (March 2011), pp. 3-34 Published by: The University of Chicago Press Stable URL: http://www.jstor.org/stable/10.1086/658408 . Accessed: 18/03/2011 17:28

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=ucpress. .

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

The University of Chicago Press is collaborating with JSTOR to digitize, preserve and extend access to The Quarterly Review of Biology.

http://www.jstor.org Volume 86, No. 1 March 2011 THE QUARTERLY REVIEW of Biology

MEASUREMENT AND MEANING IN BIOLOGY

David Houle* Centre for Ecological & Evolutionary Synthesis, University of Oslo, 0316 Oslo, Norway and Department of Biological Science, Florida State University Tallahassee, Florida 32306-4295 USA e-mail: [email protected]

Christophe Pe´labon Department of Biology, Centre for Conservation Biology, NTNU N-7491 Trondheim, Norway

Gu¨nter P. Wagner Department of Ecology & Evolutionary Biology, Yale University New Haven, Connecticut 06405 USA

Thomas F. Hansen Centre for Ecological & Evolutionary Synthesis, Department of Biology, University of Oslo 0316 Oslo, Norway

keywords measurement theory, meaningfulness, dimensional analysis, scale type, scaling, significance testing, modeling

abstract Measurement—the assignment of numbers to attributes of the natural world—is central to all scientific inference. Measurement theory concerns the relationship between measurements and reality;

*Authorship is on a nominal scale.

The Quarterly Review of Biology, March 2011, Vol. 86, No. 1 Copyright © 2011 by The University of Chicago Press. All rights reserved. 0033-5770/2011/8601-0001$15.00

3 4 THE QUARTERLY REVIEW OF BIOLOGY Volume 86

its goal is ensuring that inferences about measurements reflect the underlying reality we intend to represent. The key principle of measurement theory is that theoretical context, the rationale for collecting measurements, is essential to defining appropriate measurements and interpreting their values. Theoretical context determines the scale type of measurements and which transformations of those measurements can be made without compromising their meaningfulness. Despite this central role, measurement theory is almost unknown in biology, and its principles are frequently violated. In this review, we present the basic ideas of measurement theory and show how it applies to theoretical as well as empirical work. We then consider examples of empirical and theoretical evolutionary studies whose meaningfulness have been compromised by violations of measurement-theoretic principles. Common errors include not paying attention to theoretical context, inappropriate transformations of data, and inadequate reporting of units, effect sizes, or estimation error. The frequency of such violations reveals the importance of raising awareness of measurement theory among biologists.

Introduction measurement process. Quantitative mea- ROGRESS IN science often involves surements flourish, but they are often used Pquantification. Ideas progress from and manipulated in ways that render the loose verbal accounts to become rigorous conclusions drawn from them meaningless. mathematical models. At the same time, This conclusion is familiar to any participant concepts and entities progress from in- in journal clubs or discussion groups. complete verbal definitions to become Consider some brief examples: Dunn et variables and parameters that derive their al. (1999) devised an index of concern for meaning from a precise theoretical con- the conservation of Canadian land by text. For example, ecology developed from averaging ordinal indices for abundance, an unconnected set of verbal ideas in 1920 breadth of range, and evidence of popula- to a unified science with a rigorous foun- tion decline. Wolman (2006) pointed out dation in mathematical population ecol- that these averages are meaningless be- ogy by about 1970 (Kingsland 1985), and cause ordinal indices do not reflect the similar changes took place during the magnitudes of the attributes they measure. modern synthesis in evolutionary biology. Post and Forchhammer (2002) reported a Arguably, empirical progress results mainly negative correlation between climate and from better measurement. Measurements variation in abundance of caribou and improve either because of more rigorous musk ox on Greenland. Vik et al. (2004) theory, which defines what is important showed that this result is meaningless be- to measure, or better instruments, which cause changing the units used to measure allow more accurate measurements of fa- abundance could reverse this correlation. miliar quantities or make the previously Diaz and Ru¨tzler (2001) reviewed studies unknown measurable. The high status of of the importance of organisms on coral quantification in science is therefore no reefs, all of which used the area covered to surprise, nor is the desire of researchers to measure abundance. Wulff (2001) pointed support their work with quantitative mea- out that the most relevant attribute is bio- surements. mass, and measuring area dramatically un- For measurements to be meaningful, derestimates the biomass of sponges rela- however, they must retain their connection tive to encrusting organisms because sponge to the theoretical and instrumental con- biomass scales with volume. Harvey and text from which they were derived. Mea- Clutton-Brock (1985) compiled a widely surement theory concerns the relationship used source of primate body-size data. between measurements and reality; its goal They presented means without is ensuring that inferences about mea- standard errors, sample sizes, or references surements reflect the underlying reality to the source of the data. Smith and we intend to represent. Unfortunately, in Jungers (1997) traced the source of these biology, the connection between concepts data and found so many errors and inac- and measurements is often lost during the curacies that they concluded that “the data March 2011 MEASUREMENT AND MEANING IN BIOLOGY 5 table...should never have been acceptable virtually unknown in biology, so our edu- as a source” (p. 549). We will make the case cation in the theory has come mostly from that such errors are not isolated mistakes nonbiological sources (Stevens 1946, 1968; or differences of opinion, but systemic er- Krantz et al. 1971; Suppes et al. 1989; Luce rors in the scientific process that result et al. 1990; Hand 1996, 2004; Sarle 1997; from the absence of a theory of measure- Michell 1999). In a few exceptional cases, ment and meaning in biology. biologists have taken an explicitly measure- Scientific representation of biological re- ment-theoretic approach (e.g., Stahl 1962; ality also extends to the use of mathemati- Rosen 1978b; Wolman 2006), but these seem cal models to capture biological relations. to have had little general impact. Mathematical models are representations We also argue that measurement theory of hypotheses about reality, and their can be used more broadly as the basis of a and manipulation must be consistent with theory of scientific meaning that extends the reality they are meant to represent. beyond the of the mathematical Measurement in the usual sense assigns theory of representational measurement to numbers to aspects of reality, whereas include explicit consideration of what to mathematical modeling assigns functions measure and what conclusions can be to aspects of reality. This representational drawn from those measurements. We sug- aspect of modeling is often forgotten when gest the term “conceptual measurement specific models are used to represent gen- theory” for this broader approach to the eral hypotheses. A typical example is May- relationship between measurements and nard Smith’s (1976) claim, based on a meaning. This broader aspect of extracting highly specific population-genetic model, meaning is more familiar to biologists, as that Zahavi’s handicap principle, in which we are relatively well-versed in how to think individuals display costly traits to enforce about hypotheses and their testing. Our honest signals of quality, cannot work. This claim is that thinking about these as part of claim was not a mathematical error but a the measurement process is tremendously representational error, because the spe- useful. We have found that this broader cific functional forms assumed by Maynard conception of measurement has followed Smith were not a complete representation naturally as we have begun to incorporate ex- of the universe of cases inherent in the plicit representational measurement theory handicap hypothesis. Pomiankowski (1987, into our own work (Wagner et al. 1998; 1988) and Grafen (1990a,b) showed that Hansen and Wagner 2001a,b; Wagner and handicaps can evolve under reasonable Laubichler 2001; Hansen and Houle 2008; conditions on the basis of models with Wagner 2010). Measurement theory pro- more general functional forms. vides a language for discussion of a part of Our purpose here is to bring a broad scientific practice that is both critical to measurement theory framework for under- good science and frequently done poorly. standing the relationship between mean- Consequently, we will use the term mea- ing and measurement to the attention of surement theory to refer to all aspects of biologists. We believe that awareness of extracting meaning from observations, ex- measurement theory helps us to do better periments, and models. science by providing tools to ensure the In this review, we first summarize the meaningfulness of our work. Our starting basic themes of formal, representational point is the field of representational measurement theory. We then lay out measurement theory, a well-established broader conceptual measurement princi- mathematical approach that enables us to ples. We drive home the importance of determine whether the relationships these points by making the case that viola- among numerical measurements are con- tions of these basic measurement princi- sistent with the relationships among the ples are widespread in biology, as in the attributes they are meant to represent. examples briefly touched on above. Fi- Representational measurement theory is nally, we discuss what can be done to in- 6 THE QUARTERLY REVIEW OF BIOLOGY Volume 86 corporate measurement theory into the hypothesis that the length of a male guppy education and day-to-day thinking of biol- predicts his attractiveness to potential ogists. mates. The empirical relationship of lining up two fish side-by-side and recording Measurement which is longer is captured instead by mea- We naturally identify entities that exist in surement of length in standard units. The the world (with varying degrees of distinct- conclusions about length that can be ness) and conceive of attributes of those drawn from the pair-wise empirical com- entities that seem important to us. The parisons can also be drawn from numbers idea that some attributes are more deserv- properly assigned to lengths. Drawing con- ing of attention than others immediately clusions about the relationship between makes clear that a priori ideas about what male lengths and the larger theoretical is important and interesting underlie all context of male attractiveness to females measurement. For example, if our entities then proceeds by the usual scientific pro- are individual organisms, we might be in- cess, incorporating additional measure- terested in the attribute “size,” which ments and experiments. Note that whether might predict which individuals will leave the original hypothesis is true or not is not more offspring. Implicit in this simple important to the measurement process. statement are at least five complex con- What is important is that each of the steps cepts. First is the theoretical context that from hypothesis to identification of entities leads us to care about the number of off- and attributes to measurements and con- spring individuals produce, which might clusions be well motivated and consistent. be, among many other possibilities, an evo- False hypotheses will tend to be rejected lutionary, an ecological, or an economic when all these connections hold. context. Second is at least a hypothesis and possibly prior evidence that size might be representational measurement an important predictor of number of off- theory spring. Third is an implicit definition of size. Fourth is a notion of what constitutes Representational measurement theory is a countable offspring. Finally, we must cir- a mathematical system for determining cumscribe the set of organisms to which when the relations among the numerical our hypothesis is applied. measurements assigned to attributes re- Measurement consists of an assignment flect the corresponding empirical reality of numbers to attributes of entities so that (Krantz et al. 1971; Suppes et al. 1989; the relations between the numbers can Luce et al. 1990; Hand 1996, 2004; Sarle capture empirical relations among the at- 1997). The clearest introduction to repre- tributes (Krantz et al. 1971; Luce et al. sentational measurement theory is Chap- 1990; Hand 1996, 2004). When what is true ter 1 of Krantz et al. (1971), and we adopt about the relations of the numbers is true their terminology here. The relationship about the relations of the attributes, the con- of attributes in the world is an empirical clusions we draw from the numbers are relational structure and consists of a set of meaningful conclusions about nature. Many entities (e.g., organisms, genotypes, genes, numerical relations could help to capture or proteins), the empirical operations that empirical realities, including , differ- can be made with those entities (e.g., com- ences, ratios, and equivalence. Which of paring, combining, mutating, or placing in these is appropriate depends on our hy- an environment), the attributes that arise pothesis about what might actually matter, from those operations (e.g., contest out- the actual empirical relations, and how come, trait value, or fitness), and the infer- measurement is done. ences that can be made from comparison A simple example of the measurement of those attributes. The empirical rela- process is shown in Figure 1. The theoret- tional structure is then represented by a ical context of sexual selection has led to a numerical relational structure in which at- March 2011 MEASUREMENT AND MEANING IN BIOLOGY 7

Figure 1. The Measurement Process We imagine a study where the theoretical context is sexual selection. Within this context, we focus on attractiveness of males and then on the hypothesis that size influences attractiveness in the guppy. Size can be measured in many ways, so the concept of “size” was referred to the attribute “overall length of the fish” under the hypothesis that females prefer large males and treat the tail as part of the body. The middle third of the figure represents the empirical relations that are at the root of measurement. For empirical comparison of length, fish could be aligned with their noses against a flat surface, and the identity of the fish that extends farther noted. For a pair of fish A and B, the possible results are that A extends farther than B, which we can represent as A B; that we cannot decide whether A extends farther than B, A ϳ B; and that A extends less far than B, A ≺ B. Representational measurement theory proves that the conclusions that can be drawn about length on the basis of the pairwise empirical comparisons can also be drawn from a numerical system consisting of numbers (a and b) assigned to lengths of A and B plus a mapping of the empirical relations , ϳ, ≺ among fish to the relations Ͼ, ϭ, and Ͻ among the numbers. tributes are mapped to numbers and em- meaningfulness when inferences about pirical operations and comparisons are numbers can be translated into inferences mapped to mathematical relations about the original entities (Weitzenhoffer (greater or less than) and operators (ad- 1951; Luce et al. 1990). Which kinds of nu- dition or multiplication). A numerical re- merical relationships are meaningful in this lational structure has the property of sense is a natural way to define scale type. Thus, 8 THE QUARTERLY REVIEW OF BIOLOGY Volume 86 scale type encapsulates what properties of a set (does the scale tip toward the elephant or of measurements could be used to draw em- the mice?). pirically meaningful conclusions. Measurement then proceeds by assign- Assignment of numbers to attributes— ment of numbers to attributes and mathe- the act of measurement—usually depends matical operations (such as addition) to on just a few empirical relations (reviewed empirical operations (such as placing ob- in Krantz et al. 1971, Chapter 1; Hand jects on a scale); together the numbers and 2004). Primary among these is a procedure the operations define a numerical rela- by which the order of attributes can be tional structure. When both order determi- established—that an elephant is bigger nation and concatenation are empirically than a mouse, that 17 May 1814 is later relevant, these operations can be mapped than 4 July 1776, or that animal A is socially to a numerical relational structure that dominant to animal B. For example, in uses positive real numbers to represent the Figure 1, the empirical procedure is to attribute, addition to represent concatena- place fish next to each other in a standard tion, and the Ͼ operator for comparisons. orientation, with their snouts against a flat Measurement in these cases is termed exten- object, then see which fish’s tail extends sive measurement; both lengths and weights farther. This simple operation gives us are therefore extensively measurable attri- both the order of lengths and operational butes. equivalence, when we judge that we cannot A second important type of empirical tell which of the two extends farther. A relational structure arises when measure- second empirical operation critical to ment depends on paired comparisons be- many measurements is combining entities cause no natural concatenation operation so that their attributes are also combined is possible; this is termed intensive measure- in a meaningful way. This is termed concat- ment. This approach was developed in psy- enation. For the fish in Figure 1, we could chology for measurement of attributes cut rods equivalent to the length of each such as attractiveness or utility of objects to fish, then lay the rods end to end to ask test subjects (see, e.g., Luce 1959). Subjects questions such as: “Is twice the length of are confronted with pairs of objects, such fish A greater or less than the length of fish as faces of humans, and asked to rank B?” When concatenation operations are them. With repeated trials and under cer- possible, we can also construct a standard tain well-defined conditions, an overall sequence of concatenated entities (a ruler, judgment of the quality of the individual for example) that permits counting of at- can be made, for example, the “intrinsic tributes in standard units of measurement. attractiveness” of a face. Recently, a similar A second example is measurement of mass. approach has proven useful for defining a The empirical operation of placing an ele- measure of fitness based on pairwise com- phant and a mouse on a balance will tell us petition experiments, for example, among that the elephant is heavier because the a set of bacterial strains (Wagner 2010 and scale tips toward the elephant. We can see below). then “concatenate” mice by placing more Representational measurement theory than one mouse (or other more conve- proves which numerical relational struc- nient objects that are each equivalent in tures can correspond precisely to the em- weight to a mouse) on the scale, to deter- pirical results one would obtain using the mine how many mouse equivalents are actual entities, such as fish or mice and necessary to make something as heavy as elephants. Narens (1981, 1985) and others an elephant, that is to measure mass of the (Luce et al. 1990, Chapter 20) have shown elephant in mouse units. The empirical that only a small number of such numer- relational structure includes the animals ical relational structures can preserve we wish to compare and the operations of meaningfulness given these types of sim- concatenation (e.g., placing more than ple empirical relations. More familiarly, one mouse on a scale) and comparison these structures define the scale types first March 2011 MEASUREMENT AND MEANING IN BIOLOGY 9

TABLE 1 Classification of scale types (after Stevens 1946, 1959, 1968; Luce et al. 1990:113) Permissible Arbitrary Meaningful Scale type transformations Domain parameters comparisons Biological examples Nominal Any one-to-one Any set of symbols Countable Equivalence Species, genes mapping Ordinal Any monotonically Ordered symbols Countable Order Social dominance increasing function Interval x-Ͼ axϩb Real numbers 2 Order, differences Dates, Malthusian fitness Log-interval x-Ͼ axb,a,bϾ0 Positive real numbers 2 Order, ratios Body size Difference x-Ͼ xϩa Real numbers 1 Order, differences Log-transformed ratio- scale variables Ratio x-Ͼ ax Positive real numbers 1 Order, ratios, Length, mass, duration differences Signed ratio* x-Ͼ ax Real numbers 1 Order, ratios, Signed asymmetry, intrin- differences sic growth rate (r) Absolute None Defined 0 Any Probability * Luce et al. (1990) defined this ratio scale but did not discuss or name it. Stevens did not consider this scale. identified by Stevens (1946, 1959, 1968). An understanding of scale types is most The scale types relevant to biological sys- easily developed from specific examples. tems are listed in Table 1 with examples of For the examples of length and weight de- each. Figure 2 shows examples of scale veloped above, note that the empirical re- types resulting from measurements on a lations can be used to establish the order set of fish. We emphasize that scale types of attributes (fish A is longer than fish B) can be characterized in several different and their differences (elephant A is ways: most fundamentally, they are deter- heavier than elephant B by 20 mouse mined by the empirical operations that the units) and ratios (the concatenation of two scientist wishes to represent using numeri- fishes the length of A is equivalent to the cal operations. It is very important to note length of fish B). Any numerical relational that the theoretical context is an irreduc- system that reflects all of those relation- ible part of this formulation. The same ships is by definition on a ratio scale. Pos- data (for example, lengths) may reside on itive real numbers are the domain of the a different scale type depending on the measurements, as a negative weight or question asked. We give specific examples length has no physical equivalent. To map of this relationship between scale type and attributes to numbers, we had to specify hypothesis below. A second convenient char- one arbitrary parameter in these cases—a acterization of scale type refers to the kinds of unit of length or mass. Permissible transfor- mathematical relationships among the mea- mations of the numerical data are defined surements that are potentially meaningful, as those transformations that preserve the and this has led to the names of the scale correspondence of the numerical relation- types: nominal, ordinal, interval, log-interval, ships to all the empirical relationships that difference, ratio, signed ratio, and absolute. could be measured. For ratio scales the Third, the scale types can be formally char- only permissible transformation is multipli- acterized by three related properties: the cation by a constant. To see this, assume permissible transformations that preserve that we have four entities and that the em- the relevant relationships among measure- pirical attributes, say lengths, are repre- ments, the number of arbitrary parameters sented by the letters A, B, C, and D. We that must be adopted to establish the nu- then measure the four lengths and map merical system, and the domain of num- the nonnumerical attribute A to a number bers (or symbols) to which they apply. a, the attribute B to b, and so forth. If we 10 THE QUARTERLY REVIEW OF BIOLOGY Volume 86

Figure 2. Representational Measurement Theory A set of entities, individual fish, can be associated with different empirical relational structures depending on what attributes we focus on and the context we are interested in. Each attribute induces more or less strict order and equivalence relations among the fish. Representational measurement consists of mapping these relations to a numerical system such that the relations among the attributes are preserved in the relations among the numbers assigned to them. The scale types refer to mappings that preserve different relations. The attribute of mass illustrates a ratio scale type. In many conceptual contexts, for example, studies of metabolic rate, a compelling logical reason favors adoption of mass as the appropriate attribute to measure. In this case, we require a mapping that preserves the order of all fish according to mass (as illustrated) and also the order of differences and ratios between their masses. We can use a standard unit (say, the gram) that we could add up to describe the mass of any of the fishes. The only operation we could apply to our measurements that would preserve these properties would be multiplying by a constant (for example, changing the unit from gram to pound). If we were interested in size, but had no reason to choose mass as a measure of size over other possible measures such as length or area, then we would have to admit raising to a power to the permissible transformations, yielding the log-interval scale. On this scale type, the order of differences in fish size is meaningless because the order of differences can change depending on the choice of exponent, but the order of their ratios is meaningful because the order of ratios does not change with the choice of exponent. Note that whether each of these measures of body size is on a ratio or log-interval scale depends on the conceptual context, not on the attribute itself. The residuals of a regression of tail length on body length measure relative tail size. This example illustrates an interval scale type, where differences between relative tail sizes are meaningful (i.e., a natural unit exists), but their ratios are meaningless. The differences themselves reside on a signed ratio scale, so we can say that the relative tail area of the left fish differs from that of the second fish (0.39) by 11% more than that of the last fish differs from that of the third fish (0.35). If we seek only to represent order, as can be illustrated by contrast in the context of male attractiveness judged by females, we have an ordinal scale type. In that case, statements such as “the difference between fish A and fish B is larger than the difference between fish B and fish C” are meaningless. Finally, a nominal scale type, as illustrated by imaginary throat morphs, preserves only equivalence relations; order is meaningless. can show, with an empirical operation, that number of Cs that must be concatenated to the number of As that must be concate- be equivalent to D, we want our numerical nated to be equivalent to B is less than the measures to have the properties a/b Ͼ c/d March 2011 MEASUREMENT AND MEANING IN BIOLOGY 11 and aϪb Ͼ cϪd. Multiplying all of the nu- formed the ratios that are meaningful on the merical measurements by the same con- original scale become differences residing stant obviously preserves the truth of these on an interval scale type. statements; any other type of transforma- A key example of the context depen- tion, such as exponentiation or adding a dence of scale type now presents itself: constant to each number, can readily lead above, we claimed that body lengths and to violation of one or both statements. body mass are on a ratio scale, and now we Log-transformation of ratio-scale vari- say that they are on a log-interval scale. The ables places the data on a new scale type, difference is that different theoretical con- the difference scale, with the domain of texts are applied in the two cases. When a the real numbers. On this scale type, dif- measure of length is treated as a measure ferences are equivalent to ratios of the cor- of length it is on a ratio scale. For example, responding exponentially transformed val- if, as in Figure 1, we hypothesize (or have ues. The only permissible transformation is evidence) that female guppies really choose addition of a constant. their mates on the basis of body length, rather Body size is fundamental to much of bi- than volume, then the choice is dictated by ology, and a considerable literature ad- this hypothesis. In an allometric context, dresses allometry and scaling relationships we choose to regard powers of body size as between size and a wide of other potentially relevant to the questions. This phenotypes (e.g., Schmidt-Nielsen 1984). distinction is a key part of our argument Inspection of this literature, however, reveals that measurement theory must incorpo- two features. First, in most cases, measures of rate theoretical context and is therefore size are log-transformed before analysis. Sec- relevant to the wider issues of scientific ond, measures of size can be measured on a reasoning. linear, area, or volume scale, and these are If our data consisted of dates of occur- treated interchangeably (Solow and Wang rences, the interval scale type would be 2008). For example, to investigate Cope’s appropriate. The key difference between Rule that body size tends to increase within interval and ratio scales is that the empiri- lineages, Alroy (1998) studied the log body cal relation of concatenation does not ap- mass of fossil predicted from lin- ply: one cannot directly concatenate dates. ear tooth dimensions, whereas Hone et al. How would you decide how many 4 July (2005) used log bone length. The inter- 1776s add up to give you a 17 May 1814? changeability of these scales in practice sug- Without this operation, the magnitudes of gests that biologists consider raising mea- dates cannot be directly judged; the ratio sures of size to a power to be a permissible of dates does not correspond to an empir- transformation. Choice of (for example) lin- ical operation and is therefore not mean- ear over volume measures is seen as a matter ingful. We can, however, use the difference of convenience or convention. A key impli- between two dates, say January 1 and Jan- cation of this practice is that ratios are mean- uary 2 as our concatenatable unit—how ingful—the order of ratios, say a/b Ͼ c/d, is long we wait until the date changes. Then preserved when one raises all the measure- the difference between 4 July 1776 and 17 ments to a power—but that the order of May 1814 can be expressed as the number differences is not. For example, say that a ϭ of days on top of 4 July 1776 that are 5, b ϭ 1, c ϭ 10, and d ϭ 7. On this scale, needed to reach 17 May 1814. The permis- a/b Ͼ c/d (5 Ͼ 1.43) and a-b Ͼ c-d (4 Ͼ 3). sible transformations therefore include ad- When the measurements are raised to the dition of a constant, for example, adopting power of 2, the ratio a2/b2 is still greater the Judaic or the Islamic calendar instead than c2/d2 (25 Ͻ 2.04), but the order of the of the Gregorian calendar, as well as mul- differences is reversed: a2 Ϫ b2 Ͻ c2 Ϫ d2 tiplication by a constant (adopting the Ju- (24 Ͻ 51). The scale type with these permis- lian year, rather than Gregorian one). To sible transformations is the log-interval—so arrive at this system, we must specify two named because once the data are log trans- arbitrary parameters, the starting and stop- 12 THE QUARTERLY REVIEW OF BIOLOGY Volume 86 ping dates we use as a standard unit of dominant to C does not necessarily tell us time. The use of concatenation in measur- anything about the relation between A and ing the difference between two dates or B. Therefore, any numerical system that temperatures means that the differences preserves the order is permissible, and the between dates are extensively measurable scale type is ordinal. The assignment of and therefore on a ratio scale. numbers to submissive and dominant indi- The case of an interval scale allows us to viduals could as well be 1 and 2 as 1 and illustrate the important distinction be- 1000 or 999 and 1000. tween scale and scale type. Other familiar The absolute scale type differs from the examples of interval scale type are the Cel- others in that no transformation is permis- sius and Fahrenheit temperature scales. All sible. For example, the axiomatic defini- temperature scales based on arbitrary nu- tion of a probability constrains its domain merical assignments to two states, such as to the interval 0 and 1 and specifies the freezing and boiling points of water, are on precise meaning of any value on that the interval scale type regardless of the unit range. An arcsine square-root transformed of temperature used, but once a unit of probability is not a probability. temperature has been chosen, we have a In Table 1, we also consider two exam- scale on which units are essential to the ples of a relatively unknown scale type, the interpretation of the numbers. When we signed ratio scale. This type had previously say, on the basis of scale type, that differ- been identified on mathematical grounds ences can be meaningfully compared, (Luce et al. 1990), but to our knowledge what we mean is that the relationships of no real examples have been discussed. This the differences are preserved under the scale differs from the ratio scale in that the permissible transformations and not that domain includes all real numbers, so ratios the numerical values of the differences are have a sign that can be negative or positive. preserved. For example, if temperature For example, for a measurement of signed changes in one hour from 8 to 10 in de- asymmetry (length of structure on the left grees Celsius, and then in the second hour side of the body minus length of the to 11 degrees C, we can conclude that the same structure on the right side), both temperature rose twice as fast in the first the ratio of the asymmetry and whether hour as in the second hour. If we convert the asymmetry is in the same direction are to the Fahrenheit scale, the temperatures meaningful. Similarly, the intrinsic rate of pop- (46.4 to 50 to 51.8 degrees F) and the ulation increase has both a magnitude and a differences (3.6 and 1.8 degrees F) change, sign. but temperature can still be inferred to have risen twice as fast in hour one as in pragmatic measurement theory hour two. In many situations, the relationship be- Now consider the case of social domi- tween attributes and the numerical values nance, in which animals can be placed on we use to represent them is not entirely a simple linear hierarchy. The single rele- clear. At one extreme, these uncertainties vant empirical operation is to place two stem from quantifiable factors such as mea- animals together and observe which one is surement error that need not challenge dominant. The outcome of such a contest our understanding of the empirical rela- does not directly tell us anything about the tional system. At the other extreme, we may magnitude of the difference between ani- be studying an attribute about which we can- mals. No empirical operation corresponds not be sure what measurements can actually to concatenation, either of the animals represent it or even whether a hypothesized themselves or of their differences. Even if attribute actually exists. Numerous examples we could duplicate animal A and put two of such attributes exist in behavioral re- As with one B, the result would no longer search, and these have been much discussed tell us anything about pairwise dominance. in the application of measurement to psy- Similarly, knowing that both A and B are chological research (see, e.g., Michell 1999). March 2011 MEASUREMENT AND MEANING IN BIOLOGY 13

For example, it is still controversial whether between several alternative measurements the perception of sensory stimuli leads to a of the same underlying entity. For exam- quantitative attribute called sensation, or ple, we might wish to measure the attribute whether intellectual ability is really a quanti- “sexual attractiveness to females” of a sam- tative trait. These uncertainties arise particu- ple of males in a population, but unless we larly when the goal of measurement is to fully understand what goes on in the brain predict some future outcome, without neces- and nervous system of those females, we sarily having any underlying model of the cannot know how best to assess attractive- relationship between the attributes that are ness. measured and the outcome (Breiman 2000). McGhee et al. (2007) measured two as- These considerations have occasioned pects of male-female mating interactions in considerable debate about the usefulness the bluefin killifish (Lucania goodei): the of representational measurement theory. time a female spent associating with each In the extreme, some claim that represen- of two confined males and which of two tation is an illusion and that measurements males was successful when two males and capture only what the measurement proce- one female were placed in a single tank dure measures. This view leads to the phil- with no barriers. Association and mating osophical position of operationalism, in success were poorly correlated, and which which scientific concepts and variables are of them is a better measure of attractive- defined solely in terms of their measure- ness is unclear. One interpretation of ment procedures (Bridgman 1927). In this McGhee et al.’s finding is that female in- view, intelligence is nothing more than terest in the confined males measures attrac- what is measured by intelligence tests (Bor- tiveness, which then gets overwhelmed by ing 1945). Sneath and Sokal (1973), for male-male interactions. A second is that example, used operationalism to argue for male-male interactions furnish additional quantification and algorithmic decision- information on mate suitability to females, making in biological classification because changing their preferences (Wiley and Pos- it would be repeatable and objective, con- ton 1996). A measurement-theoretical re- sciously giving up the goal that such classi- sponse would suggest additional investiga- fications would reflect evolutionary history. tion of the link between the measurement Such positions contrast sharply with repre- and the underlying entity (e.g., Michell sentational measurement theory, which as- 1999), but this is often a very difficult sumes that the point of measurement is to task. Most studies make the pragmatic ensure meaningful statements about an choice of one measure of an attribute such empirical relational structure that may ex- as attractiveness, without verifying that it is ist whether we are there to measure it or the best such measure. This pragmatic ap- not. proach may be the best way forward because Today, pure operationalism is largely a standard assay may at least be comparable discounted by most scientists, but many au- across different experiments. The possible thors acknowledge that many measure- shortcomings of one’s measure should be ments are not purely representational, borne in mind and reevaluated when the even when representation of an empirical opportunity arises. Humility and com- system is the goal (Hand 1996, 2004). mon sense are necessary complements to Hand (2004) called these nonrepresenta- representational measurement theory. tional aspects of measurement pragmatic. Discussion of these issues can be facilitated The more pragmatic the measurements by the concept of validity, which “de- are, the greater the uncertainty about scribes how well the measured variable whether the measurements actually repre- represents the attribute being measured” sent the attribute we want to characterize (Hand 2004:129). Validity is widely dis- and, therefore, about the potential mean- cussed in the social and behavioral sci- ing of the results. Pragmatic considerations ences, and to some extent in medicine, often arise when researchers must choose particularly with respect to the represen- 14 THE QUARTERLY REVIEW OF BIOLOGY Volume 86 tation of mental states. It can be usefully clearly assign different meanings to the applied in evolutionary biology as well. same phenomenon. The importance of multiple hypotheses taking a broad view of measurement in this example suggests a measurement- theory theoretic explanation for the effectiveness Although the formal results of represen- of “strong inference” (Platt 1964) and the tational measurement theory are not famil- “method of multiple working hypotheses” iar to most biologists, the fundamental is- (Chamberlin 1890, 1965). The common sues that they address are the concern of element to these approaches is that the every working scientist: how we can best researcher should seek to consider all rea- understand reality. All measurement stems sonable hypotheses simultaneously rather from a theoretical or conceptual context; than to focus on only one. In Chamberlin’s good science depends on preserving the evocative terms, “the investigator thus becomes connection of measurement, data han- the parent of a of hypotheses: and, by his dling, analysis, and interpretation to that parental relation to all, he is forbidden to fas- conceptual context. Our provisional ideas ten his affections unduly upon any one” about the study entities specify the relevant (Chamberlin 1890, 1965:756). In the measure- attribute and, thus, the hypothesized em- ment context, the investigator considers that pirical relational structure we should study. the same phenomenon may reflect a variety Once this step is taken, representational of underlying empirical relational systems measurement theory ensures that our num- and, thus, that a variety of measures may be bers reflect the necessary empirical relations. desirable. The same data may have a dif- Specifying an empirical relation system to ferent meaning under each hypothesis. study and drawing conclusions from the observed empirical relations back to the Measurement Theory as an Aid to concepts—i.e., “conceptual measurement Modeling and Statistics theory”—is part of the measurement pro- The premise of measurement theory is cess. that we make measurements to learn about The importance of concepts and hypoth- an empirical relational structure. Hypoth- eses for measurement can be illustrated esized empirical relational structures are with the question: Why are giraffes (Giraffa also the basis for theoretical models, so an camelopardalis) so tall? The obvious expla- important but underappreciated role for nation proposed by many (e.g., Darwin measurement theory is as a guide to the 1871, Chapter 7) is that giraffes are tall so definition and identification of parameters that they can browse on tall trees. When in models. A good definition must identify this hypothesis is investigated, height is forms and parameters that are operational clearly the relevant attribute. Alternatively, in that they represent the relevant empiri- Simmons and Scheepers (1996) proposed cal relations and can also be related to that giraffe necks are an adaptation for obtainable data. This principle is linked to male-male competition. Male giraffes com- Lewontin’s (1974) conceptions of a dynam- pete for dominance by “necking,” combat ically sufficient system as one that would in which males swing their armored heads allow prediction if its parameters were at one another. This hypothesis suggests known and an empirically sufficient system that height is not the most informative as- as a dynamically sufficient system whose pect of form to measure; instead measure- parameters are estimable with sufficient ments might focus on the amount of force accuracy to allow prediction. Dynamically that can be delivered by the swinging head, sufficient parameters are meaningful from which must be a nonlinear function of the a measurement-theory perspective. Empir- mass of the head and length of the neck. ically sufficient ones lead to actual mea- The cause of the elongated shape of gi- surements that are meaningful. raffes has not been resolved (Cameron and The quantitative genetic concepts of ad- du Toit 2007), but different hypotheses ditive effects and additive genetic variance March 2011 MEASUREMENT AND MEANING IN BIOLOGY 15

are good examples of variables that are both ysis is Buckingham’s ␲ theorem (Bridgman dynamically meaningful and empirically op- 1922, Chapter 4; Krantz et al. 1971, Theorem erational. Fisher (1918) developed an ab- 10.4). This theorem considers a putative law stract model of the genotype-phenotype map or model that relates a set of ratio-scale vari- ϭ and proposed population-level measures of ables, xi,asf(x1,...xn) 0, where f is an arbi- the effect of allele substitutions, the average trary function. The units of these variables excess, and the additive effect (Fisher 1941). can be used to identify a set of fundamental The additive genetic variance is the popula- variables (and scales) that span the nxvari- tion variance of these effects (summed over ables (for the technical meaning of “span” loci) and, because the response to selection see Krantz et al. 1971). If exactly m funda- depends on variance (Fisher 1930; Price mental scales exist, the ␲ theorem guarantees 1970), the additive variance emerges as a key that the law can be reformulated as a function parameter for prediction of the short-term of n-mnew variables that all are products of ␲ evolvability—capacity of a population to powers of the original variables as g( 1,.., ␲ ϭ ␲ ϭ a1 ⅐ a2 ⅐ ⅐ an evolve (Houle 1992; Wagner and Altenberg n-m) 0, where i x1 x2 ... xn , where ␲ 1996). Fisher’s measures are approximately each i has a different combination of ex- dynamically sufficient in that they capture ponents, aj. that part of the effect of alleles on the phe- If it is known what variables enter a prob- notype that determines how their frequen- lem, this theorem can be surprisingly pow- cies change under an episode of selection, erful in simplifying and solving models. As although they are not dynamically sufficient an example, it can be used to derive the basic for long-term evolution (Frank 1995). In exponential growth equation of population addition, Fisher also showed that these ecology from minimal assumptions. Assume parameters are all estimable from data on that all we know is that population growth the phenotypes of relatives and thus empiri- involves the three ratio-scale variables popu- cally sufficient for the task of predicting lation number at time t, N(t); population short-term response to selection. number at time zero, N(0); time, t; and a signed ratio-scale variable, the rate param- dimensional analysis eter, r. Our model or “law” of population Dimensional analysis is perhaps the most growth then takes the form f(N(t), N(0), r, important practical application of mea- t) ϭ 0, for an unspecified function f, where surement theory in theoretical physics and we also assume that N(t) can be uniquely should be integral to all model building. In expressed by the rest of the variables. The its simplest form, it requires that mathe- units of N(t) and N(0) are individuals (or matical models be dimensionally consis- individuals per area), the unit of time is a tent (i.e., that the units on the left- and the time interval (e.g., seconds or genera- right-hand sides of an equation be equal; tions), and the unit of r is the inverse of the apples cannot equal oranges), and it pro- time interval. We therefore have four vari- vides a technique for reducing the model ables and two fundamental scales, which to a minimum number of dimensionless es- are individuals and time. The ␲ theorem sential parameters. Despite sporadic applica- implies we can write the model in terms of tions (e.g., by Stahl 1961, 1962; Rosen 1962, two dimensionless variables that will have 1978a,b; Gunther 1975; Heusner 1982, 1983, to involve powers of N(t)/N(0) and powers 1984; McMahon and Bonner 1983; Prothero of rt. The result is g(N(t)/N(0), rt) ϭ 0 for 1986, 2002; Stephens and Dunbar 1993; some function g. By the assumption that Gunther and Morgado 2003; Frank 2009), N(t) is a function of the other variables, this more formal dimensional analysis has played function can then be rewritten as N(t) ϭ little role in biological theory or modeling. N(0)h(rt) for some function h. Because h This is reflected in the arbitrary assumptions does not depend on any parameters beyond rt, about functional forms that are commonly this implies that N(t) ϭ N(s)h(r(tϪs)) for any found in biological models. time s Ͻ t. Hence, we must have N(0)h(rt) ϭ The key result in formal dimensional anal- N(s)h(r(tϪs)), and that implies 16 THE QUARTERLY REVIEW OF BIOLOGY Volume 86

h(rt) ϭ (N(s)/N(0))h(rt Ϫ rs) ments are also central to Charnov’s (1993) ϭ h(rs)h(rt Ϫ rs), theory of -history invariants, although Charnov does not make the connection to which gives us the functional equation ϭ Ϫ ϭ measurement theory explicit. Stephens and h(rt) h(rs)h(rt rs). Let k(x) ln(h(x)) Dunbar (1993) developed applications of di- and write the equation as k(rt) ϭ k(rs) ϩ Ϫ ϭ mensional analysis in behavioral ecology k(rt rs). This shows that k(x) ax for with admirable clarity. They showed how the some constant, a, and therefore that marginal-value theorem (Charnov 1976) can ϭ ax h(x) e . By subsuming the constant a be partially derived and illuminated by for- into r, we have derived the exponential mal dimensional analysis, and they showed growth equation: that certain models of optimal territory size N͑t͒ ϭ N(0)ert from the literature are in fact dimensionally inconsistent. for r Ն 0 (a similar argument based on r Ͻ We believe that dimensional analysis has 0 completes the derivation). Note that we been underused in biology. It has the poten- did this without specifying any model of tial to clarify the necessary and sufficient as- population growth. The law of exponential sumptions for fundamental models and con- growth follows entirely from knowing the cepts in biology along the lines we have relevant variables and their scales. Such briefly illustrated for the exponential growth results are, at first, very surprising, but they law. signal the general power of dimensional analysis. Knowing which variables are rele- meaningful modeling vant to a problem conveys a huge amount The formal derivation of the exponen- of information. Of course, the law of expo- tial growth law from dimensional consider- nential growth may not hold if other vari- ations shows that it has a status similar to ables or parameters were involved. If, for the familiar “laws” of physics, such as New- example, we add a carrying capacity, K, ton’s law of gravitation. To see the analogy with the same units as N, the ␲ theorem one must appreciate that Newton’s law is would stipulate three dimensionless vari- also an abstraction for the most symmetri- ables in g, and additional assumptions cal case. The gravitational law applies only would be needed to produce a specific so- to a rotationally symmetrical gravitational lution. The approach will give misleading field, which does not exist in the solar sys- results if some relevant variable or param- tem because of variations in the density of eter is omitted. matter in the planets and the presence of Among earlier applications of dimen- other bodies. A similar argument can be sional analysis in biology, we particularly made for the general validity of Wright’s se- note Robert Rosen’s program for theoret- lection equation, which can also be directly ical biology, in which dimensional analysis derived from measurement-theoretical con- played a central role (Rosen 1978b). For siderations (Wagner 2010). Biology includes example, Rosen (1962) derived D’Arcy rather few such laws, however, and we must Thompson’s (1917) theory of transforma- accept that most models in biology are rep- tions of biological form from dimensional resentations of qualitative relationships that arguments based on fitness optimization. are too complex to capture in simple law-like His development was, however, highly ab- relations. The complexity and high dimen- stract and, as far as we know, it has not led sionality of the relevant empirical relational to empirical research. Dimensional analy- structures require that some aspects of rep- sis has also played a minor role in discus- resentation must necessarily be given up. sion of physiological scaling relationships Levins (1966) described three modeling (Stahl 1961, 1962; Heusner 1982, 1983, strategies based on which aspects of reality 1984; but see Butler et al. 1987 for a are left out. In type I models, generality is devastating critique of Heusner’s use of sacrificed in favor of precision and realism, dimensional arguments), and such argu- as in highly complex fisheries models in March 2011 MEASUREMENT AND MEANING IN BIOLOGY 17 which as many specific aspects as possible ous) individuals exist within a generally of the stock in question are included. In aposematic species, cannot be maintained type II models, realism is sacrificed to gen- as a stable genetic polymorphism. Sven- erality and precision, as in the Lotka- nungsen and Holen (2007) considered a Volterra models of population ecology. In wider variety of functions and showed that type III models, precision is sacrificed to automimicry can indeed be evolutionarily generality and realism, as in Levins’s own stable under a number of realistic condi- models of selection in heterogeneous en- tions. A second example is the large lit- vironments based on general qualitative as- erature on whether plasticity, learning, sumptions about the fitness function (e.g. and variation increase or decrease a re- Levins 1962). sponse to selection, known as the Bald- Levins himself favored a research strat- win effect (Baldwin 1896). Multiple mod- egy in which type III models or multiple els have been published showing that type II models are used to reach robust either the Baldwin effect of acceleration qualitative insights and predictions. This was predicted (e.g., Hinton and Nowlan approach contrasts sharply with that taken 1987) or favoring the opposite result of in most modeling papers in the fields with slower evolution (e.g., Borenstein et al. which we are familiar. Typically, a question 2006). Paenke et al. (2007) demonstrated is investigated by analysis of a single spe- that the conclusions of these models were cific model of type I or II. In many cases, dictated by the change in the slope of the results are obtained solely by simulations relationship between fitness and the focal based on algorithms including numerous trait. When plasticity increases this slope, auxiliary specifications, so the results nec- the variance in fitness increases accelerat- essarily have limited applicability. Type I ing evolution; the converse occurs when models can only make predictions for the slope decreases. The general model highly specific circumstances, and type II focuses our attention on a key biological models can only demonstrate the possibil- aspect of plastic systems that was not pre- ity of a phenomenon, not its plausibility. viously considered explicitly. These limitations are often forgotten in theoretical biology, where general conclu- meaningful statistics sions are routinely drawn from specific mod- Measurement theory requires that num- els. In the introduction, we mentioned May- bers only be manipulated in ways that re- nard Smith’s (1976) claim, based on a type tain their representation of the empirical II model, that the handicap principle relational structure of interest. Statistical could not work. Maynard Smith’s claim was models, however, only make assumptions later shown to be incorrect (Pomiankowski about the distributional properties of the 1987, 1988; Grafen, 1990a,b) and the hand- data and can be applied to any set of num- icap principle is now a cornerstone of many bers that fulfill the distributional criteria. sexual-selection models. Using measurement Consequently, statistical education and prac- theory to diagnose the problem, we can see tice often recommend transformations of that arbitrary specific model assumptions, the data that make the numbers fit the such as linearity, put constraints on the em- statistical model. This practice often brings pirical relational structure that prevent it measurement principles and statistical practice from capturing the full range of possibilities into conflict. Measurement theory puts se- in the idea being modeled. vere constraints on the statistical manipu- This dynamic of overly broad claims lations that can be done without loss of from narrow models being overturned by some or all of the meaning present in the type III models is quite common. For ex- data. ample, Broom et al. (2005) claimed on the This fundamental truth has not had an basis of specific families of cost and benefit important place in the biostatistical litera- functions that automimicry, a situation in ture, but has been the subject of consider- which some defenseless (e.g., nonpoison- able debate in psychology (reviewed by 18 THE QUARTERLY REVIEW OF BIOLOGY Volume 86

Hand 1996, 2004; Michell 1999). Some pert rankings of, for example, species vul- hold that measurement theory is irrelevant nerabilities or conservation values. to statistics because the “numbers do not Once an impermissible transformation remember where they came from” (Lord is used during the analysis of the data, the 1953:751), and much statistical practice in aspect of reality to which the measure- biology seems based on this viewpoint. For ments and the associated statistical results example, the discussion of transformations apply is, by definition, changed. Similarly, non- in the context of ANOVA in the widely parametric approaches usually test a hypothe- used biostatistics textbook by Sokal and sis about the sample median rather than the Rohlf (1995) focuses on convincing the means. This point is often ignored in the reader that “the scale of measurement is interpretation of the results of statistical arbitrary, you simply have to look at the tests, potentially resulting in meaningless distributions of transformed variates to de- conclusions. Sometimes transformations of cide which transformation most closely sat- data lead to understandable changes in isfies the assumptions of the analysis of meaning, as when a log transformation variance” (p. 412), a sentiment echoed by maps ratio relations into difference rela- others (Dytham 2003; Logan 2010). If that tions and changes a log-interval scale type is statistics, we want no part of it, as science into an interval-scale type, but transformations is about nature, not numbers. We follow commonly lead to fundamental changes in Adams et al. (1965) and define a meaning- the meaning of the numbers. More often ful statistical statement as one whose truth is than not, such changes are not communi- invariant to permissible scale transforma- cated by the authors of the study. We pre- tions (in the measurement-theoretic sense) sent two such examples later in this paper. of the underlying data. The sad results of Fortunately, the generalization of statistical models to encompass distributions other ignoring scale during statistical analysis are than the normal and the rise of numerical legion. For example, if we have ordinal- methods of analysis usually obviate the need scale variables, such as ranks of items, the for unprincipled transformations of data. In statement that the arithmetic mean of one many cases, statistical tests are surprisingly sample is larger than that of another is robust to violations of assumptions (see e.g., not meaningful, because rankings reflect Whitlock and Schluter 2009:403). Neverthe- perceived order only, whereas the act of less, difficult cases will remain in which the averaging assumes that the values reflect assumptions of the available statistical analy- magnitude. Permissible transformations ses are not met, but the transformations that of ordinal data will usually exist that alter would correct the problem would divorce the order of the means. For example, if the measurements from the empirical rela- we have two entities in category B, and tions. Collaborations between biologists and three in category A, and the Bs are statisticians may be necessary in such cases. ranked 2 and 3 out of five, then mean The divorce of statistics from meaning is rank of Bs is 2.5 while the mean rank of perhaps most apparent when only the the As is 3.3. Now assume that more en- qualitative results of statistical tests are tities were included in the ranking, so given or interpreted, ignoring the numer- that the As are now ranked as 1, 9, and ical values of the estimated parameters. In 10, and the Bs as 7 and 8, yielding a lower such cases, a nominal or ordinal conclu- mean score for category A (6.7) than for sion is drawn from estimates that are on a category B (7.5). Despite the reversal of much stronger scale type. For example, the the order of the means, both rankings are conclusion of Takahashi et al.’s (2008) study equally valid. A mean is a meaningless sta- of sexual selection in peacocks, which is tistic for ordinal variables. Wolman (2006) summed up in their paper’s title Peahens Do pointed out that conservation biologists Not Prefer Peacocks with More Elaborate make precisely this error when making rec- Trains, was based on a failure to reject the ommendations based on the average of ex- null hypothesis of no selection. Their best March 2011 MEASUREMENT AND MEANING IN BIOLOGY 19 estimate of the strength of selection was odd that we think them worth emphasiz- that a peacock gained 3% more matings ing. There are, however, numerous exam- for every additional eye spot in his tail. This ples of errors arising from violations of is extremely strong selection, three times these principles—sometimes for represen- the strength of selection on fitness itself tational errors, but just as frequently for (Hereford et al. 2004). The range in the simple conceptual errors that lead to irrel- number of eyespots among males was 42, evant measurements. Even more troubling suggesting that the peacock with the most is that the perceived best practice in a par- eyespots had an expected mating success ticular field may incorporate a measure- 1.0341 Ϸ 3.4 times that of the one with the ment error that renders an entire of fewest. Consideration of the estimate of studies meaningless. We present a sam- the strength of selection, rather than its pling of such errors here. lack of statistical significance would prop- erly lead to the conclusion that this study example 1: remember context lacks the power to detect even extremely The actual theoretical context and the strong female preference rather than the measurements performed can sometimes be authors’ misleading conclusion that pea- mismatched. The most spectacular errors of hens are indifferent to the plumage of pea- this type occur when the investigator loses cocks. track of the theoretical context that is the The confusion of biological with statisti- ostensible purpose of a study. A striking ex- cal significance is exceedingly common in ample of this can be found in much recent ecology and evolution (see, e.g., Yoccoz work on the evolution of allometry. Julian 1991; Anderson et al. 2000). The serious- Huxley (1924, 1932) introduced allometry as ness of the problem is underscored by the a simple scaling relationship between a trait, more than 300 criticisms of this practice Y, and overall body size, X, as a power law, that Anderson et al. found in the scientific Y ϭ aXb, which is usually and more conve- literature through the year 2000; many niently expressed as the linear relationship more have surely accumulated by now. The log(Y) ϭ log(a) ϩ b log(X). Empirically, prevalence of such errors in the face of so such linear log-log relations are common many efforts to prevent them indicates a both within species (static allometry) and be- systemic problem; we believe the problem tween species (evolutionary allometry). Hux- is the lack of awareness of what measure- ley showed that static allometries between ment theory tells us about the meaning of two traits will result when they are under numbers. Biologists accept P-values as mea- common growth regulation, and the idea sures of effects for the same reason that that evolution may be constrained to follow they commonly neglect to report and con- static allometries has received considerable sider units and transformations. The skill attention (e.g., from Gould 1977). On the of interpreting numbers is neither taught basis of his model, Huxley (1932:5–6) felt nor practiced: too often quantification is that the intercept log(a) was “of no particu- window dressing for qualitative arguments. lar biological significance,” but the expo- Measurement theory is a description of nent, b, “has an important meaning.” what meaningful quantification entails. It Indeed, in physiology, biomechanics, and is precisely the medicine needed to restore life-history theory much theoretical and meaning to statistical practice. Fulfilling empirical work has gone into establishing the assumptions of the statistical model is the exact value of the allometric exponent important, but it is by itself useless if the for different traits (e.g., Schmidt-Nielsen statistical model does not respect the em- 1984; Charnov 1993). When the overall re- pirical content of the data. lationship between size and another trait varies, the intercept is usually far more vari- Measurement in Biological Practice able than the slope (Teissier 1936; White The principles we have outlined above and Gould 1965; Jerison 1969; Greenewalt are so near to platitudes that it may seem 1975; Dudley 2000). 20 THE QUARTERLY REVIEW OF BIOLOGY Volume 86

Although many authors follow Huxley’s favor changes in both intercept (log(a)) conception of allometry as “the slope of and allometric coefficient b. The only win- a...log-log regression of the size of a dow through which the reader can assess structure on body size” (Eberhard 2009: the nature of the responses is in a figure 48), some have adopted a “broad” sense of (Frankino et al. 2005, Figure 2A) that pres- allometry as “[c]hanges in shape which ac- ents the distributions at the end of the company change in size” (Mosimann 1970: experiment on an arithmetic scale. No 930). In this usage, all kinds of nonlinear analyses are made on the log scale, how- and even sigmoid and discontinuous ever, and no attempt to determine whether (threshold) relationships are referred to as the differences are in the slope or the in- allometric (Frankino et al. 2010). Thus, tercept. The theoretical context of allome- broad-sense allometry is essentially synony- try is not incorporated into the design or mous with shape and, therefore, without analysis of this study, despite the author’s precise connection to any of the theoreti- invocation of the concept of allometry. cal models that have motivated the interest All of these experiments (Weber 1990, in the allometric slope as an important 1992; Wilkinson 1993; Frankino et al. 2005, evolutionary constraint, whether Huxley’s 2007) alter something about the relationship hypothesis of common growth regulation between traits and clearly demonstrate that or more sophisticated models from physi- body shape shows genetic variation, a ology, biomechanics, or life-history theory. valuable conclusion. All of them invoke Several investigators over the last 20 the concept of allometry, but how to in- years have claimed to alter allometry by terpret their results in terms of allometry artificial selection on trait indices (Weber is perfectly ambiguous. One extreme in- 1990, 1992; Wilkinson 1993; Frankino et terpretation is that all of the response is al. 2005, 2007). Unfortunately, they make in the parameter a, suggesting that, con- only passing reference to the theoretical trary to the conclusions of these papers, concepts that have defined the interest of the allometric slope is constrained by a allometry to biology. None, for example, lack of genetic variation. A second inter- log-transformed their data. The study by pretation is that some part of the response is in Frankino et al. (2005) is illustrative. The the exponent b and that natural selection can title of the study, the abstract, and the text indeed reshape allometric relationships easily. formulate the problem as explaining the Meaning is lost in this class of studies conservatism of “allometry” or “scaling re- because a theoretical context is invoked lationships,” but include no discussion of but then ignored. The design of these ex- specific allometric models. The implica- periments precluded testing the claim that tion, for those familiar with the usual con- allometry can readily be altered, because ception of allometry as slope on a log-log the measurements and analyses did not re- scale, is that the study will be about the flect hypotheses about allometry. This is an evolution of this slope. In their experi- error in specifying the attribute of interest ment, Frankino et al. (2005) sought to al- as any aspect of the relationship between ter wing loading, the relationship between pairs of traits rather than as the slope of wing area and body mass, in the butterfly their relationship measured from log- Bicyclus anynana. To do so, they selected transformed data. for increased or decreased individual devi- ations from the major axis of variation be- example 2: do not use a rubber ruler tween forewing area and body mass. They Consider an animal foraging in an envi- achieved statistically significant responses ronment containing two equally common in this index. types of resource patches, where one type What do these results mean for allome- of patch has twice the payoff of the other. try, the stated subject of the study? The The animal chooses one patch at random answer is entirely unclear, because the type and finds it has payoff x, but it cannot tell of selection used by Frankino et al. will whether it is a good or a bad patch. Should March 2011 MEASUREMENT AND MEANING IN BIOLOGY 21 it then switch patches? If it switches, it has tains the additive variance as a component a 50% chance of encountering the same and, furthermore, that its other compo- payoff x but also a 50% chance of encoun- nents, such as epistatic and environmental tering a different payoff. In the case in variances, tend to be correlated with the which the payoff is different, the chances additive genetic variance (Houle 1992). are equal that the new patch will have twice Therefore, when heritabilities are used for the value (2x) or half the value (x/2) of the comparison of evolvability across traits and original patch. From optimal foraging the- populations, we use a scale that depends ory, we might then argue that the animal strongly on the entity to be measured. The should switch patches, because the ex- naive use of h2 as a measure of evolutionary pected payoff for the new patch would be potential has lead to some highly dubious 2x⅐1/2ϩx/2⅐1/2 ϭ 5x/4 Ͼ x. Notice that generalizations, such as the idea that life- this conclusion would be the same whether history traits and fitness components are the animal chose the good patch or the less evolvable than morphological traits poor patch on the first attempt. The grass (see, e.g., Roff and Mousseau 1987). Houle is always greener on the other side! (1992, 1998; Houle et al. 1996) proposed This conclusion is clearly absurd, but using a mean-standardized scale (such as a identifying the error is not trivial. The coefficient of variation or its square IA)be- cause is failure to adopt a common scale. cause it is not necessarily a function of the To see the problem, first define the payoff attribute to be measured; on this scale, the in the poor patch as z and that in the better lower h2 of life-history traits is clearly due to patch as 2z. In terms of this fixed scale, we high levels of environmental variance. The easily see that the payoff of the first patch is use of the variance-standardized scale gives ϭ ⅐ ϩ ⅐ ϭ x1 2z 1/2 z 1/2 3z/2, whereas the a misleading conclusion for reasons strik- ϭ ⅐ ϩ ⅐ payoff of the second is x2 z 1/2 2z 1/ ingly similar to the one that generates the 2 ϭ 3z/2, giving the obviously correct an- patch paradox. The naı¨ve expectation that ϭ swer that x1 x2. With this definition in evolvability can be measured equally well mind, we can return to the first equation, by either h2 or mean-standardized additive and diagnose the problem; x is used to variance is wrong; in fact they are almost equal both z and 2z in the same equation. uncorrelated (Houle 1992). This example, derived from the enve- In this case, the measurement error is in lope paradox in probability theory (Hand selecting an attribute to measure, h2, that is 2004), illustrates the dangers of a lack of unsuited to the theoretical context. Heri- awareness of scale. In fact, scales that de- tability is unsuitable because it standard- pend on the entity to be measured are izes a quantity of interest with something common in evolutionary biology. The use to which it is autocorrelated. This example of heritability to measure evolutionary shows that meaning is drastically altered by potential is one important example. We seemingly innocent choices of scale. The can define short-term evolvability as the choice of scale is a fundamental and non- expected response to a given selection gra- trivial part of both model building and sta- dient (Houle 1992; Hansen et al. 2003; tistical estimation. A lack of awareness of Hansen and Houle 2008), and under cer- such issues is a major source of misreport- tain assumptions it equals the additive ge- ing and misinterpretation of biological re- netic variance (Lande 1979). To compare sults. additive variances across traits and popula- tions, we need a common scale. The stan- example 3: interpret your numbers dard practice has been to use the trait’s Yet another way of robbing a study of population variance as the scaling unit. Di- meaning is to fail to specify a theoretical viding the additive variance by the popula- context. A good example is Kingsolver et tion variance yields the heritability, h2, al.’s (2001) paper, The Strength of Pheno- which is a dimensionless number. Note, typic Selection in Natural Populations. however, that the population variance con- One of their conclusions was that “direc- 22 THE QUARTERLY REVIEW OF BIOLOGY Volume 86 tional selection on most traits and in most ments in the paper. If the units are systems is quite weak” (p. 253), yet King- stated, Kingsolver et al. (2001) state- solver et al. never specified a theoretical ments that the median strength of selec- context that would help to determine the tion, a 16% change in relative fitness with strength of selection. As pointed out by a one-standard-deviation change in the Conner (2001), this omission leaves read- trait, is modest, whereas a change of 50% ers wondering whether selection is really with a one-standard-deviation change is strong or weak. very strong selection, sound rather con- Kingsolver et al. reviewed a large sample tradictory. of the available estimates of linear selec- Clearly, for comparison of the strengths tion gradients obtained in wild popula- of selection on different traits in different tions (Lande and Arnold 1983; Endler organisms, some standard scale must be 1986). The linear selection gradient mea- adopted, but which scale? Hereford et al. sures the change in relative fitness for a (2004) proposed two criteria for judging unit change in the value of a trait. They the strength of selection and showed that followed Lande and Arnold (1983) in each is naturally addressed on a different adopting the unit of trait standard devia- scale. Under frequency-independent selec- tion as the basis for comparison across dif- tion, the strength of selection is naturally ferent traits and populations. Despite the viewed as a function of the adaptive land- claim about weak selection in the abstract, scape, which is external to the population the text of the paper contained just two subject to selection. Keeping this theoreti- brief statements about selection strength, cal context in mind makes clear that stan- on pages 250–251 and 253. The median dardizing the selection gradient by the strength of selection of 0.16 is termed trait standard deviation has problems sim- “rather modest,” and values greater than ilar to those of basing a measure of evolv- 0.5 are “very strong,” but no discussion is ability on heritability. To get at this con- included of criteria for deciding what con- cept of the steepness of the adaptive stitutes strong or weak selection. Their ar- landscape in the neighborhood of the pop- gument for the rarity of strong selection is ulation, the variance-standardized measure instead based the approximately exponen- immediately requires a second measure of tial distribution of the absolute values of the variation in the trait and in fitness itself the gradients with a mode at 0. This to reveal whether a large value of the stan- change of emphasis from that implied by dardized gradient is due to a steep land- the title and abstract is made explicit in the scape in a population with small variation Methods , where Kingsolver et al. or to a population with a large amount of state “the overall ‘average’ strength of se- variation on a relatively flat landscape. lection is unlikely to be very informative. Hereford et al. (2004) instead standard- We focus our analyses on the distributions ized the selection gradient by the trait mean, of selection strengths” (p. 248). The con- which gives the change in relative fitness for clusion drawn from this analysis would be a proportional change in trait. On a mean- more accurately stated as “most selection standardized scale, the strength of selection estimates are much smaller than the ex- on fitness itself is one, providing a bench- treme estimates.” mark for strong selection (Hansen et al. With respect to the strength of selection, 2003). Hereford et al. (2004) showed that therefore, Kingsolver et al. (2001) imply the median mean-standardized selection gra- that they have a theoretical context, but dient was 0.31, or 31% of the strength of then decline to apply any theory or con- selection on fitness, and on this basis con- cepts related to the units of measurement. cluded that the results were “notable for Symptomatic of this problem is that the the extremely strong selection observed” units of measurement, although clearly (p. 2141). stated in the Methods section, are never As noted above, however, an alternative attached to any of the actual measure- idea of strength of selection could be based March 2011 MEASUREMENT AND MEANING IN BIOLOGY 23 on the change of fitness within the range of fitness” in population genetic theory. Wag- the population—that is, strong selection oc- ner (2010) showed that Wrightian fitness, curs when the expected fitnesses of individ- w, is a ratio-scale measure under the as- uals within the population are typically very sumption that the time scale is fixed. We different (Hereford et al. 2004). In this the- relax this assumption here because conclu- oretical context, a variance-standardized sions about fitness can be drawn on any ruler is no longer rubber, but tells us just time scale. Fitness, therefore, becomes a what we need to know. For example, in a log-interval-scale variable, as the conclu- population with a range of phenotypes of sions are invariant to power transforma- four standard deviations (readily found tions of fitness, so we are equally entitled to within a normally distributed population draw conclusions about two or more epi- of modest population size), the median sodes of selection. In this case, the expo- variance-standardized selection gradient nent is determined by the time scale. The from Kingsolver et al. (2001) of 0.16 pre- result of selection is therefore invariant to dicts that a typical least-fit individual two a transformation: w 3 awb,a,bϾ 0, but not standard deviations below the mean has a to the transformation w 3 wϩc. fitness only 52% that of the typical most-fit The Malthusian parameter m measures fit- individual, two standard deviations above ness in continuous time. If the age structure the mean. This again sounds like strong of the population is stable, the instantaneous selection, contrary to the conclusions of rate of change in genotype frequency is Kingsolver et al. (2001), but for completely given by the Crow-Kimura differential equa- different reasons than in the mean- tion: standardized case. Which scale is used re- ϭ ͑ Ϫ ͒ ally does matter: if the fitness landscapes p˙ p m m . were left unchanged, but the traits studied The Malthusian fitness measure m is ap- had much smaller coefficients of variation, proximately related to Wrightian fitness the differences in fitness on a variance- (under weak selection) by m ϭ ln w. In this standardized scale would also have been case, the biological meaning of the fitness much smaller. measures is contained in the differences of the fitness values, rather than the ratios, example 4: respect scale type and thus m is an interval-scale variable. The permissible scale transformations are there- Fitness is one of the most fundamental 3 ϩ quantitative concepts in biology. It predicts fore of the form m am b, where a the evolutionary outcome of competition determines the change in time scale. This among genotypes or phenotypes for repre- conclusion is of course unsurprising as m is sentation in the next generation. Two the log of w and w is a log-interval-scale types of fitness measures are commonly variable. A somewhat counterintuitive con- used in biology, Wrightian fitness w and sequence of this mathematical fact is that Malthusian fitness m. Wrightian fitness nat- only the differences between ms have evo- urally arises in models where changes in lutionary meaning; neither the sign nor gene or genotype frequencies are pre- the absolute magnitude is meaningful for dicted by the equation evolution, although they have ecological meaning as growth rates. pw The difference in scale types of fitness pЈϭ , w measures has important consequences in experimental practice that have not al- where p is the relative frequency of a geno- ways been respected. An example of the type before selection, pЈ the frequency after problem arises in Remold and Lenski’s selection, and w the mean fitness of the pop- (2001) study of the causes of genotype- ulation. The biological meaning of Wright- by-environment interaction in Escherichia ian fitness is contained in the ratio of fitness coli. Remold and Lenski studied the ef- measures, as reflected in the use of “relative fects of single mutations on fitness when 24 THE QUARTERLY REVIEW OF BIOLOGY Volume 86 bacteria were grown in different environ- measure the magnitude of the fitness dif- ments. They used a fitness assay in which a ference and is therefore meaningless when mutant strain, M, competed against a stan- we compare fitness in different environ- dard strain, S, starting at low density and ments, as Remold and Lenski (2001) did. equal frequencies of the two genotypes To make the problem concrete, imagine (Lenski 1988). They estimated the initial experiments where genotypic fitnesses are and final frequencies by spreading sam- measured in only two environments and ples on agar plates and recording the the fitnesses of the two genotypes in the number of colonies of each type. These environments are different, mM⅐1 mM⅐2 counts are used to estimate the Wright- and m ⅐ m ⅐ . First assume that the differ- Ј ϭ S 1 S 2 ian fitness of genotype x as N x/Nx wx ences between the fitness values in the two where N is the initial count and NЈ is the Ϫ ϭ x x environments are the same, mM⅐1 mS⅐1 final count. To estimate the Malthusian Ϫ mM⅐2 mS⅐2, so that the allele frequencies fitness, m, Remold and Lenski treated the follow exactly the same trajectory in both changes in bacterial density as the result environments. If we compare the alternative of an exponential growth process relative fitness measures based on ratios of ͑ ͒ ϭ ͑ ͒ mt ⅐ ⅐ Nx t Nx 0 e ms, we find that fM,S 1 fM,S 2, erroneously suggesting the presence of genotype-by- (Lenski 1988). One can therefore use the Ј environment interaction. For example, if we counts of bacterial cultures Nx and N x to assume that the Malthusian fitnesses are estimate m for each genotype as ϭ ϭ ϭ mM⅐1 0.02, mS⅐1 0.03, mM⅐2 0.01, and ϭ NЈ mS⅐2 0.02, the fitness differences within M ϭ ͓ ͔ ϭ lnͫ ͬ ln wM mM each environment is 0.01, and genotype and NM . environment do not interact because selec- tion proceeds exactly the same in each envi- NЈ S ϭ ͓ ͔ ϭ ronment. Using these numbers, however, lnͫ ͬ ln wS mS . ⅐ ϭ ⅐ ϭ NS fM,S 1 2/3 and fM,S 2 1/2, leads us to con- clude that a genotype-by-environment inter- Recall that the biological meaning is in action is present. If, on the other hand, ϭ ϭ ϭ the magnitude of the difference, m Ϫ m , mM⅐1 0.015, mS⅐1 0.03, mM⅐2 0.01, and M S ϭ rather than in their ratios. Remold and mS⅐2 0.02, the ratios are constant, leading Lenski (Lenski et al. 1991; Remold and one to infer an absence of interactions, when Lenski 2001), however, took their ratio to the differences show a true genotype-by- define a “relative fitness”: environment interaction. Using this approach, Remold and Lenski (2001) concluded that no m ϭ M evidence supported genotype-by-environment fM,S . mS interactions involving temperature, but that in- teractions involving the type of carbon re- The measure fM,S is still a measure of rela- tive fitness in the sense that, if genotype M source offered were abundant. Neither of Ͼ these conclusions is justified. has higher fitness than S, then fM,S 1, and if a third genotype R has higher fitness Remold and Lenski (2001) came to mean- Ͼ than M, then fR,S fM,S. This measure pre- ingless conclusions because they used ratios serves the order of fitness values between to interpret data on an interval scale type. genotypes and can therefore be used as an Respect for scale type is essential for draw- ordinal-scale variable, but the magnitude ing inferences about an empirical system of these fM,S values has no biological mean- from the measurements obtained by exper- ing. This problem is largely innocuous if iment. In this case, the actual measure- fM,S values are used only to test for the ments were appropriate to the question, existence of differences in fitness in a sin- and the measurement error came from a gle environment, say between an ancestral mismatch between the summary statistic and a derived genotype, but fM,S does not chosen and the scale type. March 2011 MEASUREMENT AND MEANING IN BIOLOGY 25

example 5: do not let statistics algorithms based on the Gaussian likeli- overrule meaning hood, and the transformations therefore An example of the conflict between sta- have a good statistical justification given tistics and meaning is a study of the rela- that the method of analysis is tailored to tionship between morphological diver- normally distributed data. gence and divergence in genetic variance On the other hand, transformations also matrices performed by Podolsky et al. alter the relationships between means and (1997). The theoretical context for this variances, which were the subject of Podol- study is the prediction of longer-term evo- sky et al.’s (1997) analysis. Equalization of lutionary potential from estimates of the variances among groups in an ANOVA is within-population genetic variation. That an alternative reason for choosing particu- genetic variation is summarized in a ma- lar transformations and one that most stat- trix, G, which contains the additive genetic isticians regard as more important than variances for each traits and the covari- transformation to normality. For example, ances between them. The Lande model size, such as the lengths of body parts used (Lande 1979) predicts how variation af- in this study, are often approximately log- fects divergence over many generations if normally distributed, so variance and cova- G remains sufficiently stable over the rele- riances scale with the mean on the linear vant time scale. The question of whether scale, but would be independent on the stability is expected or observed is still log scale. By choosing a variety of transfor- largely unresolved (Steppan et al. 2002; mations of the variables with regard to nor- Arnold et al. 2008), as G matrices are com- mality of residuals, Podolsky et al. altered plex objects that are difficult to estimate the mean-variance relationships that they precisely. The Podolsky et al. study was a wished to study in different ways for differ- relatively early attempt to address this ques- ent traits. These transformations affected tion in a maximum-likelihood framework both the distance used to judge divergence and had a positive impact by highlighting of means and the estimates of G whose the problems of statistical power in such disparity was predicted. Calculated from studies. the transformed data, D2 combined data Podolsky et al. (1997) compared popu- from many incompatible scales, rendering lation means and G matrices for lengths of any test of mean-variance relationships nine structures in 11 populations of the meaningless; if different transformations plant Clarkia dudleyana. They judged dis- had been applied, the results could very parity between matrices using the average well have been different (Adams et al. squared difference between their ele- 1965). ments. Divergence between populations We do not mean to suggest that no trans- was measured as Mahalanobis distance, D2, formations should ever be used. A com- which is the distance in variance units in mon and principled type of transformation the direction between the two samples in would be to log transform all of the data, multivariate space. The problematic aspect converting to a different but equally valid of their analysis is that they used a wide scale type (interval or difference), where variety of transformations of the underly- differences play the role of ratios on the ing data before estimating the G matrices; original scale. In other theoretical con- five of the nine lengths were untransformed, texts, transformations to still other scales three were square-root transformed, and one would be appropriate. For example, if the was transformed as y ϭ ln(ln(x) ϩ 1). Podol- investigator had valid reasons to believe sky et al.’s (1997) stated goal in choosing that natural selection had a simple rela- transformations was to transform “to nor- tionship with the square root of length, mality as feasible” (p. 1787). Departures square-root transformation would be justi- from normality can cause serious estima- fied in a study of natural selection. Rarely, tion problems for maximum-likelihood however, are such empirically based argu- 26 THE QUARTERLY REVIEW OF BIOLOGY Volume 86 ments offered for the use of anything other sion of the Lande-Arnold model has been than a logarithmic transformation. widely used. Kingsolver et al. (2001) found Transformations very frequently change 573 estimates of ␥ in a sample of 63 studies, the scale type and, therefore, the meaning and Stinchcombe et al.’s (2008) much of measurements. In many cases, the result more limited survey of one journal from is meaningless tests of hypotheses. In this 2002 through 2007 turned up 32 addi- case, the error was a general one of com- tional studies with 673 estimates of ␥. paring quantities that were incommensu- Surprisingly, Stinchcombe et al. (2008) rate because transformation placed them discovered that the majority of published es- on different scales. timates of univariate ␥ were off by a factor of 2. The source of this error is that two differ- example 6: treat measurements as ent models are used in the Lande-Arnold measurements approach: conceptually, Lande and Arnold (1983) preferred to think of the average cur- An exceedingly common measurement vature of the selective surface around the error in biology is the neglect of units, population mean and so defined the param- which changes the carefully gathered mea- eter ␥ as the second derivative of the fitness surements into a mere set of numbers. This landscape; fitness is a function of (1⁄2)␥.On error is obvious to the prepared mind the practical side, Lande and Arnold showed when a table of naked numbers is shame- that the multiple regression of trait values on lessly paraded in front of us. Sometimes relative fitness can be used to estimate the the units of each measurement can be di- parameters ␤ and ␥ in the above model. vined, by laborious shuffling from methods Regression models are parameterized to appendices to tables to results, but this is a chore no reader should be subjected to. w ϭ ␣ ϩ bz ϩ qz2 ϩ␧, A deeper symptom of the insidious neglect so that, although bϭ␤,qϭ1/2␥. Stinch- of units is evident from the curious case of combe et al. (2008) estimated the pro- quadratic selection gradients. portion of studies that reported q as ␥ by Lande and Arnold (1983) showed that asking the authors of a sample of papers the relationship between fitness and trait how they calculated ␥. A stunning 78% of values could be approximated from linear the authors reported using q as ␥. and quadratic regression coefficients. If z is How is it that hundreds of papers con- the deviation of an individual trait from taining thousands of estimates can be pub- the population mean, then the Lande- lished over 27 years before anyone notices Arnold model rewrites relative fitness as that the majority of the estimates are 1 wrong? The answer seems to be a systemic w ϭ ␣ ϩ ␤z ϩ ␥z2 ϩ␧ 2 lack of respect for measurement and mod- els in biology. Despite the wide use of the where ␣ is mean fitness and ␧ is residual Lande-Arnold method, the interest of most variation. Extension to the multivariate users has been in determining whether lin- case is straightforward. The linear gradi- ear or quadratic selection can be detected ent, ␤, is necessary for prediction of the by a statistical test for selection, not in the short-term response of the mean to natural actual strength or shape of selection. Ex- selection, whereas the more complete de- ceedingly few authors have used the scription of the selection surface including Lande-Arnold parameters ␤ or ␥ to make ␥ is important for understanding how vari- predictions about evolution or variation, ance is changed by selection (Lande 1980), and even fewer have tried to test those for visualizing the selective surface (Phil- predictions (for exceptions see Postma et lips and Arnold 1989), and for understand- al. 2007). The numerical estimates have ing whether the one-generation prediction served a mainly decorative purpose. of the response to selection is likely to hold Biology is in general poised between be- over longer time scales. The quadratic ver- ing a descriptive, qualitative science and a March 2011 MEASUREMENT AND MEANING IN BIOLOGY 27 quantitative one. Many questions have only perfect (because correlations are ϾϪ1) qualitative answers: Is this organism known should have been apparent from the be- to science or undescribed? Which piece of ginning. Half a generation of biologists land should be purchased to further con- working on the genetics of life histories servation goals? Other attributes of nature spent too much of their time worrying that may productively be treated as qualitative, their estimates of correlations were not even though underlain by quantitative re- negative enough because some referred a ality. For example, knowing whether we reasonable hypothesis (that tradeoffs are can reject the idea that a particular bit of important) to an irrelevant attribute (the DNA is evolving at the neutral rate is use- sign of the genetic correlation). The hy- ful. Consequently, we sometimes forget pothesis of constraint directly predicts a that there are cases where quantity matters, boundary beyond which fitness cannot such as quadratic selection gradients. The evolve. It does not predict genetic correla- lack of attention to the definition of ␥ has tions without subsidiary assumptions, such damaged the meaning of 20 years of pub- as a lack of deleterious mutations. lished estimates. This problem matters. Stinchcombe et al. demonstrated that this example 8: make meaningful measures numerical error can suggest that multivar- One of our roads to discovering mea- iate fitness landscapes have a shape quali- surement theory came through a desire tatively different from their real one. Many to represent the evolutionary effects of hypotheses about the maintenance of ge- epistasis (Wagner et al. 1998; Hansen netic variation depend on the actual value and Wagner 2001b). Fisher’s definition of ␥ (see, e.g., Turelli 1984). of additive effects has proven extremely fruitful because it was defined to quantify example 7: know what your the importance of parent-offspring re- parameters mean semblance in the response to selection. Our final two examples are cases where In contrast, the statistical measures of parameters of models are assigned a mean- gene interaction (Fisher 1918; Cockerham ing that they do not actually have. A good 1954; Kempthorne 1954)—dominance vari- example is the idea that negative genetic ance, additive-by-additive variance, and correlations were the necessary conse- additive-by-dominance variance, among oth- quence of an evolutionary theory of limits ers—have not proven useful for understand- on life history, which somehow became ing of the dynamical role of gene interaction in widespread during the 1980s (see, e.g., Bell evolution because they represent a statistical and Koufopanou 1986). This misconcep- measure of departures from simpler models in tion occasioned considerable debate and an ANOVA framework rather than a measure additional experiments when the data did of the dynamical consequences of epistasis. In not conform to this expectation (see, e.g., spite of this, the use of statistical measures of Rose 1984; Reznick 1985). The entire con- epistasis has persisted over many years because troversy rests on a misinterpretation of these were for a long time the only measures of what a genetic correlation, and the genetic epistasis that had been suggested. As a result, covariance it standardizes, means. Genetic when biologists wished to study the interesting covariance quantifies the degree of depen- phenomenon of epistasis, for example, in- dence of one trait on another. No discontin- spired by Wright’s shifting balance theory, they uous change in the response to selection naturally turned to the statistical measures, accompanies a change in sign of a genetic even in the absence of explicit theory to justify correlation (Via and Lande 1985; Charles- their relevance. As a result, the statistical mea- worth 1990; Houle 1991; Fry 1993). When sures of epistasis have been assigned various the covariance goes from negative to posi- meanings that they do not possess, with effects tive, the correlated effects of selection simply similar to the case of the sign of genetic corre- go from slightly negative to slightly positive. lations. For example, the “statistical” conceptu- Instead, the conclusion that tradeoffs are not alization excludes systematic directional effects 28 THE QUARTERLY REVIEW OF BIOLOGY Volume 86

by definition and has lead to the widespread but incorrect notion that the influence of epis- tasis on the dynamics of quantitative characters under selection can be ignored (see Carter et al. 2005; Hansen et al. 2006; Pavlicev et al. 2010). The only explicit dynamical role sug- gested for statistical epistatic variance was that it could be converted to additive variance by ge- netic drift during population bottlenecks and thus boost evolvability during founder events (see, e.g., Goodnight 1987; Cheverud and Routman 1996). This interpretation is mislead- ing because, although epistasis causes additive effects to change when allele frequencies Figure 3. Effect of Pairwise Epistasis on change, the increase in additive variance is not response to Directional Selection proportional to any decrease in epistatic vari- Each trajectory is the mean of 100 replicate ance components (Barton and Turelli 2004). individual-based simulations; the bars show Ϯ 1 S.E. The change in additive variance under genetic The error bars are smaller than the symbols for most drift depends strongly on the pattern of epista- of the cases. Each simulation models populations of 1000 individuals with 20 loci influencing a single trait sis and the additive variance may even decrease with identical allele frequencies and allelic effects in expectation under some kinds of epistasis that furnish an initial additive genetic variance of 1.0. (Hansen and Wagner 2001b). Mutation is absent. Populations were subjected to 100 Seeing this, however, required defining generations of positive directional selection of new measures of epistasis that capture dy- strength 1% change in relative fitness per trait unit. namically important properties. A non- Directional epistasis is defined as in Carter et al. ␮ statistical representation of epistasis had (2005), where ␧ is the mean strength of directional ␴ been proposed by Cheverud and Routman epistasis, and ␧ is the standard deviation of the di- rectional epistasis between pairs of loci. The four (1995), and Wagner et al. (1998) followed ␮ ϭ ␴ ϭ this proposition by defining epistasis in cases shown are: No epistasis ␧ 0, ␧ 0; Nondi- rectional epistasis ␮␧ ϭ 0, ␴␧ ϭ 1.41; Positive epistasis terms of changes in the effect of allele ␮␧ ϭ 1, ␴␧ ϭ 1; and Negative epistasis ␮␧ ϭϪ1, ␴␧ ϭ 1. substitutions with changes in the genetic background (i.e., substitutions at other loci). Hansen and Wagner (2001b) devel- a “directional” epistatic parameter that oped the multilinear representation of the measures these effects. This directional genotype-phenotype map along these lines parameter is dynamically meaningful in a and showed how the new representation of way similar to that of Fisher’s additive epistasis can be related to the statistical effect and additive variance. It captures representation used in quantitative genet- ´ the second-order effect of evolutionary ics (for further development see Alvarez- changes in the additive effects, as shown Castro and Carlborg 2007). Using this in Figure 3. model, we could show how different patterns of epistasis affect evolutionary dynamics (Hansen and Wagner 2001a,b; Hermisson What Can We Do? et al. 2003; Carter et al. 2005; Hansen et al. The problems that we have described 2006; Fierst and Hansen 2010). An impor- are widespread in biology. We have cho- tant result is that epistasis can accelerate a sen examples mostly from the evolution- response to selection in the presence of a ary biology literature because that is our systematic pattern of positive interactions own area of expertise, and those areas between the effects of substitutions at dif- are the ones in which we can more con- ferent loci. Conversely, a systematic pat- fidently diagnose the issues raised by tern of negative interactions leads to ca- each example. We strongly suspect that nalization. Carter et al. (2005) identified ecologists, for example, will find similar March 2011 MEASUREMENT AND MEANING IN BIOLOGY 29 problems with the measurements meant ment”—things that good scientists already to instantiate such concepts as popula- know and value. We agree that this is pre- tion size, density dependence, interac- cisely what we are doing, but disagree that tion strength, and productivity (see, e.g., taking a measurement-theory stance is not Wulff 2001; Vik et al. 2004). helpful. We have known for years about Some readers may feel that our exam- many of the examples we discuss above, ples are simply unfortunate isolated errors but we find that the terminology and mind- by individual researchers. Such errors are set of measurement-theoretic thinking al- common, but ultimately inconsequential low us to diagnose and discuss common because of the self-correcting nature of sci- features of these cases that combine to ence. We emphasize that the examples we make them examples of flawed or useless discuss are symptomatic of systemic errors that science. For example, we find clarifying the reflect widely held beliefs and attitudes. Many ability to declare that using the sign of a of our examples involve highly competent, in- genetic correlation to indicate whether a deed leading, researchers, and the problems tradeoff exists is an attribute error. Our are repeated in many other studies. For exam- claim is not that conceptual measurement ple, the much-criticized errors in Harvey and theory is revolutionary, but that it aids Clutton-Brock’s (1985) compilation of primate clear thinking about good science. body-size data might be treated as their respon- The most practical counterweight to the sibility alone, but Smith and Jungers (1997) many possible pitfalls that accompany the traced the way these data were transferred task of measurement is awareness. If our from source to source and often lost their experience is any indication, as soon as meaning and context along the way. Eventu- one becomes explicitly aware of measure- ally, in their final destination, the result was ment as a discrete topic, much unsatisfac- absurdities such as assigning six “means” to tory science is readily diagnosed in those males and females of three species of Ateles terms. In this way, measurement theory on the basis of what were originally mea- can become a routine background to read- surements from a total of three individuals. ing and reviewing the literature, discussing The shortcomings and errors by Harvey it with colleagues, and designing research and Clutton-Brock (1985) arose because strategies. Once a culture of thinking the “means” were repeatedly presented about meaning and measurement is in without standard errors, sample sizes, or place, many of the types of errors that we references to original sources, and the er- have documented will be more readily rors were thus allowed to propagate (Smith avoided during the design and analysis of and Jungers 1997). This case is a scandal experiments, caught by reviewers before because of the long chain of authors who they are published, or quickly noticed succumbed to the temptation to use data when they are. without checking original sources, and A second way to address the measure- all of the reviewers and editors who did ment problem is through education. Most not object. Similarly, the ideas that the graduate programs include a required sign of genetic correlations was an impor- background of at least one course in statis- tant attribute and that heritability predicts tics, but we have never come across a evolvability became problematic not when course in meaningfulness or even measure- the first author made this claim, but as the ment theory—it is curious that so little ef- number of those who accepted the incorrect fort is spent learning what our conclusions premise grew; these are systemic problems. might mean, but substantial effort is spent A related objection is that our extension on how to quantify those conclusions. We of measurement theory to incorporate the hope that someday the theory of measure- theoretical context and hypothesis genera- ment and meaning will have its place in tion is not helpful because we are discuss- scientific education alongside, and as a ing “good science, clear thinking, or an partial counterweight to, statistics. Looking appreciation for the subtlety of an argu- farther down the road, if explicit measure- 30 THE QUARTERLY REVIEW OF BIOLOGY Volume 86 ment theory takes root in biology, we can 9. Clothe estimates in the modest raiment look forward to a time when a backlog of of uncertainty. well-studied biological examples will be 10. Never separate a number from its unit. available as a familiar entre´e to the study of measurement and meaning. acknowledgments In the meantime, we offer a modest list Our understanding of the ideas in this paper was devel- of measurement principles as an aid to oped in discussion with members of the Hansen and thinking about both good and bad mea- Houle laboratories and with numerous visitors to the surement: Centre for Ecological and Evolutionary Synthesis (CEES) at the University of Oslo, Norway, and the Na- 1. Keep theoretical context in mind. tional Evolutionary Synthesis Center (NESCENT) at 2. Honor your family of hypotheses. Duke University, U.S. We thank Stevan J. Arnold, Don- 3. Make meaningful definitions. ald Griffin, Øistein Holen, Mihaela Pavlicev, Jura Pintar, 4. Know what the numbers mean. Claus Rueffler, Tore Schweder, Patrick Phillips, Michael 5. Remember where the numbers come Wade, and two anonymous reviewers for their com- from. ments on the manuscript. Arnaud LeRouzic performed 6. Respect scale type. the simulations in Figure 3. David Houle was supported 7. Know the limits of your model. in this research by a visiting professorship at CEES and 8. Never substitute a test for an estimate. a visiting scientist appointment at NESCENT.

REFERENCES

Adams E. W., Fagot R. F., Robinson R. E. 1965. A Breiman L. 2000. Statistical modeling: the two cul- theory of appropriate statistics. Psychometrika 30: tures. Statistical Science 16:199–215. 99–127. Bridgman P. W. 1922. Dimensional Analysis. New Ha- Alroy J. 1998. Cope’s rule and the dynamics of body ven (CT): Yale University Press. mass evolution in North American fossil mam- Bridgman P. W. 1927. The Logic of Modern Physics. New mals. Science 280:731–734. York: MacMillan Company. A´ lvarez-Castro J. M., Carlborg Ö. 2007. A unified Broom M., Speed M. P., Ruxton G. D. 2005. Evolu- model for functional and statistical epistasis and tionarily stable investment in secondary defences. its application in quantitative trait loci analysis. Functional Ecology 19:836–843. Genetics 176:1151–1167. Butler J. P., Feldman H. A., Fredberg J. J. 1987. Dimen- Anderson D. R., Burnham K. P., Thompson W. L. sional analysis does not determine a mass exponent 2000. Null hypothesis testing: problems, preva- for metabolic scaling. American Journal of Physiology- lence, and an alternative. Journal of Wildlife Man- Regulatory, Integrative and Comparative Physiology 253: agement 64:912–923. R195–R199. Arnold S. J., Bu¨rger R., Hohenlohe P. A., Ajie B. C., Cameron E. Z., du Toit J. T. 2007. Winning by a neck: Jones A. G. 2008. Understanding the evolution tall giraffes avoid competing with shorter brows- and stability of the G-matrix. Evolution 62:2451– ers. American Naturalist 169:130–135. 2461. Carter A. J. R., Hermisson J., Hansen T. F. 2005. The Baldwin J. M. 1896. A new factor in evolution. Ameri- role of epistatic gene interactions in the response can Naturalist 30:441–451, 536–553. to selection and the evolution of evolvability. The- Barton N. H., Turelli M. 2004. Effects of genetic drift oretical Population Biology 68:179–196. on variance components under a general model Chamberlin T. C. 1890. The method of multiple of epistasis. Evolution 58:2111–2132. working hypotheses. Science 15:92–96. Bell G., Koufopanou V. 1986. The cost of reproduc- Chamberlin T. C. 1965. The method of multiple tion. Pages 83–131 in Oxford Surveys in Evolutionary working hypotheses: with this method the dangers Biology, Volume 3, edited by R. Dawkins and M. of parental affection for a favorite theory can be Ridley. Oxford (UK): Oxford University Press. circumvented. Science 148:754–759. Borenstein E., Meilijson I., Ruppin E. 2006. The effect Charlesworth B. 1990. Optimization models, quantita- of phenotypic plasticity on evolution in multi- tive genetics, and mutation. Evolution 44:520–538. peaked fitness landscapes. Journal of Evolutionary Charnov E. L. 1976. Optimal foraging, the marginal Biology 19:1555–1570. value theorem. Theoretical Population Biology 9:129– Boring E. G. 1945. The use of operational definitions 136. in science. Psychological Review 52:243–245. Charnov E. L. 1993. Life History Invariants: Some Explo- March 2011 MEASUREMENT AND MEANING IN BIOLOGY 31

rations of Symmetry in Evolutionary Ecology. Oxford Frankino W. A., Zwaan B. J., Stern D. L., Brakefield (UK): Oxford University Press. P. M. 2005. Natural selection and developmental Cheverud J. M., Routman E. J. 1995. Epistasis and its constraints in the evolution of allometries. Science contribution to genetic variance components. Ge- 307:718–720. netics 139:1455–1461. Frankino W. A., Zwaan B. J., Stern D. L., Brakefield Cheverud J. M., Routman E. J. 1996. Epistasis as a P. M. 2007. Internal and external constraints in source of increased additive genetic variance at the evolution of morphological allometries in a population bottlenecks. Evolution 50:1042–1051. butterfly. Evolution 61:2958–2970. Cockerham C. C. 1954. An extension of the concept Fry J. D. 1993. The “general vigor” problem: can of partitioning hereditary variance for analysis of antagonistic pleiotropy be detected when genetic covariances among relatives when epistasis is pres- covariances are positive? Evolution 47:327–333. ent. Genetics 39:859–882. Goodnight C. J. 1987. On the effect of founder events Conner J. K. 2001. How strong is natural selection? on epistatic genetic variance. Evolution 41:80–91. Trends in Ecology and Evolution 16:215–217. Gould S. J. 1977. Ontogeny and Phylogeny. Cambridge Darwin C. 1871. The Descent of Man, and Selection in (MA): Belknap Press. Relation to Sex. London: John Murray and Sons. Grafen A. 1990a. Biological signals as handicaps. Jour- Diaz M. C., Ru¨tzler K. 2001. Sponges: an essential nal of Theoretical Biology 144:517–546. component of Caribbean coral reefs. Bulletin of Grafen A. 1990b. Sexual selection unhandicapped by Marine Science 69:535–546. the Fisher process. Journal of Theoretical Biology 144: Dudley R. 2000. The Biomechanics of Insect Flight: Form, 473–516. Function, Evolution. Princeton (NJ): Princeton Uni- Greenewalt C. H. 1975. The flight of birds: the signif- versity Press. icant dimensions, their departure from the re- Dunn E. H., Hussell D. J. T., Welsh D. 1999. Priority- quirements for dimensional similarity, and the ef- setting tool applied to Canada’s landbirds based fect on flight aerodynamics of that departure. on concern and responsibility for species. Conser- Transactions of the American Philosophical Society 65: vation Biology 13:1404–1415. 1–67. Dytham C. 2003. Choosing and Using Statistics: A Biolo- Gunther B. 1975. Dimensional analysis and theory of gist’s Guide. Second Edition. Malden (MA): Black- biological similarity. Physiological Reviews 55:659– well Publishing. 699. Eberhard W. G. 2009. Static allometry and animal Gunther B., Morgado E. 2003. Dimensional analysis genitalia. Evolution 63:48–66. revisited. Biological Research 36:405–410. Endler J. A. 1986. Natural Selection in the Wild. Prince- Hand D. J. 1996. Statistics and the theory of measure- ton (NJ): Princeton University Press. ment. Journal of the Royal Statistical Society: A Fierst J. L., Hansen T. F. 2010. Genetic architecture (Statistics in Society) 159:445–473. and postzygotic reproductive isolation: evolution Hand D. J. 2004. Measurement Theory and Practice: The of Bateson-Dobzhansky-Muller incompatibilities World Through Quantification. London (UK): Arnold. in a polygenic model. Evolution 64:675–693. Hansen T. F., A´ lvarez-Castro J. M., Carter A. J. R., Fisher R. A. 1918. The correlation between relatives Hermisson J., Wagner G. P. 2006. Evolution of on the supposition of Mendelian inheritance. genetic architecture under directional selection. Transactions of the Royal Society of Edinburgh 52:399– Evolution 60:1523–1536. 433. Hansen T. F., Houle D. 2008. Measuring and compar- Fisher R. A. 1930. The Genetical Theory of Natural Selec- ing evolvability and constraint in multivariate tion. Oxford (UK): Clarendon Press. characters. Journal of Evolutionary Biology 21:1201– Fisher R. A. 1941. Average excess and average effect 1219. of a gene substitution. Annals of Eugenics 11:53–63. Hansen T. F., Pe´labon C., Armbruster W. S., Carlson Frank S. A. 1995. George Price’s contributions to M. L. 2003. Evolvability and genetic constraint in evolutionary genetics. Journal of Theoretical Biology Dalechampia blossoms: components of variance 175:373–388. and measures of evolvability. Journal of Evolutionary Frank S. A. 2009. The common patterns of nature. Biology 16:754–766. Journal of Evolutionary Biology 22:1563–1585. Hansen T. F., Wagner G. P. 2001a. Epistasis and the Frankino W. A., Emlen D. J., Shingleton A. W. 2010. mutation load: a measurement-theoretical ap- Experimental approaches to studying the evolu- proach. Genetics 158:477–485. tion of animal form: the shape of things to come. Hansen T. F., Wagner G. P. 2001b. Modeling genetic Pages 419–478 in Experimental Evolution: Concepts, architecture: a multilinear theory of gene interac- Methods, and Applications of Selection Experiments, ed- tion. Theoretical Population Biology 59:61–86. ited by T. Garland, Jr. and M. R. Rose. Berkeley Harvey P. H., Clutton-Brock T. H. 1985. Life history (CA): University of California Press. variation in primates. Evolution 39:559–581. 32 THE QUARTERLY REVIEW OF BIOLOGY Volume 86

Hereford J., Hansen T. F., Houle D. 2004. Comparing tivariate evolution, applied to brain:body size al- strengths of directional selection: how strong is lometry. Evolution 33:402–416. strong? Evolution 58:2133–2143. Lande R. 1980. The genetic covariance between char- Hermisson J., Hansen T. F., Wagner G. P. 2003. Epis- acters maintained by pleiotropic mutations. Genet- tasis in polygenic traits and the evolution of ge- ics 94:203–215. netic architecture under stabilizing selection. Lande R., Arnold S. J. 1983. The measurement of American Naturalist 161:708–734. selection on correlated characters. Evolution 37: Heusner A. A. 1982. Energy-metabolism and body 1210–1226. size. II. Dimensional analysis and energetic non- Lenski R. E. 1988. Experimental studies of pleiotropy similarity. Respiration Physiology 48:13–25. and epistasis in Escherichia coli. I. Variation in com- Heusner A. A. 1983. Body size, energy-metabolism, petitive fitness among mutants resistant to virus and the lungs. Journal of Applied Physiology 54:867– T4. Evolution 42:425–432. 873. Lenski R. E., Rose M. R., Simpson S. C., Tadler S. C. Heusner A. A. 1984. Biological similitude: statistical 1991. Long-term experimental evolution in Esche- and functional relationships in comparative phys- richia coli. I. Adaptation and divergence during iology. American Journal of Physiology-Regulatory, 2,000 generations. American Naturalist 138:1315– Integrative and Comparative Physiology 246:R839– 1341. R846. Levins R. 1962. Theory of fitness in a heterogeneous Hinton G. E., Nowlan S. J. 1987. How learning can environment. I. Fitness set and adaptive function. guide evolution. Complex Systems 1:495–502. American Naturalist 96:361–373. Hone D. W. E., Keesey T. M., Pisani D., Purvis A. 2005. Levins R. 1966. Strategy of model building in popu- Macroevolutionary trends in the dinosauria: lation biology. American Scientist 54:421–431. Cope’s rule. Journal of Evolutionary Biology 18:587– Lewontin R. C. 1974. The Genetic Basis of Evolutionary 595. Change. New York: Columbia University Press. Houle D. 1991. Genetic covariance of fitness corre- Logan M. 2010. Biostatistical Design and Analysis Using R: lates: what genetic correlations are made of and A Practical Guide. Hoboken (NJ): Wiley-Blackwell. why it matters. Evolution 45:630–648. Lord F. M. 1953. On the statistical treatment of foot- Houle D. 1992. Comparing evolvability and variability ball numbers. American Psychologist 8:750–751. of quantitative traits. Genetics 130:195–204. Luce R. D. 1959. Individual Choice Behavior: A Theoret- Houle D. 1998. How should we explain variation in ical Analysis. New York: Wiley and Sons. the genetic variance of traits? Genetica 102–103: Luce R. D., Krantz D. H., Suppes P., Tversky A. 1990. 241–253. Foundations of Measurement, Volume III: Representa- Houle D., Morikawa B., Lynch M. 1996. Comparing tion, Axiomatization, and Invariance. New York: Ac- mutational variabilities. Genetics 143:1467–1483. Huxley J. S. 1924. Constant differential growth-ratios ademic Press. and their significance. Nature 114:895–896. Maynard Smith J. 1976. Sexual selection and the Huxley J. S. 1932. Problems of Relative Growth. New handicap principle. Journal of Theoretical Biology 57: York: L. MacVeagh, The Dial Press. 239–242. Jerison H. J. 1969. Brain evolution and dinosaur McGhee K. E., Fuller R. C., Travis J. 2007. Male com- brains. American Naturalist 103:575–588. petition and female choice interact to determine Kempthorne O. 1954. The correlation between rela- mating success in the bluefin killifish. Behavioral tives in a random mating population. Proceedings of Ecology 18:822–830. the Royal Society of London, Series B: Biological Sciences McMahon T. A., Bonner J. T. 1983. On Size and Life. 143:103–113. New York: W. H. Freeman. Kingsland S. E. 1985. Modeling Nature: Episodes in the Michell J. 1999. Measurement in Psychology: A Critical History of Population Ecology. Chicago (IL): Univer- History of a Methodological Concept. Cambridge sity of Chicago Press. (UK): Cambridge University Press. Kingsolver J. G., Hoekstra H. E., Hoekstra J. M., Ber- Mosimann J. E. 1970. Size allometry: size and shape rigan D., Vignieri S. N., Hill C. E., Hoang A., variables with characterizations of the lognormal Gibert P., Beerli P. 2001. The strength of pheno- and generalized gamma distributions. Journal of the typic selection in natural populations. American American Statistical Association 65:930–945. Naturalist 157:245–261. Narens L. 1981. On the scales of measurement. Jour- Krantz D. H., Luce R. D., Suppes P., Tversky A. 1971. nal of Mathematical Psychology 24:249–275. Foundations of Measurement, Volume I: Additive and Narens L. 1985. Abstract Measurement Theory. Cam- Polynomial Representations. New York: Academic bridge (MA): MIT Press. Press. Paenke I., Sendhoff B., Kawecki T. J. 2007. Influence Lande R. 1979. Quantitative genetic analysis of mul- of plasticity and learning on evolution under di- March 2011 MEASUREMENT AND MEANING IN BIOLOGY 33

rectional selection. American Naturalist 170:E47– biological transformations. Bulletin of Mathematical E58. Biology 40:549–579. Pavlicev M., Le Rouzic A., Cheverud J. M., Wagner Rosen R. 1978b. Fundamentals of Measurement and Rep- G. P., Hansen T. F. 2010. Directionality of epistasis resentation of Natural Systems. New York: Elsevier in a murine intercross population. Genetics 185: North-Holland. 1489–1505. Sarle W. S. 14 September 1997. Measurement Theory: Phillips P. C., Arnold S. J. 1989. Visualizing multivar- Frequently Asked Questions. ftp://ftp.sas.com/ iate selection. Evolution 43:1209–1222. pub/neural/measurement.html. (Accessed 10 Janu- Platt J. R. 1964. Strong inference: certain systematic ary 2008). methods of scientific thinking may produce much Schmidt-Nielsen K. 1984. Scaling: Why is Animal Size So more rapid progress than others. Science 146:347– Important? Cambridge (UK): Cambridge Univer- 353. sity Press. Podolsky R. H., Shaw R. G., Shaw F. H. 1997. Popu- Simmons R. E., Scheepers L. 1996. Winning by a lation structure of morphological traits in Clarkia neck: sexual selection in the evolution of giraffe. dudleyana. II. Constancy of within-population ge- American Naturalist 148:771–786. netic variance. Evolution 51:1785–1796. Smith R. J., Jungers W. L. 1997. Body mass in com- Pomiankowski A. 1987. Sexual selection: the handi- parative primatology. Journal of Human Evolution cap principle does work-sometimes. Proceedings of 32:523–559. the Royal Society of London, Series B: Biological Sciences Sneath P. H. A., Sokal R. R. 1973. Numerical : 231:123–145. The Principles and Practice of Numerical Classification. Pomiankowski A. N. 1988. The evolution of female San Francisco (CA): W. H. Freeman. mate preferences for male genetic quality. Pages Sokal R. R., Rohlf F. J. 1995. Biometry: The Principles 136–184 in Oxford Surveys in Evolutionary Biology, and Practice of Statistics in Biological Research. Third Volume 5, edited by P. H. Harvey and L. Partridge. Edition. New York: W. H. Freeman. Oxford (UK): Oxford University Press. Solow A. R., Wang S. C. 2008. Some problems with Post E., Forchhammer M. C. 2002. Synchronization of assessing Cope’s Rule. Evolution 62:2092–2096. animal population dynamics by large-scale cli- Stahl W. R. 1961. Dimensional analysis in mathemat- mate. Nature 420:168–171. ical biology. I. General discussion. Bulletin of Math- Postma E., Visser J., van Noordwijk A. J. 2007. Strong ematical Biology 23:355–376. artificial selection in the wild results in predicted Stahl W. R. 1962. Similarity and dimensional methods small evolutionary change. Journal of Evolutionary in biology. Science 137:205–212. Biology 20:1823–1832. Stephens D. W., Dunbar S. R. 1993. Dimensional anal- Price G. R. 1970. Selection and covariance. Nature ysis in behavioral ecology. Behavioral Ecology 4:172– 227:520–521. 183. Prothero J. 1986. Methodological aspects of scaling in Steppan S. J., Phillips P. C., Houle D. 2002. Compar- biology. Journal of Theoretical Biology 118:259–286. ative quantitative genetics: evolution of the G ma- Prothero J. 2002. Perspectives on dimensional analysis trix. Trends in Ecology and Evolution 17:320–327. in scaling studies. Perspectives in Biology and Medi- Stevens S. S. 1946. On the theory of scales of mea- cine 45:175–189. surement. Science 103:677–680. Remold S. K., Lenski R. E. 2001. Contribution of Stevens S. S. 1959. Measurement, psychophysics and individual random mutations to genotype-by- utility. Pages 18–63 in Measurement: Definitions and environment interactions in Escherichia coli. Pro- Theories, edited by C. W. Churchman and P. Ra- ceedings of the National Academy of Sciences USA 98: toosh. New York: Wiley. 11388–11393. Stevens S. S. 1968. Measurement, statistics, and the Reznick D. 1985. Costs of reproduction: an evaluation schemapiric view. Science 161:849–856. of the empirical evidence. Oikos 44:257–267. Stinchcombe J. R., Agrawal A. F., Hohenlohe P. A., Roff D. A., Mousseau T. A. 1987. Quantitative genetics Arnold S. J., Blows M. W. 2008. Estimating nonlinear and fitness: lessons from Drosophila. Heredity 58: selection gradients using quadratic regression coef- 103–118. ficients: double or nothing? Evolution 62:2435–2440. Rose M. R. 1984. Genetic covariation in Drosophila life Suppes P., Krantz D. H., Luce R. D., Tversky A. 1989. history: untangling the data. American Naturalist Foundations of Measurement, Volume II: Geometrical, 123:565–569. Threshold, and Probabilistic Respresentations. New Rosen R. 1962. The derivation of D’Arcy Thompson’s York: Academic Press. theory of transformations from theory of optimal Svennungsen T. O., Holen Ø. H. 2007. The evolution- design. Bulletin of Mathematical Biophysics 24:279– ary stability of automimicry. Proceedings of the Royal 290. Society of London, Series B: Biological Sciences 274: Rosen R. 1978a. Dynamical similarity and theory of 2055–2063. 34 THE QUARTERLY REVIEW OF BIOLOGY Volume 86

Takahashi M., Arita H., Hiraiwa-Hasegawa M., Hase- H. 1998. Genetic measurement theory of epistatic gawa T. 2008. Peahens do not prefer peacocks effects. Genetica 102–103:569–580. with more elaborate trains. Animal Behaviour 75: Weber K. E. 1990. Selection on wing allometry in 1209–1219. Drosophila melanogaster. Genetics 126:975–989. Teissier G. 1936. Croissance compare´e des formes Weber K. E. 1992. How small are the smallest select- locales d’une meˆme espe`ce. Me´moires du Muse´e able domains of form? Genetics 130:345–353. Royal D’Histoire Naturelle de Belgique 3:627–634. Weitzenhoffer A. M. 1951. Mathematical structures Thompson D. W. 1917. On Growth and Form. Cam- and psychological measurements. Psychometrika 16: bridge (UK): Cambridge University Press. 387–406. Turelli M. 1984. Heritable genetic variation via White J. F., Gould S. J. 1965. Interpretation of the mutation-selection balance: Lerch’s zeta meets the coefficient in the allometric equation. American abdominal bristle. Theoretical Population Biology 25: Naturalist 99:5–18. 138–193. Whitlock M. C., Schluter D. 2009. The Analysis of Bio- Via S., Lande R. 1985. Genotype-environment inter- logical Data. Greenwood Village (CO): Roberts action and the evolution of phenotypic plasticity. and Company. Evolution 39:505–522. Wiley R. H., Poston J. 1996. Perspective: indirect mate Vik J. O., Stenseth N. C., Tavecchia G., Mysterud A., choice, competition for mates, and coevolution of Lingjærde O. C. 2004. Ecology (communication the sexes. Evolution 50:1371–1381. arising): living in synchrony on Greenland coasts? Nature 427:697–698. Wilkinson G. S. 1993. Artificial sexual selection alters Wagner G. P. 2010. The measurement theory of fit- allometry in the stalk-eyed fly Cyrtodiopsis dalmanni ness. Evolution 64:1358–1376. (Diptera: Diopsidae). Genetical Research 62:213–222. Wagner G. P., Altenberg L. 1996. Perspective: com- Wolman A. G. 2006. Measurement and meaningful- plex adaptations and the evolution of evolvability. ness in conservation science. Conservation Biology Evolution 50:967–976. 20:1626–1634. Wagner G. P., Laubichler M. D. 2001. Character iden- Wulff J. 2001. Assessing and monitoring coral reef tification: the role of the organsm. Pages 141–164 sponges: why and how? Bulletin of Marine Science in The Character Concept in Evolutionary Biology, ed- 69:831–846. ited by G. P. Wagner. San Diego (CA): Academic Yoccoz N. G. 1991. Use, overuse, and misuse of signifi- Press. cance tests in evolutionary biology and ecology. Bul- Wagner G. P., Laubichler M. D., Bagheri-Chaichian letin of the Ecological Society of America 72:106–111.