Cladistics 13, 225–245 (1997) WWW http://www.apnet.com

Self-Weighted Optimization: Tree Searches and Character State Reconstructions under Implied Transformation Costs

Pablo A. Goloboff Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto “Miguel Lillo,” Miguel Lillo 205, 4000 S.M. de Tucumán, Argentina

Accepted 17 June 1997

A method to assess the cost of character state transfor- known as “optimization” (Farris, 1970). During opti- mations based on their congruence is proposed. mization, minimizing some character state Measuring the distortion of different transformations transformations may be preferable to minimizing oth- with a convex increasing function of the number of trans- ers; the relative costs of different character state formations, and choosing those reconstructions which transformations determine the character state recon- minimize the distortion for all transformations, may pro- structions, the tree costs, and which trees are chosen. vide a better optimality criterion than the linear functions implemented in currently used methods for The optimization methods for additive (Farris, 1970) optimization. If trees are optimized using such a meas- and non-additive (Fitch, 1971) characters are the two ure, transformation costs are dynamically determined most commonly used; in non-additive characters all during reconstructions; this leads to selecting trees changes are considered equally informative, but in implying that the possible state transformations are as additive characters (lineal or branched), the impor- reliable as possible. The present method is not iterative tance of transformations between different pairs of (thus avoiding the concern of different final results for states is different. Usually, there is no empirical basis different starting points), and it has an explicit optimal- for more complex models of transformation, such as ity criterion. It has a high computational cost; algorithms those allowed by the “generalized parsimony” to lessen the computations required for optimizations approach of Sankoff and Rousseau (1975), imple- and searches are described. © 1997 The Willi Hennig Society mented in PAUP (Swofford, 1993) and SPA (Goloboff, 1996a). According to some authors (e.g. Wheeler, 1992), this is the case even for DNA data (where it is possible to establish equivalencies between states of INTRODUCTION different characters, allowing in principle extrapola- tion from one character or data set to another). For morphology (where establishing such equivalencies In cladistics, the fit of trees to data is measured as a between character states is impossible), this is much function of the number of independent originations of more obviously so. Therefore, the consequence of character states required—found with a process using “generalized parsimony” is often that the results

0748-3007/97/030225+21/$25.00/0/cl970043 Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved 225

226 Goloboff

are determined more by the assumptions than by the relationships are even roughly approximate, homolo- evidence itself (Carpenter, 1994), and cladists tend to gizing a wing absence at the expense of considering avoid that approach. some wing presences as non-homologous is clearly Two main kinds of arguments have been suggested counterindicated. to decide which transformations it is preferable to min- As already pointed out, those two possible argu- imize. One argument proposes to use the direct ments to determine costs are complementary, and observation of the morphological relationships using one does not preclude using the other. Lip- between the states themselves; transformations scomb’s (1992) considering as alternatives her own between more similar states should be less costly than homology analysis and congruence-based methods transformations between radically different states. such as TSA would suggest the opposite, but the two This often allows defining relative degrees of apparent approaches focus on different, logically independent homology, and considering a character as additive components of the transformation costs. In principle, (Lipscomb, 1992). By its very nature, this approach is evaluating costs of transformations according to con- difficult to formalize, and it is not always applicable. gruence could be more easily formalized than Nonetheless, it seems obvious that, whenever the mor- evaluating them according to degrees of similarity. A phology clearly indicates degrees of homology, proper analytical method should accomplish this auto- nothing can be gained by discarding that information, matically, once the data set (including prior costs, or and Lipscomb’s approach should be followed. additivities, for each character) is given. However, The other possible argument for assigning costs to none of the methods proposed to date has been gener- transformations is their congruence: transformations ally accepted. The aim of this paper is to propose a way which are more incongruent with a tree are implied to to evaluate the relative informativeness of different be less reliable. This is additional information that can transformations dynamically, during the optimization be used, not instead of, but rather combined with, the process (and therefore during tree searches). Unlike information provided by the degrees of similarity other methods proposed to evaluate informativeness, among the states. Furthermore, the relative congru- the present method is not iterative, and therefore does ence of a transformation in the two possible directions not depend on initial hypotheses of implied costs. The (i.e. 0→1 or 1→0) may give information of asymmetries method is an extension of Goloboff’s (1993a) method in costs. Observation of morphology alone cannot pro- for comparing trees under the weights they imply vide that information, unless coupled with specific (implemented in the Ms-Dos program Pee-Wee; assumptions (such as “complex organs are less likely to Goloboff, 1993b), but applied to character state trans- evolve twice”). Phylogenetic conclusions, however, formations. I will first present a description of the often imply without ambiguity that groups character- method and its basic rationale, followed by a discus- ized by a transformation in one direction are more sion of specific fitting functions and algorithms for reliably defined than groups characterized by the optimization and tree searches. Finally, I will briefly transformation in the opposite direction. A clear exam- discuss the basic properties and implications of the ple is the presence of thoracic wings, defining the method, as compared to other methods proposed to pterygote insects (and occurring in no other group), as evaluate the relative informativeness of character state opposed to wing loss, observed in numerous unrelated transformations. insect groups, characterized by independent wing losses. That a given has thoracic wings is enough to make us certain that it belongs to the Ptery- SELF-WEIGHTED OPTIMIZATION gota, but the lack of wings tells us very little of its possible affinities with other —we know of secondarily apterous roaches, mantids, grasshoppers, The algorithms developed by Sankoff and collabora- flies, bugs, beetles, etc. An alternative formulation is tors (Sankoff and Rousseau, 1975; Sankoff and that sharing some of the states seems more likely due Cedergreen, 1983) allow finding most parsimonious to homology (i.e. traceable to a common node) than is reconstructions for any tree, under fixed transforma- sharing other states; if current hypotheses of insect tion costs. However, if one is interested in assessing

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved

Self-Weighted Optimization 227 costs based on congruence, the costs should be measured with concave functions (see Farris, 1969; tree-dependent, because different trees have different Goloboff, 1993a). The maximum possible value of that implications in terms of costs. On first thought, it function will vary for different trees, and those trees seems that one cannot choose among phylogenetic which allow attribution of higher reliability to the char- hypotheses unless the costs have been previously acter(s) are to be preferred. The most important aspect defined, and the costs cannot be chosen in the absence of cladistics is considering the evidence as the only fac- of a hypothesis. However, every topology evaluated tor determining the choice of a tree; as noted by during a search should be evaluated according to its Goloboff (1993a), choosing those trees which imply own implications of reliability. But the costs, even for a that evidence is most reliable is in direct agreement particular tree, are not determined automatically: with that basic tenet. many different reconstructions (parsimonious or not) Optimizations are usually described in terms of min- are usually possible, and each implies a certain number imizations. Maximizing a concave function such as 1/ → of steps, and hence a cost, for the different transforma- (sij+1) (where sij=number of i j transformations; note tions. Many of the possible reconstructions, however, that sij>=0) produces the same results as minimizing can be rejected on the grounds that they are not opti- its complement: mal under the weights they themselves imply. Those reconstructions, that is, postulate additional transfor- 1 sij dij = 1 Ð ------= ------. (1) mations for the types of change which (according to sij + 1 sij + 1 what happens in other parts of the tree) appear to be more reliable, and are therefore self-contradictory. Equation (1) increases with increasing steps and so can Other reconstructions will instead minimize the total be seen as measuring the “distortion” that the recon- change according to the implied costs, being internally struction imposes on the character state transformation → consistent. i j. The “total distortion” for the reconstruction is Different internally consistent reconstructions will obtained summing the individual distortions for each usually be possible, however, and some of them may possible transformation: be preferable. Those consistent reconstructions which imply that the possible character state transformations D = ∑dij . are most reliable (i.e. have the highest cost, on average) i, j are clearly preferable to reconstructions which imply that most character state transformations are poorer indicators of relationship. As in the method of weight- Among possible reconstructions for a character, those ing of whole characters of Goloboff (1993a), if the that imply the minimum total distortion, MD, are to be function to determine fit from numbers of transforma- preferred. Different trees will imply different values of tions has the proper shape, those reconstructions that ∑MD; among possible trees, those with lowest ∑MD imply the maximum fit will necessarily be internally are to be chosen. consistent. For this to be the case, the difference in fit Consider as example Fig. 1A, which shows a Fitch for a fixed difference in the number of transformations optimization taken from a real data set (of embiid must decrease with the absolute numbers of transfor- insects, with state 0 indicating presence of wings, and mations (see Goloboff, 1993a; Farris, 1981: 12 presented 1, absence; Szumik, pers. comm.). That reconstruction, a similar reasoning in a different context). That depen- of four wing losses and one regain, is not internally dency is seen in concave decreasing functions of the consistent. Using the complement of equation (1), the number of transformations s, such as 1/(s+1). There- implied costs are in the ratio 2:5 for 0→1 and 1→0, a fore, preferring the reconstructions where such a cost under which a different reconstruction is obtained function is maximized leads to postulation of addi- (Fig. 1B; note assignments for nodes a–c). Transforma- tional steps in those transformations occurring often tions 0→1 have a much poorer correlation with the elsewhere in the tree. The summed value of the func- groups in the tree than 1→0 (even according to the tion for all transformations is directly proportional to reconstruction which presupposed that they were their average cost or reliability, as reliability is usually equally reliable), and it is therefore preferable to

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved

228 Goloboff

000 00 0 00 00 000 000 0 0 11 0 1 a 11 0 1 b 11 0 1 c 00 0 0 0 0 000 0 0 0 1 1 001 0 0 1 0 0 00 0 0 00 0 0 00 0 0 00 0 0 11 1 1 d 11 1 1 1 1 (A) (B)

FIG. 1. Two alternative reconstructions for a binary character.

postulate additional 0→1 and fewer 1→0 changes. If Prior costs can be easily used in combination with one simply chooses reconstructions such that equation → the above approach. Then, if pij is the (prior) cost of i j, (1) is minimized, the choice of reconstruction 1B is ( ) MD can be calculated as the minimum ∑ pij * dij direct. For the Fitch reconstruction, the distortion will i, j be: among possible reconstructions for the tree. s s 4 1 D = ------01------+ ------10------= -- + -- = 1.300 s01 + 1 s10 + 1 5 2 while for reconstruction 1B it will be: FUNCTIONS 6 0 D = -- + -- = 0.857. 7 1 The function used in the example above was used for By checking all possible reconstructions it can be seen illustrative purposes only. Several improvements on that reconstruction 1B has the minimum possible value that function are possible. ⁄ ( ) for ∑sij sij + 1 . The states assigned to the internal tree nodes and the i, j trees selected to minimize the required distortion may The conventional optimization of a character seeks to depend on how strongly homoplastic transformations minimize the distortion as measured by a linear func- are downweighted. The function shown above down- tion of the number of steps. It is then easily seen why weights transformations by a large factor, even for a an optimization under fixed costs may be self-inconsis- single instance of homoplasy. One may consider, how- tent: the difference in distortion for a given difference ever, that only postulating numerous instances of of steps i→j is the same, regardless of whether or not homoplasy lowers reliability to such a degree. This can many other i→j changes are postulated somewhere be accomplished by introducing a constant of weight- else in the tree. That will be the case for Fitch, Farris, ing strength (or concavity), K, in the formula; then, and Sankoff optimization. dij =sij/(sij+K); note that K>=1. When K increases,

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved

Self-Weighted Optimization 229 homoplastic transformations are downweighted more gized—i.e. how many times each state originates, mildly, producing results more similar to a linear opti- regardless of what other state it comes from. For binary mization under prior costs. In procedures like characters (as in the example of wing presence/ successive weighting, it is common to assign to absence) this makes no difference, but for multistate characters with a single instance of homoplasy as little characters it may produce results different from those as 50% of the weight of perfect characters (see, for in equation (2). The equation to be minimized is then1 example, the documentation for Hennig86; Farris, 1988). In the test program mentioned below, I have ∑sij implemented a strength constant of six as default. ∑------i------. Under that strength, adding one step to transforma- j ∑sij + K i tions with two, three, four, and five steps costs 76, 58, 47, and 38% of the cost of adding one step to a In this case, all transformations leading to a given state non-homoplastic transformation. This weights much will be downweighted, which may require giving a more mildly than the function in equation (1) and low weight to otherwise very unlikely transforma- should produce results significantly different from the tions. It is reasonable to suppose that certain structures ones obtained under linear optimization only in those are more likely to originate from some conditions than cases in which some transformations are highly from others, and a method potentially capable of homoplastic. detecting this seems desirable. Consider having a Even with the correction for weighting strength, the given structure as either absent, or small, or big. One function in equation (1) still produces undesired may consider that postulating additional transforma- results in some cases because it starts downweighting tions to “small” (regardless of which state “small” from the first step. Consider the tree (0 (1 ( 1 0) ) ). With comes from) costs less to the extent that more of those transformations are being postulated by the recon- equation (1), under K=6, dij=0, 0.143, and 0.250 for struction. This will often not produce a reasonable sij=0, 1, and 2, respectively. Then, assigning state 0 to the internal tree nodes has a D of 0.286, while assigning assignment of states (and, therefore, not provide a state 1 has a D of 0.333. The first step 0→1 postulated good measure of fit). If there are many reductions from when assigning 0 to one of the internal nodes makes big to small, and no changes from absent to small, this → the second step 0→1 required by assigning 0 to the approach will lead to consider absent small changes as other internal node to cost less; the function then leads very cheap—even when the trend suggested by the → to preferring to postulate two 0→1 changes instead of many big small changes seems to be the opposite. This one 0→1 and one 1→0. This is solved if the first part of alternative seems, therefore, to produce an inferior the function is linear. Then, one can consider optimality criterion to the procedure outlined above. Another modification of the present method may if s ≤= 1 → d = s /(s + K) ij ij ij ij (2) involve counting the transformations between states → if sij > 1 dij = (1/(K+1)) + ((sij - 1)/(sij - 1 + K)). regardless of their direction. The required formulation is almost the same; in equation (1) above, simply In that case (and for any value of K), the assignments replace the value of sij by sij+sji (calculating distortions of 0 or 1 to the internal nodes are seen as producing the only for the cases where i

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved

230 Goloboff

downweight the transformations between states if they possible reconstructions. This restriction in fact makes occur more often in the tree. For binary characters, this optimization easier, and is used by the algorithms is equivalent to the method of Goloboff (1993a) for described below. weighting whole characters. In cases such as Fig. 1, this Another restriction of the possible reconstructions modification would ignore the fact that the state distri- comes from rooting considerations. Under linear opti- bution itself suggests that changes 0→1 and 1→0 mization for symmetrical transformation costs, the should not be counted as equivalent, and for that rea- states for the two nodes in the first split will always son the criterion is also considered inferior to the one appear in the set of possible states for the root of the implemented with equation (2). tree. For the present method, this may not be the case; the root of the tree may be assigned states not occur- ring in the first splitting taxon. But if successive sister LIMITING POSSIBLE RECONSTRUCTIONS groups of the group under study, not included in the matrix, have the same state as the outgroup taxon included in the matrix, assigning a different state to the Other undesirable assignments for multistate charac- root would imply many other, uncounted, transforma- ters may occur when the implied costs violate the tions to the outgroup state outside the study group. In triangle inequality. In those cases, postulating that the such cases, it is preferable to restrict the possible state common ancestor of two terminals with identical state assignments for the root to those observed in the out- had that same state may appear less “parsimonious” group taxon. Minimal modifications of the algorithms than postulating a different state for the ancestor2. described below accomplish this. Although “optimal”, such assignments of a different state are clearly illogical and unparsimonious. When the triangle inequality is violated, it is also possible that CALCULATION OF MD AND OPTIMAL addition of a taxon with a missing (unobserved) entry TREES increases the apparent fit of the tree3. A possible rem- edy for that problem is using a fitting function where the decrease in fit by adding a step in a transformation Finding Optimal Reconstructions is never less than half the decrease for non-homoplastic transformations. This would require making the func- Finding the states that occur in optimal reconstruc- tion linear beyond a certain number of steps (thus tions under fixed transformation costs is relatively eliminating the correspondence between maximum fit simple, requiring two passes, down and up the tree. and maximum weight), and would allow only very Such a two-pass procedure is apparently not possible mild weighting of the transformation costs. It seems here, since the local distortion within subtrees is not preferable to use a convex function of the number of independent of transformations outside. Therefore, the steps for the optimization process, as defined in equa- (preliminary) cost of assigning a given state to an inter- tion (2), and select from among reconstructions that do nal node cannot be determined unless all other nodes not violate the triangle inequality, instead from all have been assigned their states. The search for optimal reconstructions cannot be 2Example: two sister taxa with state 1, and several successive sis- ter groups with state 0. If there are many 0→2 and 2→1 done, either, by trial-and-error methods. Changing the transformations, and no 0→1, elsewhere in the tree, the common assignments of one or a few nodes at a time will ancestor of the two 1s may be assigned state 2, not 1. become very easily trapped in local optima. Fig. 1A 3Example: a single terminal with state 1 and several successive sister groups with state 0, with many 0→2 and 2→1 transforma- and B provides examples. Reconstruction 1B is the glo- tions, and no 0→1, occurring elsewhere in the tree. Postulating a bal optimum, but 1A is a local optimum: changing the change 0→1 between the terminal with 1 and its ancestor is unavoidable. However, if a taxon with a missing entry is added as assignment for any single node to the reconstruction in sister group of the taxon with state 1, a new node will be created, Fig. 1A produces a higher D. The only way to lower D, and it will now be possible to postulate additional 0→2 and 2→1 changes, and no 0→1, with the consequence that the “fit” will given the reconstruction 1A, is to change three nodes at increase. the same time (indicated as a, b, and c). Another local

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved Self-Weighted Optimization 231

1 1 0 assignments (references to line numbers, below, refer 1 0 to that example code). 1 0 0 Application of the principle used in branch- 0 and-bound requires that partial ds be calculated for 1 1 each type of change (the initial ones are all zero). The 1 1 1 1 states should be assigned in such a way that all descen- 0 0 dants are assigned a state before their ancestors; in this 0 way, partial state assignments can be rejected earlier, 1 1 0 as state differences will often be implied between inter- 1 nal nodes and the two (or more) descendants of the 1 1 1 node. Every time a state is assigned to a node such that 0 a change is implied between the node and either (or 1 0 1 0 both) of the descendants, the corresponding sij are 1 0 increased, and the partial D is calculated (lines 16–20, 1 0 23–27, for left and right descendant). If the partial D 1 1 exceeds the value of D for the best reconstruction 1 1 found so far (line 28), it is unnecessary to check the rest 1 of the nodes (line 29, where the function calls itself). A FIG. 2. A local optimum for the same data as in Fig. 1 (see text for reconstruction under linear parsimony (i.e. under the details). corresponding additive, non-additive, or Sankoff prior costs), usually provides a reasonable D as initial bound. optimum is the reconstruction in Fig. 2, which pro- The restriction of the number of nodes for which dif- vides a more radical example: the only way to lower D ferent state assignments must be tried is more in that case is changing assignments for 12 nodes at the involved. It is based on determining: same time, to produce reconstruction 1A (just another local optimum)4. 1. nodes with a “fixed” state. If all terminals belong- The optimal reconstructions, therefore, must be ing to a node have the same, single state, the node must have that same state. In Fig. 3, node a is “fixed” at state found in a global way. The simplest method is exhaus- 1. A state set including for each node the states of all its tively enumerating and evaluating reconstructions, descendents can be calculated easily in a down-pass, as but that is too slow to be practical. The search for opti- the union of the state sets of its descendants. Nodes mal reconstructions can be made more efficient in two leading to one missing entry and one fixed node can be ways: (1) using partial order solutions (the same prin- ciple used for branch-and-bound; Hendy and Penny, ? 0

1982), and (2) restricting the number of nodes for d which different state assignments have to be tried. ? ? Appendix 1 shows a recursive function (in the C pro- c gramming language) that finds optimal state 1 1 b 4 The problem of local optima of reconstructions poses an inter- 1 ? esting question. Figs 1A, B, and 2 show the only three optima for those data and that tree. Fig. 2 is a local optimum; there are recon- a structions with a much lower distortion but which are not local 1 1 optima (an example is reconstruction 1A with node d having state 0 instead of 1). The three optima in Figs 1A, B, and 2, correspond to 1 reconstructions which are optimal under some possible prior costs (optimized under fixed prior costs). Perhaps a general property of FIG. 3. Determination of “fixed” internal nodes. Nodes with a the optima of reconstructions is that there must be a set of prior costs for which the reconstruction is optimal under those fixed costs state indicated are fixed at that state, those with a missing entry are (and vice versa). not fixed at any state.

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved 232 Goloboff

considered fixed if they are the sister group of a clade e fixed at that same state. In Fig. 3, nodes b and c are * 2 fixed at state 1, but node d is not. To take missing d ? 2 entries into account, it is necessary to consider them as empty sets in the down-pass, and then do an up-pass. c * 1 In the up-pass, for those internal nodes leading to one b fixed node and one missing entry (or an internal node * 1 leading only to such), add to the union set determined a in the down-pass the union set of its ancestor. If the ? 1 final (up-pass) union set of the node is a single state, the node is fixed at that state; 0

2. ”linked” nodes. Some nodes in the tree will FIG. 4. Determination of linkages between nodes. Node e is linked always have the same assignment as other nodes, in to node d, and nodes c and b are linked to node a. Node d has one any optimal reconstruction: node linked to it, node a has two. (a) internal nodes leading only to missing entries: they will always have the same state assignments as their ancestor; they never require any steps (lines 3 and assigned states at all, until the search is completed (in 5; for that to work properly, tree nodes must be num- which case, the entire state sets for a can be copied onto bered with descendants smaller than their ancestor); b and c; lines 46, 47). In the example of Fig. 4, when (b) repeats: when several successive sister groups checking whether assigning a given state to node d of a clade are terminals having the same (single) state adds steps between d and c, the states for c have to be (or are internal nodes “fixed” at that state), it can be considered as those for the node to which c is linked (node a; lines 2–7; obviously, nodes which are not assumed that all the nodes leading to those successive linked to other nodes must be defined as “linked” with sister groups will be given (in any optimal reconstruc- themselves); tion) the same state assignment. In cases where this does not hold for linear optimizations under asymmet- (c) internal nodes leading to one missing and one ric prior costs5, this restriction may prevent finding non-missing entry: non-fixed nodes leading to a miss- ing entry can be disregarded during the search of some particular reconstructions (since the first part of reconstructions. The assignments for the rest of the the fitting function is linear); the state sets for each nodes must be the same, whether or not the taxon with node, however, will still be correctly identified. For the missing entry is included; the assignments for the symmetric prior costs, I have not found exceptions to node leading to one missing and one non-missing the rule of repeats. This is one of the factors that most entry can be calculated subsequently, using the state reduces the time needed to find optimal reconstruc- sets for the linked nodes. A node leading to one miss- tions. An example is provided in Fig. 4. In any optimal ing and one non-missing entry must be considered as reconstruction, nodes b and c will always be given the linked to the same node to which its descendant node same state as node a. Nodes b and c are “linked” to with non-missing entry is linked. In an up-pass, its node a. Node a has two nodes, b and c, linked to it— final state set can be calculated as the union of the state more specifically, linked to its right descendant. Dur- sets for the first ancestor not leading to a missing entry, ing the search for reconstructions, one need only to and the state set for the node to which the node is assign states to node a, and if steps are implied linked (lines 47–54). Note that two or more successive between the node and the right descendant, the step missing entries are taken into account by the linkages. increase in that type of transformation is taken to be Eliminating the nodes leading to a missing and a one plus the number of nodes that descendant has non-missing entry from the search may prevent some b c linked to it (lines 18, 25). Nodes and need not be optimal states (for those nodes) from being found. A simple example is the tree ( ( (2 ?) 0 ) 0), under additive 5Example: tree ( 0 ( 2 ( 2 1))), cost of 10 for 2→1, 5 for 0→1, and unity for all other transformations. The reconstruction assigning 0 prior costs; the common node of (2 ?) could be assigned and 1 to the ancestors of terminals with state 2 is optimal. states 0, 1, or 2 equally parsimoniously, but this short-

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved Self-Weighted Optimization 233 cut will prevent state 1 from being found. But this will Using all the shortcuts described, data sets of 40–60 not affect the measure of total distortion, nor state taxa can be optimized rather quickly, as long as no assignments for other nodes. Furthermore, if the short- character has more than five or six states. With cut is not included, it is possible that some state increasing numbers of states, taxa, and lack of congru- assignments (resulting from violations of the triangle ence between the tree and the character, optimization inequality) will cause taxa with missing entries to times increase very rapidly. The present algorithms decrease the apparent distortion required by the tree; can, hopefully, be improved to eliminate those 3. forbidden changes. If some transformations do limitations. not occur in a linear optimization, they are likely to be The algorithms described and the example code can given the highest weight in a non-linear optimization: be adapted to allow optimization of a character with assigning a lower weight to the other transformations multistate (=polymorphic) terminals. Adding the mul- may require that more of those other transformations tistate terminals to the set of nodes for which different are postulated, but will rarely require postulating the state assignments must be tried will accomplish this, transformations that were not necessary under prior although it increases the time required for finding the costs only. This requires a complete linear optimization optimal reconstructions. (finding all state sets for each node; one must be done, anyway, to provide an initial bound for the partial Finding the Optimal Trees order) before the search for optimal reconstructions starts. Then, during the search for reconstructions, all those state assignments which would imply forbidden (A) General Strategy changes are skipped (lines 15 and 22). Some exceptions Searches for optimal trees can be done using to this rule are possible, particularly when there is branch-swapping. The general strategy for estimating much homoplasy and the weighting function is strong. MD for the rearranged candidates is much like that For large numbers of states, however, optimizations proposed by Goloboff (1994, 1996b) for linear optimi- can be achieved in reasonable times only if the shortcut zation of additive and non-additive characters. It is is used, even if the resulting measures of distortion based on calculating state assignments and character may not be exact; fits for the tree clipped in two and then trying to derive 4. impossible state assignments. During the search what the decrease in fit would be if joining the clipped for reconstructions, if both descendants of a node have clade to a given destination. This is much faster than been assigned the same state, the node must also be optimizing each rearrangement anew, but it is not assigned that state in any reconstruction. The assign- exact. The lack of exactitude can be taken into account, ment of any other state need not be checked (line 14). first, trying to favour possible errors that overestimate the fit of trees (then making it less likely that an optimal Note that illogical state assignments resulting from tree will be missed by erroneously considering it as violations of the triangle inequality are ruled out in worse than it actually is), and second, doing an exact points 1, 2c, and 4. The function in Appendix 1 requires check for those trees which are not rejected by the that fixed and linked nodes, and forbidden changes, approximate calculations. The shortcuts described have been identified. Searches for optimal assignments work reasonably well for additive and non-additive can be enormously speeded up when all those factors characters, but appear to produce much error (at steps are used. For S nodes fixed or linked to other nodes, the C and D, below) for more complex prior transforma- number of reconstructions that has to be actually eval- tion costs. The higher error rate means that, for uated (for N states) is diminished by a factor SN; for 30 complex prior costs, the shortcuts must be used in con- or more taxa, this often gives 1010 times fewer recon- junction with a larger error margin, which slows down structions to try for binary characters, and 1015 for searches (see E, below). multistate characters. The use of shortcut 3 alone may The shortcuts described work directly for subtree increase speed by a factor of 100. Shortcut 4 saves an pruning regrafting branch-swapping (SPR; Swofford additional 10–30% of the time. and Olsen, 1990); tree bisection reconnection (TBR)

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved 234 Goloboff

would require that the MD for the clipped tree be recal- (C) Estimating the Costs of Suboptimal State culated each time the tree is rerooted. Unless a way to Assignments derive root states for the clipped clade that does not To derive the decrease in fit resulting from joining require a complete reoptimization is found, searches the clipped clade to a given destination (step D, below) will be more efficient if using SPR with multiple addi- it is necessary to use the extra cost of suboptimal state tion sequences and retention of suboptimal trees. assignments. The extra cost is, for every possible state In the discussion below, let M be the maximum ij at each node, the difference between MD and the number of transformations from state i to state j, in dif- smallest D given that the state is assigned to the node. ferent optimal reconstructions (found easily during The extra costs must be calculated before the optimization), C(i) the value of the fitting equation (2) rearrangement phase starts. Calculating them exactly for i transformations, E(i,n) the extra cost or increase in would be extremely time-consuming, achievable only D if state i is assigned to node n (see below; E(i,n) is zero by (almost) exhaustive enumeration of reconstruc- for optimal states, positive otherwise), and Pij the prior tions. None of the shortcuts to find optimal cost of transformations from state i to j. reconstructions more quickly can be directly applied to calculate extra costs of suboptimal state assignments. (B) Clipping the Tree Those extra costs, however, can be (roughly) estimated from the optimal state assignments in two steps: When the tree is clipped in two, it is necessary to cal- culate the optimal state assignments for all internal 1. estimate the maximum decrease in D by not nodes (and the values of Mij for all i, j). The algorithms assigning to the node any of the optimal states. For described above can be used for this. Additionally, every possible optimal state, x, assignable to the node, some characters will not change state assignments after calculate a value z for every possible state, y, assignable clipping the tree. These can be identified with the as optimal to the descendants and different from x, as: (exact) shortcut B of Goloboff (1996b; based on calcu- z=Pxy * (C(Mxy) - C(Mxy-1)). lating union state-sets for internal nodes of the whole tree), which, on average, reduces the reoptimization If there is no state y which is both optimal for the time to one-half. The conditions to avoid reoptimizing descendant and different from x, consider z=0. Calcu- a character under shortcut B of Goloboff (1996b) could late z in the same way for every state y assignable to the ancestor of the node but taking into account that now y even be loosened, so that it is not necessary for the transforms into x: clipped node’s ancestor’s ancestor (Nz in Goloboff, 1996b) to have the same single state as the clipped node z=Pyx * (C(Myx) - C(Myx-1)). (Nm); if the sister node of the clipped clade is an inter- For all different simultaneous combinations of x and y nal node and one of its descendants has a missing (for node, ancestor, and descendants) estimate the entry, the state set for the sister node may change, but decrease as the maximum sum of the three values of z. all that is required is to add all the states present to the 2. estimate, in a down-pass, the minimum increase set for Nz. in D by assigning to the node any of the suboptimal Shortcut A of Goloboff (1996b) (based on the final states. For every possible state, x, not assignable to the state sets for the whole tree) is probably too inexact for node (n) in any optimal reconstruction, calculate a the present method. Shortcut C of Goloboff (1996b) value w for every possible state, y, in one of the descen- cannot be applied to the present problem, since it is dants (d) as: based on performing down- and up-passes in restricted parts of the clipped tree. w=E(y, d) + (Pxy * (C(Mxy+1) - C(Mxy)))

The calculations in this stage of the search are exact, (when y=x, the expression simplifies to w=E(y, d)). Note except for nodes leading to a missing and a non-miss- that if d is a terminal, E(i, d)=0 if state i occurs in the set ∞ ing entry (for which state sets may be incomplete, and of observed states, E(i, d)= otherwise. The extra costs therefore could lead to underestimating fit). for the ancestor will not yet have been determined.

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved Self-Weighted Optimization 235

Therefore, only optimal steps can be considered for the of z+w. The estimated MD for the rearranged tree is ancestor, and w is simply: then the MD for the clipped tree (calculated exactly) plus the increase in D (estimated). w=Pyx * (C(Myx+1) - C(Myx)) (when y=x, w=0). After the w values for the three (E) Checking Distortion and Collapsing Trees branches have been calculated, set the value of E(x, n) as the sum of the three minimum values of w minus the The method just described is designed in such a way maximum decrease calculated in step 1. If E(x, n) so esti- that MD will be underestimated more often than over- mated is negative, consider E(x, n)=0. estimated (i.e. the rearranged tree will usually be considered, by that method, as better than it actually At steps 1 and 2, using the maximum and minimum is). Therefore, trees which according to steps B–D estimated values decreases the likelihood of overesti- appear to be worse than the best tree(s) found so far, mating the total MD of a rearrangement. Note also that can be quite safely rejected in most cases. However, a the estimation of extra costs can be skipped for those tree which appears to be optimal may actually be characters which (according to the modified shortcut B worse, and it is then necessary to recalculate the MD of Goloboff, 1996b) were not reoptimized when clip- for those trees exactly before storing them in the mem- ping the tree. ory buffer. This, however, only has to be done for a small fraction of trees, as most trees can be rejected despite the underestimation of MD. In tests using (D) Deriving MD During the Rearrangement Phase small to medium-sized data sets, with prior costs set as The basis for deriving the increase in MD produced in either additive or non-additive characters, from 0 to by joining the clipped clade to a given destination is 4% of the rearrangements tried had to be discarded taking into account that a new, medial node is then to after re-evaluation. be created. Assigning different states to that medial In some cases, the MD for the rearranged tree will be node will produce different increases in D for the tree, estimated as higher than it actually is, and this makes according to how many and which additional transfor- it possible that some trees of lower MD are found dur- mations are postulated; choosing those which would ing swapping and incorrectly rejected. This problem is produce the minimum increase in D allows estimation easily solved by introducing an error margin, such that of the MD for the rearrangement. It is obvious that only those trees for which the estimated MD exceeds whenever the basal node of the clipped clade and the best MD found so far by that predetermined differ- either the descendant or the ancestor node of the desti- ence are rejected. This requires that the length be nation branch have some optimal state(s) in common, calculated exactly for more trees, slowing down the the increase in D will be zero—no additional calcula- calculations, but making them less prone to error. Ade- tions are required in that case. When that is not the quate error margins for different data sets can be found case, the increase in D produced by assigning a state x by comparing the results of applying the shortcut to to the medial node can be calculated as the increase (z) those of exact calculations. in D implied along the branch destination–ancestor, When the MD for the rearranged tree is to be recalcu- plus the increase (w) in D implied along the branch lated exactly, this need not be done de novo for all medial node–clipped clade. The increase, z, along the characters. All those characters for which some states branch destination–ancestor (d–a) is simply the mini- were shared between the clipped clade and either the mum of [E(x, d), E(x, a)]. Estimating the increase, w, along ancestor or descendant nodes of the destination branch the branch medial node–clipped clade requires find- require no recalculation, as no error is possible. This ing, for every possible state y at the root (r) of the will often require as little as 10% of the characters opti- clipped clade, the minimum of the expression: mized for the MD check. If the MD check shows that the tree is indeed opti- w=E +(P * (C - C )). (y,r) xy (Mxy+1) (Mxy) mal, and zero-length branches are to be collapsed, it is Among all possible assignments (x) for the medial necessary to calculate all optimal state sets. Here, the node, consider the increase in D as the minimum sum optimization of many characters can also be skipped or

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved 236 Goloboff

simplified (as in Goloboff, 1996b: 214): whenever the state transformations—is actually not opposed to par- state set of the clipped clade is identical to the state set simony. Parsimony seeks to minimize homoplasy, but of either the ancestor or descendent node of the desti- if not postulating one instance of homoplasy (for either nation branch, the state assignments will remain a character or a transformation) requires postulating identical after joining the clipped clade to its destina- another, the relative importance of those ad hoc tion. Those characters for which some states were hypotheses of homoplasy is a concern. Postulating the shared between the clipped clade and either the ances- shared state for the most recent common ancestor of a tor or descendant nodes of the destination branch, but set of terminals with that state allows considering their where the state sets were not identical, may change similarity as explained by common ancestry. However, some state assignments. As discussed in Goloboff in cases of non-perfect fit, that may have to be done at (1996b), for linear optimization of additive and the expense of requiring that similarities in other states non-additive characters it seems better to optimize the are non-homologous. The present method, in those character, but for non-linear methods a complete opti- cases of conflict, attempts to minimize homoplasy for mization is too costly. As in that case it is certain that, those transformations which appear less homoplastic, for each node, the final state set for the rearranged tree but the primary aim of the method is still to minimize will be a subset of the final state set for the clipped tree, homoplasy, i.e. parsimony. All the arguments for pre- enumerating and evaluating reconstructions by choos- ferring parsimony on the basis of informativeness and ing the possible distinct combinations of states for the descriptive ability (Farris, 1979, 1982), and explanatory nodes with ambiguous state sets will be faster than power (Farris, 1983, 1986), apply to this method. It is completely reoptimizing the character. Additionally, true that the total number of transformations postu- the characters for which no state was shared between lated by an optimal reconstruction may be more than the clipped clade and either the descendant or ancestor the minimum postulated by a (linear) Fitch optimiza- nodes of the destination branch will have been reopti- tion. Considering the present method in conflict with mized during the (exact) re-evaluation of MD, so that parsimony on those grounds alone, however, would those state assignments can be used directly for tree require considering “unparsimonious” any method collapsing. besides Fitch optimization. The present criterion of optimality is therefore not offered as an alternative to parsimony, but rather as a more refined way to mea- Implementation sure the parsimony of trees. Equation (2) is not intended as an exact measure of The algorithms described have been implemented the weights. As noted by Goloboff (1995a), the estima- and tested in a prototype MS-DOS computer program, tion of costs from homoplasy is approximate. The new SLFWT, available on request from the author. The pro- procedure seems better, not because it is “more exact”, gram searches for trees with minimum MD, using but rather because it takes more information into con- multiple addition sequence Wagner trees followed by sideration. Variations in the weighting strength may SPR, with the possibility of retaining suboptimal trees. lead to different results, just as in both implied and suc- cessive weighting. That does not mean that the results under weighting methods are more “arbitrary” than EQUAL AND DIFFERENTIAL WEIGHTING the results under equal weights for all transformations (or characters). The very character state distribution often indicates that some transformations (or charac- The method of parsimony analysis under implied ters) are more reliable than others. Just deciding weights has recently been considered in conflict with beforehand to ignore this does not make the method the parsimony criterion (Turner and Zandee, 1995); the any more “objective”. It is preferable to use weighting use of character weighting in general had often been functions that take this into account, even if considered previously as incompatible with parsi- approximately. mony (see Goloboff, 1995a, for discussion). Differential Whether well-justified weighting methods produce weighting—of either whole characters or character better results can also be assessed empirically. As I

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved Self-Weighted Optimization 237 have suggested previously (Goloboff, 1993a: 84), 25% of the characters were eliminated at random and “characters which have failed repeatedly to adjust to the reduced data set was reanalysed. For the reduced the expectation of hierarchic correlation…are less data, the number of groups for the complete data that likely to predict accurately the distribution of as yet were monophyletic in all and some of the optimal trees unobserved characters”. Therefore, weighting meth- were recorded. Checking whether a given group is ods should be better able to recover the “right” monophyletic in some trees for the reduced data (even groupings on the basis of limited evidence, and results if the group does not appear in the consensus of trees established using weighting should be more stable to optimal for the reduced data) allows for cases where the addition of new evidence. In practice, conclusions the monophyly of the group is neither supported nor are always based on a limited amount of evidence, and disputed by the reduced data. The elimination of 25% “new” evidence can only be gathered with actual taxo- of the characters was repeated 10 times for equal nomic work. The effect of adding new evidence, weights, successive weighting, and implied weighting, however, can be estimated by comparing the results of and five times for self-weighted optimization; the fig- analysing complete (real) data sets and derived data ures shown in Table 1 are the average percentages of sets in which part of the evidence is discarded. Table 1 recovered nodes (for T terminals and R nodes actually shows the results of such a comparison, for parsimony retrieved, 100*R/T-2). The proportion of nodes recov- analyses under equal weights, successive weighting ered in the strict consensus for the reduced data was (Farris, 1969), implied weighting (Goloboff, 1993a), much higher for the three weighting methods. Com- and self-weighted optimization (sources for data sets pared to equal weighting, the three weighting methods and other details in Appendix 2). For the four methods were able to recover, on average, 16.5–19.2% more of of analysis, each complete data set was analysed, find- the nodes that could ideally have been recovered. That ing the optimal trees and their strict consensus; then, difference is less for small or very congruent data sets;

TABLE 1 Results of Comparative Runs (see text for details) for Analyses under Equal Weights and under Different Weighting Methods

Data Taxa × Equal weights Successive weighting PIWE (k=3) Auto-weighted optimization set chars a b c a b c a b c a b c

1 14×22 56 77 83 60 76 83 64 74 83 58 73 83 2 16×28 61 69 71 64 70 71 67 71 71 61 71 71 3 17×19 45 62 80 47 65 93 55 66 73 51 65 93 4 19×38 25 51 58 48 51 88 50 54 88 53 60 76 5 25×54 16 17 17 46 47 95 56 59 95 44 45 95 6 28×67 52 63 73 54 61 77 55 67 76 56 65 73 7 29×80 51 64 67 69 71 85 63 69 78 69 73 89 8A 34×43 21 33 47 46 55 75 42 56 75 37 44 75 8B 35×43 28 43 55 51 63 79 54 65 76 52 65 85 9 36×75 39 49 53 56 59 94 62 67 97 61 63 97 10 41×35 26 56 74 35 42 77 38 49 79 35 53 84 11 42×71 45 65 80 65 71 92 61 67 90 69 74 97 12 43×69 55 72 78 70 75 95 70 75 95 72 77 97 13 44×62 30 47 50 38 54 69 37 54 90 40 49 69 14 47×87 21 28 31 34 54 71 29 37 51 35 44 62 15 54×256 56 65 69 70 73 92 72 76 88 79 80 98

Mean 37 55 63 54 62 84 56 63 82 56 64 85 For each method of analysis, the values (a–c) are: (a) average percentage of groups monophyletic in all trees for both the entire and reduced data sets; (b) average percentage of groups monophyletic in all the trees for the entire data set which are monophyletic in some trees for the reduced data set; (c) percentage of nodes in the strict consensus for the entire data set. Note that (c) places an upper bound on values (a) and (b), and (b) on (a). Values in bold for column (a) indicate the highest averages.

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved 238 Goloboff

for larger, less congruent data sets, the difference may TABLE 2 be over 25%; in no case did the analysis under equal Results of Comparative Runs (see text for details) for Parsimony weights recover more groups than any of the weight- Analysis under Equal Weights and Three-Taxon Statements (TTS) ing methods. Data Taxa× Equal weights TTS The low number of shared consensus nodes for equal set chars a b c a b c weights is in part due to ambiguity, as the number of 1 14× 22 56 77 83 53 68 83 compatible nodes is much higher than the number of 2 16×28 61 69 71 52 66 79 × strict consensus nodes. For equal weighting analyses of 3 16 19 39 81 86 43 49 79 16 12×14 48 65 70 48 51 60 partial evidence, many of the groups for all the evi- dence were not directly contradicted but simply Mean 51 73 78 49 59 75 unsupported. The percentage of uncontradicted For each method of analysis, the three values reported (a–c) corre- groups for weighted analyses, however, is 7–9% higher spond to the values reported in Table 1. For three-taxon statements, than for equal weights; using weighting, the “right” the reduced data sets were produced by eliminating 25% of the orig- inal characters, subsequently converting them to three-taxon groupings were compatible slightly more often and statements. much more often preferred. In this test, there are no significant differences in per- formance between the three weighting methods. The superiority of the weighting methods over equal weights. Defenders of three-taxon statements have weighting, and the method of self-weighted optimiza- claimed as an advantage the extra resolution often tion in particular, seems clearer for larger numbers of allowed by the method. For the data sets analysed taxa (as expected: the estimation of unreliability, based here, however, that extra resolution seems artificial on homoplasy, should then be more accurate). and undesirable. The difference between data sets 8A and 8B is of interest. These two data sets differ almost exclusively in the location of the root (see Appendix 2 for details). DISCUSSION Because the measures of distortion for self-weighted optimizations are root-dependent, rooting so as to have a paraphyletic outgroup and a monophyletic The method of self-weighted optimization allows ingroup—harmless when using Hennig86, NONA, or consideration of some character state transformations Pee-Wee—may lead to an improper selection of trees. as more reliable as a consequence of the analysis, not as Data set 8B has a more proper rooting than 8A; note an assumption. However, when there is more than a that the addition of a proper root improves single reconstruction that minimizes D, different self-weighted optimization more than it does the other reconstructions may imply different costs. In those methods. cases, it is not possible to construct a unique matrix of The method of “three-taxon statements” (Nelson and costs that reflects those weights at the same time; Platnick, 1991) is not, strictly speaking, a method for rather, each reconstruction would imply a different character weighting (Platnick, 1993), and it has been step matrix. Attempting to use either the higher, mini- criticized on theoretical grounds (Farris et al., 1995; mum, or average implied cost for each transformation Farris, 1997). A comparison similar to the one to produce a unique step matrix is misguided, since described above (Table 2) suggests that the groups based on all the available evidence are supported no those costs could not logically be implied at the same more often by three-taxon analysis of partial evidence time, i.e. by a single reconstruction. This applies even than under equal weights. The percentage of compati- more when multiple equally parsimonious trees are ble nodes was lower for three-taxon statements than considered. This seems more a virtue than a defect. We for equal weights; the groupings suggested by the cannot know what the “real” costs of the different char- complete data sets were positively contradicted (as acter state transformations are; for a certain tree, opposed to simply not supported) by analyses based accepting a given reconstruction fixes those costs, but on incomplete evidence more often than under equal other reconstructions may fix them at other values.

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved Self-Weighted Optimization 239

Williams and Fitch (1990), and Sharkey (1993), pro- optimizations in Figs. 1B and 2. The final result of opti- posed to use the numbers of transformations implied mizations and searches will therefore be determined in a phylogenetic analysis to reassign costs, and run a by the initial transformation costs chosen, and the new analysis6, as done in successive weighting. This method may therefore produce results which are not has been called “dynamic weighting” (by Williams and completely independent of assumptions on costs. Fitch) or “exact weighting” (by Sharkey). It is essen- Additionally, the Williams–Fitch–Sharkey method tially an iterative process, which depends on the requires assigning a unique, fixed cost to each transfor- starting point. In fact, this should also be performed mation. In the cases of ambiguous optimizations and/ iteratively at the level of optimizations. Consider again or multiple trees, one is then forced to choose among the reconstruction for the tree in Fig. 1A. As already the maximum, minimum or average implied costs. shown, reassigning costs of 0→1 and 1→0 transforma- There seems to be little justification in evaluating one tions according to that optimization was in the ratio tree according to the transformation costs implied by 2:5. However, with that cost ratio, the state assign- another, perhaps very different, tree. Also, endless ments shown in Fig. 1B are preferable, and that loops (reconstruction A implies costs that produce reconstruction implies (using the same weighting reconstruction B, and reconstruction B costs that pro- function of that example, 1/sij+1) a cost ratio of 14:100. duce A) seem more common under this method than Those costs produce the (stable) reconstruction in Fig. under Farris’ successive weighting. Computer pro- 1B. Defending reconstruction 1B on the grounds that grams may include checks to interrupt that situation if reconstruction 1A led to it, however, seems illogical; it arises, but the problem is theoretical more than prac- after all, both reconstructions 1A and 1B tell us that tical. The problem is in determining which step matrix changes 0→1 and 1→0 should not have the same cost— should be used in that case: the step matrices implied as reconstruction 1A had assumed. Given that the by either reconstruction A or B seem illogical, but so “equal costs” assumption is not supported by this tree, does any other step matrix. one could equally well start to optimize the tree from Another method that has been proposed to deter- any other starting point. Here the problem arises that mine costs of transformations is “transformation series different initial costs for an optimization may lead to analysis” or TSA (Mickevich, 1982). More than imple- different final stable reconstructions. Suppose one menting a single criterion, TSA is a set of rules or assumes before optimizing that the relative costs 0→1 criteria. TSA has not been completely formalized and, and 1→0 are not in the ratio 2:5, but instead in the ratio to my knowledge, no available computer program 3:1. This leads to the reconstruction shown in Fig. 2. implements it; therefore, this type of analysis can be That reconstruction is internally consistent; it implies performed only manually. Like dynamic weighting, that changes 1→0 are very unreliable and, accordingly, TSA is iterative; it assumes a given set of initial costs postulates more of those transformations. The costs for between transformations (=character state tree), finds 0→1 and 1→0 according to that reconstruction are in the shortest tree(s) for that set of costs, and derives a the ratio 100:9—cost ratio under which the reconstruc- new set of costs based on optimizing the resulting trees tion is optimal. Reassigning costs as in successive and applying rules (like the nearest neighbor rule) to weighting, although allowing rejection of reconstruc- derive new character state trees. This is repeated until tion 1A on the grounds of inconsistency, does not offer the results remain stable. It is well known that, for com- a clear criterion by which to choose between the plex data sets, the final results obtained with TSA will depend (just as in the preceding method) on the initial 6Or select among the trees found before (Sharkey, 1993), which transformation series chosen (Mickevich, 1982: 469– does not seem a sensible suggestion. If the new set of costs can be 470). Further, the validity and general applicability of meaningfully used to choose from among some trees, it should be used to choose from among every possible tree, not just those that some of the rules (used to select among possible happened to be optimal under the previous set of costs. Also unjus- character state trees when there are ambiguous optimi- tified seems Sharkey’s suggestion to use a weighting function with different values of the “minimum” number of transformations zations) remain dubious (see Buckup, 1991). For between two given states according to the root state in the current example, the nearest neighbor rule attributes more reconstruction. The minimum number of such transformations should be the minimum among all possible reconstructions, and importance to the states in terminal taxa than to those this is always zero. postulated in the internal nodes of the tree, on the

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved 240 Goloboff

grounds that those states are “directly observed”. are based on actual observations, using the conclusions However, whenever a “terminal” included in an anal- to change them may well be undesirable. Regardless of ysis is a higher taxon, that distinction vanishes; stating the conclusion, the observed degrees of similarity will that the common ancestor of Mammalia had mammary always remain the same (unless, of course, an actual glands and hair is not based on a more “direct observa- re-examination of specimens is made) and then there tion” than is stating that any internal node of a seems to be no reason to modify the prior costs. This cladogram “has” a given state; both conclusions are potential problem aside, Lipscomb’s “congruence test” presumably reached on the basis of parsimony, but has not yet been completely formalized and is thus not neither of them rests on direct observation7. That con- generally applicable. ceptual problem aside, adding identical taxa may A possible criticism of the method comes from argu- affect the neighborhoods and therefore, if using TSA, ments on metricity. By analogy with the argument of the relationships between the taxa originally included Farris (1981, 1985) against the use of non-metrical dis- may change. In contrast, the present method is affected tance data in phylogenetic analysis, Wheeler (1993) has neither by considerations of whether the taxa in an suggested that asymmetrical costs of transformations analysis are “higher” or “lower”, nor by addition of cannot be logically analysed, as they violate the trian- identical taxa. Another important difference between gle inequality. His argument does not apply to the both methods is their behaviour in the absence of present method; the prior costs for all transformations homoplasy. For ambiguous optimizations, TSA may may well be a priori, equal and symmetrical. The well prefer a unique transformation series when there method does not consider asymmetries in costs as is no homoplasy in the data (Lipscomb, 1992, fig. 4, is intrinsic to the data; rather, it evaluates different recon- an example). In contrast, whenever there is no need to structions taking into account the direction of the postulate homoplasy (i.e. when there is no “scatter- transformations; the asymmetry is a conclusion, not a ing”; Mickevich and Lipscomb, 1990), the same premise. The reconstructions judged as optimal and reconstructions will be considered as optimal by a lin- the magnitude of the distortion will, however, depend ear optimization and by the present method. When on the location of the root8. Additionally, it is not clear there is homoplasy, the procedure of TSA appears less whether Wheeler’s (1993) analogy between taxon and well defined; a single reconstruction may indicate dif- state distances is appropriate. Farris (1981, 1985) ferent character state trees in different areas of the tree. referred only to analyses with the distances between A reasonable criterion for choosing among those char- taxa as the data. In that case, if the triangle inequality is acter state trees has not been offered; running a violated for a triplet of taxa, there is no distance space different analysis for each one of them is a possibility, where a hypothetical ancestor linking two of those taxa but analyses could multiply enormously. In essence, can be placed, such that the branch lengths are both this is a consequence of seeing as necessary the deter- minimum and non-negative. Non-minimum branches mination of a unique set of costs (i.e. a character state decrease the fit of the tree to the data, while negative tree) before a tree is optimized. The alternative, pro- lengths cannot be interpreted as amounts of evolution- posed here, is using an optimality criterion sensitive to ary change. However, the parsimony criterion the congruence of different transformations and considers transformations between actual states regarding each optimal reconstruction as determining assigned to adjacent nodes, not distances between the a plausible “character state tree”. terminal taxa, so this problem cannot arise: postulating Lipscomb (1992) proposed “testing by congruence” internal nodes linking the states of a character is not the initial character state tree(s) hypothesized on the necessary. The usual argument (e.g. Swofford, 1993: basis of morphology alone. The “initial character state 17) against costs violating the triangle inequality does tree” of Lipscomb refers to the prior costs. If those costs not consider distances between terminals, but rather

7Some authors even claim that a “species” is not a terminal unit, 8This root dependency is also observed in Wheeler’s (1996) “opti- but simply a taxon like any other—so that not even in the case of a mization alignment”, even when no distinction is made a priori “species” could their states be “directly” observed (see Nelson, between transformations in one or the other direction; Wheeler 1989; Vrana and Wheeler, 1992; an alternative view is held by Frost (pers. comm.) considers that method as not subject to his 1993 criti- and Kluge, 1995). cism, for exactly the same reason.

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved Self-Weighted Optimization 241 that for some transformations between two states it different directions are very different (at least for some becomes necessary to pass through some other state characters) it seems necessary to apply a method that instead of having a direct transformation, a possibility can use such information. The present method does so, precluded in the present method. That argument does inferring the reliabilities for different transformations not exclude in itself asymmetries in costs, but Wheeler from the data themselves. suggests that asymmetrical costs violate the inequality as one of the states involving asymmetrical costs could be seen as two different states, with transformation ACKNOWLEDGEMENTS cost between them equal to zero. However, as recog- nized by Wheeler (1993: 710), the only way to distinguish those identical conditions as two different I thank V. Albert, J. Carpenter, J. Farris, D. Lipscomb, N. Platnick, M. states would be “unfortunately, on a cladogram” Ramírez, R. Schuh, C. Szumik, W. Wheeler, and an anonymous reviewer for (more properly: given a certain character state recon- their critical comments and encouragement. I also thank M. Paolini for com- puter time. struction); this means that the distances for any triplet of taxa may well fulfill the triangle inequality. In con- clusion, it is not clear whether the analogy with distances really disqualifies all asymmetrical prior REFERENCES costs.

Buckup, P. (1991). Cladogram characters: Predictions, not observa- tions. Cladistics 7, 191–195. Carpenter, J. (1994). Successive weighting, reliability and evidence. CONCLUSIONS Cladistics 10, 215–220. Farris, J. S. (1969). A successive approximations approach to charac- ter weighting. Syst. Zool. 18, 374–385. It would be naive, of course, to claim that the present Farris, J. (1970). Methods for computing Wagner trees. Syst. Zool. 34, method is the final answer to all the problems associ- 21–24. ated with determining the costs of character state Farris, J. (1979). On the naturalness of phylogenetic classification. Syst. Zool. 28, 200–214. transformations. It is very likely that further improve- Farris, J. (1981). Distance data in phylogenetic analysis. In “Proceed- ments are possible, or that an entirely different ings of the Second Meeting of the Willi Hennig Society”. V. Funck, approach will prove better justified. Perhaps the least and D. Brooks, Eds), pp. 3–22. New York Botanical Garden, New York. well-justified aspect of the present method is that, in Farris, J. (1982). Simplicity and informativeness in systematics and order to avoid illogical state assignments resulting phylogeny. Syst. Zool. 31, 413–444. from violations of the triangle inequality in the implied Farris, J. S. (1983). The logical basis of phylogenetic analysis. In “Pro- costs, the fitting function is used to select among some, ceedings of the Second Meeting of the Willi Hennig Society. Advances in Cladistics 2” (N. Platnick, and V. Funk, Eds), pp. 7– not all, possible reconstructions. Why should a mea- 36. Columbia University Press, New York. sure be seen as meaningful when comparing some Farris J. (1985). Distance data revisited. Cladistics 1, 67–85. reconstructions, but not others? This may be obviated Farris J. (1986). On the boundaries of phylogenetic systematics. Cladistics 2, 14–27. if the measure is seen as applicable after some basic Farris, J. (1988). Hennig86, ver. 1.5. Computer program and docu- parsimony considerations, such as “the common mentation, New York. ancestor of two nodes with identical state must be Farris, J. (1997). Cycles. Cladistics 13, 131-143. assigned the same state, regardless of prior or implied Farris, J., Källersjö, M., Albert, V., Allard, M., Anderberg, A., Bowd- itch, B., Bult, C., Carpenter, J., Crowe, T., De Laet, J., Fitzhugh, K., costs”. One would prefer a measure defined in such a Frost, D., Goloboff, P., Humphries, C., Jondelius, U., Judd, D., way that those illogical character state reconstructions Karis, P., Lipscomb, D., Luckow, M., Mindell, D., Muona, J., can never be considered optimal, but that ideal may be Nixon, K., Presch, W., Seberg, O., Siddall, M., Struwe, L., Tehler, A., Wenzel, J., Wheeler, Q., and Wheeler, W. (1995). Explanation. too complex or impossible. Cladistics 11, 211–218. The present approach may not always be necessary Fitch, W. (1971). Toward defining the course of evolution: Minimal in small or very clean data sets, as it will often produce change for a specific tree topology. Syst. Zool. 20, 406–416. Frost, D., and Kluge, A. (1995). A consideration of epistemology in the same results as linear optimization. However, in systematic biology, with special reference to species. Cladistics 10, those cases in which the numbers of transformations in 259–294.

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved 242 Goloboff

Goloboff, P. A. (1993a). Estimating character weights during tree Sankoff, D., and Cedergreen, R. (1983). Simultaneous comparison of search. Cladistics 9, 83–91. three or more sequences related by a tree. In “Time Warps, String Goloboff, P. (1993b). Pee-Wee and NONA. Computer programs and Edits, and Macromolecules: The Theory and Practice of Sequence documentation. New York. Comparison” (D. Sankoff, and J. Kruskall, Eds), pp. 253–264. Add- Goloboff, P. (1994). Character optimization and calculation of tree ison-Wesley, Reading, Massachusetts. lengths. Cladistics 9, 433–436. Sharkey, M. (1993). Exact indices, criteria to select from minimum Goloboff, P. (1995a). Parsimony and weighting: A reply to Turner length trees. Cladistics 9, 211–222. and Zandee. Cladistics 11, 91–104. Swofford, D. (1993). PAUP: Phylogenetic Analysis Using Parsimony, Goloboff, P. (1996a). S.P.A.: Sankoff Parsimony Analysis. Computer version 3.1. Program and documentation, Laboratory of Molecu- program and documentation, San Miguel de Tucumán. lar Systematics, Smithsonian Institution, Washington. Goloboff, P. (1996b). Methods for faster parsimony analysis. Swofford, D., and Olsen, G. (1990). Phylogeny reconstruction. In Cladistics 12, 199–220. “Molecular Systematics” (D. Hillis, and C. Mortiz, Eds), pp. 411– Hendy, M., and Penny, D. (1982). Branch and bound algorithms to 501. determine minimal evolutionary trees. Math. Biosc. 59, 277–290. Szumik, C. (1997). Cladistics 12, 41–64. Turner, H., and Zandee, R. (1995). The behaviour of Goloboff’s tree Lipscomb, D. (1992). Parsimony, homology, and the analysis of mul- fitness measure F. Cladistics 11, 57–72. tistate characters. Cladistics 8, 45–65. Vrana, P., and Wheeler, W. (1992). Individual organisms as terminal Mickevich, M. (1982). Transformation series analysis. Syst. Biol. 31, entities: Laying the species problem to rest. Cladistics 8, 67–72. 461–478. Wheeler, W. (1992). Quo vadis? Cladistics 8, 85–86. Mickevich, M., and Lipscomb, D. (1990). Parsimony and the choice Wheeler, W. (1993). The triangle inequality and character analysis. between different transformations for the same character set. Mol. Biol. Evol. 10, 707–712. Cladistics 7, 111–139. Wheeler, W. (1996). Optimization alignment: The end of multiple Nelson, G. (1989). Species and taxa: Systematics and evolution. In alignment in phylogenetics? Cladistics 12, 1–9. “Speciation and its Consequences” (D. Otte, and J. Endler, Eds), Williams, P., and Fitch, W. (1990). Phylogeny determination using pp. 60–81. Sinauer Associates, Sunderland, Massachusetts. dynamically weighted parsimony method. In “Molecular Evolu- Nelson, G., and Platnick, N. (1991). Three-taxon statements: A more tion: Computer Analysis of Protein and Nucleic Acid Sequences” precise use of parsimony? Cladistics 7, 351–366. (R. Doolittle, Ed.), Methods in Enzymology, Vol. 183, pp. 615–626. Platnick, N. (1993). Character optimization and weighting: Differ- New York, Academic Press. ences between the standard and three-taxon approaches to phylogenetic inference. Cladistics 9, 267–272. Sankoff, D., and Rousseau, P. (1975). Locating the vertices of a Steiner tree in an arbitrary space. Math. Program. 9, 240–246.

APPENDIX 1

Recursive Function to Determine Optimal Reconstructions

This function assumes that all the fixed and linked nodes, as well as the types of change that never occur in optimal reconstructions, have been determined before the function assign_states () is called for first time. As usual in parsi- mony computer programs, the state sets are stored as sets of on/off bits. Only the variables declared within the function must be local. All others are global:

left_desc [i], right_desc [i] : the left and right descendants of node i. Both together determine a dichotomous tree. Note that for this function to work properly, tree nodes must be numbered in such a way that no ancestor is smaller than any of its descendants. largest_state : the largest state of the character being optimized. cosfunction [i] : the value of the cost function for i transformations. prior_cost [i] [j] : the prior cost of transforming from i to j. nodes_to_assign : the number of nodes for which state assignments must be examined. node_order [i] : the ith node for which state assignments must be checked. assigned_nodes : the number of nodes for which a state has already been assigned. It must be set to 0 before calling assign_states ( ).

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved Self-Weighted Optimization 243 state [i] : a buffer to store the state assignments. It stores single state assignments (i.e. single reconstructions, partial or complete). Fixed nodes must be set to the corresponding state before calling assign_states(). state_buffer [i] : a buffer to store the state assignments so far considered as optimal (as state sets). state_set [i] : the final state sets. nodes_linked_to [i] : the number of nodes linked to node i. linkside [i] : a variable that indicates to which descendent of node i are some nodes linked. Takes the value of 0 it if is the left descendant, 1 otherwise. linked_to [i] : the node to which node i is linked. For repeats, all nodes must be linked to the descendant node (i.e. the node lowest in number). Nodes leading only to terminals with missing entries (and only in that case), are linked to the first ancestor leading to some non-missing entries (i.e. only these are linked to a node with a larger number). leads_to_missing [i] : a variable indicating whether node i leads to an all-missing node. allowchg [i] [j] : a variable indicating whether any i→j changes can occur in optimal reconstructions. If change i→j is forbidden, allowchg [i] [j]=0; otherwise, allowchg [i] [j]=1. partial_D : the partial value of D; it must be set to 0 before calling assign_states(). BEST_D : the best value of D found so far. It must be set to an initial value before calling assign_states() (best done with the D for a Fitch, Farris, or Sankoff optimization). tmpsteps [i] [j] : the number of transformations i→j implied so far by the reconstruction. Before the first call to assign_states(), tmpsteps must be initialized as 0 for all i, j. MISSING : a value that represents missing entries, with every bit on.

void assign_states() {int a, b, c, left_state, right_state, this_state, x, y, z, trans_was, D_was, trans_is; /* 1 */ a=node_order [assigned_nodes++]; /* 2 */ left_state=linked_to [left_desc [a]]; /* 3 */ if (left_state>left_desc [a]) left_state=MISSING; /* 4 */ else left_state=state[left_state]; /* 5 */ right_state=linked_to [right_desc [a]]; /* 6 */ if(right_state>right_desc[a]) right_state=MISSING; /* 7 */ else right_state=state [right_state]; /* 8 */ y=0; while(!( (1<

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved 244 Goloboff

/* 25 */ if (linkside [a]) tmpsteps [x] [z] +=nodes_linked_to [a]; /* 26 */ trans_is=tmpsteps [x] [z]; /* 27 */ partial_D= (cosfunction[trans_is] - cosfunction [trans_was]) * prior_cost [x] [z]; } /* 28 */ if(partial_D <=BEST_D) /* 29 */ if(assigned_nodes

After assign_states() returns to the function that had originally called it, it is necessary to copy the states to their final buffer: /* 46 */ for(a = largest_HTU; a >= smallest_HTU; –a) /* 47 */ { state_set[a] = statebuf[linked_to[a]]; /* 48 */ if(leads_to_missing[a]) { /* 49 */ if(linked_to[a]

APPENDIX 2 searched with six replications of a random addition sequence Wagner tree followed by TBR keeping up to 20 trees per replication; and (b) the set of resulting trees Data Sets Analysed was subjected to SPR keeping up to 1000 trees distinct as dichotomous. Under successive weighting (using Unless otherwise stated, all data sets used are mor- the consistency index as weighting function, scaled phological. All uninformative characters were between 0 and 10, using NONA), the search for col- eliminated from the data set. For runs with Pee-Wee lapsed trees was performed, repeating step (a) until the and equal weights the trees for the reduced data sets results were stable, then subjecting the resulting trees were found in two steps as follows: (a) the optimal (col- to step (b). These three methods required each that lapsed) trees for each of the reduced data sets were (1+10)*16=176 different data sets be analysed. For runs

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved Self-Weighted Optimization 245 under self-weighted optimization step (a) used five maximum number allowed by TTS) were generated for replications of SPR keeping up to 10 trees per replica- the complete data set. tion, and step (b) saved no more than 150 trees distinct 4. Species of Trichomycterus and related genera as dichotomous; this was performed using earlier ver- (Pisces: Trichomycteridae); Luis Fernández, pers. sions of SLFWT, which had K=5 as default. This comm.; presented at the 1997 Simposio Internacional method required analysis of (1+5)*16=96 data sets. de Peces Neotropicales, Porto Alegre, August 1997. The total averages were calculated with the data set 5. Species of Lystrophis plus related genera (Ser- 8B. Data set 8A had been taken from the matrix of pentes, Colubridae); G. Scrocchi and F. Cruz, pers. Goloboff (1995), too large to be analysed under comm. auto-weighted optimization in reasonable times. The 6. Species of Tropidurus (Squamata, Iguania); Frost, Acanthogonatus, plus some species of Rachias, D. (1992) Am. Mus. Novit. 3033, 1–68. Stenoterommata, Hermacha, Pycnothele and Stanwellia 7. Hemiptera (Insecta), DNA sequence data; were included in 8A; in the entire analysis of Goloboff Wheeler et al., (1993) Ent. Scand. 24, 121. (1995), these genera appeared as ((Hermacha, Stanwellia, 8. Nemesiid , (Arachnida, Araneae), ((Pycnothele, Rachias), Stenoterommata)), Acanthogona- extracted from Goloboff (1995b) Bull. Am. Mus. Nat. tus). Data set 8A has Hermacha as outgroup; for that Hist. 224, 28. Data set 8A includes clade 146 of Goloboff selection of terminals, no single terminal can be chosen (1995b: 36), except Rachias n. sp., clade 137, Pycnothele as the outgroup and respect the original rooting (Acan- n. sp., P. perdita, and Stenoterommata crassistylum; data set 8B also has the HTU 146 for rooting. thogonatus has 27 species, and only terminals can be 9. Bipectinate Mygalomorph spiders (Arachnida, used for rooting in the test program used here). Then, Araneae), extracted from Goloboff, 1995b. It is the a taxon representing the ancestral states for the closest same data set of Goloboff (1995b: 36), but excluding common ancestor of the above mentioned genera was clade 146 (except Pycnothele modesta and Stenoterom- added to the data set 8A, and this constitutes data set mata platense). 8B. 10. Families of Embioptera (Insecta), from Szumik, The comparisons between three-taxon statements C. (1997) Cladistics 12, 41–64. and equal weights were done using the program TTS 11. Mygalomorph families (Arachnida, Ara- (written by Goloboff and Nixon; Nelson and Platnick, neae); Goloboff, (1993c) Am. Mus. Novit. 3056, 1–32. 1991) to recode both the original and the reduced data 12. Haplogine spiders (Arachnida, Araneae); Plat- sets. The recoded data sets were then submitted to nick et al., (1991) Am. Mus. Novit. 3016, 1. NONA. 13. Genera of Atalophlebiinae (Ephemeroptera, Leptophlebiidae); E. Dominguez, pers. comm. 1. Genera of Filistatidae (Arachnida, Araneae); Gris- 14. Phyxelidine Amaurobiid spiders (Arachnida, mado and Ramírez, pers. comm. Araneae); an earlier version of the matrix in Griswold, 2. Species of the genus Monapia, Anyphaenidae C. (1990) Bull. Am. Mus. Nat. Hist. 196, 204–206 Bull. (Arachnida, Araneae); an earlier version of the matrix Nat. Hist. (used in Platnick, 1989 Cladistics 5, 145–161). in Ramírez, M.(1995) Bol. Soc. Biol. Concepcion 71, 75. 15. Asphodelaceae, rbcl DNA data set; A. Cox, pers. 3. Genera of Theraphosoidina (Arachnida, Ara- comm. neae); R. Raven, pers. comm. For the analysis of 16. Subfamilies of Anyphaenidae (Arachnida, Ara- three-taxon statements, the last taxon in the matrix was neae); an earlier version of the matrix in Ramírez, M. excluded, as otherwise more than 1000 statements (the (1996) Ent. Scand. 26, 361–384.

Copyright © 1997 by The Willi Hennig Society All rights of reproduction in any form reserved