<<

Syst. Biol. 64(6):936–952, 2015 © The Author(s) 2015. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. DOI:10.1093/sysbio/syv031 Advance Access publication May 26, 2015 Toward Synthesizing Our Knowledge of Morphology: Using Ontologies and Machine Reasoning to Extract Presence/Absence Evolutionary Phenotypes across Studies

,∗ , , ,∗ T. A LEXANDER DECECCHI1 ,JAMES P. B ALHOFF2 3,HILMAR LAPP2 4, AND PAULA M. MABEE1 1Department of Biology, University of South Dakota, Vermillion, SD 57069, USA; 2National Evolutionary Synthesis Center, Durham, NC 27705, USA; 3University of North Carolina, Chapel Hill, NC 27599, USA; 4Center for Genomics and Computational Biology, Duke University, Durham, NC 27708, USA ∗ Correspondence to be sent to: Department of Biology, University of South Dakota, Vermillion, SD, 57069. E-mail: [email protected], [email protected].

Received 24 November 2014; reviews returned 10 March 2015; accepted 20 May 2015 Associate Editor: Thomas Near

Abstract.—The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched in noncomputer readable text, and requires enormous investments of time and resources to integrate across large numbers of taxa and studies. Here we present a new methodology, using ontology-based reasoning systems working with the Phenoscape Knowledgebase (KB; kb.phenoscape.org), to automatically integrate large amounts of evolutionary character state descriptions into a synthetic character matrix of neomorphic (presence/absence) data. Using the KB, which includes more than 55 studies of sarcopterygian taxa, we generated a synthetic supermatrix of 639 variable characters scored for 1051 taxa, resulting in over 145,000 populated cells. Of these characters, over 76% were made variable through the addition of inferred presence/absence states derived by machine reasoning over the formal semantics of the source ontologies. Inferred data reduced the missing data in the variable character-subset from 98.5% to 78.2%. Machine reasoning also enables the isolation of conflicts in the data, that is, cells where both presence and absence are indicated; reports regarding conflicting data provenance can be generated automatically. Further, reasoning enables quantification and new visualizations of the data, here for example, allowing identification of character space that has been undersampled across the fin-to-limb transition. The approach and methods demonstrated here to compute synthetic presence/absence supermatrices are applicable to any taxonomic and phenotypic slice across the tree of life, providing the data are semantically annotated. Because such data can also be linked to model organism genetics through computational scoring of phenotypic similarity, they open a rich set of future research questions into phenotype-to-genome relationships. [character conflict; evolutionary mapping; inference; missing data; morphological character; ontology; phenotype; supermatrix.]

The analysis of phenotypic traits in a phylogenetic from the description “posterior flap of adipose fin, free framework is key to addressing the evolutionary from back and caudal fin” (Lundberg 1992), the adipose questions posed by an increasingly diverse set of fin would be assumed present. Such detailed data, domains. For example, understanding the evolution of originally collected for phylogenetic reconstruction or pharyngeal jaw mechanics in fishes (Price et al. 2010), taxonomic identification, are desirable for re-use at the identifying phenotype-associated genes and regulators more general level of presence and absence where they in forward genomics approaches (Hiller et al. 2012), pertain to broader questions concerning, for example, exploring the key factors in land plant evolution (Rudall homoplasy, rates, and correlations of phenotype with et al. 2013), or discovering the role of phenotypic traits in environment, geography, and genes. colonization ability (Van Bocxlaer et al. 2010), all rely on Here we show that the presence and absence the mapping of phenotypic data to phylogeny. Although of phenotypes can be extracted automatically from robust molecular phylogenies have become easier to published detailed phenotype descriptions that are generate, more broadly available, and increasingly annotated using ontologies. For example, if an author comprehensive, the phenotypic data on which these asserts that a fin ray is branched in a particular studies rely have not. fish species, we can use the logic inherent in the Unlike molecular data, phenotypic data are corresponding ontology-based expression to infer that notoriously time-consuming and complex to observe, the fin ray is present. The power of inference across classify, and code (Burleigh et al. 2013). Moreover, ontology-based phenotypes (Balhoff et al. 2010; Dahdul they are described in a highly detailed free-text et al. 2010a; Mabee et al. 2012) from multiple species format in a distributed literature and have not been and multiple studies enables a substantial reduction available in a computable format (Deans et al. 2012). in the proportion of missing data in a matrix. We Researchers seeking to aggregate even the seemingly here demonstrate that the logical inferences enabled simple information about the presence and absence by ontologies significantly expand the coverage of the of phenotypes across a set of species are faced with a data, revealing gaps in phenotype and taxon sampling, substantial manual extraction and abstraction task (e.g., and revealing data conflict across studies. The methods Stewart et al. 2014). Although assertions of the presence described here not only allow aggregation of phenotypic and absence of phenotypes abound in the literature, data into synthetic supermatrices, but also show the so do descriptions of the variation in other qualities need to more broadly adopt the use of ontology such as shape, size, position, color, etc. In the case of annotation in the morphological literature to facilitate these qualities, presence and absence must be inferred; linking and integration with other data such as genetic

936 2015 DECECCHI ET AL.—EXTRACTING PRESENCE/ABSENCE PHENOTYPES 937 and developmental data from model organisms (Mabee 2007). The Uberon ontology includes explicit, expert et al. 2012). community-vetted, and sourced statements about homology, and splits anatomical structures named homonymously between clades but known to be METHODS nonhomologous into separate classes for each clade The methods described here rely on the use of (see Haendel et al. 2014). For example, the ontologies. Because the application of ontologies to parietal bone and the actinopterygian parietal bone systematics is recent, we provide a glossary to aid the form distinct classes in the ontology (and each is a reader (Box 1). subtype of neurocranium bone), preventing inference The Phenoscape Knowledgebase (KB; methods from inadvertently treating structures as the kb.phenoscape.org) contains ontology-annotated “same” that are known not to be. Phenotypic qualities phenotype data derived from published character (presence/absence, size, shape, composition, color, etc.) state matrices from phylogenetic treatments (Fig. 1). are drawn from the Phenotype and Trait Ontology Annotations are ontological expressions composed (PATO) (Gkoutos et al. 2005). Terms for vertebrate taxa according to the Entity–Quality (EQ) formalism are taken from the Vertebrate Taxonomy Ontology (Mungall et al. 2007, 2010), using the Phenex software (VTO) (Midford et al. 2013). Every morphological (Balhoff et al. 2010) as described previously (Dahdul matrix annotated in this way is associated with a single et al. 2010a) (Fig. 1). Anatomical entities are represented publication in the KB. by terms from the comprehensive Uberon anatomy At the time of this analysis, the KB (Fig. 1) ontology for metazoan (Mungall et al. 2012; contained a total of 19,024 morphological character states Haendel et al. 2014), which was derived in part from corresponding to 651,660 EQ phenotype annotations for independently developed vertebrate multispecies 4399 extant and fossil vertebrates from 139 comparative ontologies (Dahdul et al. 2010b; 2012; Maglia et al. studies. It is particularly enriched in the comparative

BOX 1. Glossary of key terms used in the description of anatomy data synthesis Annotation—The application of one or more ontology terms to a text expression such as a character state. Assertion—A direct author statement about a fact or observation, which here is usually an organismal phenotype, typically deduced from observations about a specimen, including whether an entity is present or absent in an organism. Character state—One of a subset of pre-defined values used in the composition of characters to differentiate between taxa. Character states are published in natural language text that is not computable. Entity—An entity is a feature of an organism, such an anatomical element, a molecular process, or a behavior. It is represented in an ontology by a class, for example, an anatomy ontology class such as “pectoral fin.” EQ—Entity–Quality, the combination of an entity class with a quality class. An EQ is an ontology-based representation of a phenotype (e.g., pectoral fin, presence). Inference—A statement about a fact, such as the presence or absence of an entity, that is implied by one or more asserted facts by means of the logical relations of an ontology. Inheres in—The relation between a dependent continuant (in this case a quality) and an entity. For example, the triangular shape that inheres in the deltopectoral crest of a tyrannosaur. Ontology—Ontologies are hierarchical vocabularies with well-defined relationships between terms that can be used by computers to integrate data. OWL—Web Ontology Language, a language standardized by the World Wide Web Consortium (WC3) for defining first-order logic ontologies. Phenotype—A feature of an organism and its quality. May be represented with ontologies using Entity–Quality (EQ) syntax. Quality—A quality describes the particular way that an entity varies in presence, size, shape, composition, etc. It is represented in an ontology (a quality ontology like PATO) by a class, for example, a quality ontology class such as “present” or “curved.” Reasoning—The use of logic, here embodied in an ontology,to reach a conclusion. Inference is a type of reasoning. Subclass—A derived class that inherits properties from its parent classes, and usually has some of its own. For example, a “radius bone” is a subclass of “bone” and thus inherits its properties, such as being composed of bone tissue. Synthetic character—A character whose states are assembled from multiple studies using the logical relationships specified in an ontology. Synthetic supermatrix— A character matrix consisting of synthetic characters. 938 SYSTEMATIC BIOLOGY VOL. 64

Evolutionary phenotype Ontologies data anatomy quality Query:

+ "Sarcopterygii" taxonomy

OntoTrace NeXML Semantic annotation presence or absence Character State from KB synthetic presence/ ethmoid plate form rounded matrix absence supermatrix Entity (Uberon) Quality (PATO) states with matrix cell ethmoid cartilage round values

data query Phenex view of matrix and source data

phenoscape-kb-owl-tools Phenoscape uses OWL reasoning to Knowledgebase precompute inferred relationships

absence logic used by OntoTrace

FIGURE 1. Flow chart showing computational steps used to extract synthetic presence/absence supermatrices from ontology-annotated evolutionary phenotype data. Phenotypic character states of taxa from the evolutionary literature are semantically annotated using anatomy, quality, and taxon ontologies. Using the phenoscape-kb-owl-tools data processing pipeline (https://github.com/phenoscape/phenoscape-owl- tools), these phenotypes are reasoned across and deposited into the Phenoscape Knowledgebase. The OntoTrace tool enables a user to generate synthetic presence/absence matrices for specific taxa (here “Sarcopterygii”) and particular anatomical entities (here “parts of fin or limb”). These matrices, including provenance for each cell, can be viewed in Phenex. skeletal anatomy for fins, limbs, and their support the OntoTrace tool (Fig. 1). OntoTrace accepts as input (i) structures (girdles) of sarcopterygian vertebrates (Table the targeted anatomical elements (or regions) in the form 1 available as Supplementary Materials on Dryad of a pertinent ontology class or expression (specifically, at http://dx.doi.org/10.5061/dryad.rm907), the clade an OWL class expression), and (ii) the taxonomic in which the “fin-to-limb” transition occurred (see group(s) (also in the form of an OWL class expression) Shubin et al. 2014 for a recent discussion). Sarcopterygii for which a supermatrix is to be synthesized. OntoTrace comprise slightly greater than half of all vertebrates first generates a matrix column, and thus a character, (Barnosky et al. 2011) and include lobe-finned fishes such for each anatomy ontology class subsumed by the as lungfish and coelacanths, and including input class expression. Then, for each anatomical amphibians and amniotes. These richly annotated taxa character so generated, OntoTrace queries the KB for and phenotypes served as the source data for this character states whose EQ annotations logically entail investigation. the presence or absence of the respective anatomical To automate synthesis of supermatrices from the element, given the subclass relationships, part–whole phenotype-by-taxon knowledge in the KB, we created relationships, developmental origin, and other axioms 2015 DECECCHI ET AL.—EXTRACTING PRESENCE/ABSENCE PHENOTYPES 939

FIGURE 2. Screenshot from Phenex, showing a portion of synthetic supermatrix in Matrix panel (left), synthetic characters in Characters panel (right), and provenance in the new Supporting State Sources panel (below). Here the Supporting State Sources panel displays the sources of the character states for the synthetic character 318 “humerus” in Ichthyostega stensioei. provided by the requisite ontologies (see below). (Vos et al. 2012) (Fig. 1). OntoTrace is implemented The taxa that are associated with those character states in the Scala programming language, and its source and that fall within the input taxonomic group (i.e., code is freely available under the MIT license on are subsumed by the input class expression designating GitHub at https://github.com/phenoscape/ontotrace. the taxa of interest) are then added to the matrix as The source code also contains several ancillary rows, and they are given a state of present, absent, or reporting scripts we used to review properties both (as a polymorphism) for the character, as entailed of the matrix (described below). The version of by their respective character states (more precisely, by OntoTrace described here has been archived at the EQ annotations for those states). To document the http://dx.doi.org/10.5281/zenodo.12705. provenance of each synthetic state, all combinations To allow manual review of the provenance of of taxon and published character state supporting the the cells in the generated synthetic presence/absence synthetic state value(s), along with references to the supermatrices, we developed a new interface panel for respective source matrices (and thus publications), are Phenex, the EQ annotation tool (Balhoff et al. 2010) recorded as metadata for each cell in the synthetic (Figs. 1 and 2). Upon selection of a matrix cell in Phenex, matrix. In addition, OntoTrace determines whether any the new Supporting State Sources panel displays the of the published states supporting a synthetic state list of originally published character states that support directly assert, in the form of their EQ annotations, the presence/absence state value assignment(s) for the the presence or absence of the anatomical element, or respective taxon, and Phenex highlights in bold those whether the synthetic state is solely based on logical that are considered supporting by direct assertion rather inference from the supporting states’ EQ annotations. than by inference (Fig. 2). Direct assertion of presence/absence here means that To illustrate the properties and value of synthetic the curated EQ annotation(s) for the respective state morphological supermatrices, we aimed to generate uses the respective character’s anatomical structure a synthetic presence/absence supermatrix of any as the entity (E), and one of the terms “present” anatomical elements that are part of the paired (PATO:0000467) or “absent” (PATO:0000462) as the limb, paired fin, and/or the girdle skeletons for any quality (Q). OntoTrace outputs the generated matrix sarcopterygian taxa. We also included elements that are and all metadata in a single file in the NeXML format connected to these structures, such as the sternum. To 940 SYSTEMATIC BIOLOGY VOL. 64 achieve this, we selected anatomical structures using the which the presence of a structure can be inferred (see following OWL class expression, shown below with term Balhoff et al. 2014 for details). For example, a quality that labels rather than identifiers for readability: inheres_in a structure implies_presence_of that structure (Fig. 3). Presence is also inferred for any structures of part_of some (‘paired limb/fin’ or ‘girdle which that structure is a part or from which it develops. skeleton’) or connected_to some (‘paired The presence of a “humerus” implies the presence of a limb/fin’ or ‘girdle skeleton’) “forelimb” and a “forelimb skeleton,” of which Uberon The properties part_of (BFO:0000050) and asserts it to be a part. The presence of a “forelimb” connected_to (RO:0002170) are from the OBO Relations also implies the presence of a “forelimb bud,” because Ontology (Smith et al. 2005), and the classes “paired Uberon asserts that the former develops from the latter limb/fin” (UBERON:0004708) and “girdle skeleton” (Fig. 3). (UBERON:0010719) are from the Uberon anatomy ontology (Mungall et al. 2012; Haendel et al. 2014). The Absence.—To query character states that denote absence taxa were selected using the VTO (Midford et al. 2013) of a given structure, OntoTrace retrieves from the term “Sarcopterygii” (VTO:001464) as input, which KB those phenotypes that are subsumed by the permitted us to generate data for taxa annotated to expression “lacks_all_parts_of_type and inheres_in some Sarcopterygii or any of its subclasses in VTO. We ran multicellular_organism and towards value .” OntoTrace on a Linux-based compute node, using Similar to “presence,” the KB makes use of chains of 60 GB RAM, within Duke University’s shared high- ontological relationships to infer which other structures performance computing cluster, with a build of the must be absent as the consequence of the absence of a Phenoscape KB generated on 23 June 2014. given structure (Fig. 3) (see Balhoff et al. 2014 for details). For example, the absence of a “forelimb” entails the absence of a “humerus.” Entailment of Presence and Absence The ontologies from which we draw our terms provide a rich context with a community-vetted set of definitions Identifying Conflicts and relationships (structural, developmental) for each When a taxon is inferred to exhibit both presence entity. The semantics of the OWL ontology language and absence for a particular structure, it indicates used by Phenoscape, Uberon, and the major model either a polymorphic condition in the taxon, or the fact organism ontology communities permits a rich set of that the supporting original character states, or more inferences to be derived from EQ annotations either precisely, the EQ annotations made for them, conflict in a developmental phenotypic context (as used by with each other. Polymorphic synthetic state values were model organism databases) or, as seen here (Fig. 3), considered as reflecting actual polymorphism, and thus in an evolutionary phenotypic context. For example, excluded from further review, if both presence and a simple EQ annotation may assert that a character absence are directly asserted by supporting character state describes an entity “humerus” bearing a quality states associated with a single source matrix (and thus “L-shaped.” A state assignment to a taxon implies the same publication). To aid manual review of the that the taxon has a member organism that exhibits remaining conflicts, we created a script that reported for a phenotype, that is, has an instance of the ontology each conflict the taxon, the entity that was polymorphic class “L-shaped” that inheres in an instance of the (i.e., had conflicting values), and whether the presence class “humerus.” In the EQ model (Mungall et al. 2007, and absence values were supported by direct assertion 2010), the relation inheres_in (RO:0000052) from the or inference. This reporting script can be found in OBO Relations Ontology (Smith et al. 2005) connects the OntoTrace source code repository. Provenance of phenotypic qualities to the anatomical entities that bear conflicting data can be viewed in Phenex (Fig. 2). them. Based on the assertion that there is an organism exhibiting a phenotypic quality inhering in a structure, we can trivially infer that this structure must be present Identifying Isomorphic Synthetic Characters in the organism. Using an OWL reasoner and additional To examine possible dependence across characters axioms provided by the Phenoscape KB, more indirect in the synthetic matrix as a consequence of assertions inferences of presence or absence can be made as well, in the ontologies, we used a script to report each which essentially result from the anatomical knowledge cluster of characters (i.e., anatomical entities) that expressed within the Uberon ontology (Balhoff et al. were identical in their taxonomic distribution of 2014). values. In other words, all identical character columns were collected into clusters; we term these clusters Presence.—Toquery character states that denote presence “isomorphic characters.” Further, for each of the of a given structure, OntoTrace retrieves phenotypes anatomical entities comprising each cluster, the script from the KB that are subsumed by the expression reported whether the matrix contained any direct “implies_presence_of some .” Implies_presence_of presence/absence assertions for that character, or if it is an OWL property that unifies the various means by was included in the matrix solely through inference. 2015 DECECCHI ET AL.—EXTRACTING PRESENCE/ABSENCE PHENOTYPES 941

forelimb bud is present forelimb is present forelimb skeleton is present

humerus is epicondyle of rod-shaped humerus is present

humerus is present

forelimb bud is absent forelimb is absent forelimb skeleton is absent

epicondyle of humerus is absent

humerus is absent

FIGURE 3. Ontology-based inference of presence and absence. Direction of arrows indicate the reasoning pathway. Top: the presence of a structure (humerus) is inferred from an assertion to its shape (humerus L-shaped) or a part (entepicondyle of humerus is present). The presence of humerus implies the presence of forelimb skeleton (humerus is part of a forelimb skeleton), a forelimb (forelimb skeleton is part of a forelimb), and thus a forelimb bud (forelimb develops from a forelimb bud). Bottom: in contrast, an assertion to the absence of a humerus does not entail the absence of a forelimb bud, forelimb, or forelimb skeleton; it does entail the absence of its parts (entepicondyle). However, the absence of a forelimb bud entails the absence of a forelimb, thus a forelimb skeleton and thus the humerus.

Code for this report can be found in the OntoTrace Uberon ontology, we generated a corresponding class source repository. To aid in characterization of the with the logical definition “implies_presence_of some ontological dependence of isomorphic characters, we X.” We classified these expressions using the ELK used a script to generate an ontology of “presence reasoner (Kazakov et al. 2012, 2013) within the Protégé classes”: For each anatomical entity “X” from the ontology editor, and used the Protégé DL Query panel 942 SYSTEMATIC BIOLOGY VOL. 64 to check for inferred equivalency between presence direct assertions about presence or absence of any expressions. fin/limb entity can nonetheless be included in the synthetic presence/absence matrix if they have EQ phenotype annotations that imply presence or absence Other Reporting Queries of a fin/limb entity. For example, the theropod dinosaur taxon Sinosauropteryx prima in our data is derived from The number of published character states that entail a single source (Sereno 2009), where it was not scored the presence or the absence for selected sets of entities for any presence/absence characters. Its inclusion in and taxa was reported using queries to the Phenoscape the synthetic supermatrix comes solely from character KB implemented as a script included within the states such as “increased scapular blade width” and OntoTrace source code. Specifically, for each taxon and “poorly differentiated humeral head form,” because entity, we queried for states that were annotated with these imply the presence of a scapula and humerus, phenotypes that entailed either the presence or the respectively. absence of the entity. After excluding polymorphisms directly asserted We queried (using SPARQL, the query language for within a single source (see “Identifying conflicts” RDF datastores) the Phenoscape KB to count the number section), we identified 774 cells (of the 146,451 populated of published matrices in which each sarcopterygian ones) as stating both presence and absence (0/1) of taxon in the KB is included. An additional query was a character, for 99 synthetic characters and 297 taxa. used to report, for each published matrix in the KB, These included 135 conflicts between direct assertions the number of taxa, characters, states, and phenotypes (that were made in different source publications), associated with annotations relevant to structures of the 565 conflicts between direct assertions and inferred fin or limb. These queries are included in SPARQLformat states, and 74 conflicts between inferred states (Table within the OntoTrace code repository. 3 available as Supplementary Materials on Dryad at http://dx.doi.org/10.5061/dryad.rm907). We also identified 93 clusters of characters in RESULTS the synthetic supermatrix that were isomorphic, that OntoTrace aggregated, as described above, the KB’s is, identical in their distribution across taxa and morphological phenotype data on paired limb, paired to one another, but variable. These correspond to fin, and/or the girdle skeletons for all sarcopterygian 85,813 cells in the synthetic supermatrix (Table 4 taxa into an entity-by-taxon matrix of 1759 synthetic available as Supplementary Materials on Dryad at presence/absence characters by 1052 taxa, in the form http://dx.doi.org/10.5061/dryad.rm907), almost 59% of an XML file in NeXML format (Vos et al. 2012) made of the (populated) cells. To better characterize these available in the Dryad data repository (see sarcop- clusters as to their ontological basis, we examined presence-absence-all.xml in Supplementary Materials which of them fall into equivalence chains of implied on Dryad at http://dx.doi.org/10.5061/dryad.rm907). presence and absence. More specifically, two synthetic The 55 studies from which data were synthesized characters with anatomical entities X and Y, respectively, in this manner are summarized in Table 1 for which a reasoner infers equivalence between the available as Supplementary Materials on Dryad at logical definitions “implies_presence_of some X” and http://dx.doi.org/10.5061/dryad.rm907. The data “implies_presence_of some Y” will necessarily be found from these papers that relate to fin, limb, girdle and isomorphic in their distribution of presence and absence. their parts total 2588 text-based character states from For example, “presence of pedal digit 2” is inferred as 1195 individual published characters, scorable for 1052 equivalent to “presence of pedal digit 2 digitopodial sarcopterygian taxa. skeleton.” Twenty-one of the 93 clusters (23%) were of Out of the 1759 generated synthetic characters, 639 this kind. Another 6 (6%) were found to be clusters were variable, that is, included both presence and of anatomical parts and the entities that contain them. absence states. Of these, 488 characters (76%) were Clusters can also arise from co-asserted entities (e.g., variable only due to the use of inference: 442 variable when a single character state includes multiple entities, characters were composed of inferred data alone; 12 such as “pedal digits 6, 7, and 8 present,” which will were made variable by inferred absence, and 34 by result in three EQ annotations, one for each of the three inferred presence. In the matrix subset comprising the digits). There were three such clusters (3%). Most of the variable characters there are 146,451 populated cells, clusters were composed of inferred data only. Of these, which constitute 21.8% of all cells. Directly asserted 63 (68%) resulted from chains of inference from multiple data accounted for only 9948 (6.8%) of the populated entities that were different for each cluster. cells, or 1.5 % of all cells in the subset; in contrast, The number of source matrices from which a inferred data represent 93.2% of the populated cells particular taxon was sampled ranged from 1 to 16 (Fig. 4). Of the 1051 taxa in the subset, 13% (136 taxa, (Table 5 available as Supplementary Materials on Dryad see Table 2 available as Supplementary Materials on at http://dx.doi.org/10.5061/dryad.rm907). In all, 813 Dryad at http://dx.doi.org/10.5061/dryad.rm907)are (77.4%) of the sampled taxa are at the species rank, included in the matrix solely on the basis of inferred with the remainder distributed across higher-level ranks data. Taxa for which the source matrices contain no (Table 5). 2015 DECECCHI ET AL.—EXTRACTING PRESENCE/ABSENCE PHENOTYPES 943

A

s i

s

n

e i

k

c

R o t s h R s u i h i t n n R R i r i e e n p h R h e i R R l g R e i si i l c Rhinella h n a n l R h h R l yi Anaxyrus woodhousiih R c l i e R e a ni i ae Anaxyrus terres h mR h o i n n s Anaxyrus quercicu n h l R l i e i k Rhinella arenaru R h l l h i e i Anaxyrus americanus Ana R h e r a i e i a n e n i g h l n i u R R h r s Rhaebo caeruleos i l l l g s n s s Melanophryniscus rubriventri Rhaebo haematiticu h l e l e n i a n l a s s Anaxyrus cognatu l e m b v s i n a y e a s s ee u A a h h l i n l i s o n e e r Amie A n l l r l s l e n Anaxyrus boreas l l s d Am i i a a a a d h s n xyr e o s p u s Melanophr Amietophrynus pardali R A a l d l p r e i i e a n n i s y i hu n i nen n l l c s s a a t i c r p a s s e A a l l a p e r c c t e a o i n r e v h o e Nannophryne variegat h a l l h g a h l s h u tacamensis p r n l g g p r t i y a i Na Rhaebo nasicus e a i e m x l u i e ie R a a l l r e e hodu c a x u u n n o eru r i a o u y d l l r o e e e h s i u h y q m i rr Nannophryne cophoti a a m n r l t a s d t e y x s c a b p u l y d I I h m r c e m c o t i l c c c h ophrynus r a e i t w i n e I u a n pt t s n oph n n i c a yc o i y h i r y r n e p r r s b a e i l t n a u a c e t o o s m n r s e I u pun u d o s olepip n t c i b t u l i a o a c r i r n l o s c Incilius alvariu p u i a t o r na n o e h r i l a s o c s r n f a r l r i r r u l g s d s o s o i s c r e o i o Guiyu oneiroP p p Phrynoidis jux a u e i a t e a c l i l e a l p l n d Osornophryne bufoniformi p b s n o t S L O i i y n g i e b i l i s i o i a i Truebella tothasteph rynu b f l r P Powichthy t u r a s n s n i u i d b n n f t i S YoungolepiDiplocercides y o p s Melanophryniscu o o v d i i c h l u a i e a m Schismaderma carenT l ct e s s a a D D i i s riphogna o a a n Duttaphrynus melanostictu s o e e i s c i S Diabolepi a a y r u w tri t o chthy d i r i Pseudepidalea viridiPhrynoidisr asper ry s l n r a H s e y g i o i Gl D e o s ni m b x atu i s c d e i Ingerophrynus macrotis u s c v s m s G Uranolophu e d s y i u l s ChirodiDipno ni ble ne corynete n l s p r odu a u e s m o u a i e e n u g t s b l s DipterusN h u c t a t s i k s Hyperolius viri b l i u r PhaneropleuroS s t s y cus regulari n e t t s n a H o l e i z c l Peltophryne lemur s i ic a l s UronemuKe a h r s rfax e i w c t c r M a s Hyperoliusy phantasticu a f k t s A n c syme a i l x e e g t GooloogongiRhizodopsi o i l ai p l c u s t i Hyperolius picturatus P e u Rhizodu o l o c a c e i Sauripterus taylor n s e e i r p s Screbinodu r p f i s e s ulat f u n s g a Hyperolius marmoratuH r s StrepBarame a u Hyperolius parallelur s t e Barameda decipiens g o u r o d o s i Beelarongi o r Hyperolius montanusy k elzner r i C a ria o Bufo bufo s Ectosteorhachie r e e eae B p l t s G r henodo t i asperao s i opomu ca Hyperolius u s u s Medoevia e t s i H e p M Osteolepidat p H t a Osteolepi n st o e s r s i s Clada i o i i ypero o b t s S i y e s T u h l t e Cabonnichthys burns ae Hyperoliuspe balfour i u Buf s s s E c t Kassina senegalensi s s i Eusthenopteron foord i u b a Glyp s diflavu Gyroptychiuandage t a r e h s dele H oliu i aalik r li p r o s s o M r i Kassinay us k i PlatycephalichthySpodichthys buetler e m h per u li s T kt apoda Kass K s c n i Elpistostegi s i Panderichthystr rhombolepis m Kassa nasu i g s T u a Kassina lamotteiHyperoliu l Densignathus t A s olius argusivuens l u s Elginerpeton pancheni tsoni Afrixalus quadrivittatu a u Te f K s i lizzia rixalusKassinaKassina ina kuvangensi st s s s Acanthostega gunnarcheeria ur a n Hynerpeton bassettcheeriida wa idisc in a aneu t janensc s Kassina fusc u s s IchthyostegaVentastega stensioe curon s a m mer Ossinodushat puer ss s i m s s Pederpeshat finneya onon c n u AfrixalusAfr fulvovittatun i W hracosauroide t r i s a a s W t o u s s Phlyctimantis verrucosu sep a c tens s Gephyrostegus bohemicua cras u Heterixalus madagascariensi i Afrixalus osoriocassinoide s i pet au sis c Afrixalusx fornasini c decora cuu An ia s s Afr alu o l SilvanerpetonBruktererpeton miripede fiebig ia i Phlyctimantis leonard c l if s Gephyrostegida en r s t atae i Proterogyrinus scheele absit t s en Kassinh her s TachycnemisP He seychellensi i r r Tuler s u s Paracassina obscurax nige a Eoherpe ylorn oen a alu t is Archeri tes a Ch O r rionali n t a Arc pis a terixalus a a Eogyrinu s ensiisc rys c s e Pholiderpet ia u n S a riens a s Pholiderpeton scutigeru r ia ba c ex c mi t e ss PhlyctimantiAfrixaludo Solenodon u s oba ho Tseajaia camp o ri t s Cryptothylaxm greshoffiiina k s Diadectes ri i t rs s Diadec m u hylaxnoda i a n tr ali s i LimnosceliLimnoscelisy paludis agili me o i ac ru e ymouro s Acanthixalus spinosu ounhiens s s KotlassiaS e primac o n i m K t s dolesoru hu a cty enberg S Ariekanerpetois s r ra immacula er u b XenopusCallixalus tropicalis cupss pictu lu s D eophallodont r i inula s i UtegeniaSynapsidarichasaurusr s s Xenopus epitropicaliXenopus witte s T e au a u Xenopus weali s h nu Xenopus mueller r is StCaseopsis c s eoni wi s i romyc n Ennatosaurusa tecto y s XenopusXenopus fraser s OAngelosaurus dolanh n tt tu i ngelo r mexica e ten e s A lo o d Xenopus boreali i CaseCasea ybroili s i romer t s s Pipa snethlageaeXenopus clivi s Caseao ruten ov tu i s n b s C s o laevi s i CotylorhynchusCotylorhynchus hancockromer u i ur s Archaeothyris florensisa ru tonens Xenopu s i Echinerpeton intermedius u e il SingidellaP latecostatPipaPipa parva i OphiacodonVaranosaurus acutirostrisa i ipaPipa c myers s s P Ianthasauruspho hardestiorumto n ipa i Edaphosaurusa n s i pip s Edaphosaurusd o boanerge s a s E d Shelania pascuali a rv a Cutleriao wilmarthitrodo Shelania laurentrrabalalho Haptodusc baylei Ll Pachycentrata taquet Haptoduse me garnettensi s m a S S Pip i S n al ingidell i Di s Hymenochiruskiba boettgeritenia i Sphenacodon a Sphenacodontina i Vulcanobatrachus mandelaitrachuPachycentrat Archaeovenator ham agili i EoxenopoidesLl reuningSal ibane a a Heleosaurus scholtz Pseudhymenochirus amerlinn Pyozia mesenensi Hymenochiruk tenia Ruthiromia elcobriensiswelles ibas trueba z i Mesenosaurus romer s trachu i Mycterosaurus longiceps Eoxenopoide Varanodon LeptopelisLept spiritusnocti a i Varanops brevirostria s e Watongia meier u Leptopelis viridi s Aerosaurus greenleeoru in s opelis v Aerosaurus a a rh i S s Biarmosuchia s o Leptopelis ilu li Lep r i Biarmosuchu n s e Leptopelis natalensisermiPipidaan s Dicynodontia o ta LeptopelisLeptopelis karissimbensistopelis modestu a Dinocephali od Lep Titanophoneusx potenn e aro s Leptopelis kivuensisculat e Gorgonopsia e m t mossambicus s i Therocephalirin opelis macrono us Procynosuchuh tida LeptopelisLeptopelis brevirostri christy ta T m gui tu s Diademodon tetragonusriu Leptopelis boulenger s Exaeretodoylodon he Morganucodonrit t watson a s Leptopelis bocagi T elo ArthroleptisAr stenodactylu s Haldanodonk exspectatu throlep tis Morganucodontida Arthroleptis sylvaticus Hen andinusia s ArthroleptisArthroleptis poecilonotus schubotz Zhangheotheriumys quinquecuspidenis tis Leptope i Ornithorhynchus anatinus s e variabili s Tachyglossus aculeatusansor i ArthroleptisArthroleptis adolfifriederici phryno Didelphisadelph virginian Ar i Dromiciopsc aia sc gliroidegobien s t lis i Pu es Cardioglossahrolept leucomysta s Eom st la Leptictisaele dakotensi ctys is M s CardioglossaCardioglossaArthroleptis escalerae gracili s Mesonyx obtusidenada s lameerei i Protungulatum donnatr AstylosternusCardioglossa diadematus cyaneospil ides Ukhaatherium tenessov Zalambdalestesops elongatu lechei Dasypusandua novemcinctu s Trichobatrachus robustu Hapalm i Leptodactylodon ventrimarmoratuCardiogloss Ta ra Scotobleps gabonicus s x Didolodus multicuspia NyctibatesAs corrugatu Hyopsodus paulus Leptodactylus silvanimbutylos Phenacodus intermediui t a Carodnia viei fers Leptodactylus pentadactyluernu a Protolipterna a s LeptodactylusLeptodactylus melanonotu rivero s Thomashuxleyopus us Amblysomuser capensi hottentotui Lep s Echinops telfairsidios todac Oryct in cirnea s Procavia us on si LeptodactylustLeptodactylusylus latinasu latran sc cy fricanyon s s s Trichechuscho manatu a l stu Leptodactyluslep knudsen i Apheliyn ium Lep todac Rh er ariega Leptodac t s rith s v i s Leptoda yloide s Loxodontae eru todatylus s Mo t Leptodactyluscty cty diedruinsularu s Cynocephalus volan Leptodactyluslus d chaquensilu s Galeop glisa Leptodactylusiscodactyl bufonius f Ptilocercus lowi ta usc mi Tupaia s us LemurNotharctus catt syrich tenebrosus PPleurodema cinereum u ma leurodemaLeptodactylus bufoni s s Tarsius ke Homo sapiens el Physalaemus cuvier s Saimiripho sciureus Phy s Gom salaemusPleu Oryctolagus cuniculuscatus rode num Rhombomyludeli s m Cavia porcellus s EngystomopsLithodytesPhysalaemu pustulosubiligonigeru lineatua amys s i Cal i Par rton ypto Tribosphenomys minutuso Adenomeracephalella marmorata s s Castor canadensirium mla s Rattus norvegicutheadacty Scinax nasicus ga Spermophiluspent tridecemlineatu Sc yis Archaeonis s s inax f Ma i Hypsiboas lanciformiusc Canis lupu s Hypsiboas Socvarius Felis silvestri s inax riojanu EquusMesohippus caballus baird PseudisHypsiboa paradox s s Erinaceus europaeus Ist s Solenodon paradoxu hm Pseudi Dendropsophusohyla nanu a Sorex araneu rivula s Talpa europaea Isthmohylris Metacheiromysatu marshisctus TripDendropsophusa Sinopavu rapaxs ov ofe i Osteopilusrion septentrionali s Vulpa us pr finneys petasa Vulpav teris tus Onychonyc x PhyllomedusaCyclor vaillantiHyl s Pteropus giganteu Phyllomedusaana ausauvagia Icaronycterislucifugus indea i stralis Myotis parnelli ii Phyllomedus i Nycteris tthebaicus rdwicka Agalychnis lemu Pterono a habilineat Litoria lesueuria i Rhinopomteryx s Saccop CacosternumLi nanumr s amphibiu toria infr i Bos tauru Cacosternum boettgerafrenata Hippopotamusa Cacosternum Lama glama capense Sus scrof a Amietia fuscigul i Basilosaurus cetoidestus Amietia angolensia Caperea marginattrunca s Amieti s Tursiops s TomopTomopterna tandya Artiocetus clavi terna marmora i Rodhocetuss balochistanensi AnhydrophryneAnhydrophryne rattray hewta Testudine i Procolophonidae archeri Str itti Protorothyrise Poyntoniaongylopus paludicol grayii Captorhinida i NatalobatrachusNothophryne bonebergbroadleyia Captorhinus Captorhinus agut a Microbatrachella i Paleothyris acadiana capensis Araeoscelidi sensis ArthroleptellaEricabatrachus landdrosia baleensis Petrolacosaurus kan Choristodera a PyxiPyxicephalusce edulis Lepidosauromorph phalus adspersus erta broomi Aubria subsigillata ProtorosauriProlac a Aubria Trilophosauru browns i HyHylodeslodes ph sazimaiyllodes Mesosuchus Hylodes perplicatus RhynchosauriaEuparkeria capensis Hylodes ornatus Vancleavea campi Hylodes nasus Chanaresuchus bonaparte s Tropidosuchus romeri HylodesHylodes lateristrigatus meridionali i s EryErythrosuchusthrosuchida eafricanus HylodesHylodes dactylocinu asper Proterosuchida tylus schmidti Proterosuchuse fergusi CrossodactylusCrossodac gaudichaudii Ornithosuchia ctylus dantei Parasuchus hislopi CrossodactylusCrossoda caramaschii Pseudopalatus pristinus goeldii Smilosuchus gregorii Megaelosia r Luperosaurus kay fureogularis i PhrynobaPhrynobatrachustrachus sul versicolo i Alligator mississippiensi trachus plicatus Proterochampsidae s Phrynoba s EudimorphodonDimorphodon macrony ranzi Phrynobatrachustrachus sanderson natalensis i Phrynoba fftii Dinosauromorph x Phrynobatrachus petropedetoidetrachus kre s Lagerpetidae a PhrynobatrachusPhrynoba dendrobate r DromomeronLagerpeton chanarensi gregori s i s stris Dromomeron PhrynobatrachusPhrynobatrachusus cricogastea cafricanuutiro Marasuchus lilloens romer nobatrach ss Asilis iis Phry trachu Eucoelophysisaurus kongw Phrynoba r Lewisuchus e PhrynobatrachusSpea acridoide neutea Pseudolagosuchus admixtubaldwin Spea multiplicati Sacisaurus s i mmondi Silesauridae agudoensi SpeaSpea intermontana ha s Silesaurus opolensi majos r Spea i Pseudosuchi Spea bombifron i Ornithosuchus s i Rioja a Scaphiopus hurteri Gracilisuchussuchu stipanicicoru longidens Scaphiopus holbrooki Ticinosuchuss t enuifero Scaphiopus T scep Scaphiopus couchii urfano s anwell Revueltosaurussuchu callenderix m tophrys cr Aetos s dabanen CerCeratophrysa ornatas Longosuchusaurus f meadei sis Stagonolepis erobertsonrratus CeratophrysCeratophry cornuta Lotosaurus adentu s Poposaurus gracilis trachus laevisi Q Lepidobatrachus llanensisgerholdttii Arizonasaurusianos babbitt i LepidobaLepidobatrachu rocei uchus s Waweliarys pies pri Xilousuchus sapingensi coph chu Effigia okeeffeaemixt Cha batra ctalishus us auru occipiatra s SillosuchusShuvosaurus longicervi inexpectatusi B oplob i rachusH s Batrachotomus kupferzellensi si Fasolasuchus tena s Hoplobat lonen s Pres Nanoranae yparker Saurosuchus x Sphaerothecarys cbrevicep s Rauisuchustosuchus tiradente chiniquensi Limnonectes blythiilaevi s Postosuchus alisonax Nannoph r Postosuchus kirkpatrick Euphlyctis idozygacyanophlycti galilei s Occ i Dibothrosuchus elaphro hristyae Dromicosuchus s s Ptychadena uzungwensis c a Kayentasuchus walke e anchiet a Lit Ptychadenahadena chrysogastenat s Sphenosuchusargo acutus i PtychadenaPtyc mascareniensiPtychadenia or Terrestrisuchussu gracili gralla tychadenat tis H chu t s P brand m esperosuchus lept or e ynoman Hesperosuchus agilio r Hild i Orthosuchus stormbergrhyi Phr Pro nchus Phrynomantis bifasciatu i Protosuchustos richardsons Callulops doriae Eocursoru cparvuhu s as G s Lesothosaurusena diagnosticuhaught s CophixalusRamanella cryptotympanu variegata Pisanosauruss merti Dyscophus antongili auri on a Sc a s i i s Heterodonut DermatonotusBombina variegat orientali muellerbombina Heterodontosaurusello tucki i Bombin a Efraasia minorsau Bombina maxima sis Plateosaurus r en Sauropod tosauridaus lawlei Bombina m as Barbourul gala an Panphagia proto s lar aria Saturnalia tupiniqui ri rana albolabriHy s r Eodromaeusiforme e yla por ato Eoraptor lunensisengelhard BarbourulaH busuangensi tem pipient tti Tawa halla s s na a Her HylaranaRanaRana roi rickeans i Staurikosaurusrera price ti au ti Allosaurus murphs St selm Ce saurue m Amolopsrys mont has Dilophosaurusra wetherill i Megophryrachiu s i to s is ob CoelophysCoelophysissau chigualast Megoph Coeloph r fragili Lept us n s Compsognathus longipea s m Ornitholestes hermannsi i en LeptobrachiumXenophrys aceras S ys is k co sis Mixophyes s Compsognathidaeino is baurayent rnis ScutigerMixophyes mammatu schevill s Huaxiagnathussau orientalisrhodesiensi Orni i a i Shenzhousaurusrop orientali ka Heleioporu s i S t te tae tz i A inor hom ryx s Limnodynastes peroniimiliari lu G lbe nit p LimnodynastesPlatyplectrum dorsaliornatu s Tyrorgortos imo ri s Daspletosaurushomimu torosuss m i s horopa i Tarbosaurusannosauau bataa au a T dorsalisnu Dilong paradoxur rus s ri Thoropa Eotyrannus lengs us s a Cycloramphus fuliginosu Guanlong wucaiiau lib arc dong Cycloramphus semipalmatus s s Raptorex kriegsteinru r Cycloramphus boraceiensis s r atuop i ProceratophrysRhinophry boie St s h s us a X o ex agu t c Manirapiongguake Rhinophrynus emoralira ia s Caudip s s Chelomophrynuslus opipari f a t siis Citipati osmolskao c an n s s Archaeopteryxsau lithographici s r RhinophrynusChelomophrynus canadensi baycal r ie J ylus au ar s s i tornlongr Rhadinosteust bi parvus c S xi teryx z us c s s i Voronahenzhan berivotrensia ellagapapuen D g le i tidacGuibemantisantist a is liber hers Rahonr o Mantidactym an d t formit S om r ou bai veland M a s Velociraptorino ou mongoliensini Man m oni s aeo s i m uibe s man f guen Jeholornisr ra o oen G hi ty s tes ss Sapeornisnit chaoyangensiav sau p ri e i p Songlingornis i tor sinen o Pla hu i Z ho s t sis o c Aly r Zhongornishon haoas o rida a B Platymantisra corrugatu to i Gallus gallu austr lis t Discoglossu a i e en Batrachylodes elegan t Ana g prim ruom Batrachylodes vertebrali s Vegavis iaajian s s tobaDiscoglossus pictu na i Chang s mi is a Discodelesra bu s Eoconfuciusorniss zhengo a Eodiscoglossutes Jinzhouornis zhangjiyingir illeni Ce Alytes obstetrican s Conf nis yalinghensi s a AmbiortusCon c s i Apsaravis hengo fuscus Chao u i e n s tropede s fuc c g Petropedetes parker ti Gansus yumenensiiu i s He iu s r s CallobatrachusPe sanyanensi Pelobateescens o s Jianchangornis microdonto ni an u Limenaviss ypatagonics Pelobates varaldiruf m t O pe angia ornirnis du s Petropedetes johnston o s Y d hengdao Pelobates syriacuxerampelinis r ra i to ukhaanementj Petropedetes martienssenPelobates s Ichthyornisix disparro s s tis Enaliorniianog r Pelobates cultripeChi P o nis bei s s Baptornisa advenurn anctusi Petropedetes cameronensi enest i Bap r rgeria japonica f i a r is s a ev i ziens i ArchaeorhynchusYanornis hmartin spathulni hanen Philautusis surdus i s e g a Chiromantue h Hollandatornis s s e i Chiroman B HongshanornisPatagopteryx deferrariis longicrestp s gr n s i Flectonotu s e g s jelsk a Longicrusavis r abauih s Pristimantistimant huicund E o i is Orobates pabst s s Gobipteryxoenantiornis minut buhler varnerr s s Iberomesornis romeral n i a Oreobates quixensiHemiphractu Liaoningornis longidigitri is Pris lectonotus fitzgeraldi Pengornis hou luceri a F obiu i Shenqiornis meng a t Conrau i Ca le a in Eocathayornis walker i s x Gastrotheca gracili s Boluochia i i BatrachophrynumConraua goliat w s Longipteryxt chaoyangensi a r e s s Longirostravisha hani Telmatobiusel scrocchi a t i Rapaxavis pan T d hu s Shanweiniao cooperorumy hou Telmatobius schreiter Elsornis ken o a c anu Eoalulavis hoyasi Conraua crassipe a Protopteryx fengningensirni EupsophusAlsodes roseus gargolam tr ic s Ves i Telmatobufor r s Concornis lacustri s a e Pelody Neuquenornis volans i a Rhinoderma s i Eucritta melanolimnete zheng d i Capetus palustric i i Eupsophus ocalcaratu ame purcelli s Crassigyrinus ornisscoticu a n rue Baphe i e Baphete i Telmatobufoh venustualaeoba t ven s s Baphetes kirkbyi i Leiopelma h i R PelodytesP punctatusnus t i IchthyophiidaE i i Odontophrynu s Rhinatrematidao hebeiensi ry AlbanerpetontidaTyphlonectes compressicaud s ru e Czatkobatrachusc polonicust i a s Pros a ida Ascaphus d i Triadobatrachus massinote s toph inguinaliparvulu i Notobatrachu Allophryne sis i Notobatrachus degiusto c e Cordicephaluyne h Cauda Leiopelma hochstetterAscaphus alamanca riceps s I il c t Liaoridot i don Heleophryne hus a s i Urodel a a s s Palaeobatrachus occidentali t r rachu k s V l t e Dicamptodo ir m s O ees t t te c Dicamptodon aterrimu s t ros Karaurusa u Cordicephalusllophr gracili a o s a Marmorerpe i b alcinae r Necturus beyeril x rit s b c A ipiaoen i d ta b Pro i e r s o R o tri o e a s s R a on Hadromophryne natalensi achaenus d s r e Ambys t i p s Vieraella herbstiPliobab i A h r t ti e s ColosZ a Cratia Am h o Allobat i u z s Amph y t i o s r s Amphiumam y eida t h nu a i s Amph a o n hech d T ne l s P a s e i HymenochirinEopelobac A b e Aneides flavipunctatus b n a o s b c a ry a l e s Desmognal c b r l Salamandr e y o r t u i Notophthalmus viridescens yst o sharovi g t a r Pleurodeles walt s t e s a e s Sirenida t t t n r a hu e a S h t omat r t Sooglossus sechellensi Discoglossida h u a HynobiidaS i o r i a i u s c s i O o um i t YizhoubatrachusThoraciliacus macilentu b a i P i i o m t o to c soph s a Salamandrella keyserlingii ren uma tridactyl o p u n r d o Thoosuchu s n s H n ma i A H a e n n l e k u o s s a n HemisusEurycephalella marmoratu s n hida o e Chunerpeton tianyiensi y o i i l r u s s C y r s y t s i n n d M Macropelobates osborn p c h i J Cryptobranchusy alleganiensi a c n idae E u o c l i Pangerpeton sinensis n c k ho s n a tigrinu o h e c i l Regaler e r d n interm t means d t i A A d h l a e i i a Siderops kehli en y o a i e a y c p h e o A h r o d s su e iu s LysorophiBrachydec a o e l Trematosaurida o t t d i p b t z Craugastor occidentali Pelorocephalua r s s i PantylusMi cordatu delogyrinuc i c a i h t ct y t r i i s o a b hus a f a r en d l stovall c d e e n a s St S h t i e o c s m p e r BatrachosuchuPlagiosaurida i o a e e l s r E Utaherpeton franklin a l o s i u t e b s a r s Batrachyla antartandic B s Eryop l n A Euryodus pr e o u r r m a d r o c i e i a cr y e BrevicepsArariphrynus mossambicu placido a s k e a n e A A p s i l i h i e s b t L Ais u c n r o e i s s ego d l u y a i i r t s l v x s c r h P P O d i ochrophaeu p c i a s s e e x s o e d i h M r r n s D B d d r o r r t n i BuettneriaA maleriensin rs i i ob b x i e l u i a m i r n e e b s g n i N K D y RhytidosteidaRhinePeltobatrachu e c a u e i o o p e y o a e m d u u a h h n d d d L C U S e anne e t i a u e m n i e e e n t e e l d s t pe o r s r s t i a P a v t i P P D n h a u o l s r o i l t g h i e t n e l o n o n e m B B O H M P o Mastodonsaurida m t e i c l h topoda e u i s s l s l t o r n e a i g l l o t u r o c s i u a n b S S S r i a X LapillopsisPlatyoposauru nan T n u o n t r p t l s m m s t e e u h d h e i o c t t e o i s t l a d n c d a C t i r e c n i t o l e c e i re a t b s a u e n p w o r s topida s o r a t p i t s m o a h r s n l r y y y p g t e a d r o v e i u a l g g s e on c r a g a a a n n a i s a y a s h h u a o u d r on i s r e c t r r r t c b o o n o r chis u b i o c o t d o n v e c i t l o o t a p u e y e a a r l c a p m j o o c w r p r i l e e s u t t Lydekkerina huxley a u u c a p o t o i p o p t r a i i p h w c r e c c s y i u i t r o a s p o u Doleserpeton g e e c o d o i l g e n e us e a d n i c r r e n l h heida u s a t r o K n o n o h a o a a o d t t e r r r t o d s r h n h e o o e b p t h nl o t u p Konzhukovia vetust d u c r e i p i a i l w p s e n s r r h h s pe p s s i r o d a u c E o o i wei p o s n r a r s o l m p e p s s t s s c p r i n e n t e y u i o u s i r c s i u p d p n simorh s e p c l p o l a o a a n o r a i u r h i r o u e d o o a o n e s u u s s i p p e t o r p i c u r e a a s u m e s t e e d e agyru e t s a a d t l h i s e a n o p s o e b c s u y e h s d d peli n c P o r v ra d m r t h a h i e p r a n n l n t e l l s e l a e y l l s s d u Tersomius texensi Micropholis stow Trema t k e t u i r o o u j t o t y s n c a e s y e e b o h l t i o e o s e p t t a y a m l l u a l mu c o c c r e c o r e r o r o o u u c s t t l D p o i m r o l A p u o r o a n r r s t m r n p l c r o a p Sparodus e u o m l g s SclerocephalusA Eryops haeuser megacephalu E d i i r e u hangen u e p l e x r a s a e e e n l u u u n i s s u m u i n r aa u m o o u s c m t r s n n u e s e n e e Za C i p e a i h a d h r r e r n a u k u s r t e e r E B u b h d a s s s s c o o b a a a p s e k a r e a s a s O i c o m o a e r f o t h w i an e r ll Branchiosauridae T P D t r d s s G P c d o c m a g t s o b t s s r r oi i n Dole T E r y i l p o n r h l h s c o o n b w s u n d n L s p n i i o a n o s l a r s e p a A r A a a b i n t n i h m r b p e i a o a v n A I u i o a e t i i o o d a o s h m l n b f Llistrofus pricei a l i e r e n a c h p a t a chu d l g z t v e r i g v i l s r h c r a r a u t a n i b m n g s e e c d e i r i i o i T t i k i a p s e n o m r e o s D e h e r s e a s s e e s i t i v A M B a e d e o n n i h r p c i l Dissorophus multicinctus C i i s e s r i u t s n i a o N h n c i s l s r s i i i i n a C i r i d o i s c T scalaris u h g o t a m h f o e c C i n s u o Tuditanus punctulatus Pelodosotis elongatum u c r r m c t u s

t m a Micromelerpeton credner A e a n r l e Micraroter erythrogeios S Hapsidopareion lepton l a t o a d i a i r D n e c t s i t i r r u s i u u o

e Cardiocephalus peabodyi r m m m G

P

FIGURE 4. A) Bird’s Eye View in Mesquite (Maddison and Maddison 2014) showing inferred (green), asserted (blue), and missing (white) data in the synthetic supermatrix for the first 48 taxa (of 1051) and all 639 characters. B) Phylogeny of sarcopterygian vertebrates (Tetrapoda in grey) represented in the synthetic supermatrix, showing the distribution of data for one character, “skeleton of digitopodium.” Tip labels in black denote the absence of data, blue denotes taxa with asserted data, and green denotes taxa with inferred data. 944 SYSTEMATIC BIOLOGY VOL. 64

The number of published character states that entail digits, i.e., the metacarpals/tarsals and phalanges), the the presence or absence for selected parts of the fin number of inferred assertions (7751 annotations for 718 and limb was used to generate a figure showing their taxa) is 38 times higher and spread over 7 times more taxa distribution across taxa along the fin-to-limb transition than direct assertions (201 annotations for 103 taxa). An (Fig. 5, Table 6 available as Supplementary Materials on individual taxon can have multiple sources of inference Dryad at http://dx.doi.org/10.5061/dryad.rm907). for an individual entity, depending on the number and nature of the annotated characters that relate to that taxon and entity. For example, Acanthostega has 5 DISCUSSION directly asserted and 30 inferred sources with character states that entail presence or absence of “skeleton of The first step in scaling up the exploration of digitopodium.” phenotypic patterns in an evolutionary context is At the most basic level, aggregation of and inference to render phenotypic descriptions of species in a on phenotype data allows users to supplement large form amenable to large-scale computational integration, amounts of missing data computationally, without linking, and mining. How this is possible has recently extensive manual literature research. As Hiller et al. been shown in a series of papers from the Phenoscape (2012) show, simply knowing in which taxa a phenotypic group (Dahdul et al. 2010a; Mabee et al. 2012). Here we trait is present or absent across a taxonomic range in demonstrate that, using computable phenotypes from a which the trait underwent evolutionary change can datastore representing the cumulative effort of experts enable entirely new insights into the developmental across a broad taxonomic scale, synthetic supermatrices genetics of the trait. Although the overall goal of for presence/absence phenotypes can be automatically our method, filling in data that is not directly assembled for user-designated slices of the taxonomic asserted, is similar to imputation, our method differs and anatomical corpus. substantially from this technique. Regression-based Bringing together phenotypic data from across imputation practices for finding the “invisible fraction” multiple studies manually, and synthesizing them in a (Grafen 1988) use probabilistic models to reconstruct form amenable to computational analysis, is a nontrivial the “unknown knowns,” whereas our method bases exercise. Manual concatenation of phylogenetic its reconstructions on predefined logical axioms in matrices, for example, necessarily involves the time- the ontology. Imputation methods can be effective at consuming process of identifying and eliminating reducing gaps in quantitative data sets (Nakagawa character redundancy (e.g., Gatesy et al. 2002; Gatesy and Freckleton 2008, 2011; Swenson 2014); however, and Springer 2004; Hill 2005). Ascertaining the presence they are not applicable to qualitative matrix data. or the absence of a morphological feature requires an Our use of inference to extract unstated knowledge additional effort to reason from text that may only about the presence and the absence of traits allows incidentally describe an aspect of it. As a consequence, reconstruction of missing values without resorting to the ability of scientists to hand-assemble data across statistical parameters that may change across phylogeny. studies is severely hampered by the difficulty to compute Though we restrict ourselves to a simple set of relational on taxa and morphological data. Our work shows that rules for entities and their parts that are uniform across computable phenotypes not only enable automatic metazoans (i.e., the humerus, when present, is always consolidation of character states into nonredundant part_of the forelimb), the results of the logical reasoning presence/absence assertions, but they enable inference methods used here are very powerful. of presence/absence generalizations. Our method The supermatrix technique is a total evidence makes data reuse by nonexperts not only more efficient, approach in systematics (Kluge 1989), where different but also reduces the risk for error and expands the data sets and types are combined into a single phenotypic and taxonomic coverage of the original “supermatrix” of unique taxa (Sanderson 1998; De data. In so doing, it can open new possibilities for data Queiroz and Gatesy 2007). Inevitably, component data analysis, in particular if phenotypes are linked with sets overlap, but incompletely, resulting in many genes and other data through their shared ontological taxa lacking data for many characters. In the realm context (Mabee et al. 2012). of molecular data matrices, there have been two approaches to deal with the missing data problem: (i) leave all taxa separate and code the unavailable Inference Expands Data: Filling in the “Unknown Knowns” characters as missing; or (ii) reduce missing data by Our results demonstrate that inference can play making composite taxa at a level for which monophyly is a profound role in supplementing the taxonomically assumed a priori. The former of these may lead to loss of sparse phenotype assertions across taxa (Fig. 4). We resolution but not necessarily misleading relationships found that 76% of the variable characters in the synthetic (Kearney 2002; Kearney and Clark 2003; Wiens and supermatrix were made that way through inference, Morrill 2011; Wiens and Tiu 2012). The latter, composite meaning that at least one of their two states (presence taxa, may lead to misleading phylogenetic results (Malia or absence) is based solely on inferred data. For an et al. 2003). Whether supermatrices are sequence or individual feature such as the skeleton of digitopodium morphology based, they typically involve a lot of missing (the collection of skeletal elements encompassing the data. Molecular supermatrices may include over 70% 2015 DECECCHI ET AL.—EXTRACTING PRESENCE/ABSENCE PHENOTYPES 945

FIGURE 5. The level of anatomical data available for different parts of the fin and limb can be visualized for taxa along the fin-to-limb transition. Taxa included in this analysis encompass all major clades from the base of Sarcopterygii to the basal amphibians Baphetes and Westlothiana (see Ruta 2011 and Clack et al. 2012 for source topology). All taxa in this analysis are extinct with exception of the lungfish Neoceratodus. Taxa lacking all data for fin or limb were excluded. Cell color reflects the number of character states that entail the presence or the absence of that entity for each taxon (row). 946 SYSTEMATIC BIOLOGY VOL. 64 missing data (De Queiroz and Gatesy 2007; Fabre et al. involve different genes and networks relevant to a 2009; Hejnol et al. 2009; Hedtke et al. 2013), and a developmental biologist. One can also envision research morphological supermatrix assembled by Ramírez et al. questions where the inferred data hold value that (2007) had 94% missing data. By comparison, missing the asserted data do not. For example, the inferred data in the variable character-subset of the synthetic presence and absence of “hindlimbs” would be valuable supermatrix we created amounted to 98.5% without for studies examining correlations of habitat and applying inference, and applying logical inference locomotion, whereas the phenotype assertions that reduced this fraction to 78.2%. entailed them may not be directly related to locomotion (e.g., toenail color). Further, the knowledge structure that is laid out in sets of isomorphic characters may be of benefit as approaches to computationally dissecting out Characteristics and Application of Isomorphic Synthetic the expression of genes and their regulators (Hiller et al. Characters 2012) are scaled up. A considerable fraction (nearly 60%) of the populated For researchers interested in using these matrices for cells in the variable subset of the synthetic supermatrix phylogenetic reconstruction, caution must be exercised. are characters that are part of one of 93 isomorphic The character dependency implied by a significant clusters. That is, they corresponded to entities whose fraction of isomorphic (even if otherwise variable) presence/absence distributions are identical across taxa characters suggests that synthetic matrices, at least (Table 4 available as Supplementary Materials on Dryad in the form of presence/absence-only data, are not at http://dx.doi.org/10.5061/dryad.rm907). That our immediately suitable for phylogenetic reconstruction. matrix synthesis method generates isomorphic clusters As discussed above, observed isomorphism does is expected, because presence/absence reasoning uses not necessarily imply logical equivalence, and hence axioms in the Uberon anatomy ontology about part– whether characters should be merged or not due whole and developmental precursor relationships to putative dependency would need to be carefully (Balhoff et al. 2014). This will necessarily result in clusters examined for each case. For example, “nail” and composed of an asserted entity and its containing (for “dorsal skin of digit,” though isomorphic in their presence) or contained (for absence) classes, and/or taxonomic distribution in this data set, have different developmental precursors (for presence) or derivatives developmental bases, and can thus be argued to not be (for absence). For example, for a character of “femur dependent. bone,” a state value of present will induce the same A related issue is the potential for overweighting state value for characters “hindlimb,” “limb,” “femur of groups of characters connected by inference chains. cartilage element,” “femur pre-cartilage condensation,” For example, the absence of a structure with many and so on (see Fig. 3). We found that 10% (9 of 93) of parts will necessarily result in all the parts inferred as the isomorphic clusters were of this kind. Most clusters absent, and the presence of a structure will necessarily were composed of only inferred data, indicating that cause recursively all its containing structures to be the underlying original character states did not directly inferred present. The impact of this, and whether assert presence or absence. Although some of these absence-driven or presence-driven overweighting of (21 clusters) could be identified as the consequence characters predominates, is beyond the scope of this of logic equivalence chains for implied presence or study. Importantly, character overweighting already absence (see “Results” section), the majority (63 clusters) exists in the literature, and must thus be taken into resulted from various chains of inference from multiple account when composing supermatrices. In contrast to entities with no obvious repeating patterns. For example, other approaches, our method allows precisely tracing the taxonomic distributions of presence for “nail,” back the inference chains to identify the sources and “dorsal skin of digit,” “distal limb integumentary magnitude of overweighting. appendage,” and “digit skin” are identical, even though More generally, the degree to which phylogeny can the ontological presences of these entities are not inferred be recovered from binary presence/absence data alone to be equivalent. has, to our knowledge, not been investigated. Certainly, Whether and what value or impact these isomorphic presence/absence data are common in morphological clusters have will depend on the goals of the researcher data sets; Sereno (2009) gives a figure of 25%. However, using these data. Perhaps the broadest and most the phylogenetic resolution attained in these studies forward-looking applications for phenomic data involve require variation in other qualities (size, shape, texture, understanding trait evolution in relation to other color, etc.). The ontological methods used here reduce phenotypic traits, environmental factors, and aspects of data from these other qualities to presence/absence, thus development and related genomic features. For research changing the phylogenetic level at which the information questions such as these, the biological knowledge is relevant. For example, if variation in vertebral shape revealed by a cluster of correlated characters may have across a set of taxa is reduced to the inference that substantial value in supplementing the input data. For vertebrae are present in these same taxa, it no longer example, the presence of developmental precursors contains information to resolve them. However, the (femur cartilage, femur condensations) entailed by presence of vertebrae is informative for resolving taxa at an entity’s presence (e.g., the femur bone) may a higher level (i.e., as members of the clade Vertebrata). 2015 DECECCHI ET AL.—EXTRACTING PRESENCE/ABSENCE PHENOTYPES 947

Though this issue will require further examination, it is features (Shubin et al. 2014), and the ability to synthesize likely that the inferred presence/absence data will only knowledge on a large scale can focus future studies on support the monophyly of more inclusive clades than filling in gaps. the original assertions.

Quantification of Taxon Scoring Distribution of Data across Taxa and Anatomical Regions As a consequence of the obstacles to integrating morphological character data, it has been nearly The paired appendages are ostensibly one of the most impossible to assess quantitatively the differential intensely studied aspects of anatomy in vertebrates, sampling of taxa and anatomy across studies. This, and yet quantifying the data available for them has too, is readily enabled by the methods described not previously been possible. The methods presented here. As an example, we examined how frequently here readily enable this, including visualizing how our individual taxa had been scored for fin and limb knowledge of morphology, whether expressly stated or phenotypes in the generated synthetic supermatrix. implied, is distributed over taxonomic and anatomical Because of the logistic efforts necessarily involved in space (Fig. 5). This can then be used to pinpoint the morphological data collection (specimen preparation, taxonomic groups and the parts of the anatomy for museum collection visits, etc.), the taxonomic sample which data are sparse or lacking, allowing potential of species that an investigator can examine is limited, reasons and remedies to be considered. One should and some taxa are more readily available for study than note in this context that availability and lack of data others. In our data set, 70% of the taxa in the synthetic for an anatomical feature in a taxonomic group should supermatrix were connected to only a single publication not be expected to coincide with presence and absence, record (Table 5 available as Supplementary Materials respectively, of the feature in said group. Figure 5 on Dryad at http://dx.doi.org/10.5061/dryad.rm907). illustrates this, for example, for digits in the lungfish For taxa having more than one source publication, the Dipterus. Lungfishes do not have digits, yet due to proportions drop rapidly: 12% and 7% are found in two assertions about their absence in this taxon (Zhu et al. and three publications, respectively, and fewer than 2% 1999; Swartz 2012) data about digits in lungfishes are of the taxa are scored in seven or more publications. available. A single taxon, Acanthostega, a well-preserved exemplar For the matrix we synthesized for the evolution taxon in the fin-to-limb transition, holds the maximum of vertebrate fin/limb morphology, the gaps in the number of 16. data may be primarily attributable to the following However, this distribution, and in particular the high two factors. One, most taxa studied in the fin-to-limb proportion of single-source publication taxa, is unlikely transition are fossils and thus restricted to a few, often to be representative of the vertebrate comparative partial, specimens. These taxa may also be unscorable fin/limb morphology literature as a whole. This is for certain entities due to primitive absence (e.g., the because the publications we chose for phenotype ilium, ischium, and pubis of the pelvis are not present annotation treat mostly nonoverlapping sections of the in basal nontetrapod taxa). Two, the taxa and anatomical vertebrate phylogeny, and thus a high fraction of taxa elements used for study are unequally sampled. As is with a single publication source is a consequence of evident in Figure 5, there are much less data about our experimental design. If we consider only the data the hindlimb relative to the forelimb in basal tetrapods, for basal sarcopterygians relating to the fin-to-limb which cannot be explained by hindlimb specimens being transition (Table 1 available as Supplementary Materials unavailable or far less preserved in the fossil record for on Dryad at http://dx.doi.org/10.5061/dryad.rm907), the respective taxa than their forelimbs. Hence, other the proportion of taxa with only a single publication explanations are needed. Perhaps the differences could source drops to 34%. However, when considering the be due to more variability, and therefore more character fraction of taxa whose morphological features have been data, in the forelimb than the hindlimb during the fin- scored by only a single research group, this figure is to-limb transition, which would be consistent with the likely an underestimate. Some of the publications in “front wheel drive” hypothesis, which posits that the fin- this subset of the supermatrix share co-authors, and to-limb transition was driven primarily by changes in the many characters are recycled. A more thorough study forelimb (Shubin et al. 2014). Alternatively, the difference of independence and depth of evidence across the data could be a result of sampling bias caused by the larger set was beyond the scope of this study, but our results size of the ancestral forefin and the interconnectedness illustrate how our methods would readily enable such of the girdle skeleton with dermal skull elements. an analysis. Regardless of what is really behind the difference, our results illustrate how the ability to visualize the uneven distribution of knowledge can reveal far more than simply the existence of bias. Gaps in morphological Conflicting Data Revealed knowledge, such as regarding the phenotypic evolution When authors reuse characters from previous works, of the hindlimb, can present major challenges for encountering, and resolving coding conflicts is an understanding the origins and evolution of novel important part of the process to ensure phylogenetic 948 SYSTEMATIC BIOLOGY VOL. 64 relationships are as accurate as possible (Harris a relatively small proportion of the conflicts (17%). Some et al. 2007). Character conflicts are often difficult to of these discrepancies arise as new observations are spot by hand, yet the protocols authors follow for made, for example, from new specimens that reveal identifying, adjudicating, and resolving conflicts are formerly poorly known anatomy. For example, Zhu and rarely reported beyond a high-level summary. The colleagues (1999) scored the humerus of Strepsodus,a presented supermatrix synthesis approach immediately rhizodontid fish, for the presence of distinct supinator reveals conflicting phenotypes, here in the form of an and deltoid processes. Based on new fossil material anatomical feature having state values of both present for rhizodontids, the humerus morphology was re- and absent (0/1) for the same taxon. We found 774 such evaluated by Jeffery (2001) who concluded that in cells (0.5%) among the 146,451 populated cells, excluding Strepsodus and other rhizodontids distinct supinator and directly asserted polymorphisms, which we defined as deltoid processes are absent, thus generating the conflict those that trace back to direct assertions of both states observed in our data set. in the same source matrix (see “Methods” section). How Sometimes, the basis of conflict between original this level of character conflict compares with what has author assertions is not as readily traceable. For example, been observed previously is difficult to assess, because in in the fossil literature it is not uncommon that not previous studies in which morphological matrices have all of the specimens examined in relation to each been concatenated manually (see O’Leary et al. 2013; operational taxonomic unit (OTU) are reported. Even Sigurdsen and Green 2011), the resolution of conflicts when specimens are listed comprehensively, the reasons is not reported in a quantitative manner. However, in a for conflicts are sometimes difficult or even impossible to consensus morphological matrix for turtles, Harris et al. deduce from the published literature alone. For example, (2007) reported <2% cells with conflict (out of 4872 total Ruta et al. (2003) state that accessory foramina (passages cells), which is similar in magnitude to our finding. for blood vessels) are absent in the humerus of the One of the major advantages of the synthesis approach fossil amphibian Sauropleura, but later Ruta (2011) scored we present is not only that the extent of character conflict these foramina as “present.” As it does not appear can be quantified quickly, but also that detailed reports from his documentation that different specimens were about the provenance of all conflicting data can be examined, this leaves re-examining the specimens or generated automatically. This greatly aids review, and communicating with the authors as the only resort to where possible, resolution of these data by experts. There resolving the conflict. Such differences in scoring are a are a number of different causes for an observed conflict, challenge for both manual and machine concatenation only some of which are correctable errors; determining of these data, but they are to be expected, as authors the cause for a given character conflict requires careful not only have access to different materials over time, examination. A trivial case stems from the fact that but will also sometimes vary in their interpretation most matrices are not yet archived in digital repositories of structures. The presented matrix synthesis method (Stoltzfus et al. 2012; Drew et al. 2013), and errors cannot reduce or eliminate them, but it is able to readily could be a result of their required manual digitization. pinpoint candidates for investigation, including by way These will be reduced by the increasing push for of computationally (and thus automatically) generated digital archival of matrices upon publication. More reports. substantive conflicts, however, result from differing author assertions that may stem from observations Conflicts between Asserted and Inferred Character States.— of different (and differing) specimens or different The most frequent conflicts (73%) occur between asserted interpretations of the same material. Additionally, the and inferred data. These are arguably much less obvious conceptualization of the character by the original author, from manual analysis than the detection of conflicting and the terminology used for its description, may have assertions. An example comes from a recent large- consequences beyond the confines of the original state scale examination of tetrapod limb evolution, focused structure when annotated with ontology terms that on the transitional fossil Tiktaalik roseae, which is have logical implications, leading to conflicting results. described as having a “poorly developed” scapula A discussion of conflicts in relation to their bases blade (Ruta 2011). This assertion results in its inferred in assertion and/or inference follows, with examples presence in the synthetic supermatrix (Fig. 1). The from the data set we generated. It is worth noting scapula blade, however, is directly asserted to be that for conflicts due to correctable errors, our fully absent in Tiktaalik roseae by Swartz (2012) and Clack computational approach to matrix synthesis has the et al. (2012). Regardless of what is at the root of this advantage that once the errors are addressed in the conflict (different specimens, different interpretations KB, the corresponding conflicts are eliminated from any of morphology, polymorphism, etc.), the value of our supermatrix subsequently generated from it. method is that it makes the discrepancies in the literature evident. Conflicts between Asserted Character States.—The conflicts that are most readily traced to their cause are those Conflicts between Inferred Character States.—The fewest between authors who differently assert the presence and conflicts (10%) are generated between data based on the absence of an anatomical structure. These comprise inference alone. For instance in the frog Bombina 2015 DECECCHI ET AL.—EXTRACTING PRESENCE/ABSENCE PHENOTYPES 949 variegata, the ilial protuberance is inferred absent Finally, inattention to the semantics of anatomical based on the assertion that the ilial shaft is absent terminology can lead to incorrect and conflicting (Fabrezi 2006), of which the ilial protuberance is a assertions. For example, although there is clearly a part. The presence of the ilial protuberance is inferred deep homology across Sarcopterygii between distal fin from two assertions regarding its shape, that is, “not (paired fin radials) and distal limb (digits) skeletal knobbed distally” and “broad and low rounded” elements (Johanson et al. 2007), they are generally (Cannatella 1985), thus generating a conflict. Identifying considered distinct. Yet in the synthetic matrix, some the condition(s) in this species is beyond the scope of this limbed tetrapods are inferred to possess both radials article, but it would likely require the user to analyze the and digits. This inference was generated from several supporting specimens from the original sources. Again, limbed, and potentially terrestrial tetrapod taxa such the value of our method is that it reveals the conflicts, as Acanthostega, Dendrerpeton, and Silvanerpeton, scored here from inference alone, which particularly in this case as possessing “jointed radials” (Swartz 2012). Thus the would be difficult to ascertain manually. presence of radials is inferred for these taxa, while simultaneously digits were directly asserted for them (Ruta 2011). Perhaps Swartz (2012) used “radial” to Conflicts from Author Character Structure and Scoring.— encompass all acropodial elements because there is Some “false” conflicts resulted from the idiosyncratic simply not a more encompassing anatomical term that character construction and scoring practices by authors, applies to these distal skeletal structures across the and also limitations of the KB. For example, a conflict taxonomic breadth of vertebrates. This use of “radial,” is automatically generated when an author creates a however, conflicts with its general usage in the literature character state that is a disjunction of absence and one or (Ruta 2011) as well as genetic data concerning the more other qualities that entail presence. For example, distinctness of digits (Davis 2013; Woltering et al. 2014). the character, “ectepicondyle” with the state: “low, Referencing and applying a standardized vocabulary in indistinct or absent” (Laurin and Reisz 1997) is intended character descriptions would resolve this type of conflict to reflect the variability present across the taxa. Yet (Seltmann et al. 2012). this wording does not allow the reader to differentiate As described above, author-generated conflicts pose a whether this represents polymorphism within species problem to the effort of automatic integration of these (i.e., different states in different individuals of a single manually annotated character data. Because they are species), or whether the set of species to which the idiosyncratic and difficult to detect until integration, description applies has multiple states (i.e., one state there is little possibility to create filters that automatically in one species, a different one in another). In this case, correct for these types of errors. We suggest it is better because “low” implies the presence of an ectepicondyle, to work to amend character construction practices, it is automatically shown as in conflict with the same and working toward a future in which characters are authors’ assertion of absence. This illustrates how constructed in computable form a priori, than trying ambiguity in how an author constructs character states to address them post hoc. This will likely become even can limit or even preclude the utility of their data in other more important as text markup of phenotypes and other contexts. concepts is automated, leaving little margin for human Another source of error stems from character curator correction of inconsistencies. constructions that involve phenotypes of anatomical elements that are more complex than simple presence/absence, but are applied to taxa to which strictly speaking they do not apply. For example, the Improved Annotation and Curation Standards frog Rhinophrynus dorsalis is asserted to lack a sternum The annotation practices that guided the EQ (Cannatella 1985). Yet in the same study the epicoracoid assignments to the character data in Phenoscape bone is scored in this taxon as “not fused to sternum,” were designed to capture the rich anatomical from which a machine reasoner, and arguably also a detail and differences among taxa, as described by human reader, would infer that a sternum is present. If taxonomic experts. Combining and reasoning across an author scored this as “not applicable” in the original the annotations in this study, however, cast these data matrix, the logical error (inferred presence) would be in different relief, in some cases revealing conflicts avoided. that were the result of inappropriate annotation of Inferences depend on the data asserted by the author. author statements. Resolving these, and generalizing For example, the loss of digit I (thumb) in amniotes the issues where possible, enabled us to improve might be denoted as “digit I absent” or “four digits and expand the anatomy and quality ontologies, present” (assuming the plesiomorphic presence of five). the annotations, and the phenotype curation guideli- However, if the identity of the absent digit is not specified nes (http://phenoscape.org/wiki/Guide_to_Character_ by the author, the annotation and reasoning methods Annotation). For instance, it appeared that data from described will not be able to infer it. Thus we strongly a single paper conflicted in whether or not the fish urge authors, when dealing with the presence/absence Onychodus possessed a postcleithrum (Cloutier and of serially homologous structures in particular, to be Arratia 2004). Investigation revealed that the authors explicit regarding the identity of the entity in question. directly asserted the absence of this bone; an inferred but 950 SYSTEMATIC BIOLOGY VOL. 64 incorrect presence resulted from a mistake in annotating character-subset from 98.5% to 78.2%. Moreover, 76% “presence of a postcleithral scale” as “‘dermal scale’ and of the variable characters were in fact made variable part_of some postcleithrum, present.” The postcleithral through the addition of inferred presence/absence scale, however, is a separate type of scale and not a part states. Equally important, this automated method of the postcleithrum bone. In this case, we added a new results in immediate isolation of character conflicts entity “postcleithral scale” to the Uberon ontology as a and detailed reports about their provenance. This type of “scale,” the feature was re-annotated, and the capability, if available broadly, will greatly aid experts conflict thus removed. in data review, and where possible, conflict resolution. Finally, machine reasoning enables quantification and new visualizations of the data, as demonstrated here, FUTURE DIRECTIONS allowing the identification of character space that is undersampled across the fin-to-limb transition. The approach and methods demonstrated here to compute synthetic presence/absence supermatrices are applicable to any taxonomic and phenotypic slice across the tree of life, provided these data SUPPLEMENTARY MATERIAL are semantically annotated. Scaling up annotation to Data available from the Dryad Digital Repository: this level, however, will require significant effort, http://dx.doi.org/10.5061/dryad.rm907. including the development of semiautomated methods for marking up free-text descriptions (e.g., Cui 2012; Arighi et al. 2013; Dahdul et al. 2015); provisioning of ACKNOWLEDGMENTS community phenotype ontologies to accommodate the We thank Phenoscape collaborators, including D. diversity of taxa and evolved anatomies and qualities Blackburn, W.M. Dahdul, P. Manda, C.J. Mungall, and (Gkoutos et al. 2005; Haendel et al. 2014); and faster T.J. Vision for comments and advice that improved and more efficient methods for reasoning across these this work. We also acknowledge N. Ibrahim and W.M. substantially larger data in knowledgebases. Another Dahdul for annotation of some of the data analyzed here, challenge lies in developing methods to aggregate as well as M.A. Haendel and C. Mungall for guidance “non” presence/absence phenotypes, that is, those and help with ontology development. We’d like to thank features varying in qualities such as size, shape, color, M.J. Ramirez and an anonymous reviewer for their texture, etc., into a matrix format, which will require helpful and constructive comments. sophisticated algorithms for automating consolidation of synthetic character states. Additionally, new methods are required to integrate taxonomically-heterogeneous FUNDING supermatrix data with user-specified trees. Because the phenotype data are asserted at multiple taxonomic This work was supported by National Science levels (i.e., to species, genera, families, etc.), current Foundation collaborative grants (DBI-1062404, DBI- methods for their optimization and visualization along 1062542) and the National Science Foundation National a tree are limited. The power of leveraging ontology- Evolutionary Synthesis Center (NESCent) (EF-0905606). based reasoning to propagate anatomical knowledge over phylogenetic trees has been recently demonstrated (Ramírez and Michalik 2014). Using this approach, in REFERENCES conjunction with the synthetic matrices demonstrated Arighi C.N., Carterette B., Cohen K.B., Krallinger M., Wilbur W.J., Fey here, indicates a necessary/critical role for inference in P., Dodson R., Cooper L., Van Slyke C.E., Dahdul W., Mabee P., Li D., understanding patterns of phenotypic evolution across Harris B., Gillespie M., Jimenez S., Roberts P., Matthews L., Becker large data and trees. K., Drabkin H., Bello S., Licata L., Chatr-aryamontri A., Schaeffer M.L., Park J., Haendel M., Van Auken K., Li Y., Chan J., Muller H.-M., Cui H., Balhoff J.P., Chi-Yang Wu J., Lu Z., Wei C.-H., Tudor C.O., Raja K., Subramani S., Natarajan J., Cejuela J.M., Dubey P., Wu CONCLUSIONS C. 2013. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database 2013:bas056. The phenotypic features that characterize and define Balhoff J.P.,Dahdul W.M., Kothari C.R., Lapp H., Lundberg J.G., Mabee evolutionary groups are currently scattered across the P., Midford P.E.,Westerfield M., Vision T.J. 2010. Phenex: ontological dispersed literature of comparative biology, often in annotation of phenotypic diversity. PLoS One 5:e10500. character-by-taxon matrices for small sets of taxa. The Balhoff J.P., Dececchi T.A., Mabee P.M., Lapp H. 2014. Presence- difficult and time-consuming manual aggregation of absence reasoning for evolutionary phenotypes. Available from: phenoday2014.bio-lark.org. these data reduces their reuse. Here we demonstrate Barnosky A.D., Matzke N., Tomiya S., Wogan G.O.U., Swartz B., that when phenotypes are ontology-annotated, their Quental T.B., Marshall C., McGuire J.L., Lindsey E.L., Maguire K.C., presence and absence can be automatically integrated Mersey B., Ferrer E.A. 2011. Has the Earth’s sixth mass extinction into synthetic character matrices. We found that already arrived? Nature 471:51–57. inference plays a profound role in supplementing the Burleigh J.G., Alphonse K., Alverson A.J., Bik H.M., Blank C., Cirranello A.L., Cui H., Daly M., Dietterich T.G., Gasparich taxonomically sparse phenotype assertions across taxa, G., Irvine J., Julius M., Kaufman S., Law E., Liu J., Moore in our case reducing the missing data in the variable L., O’Leary M.A., Passarotti M., Ranade S., Simmons N.B., 2015 DECECCHI ET AL.—EXTRACTING PRESENCE/ABSENCE PHENOTYPES 951

Stevenson D.W., Thacker R.W., Theriot E.C., Todorovic S., Velazco Harris S.R., Pisani D., Gower D.J., Wilkinson M. 2007. Investigating P.M., Walls R.L., Wolfe J.M., Yu M. Next-generation phenomics stagnation in morphological phylogenetics using consensus data. for the Tree of Life. PLoS Curr. 2013 Jun 26. Edition 1. doi: Syst. Biol. 56:125–129. 10.1371/currents.tol.085c713acafc8711b2ff7010a4b03733. Hedtke S., Patiny S., Danforth B. 2013. The bee tree of life: a Cannatella D.C. 1985. A phylogeny of primitive frogs supermatrix approach to apoid phylogeny and biogeography. BMC (archaeobatrachians) [Ph.D. Thesis]. The University of Kansas. Evol. Biol.13:138. Clack J.A., Ahlberg P.E., Blom H., Finney S.M. 2012. A new genus Hejnol A., Obst M., Stamatakis A., Ott M., Rouse G.W., Edgecombe of Devonian tetrapod from North-East Greenland, with new G.D., Martinez P., Baguñà J., Bailly X., Jondelius U., Wiens M., Müller information on the lower jaw of Ichthyostega. Palaeontology 55: W.E.G., Seaver E., Wheeler W.C., Martindale M.Q., Giribet G., Dunn 73–86. C.W. 2009. Assessing the root of bilaterian animals with scalable Cloutier R., Arratia G. 2004. Early diversification of actinopterygians. phylogenomic methods. Proc. R. Soc. B. 276:4261–4270. In: Arratia G., Wilson M.V.H., Cloutier R., editors. The origin Hill R.V. 2005. Integration of morphological data sets for phylogenetic and early radiation of vertebrates: honoring Hans-Peter Schultze. analysis of Amniota: the importance of integumentary characters Munich: Verlag Dr. Friedrich Pfeil. p. 217–270. and increased taxonomic sampling. Syst. Biol. 54:530–547. Cui H. 2012. CharaParser for fine-grained semantic annotation of Hiller M., Schaar B.T., Indjeian V.B., Kingsley D.M., Hagey L.R., organism morphological descriptions. J. Am. Soc. Inf. Sci. Technol. Bejerano G. 2012. A ‘forward genomics’ approach links genotype 63:738–754. to phenotype using independent phenotypic losses among related Dahdul W.M., Balhoff J.P., Blackburn D.C., Diehl A.D., Haendel M.A., species. Cell Rep. 2:817–823. Hall B.K., Lapp H., Lundberg J.G., Mungall C.J., Ringwald M., Jeffery J. 2001. Pectoral fins of rhizodontids and the evolution of pectoral Segerdell E., Van Slyke C.E., Vickaryous M.K., Westerfield M., appendages in the tetrapod stem-group. Biol. J. Linn. Soc. Lond. Mabee P.M. 2012. A unified anatomy ontology of the vertebrate 74:217–236. skeletal system. PLoS One 7:e51070. Johanson Z., Joss J., Boisvert C.A. 2007. Fish fingers: digit homologues Dahdul W.M., Balhoff J.P., Engeman J., Grande T., Hilton E.J., Kothari in sarcopterygian fish fins. J. Exp. Zool. (Mol Dev Evol). 308B: C., Lapp H., Lundberg J.G., Midford P.E., Vision T.J., Westerfield 757–768. M., Mabee P.M. 2010a. Evolutionary characters, phenotypes and Kazakov Y., Krötzsch M., Simanˇcík F. 2012. ELK reasoner: architecture ontologies: curating data from the systematic biology literature. and evaluation. Proceedings of the 1st International Workshop on PLoS One 5:e10708. OWL Reasoner Evaluation (ORE-2012). Dahdul W.M., Lundberg J.G., Midford P.E.,Balhoff J.P.,Lapp H., Vision Kazakov Y., Krötzsch M., Simanˇcík F. 2013. The Incredible ELK. J. T.J., Haendel M.A., Westerfield M., Mabee P.M. 2010b. The teleost Automat. Reason. 3:1–61. anatomy ontology: anatomical representation for the genomics age. Kearney M., Clark J.M. 2003. Problems due to missing data in Syst. Biol. 59:369–383. phylogenetic analyses including fossils: a critical review. J. Vert. Dahdul W., Dececchi T.A., Ibrahim N., Lapp H., Mabee P. 2015. Paleontol. 23:263–274. Moving the mountain: analysis of the effort required to transform Kearney M. 2002. Fragmentary taxa, missing data, and ambiguity: comparative anatomy into computable anatomy. Database. doi: mistaken assumptions and conclusions. Syst. Biol. 51:369–381. 10.1093/database/bav040. Kluge, A.G. 1989. A concern for evidence and a phylogenetic hypothesis Davis M.C. 2013. The deep homology of the autopod: insights from of relationships among Epicrates (Bovidae, Serpentes). Syst. Zool. hox gene regulation. Integr. Comp. Biol. 53:224–232. 38:7–25. Deans A.R., Yoder M.J., Balhoff J.P. 2012. Time to change how we Laurin M., Reisz R. 1997. A new perspective on tetrapod phylogeny. describe biodiversity. Trends Ecol. Evol. 27:78–84. In: Sumida S.S., Martin K.L.M., editors. Amniote origins: De Queiroz A., Gatesy J. 2007. The supermatrix approach to completing the transition to land. Waltham (MA): Academic Press. systematics. Trends Ecol. Evol. 22:34–41. p. 9–60. Drew B.T., Gazis R., Cabezas P., Swithers K.S., Deng J., Rodriguez Lundberg J.G. 1992. The phylogeny of ictalurid catfishes: a synthesis of R., Katz L.A., Crandall K.A., Hibbett D.S., Soltis D.E. 2013. Lost recent work. In: Mayden R.L., editor. Systematics, historical ecology, branches on the tree of life. PLoS Biol. 11:e1001636. & North American freshwater fishes. Stanford: Stanford University Fabre P.H., Rodrigues A., Douzery E.J.P. 2009. Patterns of Press. p. 392–420. macroevolution among Primates inferred from a supermatrix Mabee P., Balhoff J.P., Dahdul W.M., Lapp H., Midford P.E., Vision T.J., of mitochondrial and nuclear DNA. Mol. Phylogenet. Evol. 53: Westerfield M. 2012. 500,000 fish phenotypes: the new informatics 808–825. landscape for evolutionary and developmental biology of the Fabrezi M. 2006. Evolución morfológica en Ceratophryinae (Anura, vertebrate skeleton. J. Appl. Ichthyol. 28:300–305. Neobatrachia). J. Zoolog. Syst. Evol. Res. 44:153–166. Maddison, W.P., D.R. Maddison. 2014. Mesquite: a modular system for Gatesy J., Matthee C., DeSalle R., Hayashi C. 2002. Resolution of a evolutionary analysis. Version 3.0. http://mesquiteproject.org. supertree/supermatrix paradox. Syst. Biol. 51:652–664. Maglia A.M., Leopold J.L., Pugener L.A., Gauch S. 2007. An Gatesy J., Springer M.S. 2004. A critique of matrix representation anatomical ontology of amphibians. Proc. Pac. Symp. Biocomput. 12: with parsimony supertrees. In: Binida-Emonds O.R.P., editor. 367–378. Phylogenetic supertrees: combining information to reveal the tree Malia M.J. Jr, Lipscomb D.L., Allard M.W. 2003. The misleading effects of life. Netherlands: Springer. p. 369–388. of composite taxa in supermatrices. Mol. Phylogenet. Evol. 27: Gkoutos G.V., Green E.C.J., Mallon A.-M., Hancock J.M., Davidson D. 522–527. 2005. Using ontologies to describe mouse phenotypes. Genome Biol. Midford P., Dececchi T., Balhoff J., Dahdul W., Ibrahim N., Lapp 6:R8. H., Lundberg J., Mabee P., Sereno P., Westerfield M., Vision T., Grafen A. 1988. On the uses of data on lifetime reproductive success. Blackburn D. 2013. The vertebrate taxonomy ontology: a framework In: Clutton-Brock T.H. editor. Reproductive success: studies of for reasoning across model organism and species phenotypes. individual variation in contrasting breeding systems. Chicago: J. Biomed. Semantics 4:34. University of Chicago Press. p. 454–471. Mungall C.J., Gkoutos G.V., Smith C.L., Haendel M.A., Lewis S.E., Haendel M., Balhoff J., Bastian F., Blackburn D., Blake J., Bradford Ashburner M. 2010. Integrating phenotype ontologies across Y., Comte A., Dahdul W., Dececchi T., Druzinsky R., Hayamizu T., multiple species. Genome Biol. 11:R2. Ibrahim N., Lewis S., Mabee P., Niknejad A., Robinson-Rechavi M., Mungall C.J, Gkoutos G., Washington N., Lewis S. 2007. Representing Sereno P., Mungall C. 2014. Unification of multi-species vertebrate phenotypes in OWL. Innsbruck, Austria: OWL: Experiences and anatomy ontologies for comparative biology in Uberon. J. Biomed. Directions (OWLED 2007. Semantics. 5:21. Mungall C.J., Torniai C., Gkoutos G.V., Lewis S.E., Haendel M.A. 2012. Harmon L.J., Baumes J., Hughes C., Soberon J., Specht C.D., Turner Uberon, an integrative multi-species anatomy ontology. Genome W., Lisle C., Thacker, R.W. 2013. Arbor: comparative analysis Biol. 13:R5. workflows for the tree of life. PLoS Curr. 2013 Jun 21. Edition 1. Nakagawa S., Freckleton R.P. 2008. Missing inaction: the dangers of doi: 10.1371/currents.tol.099161de5eabdee073fd3d21a44518dc. ignoring missing data. Trends Ecol. Evol. 23:592–596. 952 SYSTEMATIC BIOLOGY VOL. 64

Nakagawa S., Freckleton R.P. 2011. Model averaging, missing data and Shubin N.H., Daeschler E.B., Jenkins F.A. Jr. 2014. Pelvic girdle and fin multiple imputation: a case study for behavioural ecology. Behav. of Tiktaalik roseae. Proc. Natl. Acad. Sci. USA 111:893–899. Ecol. Sociobiol. 65:103–116. Sigurdsen T., Green D.M. 2011. The origin of modern amphibians: a O’Leary M.A., Bloch J.I., Flynn J.J., Gaudin T.J., Giallombardo A., re-evaluation. Zool. J. Linn. Soc. 162:457–469. Giannini N.P., Goldberg S.L., Kraatz B.P., Luo Z.-X., Meng J., Ni Smith B., Ceusters W., Klagges B., Köhler J., Kumar A., Lomax J., X., Novacek M.J., Perini F.A., Randall Z.S., Rougier G.W., Sargis E.J., Mungall C., Neuhaus F., Rector A.L., Rosse C. 2005. Relations in Silcox M.T., Simmons N.B., Spaulding M., Velazco P.M., Weksler M., biomedical ontologies. Genome Biol. 6:R46. Wible J.R., Cirranello A.L. 2013. The placental mammal ancestor and Stewart T.A., Smith W.L., Coates M.I. 2014. The origins of adipose fins: the post-K-Pg radiation of placentals. Science 339:662–667. an analysis of homoplasy and the serial homology of vertebrate Price S.A., Wainwright P.C., Bellwood D.R., Kazancioglu E., Collar appendages. Proc. R. Soc. B. 281:20133120. D.C., Near T.J. 2010. Functional innovations and morphological Stoltzfus A., O’Meara B., Whitacre J., Mounce R., Gillespie E.L., Kumar diversification in parrotfish. Evolution 64:3057–3068. S., Rosauer D.F., Vos R.A. 2012. Sharing and re-use of phylogenetic Ramírez M.J., Coddington J.A., Maddison W.P., Midford P.E., Prendini trees (and associated data) to facilitate synthesis. BMC Res. Notes L., Miller J., Griswold C.E., Hormiga G., Sierwald P., Scharff N., 5:574. Benjamin S.P., Wheeler W.C. 2007. Linking of digital images to Swartz B. 2012. A marine stem-tetrapod from the Devonian of western phylogenetic data matrices using a morphological ontology. Syst. North America. PLoS One 7:e33683. Biol. 56:283–294. Swenson N.G. 2014. Phylogenetic imputation of plant functional trait Ramírez M.J., Michalik P. 2014. Calculating structural complexity in databases. Ecography 37:105–110. phylogenies using ancestral ontologies. Cladistics 30:635–649. Van Bocxlaer I., Loader S.P., Roelants K., Biju S.D., Menegon M., Rudall P.J., Hilton J., Bateman R.M. 2013. Several developmental and Bossuyt F. 2010. Gradual adaptation toward a range-expansion morphogenetic factors govern the evolution of stomatal patterning phenotype initiated the global radiation of toads. Science 327: in land plants. New Phytol. 200:598–614. 679–682. Ruta M., Coates M.I., Quicke D.L.J. 2003. Early tetrapod relationships Vos R.A., Balhoff J.P., Caravas J.A., Holder M.T., Lapp H., Maddison revisited. Biol. Rev. Camb. Philos. Soc. 78:251–345. W.P., Midford P.E., Priyam A., Sukumaran J., Xia X., Stoltzfus A. Ruta M. 2011. Phylogenetic signal and character compatibility in 2012. NeXML: rich, extensible, and verifiable representation of the appendicular skeleton of early tetrapods. Studies on Fossil comparative data and metadata. Syst. Biol. 61:675–689. Tetrapods. Oxford: Wiley-Blackwell. p. 31–43. Wiens J.J., Morrill M.C. 2011. Missing data in phylogenetic analysis: Sanderson, M.J. 1998. Phylogenetic supertrees: assembling the trees of reconciling results from simulations and empirical data. Syst. Biol. life. Trends Ecol. Evol. 13:105–109. 60:719–731. Seltmann K.C., Yoder M.J., Mikó I., Forshage M. 2012. A Wiens J.J., Tiu J. 2012. Highly incomplete taxa can rescue phylogenetic hymenopterists’ guide to the Hymenoptera Anatomy Ontology: analyses from the negative impacts of limited taxon sampling. PLoS utility, clarification, and future directions. J. Hymenoptera Res. One 7:e42925. 27:67–88. Woltering J.M., Noordermeer D., Leleu M., Duboule D. 2014. Sereno P.C. 2009. Comparative cladistics. Cladistics 25:624–659. Conservation and divergence of regulatory strategies at Hox loci Sereno P.C., Tan L., Brusatte S.L., Kriegstein H.J., Zhao X., Cloward K. and the origin of tetrapod digits. PLoS Biol. 12:e1001773. 2009. Tyrannosaurid skeletal design first evolved at small body size. Zhu M., Yu X., Janvier P. 1999. A primitive fossil fish sheds light on the Science 326:418–422. origin of bony fishes. Nature 397:607–610.