Using Natural Language Processing to Identify and Explore the Characters
Total Page:16
File Type:pdf, Size:1020Kb
G Model JCZ-25325; No. of Pages 7 ARTICLE IN PRESS Zoologischer Anzeiger xxx (2015) xxx–xxx Contents lists available at ScienceDirect Zoologischer Anzeiger jou rnal homepage: www.elsevier.com/locate/jcz Peeking behind the page: using natural language processing to identify and explore the characters used to classify sea anemones a,∗ b b Marymegan Daly , Lorena A. Endara , John Gordon Burleigh a Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Aronoff Lab, 318 West 12th Avenue, Columbus, OH 43210, USA b Department of Biology, The University of Florida, Bartram Hall, 876 Newell Dr., Gainesville, FL 32611, USA a r t i c l e i n f o a b s t r a c t Article history: Although most phylogenetic investigations are motivated by questions about the evolution of morpho- Received 20 October 2014 logical attributes, morphological data are increasingly rare as a source of characters for reconstructing Received in revised form 16 March 2015 phylogeny, in part because these attributes are time consuming to collect. Here we describe methods Accepted 17 March 2015 to mine the information contained in classifications as a source of phylogenetic characters, using the Available online xxx classification of actiniarian sea anemones (Cnidaria: Anthozoa) as our exemplar system. Our natural lan- guage processing pipeline recovers more than 400 characters in the most widely-used classification of sea Keywords: anemones. However, the majority of these are problematic, reflecting semantic or logical inconsistencies Actiniaria Cnidaria or being scored for only a single taxon and thus inappropriate for phylogenetic reconstruction. Although Systematics the classification cannot be directly translated into a phylogenetic matrix, the exposure of the characters Matrix that underlie a classification provide important perspective into the basis and limits of a classification system and offer a valuable starting point for the creation of a phylogenetic matrix. © 2015 Published by Elsevier GmbH. 1. Introduction disagreed about whether the group was monophyletic and about how to interpret and link the taxa within the order (reviewed in Actiniarian sea anemones are conspicuous members of marine Daly et al., 2007; Rodríguez et al., 2014). A stable classification habitats, dominating some shallow water and polar communities arose through collaboration between the two most prolific schol- and playing significant roles in reef, hydrothermal, and shelf sys- ars of actiniarian biology, Oskar Carlgren (Swedish, 1865–1954) and tems (Fautin, 1989; Fautin et al., 2013). Because the actiniarian Thomas A. Stephenson (British, 1898–1961) when Carlgren (1949) communities of most habitats and on most continents comprise codified and revised the system initially proposed by Stephenson, diverse lineages, this ecological breadth is probably not the result 1920, 1921, 1922. Stephenson’s (1949) contribution of the preface of in situ radiations, but instead reflects ancient diversity, a pat- to Carlgren’s classification highlights this as a consensus system tern also seen in their close relatives, scleractinian corals (Barbeitos largely agreed upon by both of them. Carlgren (1949) divides the ∼ et al., 2010; Stolarski et al., 2011). 1200 species of Actiniaria known into three suborders; the largest Like most cnidarians, actiniarians have relatively simple bodies: encompasses the vast majority of species and is further subdivided an actiniarian is a tubular, tentaculate polyp whose body consists of into superfamilies (mistakenly referred to as “tribes”). Carlgren highly folded and extruded sheets of one-to-three cell layers of tis- (1949) also synthesized the diversity of Ptychodactiaria, a group he sue. Although simple in anatomy compared to triplobastic animals, recognized as an order but that is now classified as a family within actiniarians show the greatest polyp-level diversity of Cnidaria, Actiniaria (reviewed by Rodríguez et al., 2014). with complex interior anatomy, several unique anatomical struc- Carlgren’s (1949) classification has been challenged by the tures, and diversity in the morphology of the column and tentacles discovery of new taxa (e.g., Fautin and Hessler, 1989; Daly and (Daly et al., 2007). Goodwill, 2009; Rodríguez et al., 2009), consideration of new The synthesis of this diversity into a coherent framework posed character systems (e.g., Schmidt, 1969, 1974), reexamination of a challenge for 19th and early 20th century systematists, who characters in detail (e.g., Cappola and Fautin, 2000), and phyloge- netic analyses (e.g., Daly et al., 2002, 2008; Gusmão and Daly, 2010; Rodríguez et al., 2008, 2012, 2014). Although these challenges have ∗ empirical backing, they are more limited in taxonomic scope and Corresponding author. Tel.: +1 614 247 8412; fax: +1 6142927774. in the breadth of morphological features considered than is the E-mail addresses: [email protected] (M. Daly), lendara@flmnh.ufl.edu (L.A. Endara), gburleigh@ufl.edu (J.G. Burleigh). http://dx.doi.org/10.1016/j.jcz.2015.03.004 0044-5231/© 2015 Published by Elsevier GmbH. Please cite this article in press as: Daly, M., et al., Peeking behind the page: using natural language processing to identify and explore the characters used to classify sea anemones. Zool. Anz. (2015), http://dx.doi.org/10.1016/j.jcz.2015.03.004 G Model JCZ-25325; No. of Pages 7 ARTICLE IN PRESS 2 M. Daly et al. / Zoologischer Anzeiger xxx (2015) xxx–xxx classification of Carlgren (1949). However broad it is in terms of for each taxa. In its default setting, which we used, MatrixGener- data and scope, Carlgren’s (1949) system is replete with contra- ator uses the hierarchical classification to fill in character states dictions and arbitrariness in terms of the implied hierarchy of for taxa of the lower taxonomic ranks. For example, if a family characters (reviewed in Daly et al., 2008; Rodríguez et al., 2012, description contained a specific character state, then all of the gen- 2014). Carlgren’s system is not phylogenetic in the modern sense, era within that family would also be coded for that character state. although there are indications that he viewed some of the higher Currently, the ETC website, specifically the “Text Capture” option taxa as having phylogenetic cohesiveness (Carlgren, 1942), and (Fig. 1B), includes software that enables users to perform all of these he explicitly recognized (Carlgren, 1949: 7) that the system was steps to obtain character datasets from taxonomic descriptions built upon imperfect information. Resolving the conflict between (http://etc-dev.cs.umb.edu/etcsite/). Carlgren’s classification, phylogenetic analyses of anemones, and Finally, we used “MatrixConverter” software to evaluate the the information embodied in character systems not considered by characters data output by MatrixGenerator. MatrixConverter is a Carlgren (1949) requires careful study of diverse character systems freely available, platform-independent software program designed in a phylogenetic context. to facilitate the transformation of raw phenomic character data into This seemingly daunting task is made easier by new tools that discrete character matrices (Liu et al., 2015). It takes as input the facilitate the extraction of character information from monographs. tab-delimited character files output from MatrixGenerator (or the Semi-automated text mining and natural language processing “Text Capture” option of the ETC website) and provides an easy (NLP) programs designed for the concise and technical format to use interface that enables users to evaluate the characters and of formal taxonomic descriptions enable extraction of the infor- ultimately code them as discrete character states for evolutionary mation embodied in monographs and text-based descriptions of inference. For example, in our analysis the initial list of charac- species (e.g., Cui, 2012; reviewed by Burleigh et al., 2013). These ters contained duplicate and nonsense characters. Duplicates are tools render accessible centuries’ worth of biodiversity informa- those features clearly referring to the same structure character con- tion, allowing for explicit examination of characters and data. The cept but using different words (e.g., shape of “base,” “basal disc,” or parsing of descriptive narratives into characters exposes data that “pedal disc”); nonsense characters are those features that are log- underlies classifications, proposals of synonymy, or hypotheses of ically inconstant (e.g., “count” and “presence” as two independent relationship and thus allows these individual studies to be synthe- characters for a structure that can only occur singly, like the column, sized and compared. or a position-based attribute of a feature defined by its position, We parse Carlgren’s (1949) classification using a series such as “position of the proximal end”). Identification of duplicates of NLP and phylogenetic character discovery tools that were and nonsensical features requires expertise with the organismal developed as part of the AVAToL Next Generation Phenomics system. We identified these by grouping the putative characters project (Burleigh et al., 2013). The functions of these pro- based on the system of origin (e.g., tentacle, actinopharynx) and grams are all implemented online as part of the ETC website then scrutinizing each feature. We also made comparisons across (http://etc-dev.cs.umb.edu/etcsite/). The parsing pipeline first uses character systems for characters to identify logically or structurally CharaParser (Cui, 2012) to identify the characters