Synthesis of Phylogeny and Taxonomy Into a Comprehensive Tree of Life

Synthesis of phylogeny and taxonomy into a comprehensive tree of life Cody E. Hinchliffa,1, Stephen A. Smitha,1,2, James F. Allmanb, J. Gordon Burleighc, Ruchi Chaudharyc, Lyndon M. Coghilld, Keith A. Crandalle, Jiabin Dengc, Bryan T. Drewf, Romina Gazisg, Karl Gudeh, David S. Hibbettg, Laura A. Katzi, H. Dail Laughinghouse IVi, Emily Jane McTavishj, Peter E. Midfordd, Christopher L. Owenc, Richard H. Reed, Jonathan A. Reesk, Douglas E. Soltisc,l, Tiffani Williamsm, and Karen A. Cranstonk,2 aEcology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109; bInterrobang Corporation, Wake Forest, NC 27587; cDepartment of Biology, University of Florida, Gainesville, FL 32611; dField Museum of Natural History, Chicago, IL 60605; eComputational Biology Institute, George Washington University, Ashburn, VA 20147; fDepartment of Biology, University of Nebraska-Kearney, Kearney, NE 68849; gDepartment of Biology, Clark University, Worcester, MA 01610; hSchool of Journalism, Michigan State University, East Lansing, MI 48824; iBiological Science, Clark Science Center, Smith College, Northampton, MA 01063; jDepartment of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045; kNational Evolutionary Synthesis Center, Duke University, Durham, NC 27705; lFlorida Museum of Natural History, University of Florida, Gainesville, FL 32611; and mComputer Science and Engineering, Texas A&M University, College Station, TX 77843 Edited by David M. Hillis, The University of Texas at Austin, Austin, TX, and approved July 28, 2015 (received for review December 3, 2014) Reconstructing the phylogenetic relationships that unite all line- published phylogenies are available only as journal figures, rather ages (the tree of life) is a grand challenge. The paucity of homologous than in electronic formats that can be integrated into databases and character data across disparately related lineages currently renders synthesis methods (7–9). Although there are efforts to digitize direct phylogenetic inference untenable. To reconstruct a compre- trees from figures (10), we focus instead on synthesis of pub- hensive tree of life, we therefore synthesized published phylogenies, lished, digitally available phylogenies. together with taxonomic classifications for taxa never incorporated When source phylogenies are absent or sparsely sampled, into a phylogeny. We present a draft tree containing 2.3 million tips— taxonomic hierarchies provide structure and completeness (11, the Open Tree of Life. Realization of this tree required the assembly 12). Given the limits of data availability, synthesizing phylogeny of two additional community resources: (i) a comprehensive global and taxonomic classification is the only way to construct a tree of reference taxonomy and (ii) a database of published phylogenetic life that includes all recognized species. One obstacle has been trees mapped to this taxonomy. Our open source framework facili- the absence of a complete, phylogenetically informed taxonomy tates community comment and contribution, enabling the tree to be that spans traditional taxonomic codes (13). We therefore as- continuously updated when new phylogenetic and taxonomic data sembled a comprehensive global reference taxonomy via align- become digitally available. Although data coverage and phylogenetic ment and merging of multiple openly available taxonomic conflict across the Open Tree of Life illuminate gaps in both the un- resources. The Open Tree Taxonomy (OTT) is open, extensible, derlying data available for phylogenetic reconstruction and the pub- and updatable, and reflects the overall phylogeny of life. With the continued updating of phylogenetic information from lication of trees as digital objects, the tree provides a compelling starting point for community contribution. This comprehensive tree will fuel fundamental research on the nature of biological diversity, Significance ultimately providing up-to-date phylogenies for downstream applica- tions in comparative biology, ecology, conservation biology, climate Scientists have used gene sequences and morphological data to change, agriculture, and genomics. construct tens of thousands of evolutionary trees that describe the evolutionary history of animals, plants, and microbes. This study is phylogeny | taxonomy | tree of life | biodiversity | synthesis the first, to our knowledge, to apply an efficient and automated process for assembling published trees into a complete tree of life. he realization that all organisms on Earth are related by This tree and the underlying data are available to browse and Tcommon descent (1) was one of the most profound insights in download from the Internet, facilitating subsequent analyses that scientific history. The goal of reconstructing the tree of life is one require evolutionary trees. The tree can be easily updated with of the most daunting challenges in biology. The scope of the newly published data. Our analysis of coverage not only reveals problem is immense: there are ∼1.8 million named species, and gaps in sampling and naming biodiversity but also further dem- onstrates that most published phylogenies are not available in most species have yet to be described (2–4). Despite decades of digital formats that can be summarized into a tree of life. effort and thousands of phylogenetic studies on diverse clades, we lack a comprehensive tree of life, or even a summary of our Author contributions: C.E.H., S.A.S., J.G.B., R.C., K. A. Crandall, K.G., D.S.H., L.A.K., R.H.R., current knowledge. One reason for this shortcoming is lack of D.E.S., T.W., and K. A. Cranston designed research; C.E.H., S.A.S., J.G.B., R.C., L.M.C., data. GenBank contains DNA sequences for ∼411,000 species, K. A. Crandall, J.D., B.T.D., R.G., D.S.H., H.D.L., E.J.M., P.E.M., C.L.O., R.H.R., J.A.R., D.E.S., and only 22% of estimated named species. Although some gene re- K. A. Cranston performed research; C.E.H., S.A.S., J.F.A., J.G.B., R.C., L.M.C., E.J.M., P.E.M., R.H.R., and J.A.R. contributed new reagents/analytic tools; C.E.H., S.A.S., J.G.B., R.C., L.M.C., gions (e.g., rbcL, 16S, COI) have been widely sequenced across B.T.D., R.G., D.S.H., H.D.L., C.L.O., J.A.R., and D.E.S. analyzed data; C.E.H., S.A.S., J.F.A., J.G.B., some lineages, they are insufficient for resolving relationships R.C., K. A. Crandall, L.A.K., H.D.L., E.J.M., J.A.R., D.E.S., and K. A. Cranston wrote the across the entire tree (5). Most recognized species have never paper; J.F.A. conducted user interface development; and K.G. provided graphic design. been included in a phylogenetic analysis because no appropriate The authors declare no conflict of interest. molecular or morphological data have been collected. This article is a PNAS Direct Submission. There is extensive publication of new phylogenies, data, and Freely available online through the PNAS open access option. inference methods, but little attention to synthesis. We therefore Data deposition: The Open Tree of Life taxonomy, the synthetic tree, and processed in- focus on constructing, to our knowledge, the first comprehensive puts are available from the Dryad database, dx.doi.org/10.5061/dryad.8j60q. tree of life through the integration of published phylogenies with 1C.E.H. and S.A.S. contributed equally to this work. taxonomic information. Phylogenies by systematists with exper- 2To whom correspondence may be addressed. Email: [email protected] or tise in particular taxa likely represent the best estimates of re- [email protected]. lationships for individual clades. By focusing on trees instead of This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. raw data, we avoid issues of dataset assembly (6). However, most 1073/pnas.1423041112/-/DCSupplemental. 12764–12769 | PNAS | October 13, 2015 | vol. 112 | no. 41 www.pnas.org/cgi/doi/10.1073/pnas.1423041112 A BC Metazoa Metazoa Metazoa 1.0 2,000,000 0.0 500 Archaea Archaea Archaea Bacteria Bacteria Bacteria Excavata Excavata Excavata Amoebozoa Amoebozoa Amoebozoa SAR SAR SAR Archaeplastida Archaeplastida Archaeplastida Fungi Fungi Fungi Fig. 1. Phylogenies representing the synthetic tree. The depicted tree is limited to lineages containing at least 500 descendants. (A) Colors represent pro- portion of lineages represented in NCBI databases. (B) Colors represent the amount of diversity measured by number of descendant tips. (C) Dark lineages have at least one representative in an input source tree. published studies, this framework is poised to update taxonomy not improve the synthetic estimate. In each major clade, expert in a phylogenetically informed manner far more rapidly than has curators selected and ranked input trees for inclusion based on occurred historically (see Fig. S1 for workflow). date of publication, underlying data, and methods of inference We used recently developed graph methods (14) to synthesize (see Materials and Methods for details). These rankings generally a tree of life of over 2.3 million operational taxonomic units reflect community consensus about phylogenetic hypotheses. (OTUs) from the reference taxonomy and curated phylogenies. Second, including trees that merely confirm, or are subsets of, Taxonomies contribute to the structure only where we do not other analyses only increases computational difficulty without have phylogenetic trees. Advantages of graph methods include significantly improving the synthetic tree. For example, although easy storage of topological conflict among underlying source we have many framework phylogenies spanning angiosperms, we trees in a single database, the construction of alternative syn- did not include older trees where a newer tree extends the same thetic trees, and the ability to continuously update the tree with underlying data. Third, inclusion of trees requires a minimum new phylogenetic and/or taxonomic information. Importantly, level of curation; where most OTU labels have been mapped to our methodology also highlights the current state of knowledge the taxonomic database, the root is correctly identified, and an for any given clade and reveals those portions of the tree that ingroup clade has been identified.

Synthesis of Phylogeny and Taxonomy Into a Comprehensive Tree of Life

Semi‐Automated Workflows for Acquiring Specimen Data from Label

Croflora, a Database Application to Handle the Croatian Vascular Flora

Hibbett Et Al 2011 Fung Biol

Biodiversity Informatics Vs

The Biodiversity Informatics Landscape: Elements, Connections and Opportunities

A Taxonomic Information Model for Botanical Databases: the IOPI Model

Development of a Biodiversity Database Kangaroo Island

A Higher Level Classification of All Living Organisms

Taxallnomy: an Extension of NCBI Taxonomy That Produces a Hierarchically Complete Taxonomic Tree

Support to Building the Inter-American Biodiversity Information Network

A Data Manager's Guide to Marine Taxonomic Code Lists

Guide to the Global Taxonomy Initiative