CUED Phd and Mphil Thesis Classes
Total Page:16
File Type:pdf, Size:1020Kb
Support for Taxonomic Data in Systematics Nadia Anwar A thesis submitted for the degree of Doctor of Philosophy to the Division of Ecology and Evolutionary Biology Faculty of Biomedical and Life Sciences University of Glasgow November 2008 Abstract The Systematics community works to increase our understanding of biological di- versity through identifying and classifying organisms and using phylogenies to un- derstand the relationships between those organisms. It has made great progress in the building of phylogenies and in the development of algorithms. However, it has insufficient provision for the preservation of research outcomes and making those widely accessible and queriable, and this is where database technologies can help. This thesis makes a contribution in the area of database usability, by addressing the query needs present in the community, as supported by the analysis of query logs. It formulates clearly the user requirements in the area of phylogeny and classification queries. It then reports on the use of warehousing techniques in the integration of data from many sources, to satisfy those requirements. It shows how to perform query expansion with synonyms and vernacular names, and how to implement hier- archical query expansion effectively. A detailed analysis of the improvements offered by those query expansion techniques is presented. This is supported by the expo- sition of the database techniques underlying this development, and of the user and programming interfaces (web services) which make this novel development available to both end-users and programs. This thesis is dedicated to my family. Acknowledgements This PhD was funded through a scholarship awarded by the University of Glasgow and supervised by R.D.M. Page and Ela Hunt. Contents 1 Introduction 1 1.1 Background . .1 1.2 Taxonomy . .2 1.3 Systematics . .4 1.4 TreeBASE . .5 1.5 Motivation . .6 1.6 Thesis Structure . .7 2 Background 9 2.1 Taxonomy . .9 2.1.1 Naming . 10 2.1.2 Scientific names . 10 2.1.3 Synonyms and Homonyms . 13 2.1.4 Classification . 14 2.1.5 PhyloCode . 15 2.2 Delivery of Taxonomic Knowledge . 16 2.3 Taxonomic Databases on the Web . 17 2.3.1 ITIS . 17 2.3.2 NCBI . 17 2.3.3 GRIN . 18 2.3.4 Sp2000 . 18 2.3.5 Online Checklists . 18 2.3.6 Other sources used to develop the TCl-Db data model . 19 2.3.6.1 Taxonomer . 19 iv CONTENTS 2.3.6.2 IPNI . 19 2.3.6.3 uBio . 20 2.3.7 Coordination across resources . 21 2.3.7.1 TDWG . 21 2.3.7.2 GBIF . 23 2.4 Taxonomy and systematics . 24 2.4.1 TreeBASE . 24 2.4.2 TreeBASE Database versus Information Retrieval . 25 2.4.3 Requirements of a \Taxonomically Aware" Database . 26 2.4.3.1 Support for Hierarchical Query Expansion . 26 2.4.3.2 Query Expansion with Synonyms and Vernaculars . 27 2.4.3.3 Consistency . 27 2.4.3.4 Data Coverage . 28 2.5 Integration approaches and database technologies . 30 2.5.1 Database technologies . 30 2.5.2 Data Integration . 34 2.5.3 Steps to Integration in Data Warehouses . 40 2.5.3.1 Schema Matching . 42 2.5.3.2 Data Transformation . 42 2.5.4 Data Warehouses - Update and Maintenance . 44 2.5.5 Warehousing Taxonomic Data . 45 2.6 Summary . 46 3 Database Design and Implementation 48 3.1 Introduction . 48 3.2 Requirements . 49 3.2.1 Data Retrieval . 49 3.3 Data Warehouse . 53 3.4 Modelling hierarchical data . 54 3.4.1 Materialized Paths . 55 3.4.2 Nested Sets . 55 3.4.3 Performance Benchmarking . 57 3.5 TCl-Db Structure . 59 v CONTENTS 3.5.1 Modelling rank and kingdom . 62 3.5.2 Modelling synonyms and vernaculars . 63 3.5.3 Trees and Nodes . 63 3.6 Application Tables and Query performance . 65 3.7 Procedures and Views . 67 3.8 Summary . 68 4 Populating the Database 69 4.1 Introduction . 69 4.2 ITIS . 71 4.2.1 Schema mapping and Integration . 71 4.3 NCBI . 79 4.3.1 Schema mapping and Integration . 79 4.4 GRIN . 85 4.4.1 Schema mapping and Integration . 85 4.5 Sp2000 . 90 4.5.1 Schema mapping and Integration . 90 4.6 MSOW . 92 4.6.1 Aves, early bird data . 93 4.7 Adding New Data Sources . 95 4.7.1 Data Availability . 96 4.8 Summary and Preview . 96 5 Database Utility and Web Tools 97 5.1 Overview . 97 5.2 Supported Queries . 97 5.2.1 Hierarchical queries . 98 5.3 Data analysis Queries . 99 5.3.1 Comparing classifications. 104 5.4 TCl-Db Web Tools . 107 5.4.1 TCl-Db Search Tool . 107 5.4.2 Visualisation Tools: Linked Names and Classifications with Webdot . 111 5.4.3 Supporting vernacular search terms . 113 5.4.4 A TreeBASE Wrapper . 118 vi CONTENTS 5.4.5 SOAP Tools . 122 5.5 Summary . 130 6 TCl-Db Reconciling data sources 131 6.1 Summary . 131 6.2 Background . 131 6.3 Aves Checklist - Data Transformation . 133 6.4 Results . 133 6.4.1 Classification Differences . 137 6.4.2 Spelling . 140 6.5 Data Cleaning and Presentation . 146 6.6 Conclusions . 148 7 Using TCl-Db to Improve the Querying of TreeBASE 149 7.1 Summary . ..