Relational Database Design and Implementation for Biodiversity Informatics

PhyloInformatics 7: 1-66 - 2005 Relational Database Design and Implementation for Biodiversity Informatics Paul J. Morris The Academy of Natural Sciences 1900 Ben Franklin Parkway, Philadelphia, PA 19103 USA Received: 28 October 2004 - Accepted: 19 January 2005 Abstract The complexity of natural history collection information and similar information within the scope of biodiversity informatics poses significant challenges for effective long term stewardship of that information in electronic form. This paper discusses the principles of good relational database design, how to apply those principles in the practical implementation of databases, and examines how good database design is essential for long term stewardship of biodiversity information. Good design and implementation principles are illustrated with examples from the realm of biodiversity information, including an examination of the costs and benefits of different ways of storing hierarchical information in relational databases. This paper also discusses typical problems present in legacy data, how they are characteristic of efforts to handle complex information in simple databases, and methods for handling those data during data migration. Introduction history collection information into database management systems of increasing The data associated with natural history sophistication and complexity. collection materials are inherently complex. Management of these data in paper form In this document, I discuss some of the has produced a variety of documents such issues involved in handling complex as catalogs, specimen labels, accession biodiversity information, approaches to the books, stations books, map files, field note stewardship of such information in electronic files, and card indices. The simple form, and some of the tradeoffs between appearance of the data found in any one of different approaches. I focus on the very these documents (such as the columns for well understood concepts of relational identification, collection locality, date database design and implementation. 1 collected, and donor in a handwritten Relational databases have a strong catalog ledger book) mask the inherent (mathematical) theoretical foundation complexity of the information. The 1 appearance of simplicity overlying highly Object theory offers the possibility of handling much complex information provides significant of the complexity of biodiversity information in object oriented databases in a much more effective manner challenges for the management of natural than in relational databases, but object oriented and history collection information (and other object-relational database software is much less systematic and biodiversity information) in mature and much less standard than relational electronic form. These challenges include database software. Data stored in a relational DBMS management of legacy data produced are currently much less likely to become trapped in a dead end with no possibility of support than data in an during the history of capture of natural object oriented DBMS. 1 PhyloInformatics 7: 2-66 - 2005 (Codd, 1970; Chen, 1976), and a wide usually driven by changes imposed upon us range of database software products by the rapidly changing landscape of available for implementing relational operating systems and software. databases. Maintaining a long obsolete computer running a long unsupported operating system as the only means we have to work with data that reside in a long unsupported database program with a custom front end written in a language that nobody writes code for anymore is not a desirable situation. Rewriting an entire collections database system from scratch every few years is also not a desirable situation. The computer science folks who think about Figure 1. Typical paths followed by biodiversity databases have developed a conceptual information. The cylinder represents storage of approach to avoiding getting stuck in such information in electronic form in a database. unpleasant situations – the database life cycle (Elmasri and Navathe, 1994). The The effective management of biodiversity database life cycle recognizes that database information involves many competing management systems change over time and priorities (Figure 1). The most important that accumulated data and user interfaces priorities include long term data for accessing those data need to be stewardship, efficient data capture (e.g. migrated into new systems over time. Beccaloni et al., 2003), creating high quality Inherent in the database life cycle is the information, and effective use of limited insight that steps taken in the process of resources. Biodiversity information storage developing a database substantially impact systems are usually created and maintained the ease of future migrations. in a setting of limited resources. The most appropriate design for a database to support A textbook list (e.g. Connoly et al., 1996) of long term stewardship of biodiversity stages in the database life cycle runs information may not be a complex highly something like this: Plan, design, normalized database well fitted to the implement, load legacy data, test, complexity of the information, but rather operational maintenance, repeat. In slightly may be a simpler design that focuses on the more detail, these steps are: most important information. This is not to 1. Plan (planning, analysis, requirements say that database design is not important. collection). Good database design is vitally important 2. Design (Conceptual database design, for stewardship of biodiversity information. leading to information model, physical In the context of limited resources, good database design [including system design includes a careful focus on what architecture], user interface design). information is most important, allowing 3. Implement (Database implementation, programming and database administration user interface implementation). to best support that information. 4. Load legacy data (Clean legacy data, transform legacy data, load legacy Database Life Cycle data). 5. Test (test implementation). As natural history collections data have 6. Put the database into production use been captured from paper sources (such as and perform operational maintenance. century old handwritten ledgers) and have 7. Repeat this cycle (probably every ten accumulated in electronic databases, the years or so). natural history museum community has observed that electronic data need much Being a visual animal, I have drawn a more upkeep than paper records (e.g. diagram to represent the database life cycle National Research Council, 2002 p.62-63). (Figure 2). Our expectation of databases Every few years we find that we need to should not be that we capture a large move our electronic data to some new quantity of data and are done, but rather database system. These migrations are that we will need to cycle those data through 2 PhyloInformatics 7: 3-66 - 2005 Figure 2. The Database Life Cycle the stages of the database life cycle many of collections information. In some cases, a times. database for a collection running on a single workstation accessed by a single user In this paper, I will focus on a few parts of provides a perfectly adequate solution for the database life cycle: the conceptual and the needs of a collection, provided that the logical design of a database, physical workstation is treated as a server with an design, implementation of the database uninterruptible power supply, backup design, implementation of the user interface devices and other means to maintain the for the database, and some issues for the integrity of the database. Any computer migration of data from an existing legacy running a database should be treated as a database to a new design. I will provide server, with all the supporting infrastructure examples from the context of natural history not needed for the average workstation. In collections information. Plan ahead. Good other cases, multiple users are capturing design involves not just solving the task at and retrieving data at once (either locally or hand, but planning for long term globally), and a database system capable of stewardship of your data. running on a server and being accessed by multiple clients over a network is necessary Levels and architecture to support the needs of a collection or project. A requirements analysis for a database system often considers the network It is, however, more helpful for an architecture of the system. The difference understanding of database design to think between software that runs on a single about the software architecture. That is, to workstation and software that runs on a think of the functional layers involved in a server and is accessed by clients across a database system. At the bottom level is the network is a familiar concept to most users DBMS (database management system [see 3 PhyloInformatics 7: 4-66 - 2005 glossary, p.64]), the software that runs the both building better database systems and database and stores the data (layered for understanding some of the problems that below this is the operating system and its exist in legacy data, especially those filesystem, but we can ignore these for entered into older database systems. now). Layered above the DBMS is your Museum databases that began actual database table or schema layer. development in the 1970s and early 1980s Above this may be various code and prior to the proliferation of effective software network transport layers, and finally, at the for building relational databases

Relational Database Design and Implementation for Biodiversity Informatics

Form Follows Function Model-Driven Engineering for Clinical Trials

Jsfa [email protected]

XXI. Database Design

Database Performance Study

Data Quality Requirements Analysis and Modeling December 1992 TDQM-92-03 Richard Y

Chapter 9 – Designing the Database

Database-Design-1375466250.Pdf

Designing High-Performance Data Structures for the Couchbase Data Platform

Database Design Document

A Knowledge-Based System for Physical Database Design

Development of Concept Plan for the Route 1 North Area

Applying Tssl in Database Schema Modeling: Visualizing the Syntax of Gradual Transitions