University of Southampton
Total Page:16
File Type:pdf, Size:1020Kb
University of Southampton FACULTY OF MEDICINE, HEALTH AND LIFE SCIENCES School of Biological Sciences Data Quality Concepts and Techniques Applied to Taxonomic Databases by Eduardo Couto Dalcin Thesis for the degree of Doctor of Philosophy February 2005 University of Southampton FACULTY OF MEDICINE, HEALTH AND LIFE SCIENCES School of Biological Sciences Doctor of Philosophy Abstract Data Quality Concepts and Techniques Applied to Taxonomic Databases by Eduardo Couto Dalcin The thesis investigates the application of concepts and techniques of data quality in taxonomic databases to enhance the quality of information services and systems in taxonomy. Taxonomic data are arranged and introduced in Taxonomic Data Domains in order to establish a standard and a working framework to support the proposed Taxonomic Data Quality Dimensions, as a specialised application of conventional Data Quality Dimensions in the Taxonomic Data Quality Domains. The thesis presents a discussion about improving data quality in taxonomic databases, considering conventional Data Cleansing techniques and applying generic data content error patterns to taxonomic data. Techniques of taxonomic error detection are explored, with special attention to scientific name spelling errors. The spelling error problem is scrutinized through spelling error detecting techniques and algorithms. Spelling error detection algorithms are described and analysed. In order to evaluate the applicability and efficiency of different spelling error detection algorithms, 2 a suite of experimental spelling error detection tools was developed and a set of experiments was performed, using a sample of five different taxonomic databases. The results of the experiments are analysed from the algorithm and from the database point of view. Database quality assessment procedures and metrics are discussed in the context of taxonomic databases and the previously introduced concepts of Taxonomic Data Domains and Taxonomic Data Quality Dimensions. Four questions related to Taxonomic Database Quality are discussed, followed by conclusions and recommendations involving information system design and implementation and the processes involved in taxonomic data management and information flow. 3 CONTENTS LIST OF TABLES ..................................................................................................................................... 8 LIST OF FIGURES ................................................................................................................................... 9 LIST OF APPENDICES.......................................................................................................................... 11 LIST OF ACRONYMS............................................................................................................................ 12 1. GENERAL INTRODUCTION........................................................................................................ 13 1.1. PROBLEM ELUCIDATION AND DELIMITATION ................................................................................... 13 1.2. AIMS ................................................................................................................................................ 18 1.3. THESIS PLAN .................................................................................................................................... 18 2. TAXONOMIC DATABASES.......................................................................................................... 21 2.1. OVERVIEW ....................................................................................................................................... 21 3. TAXONOMIC DATA DOMAINS .................................................................................................. 32 3.1. NOMENCLATURAL DATA DOMAIN ................................................................................................... 32 3.2. CLASSIFICATION DATA DOMAIN ...................................................................................................... 34 3.3. FIELD DATA DOMAIN ....................................................................................................................... 35 3.3.1. Spatial Data Sub-domain........................................................................................................ 35 3.3.2. Curatorial Data Sub-domain.................................................................................................. 36 3.3.3. Specimen Descriptive Data Sub-domain ................................................................................ 36 3.4. SPECIES DESCRIPTIVE DATA DOMAIN ............................................................................................... 37 4. DATA QUALITY.............................................................................................................................. 38 4.1. DATA QUALITY DEFINITIONS AND PRINCIPLES ................................................................................. 38 4.2. DATA QUALITY DIMENSIONS ........................................................................................................... 44 5. TAXONOMIC DATA QUALITY DIMENSIONS ........................................................................ 46 5.1. ACCURACY ...................................................................................................................................... 46 5.2. BELIEVABILITY ................................................................................................................................ 51 5.3. COMPLETENESS ............................................................................................................................... 52 4 5.4. CONSISTENCY .................................................................................................................................. 56 5.5. FLEXIBILITY .................................................................................................................................... 59 5.6. RELEVANCE ..................................................................................................................................... 60 5.7. TIMELINESS ..................................................................................................................................... 63 6. IMPROVING DATA ACCURACY IN TAXONOMIC DATABASES ....................................... 65 6.1. DATA CLEANSING ............................................................................................................................ 66 6.2. TAXONOMIC ERROR PATTERNS ........................................................................................................ 67 6.2.1. Domain Value Redundancy .................................................................................................... 68 6.2.2. Missing Data Values............................................................................................................... 69 6.2.3. Incorrect Data Values............................................................................................................. 71 6.2.4. Nonatomic Data Values.......................................................................................................... 72 6.2.5. Domain Schizophrenia............................................................................................................ 73 6.2.6. Duplicate Occurrences ........................................................................................................... 75 6.2.7. Inconsistent Data Values ........................................................................................................ 77 6.2.8. Information Quality Contamination ....................................................................................... 78 6.3. TAXONOMIC ERROR DETECTION APPROACHES AND TECHNIQUES .................................................... 78 6.3.1. Nomenclatural structural errors............................................................................................. 79 Levels of validity ..........................................................................................................................................80 Structural level.........................................................................................................................................82 Nomenclatural level.................................................................................................................................84 6.3.2. Database domain exception.................................................................................................... 85 Experiment in spatial domain exception detection........................................................................................87 6.3.3. Scientific name spelling errors ............................................................................................... 89 7. EXPERIMENTS IN DETECTING SPELLING ERRORS IN SCIENTIFIC NAMES.............. 92 7.1. INTRODUCTION ................................................................................................................................ 92 7.2. PURPOSE OF THE TESTS .................................................................................................................... 93 7.3. THE DATABASES ............................................................................................................................. 94 7.3.1. The selection..........................................................................................................................