Proefschrift Marieke Van Erp 300610

Tilburg University Accessing natural history van Erp, M.G.J. Publication date: 2010 Document Version Publisher's PDF, also known as Version of record Link to publication in Tilburg University Research Portal Citation for published version (APA): van Erp, M. G. J. (2010). Accessing natural history: Discoveries in data cleaning, structuring, and retrieval. TICC Dissertation Series 13. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 26. sep. 2021 Accessing Natural History Discoveries in Data Cleaning, Structuring, and Retrieval proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Tilburg, op gezag van de rector magnificus, prof. dr. Ph. Eijlander, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit op woensdag 30 juni 2010 om 14:15 uur door Maria Godefrida Jacoba van Erp geboren op 18 november 1982 te Breda ii Promotor: Prof. dr. A. P. J. van den Bosch Copromotor: Dr. P. K. Lendvai Promotiecommissie: Prof. dr. H. J. van den Herik Prof. dr. W. M. P. Daelemans Dr. R. W. R. J. Dekker Dr. E. Hovy The research reported in this thesis was funded by the Netherlands Organization for Scientific Research (NWO) in the project Mining for Information in Texts from the Cul- tural Heritage (MITCH), grant number 640.002.403. The MITCH project is part of the Continuous Access to Cultural Heritage Research (CATCH) Programme. SIKS Dissertation Series No. 2010-30 The research reported in this thesis was carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. TiCC Dissertation Series no. 13. ISBN 978-90-8559-027-9 c Marieke van Erp 2010 Cover photo c NCB Naturalis All rights reserved. No part of this publication may be reproduced in any form by any electronic or mechanical means (including photocopying, recording or information storage and retrieval) without permission from the author, except for reading and browsing via the World Wide Web. Users are not permitted to mount this file on any network servers. Voor opa Preface In the spring of 2005, while I was finishing up my Master's degree at Tilburg University, I knew I wasn't quite done with academia and wanted to pursue a Ph.D. Unexpectedly (to me at least) I landed a Ph.D. position at Tilburg Uni- versity in collaboration with the National Museum of Natural History Naturalis in Leiden. So, in the autumn of 2005, I moved to Leiden: to new adventures, and a fair amount of time on the train to Tilburg for the weekly meetings with my supervisor and promotor Antal van den Bosch and to hang out and learn from the members of the ILK group. I am very grateful to Antal for giving me the chance to do this project and start exploring the wondrous world of interdisciplinary research in such a `gezellige' environment and to the ILK members for making it such a fabulous place. At my other office, Naturalis, I want to thank the researchers and database managers for educating me and my fellow MITCH colleagues about the basics of natural history research and showing us around in their databases, in particular Pim Arntzen, Eric van Nieukerken and Dirk Houtgraaf during the early days of MITCH, and later on Marian van der Meij and RenéDekker. Somewhere halfway during my project, Antal and I realised that my work was moving away from core ILK business towards knowledge representation and ontologies. Antal thought this would be an excellent opportunity to gain knowledge elsewhere and see how another research group runs from the inside. He also knew a perfect mentor to help me explore these topics: dr. Eduard Hovy at the vi Information Sciences Institute in Marina del Rey, California, USA. Again I was given an amazing opportunity, as Ed was willing to have me at ISI for a couple of months. So in September 2007, I moved to sunny California. This (ultimately) 7-month stint meant a huge turnaround for my research (and my private life, but more on that later) as ISI's bustling community provided me with new inspiration and research directions, as well as giving me an incredibly good time exploring Los Angeles bars and clubs, yoga, the beach, and American traditions. Thank you Ed, for giving me this opportunity, guidance, and for finding the time to be on my thesis committee. I also wish to thank the other members of my thesis committee for their advice and comments on my text: prof. dr. Jaap van den Herik, prof. dr. Walter Daelemans, and dr. RenéDekker. Work hard, play hard. Luckily, my friends and family provided enough dis- tractions from deadlines and the frustrations of working with computers in the form of concerts, surf trips, dance lessons, guitar sessions, dinner parties, trips to North End, and British comedy. Special thanks to Judith and Arthur for keeping me awake far past my bedtime in Leiden and to Steve for being a great program- mer and friend. Without you this thesis would have looked very differently and my guitar skills would have been even worse than they are now. I would never have gotten anywhere near a Ph.D. if it weren't for my family. I have amazing parents who raised me to be curious and open to the world, and who have supported me with every step. It was also a true joy to grow up with my brothers Frank and Hans, and my sister Sara. I hope that you all forgive me for moving away from Etten-Leur, but there's always space for you to crash in Amsterdam. Fresh inspiration for my research wasn't the only thing I found in LA. I also found Paul, my soon-to-be husband. Thank you for your advice, distraction, bread-crusted Ahi tuna, coffee breaks, trips to exciting places far and near, music, smiles, and love. On to new adventures! Marieke van Erp Amsterdam, 11 May 2010 Contents Prefacev Contents vii 1 Introduction1 1.1 Research Motivation..........................2 1.2 Problem Statement and Research Questions.............4 1.3 Research Methodology.........................7 1.4 Thesis Outline.............................7 2 Background and Resources9 2.1 Natural History.............................9 2.1.1 Biogeography.......................... 11 2.1.2 Systematics........................... 11 2.1.3 Biodiversity Informatics.................... 14 2.2 The Naturalis Reptiles and Amphibians Data Sets......... 16 2.2.1 The R&A Database...................... 16 2.2.2 Field Logbooks and Registers................. 17 2.3 Additional Resources.......................... 19 2.3.1 GeoNames............................ 19 2.3.2 Taxonomic Resources..................... 22 viii CONTENTS 2.4 Discussion................................ 24 3 Preparatory Work 25 3.1 Automatic R&A Database Population................ 26 3.1.1 Overlap between Field Logbooks and Registers and R&A Database............................ 27 3.1.2 Annotated Training and Test Data.............. 29 3.1.3 Experiments and Results................... 29 3.2 A Manually Constructed Ontology.................. 31 3.2.1 Ontology Basics........................ 31 3.2.2 Resources Involved....................... 33 3.2.3 Ontology Building Principles................. 35 3.2.4 Ontology Construction Methodology............. 36 3.2.5 Description of the Ontology.................. 37 3.3 Chapter Summary........................... 41 4 Data Cleaning 45 4.1 The Essence of Data Cleaning..................... 46 4.2 The Importance of Automatic Data Cleaning............ 48 4.2.1 Data Cleaning Steps...................... 49 4.2.2 Types of Errors......................... 50 4.2.3 R&A Database Error Analysis................ 52 4.3 Normalisation.............................. 53 4.4 Data-driven Data Cleaning...................... 54 4.4.1 Timpute ............................ 55 4.4.2 Experiments and Results................... 57 4.5 Ontology-driven Data Cleaning.................... 63 4.5.1 Related Work.......................... 64 4.5.2 Validato ............................ 64 4.5.3 Experiments and Results................... 65 4.6 Discussion and Conclusions...................... 74 5 Automatic Data Structuring 79 5.1 Automatic Ontology Construction.................. 80 5.2 Twibio ................................. 83 5.2.1 Wikipedia............................ 84 CONTENTS ix 5.2.2 Related Work.......................... 86 5.2.3 Data Preparation........................ 87 5.2.4 Twibio System Setup..................... 89 5.2.5 Evaluation Methodology.................... 91 5.2.6 Twibio Results ........................ 92 5.3 Ontology Comparison......................... 95 5.3.1 Manual Ontology Comparison................. 95 5.3.2 Automatic Ontology Comparison............... 97 5.3.3 Manual vs. Twibio Ontology................. 99 5.4 Discussion and Conclusions...................... 99 6 Data Retrieval 101 6.1 Information Retrieval......................... 102 6.2 The Naturalis

Load more