Applying the CIDOC-CRM to Archaeological Grey Literature
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Indexing via Knowledge Organization Systems: Applying the CIDOC-CRM to Archaeological Grey Literature Andreas Vlachidis A thesis submitted in partial fulfilment of the requirements of the University of Glamorgan / Prifysgol Morgannwg for the degree of a Doctor of Philosophy. July 2012 University of Glamorgan Faculty of Advanced Technology Certificate of Research This is to certify that, except where specific reference is made, the work presented in this thesis is the result of the investigation undertaken by the candidate. Candidate: ................................................................... Director of Studies: ..................................................... Declaration This is to certify that neither this thesis nor any part of it has been presented or is being currently submitted in candidature for any other degree other than the degree of Doctor of Philosophy of the University of Glamorgan. Candidate: ........................................................ Andreas Vlachidis PhD Thesis University of Glamorgan Abstract The volume of archaeological reports being produced since the introduction of PG161 has significantly increased, as a result of the increased volume of archaeological investigations conducted by academic and commercial archaeology. It is highly desirable to be able to search effectively within and across such reports in order to find information that promotes quality research. A potential dissemination of information via semantic technologies offers the opportunity to improve archaeological practice, not only by enabling access to information but also by changing how information is structured and the way research is conducted. This thesis presents a method for automatic semantic indexing of archaeological grey- literature reports using rule-based Information Extraction techniques in combination with domain-specific ontological and terminological resources. This semantic annotation of contextual abstractions from archaeological grey-literature is driven by Natural Language Processing (NLP) techniques which are used to identify “rich” meaningful pieces of text, thus overcoming barriers in document indexing and retrieval imposed by the use of natural language. The semantic annotation system (OPTIMA) performs the NLP tasks of Named Entity Recognition, Relation Extraction, Negation Detection and Word Sense disambiguation using hand-crafted rules and terminological resources for associating contextual abstractions with classes of the ISO Standard (ISO 21127:2006) CIDOC Conceptual Reference Model (CRM) for cultural heritage and its archaeological extension, CRM-EH, together with concepts from English Heritage thesauri and glossaries. The results demonstrate that the techniques can deliver semantic annotations of archaeological grey literature documents with respect to the domain conceptual models. Such semantic annotations have proven capable of supporting semantic query, document study and cross-searching via web based applications. The research outcomes have provided semantic annotations for the Semantic Technologies for Archaeological Resources (STAR) project, which explored the potential of semantic technologies in the integration of archaeological digital resources. The thesis represents the first discussion on the employment of CIDOC CRM and CRM-EH in semantic annotation of grey-literature documents using rule-based Information Extraction techniques driven by a supplementary exploitation of domain-specific ontological and terminological resources. It is anticipated that the methods can be generalised in the future to the broader field of Digital Humanities. 1 The Department of the Environment 1990 Planning Policy Guidance Note 16 (PPG16 (DoE 2010) iv Andreas Vlachidis PhD Thesis University of Glamorgan Acknowledgements I would like to wholeheartedly thank my supervisor Professor Douglas Tudhope for his true kindness, invaluable support and inspiring guidance. Thanks are also due to the members of the Hypermedia Research Unit, Dr. Daniel Cunliffe, Ceri Binding and Dr. Renato Souza for their feedback and encouragement which helped me pursue and complete this study. I would like also to thank Keith May (English Heritage) for his expert input that has greatly helped to understand issues relating to archaeology practice and for his quality feedback and work as the Super Annotator of the evaluation process. Phil Carlisle (English Heritage) for providing domain thesauri Paul Cripps and Tom Brughmans for their manual annotation input during the pilot evaluation phase The Archaeology Data Service for provision of the OASIS corpus. In particular I would like to thank Professor Julian Richards, Dr. Stuart Jeffrey and all the ADS staff members and University of York postgraduate archaeology students who provided manual annotations for the main evaluation phase. The external examiners of the thesis, Prof. Anders Ardö (Lund University, Sweden) and Dr.Antony Beck (University of Leeds, UK) for their kind, valuable and constructive feedback that has helped to improve parts of this work. Finally, I would like to thank all my family members for their love and unwavering support not only during the period of this study but all my life so far but most specially I would like to thank my companion and partner in life Ms Paraskevi Vougessi for her unconditional love and patience that helped to make this work possible. v Andreas Vlachidis PhD Thesis University of Glamorgan To the music of : Markos Vamvakaris, 1905 – 1972 Manolis Rasoulis, 1945 – 2011 Nikos Papazoglou, 1948 – 2011 To my father, Spyros Vlachidis, 1946 – 1986 vi Andreas Vlachidis PhD Thesis University of Glamorgan Table of Contents ABSTRACT ........................................................................................................................................... IV ACKNOWLEDGEMENTS ........................................................................................................................ V TABLE OF CONTENTS ......................................................................................................................... VII LIST OF FIGURES ................................................................................................................................ XV LIST OF TABLES ................................................................................................................................ XVII PUBLISHED WORK .......................................................................................................................... XVIII INTRODUCTION TO THESIS ................................................................................................................... 1 1.1 PRELUDE ......................................................................................................................................... 1 1.2 CONTEXT AND MOTIVATION........................................................................................................... 2 1.3 THESIS LAYOUT ............................................................................................................................... 5 LITERATURE REVIEW ............................................................................................................................. 8 2.1 INTRODUCTION............................................................................................................................... 8 2.2 NATURAL LANGUAGE PROCESSING ................................................................................................. 8 2.2.1 LEVELS OF LINGUISTIC ANALYSIS............................................................................................................. 9 2.2.2 NLP SCHOOLS OF THOUGHT ............................................................................................................... 11 2.2.3 NLP POTENTIAL IN INFORMATION RETRIEVAL ......................................................................................... 11 2.2.3.1 Information Retrieval ........................................................................................................... 11 2.2.3.2 Language Ambiguities and NLP for Information Retrieval ................................................... 13 2.2.3.3 Indexing and Classification with Terminological Resources ................................................. 14 2.3 INFORMATION EXTRACTION ......................................................................................................... 16 2.3.1 THE ROLE OF THE MACHINE UNDERSTANDING CONFERENCE (MUC) .......................................................... 17 2.3.2 TYPES OF INFORMATION EXTRACTION SYSTEMS ...................................................................................... 18 2.3.2.1 Rule-based Information Extraction Systems ........................................................................ 18 2.3.2.2 Machine Learning Information Extraction Systems ............................................................. 19 2.4 ONTOLOGY ................................................................................................................................... 20 2.4.1 CONCEPTUALIZATION ......................................................................................................................... 21 2.4.2 ONTOLOGY TYPES ............................................................................................................................. 22 2.4.3 THE CULTURAL HERITAGE ONTOLOGY CIDOC – CRM ............................................................................