Automated Knowledge Enrichment for Semantic Web Data
Total Page:16
File Type:pdf, Size:1020Kb
AUTOMATED KNOWLEDGE ENRICHMENT FOR SEMANTIC WEB DATA A THESIS SUBMITTED TO AUCKLAND UNIVERSITY OF TECHNOLOGY IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Supervisors: Dr. Quan Bai Dr. Jian Yu Dr. Qing Liu November 2018 By Molood Barati School of Engineering, Computer and Mathematical Sciences Copyright Copyright in text of this thesis rests with the Author. Copies (by any process) either in full, or of extracts, may be made only in accordance with instructions given by the Author and lodged in the library, Auckland University of Technology. Details may be obtained from the Librarian. This page must form part of any such copies made. Further copies (by any process) of copies made in accordance with such instructions may not be made without the permission (in writing) of the Author. The ownership of any intellectual property rights which may be described in this thesis is vested in the Auckland University of Technology, subject to any prior agreement to the contrary, and may not be made available for use by third parties without the written permission of the University, which will prescribe the terms and conditions of any such agreement. Further information on the conditions under which disclosures and exploitation may take place is available from the Librarian. © Copyright 2019. Molood Barati ii Declaration I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the qualification of any other degree or diploma of a university or other institution of higher learning. Signature of candidate iii Acknowledgements First and foremost, I would like to express my special gratitude to my primary supervisor Dr. Quan Bai, who has been a professional advisor for me. His ethics, sincerity, friendship, and motivation have profoundly inspired me. He has taught me to present my ideas as clearly and simply as possible. Apart from academic achievements, he gifted me the most important lesson of my life that is "patience". Without patience, nothing great can be achieved. I would also like to acknowledge my mentor supervisor, Dr. Qing Liu, for her con- structive advice. The completion of this thesis would not have been possible without her scientific supports and suggestions. Special thanks to my colleagues, within the School of Engineering, Computer and Mathematical Sciences at Auckland University of Technology, who I have had the pleasure to work with them during this time, Dr. Jian Yu, Prof, Ajit Narayanan, Helena Bahrami, etc. Words can not express how grateful I am to my family in the pursuit of this project. I would like to appreciate my parents and sister, whose endless love supported me during this journey. Special thanks are due to my beloved husband, Jamal, for his continuous assist and understanding in the completion of this thesis. iv Abstract The Semantic Web is an effort to interchange unstructured data over the Web into a structured format that is processable not only by human beings but also computers. The Semantic Web creates a distributed framework to publish, query, and reuse information. The key backbones of Semantic Web are ontologies and annotations that provide semantics for raw data known as RDF data. Although there exist many Semantic Web applications, sophisticated analytical infrastructures are still lacking, preventing users from extracting the semantics attached to RDF data. Additionally, the Semantic Web data face with a wide range of data quality issues due to the distributed nature of the Semantic Web. This thesis presents three approaches based on the following purposes: (I) to express the semantics behind discovered patterns, (II) to deal with a Semantic Web data quality issue, and (III) to enrich knowledge in the Semantic Web ontologies. The following contributions have been made in this thesis. Firstly, the thesis shows the influence of relations and ontological knowledge in the process of mining hidden patterns and proposes Semantic Web Association Rule Mining (SWARM), an automated mining approach that attaches semantics to the discovered patterns. Secondly, the thesis concentrates on a data quality issue in the Semantic Web field which indicates incorrect assignments between instances and classes in the ontology. To this end, Class Assignment Detector (CAD) approach has been designed to tackle the data quality issue. Thirdly, the thesis enhances the process of ontology enrichment by generating v new classes by mining instance-level and schema-level knowledge. Since ontologies are often designed before actual usage, Class Enricher (CEn) approach is developed to extract new classes which are not defined in the ontologies. All the proposed approaches have been tested over real datasets to validate their effectiveness. vi Publications 1. Barati, M., Bai, Q. & Liu, Q. (2016). Swarm: an approach for mining semantic association rules from semantic web data. In Pacific rim international conference on artificial intelligence (pp. 30–43). Springer. 2. Barati, M., Bai, Q. & Liu, Q. (2017). Mining semantic association rules from rdf data. Knowledge-Based Systems, 133, 183–196. Elsevier. 3. Barati, M., Bai, Q. & Liu, Q. (2018). An entropy-based class assignment detection approach for rdf data. In Pacific rim international conference on artificial intelligence (pp. 412–420). Springer. vii Contents Copyright ii Declaration iii Acknowledgements iv Abstract v Publications vii 1 Introduction 1 1.1 Preliminaries . .2 1.1.1 RDF data model . .4 1.1.2 RDF/S . .4 1.1.3 OWL . .4 1.1.4 SPARQL . .5 1.1.5 LOD . .5 1.2 Challenging research issues in the SW . .6 1.3 Motivation . .7 1.4 Research questions and objectives . .9 1.5 Contributions . 10 1.6 Research methodology . 11 1.7 Thesis structure . 13 2 Literature Review 14 2.1 Frequent pattern mining . 14 2.1.1 Frequent subgraph mining . 15 2.1.2 Logical rule mining . 17 2.1.3 Association rule mining . 19 2.2 SW data quality improvement . 21 2.2.1 Missing type properties . 22 2.2.2 Incorrect or incomplete statements . 23 2.2.3 Invalid links to external resources . 24 2.2.4 Missing predicates . 25 2.3 Ontology enrichment . 26 viii 2.3.1 Schema learning . 26 2.3.2 Property learning . 28 2.3.3 Class learning . 29 2.4 Summary . 30 3 Mining Semantic Association Rules From RDF Data 32 3.1 Motivation . 33 3.2 The Semantic Web Association Rule Mining (SWARM) approach . 35 3.2.1 Pre-processing module . 36 3.2.2 Mining module . 41 3.3 Experiments and analysis . 46 3.3.1 Experiment 1: Semantic Association Rules from DrugBank dataset . 47 3.3.2 Experiment 2: Semantic Association Rules from DBpedia dataset 49 3.3.3 Experiment 3: SWARM vs. Mining Configurations . 62 3.4 Discussion . 66 3.5 Summary . 67 4 An Entropy-Based Class Assignment Detection Approach For RDF Data 69 4.1 Motivation . 70 4.2 Class Assignment Detector (CAD) approach . 71 4.2.1 Class Features Extraction module . 71 4.2.2 Instance-Class Relationship Analysis module . 80 4.3 Experiments and analysis . 83 4.3.1 Experiment 1: Class Feature extraction from DBpedia dataset 84 4.3.2 Experiment 2: Accuracy of CAD approach . 86 4.4 Summary . 89 5 Automated New Classes Extraction From RDF Data 90 5.1 Motivation . 91 5.1.1 New class placement . 93 5.2 Class Enricher (CEn) approach . 94 5.2.1 Information Gain Analysis module . 94 5.2.2 Class Creation module . 95 5.3 Experiments and analysis . 96 5.3.1 Generating new classes for DBpedia ontology . 96 5.4 Summary . 99 6 Conclusions 100 6.1 Research contributions . 100 6.2 Future directions . 103 References 106 ix Appendices 114 x List of Tables 3.1 Some RDF triples from the DBpedia dataset . 34 3.2 Some examples of Semantic Items . 39 3.3 Common Behaviour Sets . 41 3.4 Some examples of Semantic Association Rules . 42 3.5 Examples of Semantic Association Rules from DrugBank (SimTh=80%) 47 3.6 The quality measures of generated rules form DrugBank dataset . 48 3.7 Examples of Semantic Association Rules discovered from Person class (SimTh=60%)................................. 52 3.8 Examples of Semantic Association Rules discovered from Person class (SimTh=80%)................................. 53 3.9 Examples of Semantic Association Rules discovered from Organization class (SimTh=80%)............................. 54 3.10 Examples of Semantic Association Rules discovered from Place class (SimTh=80%)................................. 55 3.11 Examples of Semantic Association Rules discovered from Work class (SimTh=80%)................................. 56 3.12 Examples of Semantic Association Rules discovered from Event class (SimTh=80%)................................. 57 3.13 Examples of Semantic Association Rules discovered from Species class (SimTh=80%)................................. 58 3.14 The quality measures of generated rules form different classes of DB- pedia ontology . 60 3.15 Six configurations of context and target . 63 3.16 Semantic Association Rules generated by the SWARM (SimTh=80%) 64 3.17 Mining Configurations vs. SWARM (0.6≤Min conf.≤0.8)........ 65 4.1 Some RDF triples from the DBpedia dataset . 72 4.2 An example of a Class Random Space . 77 4.3 Examples of common feature spaces . 78 4.4 Some details of Hotel, Food, Beverage, Holiday, and BritishRoyalty classes (NGTh≥0.85, MinIN≥15) . 84 5.1 Some new classes generated by the CEn approach (NGTh<0.85, MinIN≥15) 97 xi List of Figures 1.1 The SW layer cake . .3 1.2 Research methodology . 12 3.1 A fragment of the DBpedia ontology . 35 3.2 The SWARM framework . 36 3.3 Different hierarchical structures of an ontology . 44 3.4 Number of strong rules in different minimum Similarity Threshold from DrugBank dataset .