Automatic Population of Structured Knowledge Bases Via Natural Language Processing

DEPARTMENT OF INFORMATION ENGINEERING AND COMPUTER SCIENCE ICT International Doctoral School Automatic Population of Structured Knowledge Bases via Natural Language Processing Marco Fossati Advisor Giovanni Tummarello Fondazione Bruno Kessler Abstract The Web has evolved into a huge mine of knowledge carved in different forms, the predominant one still being the free-text document. This motivates the need for Intelligent Web-reading Agents: hypothetically, they would skim through disparate Web sources corpora and generate meaningful structured assertions to fuel Knowledge Bases (KBs). Ultimately, comprehensive KBs, like Wikidata and DBpedia, play a fundamental role to cope with the issue of information overload. On account of such vision, this thesis depicts a set of systems based on Natural Language Processing (NLP), which take as input unstructured or semi-structured information sources and produce machine-readable statements for a target KB. We implement four main research contributions: (1) a one-step methodology for crowdsourcing the Frame Semantics annotation; (2) a NLP technique implementing the above contribution to perform N-ary Relation Extraction from Wikipedia, thus enriching the target KB with properties; (3) a taxonomy learning strategy to produce an intuitive and exhaustive class hierarchy from the Wikipedia category graph, thus augmenting the target KB with classes; (4) a recommender system that leverages a KB network to yield atypical suggestions with detailed explanations, serving as a proof of work for real- world end users. The outcomes are incorporated into the Italian DBpedia chapter, can be queried through its public endpoint, and/or downloaded as standalone data dumps. Keywords Natural Language Processing, Information Extraction, Machine Learning, Frame Semantics, Crowdsourcing, Recommender Systems, Wikipedia. Contents 1 Introduction1 1.1 The Vision............................4 1.2 The Problem..........................6 1.3 The Solution and its Innovative Aspects...........8 1.3.1 Contributions......................9 1.4 Structure of the Thesis..................... 11 2 State of the Art 13 2.1 Entity Linking.......................... 13 2.2 Frame Semantics........................ 14 2.3 Crowdsourcing......................... 15 2.3.1 Games with a Purpose................. 16 2.3.2 Micro-tasks....................... 16 2.3.3 Frame Semantics Annotation............. 16 2.4 Information Extraction for KB Population.......... 17 2.4.1 Information Extraction................. 17 2.4.2 Knowledge Base Construction............. 19 2.4.3 Open Information Semantification........... 20 2.4.4 Further Approaches................... 22 2.5 Taxonomy Learning....................... 23 2.5.1 Wikipedia-powered Knowledge Bases......... 24 2.5.2 Type Inference..................... 26 i 2.6 Recommender Systems..................... 26 2.6.1 CF and CB systems.................. 27 2.6.2 Similarity, diversity, coherence............. 28 2.6.3 Linked Open Data for recommendation........ 29 2.6.4 Use of semantic networks for news recommendation. 29 2.6.5 Other approaches for news recommendation..... 30 2.6.6 Evaluation guidelines.................. 32 3 Crowdsourcing Frame Annotation 33 3.1 Introduction........................... 33 3.2 Experiments........................... 35 3.2.1 Assessing Task Reproducibility and Worker Behavior Change......................... 36 3.2.2 General Task Setting.................. 37 3.2.3 2-step Approach..................... 38 3.2.4 1-step Approach..................... 42 3.3 Improving FEs Annotation with DBpedia.......... 43 3.3.1 Annotation Workflow.................. 43 3.4 Experiments........................... 47 3.5 Results.............................. 50 3.6 Conclusion............................ 51 4 Properties: N-ary Relation Extraction from Free Text 53 4.1 Introduction........................... 53 4.1.1 Contributions...................... 57 4.1.2 Problem and Solution................. 57 4.2 Use Case............................. 58 4.3 System Description....................... 60 4.4 Corpus Analysis......................... 61 4.4.1 Lexical Units Extraction................ 61 ii 4.4.2 Lexical Units Selection................. 62 4.5 Use Case Frame Repository.................. 63 4.6 Supervised Relation Extraction................ 66 4.6.1 Sentence Selection................... 67 4.6.2 Training Set Creation.................. 69 4.6.3 Frame Classification: Features............. 72 4.7 Numerical Expressions Normalization............. 73 4.8 Dataset Production....................... 74 4.9 Baseline Classifier........................ 77 4.10 Evaluation............................ 78 4.10.1 Classification Performance............... 78 4.10.2 T-Box Enrichment................... 84 4.10.3 A-Box Population.................... 86 4.10.4 Final Fact Correctness................. 88 4.11 Observations........................... 91 4.11.1 LU Ambiguity...................... 91 4.11.2 Manual Intervention Costs............... 91 4.11.3 NLP Pipeline Design.................. 92 4.11.4 Simultaneous T-Box and A-Box Augmentation.... 93 4.11.5 Confidence Scores Distribution............. 93 4.11.6 Scaling Up........................ 94 4.11.7 Crowdsourcing Generalization............. 95 4.11.8 Miscellanea....................... 96 4.11.9 Technical Future Work................. 96 4.12 Conclusion............................ 97 5 Classes: Unsupervised Taxonomy Learning 103 5.1 Introduction........................... 103 5.2 Prominent Nodes........................ 104 iii 5.3 Generating DBTax....................... 105 5.3.1 Stage 1: Leaf Nodes Extraction............ 105 5.3.2 Stage 2: Prominent Node Discovery.......... 106 5.3.3 Stage 3: Class Taxonomy Generation......... 109 5.3.4 Stage 4: Pages Type Assignment........... 110 5.4 Results.............................. 111 5.5 Evaluation............................ 112 5.5.1 Coverage......................... 113 5.5.2 T-Box Evaluation.................... 114 5.5.3 A-Box Evaluation.................... 115 5.6 Access and Sustainability.................... 119 5.7 Conclusion............................ 119 6 Application: Knowledge Base-driven Recommender Sys- tems 121 6.1 Introduction........................... 121 6.2 Approach............................ 123 6.3 System Architecture...................... 125 6.3.1 Querying the Dataspace................ 127 6.3.2 Ranking the Recommendation Sets.......... 129 6.4 Evaluation............................ 130 6.4.1 General Setting..................... 131 6.4.2 Experiments....................... 131 6.4.3 Results.......................... 134 6.4.4 Discussion........................ 136 6.5 Conclusion............................ 136 7 Conclusion 139 7.1 The Italian DBpedia Chapter................. 141 7.2 Contribution 1: Crowdsourced Frame Annotation...... 143 iv 7.3 Contribution 2: Properties Population via Relation Extraction145 7.4 Contribution 3: Classes Population via Taxonomy Learning. 147 7.5 Contribution 4: Application to Recommender Systems... 148 8 Appendix: the StrepHit Project 151 8.1 Project Idea........................... 151 8.1.1 The Problem...................... 151 8.1.2 The Solution...................... 152 8.1.3 Use Case......................... 152 8.2 Project Goals.......................... 154 8.3 Project Plan........................... 155 8.3.1 Implementation Details................. 155 8.3.2 Contributions to the Wikidata Development Plan.. 156 8.3.3 Work Package...................... 156 8.4 Community Engagement.................... 158 8.5 Methods and activities..................... 160 8.5.1 Technical Setup..................... 160 8.5.2 Project Management.................. 160 8.5.3 Dissemination...................... 161 8.6 Outcomes............................ 162 8.6.1 Software......................... 163 8.6.2 Bonus Outcomes.................... 164 8.6.3 Web Sources Corpus.................. 164 8.6.4 Candidate Relations Set................ 169 8.6.5 Semi-structured Development Dataset........ 169 8.7 Evaluation............................ 171 8.7.1 Sample Statements................... 174 8.7.2 Final Claim Correctness................ 177 8.8 Resources............................ 177 v 8.9 Challenges............................ 179 8.10 Side Projects.......................... 180 Bibliography 183 vi List of Tables 1.1 Research contributions and associated publications..... 10 2.1 Overview of Wikipedia-powered knowledge bases (C ategories, Pages, M ultilingual, 3 rdparty data). } indicates a caveat.. 24 3.1 Comparison of the reproduced frame discrimination task as per [64]............................. 37 3.2 Overview of the experimental results. FD stands for Frame Discrimination, FER for FEs Recognition........... 39 3.3 FrameNet data processing details............... 47 3.4 Experimental settings...................... 48 3.5 Overview of the experimental results............. 50 4.1 Extraction examples on the Germany national football team article.............................. 56 4.2 Training set crowdsourcing task outcomes. Cf. Section 4.6.2 for explanations of CrowdFlower-specific terms.......... 72 4.3 Frame Elements (FEs) classification performance evaluation over a gold standard of 500 random sentences from the Italian Wikipedia corpus. The average crowd agreement score on the gold standard amounts to :916............... 80 vii 4.4 Frame classification performance evaluation over a gold standard of 500 random sentences from the Italian Wikipedia corpus. The average crowd agreement score on the gold

Load more