8 Knowledge Base Curation 155 8.1 KB Quality Assessment
Total Page:16
File Type:pdf, Size:1020Kb
Submitted to Foundations and Trends in Databases Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases Gerhard Weikum Xin Luna Dong Max Planck Institute for Informatics Amazon [email protected] [email protected] Simon Razniewski Fabian Suchanek Max Planck Institute for Informatics Telecom Paris University [email protected] [email protected] August 23, 2021 Abstract Equipping machines with comprehensive knowledge of the world’s entities and their relationships has been a long-standing goal of AI. Over the last decade, large-scale knowledge bases, also known as knowledge graphs, have been automatically constructed from web contents and text sources, and have become a key asset for search engines. This machine knowledge can be harnessed to semantically interpret textual phrases in news, social media and web tables, and contributes to question answering, natural language processing and data analytics. This article surveys fundamental concepts and practical methods for creating and curating large knowledge bases. It covers models and methods for discovering and canon- icalizing entities and their semantic types and organizing them into clean taxonomies. On top of this, the article discusses the automatic extraction of entity-centric properties. To support the long-term life-cycle and the quality assurance of machine knowledge, the article presents methods for constructing open schemas and for knowledge curation. Case studies on academic projects and industrial knowledge graphs complement the survey of concepts and methods. arXiv:2009.11564v2 [cs.AI] 22 Mar 2021 1 Submitted to Foundations and Trends in Databases Contents 1 What Is This All About7 1.1 Motivation . 7 1.2 History of Large-Scale Knowledge Bases . 8 1.3 Application Use Cases . 9 1.4 Scope and Tasks . 12 1.4.1 What is a Knowledge Base . 12 1.4.2 Tasks for KB Creation and Curation . 13 1.5 Audience . 15 1.6 Outline . 17 2 Foundations and Architecture 18 2.1 Knowledge Representation . 18 2.1.1 Entities . 18 2.1.2 Classes (aka Types) . 20 2.1.3 Properties of Entities: Attributes and Relations . 22 2.1.3.1 Attributes of Entities . 23 2.1.3.2 Relations between Entities . 24 2.1.3.3 Higher-Arity Relations . 25 2.1.4 Canonicalization . 27 2.1.5 Logical Invariants . 28 2.2 Design Space for Knowledge Base Construction . 28 2.2.1 Requirements . 29 2.2.2 Input Sources . 33 2.2.3 Output Scope and Quality . 34 2.2.4 Methodological Repertoire . 35 3 Knowledge Integration from Premium Sources 37 3.1 The Case for Premium Sources . 37 3.2 Category Cleaning . 39 3.3 Alignment between Knowledge Sources . 43 3.4 Graph-based Alignment . 45 3.5 Discussion . 46 3.6 Take-Home Lessons . 48 2 Submitted to Foundations and Trends in Databases 4 KB Construction: Entity Discovery and Typing 49 4.1 Problem and Design Space . 49 4.2 Dictionary-based Entity Spotting . 50 4.3 Pattern-based Methods . 52 4.3.1 Patterns for Entity-Type Pairs . 52 4.3.2 Pattern Learning . 53 4.4 Machine Learning for Sequence Labeling . 59 4.4.1 Probabilistic Graphical Models . 59 4.4.2 Deep Neural Networks . 63 4.5 Word and Entity Embeddings . 65 4.6 Ab-Initio Taxonomy Construction . 68 4.7 Take-Home Lessons . 72 5 Entity Canonicalization 74 5.1 Problem and Design Space . 74 5.1.1 Entity Linking (EL) . 74 5.1.2 EL with Out-of-KB Entities . 76 5.1.3 Coreference Resolution (CR) . 77 5.1.4 Entity Matching (EM) . 79 5.2 Popularity, Similarity and Coherence Measures . 80 5.2.1 Mention-Entity Popularity . 80 5.2.2 Mention-Entity Context Similarity . 81 5.2.3 Entity-Entity Coherence . 84 5.3 Optimization and Learning-to-Rank Methods . 86 5.4 Collective Methods based on Graphs . 87 5.4.1 Edge-based EL Methods . 88 5.4.2 Subgraph-based EL Methods . 88 5.4.3 EL Methods based on Random Walks . 90 5.4.4 EL Methods based on Probabilistic Graphical Models . 91 5.5 Methods based on Supervised Learning . 94 5.5.1 Neural EL . 94 5.6 Methods for Semi-Structured Data . 96 5.7 EL Variations and Extensions . 98 5.8 Entity Matching (EM) . 99 5.8.1 Blocking . 101 5.8.2 Matching Rules . 102 5.8.3 EM Workflow . 103 5.9 Take-Home Lessons . 104 3 Submitted to Foundations and Trends in Databases 6 KB Construction: Attributes and Relationships 106 6.1 Problem and Design Space . 106 6.2 Pattern-based and Rule-based Extraction . 108 6.2.1 Specified Patterns and Rules . 108 6.2.1.1 Regex Patterns . 109 6.2.1.2 Inducing Regex Patterns from Examples . 110 6.2.1.3 Type Checking . 110 6.2.1.4 Operator-based Extraction Plans . 110 6.2.1.5 Patterns for Text, Lists and Trees . 112 6.2.2 Distantly Supervised Pattern Learning . 115 6.2.2.1 Statement-Pattern Duality . 115 6.2.2.2 Quality Measures . 117 6.2.2.3 Scale and Scope . 117 6.2.2.4 Feature-based Classifiers . 119 6.2.2.5 Source Discovery . 120 6.3 Extraction from Semi-Structured Contents . 122 6.3.1 Distantly Supervised Extraction from DOM Trees . 124 6.3.2 Web Tables and Microformats . 126 6.4 Neural Extraction . 127 6.4.1 Classifiers and Taggers . 127 6.4.2 Training by Distant Supervision . 128 6.4.3 Transformer Networks . 130 6.5 Take-Home Lessons . 131 7 Open Schema Construction 132 7.1 Problems and Design Space . 132 7.2 Open Information Extraction for Predicate Discovery . 133 7.2.1 Pattern-based Open IE . 135 7.2.2 Extensions of Pattern-based Open IE . 136 7.2.3 Neural Learning for Open IE . 138 7.3 Seed-based Discovery of Properties . 139 7.3.1 Distant Supervision for Open IE from Text . 140 7.3.2 Open IE from Semi-Structured Web Sites . 142 7.3.3 View Discovery in the KB . 144 7.4 Property Canonicalization . 145 7.4.1 Paraphrase Dictionaries . 145 7.4.1.1 Property Synonymy . 146 7.4.1.2 Property Subsumption . 148 7.4.2 Clustering of Property Names . 149 4 Submitted to Foundations and Trends in Databases 7.4.3 Latent Organization of Properties . 151 7.5 Take-Home Lessons . 154 8 Knowledge Base Curation 155 8.1 KB Quality Assessment . 155 8.1.1 Quality Metrics . 155 8.1.2 Quality Evaluation . 156 8.1.3 Quality Prediction . 157 8.2 Knowledge Base Completeness . 160 8.2.1 Local Completeness . 160 8.2.2 Negative Statements . 162 8.3 Rules and Constraints . ..