Building Domain-Specific Search Engines for Investigative Decision Support

Knowledge Graph Completion Introduction and motivation We have our ‘constructed’ knowledge graph, now what? freebase: Seattle 2 Introduction and motivation Problem 1: Wrong/missing triples 3 Introduction and motivation Problem 2: Many nodes refer to the same underlying entity 4 For Web extractions, noise is inevitable • Thousands of web domains • Many page formats • Distracting & irrelevant content • Purposeful obfuscation • Poor grammar & spelling • Tables To reach its potential, a constructed KG must be completed or identified 5 Noise Analysis • Extractors found to offer a collective tradeoff between multiple dimensions • Noise is rarely ‘random’! Glossary Regex Landmark CRF NER Easy to define 4 2 4 4 4 Site coverage All All Short Tail All All Precision 2-3 3-4 4 2-3 3 Recall 3-4 2 1 2 1 6 ENTITY RESOLUTION 7 Definitions and alternate names • Common sense: – Which entities refer to the same thing? • Slightly more formal: – Which mentions (aka records, instances, nodes, surface strings…) refer to the same underlying entity? • Rigorous mathematical/logical definition – Doesn’t exist, or unknown! Just like other hard AI problems... • Why try to solve the problem aka why is it a problem? 8 Applications: A Web of Linked ‘Data’ 9 Applications: Schema.org ▪ Schema.org is an RDF ontology from which triples (with Web- dereferencable URIs) can be embedded in HTML pages http://schema.org/ 10 Applications: Google Knowledge Graph https://developers.google.com/knowledge-graph/ 11 SUB-COMMUNITIES 12 Entity Linking/Canonicalization • Name of an entity (such as a city or location) not enough to resolve ambiguity • Use Geonames knowledge base to canonicalize entity using machine learning and text features 13 Co-reference Resolution 14 Entity Resolution (what we’ll be covering) • Itself has many sub-communities and approaches • Because of flexible representations (compared to databases or strict models like OWL), KG-ER systems tend to be community- agnostic 15 STANDARD ER ARCHITECTURE 16 Entity Resolution is fundamentally non-linear • Theoretically quadratic in the number of nodes, even if ‘resolution rule’ was known • In practice, number of ‘duplicates’ tends to grow linearly, and duplicates overlap in non-trivial ways • How to devise efficient algorithms? 50 years of research has agreed on a two- step solutions Candidate set Knowledge Execute Execute Resolved graph blocking similarity entities 17 Blocking • Key idea is to use a cheap heuristic that efficiently clusters approximately similar entities into (possibly overlapping) blocks Blocks Generate Apply blocking key candidate set e.g. Tokens(LastName) (12 pairs), apply similarity function on each pair ‘Exhaustive’ set: 10 C 2 = 45 pairs 18 Aside: some blocks have skewed size... • Property of real-world data (zipf distribution, power laws...) • How to address data skew? • Apply blocking methods with guarantees • May lose some recall in the process Example Sorted Neighborhood aka merge-purge: --use blocking key as ‘sorting’ key --slide a window of constant size (w) over sorted nodes --only pairs of nodes within window are paired, added to candidate set Other methods: block purging, canopies... 19 Similarity/link specification • Over 50 years of research on what makes for a good ‘similarity’ function • Current approach: apply ‘typical’ machine learning workflow to candidate set • Important to remember that features are extracted from ‘mention pairs’...leads to non-trivial alignment issues – Some form of schema-matching almost always attempted in practical systems – Some (but not much) work on so-called schema-free similarity General Semantic Web 20 Aside: why schema matching? 21 Feature engineering ... Open question: how much can representation learning contribute to Entity Resolution? 22 Similarity: putting it together • ML model can be supervised, semi-supervised or unsupervised Schema alignment /extract useful information sets Candidate set Probability that pair is duplicate Machine Learning (ML) model 23 OUTPUT REPRESENTATION AND HANDLING 24 From links to clusters • For perfect links, transitive closure/connected components works • With imperfect links, effect can be severe – One weak link is all it takes to form a giant component – Not uncommon in the real world • More robust clustering methods have to be applied – Community detection literature – Spectral clustering – Many more! • Some recent work has proposed to explore ER as a micro-clustering problem 25 From (possibly noisy) clusters to…??? • Surprisingly under-studied problem! • Should the entities be fused into a single entity? How? – Entity linking has a conceptually elegant solution to this problem… – …but how to deal with NIL clusters? • Semantic Web approach – Represent individual links as KG triples and leave it at that – Entity Name Systems for advanced search/reasoning 26 BEYOND ENTITY RESOLUTION 27 By itself, generic ER is unlikely to be enough to sufficiently boost KG quality • Other things explored in the literature: • Domain knowledge – Collective ER methods have tried to exploit these systematically • Multi-type Entity Resolution – Extremely useful for knowledge graphs, lots more work to be done • Entity Resolution+Ontologies+IE Confidences: – Probabilistic Graphical Models like Probabilistic Soft Logic • Knowledge graph embeddings – Useful for link prediction and triples classification – Recall the Microsoft-founded_in-Seattle example earlier 28 Knowledge graph embeddings/representation learning • Useful for link prediction/missing relationships/triples classification • Not clear if it is really better than PSL on noisy KGs • Not clear how to combine KGEs with domain engineering 29 Concluding notes • Entity Resolution (ER) is a hard problem for machines, may be AI complete – It’s ‘easy’ for us because we’re so good at it – Not clear what will achieve the next breakthrough in ER • Essential to attempt a solution if KGs are semi-automatically constructed from Web data – Quality doesn’t have to be perfect, as we showed earlier with KG search • Wealth of solutions but can be broken down into standard components – Blocking, to make ER efficient – Similarity, to make ER automatic/adaptive • Many open questions, especially in relation to new ML models • More broadly, lots of opportunities for KG completion 30 Bibliography • Alani, H., Kim, S., Millard, D. E., Weal, M. J., Hall, W., Lewis, P. H., & Shadbolt, N. R. (2003). Automatic ontology-based knowledge extraction from web documents. Intelligent Systems, IEEE, 18(1), 14-21. • Allemang, D., & Hendler, J. (2011). Semantic web for the working ontologist: effective model • Angles, R., & Gutierrez, C. (2005). Querying RDF data from a graph database perspective. In The Semantic Web: Research and Applications(pp. 346-360). Springer Berlin Heidelberg. • Araujo, S., Hidders, J., Schwabe, D., & De Vries, A. P. (2011). Serimi-resource description similarity, rdf instance matching and interlinking. arXiv preprint arXiv:1107.1104. • Arenas, M., Díaz, G., Fokoue, A., Kementsietsidis, A., & Srinivas, K. (2014). A principled approach to bridging the gap between graph data and their schemas. Proceedings of the VLDB Endowment, 7(8), 601-612. • Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., ... & Harris, M. A. (2000). Gene Ontology: tool for the unification of biology. Nature genetics, 25(1), 25-29. • Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data (pp. 722-735). Springer Berlin Heidelberg. • Baxter, J. (1998). Theoretical models of learning to learn. In Learning to learn (pp. 71-94). Springer US. • Baxter, R., Christen, P., & Churches, T. (2003, August). A comparison of fast blocking methods for record linkage. In ACM SIGKDD (Vol. 3, pp. 25-27). • Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (Vol. 463). New York: ACM press. • Bellahsene, Z., Bonifati, A., & Rahm, E. (2011). Schema matching and mapping (Vol. 20). Heidelberg (DE): Springer. • Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T. E., Menestrina, D., & Thavisomboon, S. (2007, June). D-swoosh: A family of algorithms for generic, distributed entity resolution. In Distributed Computing Systems, 2007. ICDCS'07. 27th International Conference on (pp. 37-37). IEEE. • Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S. E., & Widom, J. (2009). Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 18(1), 255-276. • Bhattacharya, I., & Getoor, L. (2006, April). A Latent Dirichlet Model for Unsupervised Entity Resolution. In SDM (Vol. 5, No. 7, p. 59). • Bilenko, M., & Mooney, R. J. (2003, August). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 39-48). ACM. • Bilenko, M., Kamath, B., & Mooney, R. J. (2006, December). Adaptive blocking: Learning to scale up record linkage. In Data Mining, 2006. ICDM'06. Sixth International Conference on (pp. 87-96). IEEE. • Bilke, A., & Naumann, F. (2005, April). Schema matching using duplicates. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on (pp. 69-80). IEEE. • Bizer, C. (2009). The emerging web of linked data. Intelligent Systems, IEEE, 24(5), 87-92. • Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data-the story so far. Semantic Services, Interoperability and Web Applications: Emerging Concepts,

Building Domain-Specific Search Engines for Investigative Decision Support

Knowit VQA: Answering Knowledge-Based Questions About Videos

Wikidata As a Knowledge Graph for the Life Sciences

Knowledge-Based Question Answering As Machine Translation

HOLMES: a Hybrid Ontology-Learning Materials Engineering System

Trusted, Transparent, Actually Intelligent Technology Overview

Enhancing Neural Data-To-Text Generation Models with External Background Knowledge

Danish in Wikidata Lexemes

Leveraging Entity Types and Properties for Knowledge Graph Exploitation

Reading and Reasoning with Knowledge Graphs

A Framework for Semantic Similarity Measures to Enhance Knowledge Graph Quality

Knowledge-Based Question Answering As Machine Translation

Artificial Intelligence in Education Promises and Implications for Teaching and Learning