Knowledge on the Web: Towards Robust and Scalable Harvesting of Entity-Relationship Facts
Total Page:16
File Type:pdf, Size:1020Kb
Knowledge on the Web: Towards Robust and Scalable Harvesting of Entity-Relationship Facts Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/ Acknowledgements 2/38 Vision: Turn Web into Knowledge Base comprehensive DB knowledge fact of human knowledge assets extraction • everything that (Semantic (Statistical Web) Web) Wikipedia knows • machine-readable communities • capturing entities, (Social Web) classes, relationships Source: DB & IR methods for knowledge discovery. Communications of the ACM 52(4), 2009 3/38 Knowledge as Enabling Technology • entity recognition & disambiguation • understanding natural language & speech • knowledge services & reasoning for semantic apps • semantic search: precise answers to advanced queries (by scientists, students, journalists, analysts, etc.) German chancellor when Angela Merkel was born? Japanese computer science institutes? Politicians who are also scientists? Enzymes that inhibit HIV? Influenza drugs for pregnant women? ... 4/38 Knowledge Search on the Web (1) Query: sushi ingredients? Results: Nori seaweed Ginger Tuna Sashimi ... Unagi http://www.google.com/squared/5/38 Knowledge Search on the Web (1) Query:Query: JapaneseJapanese computerscomputeroOputer science science ? institutes ? http://www.google.com/squared/6/38 Knowledge Search on the Web (2) Query: politicians who are also scientists ? ?x isa politician . ?x isa scientist Results: Benjamin Franklin Zbigniew Brzezinski Alan Greenspan Angela Merkel … http://www.mpi-inf.mpg.de/yago-naga/7/38 Knowledge Search on the Web (2) Query: politicians who are married to scientists ? ?x isa politician . ?x isMarriedTo ?y . ?y isa scientist Results (3): [ Adrienne Clarkson, Stephen Clarkson ], [ Raúl Castro, Vilma Espín ], [ Jeannemarie Devolites Davis, Thomas M. Davis ] http://www.mpi-inf.mpg.de/yago-naga/8/38 Knowledge Search on the Web (3) http://www-tsujii.is.s.u-tokyo.ac.jp/medie/ 9/38 Take-Home Message If music was invented Information is not Knowledge. 20 years ago Knowledge is not Wisdom. [when the Web was created], Wisdom is not Truth we'd all be playing Truth is not Beauty. one-string instruments. Beauty is not Music. Music is the best. (Udi Manber (Frank Zappa VP Engineering jazz&rock musician Google) 1940 – 1993) → extract facts from Web sources → organize them in an automatically built knowledge base → answer questions in terms of entities and relations 10/38 Related Work Yago-Naga Text2Onto Kylin Powerset ReadTheWeb KOG Hakia Avatar Cyc ontologies fact extraction UIMA entity search statist. ranking kosmix (Semantic (Statistical KnowItAll Web) Web) TextRunner WolframAlpha SWSE StatSnowball sig.ma online communities EntityCube DBpedia question answering Cimple TrueKnowledge (Social Web) DBlife Freebase GoogleSquared Answers START 11/38 Outline 3 What and Why Building a Large Knowledge Base Consistent Growth of the Knowledge Base Adding Multimodal Knowledge Challenges: Scope, Scale, Robustness ... 12/38 Information Extraction (IE): Text to Relations bornOn (Max Planck, 23 April 1858) [0.99] bornIn (Max Planck, Kiel) [0.9] type (Max Planck, physicist) [0.9] Max Karl Ernst Ludwig Planck was born in Kiel, Germany, on April 23, 1858, the son of advisor (Max Planck, Kirchhoff) [0.6] Julius Wilhelm and Emma (née Patzig) Planck. advisor (Max Planck, Helmholtz) [0.6] Planck studied at the Universities of Munich and Berlin,AlmaMater (Max Planck, TU Munich) [0.5] where his teachers included Kirchhoff and Helmholtz,plays (Max Planck, piano) [0.7] and received his doctorate of philosophy at Munichspouse in 1879. (Max Planck, Marie Merck) [0.9] He was Privatdozent in Munich from 1880 to 1885, spousethen (Max Planck, Marga Hösslin) [0.8] Associate Professor of Theoretical Physics at Kiel until 1889, in which year he succeeded Kirchhoff as Professor at Berlin University, where he remained until his retirementPerson in 1926. BirthDate BirthPlace ... Afterwards he became President of the Kaiser Wilhelm Society Max Planck 4/23, 1858 Kiel for the Promotion of Science, a post he held until 1937. IE buildsAlbert data Einstein space (with 3/14, uncertain1879 Ulm data) He was also a gifted pianist and is said to have atMahatma one time Gandhi 10/2, 1869 Porbandar considered music as a career. • confidence < 1 (sometimes << 1) Planck was twice married. Upon his appointment,• knowledgePerson in 1885, base from Award many sources to Associate Professor in his native town Kiel he married a friend of his childhood, Marie• high Merck, computationalMax who Planckdied cost Nobel Prize in Physics in 1909. He remarried her cousin Marga von Hösslin.Marie Curie Nobel Prize in Physics Three of his children died young, leaving him withMarie two sons. Curie Nobel Prize in Chemistry IE: combine NLP, pattern matching, statistical learning 13/38 IE for Knowledge Harvesting • YAGO knowledge base from Wikipedia infoboxes & categories and integration with WordNet taxonomy • NAGA search on RDF graph with entity-relationship LM for ranking {{Infobox scientist | name = Max Planck | birth_date = {{birth date|1858|4|23|mf=y}} | birth_place = [[Kiel]], [[Holstein]] | death_date = {{death date and age|mf=yes|1947|10|4}} | death_place = [[Göttingen]], [[West Germany]] | nationality = [[Germany|German]] | field = [[Physics]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | work_institutions = [[University of Kiel]]<br /> [[Humboldt University of Berlin|University of Berlin]]<br /> [University of Göttingen]]<br /> [[Kaiser-Wilhelm-Gesellschaft]]<br /> | doctoral_advisor = [[Alexander von Brill]] | doctoral_students = [[Gustav Ludwig Hertz]]<br /> … | known_for = [[Planck constant]]<br /> [[Planck postulate]]<br /> 14/38 YAGO Knowledge Base (F. Suchanek et al.: WWW‘07) Entity Entities Facts 40 Mio. RDF triples subclass KnowItAll 30 000 subclass subclass( entity1-relation-entity2, SUMOOrganization 20 000Person 60 000 subject-predicate-objectLocation ) WordNet 120 000 80 000 subclass subclass subclass Cyc 300 000 5 Mio. subclass subclass Accuracy Scientist Politician Country TextRunnersubclass n/a 8 Mio. ≈ 95% YAGO 2 Mio.subclass 19 Mio.instanceOf instanceOf State DBpedia Biologist 2 Mio. 103 Mio. instanceOf Freebase ???Physicist 156 Mio. City instanceOf Wolfram α ??? > 1 Trio. Germany instanceOf instanceOf locatedIn Oct 23, 1944 Erwin_Planck diedOn locatedIn Kiel Schleswig- FatherOf bornIn Holstein Nobel Prize hasWon instanceOf citizenOf diedOn Oct 4, 1947 Max_Planck Max_Planck Angela Merkel Society YAGO Apr 23, 1858 bornOn means means means means (0.9) means(0.1) “Max “Max Karl Ernst “Angela “Angela Planck”IWP Ludwig Planck” Merkel” Dorothea Merkel” 15/38 Leveraging YAGO for Entity Extraction Existing knowledge base boosts entity detection & disambiguation (similarity of string-in-context to target entity-in-context) 16/38 Outline 3 What and Why 3 Building a Large Knowledge Base Consistent Growth of the Knowledge Base Adding Multimodal Knowledge Challenges: Scope, Scale, Robustness ... 17/38 Growing the Knowledge Base Word + Wikipedia Net Web sources YAGO Core YAGO Extractors YAGO GathererGatherer YAGO Core YAGO Hypotheses YAGO Checker ScrutinizerGatherer YAGO Core YAGO G r o w i n g knows ≈ all entities focus on facts 18/38 Pattern-Based Harvesting (Dipre, Snowball, Text2Onto, Leila, StatSnowball, etc.) Facts & Fact Candidates Patterns (Hillary, Bill) X and her husband Y (Carla, Nicolas) X and Y on their honeymoon (Angelina, Brad) (Victoria, David) X and Y and their children (Hillary, Bill) X has been dating with Y (Carla, Nicolas) X loves Y (Angelina, Brad) (Yoko, John) … • good for recall (Kate, Pete) • noisy, drifting (Carla, Benjamin) • not robust enough (Larry, Google) 19/38 SOFIE: Self-Organizing Framework for IE (F. Suchanek et al.: WWW‘09) Integrate methods: • textual/linguistic pattern-based IE with statistics seeds → patterns → facts → patterns → ... (Hillary, Bill) → X and her husband Y → (Carla, Nicolas), (Carla, Mick) → • declarative rule-based IE with constraints functional dependencies: marriedTo is a function inclusion dependencies: presidentOf ⊆ citizenOf Address problems: • pattern selection („and her husband“, „has been dating“, ...) • reasoning on mutual consistency of facts • entity disambiguation („Merkel“ → AngelaMerkel, MaxMerkel, ...; „MPI“ → MaxPlanckInstitute, MessagePassingInterface) Unified solution by Weighted Max-Sat solver (high accuracy and much faster than MCMC for prob. graphical models) 20/38 SOFIE Example occurs (X and her husband Y, Hillary, Bill) [100] Spouse (HillaryClinton, occurs (X Y and their children, Hillary, Bill) [40] BillClinton) occurs (X and her husband Y, Victoria, David) [60] Spouse (CarlaBruni, occurs (X dating with Y, Rebecca, David) [20] NicolasSarkozy) occurs (X dating with Y, Victoria, Tom) [10] Spouse (Victoria, David) expresses (and her husband, Spouse) Spouse (Rebecca, David) expresses (and their children, Spouse) Spouse (Victoria, Tom) expresses (dating with, Spouse) ∀Spousex,y,z,w: (Victoria, R(x,y) ∧David)R(x,z) ⇒⇒¬y=zSpouse (alt.: ¬ R(x,y)(Rebecca, ∨¬ R(x,z)) David) Spouse∀ x,y,z,w: (Victoria, R(x,y) ∧ David)R(w,y) ⇒⇒¬x=wSpouse(alt.: ¬ R(x,y)(Victoria, ∨¬ R(x,z))Tom) …... occurs∀ x,y: R(x,y) (husband, ⇒ R(y,x) Victoria, David) ∧ expresses (husband, Spouse) … ⇒ Spouse (Victoria, David) occurs∀ p,x,y: (dating, occurs (p,Rebecca, x, y) ∧ expressesDavid) ∧ expresses (p, R) ⇒ (dating,R (x, y) Spouse) ⇒ Spouse (Rebecca, David) … occurs∀ p,x,y: (husband, occurs (p, Victoria, x, y) ∧ R David)(x, y) ⇒ ∧ expressesSpouse (Victoria, (p, R) David) ⇒ expresses (husband, Spouse)