Harvesting Knowledge from Web Data and Text
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Knowledge Bases from Web Sources IJCAI 2011 Tutorial Hady W. Lauw1, Ralf Schenkel2, Fabian Suchanek3, Martin Theobald4, and Gerhard Weikum4 1Institute for Infocomm Research, Singapore 2Saarland University, Saarbruecken 3INRIA Saclay, Paris 4Max Planck Institute for Informatics, Saarbruecken All slides for download… http://www.mpi-inf.mpg.de/yago- naga/IJCAI11-tutorial/ Outline • Part I – Machine Knowledge & Intelligent Applications • Part II – Knowledge Representation & Public Knowledge Bases • Part III – Extracting Knowledge • Part IV – Ranking and Searching • Part V – Linked Data • Part VI – Conclusion and Outlook Goal: Turn Web into Knowledge Base Source: G. Weikum, G., Kasneci, M. Ramanath, F. Suchanek: DB & IR methods for knowledge discovery. Communications of the ACM 52(4), 2009 comprehensive DB of human knowledge • everything that Wikipedia knows • everything machine-readable • capturing entities, classes, relationships Approach: Knowledge Harvesting Automatically Constructed Knowledge Bases: Politician Political Party Angela Merkel CDU • Mio‘s of individual entities Karl-Theodor zu Guttenberg CDU PoliticalParty SpokespersonChristoph Hartmann FDP • 100 000‘s of classes/types CDU Philipp …Wachholz Die Grünen Claudia Roth • 100 Mio‘s of facts Politician Position Facebook FriendFeed Angela Merkel Chancellor Germany Software AG IDS Scheer • 100‘s of relation types Karl-Theodor zu Guttenberg Minister of Defense Germany … Christoph Hartmann Minister of Economy Saarland … Company AcquiredCompany Google YouTube Company CEO Yahoo Overture Google Eric Schmidt Facebook FriendFeed Movie YahooReportedRevenueOverture Software AG IDS Scheer Avatar Facebook$ 2,718,444,933FriendFeed … The Reader Software$ 108,709,522 AG IDS Scheer Actor FacebookAward … FriendFeed Christoph Waltz Oscar Software AG IDS Scheer Sandra Bullock Oscar … Sandra Bullock Golden Raspberry … SUMO WikiNet IWP Cyc YAGO-NAGA TextRunner WikiTaxonomy ReadTheWeb Knowledge for Intelligence • entity recognition & disambiguation • understanding natural language & speech • knowledge services & reasoning for semantic apps (e.g. deep question answering) • semantic search: precise answers to advanced queries (by scientists, students, journalists, analysts, etc.) Swedish king‘s wife when Greta Garbo died? FIFA 2010 finalists who played in a Champions League final? Politicians who are also scientists? Relationships between Max Planck, Angela Merkel, Jim Gray, and the Dalai Lama? Drugs for treating Alzheimer? Influenza... vaccines for teens with high blood pressure? Application 1: Semantic Queries on Web www.google.com/squared/ Application 1: Semantic Queries on Web www.google.com/squared/ Application 1: Faceted Search H. Bast et al.: ESTER: Efficient Search on Text, Entities, and relations, SIGIR’07 Application 2: Question Answering (QA) • KB from Wikipedia and user edits • KB of curated, structured data • translation of natural-language • not just facts, but also questions into KB queries algorithms & models Application 2: Deep QA in NL William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain question knowledge classification & back-ends decomposition YAGO D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010. www.ibm.com/innovation/us/watson/index.htm Application 3: Machine Reading It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at hissame estate on the sametiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. same Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger'sowns niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomesuncleOf acquainted with the membershires of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. enemyOf same affairWith After discovering that Salander has hacked into his computer, he persuadessame her to assist him with research. They eventuallyaffairWith become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep backgroundheadOf investigations for Dragan Armansky, who, in turn,same worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill." O. Etzioni, M. Banko, M.J. Cafarella: Machine Reading, AAAI’06 T. Mitchell et al.: Populating the Semantic Web by Macro-Reading Internet Text, ISWC’09 Named-Entity Disambiguation Harry fought with you know who. He defeats the dark lord. Dirty Harry Prince Harry The Who Lord Harry Potter of England (band) Voldemort Three NLP tasks: 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB) Mentions, Meanings, Mappings Agnetha Qvarnström Agnetha, Björn, Agnetha Fältskog Benny, Benny Goodman and Anni-Frid Benny Andersson were Sweden‘s ? most successful Battle of Waterloo pop music group. KB Waterloo Station Their greatest hits were Waterloo Waterloo (song) Agnetha means Agnetha Fältskog and Mamma Mia. Agnetha means Agnetha Munther Agnetha means Agnetha Qvarnström Entities Mentions Björn means Björn Borg (meanings) Björn means Björn Ulvaeus (surface names) Björn means Björn the Viking Benny means Benny Goodman Benny means Benny Andersson Waterloo means Battle of Waterloo Waterloo means Waterloo (Ontario) Waterloo means Waterloo Station Waterloo means Waterloo (song) … D5… Overview May… 30, 2011 Mention-Entity Graph weighted undirected graph with two types of nodes Agnetha, Agnetha Q. Björn, Benny, Agnetha F. and Anni-Frid Benny G. were Sweden‘s Benny A. most successful pop music group. B. Waterloo Their greatest hits Waterloo St. were Waterloo and Mamma Mia. Waterloo (s) Popularity Similarity KB+Stats (m,e): (m,e): • freq(m,e|m) • cos/Dice/KL • length(e) (context(m), • #links(e) context(e)) Mention-Entity Graph weighted undirected graph with two types of nodes Agnetha, Agnetha Q. Björn, Benny, Agnetha F. and Anni-Frid Benny G. were Sweden‘s Benny A. most successful pop music group. B. Waterloo Their greatest hits Waterloo St. were Waterloo and Mamma Mia. Waterloo (s) Popularity Similarity KB+Stats Coherence (m,e): (m,e): (e,e‘): • freq(m,e|m) • cos/Dice/KL • dist(types) • length(e) (context(m), • overlap(links) • #links(e) context(e)) • overlap (anchor words)16 / 20 Mention-Entity Graph weighted undirected graph with two types of nodes Agnetha, Swedish female singers Agnetha Q. people from Jönköping Björn, singers Benny, Agnetha F. musicians and Anni-Frid Benny G. Swedish songwriters were Sweden‘s people from Stockholm Benny A. composers most successful musicians pop music group. B. Waterloo Their greatest hits ABBA songs Waterloo St. #1 chart singles were Waterloo songs and Mamma Mia. Waterloo (s) artifacts Popularity Similarity KB+Stats Coherence (m,e): (m,e): (e,e‘): • freq(m,e|m) • cos/Dice/KL • dist(types) • length(e) (context(m), • overlap(links) • #links(e) context(e)) • overlap (anchor words)17 / 20 Mention-Entity Graph weighted undirected graph with two types of nodes Agnetha, http://.../wiki/ABBA Agnetha Q. http://.../wiki/Anni-Frid_Lyngstad Björn, http://.../wiki/Jönköping Benny, Agnetha F. http://.../wiki/Eurovision_Song_Contest and Anni-Frid Benny G. http://.../wiki/ABBA were Sweden‘s http://.../wiki/Anni-Frid_Lyngstad Benny A. http://.../wiki/Mamma_Mia! most successful http://.../wiki/Agnetha_Fältskog pop music group. B. Waterloo Their greatest hits http://.../wiki/ABBA Waterloo St. http://.../wiki/Eurovision_Song_Contest were Waterloo http://.../wiki/Mamma_Mia! and Mamma Mia. Waterloo (s) Popularity Similarity KB+Stats Coherence (m,e): (m,e): (e,e‘): • freq(m,e|m) • cos/Dice/KL • dist(types) • length(e) (context(m), • overlap(links) • #links(e) context(e)) • overlap (anchor words)18 / 20 Mention-Entity Graph weighted undirected graph with two types of nodes Agnetha, pop group ABBA Agnetha Q. best-selling music artist in history Björn, Melodifestivalen Benny, Agnetha F. The Winner Takes It All and Anni-Frid Benny G. pop group ABBA were Sweden‘s Grammy Award nomination Benny A. Melodifestivalen most successful Mamma Mia! pop music group. B. Waterloo Their greatest hits Agnetha Fältskog Waterloo St. Benny Andersson were Waterloo number-one single in Norway and Mamma Mia. Waterloo (s) Mamma Mia! Popularity Similarity KB+Stats Coherence (m,e): (m,e): (e,e‘): • freq(m,e|m) • cos/Dice/KL • dist(types) • length(e) (context(m), • overlap(links) • #links(e) context(e)) • overlap (anchor words)19 / 20 Joint Mapping 50 30 50 20 30 10 10 90 100 30 20 90 80 100 90 5 30 • Build mention-entity graph