<<

YAGO2: A spaally and temporally enhanced from (J. Hoffart et al, 2012) presented by Gabe Radovsky YAGO2: overview

• temporally/spaally anchored knowledge base • built automacally from Wikipedia, GeoNames, and WordNet • contains 447 million facts about 9.8 million enes • human evaluators judged 97.8% of facts correct The (original) YAGO knowledge base

• introduced in 2007 • automacally constructed from Wikipedia – each arcle in Wikipedia became an enty • about 100 manually defined relaons – e.g. wasBornOnDate, locatedIn, hasPopulaon • used SPO (subject, predicate, object) triples to represent facts – reificaon: every fact given an idenfier, e.g. wasFoundIn(fact, Wikipedia) YAGO2: movaon

• WordNet/other lexical resources: – manually compiled – knows that ‘musician’ is a hyponym of ‘human’; doesn’t know that Leonard Cohen is a musician • Wikipedia/GeoNames – very large collecons of (semi-)structured data – advances in informaon extracon make them easier to mine YAGO2: movaon

• “...current state-of-the-art knowledge bases are mostly blind to the temporal dimension” (29). • e.g. (knowing that Abraham Lincoln was born in 1809 and died in 1865) != (knowing that Abraham Lincoln was alive in 1850) www.wolframalpha.com www.wolframalpha.com YAGO2: contribuon

• top-down ontology “with the goal of integrang enty-relaonship-oriented facts with the spaal and temporal dimensions” (29). • new representaon model: SPOTL tuples – (SPO [subject, predicate, object] + me + locaon) • frameworks for extracng knowledge from structured or unstructured text Extracon architecture for YAGO2

• factual rules – “declarave translaons of all the manually defined excepons and facts that the previous YAGO code contained” (30) • implicaon rules – e.g. if relaon b is a sub-property of relaon a, all instances of b are also instances of a • replacement rules – |“\{\{USA\}\}” replace “[[United States]]” – eliminate Wikipedia administrave categories, e.g. “Arcles to be cleaned up” Giving YAGO a temporal dimension

• YYYY-MM-DD format for dates – YYYY-##-## if only year is known • enes – given a me span • facts – me point for instantaneous events, me span for events with extended duraon • not all enes/facts could be temporally annotated Enes and me

• people – wasBornOnDate, diedOnDate • groups, arfacts – wasCreatedOnDate, wasDestroyedOnDate – some have unbounded end points, e.g. pieces of music, scienfic theories • events – startedOnDate, endedOnDate, happenedOnDate (for punctual events) • enes w/o defined start or end point – e.g. numbers, mythological figures, virus strains – not assigned temporal informaon Facts and me

• facts with an extracted me – ElvisPresley diedOnDate 1977-08-16 • facts with a deduced me – ([ElvisPresley diedIn Memphis] 1977-08-16) • extracon me of facts is also included – e.g. extractedFrom Wikipedia on YYYY-MM-DD Giving YAGO a spaal dimension

• YAGO2 “concerned with enes that have a permanent spaal extent on Earth” (34) – e.g. countries, cies, mountains, rivers – original YAGO, WordNet have no geographical super- class • new class: yagoGeoEnty – type yagoGeoCoordinates stores latude/longitude pair • only coordinates, no polygons – city center, not exhausve boundaries Harvesng geo-enes

• harvested from Wikipedia and GeoNames • assigned only one class – Berlin = “capital of a polical enty” • hierarchical – Berlin is located in Germany is located in Europe Assigning a locaon

• given to both enes and facts when “ontologically reasonable” (36) • locaons are themselves geo-enes Enes and locaon

• events – if specific locaon, e.g. bales and sports compeons – happenedIn relaon • groups – company headquarters, university campus – isLocatedIn relaon • arfacts – Mona Lisa in the Louvre – isLocatedIn relaon (Con-)textual data in YAGO2

• non-ontological informaon from Wikipedia (take strings as arguments) – hasWikipediaAnchorText (visible text in hyperlink) – hasWikipediaCategory – hasCitaonTitle (from references list) • mullingual informaon – extracted from inter-language links in arcles – e.g. [BaleAtWaterloo isCalled SchlachtBeiWaterloo] with associated fact [inLanguage German] YAGO2: evaluaon

• formed one pool for each relaon – e.g. wasBornOnDate, hasGDP – randomly selected test data from each pool • used 26 human judges – judge presented with fact, along with original Wikipedia arcle to assess its accuracy • accuracy of Wikipedia not assessed – connued evaluang each pool unl confidence interval was smaller than ±5%, to assure stascal significance • 97.8% of facts were judged correct YAGO2: evaluaon

p. 42 Task-based evaluaon: Jeopardy

p. 46 Task-based evaluaon: Jeopardy

p. 55 Task-based evaluaon: Jeopardy

p. 56 Our project

• were originally planning to aempt hierarchical ontology based on Wikipedia • new project: hierarchical classificaon of social science journal arcles • mine text of arcles with Python NLTK – plain text ngrams for n=(1-5) – stemmed/POS tagged unigrams – possibly named enes • run different clustering algorithms on enes (arcle tles with features mined from text) • aempt to automacally generate reasonable names for clusters