Discerning Intelligence from Text (At the Uofa)
Total Page:16
File Type:pdf, Size:1020Kb
Discerning Intelligence from Text (at the UofA) Denilson Barbosa [email protected] Web search is changing • … from IR-style document retrieval • … to ques?on-answering over en??es extracted from the web which team did Lou Saban coach last? which was Lou Saban’s last team ? • The answer is the Chowan Braves, and it can be found in Lou Saban’s Wikipedia page (ranked #1), obituary (ranked #4), and so on… [email protected] 2 in good company… An explicit answer [email protected] 3 it is not all bad news… [email protected] 4 Structured knowledge (harnessed from the Web) [email protected] 5 Surface-level relaon ExtracDon AUer his departure from Buffalo, Saban Documents returned to coach college football teams including Miami, Army, and UCF. Recognize Resolve Split Find Entities Coreferences Sentences Relations <“Lou Saban”, departed from, “Buffalo Bills”> <“Lou Saban”, coach, “Miami Hurricanes”> <“Lou Saban”, coach, “Army Black Knights football”> Triple store <“Lou Saban”, coach, “University of Central Florida”> [email protected] 6 From triples to a KB… ????? • There is a very, very, very long way… § Predicate disambiguaon into seman?c “relaons”… § Named en?ty disambiguaon… § Assigning en??es to classes… § Grouping classes into a hierarchy… § Ordering facts temporally… • It would have been virtually impossible without Wikipedia [email protected] 7 In this talk… • Work on en?ty linking with random walks … [CIKM’2014] • A bit of the work on open relaon extrac?on – less on disambiguaon § SONEX (clustering-based) [TWEB ‘2012] § EXEMPLAR (dependences based) [EMNLP’2013] § With Tree Kernels [NAACL’2013] § EFFICIENS (cost-constrained) • A bit of our work on understanding disputes in Wikipedia [Hypertext2012] [ACM TIST’2015] [email protected] 8 In this talk… • Work on en?ty linking … [CIKM’2014] En?ty Linking [email protected] 9 The enty graph • We perform disambiguaon of a graph where nodes have ids of en??es in the KB with their respec?ve context (i.e., text!) The En?ty Graph has The KB has facts text about the en??es ≠ and asser?ons Buffalo Bills AUer his departure Buffalo Bulls from Buffalo, Nick Saban Saban Lou Saban returned to coach Miami Heat college football Miami Hurricanes Miami Dolphins teams including Miami, Army Black Knights football US Army Army, University of Central Florida and UCF. UCF Knights football [email protected] 10 The enty graph • Typically, built from Wikipedia • Nodes are Wikipedia ar?cles § All known names § Context: whole ar?cle § Metadata: • types, keyphrases, • type compability… • Edges: E1 – E2 iff: § There is a wikilink from E1 to E2 § There is ar?cle E3 that men?ons E1 and E2 close to each other • Alias dic?onary: § Mapping from names to ids [email protected] 11 EnDty linking – main steps • Candidate Selecon: find a small set of good candidates for each menon à using the alias dic?onary • MenDon disambiguaDon: assign each men?on to one of its candidates Buffalo Bills AUer his departure Buffalo Bulls from Buffalo, Nick Saban Saban Lou Saban returned to coach Miami Heat college football Miami Hurricanes Miami Dolphins teams including Miami, Army Black Knights football US Army Army, University of Central Florida and UCF. UCF Knights football [email protected] 12 Candidate Selecon • On the KB: alias-dic?onary expansion § Saban : {Nick Saban, Lou Saban, Saban Capital Group, …} • On the document: § Lookups: alias-dic?onary/Wikipedia disambiguaon pages § Co-reference resolu?on[Cucerzan’07] § Acronym expansion[Zhang et.al’10, Zhang et.al’11] (ABC -> Australian Broadcas?ng Corporaon) [email protected] 13 Local menon disambiguaon—e.g., [Cucerzan’2007] • Disambiguate each men?on in isolaon ent(m) = arg max (↵ prior(m, e)+β sim(m, e)) e candidates (m) · · 2 •freq (e|m) •indegree (e) •length( context(e)) • cosine/Dice/KL( context(m), context(e)) [email protected] 14 Local menon disambiguaon • Problemac assump?on: men?ons are independent of each other Saban = Nick Saban è Miami = Miami Dolphins Saban = Lou Saban è Miami = Miami Hurricanes Buffalo Bills AUer his departure Buffalo Bulls from Buffalo, Nick Saban Saban Lou Saban returned to coach Miami Heat college football Miami Hurricanes Miami Dolphins teams including Miami, Army Black Knights football US Army Army, University of Central Florida and UCF. UCF Knights football [email protected] 15 Global menon disambiguaon—e.g., [Hoffart’2011] • Disambiguate all men?ons at once ent(m) = arg max (↵ prior(m, e)+β sim(m, e)+ e candidates (m) · · 2 γ coherence(G ))) · ent Buffalo Bills AUer his departure Buffalo Bulls from Buffalo, Nick Saban Saban Lou Saban returned to coach Miami Heat college football Miami Hurricanes Miami Dolphins teams including Miami, Army Black Knights football US Army Army, University of Central Florida and UCF. UCF Knights football [email protected] 16 Global menon disambiguaon • Coherence captures the assump?on that the input document has a single theme or topic § E.g., rock music, or the world cup final match § NP-hard op?mizaon in general Buffalo Bills AUer his departure Buffalo Bulls from Buffalo, Nick Saban Saban Lou Saban returned to coach Miami Heat college football Miami Hurricanes Miami Dolphins teams including Miami, Army Black Knights football US Army Army, University of Central Florida and UCF. UCF Knights football [email protected] 17 Global menon disambiguaon • [Hoffart et al 2011] – dense sub-graph problem • Greedy algorithm: remove non-taboo en??es un?l a minimal subgraph with highest weight is found en?ty-en?ty men?on-en?ty •overlap anchor words •sim (m,e) •overlap links •keyphraseness (m,e) •type similarity post-processing [email protected] 18 Global menon disambiguaon • Rel-RW : Robust en?ty linking with Random Walks [CIKM2014] • Global no?on of en?ty-en?ty similarity • Greedy algorithm: iteravely disambiguate men?ons; start with the easiest ones Buffalo Bills AUer his departure Buffalo Bulls from Buffalo, Nick Saban Saban Lou Saban returned to coach Miami Heat college football Miami Hurricanes Miami Dolphins teams including Miami, Army Black Knights football US Army Army, University of Central Florida and UCF. UCF Knights football [email protected] 19 Random Walks as context representaDon • Random walks capture indirect relatedness between nodes in the graph k candidates n nodes in total Buffalo Bills AUer his departure Buffalo Bulls from Buffalo, Nick Saban Saban Lou Saban returned to coach Miami Heat college football Miami Hurricanes Miami Dolphins teams including Miami, Army Black Knights football US Army Army, University of Central Florida and UCF. UCF Knights football [email protected] 20 Random Walks as context representaDon Relatedness between en??es En?ty Seman?c Signature Document Seman?c Signature • One vector for each en?ty, and another for the whole document • Similarity is measured using Zero-KL Divergence [email protected] 21 Semanc Signatures of EnDes • Restart from the en?ty with probability α (e.g. 0.15) § Un?l convergence • Repeat for the candidate men?ons only Buffalo Bills Buffalo Bulls Nick Saban Lou Saban Miami Heat . Miami Hurricanes Miami Dolphins Army Black Knights football US Army . University of Central Florida UCF Knights football [email protected] 22 Semanc Signatures of Documents • (From [Milne & Wi|en 2008]): If there are unambiguous men?ons, use only their en??es to find the signature of the document • Otherwise, use all candidate en??es Buffalo Bills AUer his departure Buffalo Bulls from Buffalo, Nick Saban Saban Lou Saban returned to coach Miami Heat unambiguous college football Miami Hurricanes Miami Dolphins teams including Miami, Army Black Knights football US Army Army, University of Central Florida and UCF. UCF Knights football [email protected] 23 Algorithm • Find candidate en??es for each men?on • Compute prior(m,e) and the context(m,e) • Sort men?ons by ambiguity (i.e., number of candidates) • Go through each men?on m in ascending order: • SSd = seman?c signature of document • Assign to m the candidate e with highest combined score prior(m,e) * context(m,e) + sim(SSe , SSd) • Update the set of en??es for the document [email protected] 24 Algorithm Men?ons Candidates [ambiguity] [PriorProb, CtxSim, SemSim] UCF Knights football [0.133, 0.18, 0.50] University of Central Florida UCF [0.167, 0.13, 0.52] 33 UCF Knights basketball [0.041, 0.13, 0.34] Lou Saban [0.009, 0.28, 0.41] Nick Saban Saban [0.009, 0.15, 0.54] 45 Saban Capital Group [0.545, 0.13, 0.20] Buffalo, New York [0.467, 0.07, 0.54] Buffalo Bulls football Buffalo [0.024, 0.11, 0.50] Use all candidates for SSd 317 Buffalo Bills [0.021, 0.09, 0.58] Miami [0.632, 0.07, 0.61] Miami Hurricanes football Miami [0.029, 0.12, 0.58] 343 Miami Dolphins [0.011, 0.10, 0.56] Army Black Knights football [0.062, 0.09, 0.52] Mariland Terrapins football Army [0.001, 0.07, 0.56] 402 Army [0.155, 0.04, 0.34] [email protected] 25 Algorithm Men?ons Candidates Ed = {UCF Knights football} [ambiguity] [PriorProb, CtxSim, SemSim] UCF Knights football [0.133, 0.18, 0.50] University of Central Florida UCF UCF Knights football [0.167, 0.13, 0.52] [0.133, 0.18, 0.50] 33 UCF Knights basketball [0.041, 0.13, 0.34] Lou Saban [0.009, 0.28, 0.41] Nick Saban Saban [0.009, 0.15, 0.54] 45 Saban Capital Group [0.545, 0.13, 0.20] Buffalo, New York [0.467, 0.07, 0.54] Buffalo Bulls football Buffalo [0.024, 0.11, 0.50] 317 Buffalo Bills [0.021, 0.09, 0.58] Miami [0.632, 0.07, 0.61] Miami Hurricanes football Miami [0.029, 0.12, 0.58] 343 Miami Dolphins [0.011, 0.10, 0.56] Army Black Knights football [0.062, 0.09, 0.52] Mariland Terrapins