Leveraging Knowledge Graphs for Web Search

Part 2 - Named Enty Recognion and Linking to Knowledge Graphs Gianluca Demarni University of Sheffield gianlucademarni.net Entity Management and Search

• Entity Extraction: Recognize entity mentions in text “Paris has been changing over time” • Entity Linking: Assign URIs to entities

• Indexing: Database vs Inverted Index

Uri Label http://dbpedia.org/resource/Paris Paris, France http://dbpedia.org/resource/Rome Rome, Italy http://dbpedia.org/resource/Berlin Berlin, Germany • Search: ranking entities given a query

“Capital cities in Europe”

2 Outline

• A NLP Pipeline: – Named Enty Recognion – Enty Linking – Ranking enty types • NER and disambiguaon in scienfic documents

• Slides: gianlucademarni.net/kg

3 Informaon extracon: enes

• Enty extracon / Named Enty Recognion – “Slovenia borders Italy” • Enty resoluon – “Apple released a new Mac”. – From “Apple”, “Mac” – To Apple_Inc., Macintosh_(computer) • Enty classificaon – Into a set of predefined categories of interest – Person, locaon, organizaon, date/me, e-mail address, phone number, etc. – E.g. <“Slovenia”, type, Country>

4 Steps

• Tokenizaon • Sentence spling • Part-of-speech (POS) tagging • Named Enty Recognion (NER) and linking • Co-reference resoluon • Relaon extracon NN NN ADJ “The cat is in London. It is nice.”

Sentence 1 Sentence 2 5 Enty Extracon

• Employed by most modern approaches • Part-of-speech tagging • Noun phrase chunking, used for enty extracon • Abstracon of text – From: “Slovenia borders Italy” – To:“noun – verb – noun”

• Approaches to Enty Extracon: – Diconaries – Paerns – Learning Models

6 NER methods • Rule Based – Regular expressions, e.g. capitalized + {street, boulevard, avenue} indicates locaon – Engineered vs. learned rules • NER can be formulated as classificaon tasks – NE extracon: assign word menons to tags (B beginning of an enty, I connues the enty, O word outside the enty) – NE classificaon: assign enty menons to categories (Person, Organizaon, etc.) – Use ML methods for classificaon: Decision trees, SVM, AdaBoost – Standard classificaon assumes cases are disconnected (i.i.d) • Probabilisc sequence models: HMM, CRF – Each token in a sequence is assigned a label – Labels of tokens are dependent on the labels of other tokens in the sequence parcularly their neighbors (not i.i.d).

7 Classificaon

Naïve Generative p(y,x) Bayes

Y

Conditional

X1 X2 … Xn

Discriminative p(y|x) Logistic Regression

8 Sequence Labeling

Generative p(y,x) HMM

Y1 Y2 YT .. Conditional

X X X1 2 … T

Discriminative p(y|x) Linear-chain CRF Sunita Sarawagi and William W. Cohen. Semi-Markov Condional Random Fields for Informaon Extracon. In NIPS, 2005. 9 NER features

• Gazeeers (background knowledge) – locaon names, first names, surnames, company names • Word – Orthographic • inial-caps, all-caps, all-digits, contains-hyphen, contains-dots, roman-number, punctuaon-mark, URL, acronym – Word type • Capitalized, quote, lowercased, capitalized – Part-of-speech tag • NP, noun, nominal, VP, verb, adjecve • Context – Text window: , tags, predicons – Trigger words • Mr, Miss, Dr, PhD for person and city, street for locaon

10 Some NER tools

• Java – Stanford Named Enty Recognizer • hp://nlp.stanford.edu/soware/CRF-NER.shtml – GATE • hp://gate.ac.uk/ hp://services.gate.ac.uk/annie/ – LingPipe hp://alias-i.com/lingpipe/ • C – SuperSense Tagger • hp://sourceforge.net/projects/supersensetag/ • Python – NLTK: hp://www.nltk.org – spaCy: hp://spacy.io/

11 Enty Resoluon / Linking Basic situaon

13 Pipeline

1. Idenfy named enty menons in source text using a named enty recognizer 2. Given the menons, gather candidate KB enes that have that menon as a label 3. Rank the KB enes 4. Select the best KB enty for each menon

14 Relatedness

• Intuion: enes that co-occur in the same context tend to be more related • How can we express relatedness of two enes in a numerical way? – Stascal co-occurrence – Similarity of enes’ descripons – Relaonships in the ontology

15 Semanc relatedness

• If enes have an explicit asseron connecng them (or have common neighbours), they tend to be related

St. Elvis Person type type Elvis Elvis Presley

Memphis, origin Location Memphis Egypt type

Memphis, type TN 16 Co-occurrence as relatedness

• If disnct enes occur together more oen than by chance, they tend to be related

Document FC x Barcelona y FC Mutual information Barcelona x z

Bavaria y Mutual information Bayern x Bayern München y

17 Where do enes appear?

• Documents – Text in general • For example, news arcles • Exploing natural language structure and semanc coherence – Specific to the Web • Exploing structure of web pages, e.g. annotaon of web tables – Hazem Elmeleegy, Jayant Madhavan, Alon Y. Halevy: Harvesng relaonal tables from lists on the web. VLDB J. 20(2): 209-226 (2011) • Queries – Short text and no structure 18 Enes in web search queries

– ~70% of queries contain a named enty (enty menon queries) • brad pi height – ~50% of queries have an enty focus (enty seeking queries) • brad pi aacked by fans – ~10% of queries are looking for a class of enes • brad pi movies • [Pound et al, WWW 2010], [Lin et al WWW 2012]

19 Enes in web search queries

• Enty menon query = {+ } – Intent is typically an addional word or phrase to • Disambiguate, most oen by type e.g. brad pi actor • Specify acon or aspect e.g. brad pi net worth, toy story trailer • Approaches for NER in queries – Matching keywords. [Blanco et al. ISWC 2013] • hps://github.com/yahoo/Glimmer/ – Matching aliases, i.e., look up enty names in the KG. Roi Blanco, Giuseppe Oaviano and Edgar Meij. Fast and space-efficient enty linking in queries. WSDM 2015

20 Exercises

• 1) Write a piece of code that – Calls NER APIs to run over some text (e.g., hp://nerd.eurecom.fr/documentaon or hps://github.com/dbpedia-spotlight/dbpedia- spotlight/wiki) – Create an inverted index of enes appearing in documents

• 2) Use the ERD dataset and try your own disambiguaon idea (using step 1 results) – hp://web-ngram.research.microso.com/ERD2014/

21 Enty Recognion and Disambiguaon Challenge (at SIGIR 2014) • Sample of Freebase KG • Short text: web search queries from past TREC compeons – Winning approach: extract enes from search results for the query • Long text: ClueWeb pages – Winning approach: supervised , training on

22 Enty Types

Alberto Tonon, Michele Catasta, Gianluca Demarni, Philippe Cudré-Mauroux, and Karl Aberer. TRank: Ranking Enty Types Using the Web of Data. In: The 12th Internaonal Semanc Web Conference (ISWC 2013) …and Why Types? • “Summarizaon” of texts Arcle Title Enes Types Osama Bin Laden Al-QaedaPropagandists Bin Laden Relave Pleads Not Abu Ghaith Kuwai Al-Qaeda members Guilty in Terrorism Case Lewis Kaplan Judge Manhaan Borough (New York City) • Contextual enes summaries in Web-pages Kuwaiti Al-Qaeda Al-Quaeda members Propagandist Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden who once served as a spokesman for Al Qaeda Jihadist Organizations • Disambiguaon of other enes

• Diversificaon of search results 24 Enes May Have Many Types

American People from Billionaires King County People from Thing Seale

American Windows Philanthropists People

American Computer Agent

Harvard University People Person American Living People of People Scosh Descent 25 TRank Pipeline

Entity linking Text List of foreach (inverted index: extraction Recognition entity DBpedia labels ⟹ (BoilerPipe) (Stanford NER) labels resource URIs)

Ranked Type ranking List of Type retrieval List of Type ranking (inverted index: list of Type ranking type resource URIs ⟹ entity types Type ranking URIs type URIs) URIs

Alberto Tonon, Michele Catasta, Gianluca Demarni, Philippe Cudré-Mauroux, and Karl Aberer. TRank: Ranking Enty Types Using the Web of Data. In: The 12th Internaonal Semanc Web 26 Conference (ISWC 2013). Sydney, Australia, October 2013. Type Hierarchy

Mappings YAGO/DBpedia (PARIS)

type: DBpedia schema.org Yago subClassOf relationship: explicit inferred from PARIS ontology manually mapping added

27 Ranking Algorithms

• Enty centric • Hierarchy-based • Context-aware (featuring type-hierarchy) • Learning to Rank

28 Hierarchy-Based Approaches (An Example)

Thing

Person Organization • ANCESTORS Actor Score(e, t) = number of t’s Foundation ancestors in the type Humanitarian Foundation hierarchy contained in Te.

Te oen doesn’t contain all super types of a specific type

29 Context-Aware Ranking Approaches (An Example)

• SAMETYPE Context Actor AmericanActor Score(e, t, cT) = number of e Person Actor mes t appears among Organization e' Thing the types of every other e'' enty in cT.

30 Learning to Rank Enty Types

Determine an opmal combinaon of all our approaches:

• Decision trees • Linear regression models • 10-fold cross validaon

31 Datasets

• 128 recent NYTimes arcles split to create: – Enty Collecon – Sentence Collecon – Paragraph Collecon – 3-Paragraphs Collecon • Ground-truth obtained by using – 3 workers per enty/context – 4 levels of relevance for each type – Overall cost: 190$

32

Effecveness Evaluaon

Check our paper for a complete descripon of all the approaches we evaluated

33 Avoiding SPARQL Queries with Inverted Indices and Map/Reduce • TRank is implemented with Hadoop and Map/Reduce. • All computaons are done by using inverted indices: – Enty linking, Path index, Depth index • CommonCrawl sample of 1TB, 1.3M web pages – Map/Reduce on a cluster of 8 machines with 12 cores, 32GB of RAM and 3 SATA disks – 25 min. processing me (> 100 docs/node x sec)

Text Extracon NER Enty Linking Type Retrieval Type Ranking 18.9% 35.6% 29.5% 9.8% 6.2%

• The inverted indices are publicly available at exascale.info/TRank 34 Using TRank

• Open Source (Scala) – hps://github.com/MEM0R1ES/TRank • Web Service (JSON) – hp://trank.exascale.info

35 Extracng Scienfic Concepts in Publicaons

Roman Prokofyev, Gianluca Demarni, and Philippe Cudré-Mauroux. Effecve Named Enty Recognion for Idiosyncrac Web Collecons. In: 23rd Internaonal Conference on World Wide Web (WWW 2014) Problem Definition

• web search engine • navigational query • user intent • information need • web content • …

Entity type: scientific concept

37 Traditional NER

Types: • Maximum Entropy (Mallet, NLTK) • Conditional Random Fields (Stanford NER, Mallet)

Properties: • Require extensive training • Usually domain-specific, different collections require training on their domain • Very good at detecting such types as Location, Person, Organization

38 Proposed Approach

Our problem is defined as a classification task.

Two-step classification: • Extract candidate named entities using frequency filtration algorithm. • Classify candidate named entities using supervised classifier.

Candidate selection should allow us to greatly reduce the number of n-grams to classify, possibly without significant loss in Recall.

39 Pipeline

Text List of Lemmat foreach n-gram extraction ization extracted Indexing (Apache Tika) n-grams

Feature POS frequency Tagging extractionFeature List of reweighting Candidate extractionFeatures selected Selection n-grams

Ranked Supervised n+1 grams merging list of Classi!er n-grams

40 Candidate Selection: Part I

Consider all with frequency > k (k=2): candidate named: 5 entity are: 4 entity candidate: 3 candidate named: 5 entity in: 18 entity candidate: 3 entity recognition: 12 entity recognition: 12 named entity: 101 named entity: 101 of named: 10 that named: 3 the named: 4 NLTK filter

41

Candidate Selection: Part II frequency is looked up from the n- gram index. candidate named entity: 5 named entity candidate: 3 candidate named: 5 named entity recognition: 12 entity candidate: 3 named entity: 101 entity recognition: 12 candidate named: 5 named entity: 101 entity candidate: 3 entity recognition: 12 candidate named entity: 5 named entity candidate: 3 named entity recognition: 12 named entity: 81 candidate named: 0 entity candidate: 0 entity recognition: 0

42

Candidate Selection: Discussion

Possible to extract n-grams (n>2) with frequency ≤k

4k

3k Valid t

n Invalid

u 2k o C 1k

0k 1 2 3 4 5 N-gram frequency in document

43 After Candidate Selection

TwiNER: named entity recognition in targeted twitter stream ‘SIGIR 2012

44 Classifier: Overview

Machine Learning algorithm: Decision Trees from scikit-learn package.

Feature types: • POS Tags and their derivatives • External Knowledge Bases (DBLP, DBPedia) • DBPedia relation graphs • Syntactic features

45 Datasets

Two collections: • CS Collection (SIGIR 2012 Research Track): 100 papers • Physics collection: 100 papers randomly selected from arXiv.org High Energy Physics category

CS Collecon Physics Collecon

N# Candidate N-grams 21 531 18 129 N# Judged N-grams 15 057 11 421 N# Valid Enes 8 145 5 747 N# Invalid N-grams 6 912 5 674

Available at: github.com/XI-lab/scientific_NER_dataset

46 Features: POS Tags, part I

6k invalid valid

t 4k n u o

C 2k

0k NN JJ NN JJ NNP NN OTHER NN NN NNS NNS NNP NN NN 100+ different tag patterns 47 Features: POS Tags, part II

Two feature schemes: • Raw POS tag patterns, each tag is a binary feature • Regex POS tag patterns: – First tag match, for example: JJ NNS JJ NN NN JJ NN JJ* ... – Last tag match:

NN VB NN NN VB JJ NN VB *VB ... 48 Features: External Knowledge Graphs Domain-specific knowledge graphs:

• DBLP (Computer Science): contains author- assigned keywords to the papers • ScienceWISE: high-quality scientific concepts (mostly for Physics domain) http://sciencewise.info

We perform exact string matching with these KGs.

49

Features: DBPedia, part I

DBPedia pages essentially represent valid entities

But there are a few problems when: • N-gram is not an entity • N-gram is not a scientific concept (“Tom Cruise” in IR paper)

CS Collecon Physics Collecon Precision Recall Precision Recall Exact string matching 0.9045 0.2394 0.7063 0.0155 Matching with redirects 0.8457 0.4229 0.7768 0.5843

50 Features: Syntactic

Set of common syntactic features: • N-gram length in words • Whether n-gram is uppercased • The number of other n-grams a given n-gram is part of

51 Experiments: Overview

1. Regex POS Patterns vs Normal POS tags 2. Redirects vs Non-redirects 3. Feature importance scores 4. MaxEntropy comparison

All results are obtained using average with 10- fold cross-validation.

52 Experiments: Comparison I

CS Collecon Precision Recall F1 score Accuracy N# features Normal POS + 0.8794 0.8058* 0.8409* 0.8429* 54 Components Regex POS + 0.8475* 0.8524* 0.8499* 0.8448* 9 Components Normal POS + 0.8678* 0.8305* 0.8487* 0.8473 50 Components-Redirects Regex POS + 0.8406* 0.8769 0.8584 0.8509 7 Components-Redirects

The symbol * indicates a statistically significant difference as compared to the approach in bold.

53 Experiments: Feature Importance

Importance Importance NN STARTS 0.3091 ScienceWISE 0.2870 DBLP 0.1442 Component + 0.1948 Components + DBLP 0.1125 ScienceWISE Components 0.0789 Wikipedia redirect 0.1104 VB ENDS 0.0386 Components 0.1093 NN ENDS 0.0380 Wikilinks 0.0439 JJ STARTS 0.0364 Parcipaon count 0.0370

CS Collection, 7 features Physics Collection, 6 features

54 Experiments: MaxEntropy

MaxEnt classifier receives full text as input. (we used a classifier from NLTK package)

Comparison experiment: 80% of CS Collection as a training data, 20% as a test dataset. Precision Recall F1 score Maximum Entropy 0.6566 0.7196 0.6867 Decision Trees 0.8121 0.8742 0.8420

55 Lessons Learned

Classic NER approaches are not good enough for Idiosyncratic Web Collections

Leveraging the graph of scientific concepts is a key feature

Domain specific KBs and POS patterns work well

Experimental results show up to 85% accuracy over different scientific collections

56 Enty Disambiguaon in Scienfic Literature

• Using a background concept graph

Roman Prokofyev, Gianluca Demarni, Philippe Cudré-Mauroux, Alexey Boyarsky, and Oleg Ruchayskiy. Ontology-Based Word Sense Disambiguaon in the Scienfic Domain. In: 35th European Conference on Informaon Retrieval (ECIR 2013).

57 58 Summary

• NLP Pipeline: – Named Enty Recognion – Enty Linking – Ranking Enty Types • NER and disambiguaon in scienfic documents • Tomorrow – Searching for enes – Human Computaon for beer effecveness

59