Knowledge Graph Construction from Text AAAI 2017 JAY PUJARA, SAMEER SINGH, BHAVANA DALVI Tutorial Overview

Part 1: Knowledge Graphs

Part 2: Part 3: Knowledge Graph Extraction Construction

Part 4: Critical Analysis

2 Tutorial Outline 1. Knowledge Graph Primer [Jay] 2. Knowledge Extraction from Text a. NLP Fundamentals [Sameer] b. Information Extraction [Bhavana] Coffee Break 3. Knowledge Graph Construction a. Probabilistic Models [Jay] b. Embedding Techniques [Sameer] 4. Critical Overview and Conclusion [Bhavana]

3 What is Knowledge Extraction?

Text was born in , to Julia and Alfred .

Literal Facts

Alfred Lennon childOf birthplace Liverpool childOf

4 NLP Fundamentals EXTRACTING STRUCTURES FROM LANGUAGE What is NLP?

NLP “Knowledge”

Unstructured Structured Ambiguous Precise, Actionable Lots and lots of it! Specific to the task

Humans can read them, but Can be used for downstream … very slowly applications, such as creating … can’t remember all Knowledge Graphs! … can’t answer questions

6 What is NLP?

John was born in Liverpool, to Julia and Alfred Lennon.

Natural Language Processing

Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP

7 What is Information Extraction?

Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP

Information Extraction

Alfred Lennon childOf birthplace spouse Liverpool John Lennon childOf Julia Lennon

8 Breaking it Down

Alfred Lennon Entity resolution, childOf birthplace spouse Entity linking, Liverpool John Lennon Julia Relation extraction… childOf

Extraction Lennon Information

Lennon.. Mrs. Lennon.. his father John Lennon... the Pool Coreference Resolution... .. his mother .. he Alfred Person Location Person Person Document John was born in Liverpool, to Julia and Alfred Lennon.

Dependency Parsing, Part of speech tagging, Named entity recognition… Sentence NNP VBD VBD IN NNP TO NNP CC NNP NNP John was born in Liverpool, to Julia and Alfred Lennon.

9 Tokenization & Sentence Splitting

“Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?”

[Mr.] [Bob] [Dobolina] [is] [thinkin’] [of] [a] [master] [plan] [.] [Why] [doesn't] [he] [quit] [?]

How it is done: Uses in KG Construction: • Regular expressions, but not trivial • Strictly constrains other NLP tasks • Mr., Yahoo!, lower-case • Parts of Speech • For non-English, incredibly difficult! • Dependency Parsing • Chinese: no “space” character • Directly effects KG nodes/edges • Non-trivial for some domains… • Mention boundaries • What is a “token” in BioNLP? • Relations within sentences

10 Tagging the Parts of Speech

NNP VBD VBD IN NNP TO NNP CC NNP NNP John was born in Liverpool, to Julia and Alfred Lennon.

How it is done: Uses in KG Construction: • Context is important! • Entities appear as nouns • run, table, bar, … • Verbs are very useful • Label whole sentence together • For identifying relations • “Structured prediction” • For identifying entity types • Conditional Random Fields, .. • Important for downstream NLP • Now: CNNs, LSTMs, … • NER, Dependency Parsing, …

11 Detecting Named Entities

Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon.

How it is done: Uses in KG Construction: • Context is important! • Mentions describes the nodes • Georgia, Washington, … • Types are incredibly important! • John Deere, Thomas Cook, … • Often restrict relations • Princeton, Amazon, … • Fine-grained types are informative! • Label whole sentence together • Brooklyn: city • Structured prediction again • Sanders: politician, senator

12 NER: Entity Types

3 class: Location, Person, Organization

Stanford CoreNLP 4 class: Location, Person, Organization, Misc

7 class: Location, Person, Organization, Money, Percent, Date, Time

PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. spaCy.io LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language.

From Stanford CoreNLP (http://nlp.stanford.edu/software/CRF-NER.shtml) 13 NER: Entity Types

Fine-grained Types

• More on this later…

From Ling & Weld. AAAI 2012 (http://aiweb.cs.washington.edu/ai/pubs/ling-aaai12.pdf) 14 Dependency Parsing

How it is done: Uses in KG Construction: • Model: score trees using features • Incredibly useful for relations! • Lexical: words, POS, … • What verb is attached? • Structure: distance, … • Relation to which mention? • Prediction: Search over trees • Incredibly useful for attributes! • greedy, spanning tree, belief • Appositives: “X, the CEO, …” propagation, dynamic prog, … • Paths are used as surface relations

Using http://nlp.stanford.edu:8080/corenlp/process 15 Dependency Paths

nmod:to

nmod:in John was born in Liverpool, to Julia and Alfred Lennon. case case

Text Patterns Dependency Paths

John, Liverpool “was born in” “was born in”

John, Julia “was born in Liverpool, to” “was born to”

John, Alfred Lennon “was born in Liverpool, to Julia and” “was born to”

16 Within-document Coreference

Mrs. Lennon.. He… Alfred .. his mother .. Lennon.. the Pool his father John Lennon... he John was born in Liverpool, to Julia and Alfred Lennon.

How it is done: Uses in KG Construction: • Mo`del: score pairwise links • More context for each entity! • dep path, similarity, types, … • Many relations occur on pronouns • “representative mention” • “He is married to her” • Prediction: Search over clusterings • Coref can be used for types • greedy (left to right), ILP, • Nominals: The president, … belief propagation, MCMC, … • Difficult, so often ignored

17 Information Extraction

Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP

Information Extraction

Alfred Lennon childOf birthplace spouse Liverpool John Lennon childOf Julia Lennon

18 Surface Patterns

Combine tokens, dependency paths, and entity types to define rules.

appos nmod

det case

Argument 1 , DT CEO of Argument 2 Person Organization

Bill Gates, the CEO of Microsoft, said … Mr. Jobs, the brilliant and charming CEO of Apple Inc., said … … announced by Steve Jobs, the CEO of Apple. … announced by Bill Gates, the director and CEO of Microsoft. … mused Bill, a former CEO of Microsoft. and many other possible instantiations…

19 Rule-Based Extraction

appos nmod headOf det case Implies Argument 1 Argument 2

Argument 1 , DT CEO of Argument 2 Person Organization Use a collection of rules as the system itself

Variations High precision: when it fires, it’s correct Source: Easy to explain predictions • Manually specified Easy to fix mistakes • Learned from Data Multiple Rules: However… • Attach priorities/precedence Only work when the rules fire • Attach probabilities (more later) Poor recall: Do not generalize!

20 Supervised Extraction

P(birthplace) = 0.75 Machine Learning: hopefully, generalizes the labels in the right way Classifier Use all of NLP as features: words, POS, NER, dependencies, embeddings

POS NER Dep Path Text in b/w embeddings … However Feature Engineering Usually, a lot of labeled data is needed, which is expensive & time consuming. John was born in Liverpool, to Julia and Alfred Lennon. Requires a lot of feature engineering!

21 Entity Resolution & Linking

...during the late 60's and early 70's, Kevin Smith worked with several local...

...the term hip-hop is attributed to Lovebug Starski. What does it actually mean...

Like Back in 2008, the Lions drafted Kevin Smith, even though Smith was badly...

... backfield in the wake of Kevin Smith's knee injury, and the addition of Haynesworth...

The filmmaker Kevin Smith returns to the role of Silent Bob...

Nothing could be more irrelevant to Kevin Smith's audacious ''Dogma'' than ticking off...

... The Physiological Basis of Politics,” by Kevin Smith, Douglas Oxley, Matthew Hibbing...

22 Entity Names: Two Main Problems

Entities with Same Name Different Names for Entities

Same type of entities share names Nick Names Kevin Smith, John Smith, Bam Bam, Drumpf, … Springfield, …

Things named after each other Typos/Misspellings Clinton, Washington, Paris, Baarak, Barak, Barrack, … Amazon, Princeton, Kingston, …

Partial Reference Inconsistent References First names of people, Location MSFT, APPL, GOOG… instead of team name, Nick names

23 Entity Linking Approach

Washington drops 10 points after game with UCLA Bruins.

Washington DC, George Washington, Washington state, Candidate Generation Lake Washington, Washington Huskies, Denzel Washington, University of Washington, Washington High School, …

Washington DC, George Washington, Washington state, Entity Types LOC/ORG Lake Washington, Washington Huskies, Denzel Washington, University of Washington, Washington High School, …

Washington DC, George Washington, Washington state, Coreference UWashington, Lake Washington, Washington Huskies, Denzel Washington, Huskies University of Washington, Washington High School, …

Washington DC, George Washington, Washington state, UCLA Bruins, Coherence Lake Washington, Washington Huskies, Denzel Washington, USC Trojans University of Washington, Washington High School, …

Vinculum, Ling, Singh, Weld, TACL (2015) 24 Information Extraction

Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP

Information Extraction

Alfred Lennon childOf birthplace spouse Liverpool John Lennon childOf Julia Lennon

25