Information Extraction

Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017

February 21, 2017

Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from. Outline

What is Information Extraction

Named Entity Recognition

Homework 3

CS 295: STATISTICAL NLP (WINTER 2017) 2 Outline

What is Information Extraction

Named Entity Recognition

Homework 3

CS 295: STATISTICAL NLP (WINTER 2017) 3 Making Sense of Text

Query ?

Search Search Query (IR) DB Query

Database or Graph Documents Documents Documents Documents Information Documents Documents Documents Extraction

Structured Massive Corpus of Representation Unstructured Text

4 News Articles

Query Which AI startups have been acquired by Tech companies?

acquired

founded Company

Structured People Representation employee belongsTo

Industry expertIn

Information Extraction

Massive Corpus of News Articles

5 Fiction

Query Which two characters are not related by blood?

Structured Representation

Information Extraction

Collection of Books

6 Academic Research

Query What is the interaction pathway between YY1 and TIP60?

Structured Representation

Massive Corpus of Information Scientific Papers Extraction Applications ?

Question Answering 250 Documents 200 Documents Documents Information 150 Documents Documents 100 Documents Extraction 50 Documents 0 April June Database or Graph Visualization & Statistics

Downstream AI applications

8 Low-level Info. Extraction

CS 295: STATISTICAL NLP (WINTER 2017) 9 Slightly better…

CS 295: STATISTICAL NLP (WINTER 2017) 10 Slightly better?

The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia.

headquarters(“BHP Biliton Limited”, “Melbourne, Australia”)

CS 295: STATISTICAL NLP (WINTER 2017) 11 In the industry… Google Knowledge Graph ◦ Google Knowledge Vault Amazon Product Graph Facebook Graph API IBM Watson Microsoft Satori ◦ Project Hanover/Literome LinkedIn Knowledge Graph Yandex Object Answer Diffbot, GraphIQ, Maana, ParseHub, Reactor Labs, SpazioDati

CS 295: STATISTICAL NLP (WINTER 2017) 12 Knowledge Extraction

John was born in , to Julia and Alfred . Text

Literal Facts

Alfred Lennon childOf birthplace Liverpool Lennon childOf

13 Role of NLP?

John was born in Liverpool, to Julia and Alfred Lennon.

Natural Language Processing

Lennon.. Mrs. Lennon.. his father ... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP

14 Information Extraction

Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP

Information Extraction

Alfred Lennon childOf birthplace spouse Liverpool John Lennon childOf Julia Lennon

15 Breaking it Down

Alfred Lennon Entity resolution, childOf birthplace spouse Entity linking, Liverpool John Lennon Relation extraction… Julia

Extraction childOf

Information Lennon

Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Coreference Resolution... Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. Document

Dependency Parsing, Part of speech tagging, Named entity recognition… Sentence NNP VBD VBD IN NNP TO NNP CC NNP NNP John was born in Liverpool, to Julia and Alfred Lennon.

16 Outline

What is Information Extraction

Named Entity Recognition

Homework 3

Relation Extraction

CS 295: STATISTICAL NLP (WINTER 2017) 17 Named Entity Recognition

An important sub-task: find and classify names in text, for example:

◦ The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Named Entity Recognition

An important sub-task: find and classify names in text, for example:

◦ The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Named Entity Recognition

An important sub-task: find and classify names in text, for example:

◦ The decision by the independent MP Andrew Wilkie to withdraw his Person support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Date Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Location Labor, they gave just two guarantees: confidence and supply. Organi- zation Detecting Named Entities

Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon.

How it is done: Uses in Knowledge Extraction: • Context is important! • Mentions describes the nodes • Georgia, Washington, … • Types are incredibly important! • John Deere, Thomas Cook, … • Often restrict relations • Princeton, Amazon, … • Fine-grained types are informative! • Label whole sentence together • Brooklyn: city • Structured prediction again • Sanders: politician, senator

21 NER: Entity Types

3 class: Location, Person, Organization

Stanford CoreNLP 4 class: Location, Person, Organization, Misc

7 class: Location, Person, Organization, Money, Percent, Date, Time

PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. spaCy.io LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language.

From Stanford CoreNLP (http://nlp.stanford.edu/software/CRF-NER.shtml) 22 NER: Entity Types

Fine-grained Types

From Ling & Weld. AAAI 2012 (http://aiweb.cs.washington.edu/ai/pubs/ling-aaai12.pdf) 23 CS 295: STATISTICAL NLP (WINTER 2017) 24 Sequence Labeling for NER

CS 295: STATISTICAL NLP (WINTER 2017) 25 Features: Words and Lexicons

Words

Lexicons

CS 295: STATISTICAL NLP (WINTER 2017) 26 Features: Prefixes/Suffixes

CS 295: STATISTICAL NLP (WINTER 2017) 27 4 17 14 4 Features: Substrings of Words

drug oxa : field company 0 6 0 0 8 movie 0 0 0 00 14 6 place person

Cotrimoxazole

708 68 241 18 Wethersfield

Alien Fury: Countdown to Invasion

CS 295: STATISTICAL NLP (WINTER 2017) 28 Features: Word Shapes

if A-Z X

if a-z x Shape(c)= if 0-9 d

o.w. c John DC-100 CamelCase

Word shapes Xxxx XX-ddd XxxxxXxxx

Short shapes Xx X-d XxXx

CS 295: STATISTICAL NLP (WINTER 2017) 29 Features: Surrounding Context

John Deere announced

i-1 i i+1

NEXT_BIAS BIAS PREV_BIAS NEXT_WORD=Deere WORD=Deere PREV_WORD=Deere NEXT_LWORD=deere LWORD=deere PREV_LWORD=deere NEXT_FIRSTCAP=True FIRSTCAP=True PREV_FIRSTCAP=True NEXT_SSHAPE=Xx SSHAPE=Xx PREV_SSHAPE=Xx NEXT_LEXICON=company LEXICON=company PREV_LEXICON=company … … …

CS 295: STATISTICAL NLP (WINTER 2017) 30 Outline

What is Information Extraction

Named Entity Recognition

Homework 3

Relation Extraction

CS 295: STATISTICAL NLP (WINTER 2017) 31 Sequence Tagging on Twitter

Parts of Speech

What a productive day . Not .

PRON DET ADJ NOUN . ADV .

Named Entity Recognition

‘ Breaking Dawn ’ Returns to Vancouver on January 11th

O B-MOVIE I-MOVIE O O O B-GEO-LOC O O O

Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 32 Sequence Tagging Models

Logistic Regression

Conditional Random Fields

Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 33 What do you have to do?

Feature Engineering

Test data will be released very close Viterbi Algorithm to the deadline!

Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 34 Upcoming…

• Homework 3 is due on February 27 Homework • Write-up and data has been released.

• Status report due in 1.5 weeks: March 2, 2017 Project • Instructions coming soon • Only 5 pages

• Paper summaries: February 28, March 14 Summaries • Only 1 page each

CS 295: STATISTICAL NLP (WINTER 2017) 35