Information Extraction
Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017
February 21, 2017
Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from. Outline
What is Information Extraction
Named Entity Recognition
Homework 3
CS 295: STATISTICAL NLP (WINTER 2017) 2 Outline
What is Information Extraction
Named Entity Recognition
Homework 3
CS 295: STATISTICAL NLP (WINTER 2017) 3 Making Sense of Text
Query ?
Search Search Query (IR) DB Query
Database or Graph Documents Documents Documents Documents Information Documents Documents Documents Extraction
Structured Massive Corpus of Representation Unstructured Text
4 News Articles
Query Which AI startups have been acquired by Tech companies?
acquired
founded Company
Structured People Representation employee belongsTo
Industry expertIn
Information Extraction
Massive Corpus of News Articles
5 Fiction
Query Which two characters are not related by blood?
Structured Representation
Information Extraction
Collection of Books
6 Academic Research
Query What is the interaction pathway between YY1 and TIP60?
Structured Representation
Massive Corpus of Information Scientific Papers Extraction Applications ?
Question Answering 250 Documents 200 Documents Documents Information 150 Documents Documents 100 Documents Extraction 50 Documents 0 April June Database or Graph Visualization & Statistics
Downstream AI applications
8 Low-level Info. Extraction
CS 295: STATISTICAL NLP (WINTER 2017) 9 Slightly better…
CS 295: STATISTICAL NLP (WINTER 2017) 10 Slightly better?
The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia.
headquarters(“BHP Biliton Limited”, “Melbourne, Australia”)
CS 295: STATISTICAL NLP (WINTER 2017) 11 In the industry… Google Knowledge Graph ◦ Google Knowledge Vault Amazon Product Graph Facebook Graph API IBM Watson Microsoft Satori ◦ Project Hanover/Literome LinkedIn Knowledge Graph Yandex Object Answer Diffbot, GraphIQ, Maana, ParseHub, Reactor Labs, SpazioDati
CS 295: STATISTICAL NLP (WINTER 2017) 12 Knowledge Extraction
John was born in Liverpool, to Julia and Alfred Lennon. Text
Literal Facts
Alfred Lennon childOf birthplace Liverpool John Lennon childOf Julia Lennon
13 Role of NLP?
John was born in Liverpool, to Julia and Alfred Lennon.
Natural Language Processing
Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP
14 Information Extraction
Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP
Information Extraction
Alfred Lennon childOf birthplace spouse Liverpool John Lennon childOf Julia Lennon
15 Breaking it Down
Alfred Lennon Entity resolution, childOf birthplace spouse Entity linking, Liverpool John Lennon Relation extraction… Julia
Extraction childOf
Information Lennon
Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Coreference Resolution... Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. Document
Dependency Parsing, Part of speech tagging, Named entity recognition… Sentence NNP VBD VBD IN NNP TO NNP CC NNP NNP John was born in Liverpool, to Julia and Alfred Lennon.
16 Outline
What is Information Extraction
Named Entity Recognition
Homework 3
Relation Extraction
CS 295: STATISTICAL NLP (WINTER 2017) 17 Named Entity Recognition
An important sub-task: find and classify names in text, for example:
◦ The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Named Entity Recognition
An important sub-task: find and classify names in text, for example:
◦ The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Named Entity Recognition
An important sub-task: find and classify names in text, for example:
◦ The decision by the independent MP Andrew Wilkie to withdraw his Person support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Date Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Location Labor, they gave just two guarantees: confidence and supply. Organi- zation Detecting Named Entities
Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon.
How it is done: Uses in Knowledge Extraction: • Context is important! • Mentions describes the nodes • Georgia, Washington, … • Types are incredibly important! • John Deere, Thomas Cook, … • Often restrict relations • Princeton, Amazon, … • Fine-grained types are informative! • Label whole sentence together • Brooklyn: city • Structured prediction again • Sanders: politician, senator
21 NER: Entity Types
3 class: Location, Person, Organization
Stanford CoreNLP 4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time
PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. spaCy.io LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language.
From Stanford CoreNLP (http://nlp.stanford.edu/software/CRF-NER.shtml) 22 NER: Entity Types
Fine-grained Types
From Ling & Weld. AAAI 2012 (http://aiweb.cs.washington.edu/ai/pubs/ling-aaai12.pdf) 23 CS 295: STATISTICAL NLP (WINTER 2017) 24 Sequence Labeling for NER
CS 295: STATISTICAL NLP (WINTER 2017) 25 Features: Words and Lexicons
Words
Lexicons
CS 295: STATISTICAL NLP (WINTER 2017) 26 Features: Prefixes/Suffixes
CS 295: STATISTICAL NLP (WINTER 2017) 27 4 17 14 4 Features: Substrings of Words
drug oxa : field company 0 6 0 0 8 movie 0 0 0 00 14 6 place person
Cotrimoxazole
708 68 241 18 Wethersfield
Alien Fury: Countdown to Invasion
CS 295: STATISTICAL NLP (WINTER 2017) 28 Features: Word Shapes
if A-Z X
if a-z x Shape(c)= if 0-9 d
o.w. c John DC-100 CamelCase
Word shapes Xxxx XX-ddd XxxxxXxxx
Short shapes Xx X-d XxXx
CS 295: STATISTICAL NLP (WINTER 2017) 29 Features: Surrounding Context
John Deere announced
i-1 i i+1
NEXT_BIAS BIAS PREV_BIAS NEXT_WORD=Deere WORD=Deere PREV_WORD=Deere NEXT_LWORD=deere LWORD=deere PREV_LWORD=deere NEXT_FIRSTCAP=True FIRSTCAP=True PREV_FIRSTCAP=True NEXT_SSHAPE=Xx SSHAPE=Xx PREV_SSHAPE=Xx NEXT_LEXICON=company LEXICON=company PREV_LEXICON=company … … …
CS 295: STATISTICAL NLP (WINTER 2017) 30 Outline
What is Information Extraction
Named Entity Recognition
Homework 3
Relation Extraction
CS 295: STATISTICAL NLP (WINTER 2017) 31 Sequence Tagging on Twitter
Parts of Speech
What a productive day . Not .
PRON DET ADJ NOUN . ADV .
Named Entity Recognition
‘ Breaking Dawn ’ Returns to Vancouver on January 11th
O B-MOVIE I-MOVIE O O O B-GEO-LOC O O O
Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 32 Sequence Tagging Models
Logistic Regression
Conditional Random Fields
Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 33 What do you have to do?
Feature Engineering
Test data will be released very close Viterbi Algorithm to the deadline!
Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 34 Upcoming…
• Homework 3 is due on February 27 Homework • Write-up and data has been released.
• Status report due in 1.5 weeks: March 2, 2017 Project • Instructions coming soon • Only 5 pages
• Paper summaries: February 28, March 14 Summaries • Only 1 page each
CS 295: STATISTICAL NLP (WINTER 2017) 35