Information Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 21, 2017 Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from. Outline What is Information Extraction Named Entity Recognition Homework 3 CS 295: STATISTICAL NLP (WINTER 2017) 2 Outline What is Information Extraction Named Entity Recognition Homework 3 CS 295: STATISTICAL NLP (WINTER 2017) 3 Making Sense of Text Query ? Search Search Query (IR) DB Query Database or Graph Documents Documents Documents Documents Information Documents Documents Documents Extraction Structured Massive Corpus of Representation Unstructured Text 4 News Articles Query Which AI startups have been acquired by Tech companies? acquired founded Company Structured People Representation employee belongsTo Industry expertIn Information Extraction Massive Corpus of News Articles 5 Fiction Query Which two characters are not related by blood? Structured Representation Information Extraction Collection of Books 6 Academic Research Query What is the interaction pathway between YY1 and TIP60? Structured Representation Massive Corpus of Information Scientific Papers Extraction Applications ? Question Answering 250 Documents 200 Documents Documents Information 150 Documents Documents 100 Documents Extraction 50 Documents 0 April June Database or Graph Visualization & Statistics Downstream AI applications 8 Low-level Info. Extraction CS 295: STATISTICAL NLP (WINTER 2017) 9 Slightly better… CS 295: STATISTICAL NLP (WINTER 2017) 10 Slightly better? The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia. headquarters(“BHP Biliton Limited”, “Melbourne, Australia”) CS 295: STATISTICAL NLP (WINTER 2017) 11 In the industry… Google Knowledge Graph ◦ Google Knowledge Vault Amazon Product Graph Facebook Graph API IBM Watson Microsoft Satori ◦ Project Hanover/Literome LinkedIn Knowledge Graph Yandex Object Answer Diffbot, GraphIQ, Maana, ParseHub, Reactor Labs, SpazioDati CS 295: STATISTICAL NLP (WINTER 2017) 12 Knowledge Extraction John was born in Liverpool, to Julia and Alfred Lennon. Text Literal Facts Alfred Lennon childOf birthplace Liverpool John Lennon childOf Julia Lennon 13 Role of NLP? John was born in Liverpool, to Julia and Alfred Lennon. Natural Language Processing Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP 14 Information Extraction Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP Information Extraction Alfred Lennon childOf birthplace spouse Liverpool John Lennon childOf Julia Lennon 15 Breaking it Down Alfred Lennon Entity resolution, childOf birthplace spouse Entity linking, Liverpool John Lennon Relation extraction… Julia Extraction childOf Information Lennon Lennon.. Mrs. Lennon.. his father John Lennon... the Pool .. his mother .. he Alfred Coreference Resolution... Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. Document Dependency Parsing, Part of speech tagging, Named entity recognition… Sentence NNP VBD VBD IN NNP TO NNP CC NNP NNP John was born in Liverpool, to Julia and Alfred Lennon. 16 Outline What is Information Extraction Named Entity Recognition Homework 3 Relation Extraction CS 295: STATISTICAL NLP (WINTER 2017) 17 Named Entity Recognition An important sub-task: find and classify names in text, for example: ◦ The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Named Entity Recognition An important sub-task: find and classify names in text, for example: ◦ The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Named Entity Recognition An important sub-task: find and classify names in text, for example: ◦ The decision by the independent MP Andrew Wilkie to withdraw his Person support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Date Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Location Labor, they gave just two guarantees: confidence and supply. Organi- zation Detecting Named Entities Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. How it is done: Uses in Knowledge Extraction: • Context is important! • Mentions describes the nodes • Georgia, Washington, … • Types are incredibly important! • John Deere, Thomas Cook, … • Often restrict relations • Princeton, Amazon, … • Fine-grained types are informative! • Label whole sentence together • Brooklyn: city • Structured prediction again • Sanders: politician, senator 21 NER: Entity Types 3 class: Location, Person, Organization Stanford CoreNLP 4 class: Location, Person, Organization, Misc 7 class: Location, Person, Organization, Money, Percent, Date, Time PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. spaCy.io LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language. From Stanford CoreNLP (http://nlp.stanford.edu/software/CRF-NER.shtml) 22 NER: Entity Types Fine-grained Types From Ling & Weld. AAAI 2012 (http://aiweb.cs.washington.edu/ai/pubs/ling-aaai12.pdf) 23 CS 295: STATISTICAL NLP (WINTER 2017) 24 Sequence Labeling for NER CS 295: STATISTICAL NLP (WINTER 2017) 25 Features: Words and Lexicons Words Lexicons CS 295: STATISTICAL NLP (WINTER 2017) 26 Features: Prefixes/Suffixes CS 295: STATISTICAL NLP (WINTER 2017) 27 4 17 14 4 Features: Substrings of Words drug oxa : field company 0 6 0 0 8 movie 0 0 0 00 14 6 place person Cotrimoxazole 708 68 241 18 Wethersfield Alien Fury: Countdown to Invasion CS 295: STATISTICAL NLP (WINTER 2017) 28 Features: Word Shapes if A-Z X if a-z x Shape(c)= if 0-9 d o.w. c John DC-100 CamelCase Word shapes Xxxx XX-ddd XxxxxXxxx Short shapes Xx X-d XxXx CS 295: STATISTICAL NLP (WINTER 2017) 29 Features: Surrounding Context John Deere announced i-1 i i+1 NEXT_BIAS BIAS PREV_BIAS NEXT_WORD=Deere WORD=Deere PREV_WORD=Deere NEXT_LWORD=deere LWORD=deere PREV_LWORD=deere NEXT_FIRSTCAP=True FIRSTCAP=True PREV_FIRSTCAP=True NEXT_SSHAPE=Xx SSHAPE=Xx PREV_SSHAPE=Xx NEXT_LEXICON=company LEXICON=company PREV_LEXICON=company … … … CS 295: STATISTICAL NLP (WINTER 2017) 30 Outline What is Information Extraction Named Entity Recognition Homework 3 Relation Extraction CS 295: STATISTICAL NLP (WINTER 2017) 31 Sequence Tagging on Twitter Parts of Speech What a productive day . Not . PRON DET ADJ NOUN . ADV . Named Entity Recognition ‘ Breaking Dawn ’ Returns to Vancouver on January 11th O B-MOVIE I-MOVIE O O O B-GEO-LOC O O O Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 32 Sequence Tagging Models Logistic Regression Conditional Random Fields Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 33 What do you have to do? Feature Engineering Test data will be released very close Viterbi Algorithm to the deadline! Steedman, 2000 CS 295: STATISTICAL NLP (WINTER 2017) 34 Upcoming… • Homework 3 is due on February 27 Homework • Write-up and data has been released. • Status report due in 1.5 weeks: March 2, 2017 Project • Instructions coming soon • Only 5 pages • Paper summaries: February 28, March 14 Summaries • Only 1 page each CS 295: STATISTICAL NLP (WINTER 2017) 35.