: Techniques, Advances and Challenges

Heng Ji Computer Science Department and Linguistics Department Queens College and Graduate Center City Univeristy of New York [email protected]

June 12, 2012 Outline

 Introduction  Basic Information Extraction (IE)  Advanced IE

 Enhance Quality

 Enhance Portability  Popular Research Directions

 Cross-source IE

 IE for Noisy Data  Resources Introduction

 What is IE  Why IE is Useful  IE History What is IE

 (In this talk) Information Extraction (IE) =Identifying the instances of facts names/entities , relations and events from semi-structured or unstructured text; and convert them into structured representations (e.g. databases)

BarryBarry Diller Diller on Wednesday quit as chief of VivendiVivendi Universal Universal Entertainment. Entertainment

Trigger Quit (a “Personnel/End-Position” event) Role = Person Barry Diller Arguments Role = Organization Vivendi Universal Entertainment Role = Position Chief Role = Time-within Wednesday (2003-03-04) Why IE is Useful

 IE can build a data base with the information on a given relation or event from news, financial, bio-medical domains…  Attack/arrest events  People’s jobs  People’s whereabouts  Merger and acquisition activity  Disease outbreaks  Patient records  Experiment chains in scientific papers  Component technology for other areas  (QA)  Summarization  Automatic translation  Document indexing  Structured Search: “who are the top employees of IBM from 2002-2012?”  Opinion Mining/Sentiment Extraction  Text Data Mining over Extracted Relationships Application Example: Dynamic Event Tracking

(Chen and Ji, 2009)

http://nlp.cs.qc.cuny.edu/demo/personvisual.html IE for Scientific Literature

For sequestration, the CO2 captured from a fossil fuel plant is first compressed until the combined heat and pressure make it "supercritical" — a state in which it displays both gas and fluid properties. At 3 kilometers, you needed only 10 wells because the increased temperature lowered the viscosity of the CO2, allowed it to slide more easily into the reservoir. Supercritical CO2 is buoyant and will rise above the other fluids. If it rises high enough (above a depth of 2,600 feet), it will return to a gaseous state.

Centroid=“CO2 Event capture Event compress Event rise Geological Object CO Object CO Object CO Sequestration” 2 2 2 Place fossil fuel State supercritica State gaseous plant l Depth 2,600 feet (subsequence)

Event lower Event slide

Object CO2 Object CO2 Agent increased Place reservoir Targettemperature viscosity Volume 10 wells Depth 3 kilometers (causal relation) Real Application: Terrorism Networks Extraction

Demo URL: http://blender2.cs.qc.cuny.edu/BlenderGraph Demo Video: http://nlp.cs.qc.cuny.edu/terrorism.m4v IE History: Early Projects

 Knowledge-based, rule-based  FRUMP – 1979  Newswire  LSP (Language String Project) – 1981  Naomi Sager et al.  AMA – American Medical Association  Patient summaries IE History: MUC

 MUC – Message Understanding Conferences (1987-1998)  DARPA, NRAD  MUC-6: Named entity, coreference and template element  MUC-7: template relation  Standardization, Evaluation, Dissemination  DARPA’s TIPSTER Program: Document Detection, Summarization and Information Extraction – until 1998  TREC (Text Retrieval Conferences) Year Conference Domain 1987 MUC-I Navy messages 1989 MUC-II Navy messages 1991 MUC-3 News about terrorist attacks 1992 MUC-4 News about terrorist attacks 1993 MUC-5 Company news (joint-ventures, micro-electronics production) 1995 MUC-6 Company news (management succession) 1998 MUC-7 Airline company orders IE History: ACE/CONLL

 HUB-4 and ACE (Automatic Content Extraction)

 NIST National Institute of Standards and Technology

 Spoken and printed text

 ACE defined 7 types of entities, 17 types of relations, 33 different types of events (2002-2008)

 Multilingual (English, Chinese, Arabic)

 The top systems obtained mention values in the range of 70-85, entity values in the range of 60-70, relation values in the range of 35- 45, event values in the range of 15-30.

 CoNLL (Conference on Natural Language Learning)

 Since 1997

 Name tagging in the 2002 and 2003 editions

 Multilingual of person (PER), location (LOC), organization (ORG) and other (O) classes IE History: Knowledge Base Population (KBP, 2009-)  General Goal  Promote research in discovering facts about entities and expanding a knowledge source automatically  Conducted as part of the NIST Text Analysis Conference

 What’s New  Extraction at large scale (1.3 million documents)  Using a representative collection (not selected for relevance)  Cross-document entity resolution (extending the limited effort in ACE)  Linking the facts in text to a knowledge base  Distant (and noisy) supervision through Infoboxes  Rapid adaptation to new relations  Support multi-lingual information fusion (cross-lingual KBP)  Capture temporal information (temporal KBP)  Automatic KB construction (cold-start KBP) Outline

 Introduction  Basic Information Extraction (IE)  Advanced IE

 Enhance Quality

 Enhance Portability  Popular Research Directions

 Cross-source IE

 IE for Noisy Data  Resources Basic IE

 Methods

 Rule-based

 Pattern Learning

 Supervised Learning  IE Components and State-of-the-Art

 Name Tagging

 Entity Coreference Resolution

 Relation Extraction

 Event Mention Extraction

 Event Coreference Resolution Traditional IE Methods  Handcrafted systems  Knowledge (rule) based  Hand-written Patterns  Gazetteers  Rule-based approaches: FASTUS (SRI, 1996), Proteus (NYU, 1996), LaSIE- II (U-Sheffield, 1998)  Example-based learning: AutoSlog (UMASS, 1993), CRYSTAL (UMASS, 1996)  Statistical models: Collins et al. (1998), Miller et al. (2000)  Advantages - Simple, fast, language independent, easy to retarget  Disadvantages – collection and maintenance of lists, cannot deal with fact variants, cannot resolve , poor portability (domains and languages)

 Automatic systems  Pattern Learning  Supervised Learning Pattern Learning based IE

 Pattern Examples  Name Tagging: CapWord + {City, Forest, Center} e.g. Sherwood Forest Cap + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street “to the” COMPASS “of” CapWord e.g. to the south of Loitokitok “based in” CapWord e.g. based in Loitokitok CapWord “is a” (ADJ)? GeoWord e.g. Loitokitok is a friendly city  Event Extraction: [Person] quit as [Position] of [Organization]  Manually writing and editing patterns require some skill and considerable time  These patterns cannot be easily adapted into new domains

 Learn these patterns automatically based on an annotated corpus pre- processed by syntactic and semantic analyzers (Muslea, 1999); details later Supervised Learning based IE  ‘Pipeline’ style IE  Split the task into several components  Prepare data annotation for each component  Apply supervised machine learning methods to address each component separately  Most state-of-the-art ACE IE systems were developed in this way  Provide great opportunity to applying a wide range of learning models and incorporating diverse levels of linguistic features to improve each component  Large progress has been achieved on some of these components such as name tagging and relation extraction Major IE Components

Name/Nominal Extraction “Barry Diller”, “chief”

Entity Coreference Resolution “Barry Diller” = “chief”

Time Identification Wednesday (2003-03-04) and Normalization

“Vivendi Universal Entertainment” is Relation Extraction located in “France”

“Barry Diller” is the person of Event Mention Extraction and the end-position event Event Coreference Resolution trigged by “quit” Name Tagging: Task

 Person (PER): named person or family

 Organization (ORG): named corporate, governmental, or other organizational entity

 Geo-political entity (GPE): name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.) George W. Bush discussed Iraq

 But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc.

 Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004)

 Convert it into a sequence labeling problem – “BIO” tagging:

B-PER I-PER I-PER O B-GPE

George W. Bush discussed Iraq Supervised Learning for Name Tagging

 Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002; Florian et al., 2007)  Decision Trees (Sekine et al., 1998)  Class-based Language Model (Sun et al., 2002, Ratinov and Roth, 2009)  Agent-based Approach (Ye et al., 2002)  Support Vector Machines (Takeuchi and Collier, 2002)  Sequence Labeling Models  Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005)  Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000)  Conditional Random Fields (CRFs) (McCallum and Li, 2003) Markov Chain for a Simple Name Tagger

George:0.3 Transition 0.6 Probability W.:0.3 Bush:0.3 Emission Iraq:0.1 Probability PER 0.2 $:1.0 0.3 0.1

END START 0.2 0.3 LOC 0.2 0.3 0.2 0.3 0.1 George:0.2 0.5 0.2 Iraq:0.8 X W.:0.3

0.5 discussed:0.7 Viterbi Decoding of Name Tagger

George W. Bush discussed Iraq $

t=0 t=1 t=2 t=3 t=4 t=5 t=6

START 1 0 0 001 0

1*0.3*0.3

PER 0 0.09 0.0162 0.003 0 0.000008 0 0.0012 0.0003

LOC 0 0.004 0 0 0 0.000032 0

X 0 0 0.0054 0 0.0004 0 0 0.0036 0.00000016 END 00 0 0 0 0 0.0000096 Current = Previous * Transition * Emission Limitations of HMMs

 Joint probability distribution p(y, x)  Assume interdependent features  Cannot represent overlapping features or long range dependences between observed elements

 Need to enumerate all possible observation sequences

 Very strict independence assumptions on the observations

 Toward discriminative/conditional models

 Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x)

 Allow arbitrary, non-independent features on the observation sequence X

 The probability of a transition between labels may depend on past and future observations

 Relax strong independence assumptions in generative models Maximum Entropy Markov Models (MEMMs)

 A conditional model that representing the probability of reaching a state given an observation and the previous state  Consider observation sequences to be events to be conditioned upon.

n = p(s | x) p(s1 | x1 )∏ p(si | si−1 , xi ) i=2

 Have all the advantages of Conditional Models  No longer assume that features are independent  Do not take future observations into account (no forward-backward)  Subject to Label Bias Problem: Bias toward states with fewer outgoing transitions Conditional Random Fields (CRFs)

 Conceptual Overview  Each attribute of the data fits into a feature function that associates the attribute and a possible label  A positive value if the attribute appears in the data  A zero value if the attribute is not in the data  Each feature function carries a weight that gives the strength of that feature function for the proposed label  High positive weights: a good association between the feature and the proposed label  High negative weights: a negative association between the feature and the proposed label  Weights close to zero: the feature has little or no impact on the identity of the label  CRFs have all the advantages of MEMMs without label bias problem  MEMM uses per-state exponential model for the conditional probabilities of next states given the current state  CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence  Weights of different features at different states can be traded off against each other  CRFs provide the benefits of discriminative models  Details in the lab session led by Veselin Stoyanov Sequential Model Trade-offs

Discriminative vs. Speed Normalization Generative HMM very fast generative local MEMM mid-range discriminative local CRF relatively slow discriminative global Typical Name Tagging Features

 N-gram: Unigram, and token sequences in the context window of the current token  Part-of-Speech: POS tags of the context  Gazetteers: person names, organizations, countries and cities, titles, idioms, etc.  Word clusters: to reduce sparsity, using word clusters such as Brown clusters (Brown et al., 1992)  Case and Shape: Capitalization and morphology analysis based features  Chunking: NP and VP Chunking tags  Global feature: Sentence level and document level features. For example, whether the token is in the first sentence of a document  Conjunction: Conjunctions of various features State-of-the-art and Remaining Challenges

 State-of-the-art Performance  On ACE data sets: about 89% F-measure (Florian et al., 2006; Ji and Grishman, 2006; Nguyen et al., 2010; Zitouni and Florian, 2008)  On CONLL data sets: about 91% F-measure (Lin and Wu, 2009; Ratinov and Roth, 2009)

 Remaining Challenges  Identification, especially on organizations  Boundary: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore”  Need coreference resolution or context event features: “FAW has also utilized the capital market to directly finance, and now owns three domestic listed companies”(FAW = First Automotive Works)  Classification  “Caribbean Union”: ORG or GPE? Coreference Resolution: From Mentions to Entities  But the little prince could not restrain admiration:

 "Oh! How beautiful you are!"

 "Am I not?" the flower responded, sweetly. "And I was born at the same moment as the sun . . ."

 The little prince could guess easily enough that she was not any too modest--but how moving--and exciting--she was!

 "I think it is time for breakfast," she added an instant later. "If you would have the kindness to think of my needs--"

 And the little prince, completely abashed, went to look for a sprinkling-can of fresh water. So, he tended the flower. Modeling Coreference Resolution

 Detailed Survey and Comparison in (Ng, 2010)

Method References Advantages Disadvantages Mention Classify whether two (Soon et al. 2001; Ng and easy to encode greedy clustering -Pair mentions are coreferential or Cardie 2002; Ji et al., features algorithm; Each candidate model not + clustering 2005; McCallum & antecedents is considered Wellner, 2004; Nicolae & independently of the other Nicolae, 2006) Entity- Classify whether a mention Pasula et al. 2003 ; Luo et Improved Each candidate cluster is Mention and a preceding, possibly al. 2004; Yang et al. 2004, expressiveness, considered independently Model partially formed cluster are 2008; Daume & Marcu, allows cluster level of the others coreferential or not 2005; Culotta et al., 2007 features Mention Imposes a ranking on a set Denis & Baldridge 2007, Considers all the Insufficient information to - of candidate antecedents 2008 candidate make an informed Ranking antecedents coreference decision; still Model simultaneously need to do clustering Cluster Ranks all the preceding Rahman and Ng, 2009 Combines the - Ranking clusters for each mention; strength of previous model create instances with entity- models; Achieved mention model, rank the best instances with mention- performance ranking model Typical Coreference Resolution Features

 Basic Features  Lexical: exact match, partial match, acronym, edit distance, head match  Distance: word, sentence distance  Count: how many times a phrase see in the document  Mention information: spelling, level, type,…  Pronoun Attributes: gender, number, possessiveness, reflexivity  Synonymy  Generic  Modifier matching  Quantifiers  Entity-level  Gender  Number  Humanity  Entity type, mention type  Syntactic features  Apposition, POS tags  Definiteness  Same-NP test  Functional tag  Hobbs distance  C-command/Governing category  Dependency structure  Conjunction features State-of-the-art and Remaining Challenges

 On ACE Data: 88.4 ACE value (Luo, 2007)  On MUC-6 Data: 71.3 MUC F-score (Yang et al., 2003)  On MUC-7 Data: 63.4 MUC F-score (Ng and Cardie, 2002)  Challenges  Name Coreference: “R” = “Republican Party”, “Brooklyn Dodgers” = “Brooklyn”  Nominal Coreference  Almost overnight, he became fabulously rich, with a $3-million book deal, a $100,000 speech making fee, and a lucrative multifaceted consulting business, Giuliani Partners. As a celebrity rainmaker and lawyer, his income last year exceeded $17 million. His consulting partners included seven of those who were with him on 9/11, and in 2002 Alan Placa, his boyhood pal, went to work at the firm.  After successful karting career in Europe, Perera became part of the Toyota F1 Young Drivers Development Program and was a Formula One test driver for the Japanese company in 2006.  “Alexandra Burke is out with the video for her second single … taken from the British artist’s debut album”  “a woman charged with running a prostitution ring … her business, Pamela Martin and Associates”

 Pronoun Coreference  Meteorologist Kelly Cass became an On-Camera Meteorologist at The Weather Channel, after David Kenny was named the chairman and chief executive. She first appeared on air at The Weather Channel in January 2000. Relation Extraction: Task

relation: a semantic relationship between two entities

ACE relation type example Agent-Artifact Rubin Military Design, the makers of the Kursk Discourse each of whom Employment/ Membership Mr. Smith, a senior programmer at Microsoft Place-Affiliation Salzburg Red Cross officials Person-Social relatives of the dead Physical a town some 50 miles south of Salzburg Other-Affiliation Republican senators Typical Relation Extraction Features  Lexical

 Heads of the mentions and their context words, POS tags

 Entity

 Entity and mention type of the heads of the mentions

 Entity Positional Structure

 Entity Context

 Syntactic

 Chunking

 Premodifier, Possessive, Preposition, Formulaic

 The sequence of the heads of the constituents, chunks between the two mentions

 The syntactic relation path between the two mentions

 Dependent words of the mentions

 Semantic Gazetteers

 Synonyms in WordNet

 Name Gazetteers

 Personal Relative Trigger Word List

 Wikipedia

 If the head extent of a mention is found (via simple string matching) in the predicted Wikipedia article of another mention  References: Kambhatla, 2004; Zhou et al., 2005; Jiang and Zhai, 2007; Chan and Roth, 2010,2011 A Simple Baseline with K-Nearest-Neighbor (KNN)

Train Sample Train Sample

Test Sample

Train Sample

Train Sample

K=3 Train Sample Relation Extraction with KNN

Train Sample: Employment Train Sample: Employment the previous president of the secretary of NIST the United States 0 Test Sample 36 Train Sample: Physical 46 the president of the United States his ranch in Texas 46 26

US forces in Bahrain Connecticut’s governor Train Sample: Physical Train Sample: Employment

1. If the heads of the mentions don’t match: +8 2. If the entity types of the heads of the mentions don’t match: +20 3. If the intervening words don’t match: +10 Most Successful Learning Methods: Kernel-based

 Consider different levels of syntactic information  Deep processing of text produces structural but less reliable results  Simple surface information is less structural, but more reliable

 Generalization of feature-based solutions  A kernel (kernel function) defines a similarity metric Ψ(x, y) on objects  No need for enumeration of features

 Efficient extension of normal features into high-order spaces  Possible to solve linearly non-separable problem in a higher order space

 Nice combination properties  Closed under linear combination  Closed under polynomial extension  Closed under direct sum/product on different domains

 References: Zelenko et al., 2002, 2003; Aron Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhao and Grishman, 2005; Che et al., 2005, Zhang et al., 2006; Qian et al., 2007; Zhou et al., 2007; Khayyamian et al., 2009; Reichartz et al., 2009 Kernel Examples for Relation Extraction ψ = 1) Argument 1(R1, R2 )  KE (R1.argi , R2.argi ), where i=1,2 = + + + KE (E1, E2 ) KT (E1.tk, E2.tk) I(E1.type,E2.type) I(E1.subtype,E2.subtype) I(E1.role,E2.role)

KT is a token kernel defined as: = + + KT (T1,T2 ) I(T1.word,T2.word) I(T1.pos,T2.pos) I(T1.base,T2.base) 2) Local dependency ψ = 2 (R1, R2 )  K D (R1.argi .dseq, R2.argi .dseq), where i=1,2 = + K D (dseq,dseq')  (I(arci .label, arc' j .label) KT (arci .dw,arc' j .dw)) 0.≤

 State-of-the-art: About 71% F-score on perfect mentions, and 50% F- score on system mentions

 Single human annotator: 84% F-score on perfect mentions

 Remaining Challenges  Context generalization to reduce data sparsity Test: “ABC's Sam Donaldson has recently been to Mexico to see him” Training: PHY relation ( “arrived in”, “was traveling to”, …)  Long context Davies is leaving to become chairman of the London School of Economics, one of the best- known parts of the University of London  Disambiguate fine-grained types  “U.S. citizens” and “U.S. businessman” indicate “GPE-AFF” relation while “U.S. president” indicates “EMP-ORG” relation  Parsing errors Event Mention Extraction: Task  An event is specific occurrence that implies a change of states  event trigger: the main word which most clearly expresses an event occurrence

 event arguments: the mentions that are involved in an event (participants)  event mention: a phrase or sentence within which an event is described, including trigger and arguments  ACE defined 8 types of events, with 33 subtypes

Argument, role=victim trigger

ACE event type/subtype Event Mention Example

Life/Die Kurt Schork died in Sierra Leone yesterday Transaction/Transfer GM sold the company in Nov 1998 to LLC Movement/Transport Homeless people have been moved to schools Business/Start-Org Schweitzer founded a hospital in 1913 Conflict/Attack the attack on Gaza killed 13 Contact/Meet Arafat’s cabinet met for 4 hours Personnel/Start-Position She later recruited the nursing student Justice/Arrest Faison was wrongly arrested on suspicion of murder Supervised Event Mention Extraction: Methods

 Staged classifiers

 Trigger Classifier

 to distinguish event instances from non-events, to classify event instances by type

 Argument Classifier

 to distinguish arguments from non-arguments

 Role Classifier

 to classify arguments by argument role

 Reportable-Event Classifier

 to determine whether there is a reportable event instance

 Can choose any supervised learning methods such as MaxEnt and SVMs

(Ji and Grishman, 2008) Typical Event Mention Extraction Features

 Trigger Labeling  Argument Labeling   Lexical Event type and trigger  Trigger tokens  Tokens and POS tags of candidate trigger and context words  Event type and subtype

 Dictionaries  Entity  Trigger list, synonym gazetteers  Entity type and subtype  Head word of the entity mention  Syntactic  the depth of the trigger in the parse tree  Context  the path from the node of the trigger to  Context words of the argument the root in the parse tree candidate  the phrase structure expanded by the  Syntactic parent node of the trigger  the phrase structure expanding the  the phrase type of the trigger parent of the trigger  Entity  the relative position of the entity regarding to the trigger (before or after)  the entity type of the syntactically nearest entity to the trigger in the parse tree  the minimal path from the entity to the trigger  the entity type of the physically nearest entity to the trigger in the sentence  the shortest length from the entity to the trigger in the parse tree

(Chen and Ji, 2009) State-of-the-art and Remaining Challenges

 State-of-the-art Performance (F-score)

 English: Trigger 70%, Argument 45%

 Chinese: Trigger 68%, Argument 52%

 Single human annotator: Trigger 72%, Argument 62%  Remaining Challenges

 Trigger Identification  Generic verbs  Support verbs such as “take” and “get” which can only represent an event mention together with other verbs or nouns  Nouns and adjectives based triggers

 Trigger Classification  “named” represents a “Personnel_Nominate” or “Personnel_Start-Position”?  “hacked to death” represents a “Life_Die” or “Conflict_Attack”?

 Argument Identification  Capture long contexts

 Argument Classification  Capture long contexts  Temporal roles (Ji, 2009; Li et al., 2011) Event Coreference Resolution: Task

1. An explosion in a cafe at one of the 4. Ankara police chief Ercument Yilmaz capital's busiest intersections killed one visited the site of the morning blast woman and injured another Tuesday

2. Police were investigating the cause of 5. The explosion comes a month after the explosion in the restroom of the multistory Crocodile Cafe in the 6. a bomb exploded at a McDonald's commercial district of Kizilay during restaurant in Istanbul, causing damage the morning rush hour but no injuries

3. The blast shattered walls and 7. Radical leftist, Kurdish and Islamic windows in the building groups are active in the country and have carried out the bombing in the past Typical Event Mention Pair Classification Features

Category Feature Description Event type type_subtype pair of event type and subtype Trigger trigger_pair trigger pairs pos_pair part-of-speech pair of triggers nominal if the trigger of EM2 is nominal exact_match if the triggers exactly match stem_match if the stems of triggers match trigger_sim trigger similarity based on WordNet Distance token_dist the number of tokens between triggers sentence_dist the number of sentences between event mentions

event_dist the number of event mentions between EM1 and EM2 Argument overlap_arg the number of arguments with entity and role match unique_arg the number of arguments only in one event mention diffrole_arg The number of coreferential arguments but role mismatch Incorporating Event Attribute as Features

Event Event Mentions Attribute Attributes Value Toyota Motor Corp. said Tuesday it will promote Akio Toyoda, a grandson of Other Modality the company's founder who is widely viewed as a candidate to some day head Japan's largest automaker. Managing director Toyoda, 46, grandson of Kiichiro Toyoda and the eldest son Asserted of Toyota honorary chairman Shoichiro Toyoda, became one of 14 senior managing directors under a streamlined management system set to be… Polarity At least 19 people were killed in the first blast Positive There were no reports of deaths in the blast Negative An explosion in a cafe at one of the capital's busiest Specific Genericity intersections killed one woman and injured another Tuesday Roh has said any pre-emptive strike against the North's nuclear facilities could prove Generic disastrous Tense Israel holds the Palestinian leader responsible for the latest violence, even though the recent Past attacks were carried out by Islamic militants We are warning Israel not to exploit this war against Iraq to carry out more attacks against Future the Palestinian people in the Gaza Strip and destroy the Palestinian Authority and the peace process.

 Attribute values as features: Whether the attributes of an event mention and its candidate antecedent event conflict or not; 6% absolute gain (Chen et al., 2009) Clustering Method 1: Agglomerative Clustering

Basic idea:

 Start with singleton event mentions, sort them according to the occurrence in the document

 Traverse through each event mention (from left to right), iteratively merge the active event mention into a prior event (largest probability higher than some threshold) or start the event mention as a new event Clustering Method 2: Spectral Graph Clustering

Trigger explosion Trigger blast Arguments Role = Place a cafe Arguments Role = Place site Role = Time Tuesday Role = Time morning

Trigger explosion Trigger explosion Arguments Role = Time a month Arguments Role = Place restroom after Role = Time morning rush hour Trigger exploded Arguments Role = Place restaurant Trigger explosion Arguments Role = Place building Trigger bombing Arguments Role = Attacker groups

 (Chen and Ji, 2009) Spectral Graph Clustering

0.8

A 0.7 0.9 0.9 0.6 0.8 0.3

0.8

0.7 0.2 0.2

0.3 0.1 B

cut(A,B) = 0.1+0.2+0.2+0.3=0.8 Spectral Graph Clustering (Cont’)

 Start with full connected graph, each edge is weighted by the coreference value

 Optimize the normalized-cut criterion (Shi and Malik, 2000) cut(,) A B cut (,) A B minNCut ( A , B ) =+ vol() A vol () B  vol(A): The total weight of the edges from group A  Maximize weight of within-group coreference links  Minimize weight of between-group coreference links State-of-the-art Performance

 MUC metric does not prefer clustering results with many singleton event mentions (Chen and Ji, 2009) Remaining Challenges

The performance bottleneck of event coreference resolution comes from the poor performance of event mention labeling Outline

 Introduction  Basic Information Extraction (IE)  Advanced IE

 Enhance Quality

 Enhance Portability  Popular Research Directions

 Cross-source IE

 IE for Noisy Data  Resources Enhance Quality

 Enhance Quality

 Incorporating Redundancy

 Global Inference

 Joint Inference Common IE Bottleneck

 One of the initial goals for IE was to create a knowledge base (KB) from the entire input corpus, such as a profile or a series of activities about any entity, and allow further logical reasoning on the KB

 Such information may be scattered among a variety of sources (large-scale documents, languages, genres and data modalities)

 Problem: the KB constructed from a typical IE pipeline often contains lots of erroneous and conflicting facts  Single-document event extraction < 70%; Cross-document slot filling < 30%; worse for non-newswire genres, languages, multimedia data

 Improve Quality of IE: Identify topically-related documents and to integrate facts, possibly redundant, possibly complementary, possibly in conflict, coming from these documents  improve IE results with low cost IE in Rich Contexts

Texts Authors Venues Time/Location/ Cost Constraints

IE

Information Networks

Human Collaborative Learning Capture Information Redundancy

 When the data grows beyond some certain size, IE task is naturally embedded in rich contexts; the extracted facts become inter-dependent  Leverage Information Redundancy from:  Large Scale Data (Chen and Ji, 2011)  Background Knowledge (Chan and Roth, 2010; Rahman and Ng, 2011)  Inter-connected facts (Li and Ji, 2011; Li et al., 2011; e.g. Roth and Yih, 2004; Gupta and Ji, 2009; Liao and Grishman, 2010; Hong et al., 2011)  Diverse Documents (Downey et al., 2005; Yangarber, 2006; Patwardhan and Riloff, 2009; Mann, 2007; Ji and Grishman, 2008)  Diverse Systems (Tamang and Ji, 2011)  Diverse Languages (Snover et al., 2011)  Diverse Data Modalities (text, image, speech, video…)

 But how? Such knowledge might be overwhelming… Global Knowledge based Inference for Event Extraction

 Cross-document inference (Ji and Grishman, 2008)  Cross-event inference (Liao and Grishman, 2010)  Cross-entity inference (Hong et al., 2011)  All-together (Li et al., 2011) Leveraging Redundancy with Topic Modeling

 Within a cluster of topically-related documents, the distribution is much more convergent; closer to its distribution in the collection of topically related documents than the uniform training corpora e.g. In the overall information networks only 7% of “fire” indicate “End-Position” events; while all of “fire” in a topic cluster are “End-Position” events e.g. “Putin” appeared as different roles, including “meeting/entity”, “movement/person”, “transaction/recipient” and “election/person”, but only played as an “election/person” in one topic cluster  Topic Modeling can enhance information network construction by grouping similar objects, event types and roles together Global Inference Results

 Topic-cluster wide cross-document inference to enhance event and role mining  One trigger sense per topic cluster / One argument role per topic cluster  Remove events and roles with low local and cluster-wide confidence  Adjust event and role labeling to achieve cluster-wide consistency  Results: Precision (P), Recall (R), F-Measure (F)

Performance Trigger Labeling Argument Labeling System P R F P R F English Baseline 74.1 49.6 59.4 50.4 28.7 36.6 Cross-doc IR 66.5 67.4 66.9 60.8 32.2 42.1 Inference Topic Modeling 73.3 66.3 69.6 59.4 36.5 45.2 Chinese Baseline 78.8 48.3 60.0 60.6 34.3 43.8 Cross-doc IR 69.9 62.3 65.9 67.5 38.3 48.9 Inference Topic Modeling 76.5 61.9 68.4 66.4 42.4 51.8

(Li et al., 2011) Facts are often Inter-dependent: Joint Inference

13.7%-24.4% error reduction using Integer Linear Programming (Li et al., 2011)

Pairwise Li Lj Person A founded Organization B Organization B hired Person A

Triangle- Li Lj Entity Organization A is involved in a Justice/Conflict Person C is affiliated with/member of both /Transaction event with Organization B Organization A and Organization B

Triangle- Li Lj Lk Link Entity A is involved in a Transport event Person C is affiliated Person C is located originated from Location B with/member of Entity A in Location B Enhance Portability

 Solutions for expensive data annotation

 Semi-supervised Learning (Self-training, Bootstrapping, Co- training)

 Unsupervised Learning

 Distant Supervision  Solutions for domain-dependent restriction

 Open IE

 Domain Adaptation Self-training for Name Tagging

Document Test Set Clustering

… T1 Ti … Tn

i=i+1 NameM  baseline tagger

Save T ’as T ’  T tagged with NameM Retrain NameM i i i System output

Adjust Selection Add T ’’ to training corpus Measure Threshold i No Yes Ti’’  5% sentences selected from Ti’ Ti’’ Empty?

(Ji and Grishman, 2006) Bootstrapping for Relation Extraction

ORGANIZATION LOCATION MICROSOFT REDMOND Initial Seed Tuples: IBM ARMONK BOEING SEATTLE INTEL SANTA CLARA

Initial Seed Tuples Occurrences of Seed Tuples

Generate New Seed Tuples

Augment Table Generate Extraction Patterns Bootstrapping for Relation Extraction (Cont’) Computer servers at Microsoft’s Occurrences of headquarters in Redmond… In mid-afternoon trading, share of seed tuples: Redmond-based Microsoft fell… ORGANIZATION LOCATION The Armonk-based IBM introduced MICROSOFT REDMOND IBM ARMONK a new line… BOEING SEATTLE The combined company will operate INTEL SANTA CLARA from Boeing’s headquarters in Seattle. Intel, Santa Clara, cut prices of its Pentium processor.

Initial Seed Tuples Occurrences of Seed Tuples

Generate New Seed Tuples

Augment Table Generate Extraction Patterns Bootstrapping for Relation Extraction (Cont’)

Learned Patterns: •’s headquarters in -based ,

Initial Seed Tuples Occurrences of Seed Tuples

Generate New Seed Tuples

Augment Table Generate Extraction Patterns Bootstrapping for Relation Extraction (Cont’) ORGANIZATION LOCATION Generate AG EDWARDS ST LUIS new seed 157TH STREET MANHATTAN tuples; start new 7TH LEVEL RICHARDSON iteration 3COM CORP SANTA CLARA 3DO REDWOOD CITY JELLIES APPLE MACWEEK SAN FRANCISCO

Initial Seed Tuples Occurrences of Seed Tuples

Generate New Seed Tuples

Augment Table Generate Extraction Patterns Cross-lingual Co-Training for Event Extraction (Chen and Ji, 2009)

Labeled Samples Unlabeled Labeled Samples in Language A Bitexts in Language B

train Select at Random train System for System for Language A Language B B Bilingual Pool with Event Extraction constant size Event Extraction A

High Confidence Cross-lingual High Confidence Samples A Projection Samples B

Projected Projected Samples A Samples B

 Bootstrapping: n=1: trust yourself and teach yourself  Co-training: n=2 (Blum and Mitchell,1998) • the two views are individually sufficient for classification • the two views are conditionally independent given the class Unsupervised Mention Detection with Google N-grams

Property Name target [#] context Pronoun Example Gender Conjunction- noun[292,212] | conjunction his|her|its|their writer and his Possessive capitalized [162,426] Nominative- noun [53,587] am|is|are| he|she|it|they he is a writer Predicate was|were|be Verb-Nominative noun [116,607] verb he|she|it|they writer thought he Verb-Possessive noun [88,577]| verb his|her|its|their writer bought his capitalized [52,036] Verb-Reflexive noun [18,725] verb himself|herself| writer explained itself|themselve himself s Animacy Relative-Pronoun (noun|adjective) comma| who|which| writer, who & not after empty where|when (preposition| noun|adjective) [664,673] Gender and Animacy Discovery Examples

 If a mention indicates male/female/animacy with high confidence, it’s likely to be a person mention

Patterns for candidate mentions male female neutral plural John Joseph bought/… his/… 32 0 0 0 Haifa and its/… 21 19 92 15 screenwriter published/… his/… 144 27 0 0 it/… is/… fish 22 41 1741 1186

Patterns for Animate Non-Animate candidate mentions who when where which supremo 24 0 0 0 shepherd 807 24 0 56 prophet 7372 1066 63 1141 imam 910 76 0 57 oligarchs 299 13 0 28 sheikh 338 11 0 0

 Comparable performance with supervised models (Ji and Lin, 2009) Distant Supervision

 Motivation  It’s expensive to conduct annotations from unstructured data  Structured knowledge bases are widely available (Freebase, DBPedia,…)  Exploits known relations (usually obtained from an existing database) to extract contexts from a large document collection and automatically label them accordingly

 Basic Idea  Whenever two entities that are known to participate in a relation appear in the same context, this context is likely to express this relation in some way  By extracting many such contexts, different ways of expressing the same relation will be captured and a general model may be abstracted by applying machine learning methods to the annotated data

 Applications  Relation Extraction (Mintz et al., 2009; Nguyen and Moschitti, 2011; Hoffmann et al., 2011; Bobic et al., 2012)  (Marchetti-Bowick and Chambers, 2012)  Emotion Classification (Purver and Battersby, 2012) Distant Supervision Example Distant Supervision

 Advantages  Save annotation cost: create a labeled training set when no human annotated dataset is available  Classifiers trained with distant supervision present the benefit that they are less prone to over-fitting than those learned from manual annotations

 Disadvantages  The underlying training assumption of distant supervision can introduce noise that results in a loss precision for relation extraction (Reidel et al., 2010)  Example errors on Temporal IE: -Coreference results match the wrong named entities in a document. -Temporal expressions are normalized incorrectly. -Temporal information with different granularities has to be compared: The KB states that John married Mary in 1997, but not the exact day and month. Should we consider a temporal expression such as September 3, 1997 as a START? -Information offered by the KB is incorrect or contradictory with information found on the Web documents (yet search web is the most effective way to collect positive samples)  Possible solutions: logistic regression based re-labeling (Tamang and Ji, 2012) Domain-independent IE

 Traditional IE assumes the scenario and event types are known in advance so that the corresponding training data and seeds can be prepared

 Open IE (Banko et al., 2007)  learn a general model of how relations are expressed (in a particular language), based on unlexicalized features such as part-of-speech tags and domain- independent regular expressions; e.g. “E1 verb E2 (X established Y) “  the identities of the relations to be extracted are unknown and the billions of documents found on the Web necessitate highly scalable processing  On-demand IE (Sekine, 2006):  Pre-emptive IE (Shinyama et al., 2006): hierarchical pattern clustering  Advantages  Can extract unknown relations and events from heterogeneous corpora  Disadvantages  Low recall, cannot incorporate complicated long distance patterns

 Automatic event type and template discovery for new scenarios  Using clustering and techniques (Li et al., 2010)  Template discovery (Chambers and Jurafsky, 2011) Summary of IE Methods

IE Methods Supervised Bootstrapping Distant Open IE Template Learning Supervision Discovery Approach Learn rules or Send seeds to Project large Open-domain Automatically Overview supervised extract common database IE based on discover scenarios, model from patterns from entries into syntactic event types labeled data unlabeled data unlabeled data patterns and templates to obtain annotations Requirement of Large Small seeds Large seeds Small Little labeled data labeled data unstructured unstructured labeled data labeled data Quality Precision High Moderate Low Moderate Moderate Recall High Difficult to Moderate Low Moderate measure Portability Poor Moderate Moderate Good Good Scalability Poor Moderate Moderate Good Good Examples (Mccallum, (Riloff, 1996; Brin, (Mintz et al., (Sekine, 2006; (Li et al., 2010; 2003; 1999; 2009; Wu and Shinyama et Chambers and Ahn, 2006; Agichtein and Weld; 2010). al., 2006 Jurafsky, 2011) Hardy et al., Gravano, 2000; Banko et al., 2006; Ji and Etzioni et al., 2007) Grishman, 2004; Yangarber, 2008) 2000) Cross-genre Name Tagging Results Cross-domain IE Results

 From News to Chemical Engineering

Adaptive Online Control for Cascading Blackout Mitigation

Natural Gas Transportation Network Expansion and LNG Terminal Location Planning Models

Quasi-Monte Carlo Methods for a Multi-stage Stochastic Program for Energy Planning

 A Novel Stochastic Programming Algorithm for Minimization of Fresh Water Consumption in Power Plants

 Resource Reservation for Allocation Model with Randomness on the Right Hand Side

 Drinking Water Supply Network Planning: A Game-Theoretic Model with Embodied-Energy Target

 No relations or events were identified.  (Jiang and Zhai, 2006) Name Tagging Train → Test F1

NYT → NYT 0.855 to find PER, LOC, ORG from news text Reuters → NYT 0.641

mouse → mouse 0.541 to find gene/protein from biomedical literature fly → mouse 0.281 Domain Adaptation Methods for IE

References: Finkel and Manning, 2009; Arnold et al., 2008; Finkel and Manning, 2009; Daume III, 2007; Wu et al., 2009; Jiang and Zhai, 2007 Outline

 Introduction  Basic Information Extraction (IE)  Advanced IE

 Enhance Quality

 Enhance Portability  Popular Research Directions

 Cross-source IE

 IE for Noisy Data  Resources Popular Research Directions

 Cross-source IE

 Cross-document IE: Knowledge Base Population

 Cross-lingual IE

 Cross-media IE  IE for Noisy Data

 MT output

 ASR output Knowledge Base Population (KBP): Break Document Boundary Source Collection Create/Expand Reference KB

http://www.nist.gov/tac/2012/KBP/index.html (Coordinators: James Mayfield and Javier Artiles) Overview of KBP Tasks

 2009,2010: Monolingual (English to English)  2011: Cross-lingual (Chinese to English)  2012: Cross-lingual (Spanish to English)  2010: Surprise Slot filling  2011: Temporal Slot filling  2012: Cold-Start KBP

 (Ji and Grishman, 2011) Entity Linking

NIL

Query = “James Parsons” Successful Entity Linking Methods Source Query Collection

Query Expansion Collaborative Clustering  Approaches are Wiki hyperlink Source doc Statistical Coreference converging mining Model Resolution Mention Collaborators  The best systems are approaching human KB Node Candidate Generation performance Document Semantic Analysis Wiki KB IR +Texts  Cross-lingual Entity Linking performance only slightly lower than mono- KB Node Candidate Ranking lingual unsupervised supervised Graph- IR Rules similarity classification based computation

NIL Clustering Hierarchical Graph- Coref agglomerative Name Match based

Topic Link to larger KB Polysemy and Modeling and map down synonymy

Answer Slot Filling

Jim Parsons eng-WL-11-174592-12943233 PER E0300113 per:date_of_birth per:age per:country_of_birth per:city_of_birth

School Attended: University of Houston Slot Types Person Organization per:alternate_names per:title org:alternate_names per:date_of_birth per:member_of org:political/religious_affiliation per:age per:employee_of org:top_members/employees per:country_of_birth per:religion org:number_of_employees/members per:stateorprovince_of_birth per:spouse org:members per:city_of_birth per:children org:member_of per:origin per:parents org:subsidiaries per:date_of_death per:siblings org:parents per:country_of_death per:other_family org:founded_by per:stateorprovince_of_death per:charges org:founded per:city_of_death org:dissolved per:cause_of_death org:country_of_headquarters per:countries_of_residence org:stateorprovince_of_headquarters per:stateorprovinces_of_residence org:city_of_headquarters per:cities_of_residence org:shareholders per:schools_attended org:website How much Inference is Needed?

 Difficulty to push above F = 0.30

 High entry cost for competitive performance; needs good performance at each IE component Why KBP is more difficult than ACE

 Cross-slot Inference (per:children)  People Magazine has confirmed that actress Julia Roberts has given birth to her third child a boy named Henry Daniel Moder. Henry was born Monday in Los Angeles and weighed 8 lbs. Roberts, 39, and husband Danny Moder, 38, are already parents to twins Hazel and Phinnaeus who were born in November 2006. son-of (Julia Roberts, Henry Moder) & spouse-of (Julia Roberts, Danny Moder) usually  son-of (Danny Moder, Henry Moder) . 25% of the examples involve coreference which is beyond current system capabilities, such as nominal anaphors  “Alexandra Burke is out with the video for her second single … taken from the British artist’s debut album”  “a woman charged with running a prostitution ring … her business, Pamela Martin and Associates”  systems would benefit from specialists which are able to reason about times, locations, family relationships, and employment relationships.

 It places more emphasis on cross-document entity resolution which received limited effort in ACE  It forces systems to deal with redundant and conflicting answers across large corpora Cross-media IE

 (Lee et al., 2010; Qi et al., 2011) Fact Type Examples in Cross-Media IE Cross-lingual IE Ang Lee XIN20030616.0130.0053 PER E0300112 per:date_of_birth, per:spouse per:children

Parent: Li Sheng

Birth-place: Taiwan Pindong City

Residence: Hua Lian

Attended-School: NYU

(Snover et al., 2011) Alternative Cross-lingual IE Pipelines

 References:

 Riloff et al., 2002;

 Sudo et al., 2004;

 Hakkani-Tür et al., 2007;

 Snover et al., 2011 Impact of MT Errors on Cross-lingual IE

 Re-training extraction components directly from MT output did not help  MT errors were too diverse to generalize  59% of the missing errors were due to text, query or answer translation errors; 20% were due to slot filling errors; Name translation is a bottleneck

 Source Text 俄塔社援引紧急情况部莫斯科市总局新闻处处长 (Bo Bei Lie Fu)的话...  Reference Translation The Russian news agency Tass, quoting Director Bobylev of the news office of the Moscow city headquarters of the Emergency Situation Department...  Various MT System Translations  Russia 's Tass news agency quoted the ministry for emergency situations of the Moscow city , Director of Information Services , German Gref...  Itar-Tass quoted the Emergency Situations Ministry博贝列夫 in Moscow City Administration Director Bo , yakovlev...  Russia 's Tass news agency of the Ministry of Emergency Situations Moscow city administration of Addis Ababa , Director of Information Services...  Russian news agency quoted the ministry of emergency situations in Moscow city administration of the Director of Information Services , A. Kozyrev...  Itar-Tass quoted the Emergencies Ministry in Moscow , the Director of information in 1988 lev... Cross-lingual Validation

Characteristics Description Scope Depth Language Global f1: frequency of that appears in all baseline outputs (Cross- Shallow English f2: number of conflicting slot types in which answer a appears in all system) baseline outputs f3: conjunction of t and whether a is a year answer Shallow English f4: conjunction of t and whether a includes numbers or letters f5: conjunction of place t and whether a is a country name Local Deep f6: conjunction of per:origin t and whether a is a nationality Based on English f7: if t=per:title, whether a is an acceptable title IE f8: if t requires a name answer, whether a is a name f9: whether a has appropriate semantic type Global Deep f10: conjunction of org:top_members/employees and whether there is a (Within- Based on English high-level title in s Document) IE f11: conjunction of alternative name and whether a is an acronym of q Global Shallow Chinese f12: conditional probability of q/q' and a/a' appear in the same document (Cross- (Statistics) English f13: conditional probability of q/q' and a/a' appear in the same sentence document Deep Both f14: co-occurrence of q/q' and a/a' appear in coreference links in Fact- English f15: co-occurrence of q/q' and a/a' appear in relation/event links comparable based on English f16: conditional probability of q/q' and a/a' appear in relation/event links corpora) InfoNets English f17: mutual information of q/q' and a/a' appear in relation/event links

Achieved 11% absolute F-measure gain (Snover et al., 2011) IE for ASR Output

 Problems

 Using an IE system trained from newswire, the performance degrades notably, 15% relative, when the system is tested on broadcast news transcriptions and 27% relative, when ASR output is used instead of reference transcriptions (Makhoul et al., 2005)

 Optimizing based on downstream applications (IE) is better than optimizing F- measure (Favre et al., 2008)

 Need better pronoun resolution for speech conversation

 Possible Solutions

 Optimize downstream applications for ASR and speech segmentation (JHU2012 Summer Workshop on “Complementary Evaluation Measures for Speech Transcription”)

 Use n-best hypotheses, ASR lattices, word confusion networks, or graphemes for IE

 Improve pronoun resolution by incorporating automatic speaker role identification techniques

 Apply Modality, Polarity, Genericity analysis to reduce uncertainty Segmenting Speech for IE

(Favre et al., 2008) Outline

 Introduction  Basic Information Extraction (IE)  Advanced IE

 Enhance Quality

 Enhance Portability  Popular Research Directions

 Cross-source IE

 IE for Noisy Data  Resources Resources: Data Sets

• ACE IE: http://projects.ldc.upenn.edu/ace/data/ IE training data for English/Chinese/Arabic/Spanish • CONLL 2002: http://www.cnts.ua.ac.be/conll2002/ner.tgz Name tagging training data for Dutch and Spanish • CONLL 2003: http://www.cnts.ua.ac.be/conll2003/ner.tgz Name tagging training data for English and German • KBP 2009-2012: http://www.nist.gov/tac/2012/KBP/data.html http://nlp.cs.qc.cuny.edu/kbp/2011/ http://nlp.cs.qc.cuny.edu/kbp/2010/ Knowledge Base Population for English, Chinese and Spanish Resources: Publicly Available IE Toolkits

NYU IE: http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.html University of Sheffield IE: http://gate.ac.uk/download/index.html NLTK: http://nltk.sourceforge.net/ CUNY KBP system: http://nlp.cs.qc.cuny.edu/kbptoolkit-1.5.0.tar.gz http://nlp.cs.qc.cuny.edu/Temporal_Slot_Filling_1.0.1.tar.gz

Name Taggers: Stanford Name Tagger: http://nlp.stanford.edu/ner/index.shtml UIUC Name Tagger: http://cogcomp.cs.illinois.edu/page/software_view/NETagger CUNY Name Taggers: Chinese tagger: http://nlp.cs.qc.cuny.edu/ChineseNameTagger.tar.gz English tagger: http://nlp.cs.qc.cuny.edu/en_nametagging_release.tar.gz

Coreference Resolvers: JavaRAP: http://aye.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html BART: http://bart-coref.org/ The Illinois Coreference Package: http://cogcomp.cs.illinois.edu/page/software_view/18 Reconcile: http://www.cs.utah.edu/nlp/reconcile/ CherryPicker: http://www.hlt.utdallas.edu/~altaf/cherrypicker.html Thank You and Join the IE World!