Information Extraction: Techniques, Advances and Challenges

Information Extraction: Techniques, Advances and Challenges Heng Ji Computer Science Department and Linguistics Department Queens College and Graduate Center City Univeristy of New York [email protected] June 12, 2012 Outline Introduction Basic Information Extraction (IE) Advanced IE Enhance Quality Enhance Portability Popular Research Directions Cross-source IE IE for Noisy Data Resources Introduction What is IE Why IE is Useful IE History What is IE (In this talk) Information Extraction (IE) =Identifying the instances of facts names/entities , relations and events from semi-structured or unstructured text; and convert them into structured representations (e.g. databases) BarryBarry Diller Diller on Wednesday quit as chief of VivendiVivendi Universal Universal Entertainment. Entertainment Trigger Quit (a “Personnel/End-Position” event) Role = Person Barry Diller Arguments Role = Organization Vivendi Universal Entertainment Role = Position Chief Role = Time-within Wednesday (2003-03-04) Why IE is Useful IE can build a data base with the information on a given relation or event from news, financial, bio-medical domains… Attack/arrest events People’s jobs People’s whereabouts Merger and acquisition activity Disease outbreaks Patient records Experiment chains in scientific papers Component technology for other areas Question Answering (QA) Summarization Automatic translation Document indexing Structured Search: “who are the top employees of IBM from 2002-2012?” Opinion Mining/Sentiment Extraction Text Data Mining over Extracted Relationships Application Example: Dynamic Event Tracking (Chen and Ji, 2009) http://nlp.cs.qc.cuny.edu/demo/personvisual.html IE for Scientific Literature For sequestration, the CO2 captured from a fossil fuel plant is first compressed until the combined heat and pressure make it "supercritical" — a state in which it displays both gas and fluid properties. At 3 kilometers, you needed only 10 wells because the increased temperature lowered the viscosity of the CO2, allowed it to slide more easily into the reservoir. Supercritical CO2 is buoyant and will rise above the other fluids. If it rises high enough (above a depth of 2,600 feet), it will return to a gaseous state. Centroid=“CO2 Event capture Event compress Event rise Geological Object CO Object CO Object CO Sequestration” 2 2 2 Place fossil fuel State supercritica State gaseous plant l Depth 2,600 feet (subsequence) Event lower Event slide Object CO2 Object CO2 Agent increased Place reservoir Targettemperature viscosity Volume 10 wells Depth 3 kilometers (causal relation) Real Application: Terrorism Networks Extraction Demo URL: http://blender2.cs.qc.cuny.edu/BlenderGraph Demo Video: http://nlp.cs.qc.cuny.edu/terrorism.m4v IE History: Early Projects Knowledge-based, rule-based FRUMP – 1979 Newswire LSP (Language String Project) – 1981 Naomi Sager et al. AMA – American Medical Association Patient summaries IE History: MUC MUC – Message Understanding Conferences (1987-1998) DARPA, NRAD MUC-6: Named entity, coreference and template element MUC-7: template relation Standardization, Evaluation, Dissemination DARPA’s TIPSTER Program: Document Detection, Summarization and Information Extraction – until 1998 TREC (Text Retrieval Conferences) Year Conference Domain 1987 MUC-I Navy messages 1989 MUC-II Navy messages 1991 MUC-3 News about terrorist attacks 1992 MUC-4 News about terrorist attacks 1993 MUC-5 Company news (joint-ventures, micro-electronics production) 1995 MUC-6 Company news (management succession) 1998 MUC-7 Airline company orders IE History: ACE/CONLL HUB-4 and ACE (Automatic Content Extraction) NIST National Institute of Standards and Technology Spoken and printed text ACE defined 7 types of entities, 17 types of relations, 33 different types of events (2002-2008) Multilingual (English, Chinese, Arabic) The top systems obtained mention values in the range of 70-85, entity values in the range of 60-70, relation values in the range of 35- 45, event values in the range of 15-30. CoNLL (Conference on Natural Language Learning) Since 1997 Name tagging in the 2002 and 2003 editions Multilingual of person (PER), location (LOC), organization (ORG) and other (O) classes IE History: Knowledge Base Population (KBP, 2009-) General Goal Promote research in discovering facts about entities and expanding a knowledge source automatically Conducted as part of the NIST Text Analysis Conference What’s New Extraction at large scale (1.3 million documents) Using a representative collection (not selected for relevance) Cross-document entity resolution (extending the limited effort in ACE) Linking the facts in text to a knowledge base Distant (and noisy) supervision through Infoboxes Rapid adaptation to new relations Support multi-lingual information fusion (cross-lingual KBP) Capture temporal information (temporal KBP) Automatic KB construction (cold-start KBP) Outline Introduction Basic Information Extraction (IE) Advanced IE Enhance Quality Enhance Portability Popular Research Directions Cross-source IE IE for Noisy Data Resources Basic IE Methods Rule-based Pattern Learning Supervised Learning IE Components and State-of-the-Art Name Tagging Entity Coreference Resolution Relation Extraction Event Mention Extraction Event Coreference Resolution Traditional IE Methods Handcrafted systems Knowledge (rule) based Hand-written Patterns Gazetteers Rule-based approaches: FASTUS (SRI, 1996), Proteus (NYU, 1996), LaSIE- II (U-Sheffield, 1998) Example-based learning: AutoSlog (UMASS, 1993), CRYSTAL (UMASS, 1996) Statistical parsing models: Collins et al. (1998), Miller et al. (2000) Advantages - Simple, fast, language independent, easy to retarget Disadvantages – collection and maintenance of lists, cannot deal with fact variants, cannot resolve ambiguity, poor portability (domains and languages) Automatic systems Pattern Learning Supervised Learning Pattern Learning based IE Pattern Examples Name Tagging: CapWord + {City, Forest, Center} e.g. Sherwood Forest Cap Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street “to the” COMPASS “of” CapWord e.g. to the south of Loitokitok “based in” CapWord e.g. based in Loitokitok CapWord “is a” (ADJ)? GeoWord e.g. Loitokitok is a friendly city Event Extraction: [Person] quit as [Position] of [Organization] Manually writing and editing patterns require some skill and considerable time These patterns cannot be easily adapted into new domains Learn these patterns automatically based on an annotated corpus pre- processed by syntactic and semantic analyzers (Muslea, 1999); details later Supervised Learning based IE ‘Pipeline’ style IE Split the task into several components Prepare data annotation for each component Apply supervised machine learning methods to address each component separately Most state-of-the-art ACE IE systems were developed in this way Provide great opportunity to applying a wide range of learning models and incorporating diverse levels of linguistic features to improve each component Large progress has been achieved on some of these components such as name tagging and relation extraction Major IE Components Name/Nominal Extraction “Barry Diller”, “chief” Entity Coreference Resolution “Barry Diller” = “chief” Time Identification Wednesday (2003-03-04) and Normalization “Vivendi Universal Entertainment” is Relation Extraction located in “France” “Barry Diller” is the person of Event Mention Extraction and the end-position event Event Coreference Resolution trigged by “quit” Name Tagging: Task Person (PER): named person or family Organization (ORG): named corporate, governmental, or other organizational entity Geo-political entity (GPE): name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.) <PER>George W. Bush</PER> discussed <GPE>Iraq</GPE> But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc. Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004) Convert it into a sequence labeling problem – “BIO” tagging: B-PER I-PER I-PER O B-GPE George W. Bush discussed Iraq Supervised Learning for Name Tagging Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002; Florian et al., 2007) Decision Trees (Sekine et al., 1998) Class-based Language Model (Sun et al., 2002, Ratinov and Roth, 2009) Agent-based Approach (Ye et al., 2002) Support Vector Machines (Takeuchi and Collier, 2002) Sequence Labeling Models Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005) Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000) Conditional Random Fields (CRFs) (McCallum and Li, 2003) Markov Chain for a Simple Name Tagger George:0.3 Transition 0.6 Probability W.:0.3 Bush:0.3 Emission Iraq:0.1 Probability PER 0.2 $:1.0 0.3 0.1 END START 0.2 0.3 LOC 0.2 0.3 0.2 0.3 0.1 George:0.2 0.5 0.2 Iraq:0.8 X W.:0.3 0.5 discussed:0.7 Viterbi Decoding of Name Tagger George W. Bush discussed Iraq $ t=0 t=1 t=2 t=3 t=4 t=5 t=6 START 1 0 0 001 0 1*0.3*0.3 PER 0 0.09 0.0162 0.003 0 0.000008 0 0.0012 0.0003 LOC 0 0.004 0 0 0 0.000032 0 X 0 0 0.0054 0 0.0004 0 0 0.0036 0.00000016 END 00 0 0 0 0 0.0000096 Current = Previous * Transition * Emission Limitations of HMMs Joint probability distribution p(y, x) Assume interdependent features Cannot represent overlapping features or long range dependences between observed elements Need to enumerate all possible

Load more