Enterprise Information Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Enterprise Information Extraction SIGMOD 2010 Tutorial Frederick Reiss, Yunyao Li, Laura Chiticariu, and Sriram Raghavan IBM Almaden Research Center © 2009 IBM Corporation Who we are Researchers from the Search and Analytics group at IBM Almaden Research Center – Frederick Reiss – Yunyao Li – Laura Chiticariu – Sriram Raghavan (virtual) Working on information extraction since 2006-08 – SystemT project – Code shipping with 8 IBM products 2 © 2009 IBM Corporation Road Map What is Information Extraction? (Fred Reiss) re he re Declarative Information Extraction (Fred Reiss) u a Yo What the Declarative Approach Enables – Scalable Infrastructure (Yunyao Li) – Development Support (Laura Chiticariu) Conclusion / Q&A (Fred Reiss) 3 © 2009 IBM Corporation Obligatory “What is Information Extraction?” Slide Distill structured data from unstructured and semi-structured text Exploit the extracted data in your applications For years, Microsoft Corporation CEO Bill Gates was against open source. But Annotations today he appears to have Annotations changed his mind. "We can be open source. We love the concept of shared source," Name Title Organization said Bill Veghte , a Microsoft Bill Gates CEO Microsoft VP . "That's a super-important Bill Veghte VP Microsoft shift for us in terms of code Richard Stallman Founder Free Soft.. access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… (from Cohen’s IE tutorial, 2003) 4 © 2009 IBM Corporation Bibliography at the end of the slide deck. SIGMOD 2006 Tutorial [Doan06] in One Slide Information extraction has been an area of study in Natural Language Processing and AI for years Core ideas from database research not a part of existing work in this area – Declarative languages – Well-defined semantics – Cost-based optimization The challenge: Can we build a “System R” for information extraction? Survey of early-stage projects attacking this problem 5 © 2009 IBM Corporation What’s new? New enterprise-focused applications … …driving new requirements … …leading to declarative approaches 6 © 2009 IBM Corporation Enterprise Applications of Information Extraction Previous tutorial showed research prototypes – Avatar: Semantic search on personal emails – DBLife: Use IE to build a knowledge base about database researchers – AliBaba: IE over medical research papers Since then, IE has gone mainstream – Enterprise Semantic Search – Enterprise Data as a Service – Business Intelligence – Data-driven Enterprise Mashups 7 © 2009 IBM Corporation Enterprise Semantic Search Use information extraction to improve accuracy and presentation of search results Extract geographical information Extract acronyms and their meanings Gumshoe (IBM) [Zhu07,Li06] Identify pages in different parts of the intranet that are about the same topic 8 © 2009 IBM Corporation Enterprise Data as a Service Extract and clean useful information hidden in publicly available documents Rent the extracted information over the Internet DBLife [1] ...<issuer>...<issuer> <issuerCik>0000070858</issuerCik><issuerCik>0000070858</issuerCik> <issuerName>BANK<issuerName>BANK OFOF AMERICAAMERICA CORPCORP /DE/</issuerName>/DE/</issuerName> <issuerTradingSymbol>BAC</issuerTradingSymbol><issuerTradingSymbol>BAC</issuerTradingSymbol> Midas (IBM) </issuer></issuer> <reportingOwner><reportingOwner> <reportingOwnerId><reportingOwnerId> <rptOwnerCik>0001090355</rptOwnerCik><rptOwnerCik>0001090355</rptOwnerCik> (Demo today!) <rptOwnerName>THAIN<rptOwnerName>THAIN JOHNJOHN A</rptOwnerName>A</rptOwnerName> </reportingOwnerId></reportingOwnerId> <reportingOwnerAddress><reportingOwnerAddress> <rptOwnerStreet1>C/O<rptOwnerStreet1>C/O GOLDMANGOLDMAN SACHSSACHS GROUP</rptOwnerSGROUP</rptOwnerStreet1>treet1> <rptOwnerStreet2>85<rptOwnerStreet2>85 BROADBROAD STREET</rptOwnerStreet2>STREET</rptOwnerStreet2> <rptOwnerCity>NEW<rptOwnerCity>NEW YORK</rptOwnerCity>YORK</rptOwnerCity> ...... </reportingOwnerAddress></reportingOwnerAddress> <reportingOwnerRelationship><reportingOwnerRelationship> <isOfficer>1</isOfficer><isOfficer>1</isOfficer> <officerTitle>Pres<officerTitle>Pres GlblGlbl Bkg Bkg Sec Sec && WlthWlth Mgmt</offic Mgmt</officerTitle>erTitle> </reportingOwnerRelationship></reportingOwnerRelationship> </reportingOwner></reportingOwner> ... ... 9 © 2009 IBM Corporation Business Intelligence Social networks Traditional BI Tools Blogs Data Government data Public Data Public Information Warehouse Extraction New BI Tools Emails ImportantImportant applicationsapplications Call center records Marketing:Marketing: CustomerCustomer sentiment,sentiment, brandbrand managementmanagement Legal:Legal: ElectronicElectronic legallegal discovery,discovery, Legacy data identifyingidentifying productproduct pipelinepipeline problemsproblems Enterprise Data Enterprise Strategy:Strategy: ImportantImportant economiceconomic events,events, monitoringmonitoring competitorscompetitors 10 © 2009 IBM Corporation IBM eDiscovery Analyzer Business Intelligence Social networks Traditional BI Tools Blogs Data Government data Public Data Public Information Warehouse Extraction New BI Tools Emails ImportantImportant applicationsapplications Call center records Marketing:Marketing: CustomerCustomer sentiment,sentiment, brandbrand managementmanagement Legal:Legal: ElectronicElectronic legallegal discovery,discovery, Legacy data identifyingidentifying productproduct pipelinepipeline problemsproblems Enterprise Data Enterprise Strategy:Strategy: ImportantImportant economiceconomic events,events, monitoringmonitoring competitorscompetitors 11 © 2009 IBM Corporation Data-Driven Mashups Extract structured information from unstructured feeds Join extracted information with other structured enterprise data IBM Lotus Notes Live Text IBM InfoSphere MashupHub [Simmen09] 12 © 2009 IBM Corporation Enterprise Information Extraction IE has become increasingly important to emerging enterprise applications Set of requirements driven by enterprise apps that use information extraction – Scalability • Large data volumes, often orders of magnitude larger than classical NLP corpora – Accuracy • Garbage-in garbage-out: Usefulness of application is often tied to quality of extraction – Usability • Building an accurate IE system is labor-intensive • Professional programmers are much more expensive than grad students! 13 © 2009 IBM Corporation A Canonical IE System Feature Entity Entity Selection Identification Resolution Entities and Structured Text Features Relationships Information 14 © 2009 IBM Corporation A Canonical IE System Feature Entity Entity Selection Identification Resolution Entities and Structured Text Features Relationships Information Boundaries between these stages are not clear-cut This diagram shows a simplified logical data flow – Traditionally, physical data flow the same as logical – But the systems we’ll talk about take a very different approach to the actual order of execution 15 © 2009 IBM Corporation Feature Selection Identify features – Very simple, “atomic” entities – Inputs for other stages Examples of features – Dictionary match – Regular expression match – Part of speech Typical components used – Off-the-shelf morphology package – Many simple rules Very time-consuming and underappreciated 16 © 2009 IBM Corporation Entity Identification Use basic features to build more complex features – Example: …was done by Mr. Jack Gurbingal at the… Dictionary match: Regular expr match: Complex feature: Common first name +Capitalized word = Potential person name Use other features to determine which of the complex features are instances of entities and relationships Most information extraction research focuses on this stage – Variety of different techniques 17 © 2009 IBM Corporation Entity Resolution Perform complex analyses over entities and relationships Examples – Identify entities that refer to the same person or thing – Join extracted information with external structured data Not the main focus of this tutorial – But interacts with other parts of information extraction 18 © 2009 IBM Corporation Obligatory Person-Phone Example Call John Merker at 555-1212. John also has a cell #: 555-1234 19 © 2009 IBM Corporation Person-Phone Example: Input Feature Entity Entity Selection Identification Resolution Text Features Entities, Structured Rels. Information Call John Merker at 555-1212. John also has a cell #: 555-1234 20 © 2009 IBM Corporation Person-Phone Example: Features Feature Entity Entity Selection Identification Resolution Text Features Entities, Structured Rels. Information Call John Merker at 555-1212. John also has a cell #: 555-1234 21 © 2009 IBM Corporation Person-Phone Example: Entities and Relationships Feature EntityEntity Entity Selection IdentificationIdentification Resolution Text Features Entities, Structured Rels.. Information Person Phone Call John Merker at 555-1212. John also has a cell #: 555-1234 Person NumType Phone 22 © 2009 IBM Corporation Person-Phone Example: Entities and Relationships Feature Entity Entity Selection Identification Resolution Text Features Entities, Structured Rels. Information JoinJoin with with officeoffice phone phone Same Same directorydirectory Person Person Phone Person Call John Merker at 555-1212. John also has a cell #: 555-1234 Person NumType Phone 23 © 2009 IBM Corporation Road Map What is Information Extraction? Declarative Information Extraction ere e h ar What the Declarative Approach Enables ou Y – Scalable Infrastructure (Yunyao Li) – Development Support (Laura Chiticariu)