Enterprise Information Extraction

SIGMOD 2010 Tutorial Frederick Reiss, Yunyao Li, Laura Chiticariu, and Sriram Raghavan IBM Almaden Research Center

© 2009 IBM Corporation Who we are

Researchers from the Search and Analytics group at IBM Almaden Research Center – Frederick Reiss – Yunyao Li – Laura Chiticariu – Sriram Raghavan (virtual) Working on information extraction since 2006-08 – SystemT project – Code shipping with 8 IBM products

2 © 2009 IBM Corporation Road Map

What is Information Extraction? (Fred Reiss) re he re Declarative Information Extraction (Fred Reiss) u a Yo What the Declarative Approach Enables – Scalable Infrastructure (Yunyao Li) – Development Support (Laura Chiticariu) Conclusion / Q&A (Fred Reiss)

3 © 2009 IBM Corporation Obligatory “What is Information Extraction?” Slide Distill structured data from unstructured and semi-structured text Exploit the extracted data in your applications

For years, Microsoft Corporation CEO Bill Gates was against open source. But Annotations today he appears to have Annotations changed his mind. "We can be open source. We love the concept of shared source," Name Title Organization said Bill Veghte , a Microsoft Bill Gates CEO Microsoft VP . "That's a super-important Bill Veghte VP Microsoft shift for us in terms of code Richard Stallman Founder Free Soft.. access.“

Richard Stallman , founder of the Free Software Foundation , countered saying… (from Cohen’s IE tutorial, 2003)

4 © 2009 IBM Corporation Bibliography at the end of the slide deck. SIGMOD 2006 Tutorial [Doan06] in One Slide

Information extraction has been an area of study in Natural Language Processing and AI for years Core ideas from database research not a part of existing work in this area – Declarative languages – Well-defined semantics – Cost-based optimization The challenge: Can we build a “System R” for information extraction? Survey of early-stage projects attacking this problem

5 © 2009 IBM Corporation What’s new?

New enterprise-focused applications … …driving new requirements … …leading to declarative approaches

6 © 2009 IBM Corporation Enterprise Applications of Information Extraction

Previous tutorial showed research prototypes – Avatar: Semantic search on personal emails – DBLife: Use IE to build a knowledge base about database researchers – AliBaba: IE over medical research papers Since then, IE has gone mainstream – Enterprise Semantic Search – Enterprise Data as a Service – Business Intelligence – Data-driven Enterprise Mashups

7 © 2009 IBM Corporation Enterprise Semantic Search

Use information extraction to improve accuracy and presentation of search results

Extract geographical information

Extract acronyms and their meanings

Gumshoe (IBM) [Zhu07,Li06]

Identify pages in different parts of the intranet that are about the same topic

8 © 2009 IBM Corporation Enterprise Data as a Service

Extract and clean useful information hidden in publicly available documents Rent the extracted information over the Internet

DBLife [1]

...... 00000708580000070858 BANKBANK OFOF AMERICAAMERICA CORPCORP /DE//DE/ BACBAC Midas (IBM) 00010903550001090355 (Demo today!) THAINTHAIN JOHNJOHN AA C/OC/O GOLDMANGOLDMAN SACHSSACHS GROUPtreet1> 8585 BROADBROAD STREETSTREET NEWNEW YORKYORK ...... 11 PresPres GlblGlbl Bkg Bkg Sec Sec && WlthWlth MgmterTitle> ...... 9 © 2009 IBM Corporation Business Intelligence

Social networks Traditional BI Tools Blogs

Data Government data Public Data Public Information Warehouse Extraction New BI Tools Emails

ImportantImportant applicationsapplications Call center records Marketing:Marketing: CustomerCustomer sentiment,sentiment, brandbrand managementmanagement Legal:Legal: ElectronicElectronic legallegal discovery,discovery, Legacy data identifyingidentifying productproduct pipelinepipeline problemsproblems

Enterprise Data Enterprise Strategy:Strategy: ImportantImportant economiceconomic events,events, monitoringmonitoring competitorscompetitors

10 © 2009 IBM Corporation IBM eDiscovery Analyzer Business Intelligence

Social networks Traditional BI Tools Blogs

Data Government data Public Data Public Information Warehouse Extraction New BI Tools Emails

ImportantImportant applicationsapplications Call center records Marketing:Marketing: CustomerCustomer sentiment,sentiment, brandbrand managementmanagement Legal:Legal: ElectronicElectronic legallegal discovery,discovery, Legacy data identifyingidentifying productproduct pipelinepipeline problemsproblems

Enterprise Data Enterprise Strategy:Strategy: ImportantImportant economiceconomic events,events, monitoringmonitoring competitorscompetitors

11 © 2009 IBM Corporation Data-Driven Mashups

Extract structured information from unstructured feeds Join extracted information with other structured enterprise data

IBM Lotus Notes Live Text

IBM InfoSphere MashupHub [Simmen09]

12 © 2009 IBM Corporation Enterprise Information Extraction

IE has become increasingly important to emerging enterprise applications Set of requirements driven by enterprise apps that use information extraction – Scalability • Large data volumes, often orders of magnitude larger than classical NLP corpora – Accuracy • Garbage-in garbage-out: Usefulness of application is often tied to quality of extraction – Usability • Building an accurate IE system is labor-intensive • Professional programmers are much more expensive than grad students!

13 © 2009 IBM Corporation A Canonical IE System

Feature Entity Entity Selection Identification Resolution

Entities and Structured Text Features Relationships Information

14 © 2009 IBM Corporation A Canonical IE System

Feature Entity Entity Selection Identification Resolution

Entities and Structured Text Features Relationships Information

Boundaries between these stages are not clear-cut This diagram shows a simplified logical data flow – Traditionally, physical data flow the same as logical – But the systems we’ll talk about take a very different approach to the actual order of execution

15 © 2009 IBM Corporation Feature Selection

Identify features – Very simple, “atomic” entities – Inputs for other stages Examples of features – Dictionary match – Regular expression match – Part of speech Typical components used – Off-the-shelf morphology package – Many simple rules

Very time-consuming and underappreciated

16 © 2009 IBM Corporation Entity Identification

Use basic features to build more complex features – Example:

…was done by Mr. Jack Gurbingal at the…

Dictionary match: Regular expr match: Complex feature: Common first name +Capitalized word = Potential person name

Use other features to determine which of the complex features are instances of entities and relationships Most information extraction research focuses on this stage – Variety of different techniques

17 © 2009 IBM Corporation Entity Resolution

Perform complex analyses over entities and relationships Examples – Identify entities that refer to the same person or thing – Join extracted information with external structured data

Not the main focus of this tutorial – But interacts with other parts of information extraction

18 © 2009 IBM Corporation Obligatory Person-Phone Example

Call John Merker at 555-1212. John also has a cell #: 555-1234

19 © 2009 IBM Corporation Person-Phone Example: Input

Feature Entity Entity Selection Identification Resolution

Text Features Entities, Structured Rels. Information

Call John Merker at 555-1212. John also has a cell #: 555-1234

20 © 2009 IBM Corporation Person-Phone Example: Features

Feature Entity Entity Selection Identification Resolution

Text Features Entities, Structured Rels. Information

Call John Merker at 555-1212. John also has a cell #: 555-1234

21 © 2009 IBM Corporation Person-Phone Example: Entities and Relationships

Feature EntityEntity Entity Selection IdentificationIdentification Resolution

Text Features Entities, Structured Rels.. Information

Person Phone Call John Merker at 555-1212. John also has a cell #: 555-1234 Person NumType Phone

22 © 2009 IBM Corporation Person-Phone Example: Entities and Relationships

Feature Entity Entity Selection Identification Resolution

Text Features Entities, Structured Rels. Information

JoinJoin with with officeoffice phone phone Same Same directorydirectory Person Person Phone Person Call John Merker at 555-1212. John also has a cell #: 555-1234 Person NumType Phone

23 © 2009 IBM Corporation Road Map

What is Information Extraction? Declarative Information Extraction ere e h ar What the Declarative Approach Enables ou Y – Scalable Infrastructure (Yunyao Li) – Development Support (Laura Chiticariu) Conclusion / Q&A (Fred Reiss)

24 © 2009 IBM Corporation Declarative Information Extraction

Overview of traditional approaches to information extraction Practical issues with applying traditional approaches How recent work has used declarative approaches to address these issues Different types of declarative approaches

25 © 2009 IBM Corporation Traditional Approaches to Information Extraction

Two dominant types: – Rule-Based – Machine Learning-Based Distinction is based on how Entity Identification is performed

Feature Entity Entity Selection Identification Resolution

Entities and Structured Text Features Relationships Information

26 © 2009 IBM Corporation Anatomy of a Rule-Based System

Example Documents

Feature Entity Selection Identification Rules Rules

Feature Entity Entity Selection Identification Resolution

Text Features Entities, Structured Rels. Information

27 © 2009 IBM Corporation Anatomy of a Machine Learning-Based System

Example Labeled Feature Features Training Documents Documents Selection and Labels

Feature Selection Model Rules

Feature Entity Entity Selection Identification Resolution

Text Features Entities, Structured Rels. Information

28 © 2009 IBM Corporation A Brief History of IE in the NLP Community Rule-Based Machine Learning

1978-1997: MUC (Message At first: Simple techniques like Understanding Conference) – Naive Bayes DARPA competition 1987 to 1997 – FRUMP [DeJong82] 1990’s: Learning Rules – FASTUS [Appelt93], – AUTOSLOG [Riloff93] – TextPro, PROTEUS – CRYSTAL [Soderland98] 1998: Common Pattern – SRV [Freitag98] Specification Language (CPSL) standard [Appelt98] 2000’s: More specialized models – Standard for subsequent rule- – Hidden Markov Models [Leek97] based systems – Maximum Entropy Markov 1999-2010: Commercial products, Models [McCallum00] GATE – Conditional Random Fields [Lafferty01] – Automatic feature expansion

For further reading: Sunita Sarawagi’s Survey [Sarawagi08], Claire Cardie’s Survey [Cardie97] 29 © 2009 IBM Corporation Tying the System Together: Traditional IE Frameworks

Traditional approach: Workflow system – Sequence of discrete steps – Data only flows forward GATE 1 and UIMA 2 are the most popular frameworks – Type systems and standard data formats Web services and Hadoop also in common use – No standard data format Workflow for the ANNIE system [Cunningham09]

1. GATE (General Architecture for Text Engineering) official web site: http://gate.ac.uk/ 30 © 2009 IBM Corporation 2. Apache UIMA (Unstructured Information Management Architecture) official web site: http://uima.apache.org/ Sequential Execution in CPSL Rules

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus , risus in e sagittis facilisis, arcu augue rutrum velit, sed , hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Level 2 〈〈PersonPerson 〉〈〉〈TokenToken 〉〉[~[~ “at”] “at”] 〈 〈PhonePhone 〉〉 〈〈PersonPhonePersonPhone 〉〉

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus , risus in sagittis facilisis arcu auguet rum velit, sed at hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est Level 1 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in at amet lt arcu enina i facilisis, at - arcu tincidunt tincidunt orci. Pellentesque justo tellus , scelerisque quis, orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla 〈Digits〈Digits〉〈〉〈TokenToken 〉[~〉[~ “-”] “-”] 〈 Digits〈Digits〉〉 〈Phone〈Phone 〉〉 〈〈FirstNameFirstName 〉〈〉〈CapsWordCapsWord 〉〉 〈Person〈Person 〉〉

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus , risus in sagittis facilisis arcu augue velit, at -. hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat Level31tincidunt, est 0 nunc (Feature volutpat enim, quis Selection) viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat© dapibus, 2009 IBM ultricesCorporation sit amet , sem. Vestibulum quis dui vitae massa euismod faucibus . Pellentesque id neque id tellus hendrerit tincidunt . Etiam augue . Class aptent Problems with Traditional IE Approaches

Complex, fixed pipelines and rule sets Semantics tied to order of execution

Data only flows forward, leading to Scalability wasted work in early stages.

Accuracy Lots of custom procedural code.

Hard to understand why the system Usability produces a particular result.

32 © 2009 IBM Corporation Declarative to the Rescue! Define the logical constraints between rules/components System determines order of execution

Scalability Optimizer avoids wasted work

More expressive rule languages; Accuracy Combine different tools easily Describe what to extract, Usability instead of how to extract it

33 © 2009 IBM Corporation What do we mean by “declarative”?

Common vision: – Separate semantics from order of execution – Build the system around a language like SQL or Datalog Different systems have different interpretations Three main categories – High-Level Declarative • Most common approach – Completely Declarative – Mixed Declarative

34 © 2009 IBM Corporation High-Level Declarative

Replace the overall IE framework with a declarative language Each individual extraction component is still a “black box” Example 1: SQoUT[Jain08]

SQL query

QueryQuery planplan combinescombines extractionextraction modulesmodules withwith scanscan andand indexindex Catalog of Optimizer accessaccess toto data.data. Extraction Modules

35 © 2009 IBM Corporation High-Level Declarative

Replace the overall IE framework with a declarative language Each individual extraction component is still a “black box” Example 1: SQoUT[Jain08] Example 2: PSOX[Bohannon08]

36 © 2009 IBM Corporation High-Level Declarative

Replace the overall IE framework with a declarative language Each individual extraction component is still a “black box” Example 1: SQoUT[Jain08] Example 2: PSOX[Bohannon08] Advantages: – Allows use of many existing “black box” packages – High-level performance optimizations possible – Clear semantics for using different packages for the same task Drawbacks: – Doesn’t address issues that occur within a given “black box” – Limited opportunities for optimization, unless “black boxes” can provide hints

37 © 2009 IBM Corporation Completely Declarative

One declarative language covers all stages of extraction Example 1: AQL language in SystemT [Chiticariu10]

-- Find all matches -- Match people with their -- Find pairs of references -- of a dictionary -- phone numbers -- to the same person create view Name as create view PersonPhone as create view SamePerson as extract dictionary select P.name as person, select P1.name as name1, CommonFirstName N.num as phone P2.name as name2 on D.text as name from Person P, PhoneNum N from Person P1, Person P2 from Document D; where … where …

Feature Entity Entity Selection Identification Resolution

Text Features Entities, Structured Rels. Information

38 © 2009 IBM Corporation Sequential Execution in CPSL Rules

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus , risus in e sagittis facilisis, arcu augue rutrum velit, sed , hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Level 2 〈〈PersonPerson 〉〈〉〈TokenToken 〉〉[~[~ “at”] “at”] 〈 〈PhonePhone 〉〉 〈〈PersonPhonePersonPhone 〉〉

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus , risus in sagittis facilisis arcu auguet rum velit, sed at hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est Level 1 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in at amet lt arcu enina i facilisis, at - arcu tincidunt tincidunt orci. Pellentesque justo tellus , scelerisque quis, orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla 〈Digits〈Digits〉〈〉〈TokenToken 〉[~〉[~ “-”] “-”] 〈 Digits〈Digits〉〉 〈Phone〈Phone 〉〉 〈〈FirstNameFirstName 〉〈〉〈CapsWordCapsWord 〉〉 〈Person〈Person 〉〉

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus , risus in sagittis facilisis arcu augue velit, at -. hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat Level39tincidunt, est 0 nunc (Feature volutpat enim, quis Selection) viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat© dapibus, 2009 IBM ultricesCorporation sit amet , sem. Vestibulum quis dui vitae massa euismod faucibus . Pellentesque id neque id tellus hendrerit tincidunt . Etiam augue . Class aptent Declarative Semantics Example: Identifying Musician-Instrument Relationships

(pipe | guitar | hammond organ |…) Instrument (Person Annotator) Person

〈Person 〉 〈0-5 tokens 〉 〈Instrument 〉 PersonPlaysInstrument

John Pipe plays the guitar 〈Person 〉 〈Person 〉 〈Token 〉 〈Token 〉 〈Instrument 〉

John Pipe plays the guitar John Pipe plays the guitar 〈Person 〉 〈Instrument 〉 〈Token 〉 〈Token 〉 〈Instrument 〉

Person Person Instrument John Pipe plays the guitar 〈Person 〉 〈Token 〉 〈Token 〉 〈Instrument 〉

Person Instrument

40 © 2009 IBM Corporation Completely Declarative

One declarative language covers all stages of extraction Example 1: AQL language in SystemT [Chiticariu10] Example 2: Conditional Random Fields in SQL [Wang10]

41 © 2009 IBM Corporation Completely Declarative

One declarative language covers all stages of extraction Example 1: AQL language in SystemT [Chiticariu10] Example 2: Conditional Random Fields in SQL [Wang10] Advantages: – Unified language clear semantics from top to bottom – Optimizer has full control over low-level operations – Can incorporate existing packages using user-defined functions Drawbacks: – Code inside UDFs doesn’t benefit from declarativeness

42 © 2009 IBM Corporation Mixed Declarative

Language provides declarativeness at the level of some, but not all, of the extraction operations, both at the individual and pipeline level Example: Xlog (CIMPLE) [Shen07] This Datalog predicate represents a large, opaque block of extraction code.

Extraction program for talk extracts, from [1] This predicate is defined in Datalog, using low-level operations.

43 © 2009 IBM Corporation Mixed Declarative

Language provides declarativeness at the level of some, but not all, of the extraction operations, both at the individual and pipeline level Example: Xlog (CIMPLE) [Shen08] Advantages: – Ability to reuse existing “black box” packages – Optimizer gets some flexibility to reorder low-level operations Drawbacks: – Challenging to build an optimizer that does both “high-level” and “low-level” optimizations

44 © 2009 IBM Corporation Declarative to the Rescue!

Different notions of declarativeness in different systems All kinds address the major issues in enterprise IE, but in different ways

Scalability Optimizer avoids wasted work

More expressive rule languages; Accuracy Combine different tools easily Describe what to extract, Usability instead of how to extract it

45 © 2009 IBM Corporation Road Map

What is Information Extraction? (Fred Reiss) Declarative Information Extraction (Fred Reiss) What the Declarative Approach Enables – Scalable Infrastructure (Yunyao Li) ere e h – Development Support (Laura Chiticariu) ar ou Y Conclusion/Questions

46 © 2009 IBM Corporation Scalable Infrastructure

Yunyao Li IBM Almaden Research Center

© 2009 IBM Corporation Declarative to the Rescue! Define the logical constraints between rules/components System determines order of execution

Scalability Optimizer avoids wasted work

More expressive rule languages; Accuracy Combine different tools easily Describe what to extract, Usability instead of how to extract it

48 © 2009 IBM Corporation Conventional vs. Declarative IE Infrastructure

Conventional: Declarative: – Operational semantics – Separate semantics from and implementation are implementation. hard-coded and – Database-style design: interconnected Optimizer + Runtime

DeclarativeDeclarative LanguageLanguage

OptimizerOptimizer Extraction Extraction RuntimeRuntime RuntimeRuntime PipelinePipeline EnvironmentEnvironment EnvironmentEnvironment PlanPlan

49 © 2009 IBM Corporation Why Declarative IE for Scalability

An informal experimental study [Reiss08] – Collection of 4.5 million web logs 20x20x fasterfaster – Band Review Annotator: identify informal reviews of concerts

CPSL-based Declarative implementation implementation

50 © 2009 IBM Corporation Different Aspects of Design for Scalability

Optimization – Granularity • High-level: annotator composition • Low-level: basic extraction operators – Strategy: • Rewrite-based • Cost-based Runtime Model – Document-Centric vs. Collection-Centric

51 © 2009 IBM Corporation Optimization Granularity for Declarative IE

Annotator Composition Basic Extraction Operator – Each annotator extracts one – Each operator represents or more entities or an atomic extraction relationships operation • E.g. Person annotator • E.g. dictionary matching, regular expression, join,… – Black box assumption on how an annotator works – System is fully aware of how each extraction – Optimizing composition of operator works extraction pipeline – Optimizing each basic extraction operator

High-level declarative Mixed declarative Completely declarative

52 © 2009 IBM Corporation Optimization Strategies for Declarative IE

Rewrite-based Cost-Based – Applying rewrite rules to – Enumerating all possible transform the declarative physical execution plans, form of the annotators to a estimate their cost, and equivalent form that is more choose the one with the efficient minimum expected cost

Systems may mix these two approaches

53 © 2009 IBM Corporation Runtime Model for Declarative IE

Document-Centric Collection-Centric

Annotations Annotated Document Stream

RuntimeRuntime RuntimeRuntime EnvironmentEnvironment EnvironmentEnvironment

Input Document Document Auxiliary Stream Annotations Document Auxiliary Annotations index CollectionCollection index

54 © 2009 IBM Corporation Systems

CIMPLE RAD SQout SystemT BayesStore

55 © 2009 IBM Corporation Cimple

Rewrite-based optimization [Shen07] – Inverted-index based simple pattern matching • Shared document scan

AND AND AND AND Ullman Naughton Ullman P1 OR * OR * Naughton P2 P1= “(Jeff|Jeffery)\s\s*Ullman ” Jeff\s Jeffery\s \s* Jeff\s Jeffery\s \s* Laura\s P3 P2=“(Jeff|Jeffery)\s\s*Naughton ” (p1) (p2) Peter\s P4 P3=“Laura\s\s*Haas ” AND AND Haas P3, P4 P4=“Peter\s\s*Haas ” AND AND Haas Haas * * Laura\s Peter\s Inverted Index Simple patterns \s* \s* (p3) (p4)

Parse trees

56 © 2009 IBM Corporation Cimple

Pushing down text properties [Shen07] – Eg: To find an all-capitalized line

σallcaps (x) σallcaps (x)

lines (d,x,n) lines (d,x,n)

σcontainCaps (d) Plan a Plan b

Scoping [Shen07] – Imposing location conditions on where to extract spans • Eg: Check for names only within two lines of the occurrence of titles

Incorporating cost-model to decide how to apply the rewrite.

57 © 2009 IBM Corporation Cimple Collection-centric runtime model – Document collection (or snapshots of document collection) – Previous extraction results

Reusing previous extraction results [Chen08][Chen09] • Similar to maintaining materialized views

• Cyclex: IE program viewed as one big blackbox [Chen08] • Delex: IE program viewed as a workflow of blackboxes [Chen09]

58 © 2009 IBM Corporation RAD [Khaitan09]

Query language: a declarative subset of CPSL specification – Regular expressions over features and existing annotations Query

DocumentDocument Generating indexed features • Dictionary lookup (Eg. First name) CollectionCollection • Part of speech lookup (Eg. Noun, verb) • Regular expression on tokens (E.g. CapsWord, Alphanum) chunking Sentence tokenization OptimizerOptimizer

Document DocumentDocument Document Generating derived entities over the index using Inverted index InvertedInverted index index series of join operators Inverted index (E.g. Person, Organization) ++ Annotations Annotations Offline process

59 © 2009 IBM Corporation RAD

Cost-based Optimization based on Posting-list Statistics • E.g. [email protected] for Email

Another zig-zag join over the R4 R3 inverted index

Zig-zag Join R2 R3 over the R2 . c o m . inverted index R1 ANYWORD ANYWORD R1 c o m

@ ANYWORD @ ANYWORD

Plan a Plan b

60 © 2009 IBM Corporation RAD

Rewrite-based Optimization – Share sub-expression evaluation • Evaluate the same sub-expression only once

61 © 2009 IBM Corporation Declarative to the Rescue! Define the logical constraints between rules/components System determines order of execution

Scalability Optimizer avoids wasted work

More expressive rule languages; Accuracy Combine different tools easily Describe what to extract, Usability instead of how to extract it

62 © 2009 IBM Corporation Conventional vs. Declarative IE Infrastructure

Conventional: Declarative: – Operational semantics – Separate semantics from and implementation are implementation. hard-coded and – Database-style design: interconnected Optimizer + Runtime

DeclarativeDeclarative LanguageLanguage

OptimizerOptimizer Extraction Extraction RuntimeRuntime RuntimeRuntime PipelinePipeline EnvironmentEnvironment EnvironmentEnvironment PlanPlan

63 © 2009 IBM Corporation Different Aspects of Design for Scalability

Optimization – Granularity • High-level: annotator composition • Low-level: basic extraction operators – Strategy: • Rewrite-based • Cost-based Runtime Model – Document-Centric vs. Collection-Centric

64 © 2009 IBM Corporation Systems

CIMPLE RAD SQout SystemT BayesStore

65 © 2009 IBM Corporation SQoUT [Ipeirotis07][Jain07,08,09] Focus on composition of extraction systems SQL Query Entities/relations to extract

ExtractionExtraction SystemSystem Repository Repository

Extraction Extraction RetrievalRetrieval Strategy SystemSystem EE 00 Strategy … … … Query DocumentDocument DataData … Collection results Collection CleaningCleaning Extraction Extraction RetrievalRetrieval Strategy Extraction results Extracted View SystemSystem EEmm Strategy

66 © 2009 IBM Corporation SQoUT

Cost-based Query Optimization New Plan Enumeration Strategies – Document retrieval strategies • Eg: filtered scan – Running the annotator only over potentially relevant docs – Join execution • Independent join, outer/inner join, zig-zag join: – Extraction results of one relation can determine the docs retrieved for another relation. Efficiency vs. Quality Cost Model

Weight Goodness Quality Efficiency

67 © 2009 IBM Corporation SystemT [Reiss08] [Krishnamurthy08] [Chiticariu10] Final Plan Rules

PlannerPlanner Plan Pre-Pre- Blocks Enumerator processorprocessor Block Post-Post- Plans processorprocessor Cost Model

• Divide rules into • Merge block plans into a compilation blocks. single operator graph. • Rewrite-based • System R Style Cost- • Rewrite-based optimization within each based optimization optimization across block within each block. blocks.

68 © 2009 IBM Corporation Example: Restricted Span Evaluation (RSE)

Leverage the sequential nature John Smith at 555-1212 of text – Join predicates on character or token distance RSEJoin Only evaluate the inner on the relevant portions of the document 555-1212 John Smith Limited applicability – Need to guarantee exact same results Regex Dictionary Only look for dictionary matches in the vicinity of a …John Smith at 555-1212… phone number.

69 © 2009 IBM Corporation Example: Shared Dictionary Matching (SDM)

Rewrite-based optimization – Applied to the algebraic plan during postprocessing Evaluate multiple dictionaries in a single pass

D1 D1 Dict D2 Dict SDMDict D2

SDM Dictionary subplan subplan Operator

70 © 2009 IBM Corporation SystemT

Document-centric Runtime Annotated Model: Document – One document at a time Stream – Entities extracted are associated with their RuntimeRuntime Environment source document Environment

Input Document Stream

Why one document at a time?

71 © 2009 IBM Corporation Scaling SystemT: From Laptop to Cluster

In Lotus Notes Live Text InCognos Cognos Toro Toro Text Analytics Analytics

Jaql Runtime LotusLotus NotesNotes Client Client Hadoop Map-Reduce

Email JaqlJaql Function Function Wrapper Wrapper Message Display Input SystemT Output Annotated Email Adapter Runtime Adapter

JaqlJaql Function Function Wrapper Wrapper Input SystemT Output SystemT Adapter Runtime Adapter Runtime JaqlJaql Function Function Wrapper Wrapper Input SystemT Output Adapter Runtime Adapter Documents JaqlJaql Function Function Wrapper Wrapper Input SystemT Output Adapter Runtime Adapter JaqlJaql Function Function Wrapper Wrapper Input SystemT Output Adapter Runtime Adapter Hadoop Cluster

72 © 2009 IBM Corporation BayesStore [Wang10]

Probabilistic declarative IE – In-database machine learning for efficiency and scalability Text Data and Conditional Random Fields (CRF) Model CRF model document

Factor table Token table

73 © 2009 IBM Corporation BayesStore

Viterbi Inference SQL Implementation – Implementing dynamic programming algorithm using recursive queries

Rewrite-based optimization.

74 © 2009 IBM Corporation Summary

Optimization Optimization Strategy Runtime Model [A table hereGranularity shows design choices of the Systems Basic Annotator Rewrite-based Cost-based Document level Collection Level systems]operator composition Cimple RAD SQoUT SystemT BayesStore

75 © 2009 IBM Corporation Road Map

What is Information Extraction? (Fred Reiss) Declarative Information Extraction (Fred Reiss) What the Declarative Approach Enables

Y – Scalable Infrastructure (Yunyao Li) ou are here – Development Support (Laura Chiticariu)

76 © 2009 IBM Corporation Development Support (Tooling)

Laura Chiticariu IBM Almaden Research Center

© 2009 IBM Corporation Declarative to the Rescue! Define the logical constraints between rules/components System determines order of execution

Scalability Optimizer avoids wasted work

More expressive rule languages; Accuracy Combine different tools easily Describe what to extract, Usability instead of how to extract it

78 © 2009 IBM Corporation A Canonical IE System

Feature Entity Entity Selection Identification Resolution

Entities and Structured Text Features Relationships Information

Developing IE systems is an extremely time-consuming, error prone process

79 © 2009 IBM Corporation The Life Cycle of an IE System

Development Usage / Maintenance

Develop Developer 1. Features Use 2. Rules / labeled data User

Analyze Test Refine Test

80 © 2009 IBM Corporation Example 1: Explaining Extraction Results

------create view Initial as create view ValidLastNameAll as from Dictionary('names/name_israel.dict', Doc.text) D /* where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); from FirstName FN, -- Document Preprocessing select N.lastname as lastname InitialWord IW, ------'Junior' (Yunyao: comments out to avoid mismatches such as Junior National [team player], from LastNameAll N CapsPerson CP -- If we can have large negative dictionary to eliminate such mismatches, CAPSPERSO select D.text as text -- then this may be recovered where Not(MatchesRegex(/(\p{Lu}\p{M}*)+- create view NamesAll as and FollowsTok(IW.word, CP.name, 0, 0); ------from DocScan D; --'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name') .*([\p{Ll}\p{Lo}]\p{M}*).*/, N.lastname)) (select P.name as name from NameDict P) N -- Document Preprocessing -- for German names union all NEWLINE? and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*- /** CAPSPERSON ------TODO: need further test (\p{Lu}\p{M}*)+/, N.lastname)); (select P.name as name from NameDict1 P) * Translation for Rule 3r2 create view Doc as ,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor', union all -- Basic Named Entity Annotators * ------select D.text as text 'Herr Professor', 'Frau professor', 'Baron', 'graf' create view LastName as (select P.name as name from NameDict2 P) * This relaxed version of rule '3' will find person from DocScan D; ); select C.lastname as lastname union all */ (select P.name as name from NameDict3 P) names like Thomas B.M. David create view Person4r1 as --from Consolidate(ValidLastNameAll.lastname) C; * But it only insists that the second word is in the select CombineSpans(FN.firstname, CP.name) as ------Find initial words -- Find dictionary matches for all title initials from ValidLastNameAll C union all person dictionary -- Basiccreate Named view InitialWord1 Entity Annotators as consolidate on C.lastname; (select P.name as name from NameDict4 P) person ------union all */ from FirstName FN, select R.match as word select D.match as initial /* CapsPerson CP --from Regex(/\b([\p{Upper}]\.\s*){1,5}\b/, Doc.text) R -- Find dictionary matches for all first names (select P.firstname as name from FirstName P) from RegexTok(/([\p{Upper}]\.\s*){1,5}/, 10, Doc.text) --'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name') -- Mostly US first names union all where FollowsTok(FN.firstname, CP.name, 0, 0); -- Find initial words -- for German names R -- TODO: need further test create view StrictFirstName1 as CAPSPERSON --create added view on InitialWord104/18/2008 as select D.match as firstname INITIALWORD whereselect R.matchNot(MatchesRegex(/M\.D\./, as word R.match)); ,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor', from Dictionary('strictFirst.dict', Doc.text) D /** --from Regex(/\b([\p{Upper}]\.\s*){1,5}\b/, Doc.text) R 'Herr Professor', 'Frau professor', 'Baron', 'graf' create view PersonDict as CAPSPERSO from RegexTok(/([\p{Upper}]\.\s*){1,5}/, 10, Doc.text) ); where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{ * R --from Consolidate(NamesAll.name) C; N * This relaxed version of rule '4' will find person -- Yunyao: added on 11/21/2008 to capture names -- Find dictionary matches for all title initials 0,20}/, D.match); from NamesAll C with-- addedprefix on(we 04/18/2008 use it as initial -- changed to enable unicode match names Thomas, David where Not(MatchesRegex(/M\.D\./, R.match)); consolidate on C.name; */ * But it only insists that the SECOND word is in some -- to avoid adding too many commplex rules) where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, person dictionary create view InitialWord2 as from Dictionary('InitialDict', Doc.text) D; D.match); --======create view Person3r2 as select D.match as word */ -- Yunyao: added on 11/21/2008 to capture names -- Actual Rules select CombineSpans(CP.name, LN.lastname) as /* from Dictionary('specialNamePrefix.dict', Doc.text) -- Yunyao: added 05/09/2008 to capture person name suffix -- German first names --======person D;with prefix (we use it as initial create dictionary PersonSuffixDict as create view StrictFirstName2 as -- to avoid adding too many commplex rules) from LastName LN, ANYWORD ( select D.match as firstname -- For 3-part Person names InitialWord IW, create view InitialWordInitialWord2 as as ',jr.', ',jr', 'III', 'IV', 'V', 'VI' from Dictionary('strictFirst_german.dict', Doc.text) D select D.match as word create view Person3P1 as CapsPerson CP CAPSPERSON (select I.word as word from InitialWord1 I) ); -- select CombineSpans(F.firstname, L.lastname) as person where FollowsTok(CP.name, IW.word, 0, 0) fromunion Dictionary('specialNamePrefix.dict', all Doc.text) where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{ NEWLINE? D; from StrictFirstName F, and FollowsTok(IW.word, LN.lastname, 0, 0); CAPSPERSO select D.match as suffix --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); StrictLastName L /** N create view InitialWord as from Dictionary('PersonSuffixDict', Doc.text) D; -- changed to enable unicode match (select I.word as word from InitialWord1 I) where FollowsTok(F.firstname, S.name, 0, 0) * Translation for Rule 4 -- Find weak initial words where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, --and FollowsTok(S.name, L.lastname, 0, 0) * createunion view all WeakInitialWord as D.match); (select I.word as word from InitialWord2 I); -- Find capitalized words that look like person names and not in the non-name dictionary and FollowsTok(F.firstname, L.lastname, 1, 1) * This rule will find person names like David Thomas */ select R.match as word create view CapsPersonCandidate as and Not(Equals(GetText(F.firstname), GetText(L.lastname))) */ create view Person4r2 as --from Regex(/\b([\p{Upper}]\.?\s*){1,5}\b/, Doc.text) select R.match as name -- nick names for US first names and Not(Equals(GetText(F.firstname), GetText(S.name))) /* select CombineSpans(CP.name, LN.lastname) as R; --from Regex(/\b\p{Upper}\p{Lower}[\p{Alpha}]{1,20}\b/, Doc.text) R create view StrictFirstName3 as -- Find weak initial words and Not(Equals(GetText(S.name), GetText(L.lastname))) person from RegexTok(/([\p{Upper}]\.?\s*){1,5}/, 10, Doc.text) --from Regex(/\b\p{Upper}\p{Lower}[\p{Alpha}]{0,10}(['-][\p{Upper}])?[\p{Alpha}]{1,10}\b/, Doc.text) R select D.match as firstname and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(F.firstname, L.lastname))); from CapsPerson CP, R create view WeakInitialWord as -- change to enable unicode match from Dictionary('strictNickName.dict', Doc.text) D CAPSPERSO where FollowsTok(CP.name, LN.lastname, 0, 0); ----from Do not Regex(/\b([\p{Upper}]\.?\s*){1,5}\b/, allow weak initial word to be a word Doc.te longext) r --from Regex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*[\p{L}\p{M}*]{0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}\b/, Doc.text) R where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{ R; select CombineSpans(P.name, L.lastname) as person N than three characters -- Allow fully capitalized words 0,20}/, D.match); from PersonDict P, CAPSPERSO /** R.match)) from RegexTok(/\p{Lu}\p{M}*(\p{L}\p{M}*){0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}/, 4, Doc.text) R --' -- changed to enable unicode match StrictLastName L N -- added on 04/14/200905/12/2008 * Translation for Rule 5 where Not(ContainsDicts( where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); where FollowsTok(P.name, S.name, 0, 0) * -- Do not allow weak initial wordsword to to be match a word the longe r 'FilterPersonDict', than three characters --and FollowsTok(S.name, L.lastname, 0, 0) * This rule will find other single token person first timezon 'filterPerson_position.dict', -- german first name from blue page and FollowsTok(P.name, L.lastname, 1, 1) */ whereand Not(ContainsDict('timeZone.dict', Not(ContainsRegex(/[\p{Upper}]{3}/, R.match)); names 'filterPerson_german.dict', create view StrictFirstName4 as and Not(Equals(GetText(P.name), GetText(L.lastname))) create view Person4WithNewLine as */ R.match)) 'InitialDict', select D.match as firstname -- added on 04/14/2009 and Not(Equals(GetText(P.name), GetText(S.name))) select CombineSpans(FN.firstname, LN.lastname) as /* ------'StrongPhoneVariantDictionary', from Dictionary('strictFirst_german_bluePages.dict', and Not(Equals(GetText(S.name), GetText(L.lastname))) person ---- Strong Do not Phone allow weakNumbers initial words to match the 'stateList.dict', Doc.text) D and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(P.name, L.lastname))); from FirstName FN, ------timezon 'organization_suffix.dict', -- LastName LN createand Not(ContainsDict('timeZone.dict',dictionary StrongPhoneVariantDictionary R.match)); as ( INITIALWORD? 'industryType_suffix.dict', where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{ create view Person3P3 as where FollowsTok(FN.firstname, LN.lastname, 0, 0); CAPSPERSO 'cell', 'wkday.dict', --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); from PersonDict P, -- Yunyao: 05/20/2008 revised to --'contact', Strong Phone Numbers N 'nationality.dict', -- changed to enable unicode match StrictCapsPersonR S, Person4WrongCandidates due to performance reason ------'direct', 'stateListAbbrev.dict', where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, create dictionary StrongPhoneVariantDictionary as ( StrictFirstName F -- NOTE: current optimizer execute Equals first thus 'office', 'stateAbbrv.ChicagoAPStyle.dict', R.match)); D.match); where FollowsTok(F.firstname, S.name, 0, 0) make Person4Wrong very expensive 'phone',-- Yunyao: Added new strong clues for phone */ 'cell', --and FollowsTok(S.name, P.name, 0, 0) --create view Person4Wrong as create view Person5 as numbers create view CapsPerson as -- Italy first name from blue pages and FollowsTok(F.firstname, P.name, 1, 1) --select CombineSpans(FN.firstname, LN.lastname) as select CombineSpans(IW.word, FN.firstname) as 'tel','contact', select C.name as name create view StrictFirstName5 as 'direct', and Not(Equals(GetText(P.name), GetText(F.firstname))) person person 'dial', from CapsPersonCandidate C select D.match as firstname and Not(Equals(GetText(P.name), GetText(S.name))) --from FirstName FN, from InitialWord IW, 'Telefon','office', where Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, C.name)) from Dictionary('names/strictFirst_italy.dict', -- Yunyao: Added new strong clues for phone and Not(Equals(GetText(S.name), GetText(F.firstname))) -- LastName LN FirstName FN 'mobile', and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*-(\p{Lu}\p{M}*)+/, C.name)); Doc.text) D and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(F.firstname, P.name))); --where FollowsTok(FN.firstname, LN.lastname, 0, 0) numbers'Ph', where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, where FollowsTok(IW.word, FN.firstname, 0, 0); 'tel', -- and ContainsRegex(/[\n\r]/, 'Phone Number', D.match); /** SpanBetween(FN.firstname, LN.lastname)) 'Direct'dial', Line', -- Find strict capitalized words with two letter or more (relaxed version of StrictCapsPerson) 'Telefon', * Translation for Rule 1 -- and Equals(GetText(FN.firstname), /** 'Telephone No', -- France first name from blue pages * Handles names of persons like Mr. Vladimir E. Putin GetText(LN.lastname)); * Translation for Rule 6 'TTY','mobile', --======create view StrictFirstName6 as 'Ph', */ * 'Toll Free', --TODO: need to think through how to deal with hypened name select D.match as firstname /* create view Person4WrongCandidates as 'Toll-free','Phone Number', * This rule will find other single token person last -- one way to do so is to run Regex(pattern, CP.name) and enforce CP.name does not contain ' from Dictionary('names/strictFirst_france.dict', select FN.firstname as firstname, LN.lastname as names --'Direct German Line', -- need more testing before confirming the change Doc.text) D 'Telephone No', CANYWORD lastname */ 'Fon', where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, from FirstName FN, 'Telefon'TTY', Geschaeftsstelle', /* create view CapsPersonNoP as D.match); CAPSPERSON LastName LN 'Telefon'Toll Free', Geschäftsstelle', select CP.name as name 'Toll-free', INITIALWORD where FollowsTok(FN.firstname, LN.lastname, 0, 0) 'Telefon Zweigstelle', from CapsPerson CP -- Spain first name from blue pages CAPSPERSON and ContainsRegex(/[\n\r]/, INITIALWORD? --'Telefon German Hauptsitz', where Not(ContainsRegex(/'/, CP.name)); --' create view StrictFirstName7 as 'Fon', SpanBetween(FN.firstname, LN.lastname)); 'Telefon (Geschäftsstelle)',Geschaeftsstelle', attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSO from Dictionary('names/strictFirst_spain.dict', */ N 'Telefon (Zweigstelle)',Geschäftsstelle', create view StrictCapsPersonR as Doc.text) D 'Telefon Zweigstelle', create view Person4 as 'Telefon (Hauptsitz)', select R.match as name where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, create view Person1 as (select P.person as person from 'Telefonnummer','Telefon Hauptsitz', D.match); 'Telefon (Geschaeftsstelle)', --from Regex(/\b\p{Lu}\p{M}*(\p{L}\p{M}*){1,20}\b/, CapsPersonNoP.name)SystemT’s R; Person selectextractor CombineSpans(CP1.name, CP2.name) as person Person4WithNewLine P) */ 'Telefon Geschaeftssitz', from RegexTok(/\p{Lu}\p{M}*(\p{L}\p{M}*){1,20}/, 1, CapsPersonNoP.name) R; from Initial I, minus 'Telefon Geschäftssitz',(Geschäftsstelle)', -- Indian first name from blue pages 'Telefon (Zweigstelle)', CapsPerson CP1, (select CombineSpans(P.firstname, P.lastname) as create view Person6 as 'Telefon (Geschaeftssitz)', -- TODO: still need to clean up the remaining entries InitialWord IW, person 'Telefon (Geschäftssitz)',(Hauptsitz)', SystemT’screate view StrictFirstName8 as Person extractor select CombineSpans(IW.word, LN.lastname) as --======CapsPerson CP2 from Person4WrongCandidates P 'Telefon'Telefonnummer', Persönlich', select D.match as firstname person 'Telefon Geschaeftssitz', where FollowsTok(I.initial, CP1.name, 0, 0) where Equals(GetText(P.firstname), from InitialWord IW, 'Telefon persoenlich', -- Find strict capitalized words from Dictionary('names/strictFirst_india.partial.dict', and FollowsTok(CP1.name, IW.word, 0, 0) GetText(P.lastname))); LastName LN 'Telefon (Persönlich)',Geschäftssitz', --create view StrictCapsPerson as Doc.text) D 'Telefon (Geschaeftssitz)', and FollowsTok(IW.word, CP2.name, 0, 0); /** where FollowsTok(IW.word, LN.lastname, 0, 0); 'Telefon (persoenlich)', create view StrictCapsPerson as where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, --and Not(ContainsRegex(/[\n\r]/, SpanBetween(I.initial, CP2.name))); * Translation for Rule4a 'Handy','Telefon (Geschäftssitz)', select R.name as name D.match); 'Telefon Persönlich', * This rule will find person names like Thomas, David -- 'Handy-Nummer', from StrictCapsPersonR R /** */ ======'Telefon arbeit',persoenlich', where MatchesRegex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*(\p{L}\p{M}*){1,20}\b/, R.name); -- Israel first name from blue pages 'Telefon (Persönlich)', * Translation for Rule 1a /* ======'Telefon (arbeit)' create view StrictFirstName9 as * Handles names of persons like Mr. Vladimir Putin -- End of rules ); 'Telefon (persoenlich)', -- Find dictionary matches for all last names select D.match as firstname 'Handy', */ -- create view StrictLastName1 as from Dictionary('names/strictFirst_israel.dict', /* attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSO extracted from Dictionary('strictLast.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, CANYWORD N -- 'Telefon (arbeit)' --where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); D.match); ); \, -- -- changed to enable unicode match CAPSPERSON{1,3} attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSO ======( -- union all the dictionary matches for first names */ N 'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', create view StrictLastName2 as create view StrictFirstName as /** 'All','Tell', select D.match as lastname (select S.firstname as firstname from -- Split into two rules so that single token annotations are serperated from others * Union all matches found by strong rules, except the 'Friends', 'Friend', 'Colleague', 'Colleagues', from Dictionary('strictLast_german.dict', Doc.text) D StrictFirstName1 S) -- Single token annotations */ 'Managers','If',create dictionary FilterPersonDict as ~250 AQL rules ones directly come --where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); union all create view Person1a1 as create view Person4a as * from dictionary matches ( 'Customer', 'Users', 'User', 'Valued', 'Executive', --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); (select S.firstname as firstname from 'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', select CP1.name as person select CombineSpans(LN.lastname, FN.firstname) as */ 'Chairs', -- changed to enable unicode match StrictFirstName2 S) from Initial I, person create view PersonStrongWithNewLine as 'All','Tell','New', 'Owner', 'Conference', 'Please', 'Outlook', where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); ~250union all AQL rules CapsPerson CP1 from FirstName FN, 'Lotus','Friends', 'Notes', 'Friend', 'Colleague', 'Colleagues', (select S.firstname as firstname from (select P.person as person from Person1 P) 'Managers','If', where FollowsTok(I.initial, CP1.name, 0, 0) LastName LN --union all 'This', 'That', 'There', 'Here', 'Subscribers', 'What', create view StrictLastName3 as StrictFirstName3 S) --- start changing this block where FollowsTok(LN.lastname, FN.firstname, 1, 1) -- (select P.person as person from Person1a_more 'When','Customer', 'Where', 'Users', 'Which', 'User', 'Valued', 'Executive', select D.match as lastname union all 'Chairs', --- disallow allow newline and ContainsRegex(/,/,SpanBetween(LN.lastname, P) 'With', 'While', 'Thanks', 'Thanksgiving','Senator', from Dictionary('strictLast_german_bluePages.dict', Doc.text) D (select S.firstname as firstname from and Not(ContainsRegex(/[\n\t]/,SpanBetween(I.initial,CP1.name))) FN.firstname)); 'Platinum','New', 'Owner', 'Perspective', 'Conference', 'Please', 'Outlook', union all --where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); StrictFirstName4 S) --- end changing this block (select P.person as person from Person3 P) 'Lotus','Manager', 'Notes', 'Ambassador', 'Professor', 'Dear', --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); union all 'This', 'That', 'There', 'Here', 'Subscribers', 'What', ; -- relaxed version of Rule4a union all 'Contact', 'Cheers', 'Athelet', -- changed to enable unicode match (select S.firstname as firstname from -- Yunyao: split the following rules into two to improve (select P.person as person from Person4 P) 'When','And', 'Where','Act', 'But', 'Which', 'Hello', 'Call', 'From', 'Center', where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); StrictFirstName5 S) 'With', 'While', 'Thanks', 'Thanksgiving','Senator', -- Yunyao: added 05/09/2008 to match patterns such as "Mr. B. B. Buy" performance union all 'The', 'Take', 'Junior', union all /* -- TODO: Test case for optimizer (select P.person as person from Person3P1 P); 'Platinum','Both', 'Communities', 'Perspective', 'Greetings', 'Hope', create view StrictLastName4 as (select S.firstname as firstname from 'Manager', 'Ambassador', 'Professor', 'Dear', create view Person1a2 as -- create view Person4ar1 as 'Restaurants', 'Properties', select D.match as lastname StrictFirstName6 S) select CombineSpans(name.block, CP1.name) as person -- select CombineSpans(CP.name, FN.firstname) as 'Contact','Let', 'Corp', 'Cheers', 'Memorial', 'Athelet', 'You', 'Your', 'Our', 'My', from Dictionary('uniqMostCommonSurname.dict', Doc.text) D union all 'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', from Initial I, person create view PersonStrongSingleTokenOnly as 'His','Her', --where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); (select S.firstname as firstname from BlockTok(0, 1, 2, InitialWord.word) name, --from FirstName FN, 'The','Their','Popcorn', 'Take', 'Junior', 'Name', 'July', 'June','Join', (select P.person as person from Person5 P) --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); StrictFirstName7 S) CapsPerson CP1 -- CapsPerson CP union all 'Business','Both', 'Communities', 'Administrative', 'Greetings', 'South', 'Hope', 'Members', -- changed to enable unicode match union all 'Restaurants', 'Properties', where FollowsTok(I.initial, name.block, 0, 0) --where FollowsTok(CP.name, FN.firstname, 1, 1) (select P.person as person from Person6 P) 'Address', 'Please', 'List', where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); (select S.firstname as firstname from and FollowsTok(name.block, CP1.name, 0, 0) --and ContainsRegex(/,/,SpanBetween(CP.name, 'Public','Let', 'Corp', 'Inc', 'Memorial', 'Parkway', 'You','Brother', 'Your', 'Buy', 'Our', 'Then 'My',', StrictFirstName8 S) union all 'His','Her', and Not(ContainsRegex(/[\n\t]/,CombineSpans(I.initial, CP1.name))); FN.firstname)) (select P.firstname as person from FirstName P) 'Services', 'Statements', create view StrictLastName5 as union all */ --and Not(MatchesRegex(/(.|\n|\r)*(\.|\?|!|'|\sat|\sin)( )*/, 'President','Their','Popcorn', 'Governor', 'Name', 'Commissioner', 'July', 'June','Join', union all select D.match as lastname (select S.firstname as firstname from LeftContext(CP.name, 10))) (select P.lastname as person from LastName P) 'Commitment','Business', 'Administrative', 'Commits', 'Hey', 'South', 'Members', from Dictionary('names/strictLast_italy.dict', Doc.text) D StrictFirstName9 S); 'Address', 'Please', 'List', create view Person1a as --and Not(MatchesRegex(/(?i)(.+fully)/, CP.name)) union all 'Director', 'End', 'Exit', 'Experiences', 'Finance', where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); -- ( --and GreaterThan(GetBegin(CP.name), 10); 'Elementary','Public', 'Inc', 'Wednesday', 'Parkway', 'Brother', 'Buy', 'Then', -- Relaxed versions of first name (select P.person as person from Person1a P); 'Services', 'Statements', select P.person as person from Person1a1 P 'Nov', 'Infrastructure', 'Inside', 'Convention', create view StrictLastName6 as create view RelaxedFirstName1 as -- ) create view Person4ar1temp as -- Yunyao: added 05/09/2008 to expand person names 'Judge','President', 'Lady', 'Governor', 'Friday', 'Commissioner', 'Project', 'Projected', select D.match as lastname select CombineSpans(S.firstname, CP.name) as 'Commitment', 'Commits', 'Hey', -- union all select FN.firstname as firstname, CP.name as name with suffix 'Recalls', 'Regards', 'Recently', 'Administration', from Dictionary('names/strictLast_france.dict', Doc.text) D firstname -- (select P.person as person from Person1a2 P) from FirstName FN, 'Independence','Director', 'End', 'Denied', 'Exit', 'Experiences', 'Finance', create view where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); from StrictFirstName S, ; CapsPerson CP PersonStrongSingleTokenOnlyExpanded1 as 'Elementary','Unfortunately', 'Wednesday', 'Under', 'Uncle', 'Utility', 'Unlike', StrictCapsPerson CP 'Nov', 'Infrastructure', 'Inside', 'Convention', where FollowsTok(CP.name, FN.firstname, 1, 1) select CombineSpans(P.person,S.suffix) as person 'Was', 'Were', 'Secretary', create view StrictLastName7 as where FollowsTok(S.firstname, CP.name, 1, 1) /* and ContainsRegex(/,/,SpanBetween(CP.name, 'Speaker','Judge', 'Lady', 'Chairman', 'Friday', 'Consider', 'Project', 'Consultant','Projected', from select D.match as lastname and MatchesRegex(/\-/, SpanBetween(S.firstname, create view Person1a_more as FN.firstname)); PersonStrongSingleTokenOnly P, 'County','Recalls', 'Court', 'Regards', 'Defensive', 'Recently', 'Administration', from Dictionary('names/strictLast_spain.dict', Doc.text) D CP.name)); 'Independence', 'Denied', select name.block as person PersonSuffix S 'Northwestern', 'Place', 'Hi', 'Futures', 'Athlete', where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); from Initial I, where 'Invitational','Unfortunately', 'System', 'Under', 'Uncle', 'Utility', 'Unlike', create view RelaxedFirstName2 as 'Was', 'Were', 'Secretary', BlockTok(0, 2, 3, CapsPerson.name) name create view Person4ar1 as FollowsTok(P.person, S.suffix, 0, 0); 'International', 'Main', 'Online', 'Ideally' create view StrictLastName8 as select CombineSpans(CP.name, S.firstname) as where FollowsTok(I.initial, name.block, 0, 0) select CombineSpans(P.name, P.firstname) as person --'Speaker', more entries 'Chairman', 'Consider', 'Consultant', select D.match as lastname firstname and Not(ContainsRegex(/[\n\t]/,name.block)) from Person4ar1temp P -- Yunyao: added 04/14/2009 to expand single token 'County',,'If','Our', 'Court', 'About', 'Defensive', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt', from Dictionary('names/strictLast_india.partial.dict', Doc.text) D from StrictFirstName S, 'Northwestern', 'Place', 'Hi', 'Futures', 'Athlete', --- start changing this block where Not(MatchesRegex(/(.|\n|\r)*(\.|\?|!|'|\sat|\sin)( )*/, person name with a single initial 'Pre', 'Post', where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); StrictCapsPerson CP -- disallow newline LeftContext(P.name, 10))) --' 'Invitational','Condominium', 'System', 'Ice', 'Surname', 'Lastname', -- extend single token person with a single initial where FollowsTok(CP.name, S.firstname, 1, 1) and Not(ContainsRegex(/[\n\t]/,SpanBetween(I.initial,name.block))) and Not(MatchesRegex(/(?i)(.+fully)/, P.name)) create view 'firstname','International', 'Name', 'Main', 'familyname', 'Online', 'Ideally' create view StrictLastName9 as and MatchesRegex(/\-/, SpanBetween(CP.name, -- more entries and GreaterThan(GetBegin(P.name), 10); PersonStrongSingleTokenOnlyExpanded2 as -- Italian greeting select D.match as lastname S.firstname)); --- end changing this block select CombineSpans(R.person, 'Ciao',,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt', from Dictionary('names/strictLast_israel.dict', Doc.text) D 'Pre', 'Post', ; create view Person4ar2 as RightContext(R.person,2)) as person -- Spanish greeting where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); -- all the first names */ select CombineSpans(LN.lastname, CP.name) as from PersonStrongSingleTokenOnly R 'Hola','Condominium', 'Ice', 'Surname', 'Lastname', create view FirstNameAll as person 'firstname',-- French greeting'Name', 'familyname', (select N.firstname as firstname from where MatchesRegex(/ +[\p{Upper}]\b\s*/, -- Italian greeting /** from CapsPerson CP, RightContext(R.person,3)); 'Bonjour', create view StrictLastName as StrictFirstName N) * Translation for Rule 3 LastName LN --'Ciao', new entries (select S.lastname as lastname from StrictLastName1 S) union all * Find person names like Thomas B.M. David where FollowsTok(LN.lastname, CP.name, 0, 1) create view PersonStrongSingleToken as 'Pro','Bono','Enterprises','Group','Said','Says','A-- Spanish greeting ssis union all (select N.firstname as firstname from 'Hola', */ and ContainsRegex(/,/,SpanBetween(LN.lastname, (select P.person as person from tant','Vice','Warden','Contribution', (select S.lastname as lastname from StrictLastName2 S) RelaxedFirstName1 N) /* CP.name)); PersonStrongSingleTokenOnly P) 'Research',-- French greeting 'Development', 'Product', 'Sales', union all union all 'Bonjour', union all 'Support', 'Manager', 'Telephone', 'Phone', 'Contact', (select S.lastname as lastname from StrictLastName3 S) (select N.firstname as firstname from 'Information',-- new entries (select P.person as person from union all RelaxedFirstName2 N); CAPSPERSON /** PersonStrongSingleTokenOnlyExpanded1 P) 'Electronics','Managed','West','East','North','Sout'Pro','Bono','Enterprises','Group','Said','Says','Assish', (select S.lastname as lastname from StrictLastName4 S) tant','Vice','Warden','Contribution', INITIALWORD * Translation for Rule2 union all 'Teaches','Ministry', 'Church', 'Association', union all create view ValidFirstNameAll as CAPSPERSON * 'Laboratories','Research', 'Development', 'Living', 'Community', 'Product', 'Visiting', 'Sales', (select P.person as person from (select S.lastname as lastname from StrictLastName5 S) select N.firstname as firstname * This rule will handles names of persons like B.M. PersonStrongSingleTokenOnlyExpanded2 P); 'Support','Officer', 'Manager','After', 'Pls', 'Telephone', 'FYI', 'Only', 'Phone', 'Addition 'Contacally', t', union all from FirstNameAll N 'Information', */ Thomas David, where Thomas occurs in some person 'Adding', 'Acquire', 'Addition', 'America', (select S.lastname as lastname from StrictLastName6 S) where Not(MatchesRegex(/(\p{Lu}\p{M}*)+- dictionary /** --'Electronics','Managed','West','East','North','Sout short phrases that are likely to be at the start of h',a union all .*([\p{Ll}\p{Lo}]\p{M}*).*/, N.firstname)) 'Teaches','Ministry', 'Church', 'Association', create view Person3 as */ * Union all matches found by weak rules sentence (select S.lastname as lastname from StrictLastName7 S) and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*- select CombineSpans(P1.name, P2.name) as person /* 'Laboratories','Yes', 'No', 'Ja', 'Living', 'Nein','Kein', 'Community', 'Keine', 'Visiting', 'Gegenstimme', */ union all (\p{Lu}\p{M}*)+/, N.firstname)); from PersonDict P1, create view PersonWeak1WithNewLine as --'Officer', TODO: 'After',to be double 'Pls', 'FYI', checked 'Only', 'Additionally', (select S.lastname as lastname from StrictLastName8 S) 'Adding', 'Acquire', 'Addition', 'America', --InitialWord IW, (select P.person as person from Person3r1 P) 'Another', 'Anyway','Associate', 'At', 'Athletes', 'It', union all create view FirstName as WeakInitialWord IW, INITIALWORD 'Enron',-- short 'EnronXGate', phrases that are'Have', likely 'However', to be at the start of a union all (select S.lastname as lastname from StrictLastName9 S); select C.firstname as firstname PersonDict P2 CAPSPERSON union all -- common verbs appear with person names in -- Relaxed version of last name from ValidFirstNameAll C and FollowsTok(IW.word, P2.name, 0, 0) CAPSPERSON financial-- TODO: reports to be double checked (select P.person as person from Person4r1 P) create view RelaxedLastName1 as consolidate on C.firstname; and Not(Equals(GetText(P1.name), GetText(P2.name))); union all --'Another', ideally we 'Anyway','Associate', want to have a general 'At', comprehensive 'Athletes', 'It', select CombineSpans(SL.lastname, CP.name) as lastname 'Enron', 'EnronXGate', 'Have', 'However', (select P.person as person from Person4r2 P) verb list to use as a filter dictionary from StrictLastName SL, -- Combine all dictionary matches for both last names /** */ 'Joins','Company', 'Downgrades', 'Companies', 'Upgrades', 'IBM','Annual', 'Reports', 'Sees', union all StrictCapsPerson CP and first names * Translation for Rule 3r1 (select P.person as person from Person2 P) 'Warns',-- common 'Announces', verbs appear 'Reviews' with person names in where FollowsTok(SL.lastname, CP.name, 1, 1) create view NameDict as financial reports * create view Person2 as union all -- Laura 06/02/2009: new filter dict for title for SEC and MatchesRegex(/\-/, SpanBetween(SL.lastname, CP.name)); select D.match as name * This relaxed version of rule '3' will find person names like Thomas B.M. David select CombineSpans(IW.word, CP.name) as person domain-- ideally in filterPerson_title.dictwe want to have a general comprehensive (select P.person as person from Person2a P) from Dictionary('name.dict', Doc.text) D * But it only insists that the first word is in the person dictionary from InitialWord IW, union all );verb list to use as a filter dictionary create view RelaxedLastName2 as -- 'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees', */ PersonDict P, (select P.person as person from Person3P2 P) select CombineSpans(CP.name, SL.lastname) as lastname where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{ /* CapsPerson CP union all create'Warns', dictionary 'Announces', GreetingsDict 'Reviews' as from StrictLastName SL, 0,20}/, D.match); -- Laura 06/02/2009: new filter dict for title for SEC where FollowsTok(IW.word, P.name, 0, 0) (select P.person as person from Person3P3 P); ( StrictCapsPerson CP --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); and FollowsTok(P.name, CP.name, 0, 0); domain'Hey', 'Hi',in filterPerson_title.dict 'Hello', 'Dear', where FollowsTok(CP.name, SL.lastname, 1, 1) -- changed to enable unicode match ); CAPSPERSON Person -- weak rules that identify (LastName, FirstName) -- German greetings and MatchesRegex(/\-/, SpanBetween(CP.name, SL.lastname)); where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, INITIALWORD /** 'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo', create view PersonWeak2WithNewLine as D.match); CAPSPERSON * Translation for Rule 2a (select P.person as person from Person4a P) create-- Italian dictionary GreetingsDict as -- all the last names ( * union all 'Ciao', create view LastNameAll as create view NameDict1 as * The rule handles names of persons like B.M. (select P.person as person from Person4ar1 P) --'Hey', Spanish 'Hi', 'Hello', 'Dear', (select N.lastname as lastname from StrictLastName N) select D.match as name -- German greetings */ Thomas David, where David occurs in some person union all 'Hola', union all from Dictionary('names/name_italy.dict', Doc.text) D dictionary (select P.person as person from Person4ar2 P); --'Liebe', French 'Lieber', 'Herr', 'Frau', 'Hallo', (select N.lastname as lastname from RelaxedLastName1 N) where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, -- Italian create view Person3r1 as */ 'Bonjour' union all D.match); /* ); 'Ciao', (select N.lastname as lastname from RelaxedLastName2 N); -- Spanish create view NameDict2 as 'Hola', create view ValidLastNameAll as select D.match as name -- French INITIALWORD --include 'core/GenericNE/Person- create dictionary InitialDict as select N.lastname as lastname from Dictionary('names/name_france.dict', Doc.text) CAPSPERSON FilterNewLineSingle.aql'; ( 'Bonjour' D ); NEWLINE? --include 'core/GenericNE/Person-Filter.aql'; 'rev.', 'col.', 'reverend', 'prof.', 'professor.', where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, CAPSPERSON 'messrs.', 'dr.', 'master.', 'marquis', 'monsieur', create view PersonBase as create dictionary InitialDict as “Global financial services firm Morgan Stanley announced(select P.person as person from … “ 'ds', 'di' create view NameDict3 as ( PersonStrongWithNewLine P) --'Dear' (Yunyao: comments out to avoid select D.match as name */ union all mismatches'rev.', 'col.', such 'reverend', as Dear 'prof.', Member), 'professor.', from Dictionary('names/name_spain.dict', Doc.text) D --'Junior''lady', 'miss.', (Yunyao: 'mrs.', comments 'mrs', 'mr.', out 'pt.', to 'ms.'avoid, “Global financialwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, services firm Morgan Stanley announced(select P.person as person from … “ create view Person2a as PersonWeak1WithNewLine P) mismatches'messrs.', 'dr.',such 'master.', as Junior 'marquis', National 'monsieur',[team player], D.match); 'ds', 'di' select CombineSpans(IW.word, P.name) as person union all -- If we can have large negative dictionary to from InitialWord IW, (select P.person as person from eliminate--'Dear' such(Yunyao: mismatches, comments out to avoid create view NameDict4 as mismatches such as Dear Member), CapsPerson CP, PersonWeak2WithNewLine P); -- then this may be recovered select D.match as name PersonDict P where FollowsTok(IW.word, CP.name, 0, 0) and FollowsTok(CP.name, P.name, 0, 0); output view PersonBase; 81 © 2009 IBM Corporation Example 2: Correcting Extraction Results

1503 1505 621 7 EU 633 14 ORG 52 ORG Awdankiewicz Hendrix 477 ORG 68 PER PER 497 Peter Blackburn Tadeusz Awdankiewicz PER 44 PER 1662 44 49 1664 1021 505 137 49 Libya EU 1029 512 149 Libya LOC ORG Fischler Reuters Jimi Hendrix 621 LOC 351 PER ORG PER 633 367 Awdankiewicz Werner Zwingmann PER 53 PER 2340 53 58 2370 1215 44 362 58 TUNIS Welsh National Farmers ' Union 1223 49 369 TUNIS LOC ORG Fischler Libya Hendrix 505 LOC 617 PER LOC PER 512 637 Reuters Organization> Nikolaus van der Pas ORG 2340 PER 2373 147 2370 2376 1450 53 655 152 Welsh National Farmers ' Union NFU 1467 58 662 Libya ORG ORG Loyola de Palacio TUNIS Hendrix 44 LOC 849 PER LOC PER 49 863 Libya Franz Fischler LOC 2373 PER 2413 356 2376 2422 1488 147 684 361 NFU BBC radio 1496 152 700 Libya ORG ORG Fischler Libya Kathy Etchingham 53 LOC 1021 PER LOC PER 58 1029 TUNIS Fischler LOC 2413 PER 69 467 2422 77 1634 356 812 474 BBC radio BRUSSELS 1642 361 819 Tripoli ORG LOC Fischler Libya Hendrix 147 LOC 1215 PER LOC PER 152 1223 Libya Fischler LOC 69 PER Organization> 279 2340 77 286 2388 467 55 2370 BRUSSELS Germany CoNLL 2003 Named Entity Recognition2404 Competition 474 61 Welsh National Farmers ' Union LOC LOC John Lloyd Jones Tripoli LONDON 356 ORG 1450 CoNLL 2003 Named Entity RecognitionPER Competition LOC LOC 361 1467 Libya Loyola de Palacio LOC 279 PER 443 2373 286 450 2 529 118 2376 Germany Britain 4 535 122 NFU LOC LOC EU Poland U.S. 467 ORG 1488 ORG LOC LOC 474 1496 Tripoli Fischler LOC 443 PER 1075 2413 450 1082 94 594 264 2422 Britain Training Data ~ 150 of 20,000 labels 599 Britain 113 271 BBC radio LOC LOC European Commission Libya Florida 529 1634 Training Data ~ 150ORG of 20,000 labels ORG LOC LOC 535 1642 Poland Fischler LOC 1075 PER 1087 69 1082 1093 312 823 391 77 Britain France 326 828 397 LOC BRUSSELS LOC European Union Libya London 594 LOC 2388 ORG LOC LOC 599 2404 Libya John Lloyd Jones 279 LOC 1087 PER 1608 286 279 1093 1614 Germany 491 286 France France LOC 501 Germany LOC LOC typeinfo>LOC Nottingham 823 LOC 2 LOC 828 4 Libya EU 443 LOC 1608 ORG 1619 450 279 443 1614 1626 Britain 286 450 France Britain LOC Germany Britain LOC LOC LOC 94 LOC 113 typeinfo>LOC European Commission 1075 1619 ORG 2227 1082 443 7 1075 1626 2233 Britain 450 14 1082 Britain Europe LOC Britain Hendrix 279 Britain LOC LOC LOC PER 286 LOC 312 Germany 326 LOC European Union 1087 2227 ORG 2320 1093 1075 137 1087 2233 2327 France 1082 149 1093 Europe Germany LOC Britain Jimi Hendrix 443 France LOC LOC LOC PER 450 LOC 587 Britain 597 LOC Commission 1608 2320 ORG 2426 1614 1087 362 1608 2327 2430 France 1093 369 1614 Germany Bonn LOC France Hendrix 1075 France LOC LOC LOC PER 1082 LOC 777 Britain 791 LOC European Union 1619 ORG 2641 1626 1608 655 1619 1608 2648 Britain 1614 662 1626 1614 Germany LOC France Hendrix 1087 Britain France LOC LOC PER 1093 LOC LOC 828 France 830 LOC EU 2227 ORG 2676 2233 1619 684 2227 1619 2683 Europe 1626 700 2233 1626 LOC1626 Britain Britain Kathy Etchingham 1608 Britain Europe LOC Britain LOC PER 1614 LOC1626 LOC 1264 LOC France Britain 1266 LOC EU LOC ORG 2227 2233 1619 Europe 1626 LOC1626 Britain Britain LOC LOC 82 © 2009 IBM Corporation Development Support in Conventional IE Systems

Grammar-based Systems – GATE’s CPSL debugger, AnnotationDiff/Benchmark Tools [Cunningham02]

Machine Learning Systems – Active learning of training examples [Thompson99] – Interfaces to verify/correct extraction results [Kristjansson04]

Lack of transparency makes it difficult for humans to understand/debug/refine the system

[Das Sarma et al. SIGMOD 2010. I4E: Interactive Investigation of Iterative IE]

83 © 2009 IBM Corporation Different Aspects of Tooling for Declarative IE

1. Explaining Trace Provenance – Type: Answers vs. Non-answers – Granularity: Coarse-grained vs. Fine-grained

2. Refining Leverage Provenance + User Feedback – Scope: • Feedback on input forward propagation • Feedback on output backward propagation

3. Using Novel User Interfaces – Enable building/maintaining – Enable collaboration

84 © 2009 IBM Corporation Provenance in DB and Scientific Communities

Explains output data in terms of the input data, the intermediate data, and the transformation – Transformation = query, ETL, workflow – Recent surveys: [Davidson08] [Cheney09] Approaches for computing provenance: Eager Lazy Q Q

Q’

Transformation is changed to carry over No changes to the transformation additional information Requires knowledge of Q’s semantics, Provenance readily available in output or re-executing Q

85 © 2009 IBM Corporation Types of Provenance for Declarative IE

Provenance of answers Provenance of non-answers – Why is this in the result? – Why is this not in the result? – Useful to explain wrong – Useful to explain missing results results – Helps in removing false – Helps in removing false positives negatives – Benefits from previous work on provenance

86 © 2009 IBM Corporation Granularity of Provenance for Declarative IE

Coarse grained Fine grained – Enables understanding of – Enables understanding the IE system at a higher the entire IE system down level to basic operator level – Individual components – Can use both eager or remain “black boxes” lazy computation – More amenable to lazy computation

High-level declarative Mixed declarative Completely declarative

87 © 2009 IBM Corporation Scope of User Feedback for Declarative IE

Feedback on the output of IE Feedback on intermediate program propagates results of IE program backwards propagates forward – Goal is to refine the program – Goal is to repair specific problems with the intermediate data

P P I O Feedback Feedback I O

P’ P’ O’ I’ O’

88 © 2009 IBM Corporation Goals of User Interfaces for Declarative IE

Build & Maintain IE Systems – GUI builders – Wiki interfaces

How to do it collaboratively ? – Share & Reuse – Reconcile feedback

89 © 2009 IBM Corporation Tooling in Various Declarative IE Systems

SystemT CIMPLE PSOX

90 © 2009 IBM Corporation SystemT: Provenance of Answers

Fine-grained provenance Example:

Person Phone Phone Call John now (555-1212). He’s busy after 11:30 AM.

Simple PersonPhone extractor create view Person as… PersonPhone create view Phone as name number extract regex /(\d+\W)+\d+/… John 555-1212 create view PersonPhone as John 11:30 select P.name, Ph.number from Person P, Phone Ph where Follows(P.name, Ph.number, 0, 40);

91 © 2009 IBM Corporation SystemT: Provenance of Answers

Eager approach: Rewrite AQL to maintain one-step derivation of each tuple – Extends work on how-provenance for relational queries [Green07] [Glavic09] – Core subset of AQL: SPJUD, Regex, Dictionary, Consolidate

ID: 1 ID: 2 ID: 3 Person Phone Phone Call John now (555-1212). He’s busy after 11:30 AM.

Rewritten PersonPhone How-Provenance create view PersonPhone as PersonPhone select GenerateID() as ID, name number P.name, P.number, P.id as nameProv, Ph.id as numberProv 4John 555-1212 1 ∧∧∧ 2 ‘AND’ as how John 11:30 1 ∧∧∧ 3 from Person P, Phone Ph where Follows(P.name, Ph.number, 0, 40);

92 © 2009 IBM Corporation SystemT: Provenance of Answers

John  11:30

PersonPhone Join Follows(name,phone,0,40)

Refine Test John 11:30

Person Phone Dictionary Regex Can we automate this ? FirstNames.dict (\d+\W)+\d+

Doc

Provenance of wrong result John  11:30

93 © 2009 IBM Corporation SystemT: Automatic Rule Refinement using Provenance [Liu10]

John  11:30 Input: User feedback = labels on the output of AQL program P Goal: Automatically refine P to PersonPhone remove false positives Join Follows(name,phone,0,40) Idea: cut any provenance link wrong tuple disappears John 11:30

Person Phone Dictionary Regex FirstNames.dict (\d+\W)+\d+

Doc

Provenance of wrong result John  11:30

94 © 2009 IBM Corporation SystemT: Automatic Rule Refinement using Provenance [Liu10]

John  11:30 Input: User feedback = labels on the HLC 1 : Remove John  11:30 from output of AQL program P output of PersonPhone Goal: Automatically refine P to PersonPhone remove false positives Join Follows(name,phone,0,40)

1. High-level changes (HLC) HLC 2 HLC 3 Remove 11:30 from What operator to modify? Remove John from output of Person output of Phone Leverages provenance John 11:30

Person Phone Dictionary Regex FirstNames.dict (\d+\W)+\d+

Doc

Provenance of wrong result John  11:30

95 © 2009 IBM Corporation SystemT: Automatic Rule Refinement using Provenance [Liu10]

John  11:30

Input: User feedback = labels on the LLC 1 output of AQL program P Change join predicate to Follows(name,phone,0,30) Goal: Automatically refine P to PersonPhone remove false positives Join Follows(name,phone,0,40)

1. High-level changes (HLC) LLC 2 Remove ‘John’ from LLC 3 What operator to modify? FirstNames.dict Change regex to (\d{3}\W)+\d+ 2. Low-level changes (LLC) John 11:30 – How to modify it? Person Phone 3. Evaluate and rank LLCs Dictionary Regex Leverage provenance FirstNames.dict (\d+\W)+\d+

Doc

Provenance of wrong result John  11:30

96 © 2009 IBM Corporation A Simple Phone Pattern

Blocks of digits separated by non-word character :

R0 = ( \d+ \W)+ \d+

☺☺☺ Identifies valid phone numbers (e.g. 555-1212, 800-865-1125 )

Produces invalid matches (e.g. 11:30, 10/19/2002, 1.25 …)

Misses valid phone numbers (e.g. (800) 865-CARE )

97 © 2009 IBM Corporation Conventional Regex Writing Process for IE

(\d+\W)+\d+(\d{3} \W)+\d+ Regex 01

Sample Documents 800-865-1125800-865-1125 555-1212555-1212 …… Match 1 11:3011:30 Match 2 10/19/200210/19/2002 … 1.251.25 …… Yes Good Enough? Regex final

No

98 © 2009 IBM Corporation Learning Regex final automatically [Li08] AQL rules Regex 0 w/ R 0

Sample Documents 800-865-1125 555-1212John  555-1212 …… Match 1 11:30John 11:30 HLC Match 2 10/19/2002 Enumerator … 1.25 …… NegMatch 1 …

NegMatch m0 PosMatch 1 Regex LLC Labeled matches for R Regex 0 … Module final PosMatch n0

99 © 2009 IBM Corporation ([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4} …

Intuition [Li08] … “Goodness” measure ([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4}

([A-Z] [a-zA-Z] {1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4} F1 ([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4} … … … R’ …

([A-Z] … [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3}

([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} F7 ………….. ([A-Z] [a-zA-Z] {1,10}\s){1,2} \s*(\w{0,2}\d[\.]?){1,4} (( (?!(Copyright |Page |Physics |Question | · · · |Article |Issue) F8 [A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4 } R0 … … … ([A-Z] [a-zA-Z] {1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} …

([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330 ))(\w{0,2}\d[\.]?){1,4} ([A-Z] [a-zA-Z] {1,10}\s){2,4} \s*(\w{0,2}\d[\.]?){1,4} F34

((?!(Copyright |Page |Physics |Question | · · · |Article |Issue) [A-Z] [a-zA-Z] {1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} F35 … …

([A-Z] [a-zA-Z] {1,10}\s){1,5}\s* (?!(201|…|330 ))(\w{0,2}\d[\.]?){1,4} F48 • Generate candidate regular expressions by modifying current regular expression • Select the “best candidate” R’ • If R’ is better than current regular expression, repeat the process

100 © 2009 IBM Corporation SystemT: Named-Entity Recognition (NER) Interface

Interface for exposing complex NER annotators – Common NER operations exposed in a compact language – Simplifies maintenance

Enabled by declarativity – Seamless translation to AQL, since semantics are separate from implementation

101 © 2009 IBM Corporation SystemT: Tagger UI [Kandogan06] [Kandogan07]

Simple language for combining existing complex annotators – Dictionary, regex, sequence, union – Similar in spirit to SQL Query Builders Focus on collaboration within a community of users

102 © 2009 IBM Corporation Tooling in Various Declarative IE Systems

SystemT CIMPLE PSOX

103 © 2009 IBM Corporation Cimple: Provenance for Non-answers [Huang08] Why is ‘Berkeley’ *not* in the result ? To explain missing answers TopJobs – Updates to extracted School Rank relations that would cause Stanford 1 the missing answer to appear in the result Focus on SPJ queries TopJobs (s,r):- Openings(s,’CA’,’yes’), Rankings(s,r), r<4

Openings Rankings School State Opening School Rank Stanford CA yes Stanford 1 MIT MA no MIT 2 CMU PA yes Berkeley 3 CMU 4

Example from [Huang08]

104 © 2009 IBM Corporation Cimple: Provenance for Non-answers [Huang08] ‘Berkley is a potential answer To explain missing answers TopJobs – Updates to extracted School Rank relations that would cause Stanford 1 the missing answer to Berkeley 3 appear in the result Focus on SPJ queries TopJobs (s,r):- Lazy approach Openings(s,’CA’,’yes’), Rankings(s,r), r<4 Openings Rankings School State Opening School Rank Stanford CA yes Stanford 1 MIT MA no MIT 2 CMU PA yes Berkeley 3 Berkeley CA yes CMU 4

Example from [Huang08]

105 © 2009 IBM Corporation Cimple: Best-Effort IE [Shen08]

Scope: feedback on output data refine IE program Based on Alog = “skeletal” xlog + predicate description rules Example: Extraction predicates partially specified approximate program R1: houses(x,price,area,hs) : − housePages(x), extractHouses(x,price,area,hs)

R2: schools(school): − schoolPages(y), extractSchools(y,school)

R3: Q(x,p,a,h) :− houses(x,price,area,hs), schools(school), p>500000, a>4500, approxMatch(hs,school)

Ordering strategy: Example from [Shen08] Sequential or simulation Next effort assistant: Is price Bold? Is-bold(price)

Is area numerical? Is-numeric(area)

What is the maximum value of price? Max-val(price, 750000)

106 © 2009 IBM Corporation Cimple: Incorporate User-Feedback in IE Programs [Chai09]

Scope: feedback on intermediate data refine IE results – Feedback = data corrections Based on hlog = xlog + declarative user feedback rules Example: dataSource tuples with date after 1/1/09 can be edited through a spreadsheet interface R1: webPages(p) : − dataSources(url, date), crawl(url, page) R : titles(title, p): − webPages(p), extractTitle(p, title) xlog 2 R3: abstracts(abstract, p) :− webPages(p), extractAbstract(p, abstract)

R4: talks(title, abstract) : − titles(title, p), abstracts(abstract, p), immBefore(title,abstract) feedback R5: dataSourcesForUserFeedback(url) #spreadsheet-UI rules :− dataSources(url, date), date >= “01/01/2009”

Example from [Chai09]

107 © 2009 IBM Corporation Cimple: Incorporate User-Feedback in IE Programs [Chai09]

Scope: feedback on intermediate data refine IE results – Feedback = data corrections Based on hlog = xlog + declarative user feedback rules Challenges: – How to incorporate user feedback? • Leverage provenance to resolve conflicting updates • Provenance is eagerly computed – How to efficiently execute an hlog program? • Leverage provenance to incrementally propagate user edits upward in the pipeline

108 © 2009 IBM Corporation Cimple: Building Community Wikipedias [DeRose08]

Madwiki = Cimple + Wiki interface

View 1 Wiki 1

Data IE Structured View 2 Wiki 2 Sources program DB

View 3 Wiki 3

View 3’ Wiki 3’

Text Text 3’ DB

109 © 2009 IBM Corporation Tooling in Various Declarative IE Systems

SystemT CIMPLE PSOX

110 © 2009 IBM Corporation PSOX: Tracing Provenance [Bohannon08]

Coarse grained, eager approach Record operator, time and environment of each execution Record the entities and relationships that lead to each result

111 © 2009 IBM Corporation PSOX: Incorporating Social Feedback [Bohannon08]

User modeled as operator – Feedback = confidence for relationships inferred by the system – Flexible scoring model for combining confidence scores Score changes propagate forward (leveraging provenance)

112 © 2009 IBM Corporation Provenance-Based Debugger (PROBER) [Das Sarma10]

Focus on IE pipelines with monotonic “black box” operators – 1-1, 1-m, m-1, arbitrary

Coarse grained provenance (operator level) – MISet: Minimum set of input records sufficient for deriving an output record

Lazy approach: repeatedly re-execute the operator

Trade-offs between provenance completeness and efficiency – Any MISet (P-any): efficient for arbitrary operators – All MiSets (P-all): efficient for 1-1, 1-m operators

Composition of operators: – P-*(Op1 ° Op2) can be always computed from P-all(Op1) and P-all(Op2) – Can be done efficiently for certain combinations of operators

113 © 2009 IBM Corporation Summary

Provenance Provenance Automatic refinement User Interfaces Type Systems Granularity Non- Type of Forward Backward Fine Coarse Answers Maintain Collaborate answers Feedback propagation propagation SystemT Output labels Cimple Q&A Data edits PSOX Confidence scores PROBER

114 © 2009 IBM Corporation Road Map

What is Information Extraction? (Fred Reiss) Declarative Information Extraction (Fred Reiss) What the Declarative Approach Enables – Scalable Infrastructure (Yunyao Li) – Development Support (Laura Chiticariu) Conclusion/Questions re he re u a Yo

115 © 2009 IBM Corporation Open Challenges

Better cost and selectivity models Scalability for IE

Using provenance of non-answers Accuracy for automatic extractor refinement

More declarative approaches to Usability machine learning

116 © 2009 IBM Corporation Thank you!

Our web page: – http://www.almaden.ibm.com/cs/projects/systemt/ Our software: – http://www.alphaworks.ibm.com/tech/systemt – Updated release coming soon!

117 © 2009 IBM Corporation References

[Appelt93] D. Appelt et al. “FASTUS: A Finite-State Processor for Information Extraction.” IJCAI 1993. [Appelt98] D. Appelt and B. Onyshkevych, “The Common Pattern Specification Language.” ACL 1998. [Bohannon08] Philip Bohannon, Srujana Merugu, Cong Yu, Vipul Agarwal,Pedro DeRose, Arun Iyer, Ankur Jain, Vinay Kakade,Mridul Muralidharan, Raghu Ramakrishnan, Warren. 2008. Purple SOX Extraction Management System, SIGMOD Record. 37(4): 21-27 [Cardie97] C. Cardie, “Empirical Methods in Information Extraction.” AI Magazine 18:4, 1997. http://www.cs.cornell.edu/home/cardie/ [Chai09] X. Chai, B. Vuong, A. Doan, J. Naughton. Efficiently Incorporating User Feedback into Information Extraction and Integration Programs. SIGMOD 2009 [Chen08] Fei Chen, AnHai Doan, Jun Yang and Raghu Ramakrishnan. 2008. Efficient Information Extraction over Evolving Text Data. In Proceedings of the 24th International Conference on Data Engineering (ICDE 2008) [Chen09] F. Chen et al. “Optimizing Complex Extraction Programs Over Evolving Text Data”. SIGMOD 2009. [Cheney 09] J. Cheney, L. Chiticariu, W. Tan. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 2009 [Chiticariu10] Laura Chiticariu et al. "SystemT: An Algebraic Approach to Declarative Information Extraction”. To appear, ACL 2010.

118 © 2009 IBM Corporation References (continued)

[Cohen03] W. Cohen, “Information Extraction and Integration: an Overview.” KDD (Tutorial) 2003 [Cunningham02] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. ACL 2002 [Cunningham09] H. Cunningham et. Al. “ANNIE: A Nearly-New Information Extraction System”. In Developing Language Processing Components with GATE Version 5 (a User Guide); September 4, 2009. http://gate.ac.uk/sale/tao/split.html [DasSarma10] A. Das Sarma, A. Jain, P. Bohannon. Sarma et al. PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines. CoRR abs 1004.1614, 2010 [Davidson08] S. Davidson, J. Freire. Provenance and Scientific Workflows: Challenges and Opportunities. SIGMOD 2008 [DeJong82] G. DeJong, “An overview of the FRUMP system” in Lehnert, W.G. and Ringle, M.H. (eds.) Strategies for natural language processing (Hillsdale: Erlbaum, 1982), p.149-176 [DeRose08] P. DeRose, X. Chai, B. J. Gao, W. Shen, A. Doan, P. Bohannon, X. Zhu: Building Community Wikipedias: A Machine-Human Partnership Approach. ICDE 2008 [Doan06] Doan et al., “Managing Information Extraction” (Tutorial), SIGMOD 2006. [Freitag98] “Toward General-Purpose Learning for Information Extraction.” COLING-ACL 1998. [Green07] T. J. Green, G. Karvounarakis, V. Tannen. Provenance Semirings. PODS 2007 [Glavic08] B. Glavic, G. Alonso, Perm: Processing provenance and data on the same data model through query rewriting, ICDE 2009

119 © 2009 IBM Corporation References (continued)

[Huang08] J. Huang et al. On the Provenance of Non-answers to Queries over Extracted Data. PVLDB 2008 [Ipeirotis07] PANAGIOTIS G. IPEIROTIS, EUGENE AGICHTEIN, PRANAY JAIN and LUIS GRAVANO. 2007. Towards a Query Optimizer for Text-Centric Tasks. ACM Transactions on Database Systems (TODS), 32 (4), article 21, November 2007 [Jain08] A. Jain, AnHai Doan, and Luis Gravano, “Optimizing SQL Queries over Text Databases”. ICDE 2008 [Jain09] Alpa Jain, Panagiotis Ipeirotis and Luis Gravano. 2009. Building Query Optimizers for Information Extraction: The SQoUT Project. SIGMOD Record. 37(4): 28-34 [Jain09b] Alpa Jain, Panagiotis Ipeirotis, AnHai Doan, and Luis Gravano: Join Optimization of Information Extraction Output:Quality Matters! In Proceedings of International Conferenceon Data Engineering (ICDE 2009) [Khaitan09] Sanjeet Khaitan, Ganesh Ramakrishnan, Sachindra Joshi, Anup Chalamalla. 2008. RAD: A Scalable Framework for Annotator Development. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE 2008) [Kandogan06] E. Kandogan, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Zhu: Avatar semantic search: a database approach to information retrieval. SIGMOD 2006 (Demo) [Kandogan07] Kandogan et al. Avatar: Beyond Keywords – Collaborative Information Extraction and Search. ACM CHI Workshop on Exploratory Search and HCI 2007 (Poster) [Krishnamurthy07] R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Zhu. Using Structured Queries for Keyword Information Retrieval. IBM Research Report RJ 10413. 2007 [Krishnamurthy08] Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu.2008. SystemT: a system for declarative information extraction. SIGMOD Record 37(4): 7-13

120 © 2009 IBM Corporation References (continued)

[Kristjansson04] T. Kristjansson, A. Culotta, P. Viola, A. McCallum. Interactive Information Extraction with Constrained Conditional Random Fields. AAAI 2004 [Lafferty01] J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.” ICML 2001 [Leek97] T. Leek, Information extraction using hidden Markov models. Master’s thesis. UC San Diego. [Li06] Y. Li et al. Getting Work Done on the Web: Supporting Transactional Queries. In SIGIR, 2006. [Li08] Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H.V. Jagadish. Regular Expression Learning for Information Extraction. EMNLP 2008 [Liu10] B. Liu, L. Chiticariu, V. Chu, H.V. Jagadish, F. Reiss. “Automatic Rule Refinement for Information Extraction.” to appear, VLDB 2010. [McCallum00] A. McCallum, D. Freitag, and F. Pereira, “Maximum Entropy Markov Models for Information Extraction and Segmentation.” ICML 2000 [Reiss08] Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivakumar Vaithyanathan. 2008. An Algebraic Approach to Rule-Based Information Extraction. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE 2008) [Riloff93] E. Riloff, “Automatically Constructing a Dictionary for Information Extraction Tasks.” Proceedings of the Eleventh National Conference on Artificial Intelligence, 1993. [Sang03] E. Sang, F. De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition.

121 © 2009 IBM Corporation References (continued)

[Sarawagi08] Sunita Sarawagi, “Information Extraction.” In Foundations and Trends in Databases”, 2008 [Shen07] Warren Shen, AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan. 2007. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In Proceedings of the 33rd Very Large Data Bases Conference (VLDB 2007) [Shen08] W. Shen, P. DeRose, R. McCann, A. Doan, R. Ramakrishnan. Toward Best-effort Information Extraction. SIGMOD 2008 [Simmen09] D. E. Simmen et al. « Enabling Enterprise Mashups over Unstructured Text Feeds with InfoSphere MashupHub and SystemT”. In SIGMOD (Demo), 2009. [Soderland98] S. Soderland et al. “CRYSTAL: Inducing a Conceptual Dictionary.” IJCAI 1995. [Thompson99] C. A. Thompson, M. E. Califf and R. J. Mooney. Active Learning for Natural Language Parsing and Information Extraction. ICML 1999 [Wang10] Daisy Zhe Wang, Eirinaios Michelakis, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein. 2010. Probabilistic Declarative Information Extraction In Proceedings of the 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) [Zhu07] H. Zhu et al. Navigating the Intranet with High Precision. In WWW, 2007

122 © 2009 IBM Corporation