Wrapper Induction for Information Extraction

by

Nicholas Kushmerick

A dissertation submitted in partial fulllment

of the requirements for the degree of

Do ctor of Philosophy

University of Washington

Approved by

Chairp erson of Sup ervisory Committee

Program Authorized

to Oer Degree Date

In presenting this dissertation in partial fulllment of the requirements for the Do c

toral degree at the I agree that the Library shall make

its copies freely available for insp ection I further agree that extensive copying of

this dissertation is allowable only for scholarly purp oses consistent with fair use

as prescrib ed in the US Copyright Law Requests for copying or repro duction of

this dissertation may b e referred to University Microlms Eisenhower Place

PO Box Ann Arb or MI to whom the author has granted the right

to repro duce and sell a copies of the manuscript in microform andor b printed

copies of the manuscript made from microform

Signature Date

University of Washington

Abstract

Wrapper Induction for Information Extraction

by Nicholas Kushmerick

Chairp erson of Sup ervisory Committee Professor Daniel S Weld

Department of Computer Science

and Engineering

The Internet presents numerous sources of useful informationtelephone directories

pro duct catalogs sto ck quotes weather forecasts etc Recently many systems have

b een built that automatically gather and manipulate such information on a users

b ehalf However these resources are usually formatted for use by p eople eg the

relevant content is embedded in HTML pages so extracting their content is dicult

Wrappers are often used for this purp ose A wrapp er is a pro cedure for extracting

a particular resources content Unfortunately handco ding wrapp ers is tedious We

introduce wrapper induction a technique for automatically constructing wrapp ers

Our techniques can b e describ ed in terms of three main contributions

First we p ose the problem of wrapp er construction as one of inductive learning

Our algorithm learns a resources wrapp er by reasoning ab out a sample of the re

sources pages In our formulation of the learning problem instances corresp ond to

the resources pages a pages label corresp onds to its relevant content and hypotheses

corresp ond to wrapp ers

Second we identify several classes of wrapp ers which are reasonably useful yet

eciently learnable To assess usefulness we measured the fraction of Internet re

sources that can b e handled by our techniques We nd that our system can learn

wrapp ers for of the surveyed sites Learnability is assessed by the asymptotic

complexity of our systems running time most of our wrapp er classes can b e learned

in time that grows as a smalldegree p olynomial

Third we describ e noisetolerant techniques for automatically labeling the exam

ples Our system takes as input a library of recognizers domainsp ecic heuristics

for identifying a pages content We have developed an algorithm for automatically

corroborating the recognizers evidence Our algorithm p erform well even when the

recognizers exhibit high levels of noise

Our learning algorithm has b een fully implemented We have evaluated our system

b oth analytically with the PAC learning mo del and empirically Our system requires

to examples for eective learning and takes ab out ten seconds of CPU time for

most sites We conclude that wrapp er induction is a feasible solution to the scaling

problems inherent in the use of wrapp ers by informationintegration systems

TABLE OF CONTENTS

List of Figures vi

Chapter Introduction

Motivation

Background Information resources and software agents

Our fo cus Semistructured resources

An imp erfect strategy Handco ded wrapp ers

Overview

Our solution Automatic wrapp er construction

Our technique Inductive learning

Evaluation

Contributions

Organization

Chapter A formal mo del of information extraction

Introduction

The basic idea

The formalism

Summary

Chapter Wrapper construction as inductive learning

Introduction

Inductive learning

The formalism

The Induce algorithm

PAC analysis

Departures from the standard presentation

Wrapper construction as inductive learning

Summary

Chapter The HLRT wrapp er class

Introduction

hlrt wrapp ers

The Generalize algorithm

hlrt

The hlrt consistency constraints

Generalize

hlrt

Example

Formal prop erties

Eciency The Generalize algorithm

hlrt

Complexity analysis of Generalize

hlrt

Generalize

hlrt

Formal prop erties

Complexity analysis of Generalize

hlrt

Heuristic complexity analysis

PAC analysis

Summary

Chapter Beyond HLRT Alternative wrapp er classes

Introduction

Tabular resources ii

The lr oclr and hoclrt wrapp er classes

Segue

Relative expressiveness

Complexity of learning

Nested resources

The nlr and nhlrt wrapp er classes

Relative expressiveness

Complexity of learning

Summary

Chapter Corrob oration

Introduction

The issues

A formal mo del of corrob oration

The Corrob algorithm

Explanation of Corrob

Formal prop erties

Extending Generalize and the PAC mo del

hlrt

noisy

The Generalize algorithm

hlrt

Extending the PAC mo del

Complexity analysis and the Corrob algorithm

Additional input Attribute ordering

Greedy heuristic Stronglyambiguous instances

Domainsp ecic heuristic Proximity ordering

Performance of Corrob

Recognizers

Summary iii

Chapter Empirical Evaluation

Introduction

Are the six wrapp er classes useful

Can hlrt b e learned quickly

Evaluating the PAC mo del

Verifying Assumption Short page fragments

Verifying Assumption Few attributes plentiful data

Measuring The PAC mo del noise parameter Equation

The WIEN application

Chapter Related work

Introduction

Motivation

Software agents and heterogeneous information sources

Legacy systems

Standards

Applications

Systems that learn wrapp ers

Information extraction

Recognizers

Do cument analysis

Formal issues

Grammar induction

PAC mo del

Chapter Future work and conclusions

Thesis summary iv

Future work

Shorttomedium term ideas

Mediumtolong term ideas

Theoretical directions

Conclusions

App endix A An example resource and its wrapp ers

App endix B Pro ofs

B Pro of of Theorem

B Pro of of Lemma

B Pro of of Theorem

B Pro of of Theorem

B Pro of of Theorem Details

B Pro of of Theorem Details

B Pro of of Theorem

App endix C String algebra

Bibliography v

LIST OF FIGURES

The showtimeshol lywoodcom Internet site is an example of a semi

structured information resource

a A ctitious Internet site providing information about countries and

their telephone country codes b an example query response and c

the html text from which b was rendered

The ExtractCCs procedure a wrapper for the countrycode resource

shown in Figure

The Induce generic inductive learning algorithm preliminary version

see Figure

Two parameters and are needed to hand le the two types of di

culties that may occur while gathering the examples E

A revised version of Induce see Figure

The hlrt wrapper procedure template a pseudocode and b details

A label partitions a page into the attribute values the head the tail

and the inter and intratuple separators For brevity parts of the

page are omitted

The hlrt consistency constraint C is dened in terms of three

hlrt

predicates CC

The Generalize algorithm

hlrt

The space searched by Generalize for a very simple example

hlrt vi

The Generalize algorithm is an improved version of

hlrt

Generalize Figure

hlrt

Surfaces showing the condence that a learned wrapper has error at

most as a function of N the total number of examples pages and

M

tot

M the average number of tuples per example for a

ave

N

and b

The relative expressiveness of the lr hlrt oclr and hoclrt wrap

per classes

An example of a nested documents information content

The relative expressivenes of the lr hlrt nlr and nhlrt wrapper

classes

The Corrob algorithm

noisy

The Generalize algorithm a modication of Generalize Fig

hlrt

hlrt

ure which can hand le noisily labeled examples

A slightly modied version of the Generalize algorithm compare

hlrt

with Figure

The information resources that we surveyed to measure wrapper class

coverage

The results of our coverage survey

A summary of Figure

Average number of example pages needed to learn an hlrt wrapper

that performs perfectly on a suite of test pages for actual Internet

resources vii

Number of examples needed to learn a wrapper that performs perfectly

on test set as a function of the recognizer noise rate for the a okra

b bigbook c corel and d altavista sites

Average time to learn a wrapper for four Internet resources

Predicted number of examples needed to learn a wrapper that satises

the PAC termination criterion as a function of the recognizer noise

rate for the a okra b bigbook c corel and d altavista

sites

A scatterplot of the observed partition fragment lengths F versus page

p

lengths R as wel l as two models of this relationship F R the

p

model demanded by Assumption and F R the bestt model

The measured values of the ratio dened in Equation

The wien application being used to learn a wrapper for lycos

A Site in the survey Figure a the query interface and b an

example query response

B A sux tree showing that the common proper suxes of the ve exam

ples can be represented as the interval L U

B An example of learning the integer I from lower bounds fL L g

and upper bounds fU U g

B One point in each of the regions in Figure

B One point in each of the regions in Figure viii

ACKNOWLEDGMENTS

First of all I thank my family without your inspiration and supp ort none of this

would have b een p ossible Thanks for not asking to o many questions or to o few

The fabulous Marilyn McCune demands sp ecial mention Marilyn did not merely

put up with far to o much during the past eight months Her wit and condence in

the face of my glo om her aection in the face of my six oclo ck mornings her verse

and laughter in the face of my equations and her culinary ingenuity in the face of

my hungerin these and a thousand other ways Marilyn carried the day

Simply put Dan Weld has b een a fantastic advisor I am grateful for the b oundless

faith in me he showed even as I insisted on exploring yet another patently ludicrous

culdesac And though it was sometimes intimidating Dans comprehensive grasp

of the several areas in which we worked was always a helpful source of new ideas

My colleagues at the University of Washington have provided a fertile environment

for exploring articial intelligence and computer science not to mention life Let me

thank in particular Tony Barrett Paul BartonDavis Adam Carlson Bob Do orenbos

Denise Drap er Oren Etzioni Marc Friedman Keith Golden Steve Hanks Anna

Karlin Neal Lesh Mike Perkowitz Ted Romer Eric Selb erg Stephen So derland and

Mike Williamson

Brett Grace provided invaluable assistance in transforming my crude vision into

the wien application Finally Boris Baks assistance with the searchcom survey

is much appreciated

This dissertation is dedicated with love to my grandmothers Mary Nowicki

Knol l and Marie Naglak Kushmerick who got everything started ix

Chapter

INTRODUCTION

Motivation

Background Information resources and software agents

The muchheralded information technology age has delivered a stunning variety of

online information resources telephone directories airline schedules retail pro duct

catalogs weather forecasts sto ck market quotations job listings event schedules

scientic data rep ositories recip e collections and many more With widespread

adoption of op en standards such as http and extensive distribution of inexp ensive

software such as Netscap e Navigator these resources are b ecoming ever more widely

available

As originally envisioned this infrastructure was intended for use directly by people

For example Internet resources often use query mechanisms eg html forms and

output standards eg htmls formatting constructs that are reasonably well suited

to direct manual interaction

An alternative to manual manipulation is automatic manipulation the use of com

puter programs rather than p eople to interact with information resources For many

the information feast has b ecome an information glut There is a widelyrecognized

need for systems that automate the pro cess of managing collating collecting nding

ltering and redistributing information from the many resources that are available

Over the last several years the articial intelligence community has resp onded to

this challenge by developing systems that are lo osely allied under the term software

agents In this thesis we are mainly motivated by a sp ecic variety of such agents

those which use an array of existing information resources as tools much as a house

cleaning rob ot might use vacuum cleaners and mops The idea is that the user

sp ecies what is to b e accomplished the system gures out how to use its to ols to

accomplish the desired task Following the University of Washington terminology we

will call such systems softbots soft ware rob ots

To make this discussion concrete we will fo cus on one particular system The ra

zor nee occam system Kwok Weld Friedman Weld accepts queries

that describ e a particular information need eg Find reviews of movies show

ing this week in by Fellini razor then computes and executes a se

quence of informationgathering actions that will satisfy the query In the ex

ample the actions might involve querying various movieinformation sites eg

imdbcom to obtain a list of Fellinis movies then asking theaters sites eg

showtimeshollywoo dcom which of these movies are now showing and nally

passing these movies on to various sites containing reviews eg siskelebertcom

Softb ots are attractive b ecause they promise to relieve users of the tedium of

manually carrying out such op erations For example for each of the three steps

ab ove there are many p otentially relevant resources Moreover information gleaned

at each step must b e manually passed on ie read from the browser and typed into

an html form to each of the resources consulted in the next step

The software agent literature is vast see Etzioni et al Wooldridge Jennings

Bradshaw or wwwagentsorg for surveys The softb ot paradigm is discussed in detail

in Etzioni Weld Examples of the systems we have in mind include Etzioni et al

Chawathe et al Kirk et al Carey et al Krulwich Kwok Weld Arens et al

Shakes et al Do orenbos et al Selb erg Etzioni Decker et al as well as commer

cial pro ducts such as Jango wwwjangocom Junglee wwwjungleecom Computer ESP oracle

uvisioncomshop and AlphaCONNECT wwwalphamicrocom This short list certainly neglects

many imp ortant pro jects In Chapter we discuss related work in detail

One might argue that this ideathe automatic manipulation of resources intended

for p eopleis somewhat misguided Havent the distributed databases and software

agents communities developed elab orate proto cols for exchanging information across

heterogeneous environments Why not reengineer the information resources so that

they provide interfaces that are more conducive to automatic interaction

This is a reasonable ob jection However organizations might have many reasons

for not wanting to op en up their information resources to software agents An online

business might prefer to b e visited manually rather than mechanically Search engines

such as Yahoo for example are in the business of delivering users to advertisers not

servicing queries as an end in itself Similarly a retail store might not want to simplify

the pro cess of automatically comparing prices b etween vendors And of course the

cost of reengineering existing resources might b e prohibitive

For these reasons if we want to build software agents that access a wide variety

of information resources the only option might b e to build systems that make use

of the existing interfaces that were intended originally for use by p eople At the

highest level this challengebuilding systems that can use humancentered

interfacesconstitutes the core motivation of this thesis

Of course this challenge is exceedingly dicult For example machines to day

cant fully understand the unrestricted natural language text used in online newspa

p er or magazine articles Thus an entirely generalpurp ose solution to this challenge

is a long way o Instead following standard practice our approach is to isolate a

relevant yet realistic sp ecial case of this dicult problem

There are many such prop osals each developed under somewhat dierent motivations A

representative sample includes corba wwwomgorg odbc wwwmicrosoftcomdatao db c

xml wwwworgTRWDxml kif logicstanfordedukif z lcweblocgovz widl

wwwwebmethodscom shoe Luke et al kqml Finin et al and the Metacontent Format mcfresearchapplecom

+ information extraction

h In Out PG Metro Cinemas i

h Sunday NR Broadway Market Cinema i

h Eye of God NR Broadway Market Cinema i

Figure The showtimeshol lywoodcom Internet site is an example of a semi

structured information resource

Our focus Semistructured resources

Fortunately many of the information resources we want our software agents to use

do not exhibit the full p otential complexity suggested by the challenge just outlined

For instance Figure illustrates that the resource showtimeshollywoo dcom do es

not contain unrestricted natural language text Rather it presents information in a

highly regular and structured fashion

Sp ecically showtimeshollywoo dcom structures its output in the form of a

table The table contains four columns time movie rating and theater and one

row for each quadruple of information in the do cument Now if we are to build a

software agent that can make use of this resource then we must provide it with a

pro cedure which when invoked on such a do cument extracts the do cuments con

tent In the literature such sp ecialized pro cedures are commonly called wrappers

Papakonstantinou et al Chidlovskii et al Roth Schwartz

This thesis is concerned with such semistructured information resourcesthose

that exhibit regularity of this nature This fo cus in motivated by three concerns

Semistructured resources generally do not employ unrestricted natural language

text but rather exhibit a fair degree of structure Thus we are optimistic that

handling semistructured resources is a realistic goal

On the other hand semistructured resources contain extraneous elements that

must b e ignored such as advertisements and html formatting constructs Most

imp ortantly there is no machinereadable standard explaining how to inter

act with this site or how it renders information Thus extraction from semi

structured do cuments is not entirely trivial

As discussed in Do orenbos et al Perkowitz et al we hop e that a rela

tively large fraction of actual Internet information resources are semistructured

in this way so that our results are reasonably useful

Besides semistructured resources the discussion so far has hinted at our second

main fo cus information extraction Roughly by the phrase information extraction

we refer to the pro cess of identifying and organizing relevant fragments in a do cument

while discarding extraneous text Once a resources information is extracted in this

manner it can b e manipulated in many dierent ways Thus while we are strongly

Of course this emphasis of extraction ignores many imp ortant issues such as how software agents

come to discover resources in the rst place Bowman et al Zaiane Jiawei how they can

learn to query these resources Do orenbos et al and which services b eyond extraction wrapp ers

should provide Papakonstantinou et al Roth Schwartz Also while the essence is simi

lar our use of the phrase information extraction diers from other uses Cowie Lehnert In

Chapter we describ e this and other related work in detail

motivated by the software agent paradigm this thesis is not concerned with sp ecic

systems or architectures Rather we have developed an enabling technologynamely

techniques for automatically constructing wrapp ersthat we exp ect to b e relevant

to a wide sp ectrum of software agent work

An imperfect strategy Handcoded wrappers

To summarize we have introduced the notion of semistructured information re

sources The hop e is that we can construct relatively simple information extraction

pro cedures which well call wrapp ers for semistructured resources

For example it turns out that do cuments from showtimeshollywoo dcom can

b e parsed using an extremely simple mechanism For now we omit all the details

but the basic idea is that the text fragments to b e extracted showtime movie title

etc are always surrounded by certain sp ecic strings Note for example that the

showtimes are consistently rendered in b old face insp ection of the html do cument

reveals that each showtime is surrounded by the tags B B

This observation leads to a very simple parsing pro cedure the two strings B

and B can b e used to identify and extract the relevant fragments of the do cument

To b e sure this simple mechanism cant handle some imp ortant subtleties For now

the p oint is simply that a very simple mechanism can p erform this sort of information

extraction task since the wrapp ers can exploit fortuitous regularities in a collection

of semistructured do cuments

Given that such wrapp ers exists the question b ecomes how should designers

of software agents build them One obvious p ossibility is to construct an agents

wrapp ers by hand indeed nearly all wrapp ers to day are constructed by hand

Unfortunately while straightforward in principle handco ding wrapp ers is time

consuming and errorprone mainly b ecause accurately determining the appropriate

delimiter strings eg B and B is tedious Note also that a sp ecialized wrapp er

must b e written for each of the resources in a software agents arsenal Moreover

actual Internet sites change their format o ccasionally and each such mo dication

might require that the wrapp er b e rewritten To summarize handco ding results in

a serious knowledgeengineering b ottleneck and software agents that rely on hand

co ded wrapp ers face serious scaling problems

Set against this background our primary motivation can now b e stated quite

succinctly To alleviate this engineering b ottleneck we seek to automate

the pro cess of constructing wrapp ers for semistructured resources

In the remainder of this chapter we rst provide an overview of our approach to

automating the wrapp er construction pro cess Section We then summarize our

ma jor contributions Section and describ e how the chapters of this thesis are

related and organized Section

Overview

Our solution Automatic wrapper construction

Our goal is to automatically construct wrapp ers Since a wrapp er is simply a com

puter program we are essentially trying to do automatic programming Of course in

general automatic programming is very dicult So as suggested earlier we follow

standard practice and pro ceed by isolating particular classes of programs for which

eective automatic techniques can b e developed

For example the B B technique introduced earlier for extracting the show

times from showtimeshollywoo dcom suggests one simple class of wrapp ers which

we call lr l eftright lr wrapp ers op erate by scanning a do cument for such delim

iters one pair for each attribute to b e extracted An lr wrapp er starts by scanning

forward to the rst o ccurrence of the lefthand delimiter for the rst attribute then

to the righthand delimiter for the rst attribute then to the lefthand delimiter for

the second attribute then to the righthand delimiter for the second attribute and

so forth until all the columns have b een extracted from the tables rst row

The wrapp er then starts over again with the next row execution halts when all rows

have b een extracted

Admittedly lr is an extremely simple class of wrapp ers Can suitable delimiters

b e found for real information resources According to our recent survey of actual

Internet resources lr wrapp ers are suciently expressive to handle of the kinds

of resources to which we would like our software agents to have access see Section

for details Moreover as we discuss next we have developed ecient algorithms for

automatically co ding lr wrapp ers The lr class therefore represents a reasonably

useful strategy for alleviating the wrapp er construction b ottleneck

While lr is reasonably expressive by insp ecting the of resources that it

can not handle we have developed a more general wrapp er class which we call hlrt

eadleftrighttail hlrt wrapp ers work like lr wrapp ers except that they ignore h

distracting material in a do cuments head its topmost p ortion and tail b ottom

hlrt is more complicated than lr but this additional machinery pays mo dest

dividends ab out of the resources we surveyed can b e handled by hlrt More

over this extra expressiveness costs little in terms of the complexity of automatic

wrapp er construction just as with lr we have developed ecient algorithms for

automatically co ding hlrt wrapp ers

Since hlrt is more complicated than lr it exp oses more of the subtleties of au

tomatic construction Therefore in this thesis we fo cus mainly on hlrt see Chapter

However in Chapter we describ e lr as well as four other wrapp er classes All

six classes are based on the idea of using delimiters such as B B to extract the

attributes The classes dier in two ways First we developed various techniques to

avoid getting confused by distractions eg advertisements Second we developed

wrapp er classes for extracting information that is laid out not as a table but rather

as a hierarchically nested structure eg a b o oks table of contents

The main result of this analysis is a comparison of the six wrapp er classes on two

bases relative expressiveness a measure of the extent to which one wrapp er class

can mimic another and the computational complexity of automatically constructing

wrapp ers in each class

Our technique Inductive learning

We have developed techniques to automatically construct various types of wrapp ers

How do es our system work Our approach is based on inductive learning a well

studied paradigm in machine learning see for example Dietterich Michalski

Michalski and Mitchell Chapters Induction is the pro cess of reasoning

from a set of examples to an hypothesis that in some applicationsp ecic sense

generalizes or explains the examples A inductive learning algorithm then takes as

input a set of examples and pro duces as output an hypothesis For example if told

that Thai food is spicy Korean food is spicy and German food is not spicy an

inductive learner might output Asian food is spicy

Induction is in principle very dicult The fundamental problem is that many

hypotheses are typically consistent with a set of examples but the learner has no

basis on which to choose For example while we might judge Asian food is spicy as

a reasonable generalization on what basis do we reject the trivial generalization Thai

or Korean food is spicy formed by simply disjoining the examples The standard

practice in machine learning is to bias Mitchell the learning algorithm so that

it considers only hypothesis that meet certain criteria For example the learning

algorithm could b e biased so that is do es not consider disjunctive hypotheses

Wrapper induction This thesis was strongly inuenced by two related Uni

versity of Washington pro jects ila Perkowitz Etzioni and shopbot

Do orenbos et al see also Etzioni b Perkowitz et al This seminal work

prop osed a framework in which softb ots use machine learning techniques to learn

ab out the to ols they use The idea is that during an oline learning phase the

agents interact with these to ols generalizing from the observed b ehavior The results

of this learning phase are then used online to satisfy users queries

We can summarize this framework in terms of the following observation an

eective way to learn ab out an information resource is to reason ab out a

sample of its b ehavior Our work can b e p osed in terms of this observation as

follows we are interested in the following wrapper induction problem

input the examples corresp ond to samples of the inputoutput b ehav

ior of the wrapp er to b e constructed

output a hypothesis corresp onds to a wrapp er and hypothesis biases

corresp ond to classes of wrapp ers such as lr and hlrt

Under this formulation an eective wrapp er induction system is one that rapidly

computes a wrapp er that b ehaves as determined by the inputoutput sample

Like all induction algorithms our system assumes that there exists some target

wrapp er which works correctly for the information resource under consideration The

input to our system is simply a sample of this target wrapp ers inputoutput b ehavior

In the showtimeshollywoo dcom example this sample might include the input

output pair shown in Figure

The desired output for showtimeshollywoo dcom is the target wrapp er How

do es our learning algorithm reconstruct the target from the inputoutput samples it

takes as input The basic idea is that our system considers the set of al l wrapp ers

rejecting those that are inconsistent with the observed inputoutput samples

Initially any wrapp er is p otentially consistent Then by examining the examples

such as Figure the system eliminates incompatible wrapp ers Guided by the

sample inputoutput b ehavior of the target our system examines the text extracted

from the example do cuments in order to nd incompatibilities For instance when the

system observes that the extracted showtimes are always surrounded by B B

it can discard all wrapp ers that do not use these delimiters

Of course the set of all wrapp ers is enormous in fact it is innite Thus a key

to eective wrapp er learning is to reason eciently over these sets As describ ed in

Chapters and we have built and analyzed the eciency of learning algorithms

for six dierent wrapp er classes

Automatically generating the examples As the wrapp er induction problem

was stated our learning algorithm takes as input a sample of the b ehavior of the

very wrapp er to b e learned Although this assumption is standard in the inductive

learning community we must ask whether it is appropriate for our application

First note that the apparent contradictionto learn a wrapp er we have to rst

sample its b ehavioris easily resolved The input to our learning system need only

b e a sample of how the target wrapp er would b ehave if it were given an example

do cument from the resource under consideration

This observation suggests one simple way to gather the required input ask a

p erson Under this approach we have reduced the task of handco ding a wrapp er to

the task of handlab eling a set of example do cuments In some cases this reduction

may b e quite eective at simplifying the p ersons cognitive load For instance less

exp ertise might b e required since the user can fo cus on the attributes to b e extracted

showtimes movie titles etc rather than on implementation details verifying

that B B is a satisfactory pair of delimiters etc

Nevertheless we seek to automate wrapp er construction as much as p ossible To

that end we have developed a set of techniques for automatically lab eling example

do cuments see Chapter Our lab eling algorithm takes as input domainsp ecic

heuristics for recognizing instances of the attributes to b e extracted In the show

timeshollywoo dcom example our lab eling system would take as input a pro cedure

for recognizing all of the instances of times text fragments such as and

The required recognition heuristics might b e very primitiveeg using the reg

ular expression to identify showtimes At the other extreme

recognition might require natural language pro cessing or the querying of other in

formation resourceseg asking an alreadywrapp ed resource to determine whether

a particular text fragment is a movie title While such recognition heuristics are cer

tainly very imp ortant this thesis is not concerned with either the theory or practice

of developing these heuristics Rather our system simply requires that these heuris

tics b e provided as input and then treats them as black b oxes when determining

a resources structure

Once the instances of each attribute have b een identied our lab eling system

combines the results for the entire page If the recognition knowledge is p erfect then

this integration step is trivial However note that p erfect recognition heuristics do not

obviate the need for wrapp er induction b ecause while the heuristic might b e p erfect

they might also b e very slow and thus b e unable to deliver the fast p erformance

demanded by a software agents online information extraction subsystem

An imp ortant feature of our system is that it can tolerate quite high rates of

noise in the recognizer heuristics For example the time regular expression

ab ove might nd some text fragments that are not in fact showtimes or it might

mistakenly ignore some of a do cuments showtimes Our automatic page lab eling

algorithm can make use of recognizers even when they make many such mistakes for

example we have tested our system using recognizer heuristics that are wrong up to

of the time and found only a mo dest p erformance degradation see Section

The intuition for this capability is that while the recognition heuristics might fail for

any particular showtime they are unlikely to fail rep eatedly across several example

do cuments

How many examples are enough We presented a framework inductive learn

ing with which to understand our approach to automatic wrapp er construction We

then went on to describ e our technique for solving one problematic asp ect of this

approach the need for samples from the very wrapp er to b e learned The basic

inductive learning mo del also has little to say regarding a second imp ortant issue

how many examples must our learning system examine in order to b e condent that

it will output a satisfactory wrapp er

Computational learning theory a subeld of the machine learning and theoretical

computer science communities provides a rigorous foundation for investigating the

exp ected p erformance of learning systems see Angluin for a survey We have

applied these techniques to our wrapp er induction application

Our results see Section are statistical in nature We have developed a mo del

which predicts how many examples our learning algorithm must observe to ensure

that with high probability the algorithms output wrapp er makes a mistake ie

fails when parsing an unseen do cument only rarely This investigation is formalized

in terms of usersp ecied parameters that dene high probability and rarely

Based on the structure of the wrapp er learning task we use these parameters to

derive a b ound on the number of examples needed to satisfy the satisfy the stated

criterion Thus our wrapp er induction system usually outputs a wrapp er that rarely

fails in computational learning theory terminology our system outputs wrapp ers

robably approximately correct PAC that are p

Evaluation

Perhaps the most imp ortant issues in any research pro ject concern evaluation how

can one know the extent to which ones results are relevant and imp ortant In

Chapter we take a thoroughly empirical approach to evaluation measuring the

b ehavior of our system against actual information resources on the Internet

Our rst exp eriment takes the form of a survey Section We are interested

in whether the wrapp er classes we have developed are useful for handling actual

Internet resources We examined a large p o ol of such resources and determined

for each whether it could b e handled by one or more of the six wrapp er classes

The PAC mo del was prop osed rst in Valiant see Martin Biggs Kearns Vazirani

for extensive treatments

developed in this thesis The results are quite encouraging of the resources

can b e handled by the lr wrapp er class while the hlrt class can handle the

remaining four wrapp er classes have a similar degree of expressiveness We conclude

that we have identied wrapp er classes that are useful for numerous interesting real

world information resources

A second set of exp eriments tests whether our induction algorithm is to o exp ensive

to b e used in practice Section One imp ortant resource is time averaged across

several real domains our system requires ab out one minute to learn an hlrt wrapp er

While CPU time is an imp ortant measure recall from our discussion of the PAC

mo del that another imp ortant resource is the required set of examples Each example

do cument must b e fetched from the resource and then lab eled either by a p erson or

with our automatic lab eling techniques

Given these costs we would prefer that our induction system use few examples

We nd that averaging across several actual domains our system requires

examples in order to learn a wrapp er that works p erfectly against a large suite of test

do cuments As a p oint of comparison we can relate this result with our theoretical

PAC b ound Our mo del predicts that examples are required Thus our

PAC b ound is to o lo ose by one to two orders of magnitude improving this b ound is

a challenging direction for future research see Chapter

Finally we have developed the w rapp er induction environment wien application

Section Using a standard Internet browser a user shows wien an example do c

ument and then uses the mouse to indicate the p ortions of the page to b e extracted

In addition the user can supply recognizer heuristics which are automatically ap

plied to the do cument wien then tries to learn a wrapp er for the resource When

shown a second example wien uses the learned wrapp er to automatically lab el the

new example The user then corrects any mistakes and wien generalizes from b oth

examples This pro cess rep eats until the user is satised

Contributions

As discussed earlier the primary motivation of this thesis involves enabling software

agents to interact with the humancentric interfaces found at most Internet informa

tion resources Since this is a very hard problem we have fo cused on a more mo dest

challenge the development of techniques for automatically constructing wrapp ers for

semistructured resources

Let us summarize this chapter by explicitly stating what we b elieve to b e our

ma jor contributions towards meeting this challenge

We crisply p ose the automatic wrapp er construction problem as one of inductive

ie exampledriven learning Chapters and

We identify several classes of wrapp ers which are expressive enough to han

dle numerous actual Internet resources and we develop ecient algorithms for

automatically learning these classes Chapters and

Finally we develop a metho d for automatically lab eling the examples required

by our induction algorithm An imp ortant feature of our metho d is that it is

robust in the face of noise Chapters

Organization

The remainder of this thesis is organized as follows

Chapter A formal mo del of information extraction

We b egin by stating the kind of information extraction tasks that

concern us

Chapter Wrapper construction as inductive learning

We then show how to frame the problem of automatically construct

ing wrapp ers as one of inductive learning

Chapter The HLRT wrapp er class

We then illustrate our approach with one particular wrapp er class

hlrt our results include an ecient and formally correct learning

algorithm and a PACtheoretic analysis of the number of training

examples needed to learn a satisfactory wrapp er

Chapter Beyond HLRT Alternative wrapp er classes

hlrt is but one of many wrapp er classes in this chapter we explore

ve more including classes for extracting information from hierar

chically nested rather than tabular do cuments

Chapter Corrob oration

The learning algorithm developed in Chapter requires a set of la

b eled examples as input In Chapter we describ e techniques for

automating this lab eling pro cess

Chapter Empirical evaluation

We describ e several exp eriments which demonstrate the feasibility of

our approach on actual Internet information resources and describ e

wien an application which embo dies many of the ideas developed

in this thesis

Chapter Related work

Our work is motivated by and draws insights from a variety of dif

ferent research areas

Chapter Future work and conclusions

We suggest avenues for future research and summarize our contri

butions and conclusions

App endix A An example resource and its wrapp ers

We provide the complete html source from an example Internet

information resource and show a wrapp er for this resource for each

of the wrapp er classes describ ed in Chapter

App endix B Pro ofs

We include pro ofs of the theorems and lemmas asserted in this thesis

App endix C String algebra

Finally we provide details of the character string notation used

throughout this thesis Enjoy

Chapter

A FORMAL MODEL OF INFORMATION EXTRACTION

Introduction

This thesis is concerned with learning to extract information from semistructured

information resources such as Internet sites In this chapter we temp orarily ignore

issues related to learning and develop a formal mo del of the extraction pro cess itself

This preliminary step is imp ortant b ecause it allows us to state precisely what we

want our system to learn

We start with a highlevel overview of our mo del Section We then formalize

and provide a precise notation for these intuitions and ideas Section

The basic idea

Figure provides an example of the sort of information resource with which we

are concerned The gure shows a ctitious Internet site that provides information

ab out countries and their telephone country co des When the form shown in a

is submitted the resource resp onds as shown in b which was rendered from the

html do cument shown in c

More generally an information resource is a system that resp onds to queries

yielding a query response that is presumably appropriate to the query Usually

the resource contains a database incoming queries are p osed to the database and

the results are used to comp ose the resp onse However from the p ersp ective of this

thesis the information resource is simply a black b ox and we will have very little to

say ab out either the queries or the resources internal architecture

a b

HTMLTITLESome Country CodesTITLEBODY

BSome Country CodesBP

BCongoB IIBR

c

BEgyptB IIBR

BBelizeB IIBR

BSpainB IIBR

HRBEndBBODYH TML

Figure a A ctitious Internet site providing information about countries and

their telephone country codes b an example query response and c the html text

from which b was rendered

In the countrycode example the resp onse is presented as an Internet page to

use the standard jargon We intend that the results of this thesis apply more broadly

than just to html do cuments on the Internet Nevertheless for the sake of simplicity

we will use the terms page and query response interchangeably

The example information resource provides information ab out two attributes the

names of the countries and their telephone country co des In this thesis we adopt

a standard relational data mo del query resp onses are treated as providing one or

more tuples of relevant information Thus the example resp onse page provides four

hcountry co de i pairs

fhCongo i hEgypt i hBelize i hSpain ig

Informally then the information extraction task here is to extract this set of tuples

from the example query resp onse

We use the phrase semistructured to describ e the countrycode resource While

the information to b e extracted is a set of fragments that have a regular structure

the page also contains irrelevant text Furthermore this regularity ie displaying

countries in b old face and co des in italic is not available as a machinereadable

sp ecication but rather is a fortuitous but idiosyncratic asp ect of the countrycode

resource The phrase semistructured is meant merely to guide ones intuitions

ab out the kinds of information resource in which we are interested and so we will

not provide a precise denition

To extract the tuples listed in Equation the query resp onse must b e parsed to

extract the relevant content while discarding the irrelevant text That is the tuples

in Equation corresp ond to the text fragments that are outlined

HTMLTITLESome Country CodesTITLEBODY

BSome Country CodesBP

B Congo B I IBR

Egypt B I IBR B

Belize B I IBR B

B Spain B I IBR

HRBEndBBODYH TML

Before pro ceeding note that this framework is not fully general The content of

some query resp onses might not corresp ond simply to a set of fragments of the text

For example if the query resp onses are natural language texts eg newswire feed

then dep ending on the task the true content may simply not b e describable using

a simple relational mo del The natural language pro cessing community typically

uses the phrase information extraction in this richer sense Cowie Lehnert

Alternatively the extracted strings may need to b e p ostpro cessed eg to remove

extraneous punctuation While these problems are imp ortant we fo cus on informa

tion resources for which the content can b e captured exactly as a set of fragments of

the raw query resp onse

In this thesis we are concerned mainly with tabular information resources

Roughly a resource is tabular if the attributes and tuples never overlap and if the

attributes o ccur in a consistent order within each tuple As shown the countrycode

resource is tabular While we emphasize tabular layouts note that in Chapter

we discuss information extraction from pages with a hierarchically nested structure

such as a b o oks table of contents

Finally a wrapper is a pro cedure for extracting information from a particular

resource Formally a wrapp er takes as input a query resp onse and returns as output

the set of tuples describing the resp onses information content

For example Figure shows the ExtractCCs pro cedure a wrapp er for the coun

tryco de information resource The wrapp er op erates by rst skipping over the re

sp onses head indicated by the string P and then using the strings B B I

and I to delimit the left and right sides of the country and co de values Sp eci

cally the tuples are extracted starting at the top of the page within each tuple the

country is extracted rst followed by the co de Extraction stops when the tail of the

query resp onse is encountered indicated by HR

In Chapter we describ e ExtractCCs in detail We explain why it works and

argue that it is correct sp ecically we discuss why the head and tail of the page

must b e handled carefully For now the p oint is simply that the ExtractCCs wrapp er

can b e used to extract the information content of resp onses to queries p osed to the

countrycode resource

To summarize we have informally describ ed our mo del of information extraction

When queried an information resource returns a response The information content

of a resp onse is comprised of sp ecic literal fragments of the resp onse the remaining

irrelevant text is discarded during extraction We adopt a relational mo del of a

resp onses information content the content is taken to b e a set of one or more tuples

where each tuple consists of a value for each of a xed set of attributes The ob jects

in a relational database can b e thought of as rows in a table and so we require

that a pages information content b e embedded in the resp onses in a tabular layout

ExtractCCs page P

skip past rst o ccurrence of P in P

while the next o ccurrence of B is b efore the next o ccurrence of HR in P

for each h r i fhB Bi hI I ig

k k

th

attribute extract from P the value of the next instance of the k

b etween the next o ccurrence of and the subsequent o ccurrence of r

k k

return all extracted tuples

Figure The ExtractCCs procedure a wrapper for the countrycode resource shown

in Figure

Finally a wrapper is a pro cedure for extracting the relational content from a page

while discarding the irrelevant text

The formalism

Resources queries and resp onses An information resource S is a function

from a query Q to a response P

The query Q describ es the desired information by means of an expression in

some query language Q For example Q might b e sql ANSI or kqml

Finin et al For typical Internet resources the query is represented by the

arguments to a cgi script ho oho oncsauiuceducgi

The query resp onse P is the resources answer to the query We assume that

the resp onse is enco ded as a string over some alphab et Typically is the

asci i character set and the resp onses are html pages or unstructured natural

language text alternatively the resp onses might ob ey a standard such as kif

logicstanfordedukif or xml wwwworgTRWDxml

Information resource S can thus b e describ ed formally as a function of the form

S Q

query

resp onse

information resource S

Q Q

P

Given our fo cus on information extraction in this thesis we are concerned pri

marily with resp onses rather than the queries The intent is that our information

extraction techniques can b e largely decoupled from issues related to queries For this

reason in the remainder of this thesis we ignore the query language Q Under this

simplication resource S is equivalent to the set of resp onses it gives to all queries

Attributes and tuples We adopt a standard relational data mo del Asso ciated

with every information resource is a set of K distinct attributes each representing

a column in the relational mo del In the countrycode example there are K

attributes

A tuple is a vector hA A i of K strings A for each k The string A

K

k k

th

is the value of k attribute Whereas attributes represent columns in the relational

mo del tuples represent rows

Content and lab els The content of a page is the set of tuples it contains For

example Equation lists the content of the countrycode example page This

notation is adequate but since pages have unbounded length we use instead a cleaner

and more concise representation of a pages content The idea is that rather than

listing the attribute value fragments explicitly a pages label represents the content

in terms of a set of indices into the page

For example the lab el for the countrycode page in Figure c is

D E

h i h i

D E

h i h i

D E

h i h i

D E

h i h i

To understand this lab el compare it to Equation Equation contains four

tuples each tuple consists of K attributes values and each such value is repre

sented by a pair of integers Consider the rst pair h i These integers indicate

that the rst attribute of the rst tuple is the substring b etween p ositions and

ie the string Congo insp ection of Figure c reveals that these integers are

correct Similarly the last pair h i indicates that the last attributes country

co de o ccurs b etween p ositions and ie the string

More generally the content of page P is represented as the lab el

D E

hb e i hb e i hb e i

K K

k k

D E

L

hb e i hb e i hb e i

m m

mK mK

mk mk

D E

hb e i hb e i hb e i

M M M K M K

M k M k

Lab el L enco des the content of page P The page contains M tuples each of

which has K attributes The integers k K are the attributes indices while

the integers m M index tuples within the page Each pair hb e i enco des

mk mk

th

eginning of the k a single attribute value The value b is the index in P of the b

mk

th th

attribute value in the m tuple Similarly e is end index of the k attribute value

mk

th th th

in the m tuple Thus the k attribute of the m tuple o ccurs b etween p ositions

b and e of page P For example the pair hb e i h i in Equation

mk mk

enco des the third tuples second country co de attribute in the example page

Note that missing values are not p ermitted each tuple must assign exactly one

value to each attribute

Tabular layout We are concerned with tabular information resources This re

quirement amounts to a set of constraints over the b and e indices of legal lab els

mk mk

To b e valid the indices must extract from the page a set of strings that corresp ond

to a tabular layout of the content within the page For example if e b for

mk mk

some m and k then the content of the query resp onse simply can not b e arranged

in a tabular manner b ecause the two values o ccur out of order The full details are

conceptually straightforward but uninteresting

For a lab el to b e tabular the following four conditions must hold The individual frag

ments are legitimate b e The content consists of disjoint fragments

mk mk mk

The symbol L refers to the set of all lab els Again the details are tedious

Wrappers Finally we are in a p osition to give a precise denition of a wrapp er

Formally a wrapp er eg the ExtractCCs pro cedure is a function from a query

resp onse to a lab el

query resp onse

lab el

wrapp er

P

L L

At this level of abstraction a wrapp er is simply an arbitrary pro cedure Of course

in the remainder of this thesis we devote considerable attention to particular classes

of wrapp ers

Summary

We have presented a simple mo del of information extraction Information resources

return pages whose tabular contents can b e captured in terms of a label a structure

using a pages indices to represent the text fragments to b e extracted A wrapper is

simply a pro cedure that computes a lab el for a given page

With this background in place we can b egin our investigation of wrapp er induc

tion We start in Chapter with a discussion of the inductive learning framework and

show how it can b e applied to the problem of automatically constructing wrapp ers

b b e b e e All attributes for one tu

mk mk mk mk mk

m k m k m k

ple precede all attributes for the next e b Attributes o ccur sequentially

m

mM mK

within each tuple e b

mk K mk mk

L L where L is the set of lab els containing exactly m tuples Informally L fLj

m m m

m

lab el L contains m tuples and resp ects the tabularity conditions in Footnote g

Chapter

WRAPPER CONSTRUCTION AS INDUCTIVE

LEARNING

Introduction

Our goal is to automatically construct wrapp ers In the previous chapter we de

scrib ed a mo del of information extraction which included the sp ecication of a wrap

p er Our technique for generating such wrapp ers is based on inductive learning We

rst describ e the inductive learning framework Section We then show how to

treat wrapp er construction as an induction task Section

As we will see the input to our wrapp er construction system is essentially a

sample of the b ehavior of the wrapp er to b e learned Under this formulation wrapp er

construction b ecomes a pro cess of reconstructing a wrapp er based on a sample of its

b ehavior

Inductive learning

Inductive learning has received considerable attention in the machine learning com

munity see Angluin Smith or Mitchell Chapters for surveys At the

highest level inductive learning is the task of computing from a set of examples of

some unknown target concept a generalization that in some domainsp ecic sense

explains the observations The idea is that a generalization is go o d if it explains the

observed examples and more imp ortantly makes accurate predictions when addi

tional previouslyunseen examples are encountered

For example supp ose an inductive learning system is told that Thatcher lied

Mao lied and Einstein didnt lie The learner might then hypothesize that the

general rule underlying the observations is Politicians lie This assertion is reason

able b ecause it is consistent with the examples seen so far If asked Did Nixon lie

the learner would then presumably resp ond Yes

We pro ceed by presenting a formal framework for discussing inductive learning

Section We then describ e Induce a generic inductive learning algorithm that

can b e customized to learn a particular hypothesis class Section We go on

to describ e the PAC mo del a statistical technique for predicting when the learner

has seen enough examples to generate a reliable hypothesis Section Finally

we describ e the few minor ways in which our treatment of induction is somewhat

unconventional Section

The formalism

An inductive learning task consists of three elements

a set I f I g of instances

a set L f L g of labels and

a set H f H g of hypotheses

Each hypothesis H H is a function from I to L with the notation

H I L

indicating that hypothesis H H assigns lab el L L to instance I I

In the p olitician example the set of instances I captures the set of p eople ab out

whom assertions are made and the lab el set L indicates whether a particular p erson

is a liar

I fThatcher Mao Einstein Nixon g

L fliar truthfulg

Since hypotheses are functions from I to L the hypothesis Men lie corresp onds to

the function that classies any male as liar

The idea is that the learner wants to identify some unknown target hypothesis

T H Of course the learner do es not have access to T Instead the learner observes

I L

a set E of examples E f hI T I i g is a sample of the instances

each paired with its lab el according to T The learners job is to reconstruct the

target T from the sample E

This notion of a supply of examples of T is formalized in terms of an oracle that

supplies lab eled instances In this simple mo del of induction we assume that the

learner has no control over which examples are returned by the oracle Formally

an inductive learning algorithm is given access to a function Oracle Oracle is

T T

a function taking no arguments and returning a pair hI T I i where I I and

T I L The subscript T on the oracle function emphasizes that the oracle dep ends

on the target T The inductive learner calls Oracle rep eatedly in order to accumulate

T

the set E of examples

The Induce algorithm

An inductive learning algorithm can b e characterized as follows the algorithm takes

as input the oracle function Oracle and outputs an hypothesis H H However in

T

the remainder of this thesis it will b e helpful to op en up this black b ox sp ecication

in order to concentrate on learning algorithms that op erate in a sp ecic manner

In particular Figure presents the Induce generic inductive learning algorithm

Induce takes two inputs

the oracle function Oracle and

T

a generalization function Generalize sp ecic to the hypothesis class H b eing

H

learned

Given these two inputs Induce must output an hypothesis H H

Note that Induce is a generic inductive learning algorithm in the sense that it

can b e customized to learn a target T in any hypothesis class H by supplying the al

gorithm with the appropriate oracle and generalization functions as input The oracle

E fg

Oracle

T

rep eat

hypothesis

hI Li Oracle

T

H H

E E fhI Lig

until termination condition is satised

Generalize

H

return Generalize E

H

Figure The Induce generic inductive learning algorithm preliminary version see

Figure

function Oracle was just discussed in the remainder of this section we describ e how

T

Induce works which will lead to a description of the second input the Generalize

H

function

The Induce algorithm op erates as follows First Induce uses the Oracle function

T

to accumulate a set E f hI Li g of examples Unsp ecied in Figure is the

termination condition describing when enough examples have b een gathered As will

b e describ ed in Section this termination condition is imp ortant for ensuring

that Induces output is satisfactory

After gathering the examples E Induce then passes E to the Generalize function

H

Recall that Generalize is a domainsp ecic function sp ecialized to the hypothe

H

sis class H under consideration Let us emphasize again that Induce is a generic

learner with the domainsp ecic heavy lifting b eing done by Generalize For

H

mally Generalize takes as input a set of examples E and returns an hypothesis

H

I L

H H thus the generalization function has the form Generalize H

H

We now illustrate Induce using a generalization of the p olitician example intro

duced earlier Consider learning the hypothesis class corresp onding to all conjunctions

of Bo olean literals cbl comp osed from N binary features x x x The fea

N

The literals of prop osition x are x and its complement x

N

tures dene the set of instances I ftrue falseg Instances are classied in just

two ways L ftrue falseg The hypothesis class H consists of conjunctions

cbl

of the N literals H fv v j each v is a literalg In the p oliticians

n

cbl

example x might indicate that a p erson is a male x that they have red hair and

x that they are a p olitician Each p erson could then represented by a Bo olean as

signment to each variable and the lab el true might indicate that the p erson do esnt

lie while false indicates the p erson is a liar

To use the Induce algorithm to learn H we must provide it with two inputs

cbl

Oracle and Generalize For now we assume that the oracle function is built

T

cbl

using whatever means are appropriate to the particular learning task for example a

p erson might act as the oracle

The second input is the generalization function Generalize

cbl

Generalize examples E

cbl

H x x x x

N N

for each example hI truei E

for each feature x

n

if x true in instance I then remove x from H

n n

else remove x from H

n

return H

Generalize starts with an hypothesis containing every literal and then eliminates

cbl

the literals that are inconsistent with the true instances false instances are simply

ignored

Is Generalize a go o d generalization function As well see next the answer

cbl

to this question is tightly coupled to Induces termination condition left unsp ecied

in Figure

PAC analysis

The Generalize function demonstrates how the Induce generic induction algorithm

cbl

is customized to a particular hypothesis class H in this case But so far we

cbl

have neglected a crucial question What reason is there to b elieve that the stated

Generalize function ouputs go o d hypotheses Why should we b elieve it will p er

cbl

form b etter than for example simply returning the hypothesis that classies every

instance as true

Note that the answer is directly relevant to Induces termination condition left

unsp ecied in Figure This termination condition governs how many exam

ples Induce accumulates into the set E b efore invoking the generalization function

Generalize In this section I describ e one particular termination criterion based on

H

robably approximately correct PAC mo del Valiant Kearns Vazirani the p

Martin Biggs

Ideally we want the Induce learning algorithm to output the target hypothesis

ie the function actually used to lab el the examples However since the target

hypothesis is unknown this intuition do es not yield a satisfactory evaluation criterion

A more sophisticated alternative involves estimating the p erformance of an hy

p othesis on instances other than the examples By denition the target will p erform

p erfectly on future instances we want Induce to output an hypothesis that is exp ected

to p erform well on future instances to o PAC analysis is a technique for estimating the

exp ected future p erformance of an hypothesis this estimate is based on the examples

from which the hypothesis was generated

The PAC mo del adopts a statistical approach to evaluating a learner PAC analy

sis starts with the observation that in many learning tasks it makes sense to assume

that b oth the testing and training instances are distributed according to a xed but

unknown and arbitrary probability distribution Thus the extent to which an hy

p othesis p erforms well with resp ect to this distribution can b e estimated from its

p erformance on training data drawn from the same distribution Sp ecically the

PAC mo del uses as a p erformance measure the probability that the learned hypoth

esis is correct

The PAC mo del is formalized as follows

Let the instances I b e distributed according to D a xed but unknown and

arbitrary probability distribution over I We will b e interested in the chance of

drawing an instance with particular prop erties the probability of selecting from

an instance I with prop erty q is written D I jq I

The example oracle Oracle is mo died so that it returns instances distributed

T

according to D We use the notation Oracle to emphasize this dep endancy

T D

The error E H of hypothesis H H is the probability with resp ect to D

T D

that H and T disagree ab out a single instance drawn randomly from D

E H D I jH I T I

T D

Finally let b e a parameter describ e the desired accuracy and

b e a second parameter the desired reliability The meaning of these two

parameters will b e describ ed shortly

The idea of the PAC mo del is that the learner Induce invokes Oracle as many

T D

times as needed to b e reasonably sure that its hypothesis will have low error In

general we exp ect that the error of the hypotheses generated by Induce will decreases

with the number of training examples This intuition is formalized statistically as

follows

PAC termination criterion An inductive learning algorithm for hy

p othesis class H should examine as many examples as are needed in order

to ensure that for any target T H distribution D and

the algorithm will output an hypothesis H H satisfying E H

T D

with probability at least

Under this formalization the parameters and dene what low error and rea

sonably sure resp ectively mean the learned hypothesis must have error b ounded

by and this must happ en with probability at least Note that the learner

must p erform increasingly well as and approach zero

As illustrated in Figure the two parameters and are needed to handle the

two kinds of diculties Induce may encounter The reliability parameter is required

b ecause the PAC termination criterion do es not involve the particular examples in E

instance space instances needed to discriminate target from low-error alternatives A B

instances need to produce

a low-error hypothesis

Figure Two parameters and are needed to hand le the two types of diculties

that may occur while gathering the examples E

and thus Induce could unluckily receive from Oracle an unrepresentative sample of

T D

the instances In Figure the learner may b e unlucky and see no instances from

the region marked A In summary there is always some chance that the error of the

learned hypothesis will b e large

On the other hand the accuracy parameter is required b ecause D might b e

structured so that there is only a small chance under D of encountering an example

needed to discriminate b etween the target T and alternative lowerror hypotheses In

Figure there is only a small chance of seeing the instances in the region marked

B To summarize it is unlikely that the error will b e exactly zero

Figure illustrates how to incorp orate the PAC mo del into the Induce algorithm

This revised version of Induce takes as input parameters and and the algorithm

stops gatherings examples when they satisfy the PAC termination criterion

In general whether the PAC termination criterion can always b e satised de

p ends on the hypothesis class H If such a guarantee can b e made then following

Kearns Vazirani Denition we say that hypothesis class H is PAClearnable

Denition PAClearnable Hypothesis class H is PAClearnable

i there exists a generalization function Generalize with the fol lowing

H

property for every target hypothesis T H for every distribution D over

Oracle

E fg

T D

rep eat

hI Li Oracle

T D

hypothesis

Generalize

H

E E fhI Lig

H H

until with probability at least

E Generalize E

PAC parameters

T D H

return Generalize E

H

Figure A revised version of Induce see Figure

the instances I and for al l and if Induce is given

as input Generalize Oracle and then with probability at least

H T D

Induce wil l output an hypothesis H H satisfying E H

T D

If in addition Generalize runs in time polynomial in and the size

H

of the instances then we say that H is eciently PAClearnable

The standard technique for establishing that class H is PAClearnable is to de

termine given values of and but with no knowledge of T or D the least value

B such that if jE j B then the PAC termination criterion will hold B is a lower

b ound on the number of examples needed to satisfy the PAC termination criterion

In general B dep ends on and we write B B to emphasize this relation

ship The Induce learning algorithm uses this b ound B to decide when it has

gathered enough examples B serves that the termination conditions unsp ecied

We are sweeping a subtlety of the PAC mo del under the rug here Recall the class H over

cbl

instances dened by N binary features We ought to give a learning algorithm more time as N

increases For example N is clearly a lower b ound on the time to read a single instance as well

as the time to write out the nal hypothesis Thus N captures the complexity of learning H

cbl

More generally for any particular learning task we want to allow the learner algorithm additional

time as the tasks natural complexity measure increases We will return to this issue in Section

when we apply the PAC mo del to the task of learning hlrt wrapp ers See Kearns Vazirani

Section for a thorough discussion

in Figure

For example recall the p olitician example and H hypothesis class introduced

cbl

earlier It is straightforward to show Kearns Vazirani Theorem that for

all D and T if Induce gathers a set of examples E such that

N

jE j ln N ln

where the instances are dened by N binary features then the PAC termination

criterion is satised We conclude that H is PAClearnable b ecause Induce can

cbl

N

use the value B ln N ln as its termination condition Furthermore

and and the Generalize function note that B is p olynomial in N

cbl

runs in time p olynomial in jE j and N therefore we conclude that H is eciently

cbl

PAClearnable

Departures from the standard presentation

The formal framework for inductive learning presented in this chapter diers some

what from the standard presentation usually found in the machine learning lit

erature eg Mitchell Formally sp eaking these discrepancies are relatively

unimp ortant Throughout our ob jective is to simplify as much as p ossible our no

tation and formalism In some cases this ob jective has lead to describing inductive

learning in more general terms than usual in other cases we have presented a rela

tively restricted notion of induction In this section we briey highlight the dierences

b etween the standard presentation and ours

First we present the instances simply as a set I rather than the usual practice

of describing I as the space induced by a given language for representing instances

Similarly we dene H as a set of functions rather than as the space induced by an

hypothesis representation language Of course in order to actually apply the induc

tive framework to a particular learning task one must select representation languages

for the instances and hypotheses and the key to successful learning is invariably to

bias Mitchell the learning algorithm so that it considers only hypotheses in a

carefully crafted space

For example the cbl example was describ ed by sp ecifying that I

N

ftrue falseg for some xed N and H fv v j each v is a literalg

n

cbl

we have seen that cbl is eciently learnable However this thesis is concerned with

information extraction and wrapp er induction and it turns out that the representa

tion languages that have b een examined are not appropriate for our application In

Chapters and we describ e wrapp er induction by precisely sp ecifying how to bias

the Induce inductive learning algorithm in order to make wrapp er induction feasible

Second the standard presentation assumes that the set L of lab els has xed car

dinality often L is simply assumed to b e the set of Bo olean values ftrue falseg

For example concept learning is a wellstudied sp ecial case of induction Mitchell

Chapter a concept is equated with the set of instances in its extension or equiva

lently the function which maps instances of the concept to true and other instances

to false In contrast we p ermit L to have arbitrary including innite cardinal

ity As well see allowing L to have arbitrary cardinality is essential to formalizing

wrapp er construction as an induction task

Third we assumed that the learner gains knowledge of the target hypothesis

T only by means of the oracle Oracle In this thesis we ignore other kinds

T D

of queries to the oracle such as equivalence and membership queries Angluin

Angluin Laird Furthermore we have considered only p erfect noisefree

oraclesie invoking Oracle returns exactly hI T I i for some instance I

T D

rather than sometimes rep orting T I incorrectly We will continue to assume noise

free oracles until Chapter when we consider extensions to this simple mo del in the

context of wrapp er induction

Fourth even assuming simple noisefree oracles the Induce algorithm is highly

sp ecialized to the task at hand Induce simply gathers a set E of examples and

calls the generalization function Generalize once with input E In the context of H

wrapp er induction or learning cbl this simple algorithm is sucient But other

strategies might b e useful in other learning tasks For example boosting algorithms

Kearns Vazirani Chapter invoke Generalize several times with dierent sets

H

of gathered examples and output the hypothesis that p erforms b est against additional

examples from Oracle

T D

Note also that unlike most descriptions of induction we explicitly decomp ose the

learning algorithm into generic learner Induce and a domainsp ecic generaliza

tion function Generalize We do this b ecause it will b e convenient to b e able to

H

simply plug in generalization functions for dierent wrapp er classes

Finally the denition of PAC learnability Denition is stated in terms of the

Induce algorithms Generalize input Usually PAC learnability is dened by saying

H

that there must exist a learning algorithm with a particular prop erty namely that

the PAC termination criterion is satised for any T and D Denition

is simply a restricted version of this more general denition sp ecialized to the case

when the learning algorithm is Induce Clearly if H is PAClearnable according to

our Denition then it is PAClearnable in the more general sense

Wrapper construction as inductive learning

This chapter has b een concerned with inductive learning in general With this back

ground in place we now show how wrapp er induction can b e viewed as a problem of

induction

Recall that an induction task is comprised of a a set of instances I b a

set of lab els L and c a set of hypotheses H In our framework to learn H we

must provide the Induce generic learning algorithm with two inputs d the oracle

function Oracle and e the function Generalize Finally need to develop f a

T D H

PAC mo del of learning H

The corresp ondence b etween inductive learning and wrapp er induction is as fol

lows

a The instances I corresp ond to query resp onses from the information resource

under consideration In the countrycode example describ ed in Chapter I

would b e a set of html strings similar to Figure c

b The lab els L corresp ond to query resp onse lab els For example the lab el of the

html do cument in Figure c is shown in Equation

Note that in typical inductive learning tasks lab els are simply category assign

ments such as liar or truthful while our lab els are complex multidimensional

structures However a single such structure corresp onds to a pages lab el just

as the single lab el liar is assigned to Nixon Let us explicitly state that even

though our lab els are structured we do not assign multiple lab els to any in

stance

c Hyp otheses corresp ond to wrapp ers and an hypothesis bias corresp onds to a

class of wrapp ers For example the ExtractCCs wrapp er shown in Figure is

one candidate hypothesis Note that ExtractCCs satises the formal denition

of an hypothesis b ecause it takes as input a query resp onse and outputs a lab el

As hinted at in Chapter and describ ed in detail in Chapter ExtractCCs is a

member of the hlrt wrapp er class In Chapter we identify several additional

classes of wrapp ers which are learnable under this framework

d The Oracle function pro duces a lab eled example query resp onse for a partic

T D

ular information resource

The idea is that asso ciated with each resource is a target wrapp er T Of course

in our wrapp er construction application T usually do es not exist until our

system learns it Nevertheless we can treat the oracle as b eing dep endent on

T The function Oracle need only return an example of how the target would

T D

b ehave the oracle do es not require access to the target itself

As a simple example a p erson might play the role of the oracle Alternatively

we describ e in Chapter techniques for automatically lab eling pages These

techniques can b e used to implement Oracle without requiring access to T

T D

e In Chapter we discuss how to implement the Generalize the generalization

hlrt

function for the hlrt wrapp er class Chapter then go es on to dene this

function for several more classes

f In Chapter we develop a PAC mo del of our wrapp er learning task

Summary

Inductive learning is a wellstudied mo del for analyzing and building systems that

improve over time or generalize from their exp erience The framework provides a rich

variety of analytical techniques and algorithmic ideas

In this chapter we showed how our wrapp er construction task can b e viewed as

one of induction To summarize query responses corresp ond to instances a pages

information content is its label and wrappers corresp ond to hypotheses

Of course so far we have only sketched out the mapping b etween wrapp er con

struction and induction In Chapter we esh out the details for one particular

wrapp er class hlrt and in Chapter we consider ve additional wrapp er classes

Chapter

THE HLRT WRAPPER CLASS

Introduction

In Chapter we discussed information extraction in general and using wrapp ers

for information extraction in particular We then discussed inductive learning our

approach to automatically constructing wrapp ers We noted that the key to a suc

cessful induction algorithm is invariably to bias the algorithm so that it considers

a restricted class of hypotheses Mitchell In this chapter we describ e such a

restricted class for our application the hlrt wrapp er class is a generalization of the

ExtractCCs pro cedure describ ed in Chapter

We pro ceed as follows First we describ e the hlrt class Section We then

show how to learn hlrt wrapp ers by describing the hlrt sp ecic generalization

function required by the Induce generic learning algorithm In Section we present

a straightforward though inecient implementation of this function We then an

alyze the computational complexity of our algorithm and use this analysis to build

a more ecient algorithm Section We then analyze our algorithm heuristically

in order to understand theoretically why our algorithm runs so quickly in practice

Section Finally in Section we develop a PAC mo del for the hlrt wrapp er

class

hlrt wrapp ers

Recall the wrapp er ExtractCCs Figure for the example countrycode resource

Figure The ExtractCCs wrapp er op erates by searching for the strings B and

B in order to lo cate and extract the countries and for the strings I and I

to extract the country co des

ExtractCCs is somewhat more complicated b ecause this simple l eftright lr

strategy fails the top and the b ottom of the page are formatted in such a way that

not all o ccurrences of B B indicate a country However the string P can b e

used to distinguish the head of the page from its b o dy Similarly HR separates the

eadleftrighttail hlrt approach last tuple from the tail This more sophisticated h

can b e used to extract the content from the countrycode resource

The ExecHLRT pro cedure In this thesis we fo cus on hlrt wrapp ersinformally

hlrt wrapp ers are those that are structurally similar to ExtractCCs The idea is

that ExtractCCs is one instance of a particular idiomatic programming style that

is useful when writing wrapp ers Figure lists ExecHLRT a template for writing

wrapp ers according to this idiomatic style ExecHLRT generalizes the ExtractCCs

wrapp er by substituting variables in place of the constant strings that are sp ecic

to the countrycode example and by allowing K attributes p er tuple instead of

exactly two Note that the hlrt delimiters must b e constant strings rather than

for example regular expressions

Sp ecically the variable h in ExecHLRT represents the head delimeter h P

for ExtractCCs The variable t represents the tail delimiter t HR in the example

There are K attributes in the example The variable indicates the string mark

ing the lefthand side of the rst attribute the country B in the example r

marks the righthand side of the rst attribute r B in the example Finally

and r mark the left and righthand resp ectively sides of the second attribute

the co de I and r I in the example

ExecHLRT op erates by skipping over the pages head marked by the value h Next

the tuples are extracted one by one stopping when the pages tail indicated by t

is encountered The algorithms outer lo op terminates when the tail is encountered

ExecHLRTwrapp er hh t r r i page P

K K

skip past the rst o ccurrence of h in P

while the next o ccurrence of is b efore the next o ccurrence of t in P

a

for each h r i fh r i h r ig

k k K K

th

extract from P the value of the next instance of the k attribute

b etween the next o ccurrence of and subsequent o ccurrence of r

k k

return all extracted tuples

ExecHLRTwrapp er hh t r r i page P

K K

i P h a

m

while jP itj jP i j b

m m

b

for each h r i fh r i h r ig

K K

k k

i i P i j j c

k k

b i d

mk

i i P ir e

k

e i f

mk

return lab el f h hb e i i g g

mk mk

Figure The hlrt wrapper procedure template a pseudocode and b details

The inner lo op extracts each of a tuples attributes using the left and righthand

k

r delimiter for each attribute in turn

k

ExecHLRT is essentially a template for writing hlrt wrapp ers For instance if

we instantiate the ExecHLRT template with the values K h P t HR

B r B I r I the result is the ExtractCCs wrapp er

The ExecHLRT pro cedure Details Figure presents ExecHLRT at two levels

of detail In part a of the gure we provide a highlevel pseudoco de description

of the algorithm In App endix B we provide pro ofs for the theorems and lemmas

stated in this thesis To do so involves reasoning formally ab out the b ehavior of

the ExecHLRT algorithm Such reasoning involves sp ecifying the algorithm in more

detail than is supplied in Figure a part b provides the additional details

Before pro ceeding we briey discuss these details

ExecHLRT op erates by incrementing an index i over the input page P The

variable i is increased by searching for the hlrt delimiters in the page The attributes

b eginning indices the b and ending indices the e are computed based on the

mk mk

values taken by i as the page is pro cessed

Figure b makes use of the and string op erators In App endix C

we describ e our string algebra here we provide a brief summary If s and s are

strings then ss is the sux of s starting at the rst o ccurrence of s with ss

indicating that s do es not o ccur in s For example abcdecdf cd cdecdf while

abc xyz While is a string search op erator is a string index op erator

ss is the p osition of s in s for example abcdef cd

ExecHLRT pro ceeds as follows First line a the index i p oints to the head

delimiter h in the input page P For each iteration of the outer while lo op line

b i p oints to a p osition upstream of the start of the next tuple For each

iteration of the inner for each lo op i p oints rst line c at the b eginning of

th th

tuples k attribute The variable i then line e p oints to one character the m

th th

tuples k attribute The values of b and e are b eyond the end of the m

mk mk

set using these two indices lines d and f The outer lo op terminates line

b when the next o ccurrence t o ccurs b efore the next o ccurrence of if there

is one indicating that P s tail has b een encountered

Formal considerations Note that we delib erately ignore failure in ExecHLRT

For example what happ ens if the head delimiter h do es not o ccur in page P From

a practical p ersp ective these are imp ortant issues though it is trivial to add co de to

detect these situations However for now this complication need not concern us As

well see we can ignore failure b ecause our induction system do es not actually invoke

wrapp ers Rather it reasons ab out what would happ en if they were to b e invoked

the consistency constraints developed in Section are a precise characterization

of the conditions under which ExecHLRT will fail

The notation presented in Chapter can b e used to formally describ e the b ehavior

of ExecHLRT Let w b e an hlrt wrapp er P b e a page and L b e a lab el In Chapter

we used the notation w P L to indicate that the L is the result of applying wrapp er

w to page P ExecHLRT is simply a pro cedural description of this relationship

w P L if and only if L is output as a result of invoking ExecHLRT on w and P

That is the notation w P L is an abbreviation for ExecHLRTw P L

Note that given the ExecHLRT pro cedure an hlrt wrapp ers b ehavior can b e

entirely captured in terms of K strings h t r and r where each

K K

tuple consists of K attributes For example as describ ed earlier the six strings

hP HR B B I I i exactly sp ecify the ExtractCCs wrapp er For this

reason in the remainder of this thesis we treat a wrapp er as simply a vector of

strings the ExecHLRT pro cedure itself plays a relatively minor role Implicit in the

use of the notation hh t r r i is the fact that the ExecHLRT pro cedure

K K

denes such a wrapp ers b ehavior

Denition hlrt An hlrt wrapp er is a vector

hh t r r i consisting of K strings where the pa

K K

rameter K indicates the number of attributes per tuple The hlrt

wrapp er class H is the set consisting of al l hlrt wrappers

hlrt

This chapter concerns the hlrt wrapp er class when not stated explicitly any use

of the generic term wrapp er refers to hlrt wrapp ers only

Strictly sp eaking each integer K induces a dierent hypothesis set H and thus H is

hlrt hlrt

in fact a function of K However the value of K will always b e clear from context and thus for

simplicity we do not explicitly note this dep endency

The Generalize algorithm

hlrt

We want to use the Induce generic induction algorithm to learn hlrt wrapp ers Recall

that Induce takes as input a function Generalize which is customized to the particular

H

hypothesis class H b eing learned In this section we describ e Generalize the

hlrt

generalization function for the H hypothesis class

hlrt

Generalize takes as input a set E fhP L i hP L ig of examples

hlrt N N

Each example is a pair hP L i where P is a page and L is its lab el according to the

n n n n

target wrapp er T L T P When invoked on a set of examples Generalize

n n

hlrt

returns a wrapp er hh t r r i H

K K hlrt

At a theoretical level what formal prop erties should Generalize exhibit In

hlrt

an inductive learning setting we generally can not insist that the learner b e perfect

ie that it always output the target hypothesis The reason is simply that by

denition induction involves drawing inferences that may b e invalid Of course

the hop e is that the inferences will b e appropriate if based on a signicant amount

of training examples Rather than p erfect we want our learner to b e consistent

ie it always outputs an hypothesis that is correct with resp ect to the training

examples As well see consistency is an imp ortant formal prop erty not only b ecause

it is intuitively reasonable but also b ecause our PAC analysis of hlrt requires that

Generalize E b e consistent

hlrt

We therefore b egin our description of Generalize by describing the conditions

hlrt

under which an hlrt wrapp er is consistent with a set of examples These conditions

which we call the hlrt consistency constraints are used throughout this thesis to

understand the nature of hlrt wrapp er induction

The section is organized as follows

In Section we present the hlrt consistency constraints

In Section we present Generalize a sp ecialpurp ose constraint

hlrt

satisfaction engine customized to the hlrt consistency constraints

In Section we illustrate Generalize by walking through an example hlrt

Finally in Section we prove that Generalize is consistent

hlrt

The hlrt consistency constraints

We b egin with a denition of consistency Though more general denitions are p ossi

ble we will sp ecialize our denition to the use of the Induce generic learning algorithm

Denition Consistency Let H be an hypothesis class

Generalize is consistent i for every set of examples E we have

H

that H I L for every hI Li E where H Generalize E is the

H

hypothesis returned by Generalize when given input E

H

To reason ab out the consistency of Generalize we must apply this denition

hlrt

to the H wrapp er class What conditions must hold for wrapp er w to b e con

hlrt

sistent with a particular example hP Li The hlrt consistency constraints provide

the answer to this question By examining the ExecHLRT pro cedure in detail we can

dene a predicate C w hP Li which ensures that ExecHLRT computes L given

hlrt

w and P That is C w hP Li provides the necessary and sucient conditions

hlrt

under which w P L

A precise description of C requires a signicant amount of notation Before

hlrt

tackling these details well illustrate the basic idea using the countrycode example

Figure

Example Why do es the ExtractCCs wrapp er work Consider a relatively simple

asp ect of this question Why is B a satisfactory value for r in the hlrt enco ding

of the ExtractCCs wrapp er To answer this question we must examine the ExecHLRT

pro cedure Figure

The variable r takes on tihe value r B in the rst iteration of the inner for

k

each lo op ExecHLRT scans the input page extracting each attribute value in turn

In particular for each tuple ExecHLRT lo oks for r immediately after the lo cation

where it found the rst attribute is extracted b etween these two indices Thus

r must satisfy two prop erties r must o ccur immediately after each instance of the

rst attribute in each tuple and r must not o ccur in any of the instances of the

rst attribute itself otherwise r would o ccur upstream b efore the true end of

the attribute We can see that B satises b oth prop erties B is a prex of the

strings o ccurring after the countries in this case each is the string B I and

B is also not a substring of any of the attribute values Congo Egypt Belize

Spain

On the other hand consider the delimiter r go This delimiter clearly do esnt

work but why not exactly The reason is that the string go satises neither of the

two prop erties just describ ed go is not a prex of any of the B I strings that

o ccur after the countries and it also is a substring of one of the countries Congo

This simple example illustrates the constraints that the delimiter r must ob ey

We can concisely state the constraints on r by introducing some auxiliary ter

minology The idea is that we can simplify the sp ecication of the hlrt consistency

constraints by assigning names to particular pieces of the input page Figure

illustrates that the example page b egins with a head each tuple is comp osed of a set

of attribute values values within a single tuple are separated by the intratuple sep

arators tuples are separated from one another by the intertuple separators nally

the page ends with a tail Note that these various pieces constitute a partition of

the page exactly how a page is partitioned dep ends on its lab el

Using this terminology we can say that B is a satisfactory value for r b ecause

B is a prex of the intratuple separators b etween the rst and second attribute

lab eled S in Figure yet B do es not o ccur in any rst attributes instances

m

lab eled A

m

The r constraint just describ ed involves only the intratuple separators and the

In the gure the symbol indicates a carriage return character see App endix C

Congo B I IBR B Egypt B I B Spain B I IBR HR HTML HTML P B

z z z z z z z z z z

head value intratuple value intertuple value value intratuple value tail

S

of of

of of of

S

K

st

st th th nd

separator separator separator

tuples

tuples

tuples tuples

tuples

st nd

st nd

st

b etween

b etween b etween

attribute

attribute

attribute attribute

attribute

st st th

A

A

A A

tuples

A

tuples

nd

tuples

st

st

nd

S

nd

attributes

attributes

S

S

Figure A label partitions a page into the attribute values the head the tail and

the inter and intratuple separators For brevity parts of the page are omitted

attribute values As well see the other parts of the pages partition are required for

the full sp ecication of the hlrt consistency constraints

The denition of C Denition b elow states the predicate C As well

hlrt hlrt

see in Section computing Generalize E is a matter of nding a wrapp er w

hlrt

such that C w hP Li for every hP Li E

hlrt

Denition hlrt consistency constraints Let w

hh t r r i be an hlrt wrapper and hP Li be an ex

K K

ample w and hP Li satisfy the hlrt consistency constraintswritten

C w hP Liif and only if the predicates CC dened in Figure

hlrt

hold

Cr hP Li C w hP Li

hlrt

k

k K

C hP Li

k

k K

Ch t hP Li

C constraints on the r Every r must i b e a prex of the subsequent

k k

intratuple separators and ii must not o ccur within any of the corresp onding

attribute values

Cr hP Li S r S i

mM

k mk k mk

A r ii

mk k

C constraints on the Every except must b e a prop er sux of the

k k

preceding intratuple separators

C hP Li S

mM

k mk k k

C constraints on h t and i must b e a prop er sux of the substring

of the pages head following h ii t must not o ccur following h but b efore in

the pages head iii must not precede t in the pages tail iv must b e a

prop er sux of each of the intertuple separators and v t must never precede

in any intertuple separator

Ch t hP Li S h i

K

jP htj jP h j ii

jS j jS tj iii

M K M K

S iv

mK mM

j v tj jS jS

mK mK

Figure The hlrt consistency constraint C is dened in terms of three pred

hlrt

icates CC

The notation C w E is a shorthand for C w hP Li

hlrt hlrt

hP L iE

Figure lists the three predicates CC in terms of which C is dened

hlrt

CC are a convenient way to decomp ose the predicate C into more easily

hlrt

digested parts Each predicate governs a particular asp ect of consistency Sp ecically

C ensures that each delimiter r is correct C governs the for k and

k k

C that h t and are correct As well see in Section the fact that C

C each constrain dierent hlrt comp onents is the key to designing an ecient

Generalize algorithm hlrt

In the remainder of this section we describ e the predicates CC The casual

reader might prefer to skip directly to Section on page

String algebra The sp ecication of CC in Figure makes use of the string

op erator dened in App endix C

The partition variables A and S Figure refers to the variables S

mk mk mk

and A Earlier in this section Figure illustrated how a page is partitioned by

mk

its lab el The variables S and A o ccur in the denitions of predicates CC

mk mk

in order to precisely refer to the parts of this partition

For example we saw that the value of r is constrained by the values of the rst

attribute referred to with the variables A for each m M as well as the

m

text o ccurring b etween these attributes values and the next the S From Figure

m

we can recognize this constraint on r as the predicate C

More formally supp ose P s lab el L indicates that P contains M jLj tuples

each consisting of K attributes

D E

hb e i hb e i hb e i

K K

k k

D E

L

hb e i hb e i hb e i

m m

mK mK

mk mk

D E

hb e i hb e i hb e i

M M M K M K

M k M k

Observe that P is partitioned by L into MK substrings

S A S A S S A S

K K K K

P

A S A S S A S

m m m m

mK mK mK

A S A S S A S

M M M M M K M K M K

This partition of P is dened in terms of MK values of A and MK values of

mk

S These variables are dened as follows

mk

ttribute in each of page P s tuples Sp ecically The A are the values of each a

mk

th th

A is the value of the k attribute of the m tuple on page P In terms of

mk

the indices in lab el L

i h

A P b e

mk mk mk

The A of course are simply the text fragments to b e extracted from page

mk

P

The S are the s eparators b etween these attribute values There are four

mk

kinds of separators

Page P s head denoted S is the substring of the page prior to the rst

K

attribute of the rst tuple

S P b

K

Page P s tail denoted by S is the substring of the page following the

M K

last attribute of the last tuple

h i

S P e jP j

M K M K

The intratuple separators separate attributes within a single tuple Sp ecif

st th

and k ically S for all k K is the separator b etween the k

mk

th

attribute of the m tuple in page P

i h

S P e b

mk mk mk

The intertuple separators separate consecutive tuples Sp ecically S

mK

th st

is the separator b etween the m and m tuple on page P for

m M

h i

S P e b

m

mK mK

Recall from App endix C that the notation sb e is the substring of s from p osition b to p osition

e

Note that the subscripts on the head S and tail S generalize the

K M K

intertuple separators the head separates the zeroth and rst tuples

and the tail separates the last and p ostultimate tuples

Note that the A and S do in fact partition P b ecause collectively they cover

mk mk

the entire string P

Finally the notation S refers to the concatenation of S with al l subsequent

mk

mk

partition elements More precisely

S S A S A S A S A S

m m

mK mK M K M K

mk mk mk

mk

b ecause a For example under this notation page P can b e written as P S

K

page consists of its head S followed by the rest of the page Note also that

K

b ecause nothing follows a pages tail S S

M K

M K

Before pro ceeding let us clarify that the notation A and S obscures an

mk mk

imp ortant p oint the value of each partition variable obviously dep ends on the page

b eing partitioned For example given a set of pages P P it is ambiguous to

N

which page the variables A or S refer Rather than further complicate the

mk mk

n

notation eg by noting the relationship with a sup erscript A we will always

mk

explicitly mention the page under consideration

Explanation of Civv The predicates CC are tersely presented in Table

The details are required only for the pro ofs in App endix B The casual reader

need not comprehend CC at such depth the Englishlanguage descriptions given

in Figure should suce Nevertheless let us now explain part of the notation in

order to motivate and explain CC Predicate C is relatively complex and so

exploring just Civv in detail will illuminate the rest

Predicate Civv ensures that and t are satisfactory with regard to a pages

b o dy The quantier states that Civv must hold for each tuple m

mM

except the rst and last which are handled by Ciii and Ciii resp ectively

Civ S ensures that is a prop er prex of every S To see

mK mK

this note that for any strings s and s ss s i s is a prop er sux of s

j ensures that the tail delimiter t must always o ccur tj jS Cv jS

mK mK

th

intertuple separator To see this note that jss j is the p osition after in the m

of s in s therefore jss j jss j i s o ccurs after s in s

C is correct We have dened a predicate C and claimed that it ex

hlrt hlrt

actly captures the conditions under which a wrapp er is consistent with a particular

example We now formalize this claim

Theorem C is correct For every hlrt wrapper w page P

hlrt

and label L C w hP Li ExecHLRTw P L

hlrt

See App endix B for the pro of

Summary In this section we have dened the hlrt consistency constraints the

conditions under which a wrapp er is consistent with a given example Sp ecically we

have introduced the predicate C w hP Li which holds if and only if w P

hlrt

Lie just in case hlrt wrapp er w generates lab el L for page P

C is sp ecied in terms of a complicated set of notation C is dened

hlrt hlrt

in terms of a conjunction of more primitive predicates CC In turn CC

are dened in terms of the variables A and S which indicate the way that L

mk mk

partitions P into a set of attribute values and the separators o ccurring b etween them

Generalize

hlrt

We introduced the hlrt consistency constraint predicate C b ecause it pro

hlrt

vides a straightfoward way to describ e the Generalize algorithm Generalize

hlrt hlrt

is a sp ecialpurp ose constraintsatisfaction engine that takes as input a set E

hhP L i hP L ii and that solves problems of the following form

N N

Generalize examples E fhP L i hP L ig

hlrt N N

for r each prex of P s intratuple separator for S a

for r each prex of P s intratuple separator S b

K K

for each sux of P s head S c

K

for each sux of P s intratuple separator S d

for each sux of P s intratuple separator S e

K K

for h each substring of P s head S f

K

for t each substring of P s tail S g

M K

w hh t r r i h

K K

if C w E then i

hlrt

return w j

Figure The Generalize algorithm

hlrt

Variables h t r r

K K

Domains each variable can b e an arbitrary character string

Constraints C w E where w hh t r r i

hlrt K K

The algorithm Figure presents Generalize an algorithm that solves

hlrt

constraintsatisfaction problems of this form

Generalize is a simple generateandtest algorithm Given input E

hlrt

Generalize op erates by searching the space of hlrt wrapp ers for a wrapp er w

hlrt

that satises C with resp ect to each example Generalize employs a depth

hlrt hlrt

rst search strategy the algorithm considers all candidate values for r for each

such r it then considers all candidate values for r for each such r and r it then

considers all candidate values for r and so on The result is a lo op control structure

nested K levels deep

What candidates values should b e considered for each of the K hlrt comp o

nents An implementation of Generalize that do es not restrict these candidate

hlrt

sets is infeasible b ecause the hlrt space is innite and thus a depthrst search

might never terminate

r1a r1b Candidates for r1

l1a l1b l1a l1b Candidates for l1

ha hbha hb ha hb ha hb Candidates for h

tatb ta tb ta tb ta tb ta tb ta tb ta tb ta tb Candidates for t

XXXXXXX X

Figure The space searched by Generalize for a very simple example

hlrt

However on the basis of just a single example the number of candidates for each

of the K comp onents b ecomes nite For example the head delimiter h must

b e a substring of the head of each page P Thus line f uses as candidates for

n

h not all p ossible strings but rather only those that are subtrings of page P s head

S Similar constraints apply to each hlrt comp onent

K

Wrapper induction as search In Section we will p erform a detailed com

plexity analysis of Generalize the b ottom line is that Generalize examines

hlrt hlrt

a nite search space This means that we can describ e the algorithm in very simple

terms Generalize searches the nite space of p otentiallyconsistent hlrt wrap

hlrt

p ers in a depthrst fashion stopping when it encounters a wrapp er w that ob eys

C

hlrt

Figure illustrates the space searched by Generalize for a very simple

hlrt

example in which there is just K attribute p er tuple and where there are exactly

two candidates for each of the K wrapp er comp onents In the Figure the

two candidates for h are denoted ha and hb Similarly for t the candidates are

ta and tb for la and lb and for r ra and rb Figure shows

the space as a binary tree In general the tree is not binary the branching factor at

each level captures the number of candidates considered by the corresp onding line of

Generalize in this case line a for the rst level c for the second f

hlrt

for the third and g for the fourth

The search space is a tree Leaves represent wrapp ers and interior no des indicate

the selection of a candidate for a sp ecic wrapp er comp onent Some of the leaves are

lab eled X indicating that the corresp onding wrapp er do es not satify C these

hlrt

leaves are chosen arbitrarily merely for the sake of illustration For example the

rightmost leaf in the tree represents the wrapp er hhb tb lb rbi

Using these ideas we can now succinctly describ e our algorithm Generalize

hlrt

do es an exhaustive depthrst search of the tree for a leaf not lab eled X Since the

outdegree of each interior no de and the trees depth are nite such a search pro cess

is guaranteed to terminate

Example

Before describing the formal prop erties of Generalize we illustrate how the al

hlrt

gorithm op erates using the countrycode example

Supp ose Generalize is given just a single example the html page in Figure

hlrt

c call it page P together with its lab el Equation call it L The following

cc cc

partial evaluation of Generalize illustrates how this example is pro cessed

hlrt

Generalize example fhP L ig

cc cc

hlrt

for r each prex of B I y

for r each prex of IB

for each sux of HTMLTITLE PB

for each sux of B I

for h each substring of HTMLTITLE PB

for t each substring of IBRHR HTML

w hh t r r i

if C w hP Li then return w

hlrt

Consider the rst line of this co de fragment marked y As sp ecied in Figure

this line lists the candidates for hlrt comp onent r the prexes of B I To

see that these are the candidates note that B I is the string o ccuring after the

rst attribute of the rst tuple in the example hP L i That is S B I

cc cc

for page P Similarly the candidates for r are the prexes of IB b ecause

cc

S IB and so forth for the remaining hlrt comp onents

Generalize op erates by iterating over all p ossible combinations of the can

hlrt

didates Each such combination corresp onds to a wrapp er hh t r r i Each

wrapp er is tested to see whether it satises C if so Generalize terminates

hlrt hlrt

and the wrapp er is returned In this particular example eventually the wrapp er

hP HR B B I I i is encountered and Generalize halts

hlrt

Formal properties

As describ ed in Denition we want the function Generalize to b e consistent

hlrt

ie Generalize should output only wrapp ers w that are correct for the input

hlrt

examples In this section we formalize this claim

Theorem Generalize is consistent

hlrt

Pro of of Theorem From lines hj we know that if Generalize re

hlrt

turns a wrapp er w then w will satisfy C for every example Theorem states

hlrt

C w hP Li w P L Thus establishing Lemma completes the pro of

hlrt

2 Pro of of Theorem

Lemma Generalize is complete For any set E of examples

hlrt

if there exists an hlrt wrapper obeying C for each member of E then

hlrt

Generalize E wil l return such a wrapper

hlrt

Actually exhaustive enumeration reveals that there are more than million hlrt wrapp ers

that are consistent with hP L i the wrapp er hSome HR B B I B I Ii is

cc cc

one example Since we have not sp ecied the order in which the candidates are examined we can

not b e certain which wrapp er will b e returned These subtle searchcontrol issues are in interesting

direction of future research see Chapter

The pro of of Lemma essentially involves establishing that Generalize con

hlrt

siders enough candidates for each hlrt comp onent For example when choosing a

value for h we need to show that satisfactory values for h are always substrings of

page P s head and thus Generalize will always nd a satisfactory value even

hlrt

though it restricts its search to such substrings See App endix B for the details

algorithm Eciency The Generalize

hlrt

Though very simple and exhibiting nice formal prop erties the Generalize al

hlrt

gorithm as describ ed in Section is very inecient In this section we analyze

the algorithms computational complexity and describ e several eciency improve

whichlike Generalize is ments The result is the algorithm Generalize

hlrt

hlrt

consistent but which is much faster

Complexity analysis of Generalize

hlrt

What is the computational complexity of Generalize The algorithm consists

hlrt

of K nested lo op We can b ound the total running time of Generalize by

hlrt

multiplying the number of times line i is executed ie the total number of

iterations of the K nested lo ops by the time to execute line i once

How many times do es each nested lo op iterate Each substring of page P s head

is a candidate value for h How many such substrings are there Without loss of

generality and to obtain the tightest b ound we can assume that Generalize

hlrt

enumerates the substrings of the shortest pages headie Figure assumes that

P is the shortest page Note that page P s jS j grows with jP j in the worst case

K

Thus we can use R jP j to b ound the number of candidates for h Sp ecically

R R

since there are substrings of a string of length R Generalize must con

hlrt

candidates for h A similar argument applies to the tail delimiter t sider O R

candidates for t Generalize must examine O R hlrt

The number of candidates for the r and l are constrained more tightly For

k k

example r must b e a prex of each S the intratuple separators b etween the rst

m

and second attributes on each page Thus the candidates for r can b e enumerated

by simply considering all prexes of the shortest value of S At this p oint the

mk

analysis pro ceeds as with the page heads and tails jS j is at most R and there

mk

are R prexes of a string of length R and therefore Generalize must consider

hlrt

O R candidates for r A similiar argument applies to and the same b ound can b e

derived for each r and

k k

Multiplying these b ounds together we have that Generalize must execute

hlrt

line i for

K

OR OR O R O R O R

z

K terms

for dierent wrapp ers

How long do es it take to evaluate C for a single wrapp erie how long do es

hlrt

it take to execute line i Computing C essentially consists of a sequence of

hlrt

stringsearch op erations written as ss over pairs of strings that in the worst case

each have length R Using ecient techniques eg Knuth et al each such

op eration takes time O R Thus b ounding the time to compute C amounts to

hlrt

multiplying the total number of invocations of the stringsearch op erator by the time

p er invocation O R

The total number of such invocations can b e computed by summing the num

b er of invocations required by each of the three predicates CC We must test

each predicate against each of the N examples Predicates C and C are in

voked O K times once p er attribute Each such evaluation requires examining

each of the tuples of each page We can b ound the number of tuples on a single

page as M max M where M jL j is the number of tuples on page P

max n n n n n

Thus the time to evaluate C and C for a single page is O KM Evaluat

max

ing C involves a constant number of string search op erations for each tuple across

all of the pages Thus testing C on a single page involves O M primitive

max

stringsearch op erations Therefore the time to test CC on a single example

is O KM O KM O M O KM Therefore the time to test

max max max max

CC against all the examples is O KM O N O KM N Since the

max max

time to p erform a single stringsearch op erator is O R the total time to compute

C for a single wrapp er is

hlrt

O KM N O R O KM NR

max max

Multiplying this b ound by the number of wrapp ers for which C must b e

hlrt

evaluated Equation reveals that the total running time of Generalize is

hlrt

K K

O M NR O KM NR O R

max max

To summarize we have established the following theorem

Theorem Generalize complexity The Generalize algo

hlrt hlrt

K

where K is the number of attributes rithm runs in time O M NR

max

per tuple N jE j is the number of examples M max M is the

max n n

maximum number of tuples on any single page M jL j is the num

n n

ber of tuples on page P and R min jP j is the length of the shortest

n n n

example page

Generalize

hlrt

As the preceding complexity analysis indicates Generalize is a relatively nave al

hlrt

gorithm for solving the constraintsatisfaction problem of nding a wrapp er consistent

with a set of examples In this section we improve the p erformance of Generalize

hlrt

by describing several optimizations

The basic idea is that Generalize can b e improved by decomp osing the prob

hlrt

lem of nding the entire wrapp er w into the problem of nding each of w s K

comp onent delimiters h t r etc The key insight is that for the most part

the predicates CC do not interact and thus we can nd each delimiter with in

most cases no regard for the others

For example consider the variable r Insp ecting predicate C for k

A r S r S Cr P L

mM

mk mk mk

it is apparent that constraint C governs only r Furthermore the other predicates

C do not mention r Therefore we can assign a value to r without regard to

assignments to any of the other K variables This observation suggests that the

complexity of nding r is considerably simpler than indicated by the b ound derived

for Generalize

hlrt

To generalize rather than solving the overall constraintsatisfaction problem of

nding a wrapp er w ob eying C we can instead solve for the K comp onents

hlrt

of w as follows

Each variable r for k K is indep endent of the remaining K

k

variables since each r is governed only by predicate C

k

Similarly each variable for k K is indep endent of the remaining

k

K variables since each is governed only by predicate C

k

The remaining three variables h t and mutually constrain each other but

assignments to these three variables are indep endent of assignments to the other

K variables To see this note that predicate C mentions h t and but

neither of the other two predicates mention either of these variables

The algorithm Figure lists Generalize an improved version of

hlrt

op erates by Generalize that makes use of these improvements Generalize

hlrt

hlrt

solving for each r then for each k and nally for the three values

k k

is much faster than Generalize b ecause h and t In a nutshell Generalize

hlrt

hlrt

Generalize s iteration constructs are in series rather than nested

hlrt

Fast wrapp er induction as search Recall Figure which illustrates the space

searched by the Generalize algorithm Notice that the choice of r is represented

hlrt

Generalize examples E fhP L i hP L ig

N N

hlrt

for each k K a

for r each prex of P s intratuple separator for S b

k k

accept r if C holds of r and every hP L i E c

n n

k k

for each k K d

for each sux of P s intratuple separator S e

k k

accept if C holds of and every hP L i E f

n n

k k

for each sux of P s head S g

K

for h each substring of P s head S h

K

for t each substring of P s tail S i

M K

accept h and t if C holds of h t and every hP L i E j

n n

return hh t r r i k

K K

algorithm is an improved version of Generalize Figure The Generalize

hlrt

hlrt

Figure

as the trees ro ot Similarly the choice for is represented by the rst layer of no des

the choice of h the second layer and the choice for t the third layer This ordering

derives directly from the fact that in the original Generalize Figure the

hlrt

outermost lo op iterates over candidates for r the next lo op iterates over candidates

for r and so forth ending with the innermost lo op iterating over candidates for t

Given this observation we can describ e the improved Generalize algorithm in

hlrt

searches the tree in Figure very simple terms Like Generalize Generalize

hlrt

hlrt

in a depthrst fashion However Generalize backtracks only over the b ottom

hlrt

three no de layersie the no des representing choices for h and t The algorithm

do es not backtrack over the no des from the ro ot down to the fourth layer from the

b ottomie the no des representing choices for r r

K K

Generalize s consistency pro of presented next mainly involves showing that

hlrt

this greedy search never fails to nd a consistent wrapp er if one exists The basic

idea is that while Generalize uses the global evaluation function C to

hlrt hlrt

determine whether a leaf is lab eled X ie the corresp onding wrapp er is consistent

Generalize uses the lo cal evaluation functions CC hlrt

Formal properties

In attempting to improve Generalize have we sacriced any formal prop erties

hlrt

First of all note that if there are several hlrt wrapp ers consistent with a given set

of examples then the two algorithms might return dierent wrapp ers So in general

we can not say that Generalize E Generalize E for a set of examples E

hlrt

hlrt

However like Generalize Theorem Generalize is consistent

hlrt

hlrt

is consistent Theorem Generalize

hlrt

See App endix B for the pro of

Complexity analysis of Generalize

hlrt

What is the complexity of Generalize To p erform this analysis we determine a

hlrt

b ound on the running time of each of the three outer lo ops lines ac df

and gj As b efore we assume that each primitive string search op eration takes

time O R where R min jP j is the length of the shortest example page

n n

The rst lo op lines ac iterates K times and each iteration requires

O M N stringsearch op erations where K is the number of attributes p er tuple

max

N is the number of examples and M is the total number of tuples in the exam

tot

ples Thus the total running time of lines ac is O K O M N O R

max

O KM NR A similar analysis reveals that lines df also run in time

max

O KM NR

max

The nal lo op construct lines gj is more interesting This triplynested

lo op construct considers all combinations of values for h and t As b efore there

each for h and t for a total of O R are O R candidates for and O R

evaluations of predicate C Each such evaluation involves O M N invocations of

max

the stringsearch op erator Thus lines gj run in time

O M N O R O M NR O R

max max

Summing these three b ounds we have that Generalize is b ounded by

hlrt

O KM NR O KM NR O KM NR O M NR

max max max max

which provides a pro of of the following theorem

Theorem Generalize complexity The Generalize algo

hlrt hlrt

rithm runs in time O KM NR

max

Generalize is therefore signicantly more ecient than Generalize which

hlrt

hlrt

K

Theorem This improvement derives from the runs in time O M NR

max

observation that each variable r and except can b e learned indep endently of

k k

the remaining three variables

Since Generalize is demonstrably sup erior to Generalize we will hence

hlrt

hlrt

forth drop the annotation unless explicitly noted the notation Generalize

hlrt

will refer to the more ecient generalization algorithm

Heuristic complexity analysis

worstcase running time b ound derived for Generalize The O KM NR

max

hlrt

K

b ound for represents a substantial improvement over the O M NR

max

Generalize Unfortunately even if R were relatively small a degreesix p oly

hlrt

nomial is unlikely to b e useful in practice Moreover R is not relatively small for

the Internet resources examined in Chapter R ranges from characters

Fortunately it turns out that the worstcase assumptions are extremely p essimistic

By b eing only slightly more optimistic we can obtain a much b etter b ound

We make the worstcase analysis less p essimistic on the basis of the following ob

servation Recall that for each wrapp er comp onent we try all candidate substrings of

some particular partition fragments of the rst page P S for h S for r etc

K

algorithm for details Our complexity see lines begi of the Generalize

hlrt

b ound was derived in part by counting the number of such candidates for each hlrt

comp onent And these counts dep end on the length of these partition fragments So

far we assumed that these lengths are b ounded by the length of the shortest page

R min jP j But in practice any individual partition fragment constitutes only a

n n

small fraction of page P s total length jS j jP j and jA j jP j for every

n n n

mk mk

n k and m

Why should the individual A and S b e much shorter than the pages them

mk mk

selves First of all since the A and S form a partition of the page on average

mk mk

the partition elements simply cant have length approximately R And from a more

empirical p ersp ective we have observed that the bulk of an information resources

query resp onses consist of formatting commands advertisements and other irrelevant

text the extracted content of the page is usually quite small

The idea of the heuristic complexity analysis is to formalize and exploit this in

tuition Sp ecically our heuristic analysis rests on the following assumption

Assumption Short page fragments On average page P s par

n

p

tition elements S and A al l have length approximately R rather

mk mk

than R

p

The sp ecic function R was chosen to simplify the following analysis However

as we will see in Section this function is very close to what is observed empirically

at actual Internet sites

What do es Assumption buy us In the remainder of this Section we p erform

a heuristiccase analysis of Generalize s running time Like an averagecase

hlrt

analysis this heuristiccase analysis provides a b etter estimate of the runningtime

of our algorithm While an averagecase analysis assumes a sp ecic probability dis

tribution over the algorithms inputs our heuristiccase analysis relies on particular

prop erties of these inputs namely our analysis rests on Assumption

As indicated in Equation the triplynested lo op structure lines gj

dominates Generalize s running time which runs in time O M NR

max hlrt

for each sux of P s head S g

K

for h each substring of P s head S h

K

for t each substring of P s tail S i

M K

accept h and t if C holds of h t and every hP L i E j

n n

Walking through lines gj we can compute a new tighter b ound on the

p

suxes of a string of basis of Assumption as follows Since there are O R

p p

length candidates for instead of OR as R line g enumerates O R

p p

in the worstcase analysis Similarly lines hi enumerate the O R R

O R candidates for each of h and t Thus there are a total of

p

O O R O R R O R

combinations of candidates for which C must b e evaluated As b efore evaluating

C involves O M N invocations of the stringsearch op erator But each such

max

p

invocation now involves pairs of strings with lengths b ounded by R and thus the

p

Combining these results we obtain total time to evaluate C once is O M N R

max

the following heuristiccase b ound for the running time lines gj

p

O R O M N R O M NR

max max

As in the worstcase analysis the rest of Generalize runs in time b ounded by

hlrt

lowerorder p olynomials of R and so these terms can b e largely ignored However

the rest of the algorithm introduces a linear dep endency on K Thus we have proved

the following theorem

Theorem Generalize heuristiccase complexity

hlrt

algorithm runs in time Under Assumption the Generalize

hlrt

O KM NR

max

Of course the validity of this b ound rests on Assumption In Section

we demonstrate empirically that Assumption do es in fact hold in practice We

examined several actual Internet information resources and found that the b estt

p

R Thus Assumption is predictor of the partition element lengths is R

validated by empirical evidence

PAC analysis

In this section we apply the PAC mo del introduced in Section to the problem

of learning hlrt wrapp ers

Following the PAC mo del we assume that each example page P is drawn from

a xed but arbitrary and unknown probability distribution D Wrapper induction

involves nding an approximation to the target hlrt wrapp er T The Induce learn

ing algorithm sees only T s b ehavior on the examples the learner gathers a set

E f hP T P i g of examples Induce might not b e able to isolate the tar

get exactly and therefore we require only that with high probability the learning

algorithm nd a go o d approximation to T The quality of an approximation to T

is measured in terms of its error E w is the chance with resp ect to D that

T D

wrapp er w makes a mistake with resp ect to T on a single instance

E w D P jw P T P

T D

The user supplies two numeric parameters an accuracy parameter and

a reliability parameter The basic question we want to answer is How

many examples do es the Generalize induction algorithm need to see so that with

hlrt

probability at least it outputs a wrapp er w that ob eys E w

T D

Formal result This question is answered by the following theorem

Theorem hlrt is PAClearnable Provided that Assumption

holds the fol lowing property holds for any target wrapper

T H and page distribution D Suppose Generalize is given

hlrt hlrt

as input a set E fhP L i hP L ig of example pages each drawn

N N

independently according to distribution D and then labeled according to

target T If

M

tot

N

K R

K

then E w with probability at least The parameters are as

T D

P

M is the total number fol lows E consists of N jE j examples M

n tot

n

of tuples in E page P contains M jL j tuples each tuple consists of

n n n

K attributes the shortest example page has length R min jP j

n n

K K

and

R R R R

The pro of of Theorem app ears in App endix B In the remainder of this section

we interpret and evaluate this result

Interpretation We b egin by applying Theorem to an example Consider an

information resource presenting K attributes p er tuple an average of ve tuples

M

tot

and where the shortest page is R characters p er example page ie

N

long For the parameters Theorem states that Generalize

hlrt

must examine at least N example pages to satisfy the PAC criteria while for

the b ound grows to N

Equation can b e understo o d as follows The lefthand side is an upp er b ound

on the chance that the examples will yield a wrapp er with excessive error while the

righthand side is the desired reliability level Thus requiring that the righthand

side b e less than the lefthand side ensures that the PAC termination criterion is

satised

In standard PAC analyses such an inequality is usually solved for the variable

over which the learner has control the number of examples N jE j In Equation

there are three such variables N M and R To understand why our result has

tot

this form recall the fundamental p oint of PAC analysis The goal is to examine the

examples E and determineon the basis of just N whether the PAC termination

criteria is satised As the pro of of Theorem shows it turns out that we need to

examine E in more detail That is we simply cant b e sure that the PAC termination

criteria will b e satised merely on the basis of N Rather we must also take into

account the total number of tuples M and the length of shortest example R

tot

In more detail the rst term of the lefthand side of Equation

M

tot

K

K

is a b ound on the chance that the K left and right delimiters the and r

k k

except for l have b een learned incorrectly The second term

N

R

is a b ound on the chance that the remaining three delimiters h t and have b een

learned incorrectly The overall chance that the wrapp er is wrong is just the chance

of the disjunction of these two events and so the chance of learning the wrapp er

incorrectly is at most the sum of these two terms The pro of of Theorem mainly

involves deriving these two terms

How the mo del b ehaves To help visualize Equation Figure shows two

pro jections of the multidimensional surface it describ es We have plotted the chance

of successful learning as a function of the number of examples More precisely the

graphs display the condence that the learned wrapp er has error at most computed

as one minus the lefthand side of Equation as a function of two variables N

Actually recall that the lefthand side of Equation call this quantity cN M is only an

tot

approximation to the true condence As N and M approach zero the approximation degrades so

tot

that eventually the probability cN M exceeds one Therefore rather than plot cN M

tot tot

we have plotted min cN M tot

M

tot

the number of example pages and M the average number of tuples p er

ave

N

example The surface is plotted using M rather than M b ecause in practice the

ave tot

average number of tuples p er page is more meaningful than the total The remaining

domain characteristics are set as in the earlier example K is held constant at four

attributes p er tuple and R is held constant at characters

Figure a illustrates the surface for and b illustrates the surface for

In each case as N and M increase the PAC condence asymptotically

ave

approaches one We have also indicated the condence level p oints ab ove this

threshold corresp ond to learning trials in which the PAC termination criterion is

satised for the reliability parameter Earlier we rep orted that a learner must

examine at least N examples for and N examples for

these graphs show that these values are correct

Evaluation The PAC mo del is just thata modeland thus an imp ortant issue

is whether Theorem provides an eective termination condition Informally we

want to know whether the PAC mo del terminates the learning pro cess to o so on

meaning that more examples ought to b e considered b efore stopping or to o late

fewer examples are sucient for a highquality wrapp er

More precisely the PAC mo del would terminate the learning algorithm to o so on

if its condence were to o high This could happ en b ecause the pro of of Theorem

relies on the assumption that the examples are drawn independently from the

distribution D If this assumption do es not hold then the mo del is overcondent

b ecause the observed examples are not in fact representative of D

On the other hand the PAC mo del would terminate the learning algorithm to o

late if its condence were to o low This could happ en b ecause Theorem makes no

assumptions ab out the instance distribution D Indeed the PAC mo del is essentially

a worstcase analysis of D It is assumed for example that the chance of learning

r in no way eects the chance of learning r But it might b e the case that two

confidence

1

0.9

a

4 0.5 3

2 ave tuples/page 0 100 200 1 300 400 number of pages 500

confidence

1

0.9

b

4 0.5 3

2 ave tuples/page 0 1000 2000 1 3000 4000

number of pages 5000

Figure Surfaces showing the condence that a learned wrapper has error at most

M

tot

as a function of N the total number of examples pages and M the

ave

N

average number of tuples per example for a and b

events are in fact probabilistically dep endent under D For example there may b e

a set of p edagogically useful instances such that the learner will succeed if it sees

any one In summary it may b e the case that the distributions actually encountered

in practice are substantially easier to learn from than the worst case

Though analytic techniques could in principle b e applied to determine how well

the PAC mo del ts the actual learning task we have taken an empirical approach to

validating the mo del Section demonstrates that for several real hlrt learning

tasks the PAC mo del stops the induction pro cess substantially later than is actually

necessary even with relatively weak reliability and accuracy parameters levels of

Tightening this prediction is a challenging direction for future research

A simpler PAC mo del As describ ed earlier the lefthand side of the Equation

is the sum of two terms listed ab ove as Equations and

M

tot

N

K R

K

We now show that this sum tends to b e dominated by the second term so that the

PAC mo del reduces to a much simpler form which in turn leads to an interesting

theoretical result

Note that typically K R since K R for instance if K and

R then K while R Moreover the dierence b etween

is usually small for example with K and these and

K

terms dier by ab out Finally typically each example page has many tuples and

M N

tot

therefore M N so that for any

tot

Putting these observations together we can reason informally as follows

M

tot

N

K R

K

M

tot

N

small R

K

M N

tot

R small

N

small small R

N

R

To summarize if K R and N M then our PAC mo del Equation tot

simplies to

N

R

For example using K R M and N the original

ave

PAC condence and this approximation dier by less

This simplied mo del yields an interesting theoretical result and thus we state

the assumptions b ehind it explicitly

Assumption Few attributes plentiful data For any set of ex

amples E f hP L i g we have that

n n

M

tot

N

R K

K

where N jE j each tuple has K attributes R min jP j is the length

n n

of the shortest page the examples together have M tuples and K

tot

and R are as dened in Theorem

We can now solve Equation for N Doing so gives us a b ound on the number

of examples needed to ensure that a the learned wrapp er is PAC Using the inequality

x

x e we have that

R

N log

and in Equation and Generalize runs Since N is p olynomial in

hlrt

in quadratic time Theorem we have therefore established the following result

Theorem Wrapper class H is eciently PAClearnable under

hlrt

Assumption

This simpler PAC mo del makes intuitive sense The b ottleneck when learning

hlrt wrapp ers is an adequate supply of page heads and tails each example provides

just one of each In contrast the b o dy of a page typically contains several tuples and

Recall Footnote on page Here R plays the role of the natural complexity measure of the

hlrt hypothesis class notice that N is p olynomial in R as well

thus several opp ortunities for learning the r and delimiters Therefore if we can

k k

count on each page to provide several tuples then collectively the examples provide

many more tuples than page heads or tails Thus the part of the PAC mo del related

to learning from the page b o dies Equation is insignicant compared to the part

of the mo del related to learning from the page heads and tails Equation

Of course Theorem holds only if Assumption is realistic In Section

we take an empirical approach to verifying Assumption We demonstrate that

for several actual exp eriments the deviation b etween the simplied and original PAC

mo dels is small

Finally notice that the variable K do es not o ccur in Equation Therefore

under Assumption the number of examples required to satisfy the PAC criteria

is indep endent of K This is an interesting theoretical result since it suggests that

the techniques developed in this thesis can b e applied to information resources that

contain arbitrarily many attributes

Summary

In this chapter we have seen our rst complete instance of how to learn a wrapp er

class Conceptually this was quite straightforward we dened the hlrt wrapp er

class developed Generalize the generalization function input to the Induce al

hlrt

gorithm and designed a PACtheoretic mo del of the hlrt wrapp er class

Along the way we examined many technical details Our straightforward imple

mentation of Generalize is extremely slow so we developed a substantially more

hlrt

ecient variant Generalize We were able to reason ab out these algorithms b e

hlrt

cause we explicitly wrote down the constraint C governing when a wrapp er is

hlrt

consistent with an example We then went on to p erform an heuristiccase analysis

of our learning algorithm which reveals that our algorithm runs in time quadratic in

the relevant parameters

We then developed a PAC mo del Since hlrt wrapp ers have quite a complex

structure the PAC b ound is necessarily rather complex as well Interestingly our

mo del reduces to a very simple form under conditions that are easily satised by

actual Internet resources Sp ecically we have found that under reasonable assump

tions the number of examples required to learn an hlrt wrapp er do es not dep end

on the number of attributes K and that our induction algorithm requires only a

p olynomialsized collection of examples in order to learn eectively

hlrt is just one of many p ossible wrapp er classes For example we have already

mentioned a simplication of hlrt which we call lr In Chapter we discuss the

automatic induction of lr and four other wrapp er classes

Chapter

BEYOND HLRT ALTERNATIVE WRAPPER CLASSES

Introduction

So far we have b een concerned exclusively with the hlrt wrapp er class This suggests

a natural question what ab out other wrapp er classes In the extreme wrapp ers

can b e unrestricted programs in a fully general programming language Of course

automatically learning such wrapp ers would b e dicult

In this chapter we describ e ve additional wrapp er classes for which we have found

ecient learning algorithms Like hlrt all are based on the idea of using delimiters

that mark the left and righthand side of the text fragments to b e extracted The

classes dier from hlrt in two ways First we developed various techniques to avoid

getting confused by distractions such as advertisements Section Second we

developed wrapp er classes for extracting information that is laid out not as a table

the structure assumed by hlrt but rather as a hierarchically nested structure eg

a b o oks table of contents Section

Tabular resources

We rst consider alternatives to hlrt As discussed in Chapter the hlrt wrapp er

class corresp onds to one particular programming idiom for writing wrapp ers In

this section we consider alternatives

The lr oclr and hoclrt wrapper classes

The hlrt wrapp er class We b egin by quickly reviewing the hlrt wrapp er class

hlrt wrapp ers use one comp onent h to skip over a pages head and a second t

indicates a pages tail In the pages b o dy a pair of left and righthand delimiters

and r is used to extract each of the K attribute values for each tuple

k k

We have seen that we can encapsulate the b ehavior of an hlrt wrapp er for a

domain with K attributes as a vector hh t r r i of K strings The

K K

meaning of these strings is dened by the ExecHLRT pro cedure which executes an

hlrt wrapp er on a given page see Figure

The lr wrapp er class The lr wrapp er class is a simplication of hlrt In a

nutshell lr is simply hlrt without the h or t In pseudoco de lr execution

pro ceeds as follows

ExecLRwrapp er h r r i page P

K K

while there is a next o ccurrence of in P

for each h r i fh r i h r ig

K K

k k

th

attribute extract from P the value of the next instance of the k

b etween the next o ccurrence of and the subsequent o ccurrence of r

k k

return all extracted tuples

More precisely we dene the execution of an lr wrapp er as follows

ExecLRwrapp er h r i page P

K K K

i

m

while P i a

m m

for each h r i fh r i h r ig

k k K K

i i P i j j b

k k

b i

mk

i i P ir

k

e i

mk

return lab el f h hb e i i g

mk mk

Just as an hlrt wrapp er can b e describ ed exactly as a vector of K strings the

b ehavior of an lr wrapp er can b e encapsulated entirely as a vector of K strings

h r r i

K K

For example consider the following page articially pro duced for the sake of

illustration

hoAAcoAA coA A ct

The task is to lab el this page so as to extract the three tuples of two attributes

fhA Ai hA A i hA Aig

The hlrt wrapp er hh t i as well as the lr wrapp er h i b oth are

consistent withie extract the given information fromthis example page

The oclr wrapp er class hlrt employs the delimiters h and t to prevent in

correct extraction from a pages head and tail The oclr wrapp er class provides a

dierent mechanism for handling a similar problem

Rather than treating a page as head plus b o dy plus tail the oclr wrapp er

treats the page as a sequence of tuples separated by irrelevant text oclr is de

signed to avoid distracting irrelevant text in these intertuple regions just as hlrt

is designed to avoid distracting text in the head and tail

oclr uses two strings one denoted o that marks the b eginning or o pening of

each tuple and a second denoted c that marks then end or c losing of each tuple

oclr uses o and c to indicate the op ening and closing of each tuple much as hlrt

th

and lr use and r mark the b eginning and end of the k attribute

k k

As with lr and hlrt we can describ e the b ehavior of an oclr wrapp er using

a vector ho c r r i of K strings To make the meaning of these

K K

strings precise we must describ e the ExecOCLR pro cedure ExecOCLR prescrib es the

meaning of an oclr wrapp er ho c r r i just as ExecHLRT denes hlrt

K K

wrapp ers and ExecLR denes lr wrapp ers In pseudoco de

Note that we use a fixedwidth font to indicate the actual fragments of a page such as h while

italics indicates hlrt wrapp er comp onents such as h

ExecOCLRwrapp er ho c r r i page P

K K

while there is a next o ccurrence of o in P

skip to the next o ccurrence of o in P

for each h r i fh r i h r ig

K K

k k

th

attribute extract from P the value of the next instance of the k

b etween the next o ccurrence of and the subsequent o ccurrence of r

k k

skip past the next o ccurrence of c in P

return all extracted tuples

More precisely we dene the execution of an oclr wrapp er as follows

ExecOCLR wrapp er ho c r r i page P

K K

i

m

while P io a

m m

i i P io

for each h r i fh r i h r ig

k k K K

i i P i j j b

k k

b i

mk

i i P ir

k

e i

mk

i i P ic c

return lab el f h hb e i i g

mk mk

To illustrate oclr supp ose that the following html page was pro duced by a

resource similar to the original countrycode example Figure

HTMLTITLESome Country CodesTITLEBODY

BB BCongoB IIBR

BB BEgyptB IIBR

BB BBelizeB IIBR

BB BSpainB IIBR

BODYHTML

In the original example every lr wrapp er will get confused b ecause distracting text

in the pages head and tail is incorrectly extracted as a country Here the problem

is that text between the tuples is marked up using the same tags as are used for

countries One way to avoid this problem is to use an oclr wrapp er with o B

and c BR When the oclr wrapp er hB BR B B I Ii is given

to ExecOCLR these c and o tags force the irrelevant parts of the page to b e skipp ed

It is true that there exists an lr wrapp er that can handle this mo dication to the countrycode

As a second illustration of an oclr wrapp er consider again the example page

hoAAcoAAc oA A c t Along with the lr and hlrt wrap

p ers mentioned earlier the oclr wrapp er ho c i is consistent with this page

The hoclrt wrapp er class As the ExecHOCLRT pro cedure pseudoco de illus

trates the hoclrt wrapp er class combines the functionality of oclr and hlrt

ExecHOCLRTwrapp er hh t o c r r i page P

K K

skip past the rst o ccurrence of h in P

while the next o ccurrence of o is b efore the next o ccurrence of t in P

skip to the next o ccurrence of o in P

for each h r i fh r i h r ig

K K

k k

th

extract from P the value of the next instance of the k attribute

b etween the next o ccurrence of and the subsequent o ccurrence of r

k k

skip past the next o ccurrence of c in P

return all extracted tuples

More precisely

ExecHOCLRTwrapp er hh t o c r r i page P

K K

i P h

m

while jP itj jP ioj a

m m

i i P io

for each h r i fh r i h r ig

K K

k k

i i P i j j b

k k

b i

mk

i i P ir

k

e i

mk

i i P ic c

return lab el f h hb e i i g

mk mk

We can use the example page hoAAcoAAc oA A ct men

tioned earlier to illustrate hoclrt wrapp er hh t o c i is consistent with

this page

example namely hB B B I Ii However as well see in Section there are pages

that can b e wrapp ed by oclr but not lr The purp ose of this example is simply to illustrate oclr

not to establish its sup eriority to lr

Segue

Having introduced four wrapp er classes a natural question arises on what basis can

they b e compared In the next two sections we provide two such bases relative

expressiveness which measures the extent to which the funtionality of one wrap

p er class can b e mimicked by another and learning complexity which measures the

computational complexity of learning wrapp ers within each class

Relative expressiveness

A key issue when comparing lr hlrt oclr and hoclrt concerns their relative ex

pressiveness which information resources can b e handled by certain wrapp er classes

but not by others For example we have observed that the page

hoAAcoAA co A A ct

can b e handled by all four wrapp er classes On the other hand the countrycode

example Figure can b e wrapp ed by hlrt but not by lr

To formalize this investigation let f hP Li g b e the resource space the

set of all pagelab el pairs hP Li Conceptually includes pages from all information

resources whether regularly structured or unstructured tabular or not and so forth

For each such information resource contains all of the resources pages and each

page P included in is paired with its lab el L Note that a particular information

resource is equivalent to a subset of namely the particular pagelab el pairs the

resource contains

More imp ortantly from the p ersp ective of relative expressiveness note that a

wrapp er class can b e identied with a subset of a class corresp onds to those

pagelab el pairs for which a consistent wrapp er exists in the class If W is a wrapp er

class then we use the notation W to indicate the subset of which can b e handled

by W

Denition W The resource space is dened to be the set

of al l hP Li pagelabel pairs

The subset of that is wrappable by a particular wrapper class W writ

ten W is dened as the set of pairs hP Li such that a wrapper

consistent with hP Li exists in class W

W fhP Li j w P Lg

w H

W

Note that W provides a natural way to compare the relative expressiveness of

wrapp er classes For example let W and W b e two wrapp er classes If W

W then W is more expressive than W in the sense that any page that can b e

wrapp ed by W can also b e wrapp ed by W

Figure and Theorem capture our results concerning the relative expres

siveness of the four wrapp er classes discussed in this section

Theorem Relative expressiveness of lr hlrt oclr and hoclrt

The relationships between lr hlrt oclr and hoclrt

are as indicated in Figure

Pro of of Theorem Sketch To establish these relationships it suces to

show that

There exists at least one pair hP Li in each of the regions marked A B

C I in Figure

oclr subsumes lr lr oclr and

hoclrt subsumes hlrt hlrt hoclrt

Note that these three assertions jointly imply that the four wrapp er classes are related

as claimed

Consider each assertion in turn

In App endix B we identify one hP Li pair in each of the regions A B etc

Π Π(HOCLRT) Π(OCLR)

Π(HLRT) Π(LR)

(A) (B)(C) (D) (E) (F) (G) (H)

(I)

Figure The relative expressiveness of the lr hlrt oclr and hoclrt wrapper

classes

The idea is that an oclr wrapp er can always b e constructed from an lr wrapp er

for any particular page Sp ecically supp ose there exists a pair hP Li lr

ie there exists a wrapp er w h r r i H such that w P L

K K lr

Then oclr wrapp er w h r r i satises w P L and

K K

therefore hP Li oclr

The same idea applies to the classes hlrt and hoclrt Supp ose there

exists a pair hP Li hlrt ie there exists a wrapp er w

hh t r r i H such that w P L Then hoclrt wrap

K K

hlrt

p er w hh t r r i satises w P L and therefore

K K

hP Li hoclrt

See App endix B for complete details

2 Pro of of Theorem Sketch

As describ ed in App endix C the symbol denotes the empty string

One p ossibly counterintuitive implication of Theorem is that the lr class is

not subsumed by the hlrt class One might exp ect that an hlrt wrapp er can always

b e constructed to mimic the b ehavior of any given lr wrapp er To do so the head

delimiter can simply b e set to the empty string h However the tail delimiter

t must b e set some nonempty page fragment In general such a delimiter might not

exist For similar reasons the oclr wrapp er class is not subsumed by the hoclrt

class

Complexity of learning

The previous section illustrated that the four wrapp er classes under consideration

cover dierent parts of the resource space For example lr covers strictly less of

than oclr Thus a natural basis for comparing wrapp er classes is the computational

tradeos of these coverage dierences Since this thesis concerns learning wrapp ers

we are interested in the computational complexity of learning wrapp ers from the

wrapp er classes

We will concentrate on the following scenario Supp ose we are given a set E

f hP L i g of examples Let N jE j b e the number of examples Supp ose

n n

each tuple consists of K attributes and let M max M b e the maximum

max n n

number of tuples on any single page where M jL j is the number of tuples on

n n

page P Finally supp ose that the shortest example has length R min jP j For

n n n

each wrapp er class W recall that Generalize function is the input to the Induce

W

learning algorithm when learning wrapp er class W To measure the complexity of

learning a particular class W we will b e interested in the heuristiccase dened by

Assumption running time of the function call Generalize E

W

The hlrt wrapp er class We have already established Theorem that under

Assumption the Generalize function runs in time O KM NR

max hlrt

The lr wrapp er class The lr generalization function Generalize is similar

lr

to Generalize except that a dierent consistency constraintC instead of

hlrt lr

C must b e satised

hlrt

The consistency constraints for the lr wrapp er class is the predicate C Let

lr

w h r r i b e an lr wrapp er and hP Li b e a pairlab el pair We dene

K K

C as follows

lr

Cr hP Li C w hP Li

lr

k

k K

C hP Li

k

k K

The dierence b etween hlrt and lr is mirrored in the dierence b etween C

hlrt

and C While hlrt wrapp ers must satisfy constraint C for the h t and

lr

comp onents lr wrapp ers can entirely ignore this constraint Instead lr requires

simply that like the other K comp onents must satisfy C In particular

k

notice that the k K quantier in the C conjunct of C is replaced by

hlrt

the quantier k K in C

lr

As with hlrt we pro ceed by establishing that C is correct

lr

Theorem C is correct For every lr wrapper w page P and

lr

label L C w hP Li ExecLRw P L

lr

The pro of is similar though simpler than that of Theorem The details are

omitted

Given C the generalization function Generalize is as follows

lr lr

Generalize examples E fhP L i hP L ig

lr N N

for each k K

for r each prex of P s intratuple separator for S

k k

accept r if C holds of r and every hP L i E i

n n

k k

for each k K

for each sux of P s intratuple separator S S when k

K

k k

accept if C holds of and every hP L i E ii

n n

k k

return h r r i

K K

To see that Generalize is correct recall that predicates C and C are indep en

lr

dent in that each r is constrained only by C while each dep ends only on C

k k

Thus Generalize is complete even though it do es not consider all p ossible combi

lr

nations for all values of the K r and comp onents Thus we have given a pro of

k k

sketch for the following

Theorem Generalize is consistent

lr

This result is analogous to Theorem for the hlrt wrapp er class

What is the complexity of Generalize The analysis is similar to that for The

lr

p

orem Lines i and ii of Generalize are each executed O K R times

lr

p

Predicates C and C each require the same amount of time O M N R Thus

max

the total time to execute Generalize is

lr

p p

R O M N R O KM NR O K

max max

Thus we have established the following theorem

Theorem Generalize heuristiccase complexity Under As

lr

sumption the Generalize algorithm runs in time O KM NR

max

lr

The oclr wrapp er class In the hlrt wrapp er class the three comp onents h t

and interact in that they are jointly governed by the predicate C Similarly in

oclr the three comp onents o c and interact

The oclr consistency constraint predicate for wrapp er w

ho c r r i and example hP Li is dened as follows

K K

Cr hP Li C w hP Li

oclr

k

k K

C hP Li

k

k K

Co c hP Li

Notice that like hlrt but unlike lr the C conjunct of C is quantied by

oclr

k K since is governed by C rather than C

Predicate C governs the interaction b etween o c and for the oclr wrapp er

class just as C governs h t and for hlrt For o c and to satisfy C for

a particular example it must b e the case that i in the pages head o must skip

over any p otentially confusing material so that indicates the b eginning of the rst

attribute of the rst tuple ii in the pages tail c must mark the end of the last

tuple and there must b e no o present indicating a subsequent tuple and iii b etween

each tuple c and o must together skip over any p otentially confusing material so that

indicates the b eginning of the rst attribute of the next tuple More formally

Co c hP Li S o i

K

S c S co ii

M K M K

S co iii

mM mK

As with hlrt and lr b efore pro ceeding we must establish the following theorem

Theorem C is correct For every oclr wrapper w page P

oclr

and label L C w hP Li ExecOCLRw P L

oclr

The pro of is similar to that of Theorems and we show that under no cir

cumstances can ExecOCLR pro duce the wrong answer if C is satised and fur

oclr

thermore that if the wrapp er is correct then C must hold The details are

oclr

omitted

To analyze the complexity of learning oclr we must state its generalization function

Generalize examples E fhP L i hP L ig

oclr N N

for each k K

for r each prex of P s intratuple separator for S

k k

accept r if C holds of r and every hP L i E i

n n

k k

for each k K

for each sux of P s intratuple separator S S when k

k k K

accept if C holds of and every hP L i E ii

n n

k k

for each sux of P s head S

K

for o each substring of P s head S

K

for c each substring of P s tail S

M K

accept o c and if C holds of o c and and every hP L i E iii

n n

return ho c r r i

K K

To see that Generalize is consistent note that the algorithm is correct with

oclr

resp ect to the r and for k b ecause C and C are indep endent To see

k k

that the algorithm is correct with resp ect to o c and note that the algorithm

eventually considers all combinations of sensible values for each comp onent just

as Generalize considers all combinations of sensible values for h t and Thus

hlrt

we have given a pro of sketch for the following

Theorem Generalize is consistent

oclr

We are now in a p osition to analyze the complexity of Generalize As with

oclr

Generalize the b ottleneck of the algorithm is line iii Line iii is executed

hlrt

O R times once for each combination of o c and and thus predicate C

must b e evaluated O NR times once for each page Each such evaluation involves

p

Therefore O M primitive stringsearch op erations over strings of length O R

max

we must execute line iii

p

O NR O M O R O M NR

max max

times Furthermore lines i and ii of the algorithm requires time O K Since

line iii dominates the running time we can safely neglect the dep endency on the

other parameters Thus the total execution time of Generalize is b ounded by

oclr

To summarize O KM NR max

Theorem Generalize heuristiccase complexity

oclr

Under Assumption the Generalize algorithm runs in time

oclr

O KM NR

max

The hoclrt wrapp er class Learning the lr wrapp er class is very simple b ecause

all K comp onents can b e learned indep endently However in hlrt and oclr the

choice of dep ends on two other comp onents h and t for hlrt or o and c for

oclr As we saw these dep endencies result in the fact that learning hlrt or oclr

is computationally more intensive than learning lr Since hoclrt combines the

functionality of hlrt and oclr learning hoclrt is in turn computationally more

exp ensive than b oth hlrt and oclr

As usual we b egin with the hoclrt consistency constraints hoclrt wrapp er

w hh t o c r r i is consistent with example hP Li if and only if pred

K K

icates C C and C all hold

C w hP Li Cr hP Li

hoclrt

k

k K

C hP Li

k

k K

Ch t o c hP Li

Predicate C governs the ve comp onents h t o c and C requires that i

ii h t o and are satisfactory for the pages head iii o c and t are satisfactory

for the pages tail and ivv o c t and are satisfactory for the pages b o dy More

formally

Ch t o c hP Li

S ho i

K

jP hotj jP ho j ii

S c jS coj jS ctj iii

M K M K M K

S co iv

mK mM

jS cotj jS co j v

mK mK

As usual we demand that C b e correct

hoclrt

Theorem C is correct For every hoclrt wrapper w

hoclrt

page P and label L C w hP Li ExecHOCLRTw P L

hoclrt

The pro of is omitted it is similar to that of Theorems and

We are now in a p osition to state the hoclrt generalization function

Generalize examples E fhP L i hP L ig

hoclrt N N

for each k K

for r each prex of P s intratuple separator for S

k k

accept r if C holds of r and every hP L i E

n n

k k

for each k K

for each sux of P s intratuple separator S or S when k

K

k k

accept if C holds of and every hP L i E

n n

k k

for each sux of P s head S

K

for o each substring of P s head S

K

for c each substring of P s tail S

M K

for h each substring of P s head S

K

for t each substring of P s tail S

M K

accept h t o c and if C holds of them and every hP L i E i

n n

return hh t o c r r i

K K

The algorithm Generalize follows the same pattern as established by hlrt

hoclrt

and oclr the values of r and for k can b e learned indep endently but line

K

k

i indicates that values for the remaining comp onents h t o c and in this case

can b e selected only by enumerating all combinations of candidates for each

As with hlrt and oclr a complexity analysis of Generalize reveals that

hoclrt

p

R line i is the b ottleneck Sp ecically we must test predicate C against O

dierent combinations of h t O R O R O R O R O R

o c and Each evaluation of predicate C involves O NM invocations of the

max

p

primitive string search op eration which has cost O R Furthermore the rest of

the algorithm requires time O K Combining these b ounds we have that the total

running time is

p

R O KNM R O O K O NM O R

max max

To summarize

Theorem Generalize heuristiccase complexity

hoclrt

Under Assumption the Generalize algorithm runs in time

hoclrt

O KNM R

max

Nested resources

The preceding section was concerned with alternatives to hlrt for information re

sources that render data in a tabular format In this section we explore one particular

style of nontabular formatting which we call nested structure

While a rectangular table is the prototypical example of a do cument exhibiting

tabular structure a table of contents is the prototype of nested structure

Part I Introduction to CMOS Technology

Chapter Introduction to CMOS Circuits

A Brief History

MOS Transistors

CMOS Logic

The Inverter

Combinatorial Logic

Circuit and System Representation

Behavioral Representation

Structural Representation

Summary

Chapter MOS Transistor Theory

Introduction

nMOS Enhancement Transistor

pMOS Enhancement Transistor

MOS Device Design Equations

Basic DC Equations

Second Order Eects

Threshold VoltageBody Eect

Subthreshold Region

Channel Length Mo dulation

MOS Mo dels

The Complementary CMOS InverterDC Characteristics

Part I I System Design and Design Metho ds

Chapter CMOS Design Metho ds

Introduction

In a do cument with nested structure values for a set of K attributes are presented

with the information organized hierarchically The information residing b elow at

tribute k can b e thought of as details ab out attribute k Moreover for each attribute

there may b e any number zero one or more of values for the given attribute The

only constraint is that values can b e provided for attribute k only if values are also

provided for attributes to k

In the earlier tableofcontents example there are K attributes the titles of

the b o oks parts chapters sections subsections and subsubsections The constraint

that each attribute can have any number of values corresp onds to the idea that a b o ok

can have any number of parts a part can have any number of chapters a chapter can

have any number of sections and so on The constraint that attribute k can have a

value only if attributes to k have values means that there can b e no oating

subsubsection without an enclosing subsection no oating subsections without an

enclosing section and so forth

With a tabular format there is a natural denition of a pages information con

tent namely the substrings of the source do cument corresp onding to the attribute

values for each tuple For nested do cuments we extend this idea in a straightforward

manner The information content of a nested do cument is a tree of depth of most K

The no des of the tree group together related attribute values while the edges enco de

the attribute values themselves

For example the information content of the tableofcontents example is repre

Introduction to CMOS Technology Introduction to CMOS Circuits A Brief History MOS Transistors CMOS Logic The Inverter Combinatorial Logic Circuit and System Representation Behavioral Representation Structural Representation Summary MOS Transistor Theory Introduction nMOS Enhancement Transistor pMOS Enhancement Transistor MOS Device Design Equations Basic DC Equations Second Order Effects Threshold Voltage—Body Effect Subthreshold Region Channel Length Modulation MOS Models The Complementary CMOS Inverter—DC Characteristics .... System Design and Design Methods CMOS Design Methods Introduction

...

Figure An example of a nested documents information content

sented as following tree shown in Figure The ellipses indicate parts of the tree

that were omitted in the example naturally the information content of an actual

do cument do es not contain ellipses

Note that tabular formatting is actually a sp ecial case of a nested formatting in

which the trees ro ot has zero or more children all interior no des have exactly one

child and every leaf is at depth exactly K

For tabular formatting we formalized the notion of information content in terms

of a lab el L of the form

D E

hb e i hb e i hb e i

k k K K

L

D E

hb e i hb e i hb e i

M M M K M K

M k M k

To precisely discuss wrapp ers for do cuments with nested structure we must generalize

this formalization of a lab el

Since we must represent trees with arbitrary depth we employ a recursive deni

tion to describ e a nestedstructure lab el L

L lab el

lab el f no de g

no de hhb ei lab eli

That is L is a lab el structure where a lab el is a set of zero or more no de structures

and a no de is a pair consisting of an interval hb ei and a lab el structure A no des

interval hb ei indicates the indices of the string used to lab el the edge b etween the

no de and its parent A no des lab el represents its children

For example consider the following example page

hAiAiiAA A A A t

The task is to extract the following nested structure

a

a

e

a

a

e

a

a

a

e

a

a

Ai Aii A A

e

a

a

e

a

a

e a

A A A

The following lab el structure L represents this tree

hh i fgi

hh i fhh i fgigi

L

hh i fhh i fgigi

hh i fhh i fgigi

For instance A is substring of the example page Since A is the value

of the second attribute under the third value of the rst attribute the pair

o ccupies the corresp onding lo cation in the lab el

Recall that when we dened a tabular lab el we placed constraints on indices

which ensure that the layout is in fact tabular see Footnote on page A similar

requirement must b e satised by the indices in a nestedstructure lab el Sp eci

cally the indices must in fact corresp ond to a treelike decomp osition of the original

do cument

Finally note that since tabular formatting is a sp ecial case of nested format

ting we will overload the symbol L to mean either a tabular or a nested lab el the

interpretation will b e clear from context

The nlr and nhlrt wrapper classes

In this section we elab orate on the idea of nested structure by developing wrapp er

classes nlr and nhlrt that can b e used to extract such structure Both wrapp er

classes are similar to the tabular wrapp er classes asso ciated with each of the K

attributes is a lefthand delimiter and a righthand delimiter r In the tabular

k k

th

attribute for some particular tuple wrapp ers after extracting the value of the k

st

then the wrapp er extracts the k attribute value or the rst in the case of

th

k K Nestedstructure wrapp ers generalize this pro cedure after extracting the k

attribute the next extracted value could b elong to attributes k k k k

or The nlr and nhlrt wrapp ers use the p osition of the delimiters

k

to indicate how the page should b e interpreted the next value to b e extracted is

indicated by which delimiters o ccurs next For example if we have currently

k

extracted a pages content through index and there is a subsection title starting

at p osition a chapter title starting at p osition and a subsubsection title

starting at p osition then we will extract the chapter title next

The nlr wrapp er class The nlr wrapp er class is very similar to the lr wrapp er

class except that the execution of the wrapp er searchers for the delimiters in a

th

dierent way Sp ecically after extracting a value for the k attribute the ExecNLR

determines which attribute o ccurs next As just mentioned after attribute k the

next attribute could b e at attributes to k Of course at attribute K the next

attribute can not b e at k K and the rst extracted attribute must b e at k

To determine which attribute o ccurs next the p ositions of the left delimiter are

k

compared for each p ossible next attributes k

ExecNLRwrapp er h r r i page P

K K

i

k

lo op

k arg min P i i

k mink K k

exit lo op if no such k exists ii

i i P i j j

k k

b i

i i P ir

k

e i

th

insert hb ei as indices of next value of k attribute

return lab el structure

Line i is the key to the op eration of ExecNLR The algorithm determines which

among the k p ossibilities attribute k o ccurs next by nding the minimum

value of P i If line i were replaced by k if k K then else k then

k

ExecNLR would reduce to ExecLR The termination condition line ii is satised

whenever none of the o ccurie when the wrapp er has reached the end of the

k

page

We illustrate nlr with two simple examples First recall the articial exam

ple hoAAcoAA co A A ct The nlr wrapp er h i is

consistent with this example page

Of course this example is tabular rather than nested This example illustrates

that tabular formatting is a sp ecial case of nested formatting To illustrate the

additional functionality of nlr recall the example page introduced earlier

hAiAiiAA A A A t

The nlr wrapp er h i correctly extracts this structure from the example page

The nhlrt wrapp er class The nhlrt wrapp er class combines the functionality

of nlr and hlrt The execution of an nhlrt wrapp er hh t r r i is

K K

dened as follows

ExecNHLRTwrapp er hh t r r i page P

K K

i P h

k

lo op

k arg min P i

k min k K k

exit lo op if no such k exists or jP i j jP itj

k

i i P i j j

k k

b i

i i P ir

k

e i

th

insert hb ei as indices of next value of k attribute

return lab el structure

This pro cedure op erates just like ExecNLR except that additional mechanisms

similar to those in ExecHLRT are needed to handle the h and t comp onents

As exp ected the nhlrt wrapp er hh t i is consistent with the example

pages mentioned earlier

hoAAcoAA co A A ct

and

hAiAiiAA A A A t

Relative expressiveness

As with the four tabular wrapp er classes we are interested in comparing the relative

expressiveness of nlr and nhlrt With six wrapp er classes there are p otentially

regions that must b e examined in a Venn diagram such as Figure illustrating

the relative expressiveness of the classes Therefore for brevity we will compare only

lr hlrt nlr and nhlrt

Figure and Theorem capture our results concerning the relative expres

siveness of these four wrapp er classes

Π Π(HLRT)

(J) (K) Π(N-HLRT) Π(LR) (L)(M) (N) (O) (P)

(Q) (R) (S) (T) (U)

Π(N-LR)

Figure The relative expressivenes of the lr hlrt nlr and nhlrt wrapper

classes

Theorem Relative expressiveness of lr hlrt nlr and nhlrt

The relationships between lr hlrt nlr and nhlrt are

as indicated in Figure

Pro of of Theorem Sketch To see that these relationships hold it suces

to show that

There exists at least one pair hP Li in each of the regions marked J K

L U in Figure

Every pair in nlr and hlrt is also in lr nlr hlrt lr

Every pair in nhlrt and lr is also in hlrt nhlrt lr hlrt

Note that these three assertions jointly imply that the four wrapp er classes are related

as claimed

Consider each assertion in turn

In App endix B we identify one hP Li pair in each of the regions J K etc

Consider a pair hP Li claimed to b e a member of nlr and hlrt but not lr

Since hP Li hlrt hP Li must have a tabular structure It is straightfor

ward to show that if hP Li nlr and hP Li is tabular then hP Li lr

But this contradicts the assumption that hP Li lr

Consider a pair hP Li claimed to b e a member of lr and nhlrt but not hlrt

Since hP Li lr hP Li must have a tabular structure It is straightforward

to show that if hP Li nhlrt and hP Li is tabular then hP Li hlrt

But this contradicts the assumption that hP Li hlrt

See App endix B for details

2 Pro of of Theorem Sketch

Complexity of learning

To p erform a complexity analysis of learning lr hlrt oclr and hoclrt wrapp ers

we stated the consistency constraints that must hold if a wrapp er is to b e consistent

with a particular example We then describ ed the Induce generalization function for a

particular wrapp er class as a sp ecialpurp ose constraintsatiser for the corresp onding

consistency constraints More precisely when generalizing from a set E of examples

the consistency constraint C for wrapp er class W tells us whether some wrapp er

W

w H ob eys w P L for each hP Li E

W

An alternative is to use the wrapp er execution pro cedure ExecW directly For

instance rather than testing whether some wrapp er w H ob eys C we

hlrt hlrt

could have used the ExecHLRT pro cedure to verify that w is consistent with each

example

The reason we did not pursue this path is computational by explicitly stating

the constraint C we can then decomp ose the constraint into CC show that

hlrt

these predicates are indep endent and thereby search for consistent wrapp ers much

more eciently For the hlrt class the original inecient Generalize algorithm

hlrt

tries to satisfy C directly while the improved Generalize algorithm exploits

hlrt hlrt

the indep endence of CC A similar strategy was applied to the lr oclr and

hoclrt classes

Two problems arise when applying this metho dology to the nlr and nhlrt

wrapp er classes First of all the consistency constraints C and C would

nlr nhlrt

b e quite complicated requiring substantial additional notation b eyond the variables

S and A

mK mk

Further consideration of this rst diculty leads to the second problem To un

derstand why these constraints are so complicated consider attribute k What values

th

are satisfactory for the lefthand delimiter for the k attribute Recall that after

k

th

attribute ExecNLR and ExecNHLRT might extract a extracting a value for the k

value for attribute k k k or dep ending on which value o ccurs

k

rst amongst the k k Thus must occur rst when attribute k should b e

k

extracted but it must not occur rst when some other attribute k k k

or are to b e extracted To summarize whether a candidate value for is

k

satisfactory dep ends on the choices of up to k other lefthand delimiters

This discussion reveals the second problem with trying to decomp ose the consis

tency constraints C or C we conjecture that these constraints simply can

nlr nhlrt

not b e decomp osed in a way that simplies learning wrapp ers from these classes Since

we can see no advantage in explicitly stating the constraints C and C we

nlr nhlrt

will not do so Fortunately once weve committed to this decision the analysis of

the complexity of learning nlr and nhlrt is greatly simplied

The nlr wrapp er class The previous discussion suggests the following nlr

generalization function

Throughout this discussion we ignore the fact that attribute K is sp ecial case b ecause this com

plication do es not greatly change the computational burden of learning nlr and nhlrt wrapp ers

Generalize examples E fhP L i hP L ig

nlr N N

for r each prex of the text following some instance of attribute

for r each prex of the text following some instance of attribute K

K

for each sux of the text preceding some instance of attribute

for each sux of the text preceding some instance of attribute K

K

w h r r i

K K

if ExecNLRw P L for every hP L i E then i

n n n n

return w

As indicated note that line i of the algorithm simply uses the ExecNLR pro ce

dure to determine whether a candidate wrapp er is consistent The following theorem

follows immediately

Theorem Generalize is consistent

nlr

Before pro ceeding with the complexity analysis let us describ e Generalize in

nlr

more detail Note that the candidates for the r and are sp ecied dierently from

k k

Generalize than for the four wrapp er classes discussed earlier In Generalize for

nlr lr

example the candidates for r are the prexes of the intratuple separator for S for

k k

page P However we have not yet dened the notation S for nlr The reason

mk

for not dening S for nlr was just mentioned there is no advantageeither

mk

computational or to simplify the presentationto explicitly stating the consistency

constraint C and thus there was no need to dene S and A for nested

nlr

mk mk

resources Rather than invent a formal notation we will describ e the candidates

informally For example the candidates for r in the nlr and lr classes are similar

k

th

we lo ok for some o ccurrence of the k attribute and then the candidates for r are

k

the prexes of the text b etween this attribute and the next It do es not matter which

such o ccurrence is considered though for eciency we assume that the algorithm

considers the o ccurrence that minimizes the number of candidates for r k

We now provide a complexity analysis for the Generalize algorithm As usual

nlr

p

R candidates for each of r and and thus we must evaluate line there are O

k k

K

i O R times Each execution of line i required N invocations of the ExecN

LR function ExecNLR runs in time linear in the length of the page Let R

max

max jP j b e the length of the longest example page Combining these b ounds

n n

K

we have that the total running time of Generalize is b ounded by O R

nlr

K

O N O R O R NR To simplify this expression at the cost of

max max

a somewhat lo oser b ound we can make use of the fact that R R and thus

max

K K K

O R NR O NR O NR Thus we have a pro of for the

max

max max

following theorem

Theorem Generalize heuristiccase complexity

nlr

Under Assumption the Generalize algorithm runs in time

nlr

K

O NR

max

The nhlrt wrapp er class We just observed that a discussion of the disad

vantages of explicitly stating C resulted in the straightforward Generalize

nlr nlr

algorithm Similarly there is no p oint in explicitly stating the consistency con

straint C and thus the generalization function Generalize is also rather

nhlrt nhlrt straightforward

Generalize examples E fhP L i hP L ig

nhlrt N N

for r each prex of the text following some instance of attribute

for r each prex of the text following some instance of attribute K

K

for each sux of the text preceding some instance of attribute

for each sux of the text preceding some instance of attribute K

K

for h each substring of the text preceding the rst attribute

for t each substring of the text following the last attribute

w hh t r r i

K K

if ExecNHLRTw P L for every hP L i E then i

n n n n

return w

The dierence b etween Generalize and Generalize is similar to the

nhlrt hlrt

dierence b etween Generalize and Generalize The main dierence is that line

lr nlr

i of the algorithm uses ExecNHLRT rather than C to verify wrapp ers

nhlrt

As usual we b egin by observing that Generalize b ehaves correctly like

nhlrt

that of Theorem the pro of follows immediately from line i

Theorem Generalize is consistent

nlr

The complexity analysis of Generalize follows that of Generalize The

nhlrt nlr

K

dierence is that we must invoke line i of the algorithm O R times rather

K

than O R times The extra in the exp onent arises from the fact that

we must examine O R candidates for each of h and t From this p oint the

analysis pro ceeds as b efore and we have that the complexity of Generalize

nhlrt

K K K

NR O NR O NR We can summarize these O R

max

max max

observations with the following theorem

Theorem Generalize heuristiccase complexity

nhlrt

Under Assumption the Generalize algorithm runs in time

nhlrt

K

O NR max

Summary

In this chapter we have describ ed several wrapp er classes First we examined the lr

oclr and hoclrt wrapp er classes Like hlrt these three classes are designed for

extracting data from resources exhibiting a tabular structure Second we examined

two wrapp er classesnlr and nhlrtthat are designed for extracting information

from resources that exhibite a more general nested structure

Our formal results are summarized in the following table

heuristiccase strictly more

wrapp er class learning complexity expressive than

lr O KM NR

max

hlrt O KM NR

max

lr oclr O KM NR

max

hoclrt O KNM R hlrt

max

K

nlr O NR

max

K

nhlrt O NR

max

The rst column indicates the six wrapp er classes we have investigated

lr the simplest wrapp er class for tabular resources consist

ing of just left and righthand attribute delimiters

hlrt the wrapp er class discussed extensively in Chapter

oclr a wrapp er class consisting of left and righthand at

tribute delimiters as well as an op ening and closing de

limiter for each tuple

hoclrt a wrapp er class that combined the functionality of hlrt

and oclr

nlr the simplest wrapp er class for nested resources consist

ing of just left and righthand attribute delimiters and

nhlrt the straightforward extension of hlrt to nested re sources

The second column in the table provides a way to compare the diculty of au

tomatically earning wrapp ers within each class we have analyzed the heuristiccase

dened by Assumption runningtime of the function Generalize for each wrap

W

p er class W The analysis in provided in terms of the following parameters which

characterize the set of examples

K the number of attributes p er tuple

N the number of examples

M the total number of tuples in the example pages

tot

R the length of the shortest example page and

R the length of the longest example page

max

The third column in the table provides a brief summary of the relative expres

siveness of the six wrapp er classes Relative expressiveness captures the extent to

which wrapp ers from one class can mimic the b ehavior of wrapp ers from another As

Theorems and and Figures and indicate this summary is very crude

indeed the full analysis reveals that the classes are interrelated in a substantially

richer manner

These theoretical results are imp ortant and interesting but of course the b ottom

line is the extent to which these six wrapp er classes are useful for real information

resources In Section we demonstrate that indeed these wrapp er classes are

quite useful collectively they can handle of a recentlysurveyed sample of actual

Internet resources with the individual classes handling b etween and

Chapter

CORROBORATION

Introduction

Our discussion of automatic wrapp er construction has fo cused on inductive learning

One input to the Induce generic learning algorithm is the oracle function Oracle

T D

See Figure The oracle supplies Induce with a stream of lab eled examples from

the resource under consideration So far we have taken Oracle to b e provided as

T D

input And indeed this supervised approach is standard practice in many applications

of machine learning

Unfortunately assuming Oracle as an input usually means that a person must

T D

play the role of the oracle painstakingly gathering and examining a collection of

example query resp onses Since our goal is automatic wrapp er construction requiring

a p erson to b e in the lo op in this way is regrettable

Therefore in this chapter we fo cus on automating the Oracle function

T D

Throughout we discuss only the hlrt wrapp er class analogous results could b e

obtained for the other wrapp er classes Conceptually oracles p erform two functions

gathering pages and lab eling them Our eort is directed only at the second step see

Do orenbos et al for work on the rst We introduce corroboration a technique

for automatically lab eling a page Though in this thesis we do not reach our goal

of entirely automatic wrapp er construction we think that corrob oration represents a

signicant step in that direction

Corrob oration assumes a library of domainsp ecic attribute recognizers Each

recognizer rep orts the instances present in the page for a particular attribute For

example one recognizer might tell the corrob oration system the lo cation of all the

countries while a second recognizer lo cates the country co des Corrob oration is the

pro cess of sifting through the evidence provided by the recognizers

If the recognizers do not make mistakes then corrob oration is very simple Most

of this chapter is concerned with the complications that arise due to the recogniz

ers mistakes However we note that even if the recognizers were p erfect wrapp er

induction would still b e imp ortant b ecause the recognizers might run slowly while

wrapp ers are used online and therefore should b e very fast

We b egin with a highlevel description of the issues related to corrob oration Sec

tion We then provide a formal sp ecication of the corrob oration pro cess Section

Third we describ e an algorithm Corrob that solves an interesting sp ecial case

of the general problem Section Fourth we show how to use the output of

Corrob to implement the Oracle function there turn out to b e several wrinkles

T D

Section While clean and elegant Corrob is extremely slow and so in Section

we describ e several heuristics and simplications that result in a usable algorithm

called Corrob Finally we step back and discuss the concept of recognizers asking

whether they represent a realistic approach to lab eling pages Section

The issues

We b egin with an example that illustrates at a fairly high level the basic issues that

are involved Consider a simple extension of the countrycode example Figure

that has three rather than two attributes country country co de and capital city

In this section we will abbreviate these attributes ctry code and cap resp ectively

Recognizers The labeling problem under consideration is to take as input a page

from the countrycodecapital resource and to pro duce as output a lab el for the

page In this chapter we explore one strategy for automatically solving the lab eling

problem We assume we have a library of recognizers one for each attribute A

recognizer examines the page to b e lab eled and identies all instances of a particular

attribute For example a recognizer for the ctry attribute would scan a page and

return a list of the page fragments that the recognizer b elieves to b e values of the

ctry attribute

More formally a recognizer is a function taking as input a page and pro ducing

as output a set of subsequences of the page Each such subsequence is called an in

stance of the recognized attribute The symbol R indicates a recognizer for example

R is the country recognizer

ctry

Note that the recognizer library do es not sp ecify the order in which the attributes

are rendered by the information resource under consideration Our intent is to make

the wrapp er construction system as general as p ossible For example ideally we

would like to provide our system with example pages from any site that provides

countrycodecapital information we want the system to learn that at one resource

the tuples are formatted as hctry code capi while at another they are formatted

as hcode cap ctryi

Corrob orating p erfect recognizers As mentioned R is a recognizer for

ctry

countries R recognizes country co des and so forth Ideally each recognizer will

code

b ehave perfectly every instance returned by the recognizer is in fact a true instance

of the attribute and the recognizer never fails to output an instance of an attribute

that is present in the input

If every recognizer is p erfect then the lab eling problem is trivial we can simply

compute the unique attribute ordering consistent with the recognized instances

and then gather the instances into an overall lab el for the page

For example supp ose that all three recognizers R R and R are

ctry code cap

p erfect And supp ose that when we apply these recognizes to some page we obtain

the following output

R R R

cap ctry code

perfect perfect perfect

h i h i h i

h i h i h i

h i h i h i

First of all note that the indices in this table are completely articial they b ear no

resemblance to the html co de displayed in Figure

Each table contains a set of indices each of the form hb ei indicating the b eginning

and end indices resp ectively of the instance The second table ab ove for example

indicates that there are country attributes in the input page at p ositions

and

Since each recognizer is p erfect note that there is a unique ordering consistent

with the recognized instances In this case the ordering is hctry code capi

Given this ordering it is now trivial to comp ose the sets of recognized instances

into a lab el for the overall page In this example we obtain the following lab el

hh i h i h ii

hh i h i h ii

hh i h i h ii

In the remainder of this chapter to simplify the notation and to allow for complica

tions that will arise we will use the following equivalent notation

ctry code cap

h i h i h i

h i h i h i

h i h i h i

Before pro ceeding let us note that even if one has p erfect attribute recognizers

automatic wrapp er construction is still imp ortant The reason is that while p erfect

the recognizers may run slowly However a software agent needs its information

extraction subsystem to run quickly ideally it should resp ond in real time to users

commands Wrapper construction oers the prosp ect of oline caching of the smart

butslow recognizers output in a form that can b e used quickly online

Imp erfect recognizers This example was so simple b ecause each recognizer is

p erfect Of course we want to accomo date recognizers that make mistakes There

are two kinds of mistakes false positives and false negatives An instance returned

by a recognizer is a false p ositive if it do es not corresp ond to an actual value of the

attribute An actual value of the attribute is a false negative if it is not included

among the instances returned by the recognizer In general an imp erfect recognizer

might either exhibit false p ositives false negatives or b oth Thus with resp ect to

mistakes there are four p ossible kinds of recognizers

perfect neither false p ositives nor false negatives

incomplete false negatives but no false p ositives

unsound false p ositives but no false negatives and

unreliable b oth false p ositives and false negatives

We now return to the example to see how these dierent kinds of recognizers eect

the lab eling problem

Corrob orating incomplete recognizers Consider rst incomplete recognizers

those that exhibit false negatives but not false p ositives Supp ose R and R

ctry cap

are incomplete while as b efore R is p erfect Supp ose further that when given

code

the example page the recognizers resp ond as follows

R

code

R

cap

R perfect

ctry

incomplete

incomplete

h i

h i

h i h i

h i

h i

Note that several of the instances that were originally rep orted are now missing

These missing instances are the false negatives that were incorrectly ignored

The lab eling problem is now more complicated We must corroborate the output

of the recognizers in an attempt to comp ensate for the imp erfect information we

receive ab out the lab el This corrob oration pro cess pro duces the following lab el

ctry code cap

h i h i

h i h i

h i h i

We rst explain this lab el and then describ e corrob oration Compare this lab el with

the lab el listed at Equation generated from three p erfect recognizers Note that

the only dierence is the symbols As b efore each cell in the lab el indicates a

single attribute value The symbol indicates that the lab el conveys no information

ab out the instance in the given cell For example the second row of the lab el shows

that the ctry attribute o ccurs at p osition and the code attribute o ccurs at

p osition but no information is known ab out the cap attribute

Earlier we saw that corrob oration of p erfect recognizers is very simple Corrob

oration of incomplete recognizers is nearly as easy so long as at least one recognizer

is perfect We will formalize motivate and exploit this assumption throughout this

chapter The basic idea is that this p erfectlyrecognized attribute can b e used to

anchor the other instances

Thus in the example given the recognized instances we know that as b efore the

attribute ordering must b e hctry code capi Therefore by comparing the indices

it is a simple matter to align each ctry and cap instance with the correct code

instance For example with which code instance do es the ctry instance h i

b elong To see why it is aligned with code instance h i we need only note that

and therefore ctry h i b elongs to either the second or third code

instances and and therefore ctry h i must b elong to either the rst

or second code instances Performing such analysis for each ctry and cap instance

yields the lab el listed ab ove

These p esudolab els that contain symbols will b e referred to as partial

labels Section discusses how to extend the induction system to handle partially

lab eled pages

Multiple orderings We have neglected an imp ortant complication with imp er

fect recognizers there might b e more than one attribute ordering that is consistent

with the recognized instances For example supp ose that a library of two incomplete

and one p erfect recognizer output the following instances

R

code

R R perfect

cap ctry

incomplete incomplete

h i

h i h i h i

h i

In this situation the original ordering hctry code capi as well as

hcap code ctryi are b oth consistent with the recognized instances To see

this note that each of the following lab els would constitute a valid corrob oration of

the recognized instances

ctry code cap cap code ctry

h i h i h i

h i h i h i h i

h i h i h i

Without additional information or constraints the corrob oration pro cess has no

basis for preferring one of these two lab els over the other On the other hand the

wrapp er induction pro cess that invokes the corrob oration pro cess might well b e able

to decide which is correct Sp ecically well see that it is extremely unlikely that there

is a wrapp er consistent with more than one of the orderings Therefore we will treat

corrob oration as a pro cess that takes as input the recognizer library and pro duces as

output a set of partial lab els one for each consistent ordering Corrob oration returns

all consistent partial lab els the induction system then tries to select the correct lab el

see Section for details

In this example we want corrob oration to compute the following set

ctry code cap cap code ctry

h i h i h i

h i h i h i h i

h i h i h i

Corrob orating unsound recognizers The corrob oration of incomplete recog

nizers required introducing the notion of partial lab els and the symbol as

well as machinery for handling multiple consistent attribute orderings Unsound

recognizersthose that exhibit false p ositives but not false negativesinvolve addi

tional mo dications to the basic corrob oration pro cess

Supp ose R is unsound while R and R are p erfect Supp ose further

ctry code cap

that when given the example page the recognizers resp ond as follows

R

ctry

R unsound R

cap code

perfect perfect

h i

h i h i h i

h i h i h i

h i h i h i

h i

Note that the cap and code instances are unchanged from the rst p erfect example

while there are two additional ctry instances h i and h i

Corrob oration pro ceeds as follows As b efore there is just one consistent ordering

hctry code capi Given this constraint we can immediately insert the p erfectly

recognized code and cap instances into the output lab el We can thus write down

the following p ortion of the lab el

code cap

h i h i

h i h i

h i h i

We now discuss what to ll in for the ctry column

How are the remaining ve ctry instances handled As b efore we can conclude

that the fourth and fth instances h i and h i b elong to the second and

third resp ectively code and cap instances

ctry code cap

h i h i

h i h i h i

h i h i h i

All that remains is to ll in the topleft cell of the lab el The remaining three

ctry instancesh i h i and h iare all and the only candidates for

this cell We know that exactly one of them is correct but we dont yet know which

We can make some progress by discarding h i this instance overlaps with the

code instance h i since R is p erfect we know that the code instances

code

are correct so h i defers to h i

Unfortunately there is a no way for corrob oration to decide b etween these two

choices We handle this situation by splitting the lab el into two one for each p ossi

bility

ctry code cap ctry code cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i h i h i

Note that these two lab els are identical except for the contents of the topleft cell

To summarize some of the instances returned by an unsound recognizer might b e

false p ositives In general the corrob oration system has no basis on which to decide

which are false and which are true p ositives Therefore corrob oration pro duces one

lab el for each way to make a choice ab out which of the instances are false p ositives

and which are true p ositives In this light corrob oration amounts to guaranteeing

that one of the returned lab els corresp onds to the correct choice

Recall that in the earlier discussion of multiple consistent orderings we mentioned

that the induction system is resp onsible for deciding which ordering is correct Simi

larly the result of corrob orating unsound recognizers is a set of lab els the induction

system is resp onsible for deciding which is correct We discuss these details in Section

Corrob orating p erfect unsound and incomplete recognizers We have seen

examples of corrob orating p erfect and unsound recognizers and p erfect and incom

plete recognizers We now discuss corrob oration when the recognizer library contains

recognizers of all three types As might b e exp ected in the general case there might

b e

multiple consistent attribute orderings due to false negatives

partial lab els due to false negatives and

multiple lab els for any given attribute ordering due to ambiguous false p ositives

For example supp ose that R is p erfect R is incomplete and R

code ctry cap

is unsound and that these recognizers output the following instances

R

cap

unsound

R

code

h i

R perfect

ctry

h i

incomplete

h i

h i

h i h i

h i

h i

h i

h i

First consider the ctry and codethe attribute that are recognized by p erfect and

incomplete recognizers These following two lab els are consistent with these instances

ctry code code ctry

h i h i h i

h i h i h i

h i h i

Now consider the cap attribute When we combine the cap instances with the

rst lab el we see that cap must o ccur as the third column in this p osition there

are two subsets of the cap instances that are consistent with the rest of the lab el

ctry code cap ctry code cap

h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i

Note that these lab els are the same except for the middleright cell Consider the rst

lab el Note that all of the other cap instancesh i h i and h ieither

overlap with instances already installed in the lab el which must b e true p ositives since

they were rep orted by incomplete or p erfect recognizers or are inconsistent with cap

b eing the third attribute The second lab el is derived by similar reasoning

Next we expand the second lab el in Equation which pro duces the following

two lab els

cap code ctry cap code ctry

h i h i h i h i h i h i

h i h i h i h i

h i h i h i h i

Thus the nal result of corrob oration is the following set of four partial lab els

ctry code cap ctry code cap

h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i

cap code ctry cap code ctry

h i h i h i h i h i h i

h i h i h i h i

h i h i h i h i

Unreliable recognizers So far we have ignored unreliable recognizersthose

that exhibit b oth false p ositives and false negatives The corrob oration pro cess is

similar to what we have seen so far However recall that the induction system must

iterate over all partial lab els consistent with the recognized instances see Section

for details An imp ortant consideration therefore is the number of such lab els

pro duced by the corrob oration pro cess

Unfortunately we have observed in practice that this number b ecomes very large

with unreliable recognizers even when the rate at which mistakes are made is small

Therefore in this chapter we largely ignore unreliable recognizers

A formal mo del of corrob oration

In the preceding section we describ e the ma jor issues that must b e tackled when

using noisy attribute recognizers to solve the lab eling problem In this section we

provide a formal mo del of these ideas and intuitions

We assume a set of attribute identiers fF F g where there are K at

K

tributes p er tuple In the example discussed in the previous section K and

the attribute identiers were fctry code capg

An instance is a pair of nonnegative integers hb ei such that b e When not

clear from context the alternative notation hb e F i emphasizes that hb ei is

k

an instance of attribute F

k

A recognizer R is a function from a page to a set of instances

R

where is the set of all strings see App endix C The notation RP

f hb ei g indicates that the result of applying recognizer R to page P is

the indicated set of instances

R g exactly one The recognizer library is a set of recognizers fR

F F

K

p er attribute

For a given page P and attribute F let R P b e the set of true instances

k

F

k

of F in P To b e sure neither the induction system nor the recognizers have

k

access to the true instances However we introduce this notion to make precise

the kinds of mistakes a recognizer makes Note that in this chapter we are

concerned only with pages that are formatted according to a tabular layout

Thus for example we have that jR P j jR P j jR P j for

F F F

K

details see Footnote on page

P We abbre is perfect if for every page P R P R Recognizer R

F F

F

k k

k

is p erfect as perfR viate the assertion that R

F F

k k

P R P The is incomplete if for every page P R Recognizer R

F F

F

k k

k

is incomplete means that R notation incom R

F F

k k

P The is unsound if for every page P R P R Recognizer R

F F

F

k k

k

is unsound means that R notation unsndR

F F

k k

is unreliable if it is not p erfect incomplete or unsound As Recognizer R

F

k

mentioned earlier unreliable recognizers make corrob oration more complicated

and so we largely ignore them in this chapter

An attribute ordering is a total order over a set of attribute identiers

A label is an array with K columns and M rows with each cell in the lab el con

taining an instance The headings in the ruledtable notationeg see Equa

tion emphasize the ordering over the attributeseg hctry code capi

in Equation Note that we use this tabular notation rather than the set

notation introduced in Equation Chapter just for convenience the two

notations are equivalent

The true label for a given page P is simply the result of app ending the true

instances for each attribute together into a lab el with the attributes selected

according to the unique valid ordering For instance in the following example

the three sets of true instances on the left give rise to the true lab el on the right

R R R ctry code cap

code ctry cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i h i h i

A partial label is a lab el except that each cell contains either an instance or

the sp ecial symbol

The labeling problem is the following given as input a page P and a library

R g output a set fL L g of p ossiblypartial of recognizers fR

F F

K

lab els The notation hP i indicates a particular lab eling problem

Let L b e the true lab el for page P A set of partial lab els is a solution to a

given lab eling problem if there exists an L fL L g such that L matches

j j

L

Informally partial lab el L matches true lab el L if L is identical to L for

j j

attributes recognized p erfectly or unsoundly but holes indicated by

can o ccur in the incompletelyrecognized columns so long as none of the

incompletelyrecognized instances are discarded For example consider the

lab eling problem describ ed by Equation and assume that the true lab el is

as shown in Equation The partial lab el shown in Equation matches

the true lab el b ecause they are identical except for the three missing instances

and none of the symbols were introduced spuriously

More precisely partial lab el L matches true lab el L i the attributes

j

in L are ordered the same way as the attributes in L and L and L

j j

are identical except that a cell in column F can b e the sp ecial symbol i

k

the set of non and for each F such that incom R incom R

F F

k

k k

P instances in column F must equal R

F

k

k

The Corrob algorithm

In the previous section we formally sp ecied the lab eling problem In this section

we describ e Corrob an algorithm that correctly solves the lab eling problem for the

following sp ecial case

Assumption Reasonable recognizers The recognizer library

contains at least one perfect recognizer and no unreliable recognizers

We defer until Section further discussion and justication of Assumption

Figure lists the Corrob algorithm In the remainder of this section we rst

describ e the algorithm in more detail by walking through an example and then

describing its formal prop erties

Explanation of Corrob

As motivated in Section the Corrob corrob oration algorithm involves exploring a

combinatorial space of all p ossible ways to combine the evidence from the recognizer

library

Perfect and incomplete recognizers are straightforward since we allow partial

lab els and since such recognizers never include false p ositives we can simply incor

p orate the instances provided by these recognizers directly into every lab el output by

Corrob One imp ortant wrinkle is that incomplete recognizers give rise to an ambigu

ity over the ordering of attributes with the lab el and so each p ossible ordering must

b e considered

Unsound recognizers are somewhat more complicated since any particular in

stance they rep ort might b e a false p ositive we must in general allow for throwing

away an arbitrary p ortion of the recognized instances However since unsound rec

ognizers never exhibit false negatives we know that the instances recognized by an

unsound recognizer must contain all the true p ositives Therefore we must consider

all ways of selecting exactly one recognized instance for each tuple for each attribute

recognized by an unsound recognizer

The example To describ e the Corrob algorithm we show its op eration on the last

example in Section To rep eat there are three attributes and the recognizer

library fR R R g resp onds to an example page P as follows

ctry cap code

R

cap

unsound

R

code

h i

R perfect

ctry

h i

incomplete

h i

h i

h i h i

h i

h i

h i

h i

Corrobpage P recognizer library fR R g

F F

K

L fg

NTPSet NTPSet P a

for each PTPSet PTPSets P b

for each Orders fF F g c

K

if Consistent NTPSet PTPSet then d

L L MakeLabel NTPSet PTPSet e

return L

NTPSet page P recognizer library

Return the annotated instances from the p erfect or incomplete recognizers

S

P InvokeP R

F

R unsndR

k

F F

k k

page P recognizer library PTPSets

Let M b e the number of tuples in P s true lab el

such that perfR P j for some R M jR

F F F

k k k

Let s n b e the set of all subsets of set s having of size n

s

s n fs j js j ng

Return the set of ways to choose M instances from each unsound recognizer

Q

M InvokeP R

F

R unsndR

k

F F

k k

page P recognizer R Invoke

F

k

Return the set of recognized instances annotated with the attribute identifer

o n

P hb e F i j hb ei R

F

k

k

Orders attribute identiers fF F g

K

Return the set of all total orders over the set fF F g

K

Consistent ordering annotated instances f hb e F i g

k

Return true i there exists some lab el with attribute ordering

that contains every instances hb e F i

k

ordering annotated instances f hb e F i g MakeLabel

k

Assuming that Consistent f hb e F i g true return

k

the lab el to which these inputs corresp ond

Figure The Corrob algorithm

We now step through the execution of Corrob P

Line a Necessarily truep ositive instances Corrob starts by invoking the

p erfect and incomplete recognizersR and R on page P The NTPSet

ctry code

subroutine computes this set of necessarily true positive NTP instances The in

stances output by R and R or more generally all instances except those

ctry code

recognized unsoundlyare necessarily true p ositive instances b ecause incomplete

and p erfect recognizers never pro duce false p ositives

For b o okkeeping purp oses that will b ecome imp ortant at lines de we must

keep track of the attribute to which each instance b elongs In the example line a

computes the following set of NTP instances

NTPSet fh ctryi h codei h codei h codeig

Line b Possibly truep ositive instances Corrob next collects and reasons

ab out the instances recognized by the unsound recognizersin this case R We

cap

know that some subset of the instances R P are true p ositives and the remainder

cap

are false p ositives Moreover we know that there must b e exactly three true p ositives

and therefore remaining jR P j instances are false p ositives More generally

cap

for each unsound recognizer we know that M of the recognized instances are true

p ositives and the remainder are false p ositives where M is the number of instances

in P s true lab el note that we can determine M by consulting one of the p erfect

recognizers

However as discussed in Section Corrob do es not know which are correct

and which are incorrect and so it must try each way to select M choices from each

unsound recognizers set of instances In this example we must consider each of the

following subsets

fh i h i h i g fh i h i h ig fh i h i h ig

fh i h i h i g fh i h i h ig fh i h i h ig

fh i h i h i g fh i h i h ig fh i h i h ig

fh i h i h i g fh i h i h ig fh i h i h ig

fh i h i h ig fh i h i h ig fh i h i h ig

fh i h i h ig fh i h i h ig fh i h i h ig

fh i h i h ig fh i h i h i g

ways to select M elements Notice that we have listed here the

from the jR P j recognized cap instances

cap

Each of these sets represents a collection of possibly truepositive PTP

instancesie each set corresp onds to a collection of instances that might b e though

are not necessarily true p ositives The function PTPSets is invoked by line b to

compute these PTP sets

Sp ecically PTPSets uses the function to generate all ways to choose M elements

from each unsound instance set and then takes the cross pro duct over all such sets

ofsets In this example PTPSets returns the ab ove set of twenty sets though for

brevity we did not include the attribute annotations p ostp ended to each instance by

the Invoke function

digression This example is fairly simple b ecause there is only one unsound rec

ognizer and so the crosspro duct is trivial However in general there might b e

multiple unsound recognizers For example consider the following example involving

and R two unsound recognizers R

F F

R R

F F

unsound unsound

h i h i

h i h i

h i h i

h i h i

Supp ose that there are M tuples in the true lab el In this case we have that

PTPSets returns the thirtysix member set

fh F i h F i g fh F i h F ig

fh F i h F i g fh F i h F ig

fh F i h F i g fh F i h F i g



fh F i h F ig fh F i h F ig

fh F i h F ig fh F i h F ig

fh F i h F ig fh F i h F i g

Note that each such PTP set contains two instances for F and two instances for F

as required end of digression

Line c Attribute orderings Corrob iterates over all choices for p ossibly

true p ositive instances as computed by PTPSets Now Corrob must also determine

how the attributes are ordered The function Orders returns the set of all K total

orders over the attribute identiers In the example we have

hctry code capi hctry cap codei

Orders fctry code capg

hcode ctry capi hcode cap ctryi

hcap ctry codei hcap code ctryi

In Figure we use the notation to refer to some element of this set We

write F F to mean that F precedes F in the output of Orders to which

k k

k k

corresp onds For example if hcap code ctryi then cap ctry b ecause

cap precedes ctry in hcap code ctryi

Line d Enumerating all combinations As this p oint the Corrob algo

rithm has gathered

the set NTPSet of necessarily truep ositive instances

the set PTPSets each member of which is a set of p ossible truep ositive instances and

the set Orders fctry code capg all p ossible attribute orderings

The nested lo op at lines bc cause Corrob to iterate over the cross pro duct of

PTPSets and Orders fctry code capg When coupled with NTPSet each such

combination of PTPSet PTPSets and Orders fctry code capg corre

sp onds to a complete guess for P s true lab el In the example there are

such combinations

To verify each such guess Corrob invokes the Consistent function Consistent

determines whether a given set of instances and a given attribute ordering can in

fact b e arranged in a tabular format The basic idea is that an instance hb e F i

k

conicts with the remaining instances if it can not b e inserted into a lab el with the

given attribute ordering

In the example we must evaluate Consistent against the PTPordering pairs

For example consider

h ctryi

C B

C B

h capi

C B

h codei

C B

C B

Consistent hctry code capi

h capi

C B

C B

h codei

C B

A

h capi

h codei

z z z

st st

NTPSet elt of Orders elt of PTPSets

In this case Consistent returns false To see this note that instance h capi

must o ccur in the same tuple as h codei b ecause there are no earlier code

instances and therefore cap code b ecause h i precedes h i However

the ordering hctry code capi requires that code cap

We will not illustrate Consistent for the remaining PTPorderings combina

tions Instead we simply assert only the following four pairs satisfy Consistent

hhctry code capi fh capi h capi h capigi

hhctry code capi fh capi h capi h capigi

hhcap code ctryi fh capi h capi h capigi

hhcap code ctryi fh capi h capi h capigi

Line e Building the lab els Finally the MakeLabel function is invoked for

every PTPordering combination that satises Consistent In the example we invoke

MakeLabel four times

h ctryi h codei

C B

C B

C B

h codei h codei

C B

C B

hctry code capi MakeLabel

C B

C B

h capi h capi

C B

A

h capi

h ctryi h codei

C B

C B

C B

h codei h codei

C B

C B

hctry code capi MakeLabel

C B

C B

h capi h capi

C B

A

h capi

h ctryi h codei

C B

C B

C B

h codei h codei

C B

C B

MakeLabel hcap code ctryi

C B

C B

h capi h capi

C B

A

h capi

and

h ctryi h codei

C B

C B

C B

h codei h codei

C B

C B

hcap code ctryi MakeLabel

C B

C B

h capi h capi

C B

A

h capi

Each such invocation is straightforward MakeLabel simply use the given attribute

ordering and instances to construct an M K array Since MakeLabel is called only

if the arguments satisfy Consistent we know that such an array is well dened In

the example the invocations of MakeLabel result in the following four partial lab els

ctry code cap ctry code cap

h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i

cap code ctry cap code ctry

h i h i h i h i h i h i

h i h i h i h i

h i h i h i h i

This set is then returned by the Corrob algorithm as the output given inputs P and

Formal properties

In Section we formally dened the lab eling problem as well as what constitutes

a solution In App endix B we prove that Corrob always generates solutions

Theorem Corrob is correct For every recognizer library and

page P if Corrob P L then L is a solution to the labeling prob

lem hP i provided that Assumption holds of

Extending Generalize and the PAC mo del

hlrt

Ideally the Induce inductive learning system would make use of the Corrob algorithm

to lab el a set of example pages and then provide these lab eled pages to the general

ization function Generalize Unfortunately Generalize wants and the PAC

hlrt hlrt

mo del assumes p erfectly lab eled examples In contrast

In general Corrob pro duces multiple lab els p er example page and guarantees

only that one is correct

Moreover incomplete recognizers mean that the resulting lab els might b e partial

These dierences mean that Corrob can not b e used directly to lab el example pages

for Generalize and the PAC mo del of wrapp er induction is not immediately

hlrt

applicable Fortunately only relatively minor extensions to the Generalize algo

hlrt

rithm and the PAC mo del are needed to accomo date Corrobs noisy lab els In this

section we describ e these enhancements

To illustrate the basic idea recall the countrycode example in Figure Sup

p ose that instead of pro ducing the correct lab el

HTMLTITLESome Country CodesTITLEBODY

BSome Country CodesBP

Congo B I IBR B

B Egypt B I IBR

B Belize B I IBR

Spain B I IBR B

HRBEndBBODYH TML

the corrob oration algorithm Corrob pro duces the following incorrect lab el

HTMLTITLESome Country CodesTITLEBODY

BSome Country CodesBP

BCon goB I IBR

B Egypt B I IBR

B Belize B I IBR

B Spain B I IBR

HRBEndBBODYH TML

Note that two of the countries are lab eled incorrectly For example this lab el might

result from using an unsound country attribute recognizer

What happ ens when we invoke Generalize on this example Generalize

hlrt hlrt

certainly can not return some wrapp er w hh t r r i b ecause no such wrap

p er exists For instance there is no valid value for b ecause the rst country

instance BCon is preceded by the string while the other three country instances

are preceded by B and and B share no common sux Moreover note

that due to the fourth country instance Spain there is no acceptable value for r

In short the given lab eled page simply can not b e wrapp ed by hlrt

This example illustrates that even though Corrob alone can not decide which of

its outputs lab els are correct a mo died version of Generalize algorithm can

hlrt

decide by checking to see whether a wrapp er exists If no wrapp er exists then

the lab el must b e incorrect and Generalize can try the next lab el suggested by

hlrt

Corrob Since Corrob is correct Theorem this simple strategy works b ecause

eventually Generalize will encounter the correct lab el

hlrt

However note that there is always some chance that this strategy will fail it might

b e the case the some of the lab els are incorrect yet a consistent wrapp er do es indeed

exists This would cause the learning algorithm system to think it has succeeded

when in fact it has learned the wrong wrapp er Later in this section we describ e how

to mo dify PAC mo del to account for the ideally very small chance that a wrapp er

can b e found that is consistent with an set of examples that are incorrectly lab eled

Note that we have go o d reason to b e optimistic as the example illustrates in the

context of wrapp er induction mislab eling a page even slightly will very likely result

in an unwrappable page

So far we have discussed the diculties that result from unsound recognizers

Supp ose that instead of b eing unsound the country recognizer were incomplete This

might result in the following partial lab el

HTMLTITLESome Country CodesTITLEBODY

BSome Country CodesBP

B Congo B I IBR

BEgyptB I IBR

BBelizeB I IBR

B Spain B I IBR

HRBEndBBODY HTML

In this case so long as Generalize is somewhat careful in the way it pro

hlrt

cesses this example the induction algorithm can still pro duce the wrapp er w

hP HR B B I I i The only problem is that the algorithm is general

izing on the basis of fewer examples of and r but this is essentially the same

scenario as if the original example contained only two tuples instead of four

This second example illustrates that partial lab elsthe symbols in a partial

lab elare easily handled by Generalize Induction must simply rely on fewer

hlrt

examples of each hlrt comp onent Similarly the PAC mo del must b e extended so

that the chance of learning each individual comp onent is tailored on the basis of the

evidence available ab out that sp ecic comp onent

In the remainder of this section we describ e these extensions to Generalize

hlrt

and then discuss the mo died PAC mo del

noisy

algorithm The Generalize

hlrt

noisy

Generalize is the name of our extension to Generalize While Generalize

hlrt hlrt

hlrt

takes as input a set

fhP L i hP L ig

N N

noisy

of p erfectlylab eled examples Generalize takes as input a set

hlrt

n o

hP fL L gi hP fL L gi

N

N N

j

are pro duced by Corrob for of noisilylab eled examples The intent is that the L

n

each n

fL L g Corrob P

n

n n

noisy

The algorithm The Generalize algorithm is shown in Figure

hlrt

noisy

Generalize op erates by rep eatedly invoking the original Generalize algo

hlrt

hlrt

rithm once for each way to select one of the p ossible lab els for each of the pages

The algorithm terminates when such a lab eling yields a consistent wrapp er

noisy

gig L Generalize noisy examples E fhP fL L gi hP fL

N

N N

hlrt

j

j

N

for each vector hL i fL L g fL L g a L

N N

N

j

j

N

w Generalize fhP L i hP L ig

hlrt N

N

if w false then return w

Note that the Generalize algorithm has b een slightly mo died from Figure

hlrt

see the text and Figure for details

noisy

Figure The Generalize algorithm a modication of Generalize Figure

hlrt

hlrt

which can hand le noisily labeled examples

noisy

Mo dications to Generalize As indicated in Figure Generalize relies

hlrt

hlrt

on two minor mo dications to the original Generalize algorithm see Figure

hlrt

First as describ ed in Figure Generalize assumes that there exists a wrap

hlrt

p er consistent with the input examples and so the only termination condition ex

noisy

re plicitly sp ecied involves nding such a wrapp er In contrast Generalize

hlrt

quires that Generalize b e mo died so that it terminates and returns false if no

hlrt

such consistent wrapp er can b e found Implementing this change is very simple we

need only insert the three lines marked y in Figure These lines ensure that

the lo ops at lines a d and g terminated successfully If any of these

lo ops terminate without nding a legitimate value for the hlrt comp onent then

Generalize simply stops immediately and returns false Note that this mo di

hlrt

cation to Generalize has no b earing on the analysis and discussion of its b ehavior

hlrt

that we have presented

Recall that in Chapter we actually dened two dierent hlrt generalization functions the

slower Generalize algorithm Figure and the faster Generalize algorithm Figure

hlrt

hlrt

dominates Generalize we dropp ed the annotation and used Then since Generalize

hlrt

hlrt

the symbol Generalize to refer only to faster algorithm The mo dications to Generalize

hlrt hlrt

prop osed in this section apply in principle to either algorithm However we provide details only for

the faster algorithm since there is no p oint in worrying ab out the slower algorithm

Generalize examples E fhP L i hP L ig

N N

hlrt

for each k K a

for r each prex of some P s intratuple separator for S b

n

k mk

accept r if C holds of r and every hP L i E c

n n

k k

return false if no candidate satises C y

for each k K d

for each sux of some P s intratuple separator S e

n

k mk

accept if C holds of and every hP L i E f

n n

k k

return false if no candidate satises C y

for each sux of some P s head S g

n

K

for h each substring of some P s head S h

n

K

for t each substring of some P s tail S i

n

M K

accept h and t if C holds of h t and every hP L i E j

n n

return false if no triplet of candidates satises C y

return hh t r r i k

K K

Figure A slightly modied version of the Generalize algorithm compare with

hlrt

Figure

Second we must mo dify Generalize so that it correctly pro cesses lab els con

hlrt

taining symbols The discussion earlier gave the basic idea rather than assum

ing that each lab el provides exactly one value for each attribute within each tuple

Generalize must reason only ab out those parts of the page ab out which the lab els

hlrt

provides information

Consider an example noisilylab eled page

HTMLTITLESome Country CodesTITLEBODY

BSome Country CodesBP

BCongoB I IBR

Egypt B I IBR B

BBelizeB I IBR

B Spain B I IBR

HRBEndBBODYH TML

This pages lab el contains two symbols one for each missing country Now

recall lines ac of the Generalize algorithm Figure At this stage the

hlrt

algorithm is nding a value for each r for k K To do so the algorithm k

gathers a suitable set of candidates for r namely the prexes of page P s

k

intratuple separator for S and then

k

tests each such candidate to see whether is satises predicate C

To extend Generalize so that it can handle partial lab els these two steps are

hlrt

mo died as follows

The candidate set for r contains the prexes of some page P s intratuple sep

n

k

arator for S If as is the case in the example ab ove page P s lab el is missing

mk

the rst attribute of the rst tuple then algorithm simply uses any other page P

n

instead of always using P or any other tuple m instead of always the rst tuple

Note that these choices do not matterany such S provides a satisfactory

mk

set of candidates for r In the example if we are trying to learn r ie k

k

then we can use as the candidates for r all prexes of the pages S value

B I b ecause S is undened since hb e i is missing from the pages

lab el

Recall predicate C from Figure

S r S F r Cr hP Li

mM

mk k mk mk k k

Rather than checking al l m M we must check only those m such that

instance hb e i is present in the lab el In the example when learning r

mk mk

the mo died version of Generalize tests only m and m m and

hlrt

m are ignored b ecause instances hb e i and hb e i are missing in the

pages lab el

This second mo dication to Generalize can b e summarized as follows As

hlrt

weve seen Generalize assumes that lab els contained no symbols and thus

hlrt

the algorithm was describ ed in very simple terms But we have shown how for

the learning of r it is very simple to generalize the sp ecication of the algorithm

k

so that symbols are handled prop erly A similar set of changes are made for

Generalize s other steps

hlrt

Summary In Chapter and this section we have describ ed several dierent ver

sions of the hlrt generalization algorithm As Footnote suggests these dierent

versions can b e somewhat confusing Therefore we summarize this discussion by

listing the four algorithms

Generalize Figure the original hlrt generalization function While con

hlrt

ceptually elegant this algorithm is extremely inecient

Figure a reimplementation of Generalize that is sub Generalize

hlrt

hlrt

stantially more ecient Since Generalize so clearly dominates

hlrt

Generalize we largely ignore the strawman algorithm and use the symbol

hlrt

Generalize to refer to the improved version

hlrt

noisy

Generalize Figure the hlrt generalization function that handles the

hlrt

noisy lab els generated by Corrob by calling a mo died version of

Generalize as a subroutine

hlrt

Mo died Generalize Figure two small changes to the Generalize

hlrt hlrt

to b e precise that were just describ ed Since these algorithm Generalize

hlrt

mo dications are so minor we did not dignify them by introducing a new symbol

Extending the PAC model

noisy

We have just describ ed Generalize which rep eatedly invokes the Generalize

hlrt

hlrt

algorithm describ ed in Chapter thereby enabling wrapp er induction in the face of

noisy example lab els We now show how to extend the PAC developed in Section

to handle noisy lab els

PAC analysis provides an answer to a fundamental question in inductive learning

how many examples must a learner see b efore its output hypothesis is to b e trusted

As discussed earlier there are two main problems with directly applying the PAC

mo del developed earlier b oth of which derive from the fact that the mo del assumes

correct lab els

First the fact that the noisy lab els might b e partial ie include symbols

means that we must b e careful when calculating the chance that a wrapp er has low

error since these symbols eectively mask parts of the page from the learners

attempt to determine a wrapp er Without sp ecial care the PAC mo del will overes

timate the reliability of a learned wrapp er

The second problem is that unsound recognizers might return falsep ositive in

stances which might then b e incorrectly inserted into a pages lab el What if by

chance a wrapp er consistent with these incorrectly lab eled pages can b e found The

result is an incorrect wrapp erfor the examples and p ossibly other pages as well

the wrapp er extracts the wrong information content namely the text indicated by

the incorrect lab els To deal with this situation we must extend the PAC mo del so

that it do es not overestimate the reliability of a learned wrapp er by ignoring the fact

that the wrapp er might have b een learned on the basis of noisy lab els

Handling partial lab els The solution to the problem of partial lab els is straight

forward We must do some additional b o okkeeping to ensure that we count the actual

number of examples of each delimiter Recall the original PAC mo del Equation

M

tot

N

R K

K

where the set of examples E f hP L i g consists of N jE j examples

n n

P

M is the total number of tuples in E page P contains M jL j M

n n n n tot

n

tuples each tuple consists of K attributes and the shortest example page has length

R min jP j and K and R are dened by Equations and on page

n n

In this original mo del the parameter N indicates the number of observations

from which the head h tail t and hlrt comp onents are learned Similarly

M counts the number of observations from which the remaining comp onents

tot

r r are learned Accommo dating partial lab els is a matter of rening

K K

these counts to reect the partial lab els incomplete information

We rst walk through an example and then provide the details Consider the

following noisilylab eled page

HTMLTITLESome Country CodesTITLEBODY

BSome Country CodesBP

IBR BCongoB I

B Egypt B I IBR

BBelizeB I IBR

B Spain B I IBR

HRBEndBBODYH TML

What can we learn from this example The four intact co de instances provide evi

dence for and r the left and righthand delimiters for the second attribute the

country co de in this example Sp ecically for each of these two delimiters we have

four example strings from which to generalize

However note that we have less evidence ab out and r the left and righthand

delimiters for the country attribute Since just two of the instances are present the

example provides only two example strings from which to generalize and r

Finally consider the head delimiter h and the tail delimiter t Since the nal

in this example second attribute of the last tuple is not missing this example

provides us with a complete example of a pages tail region which provides evidence

for determining t In contrast since the rst attribute of the rst tuple is missing

the example do es not provide evidence for a pages head and so we can learn nothing

ab out the value of h

We can summarize these observation as follows This example illustrates that it

is not correct to assume that the examples provide exactly N observations of page

heads N observations of page tails and M observations of each attribute

tot

Instead we pro ceed as follows In the new PAC analysis we are interested in a

set

j

j

N

E hP L i hP L i

N

N

j

n

of examples where each L is a p ossibly partial lab el n

F F F

K

hb e i hb e i hb e i

K K

hb e i hb e i hb e i

M M M M M K M K

n n n n n n

j

n

might b e partial Imp ortantly note that though not shown explicitly since each L

n

some of the hb e i might b e missing with the symbol o ccurring instead

mk mk

The extended PAC mo del is stated in terms of the following parameters of E

k

the number of nonmissing instances of attribute k on page P M

n

n

X

k k

M M

n tot

n

if hb e i or hb e i is missing on page P

n

M K M K

n n

N

n

otherwise

X

N N

n

n

These new parameters are interpreted as follows

k k

M is the number of instances of attribute F on page P The K variables M

n

k

n n

thus generalize the role of the single variable M in the original mo del If lab el

n

j k K

n

L is not partial then all the M s are equal M M M

n

n n n n

k

The K values of M generalize the role of M If none of the lab els are partial

tot

tot

K

then M M M

tot

tot tot

N indicates whether page P s head and tail can b oth b e identied With no

n

n

partial lab els the head and tail can always b e identied N

n

N generalize the role of N in the original mo del With no partial lab els

N N

In the countrycode example mentioned earlier if we call the example page P

then we have M while M and M and N If the set of examples

consists only of page P then we have M while M and M and

tot

tot tot

N while N

These new parameters are then incorp orated into the PAC mo del as follows First

the rst term in the lefthand side of Equation changes as follows

k

M M M

tot

tot tot

K

X

B C

K

A

K K K

k

recall from Equation that K K

This change implements the more sophisticated b o okkeeping that is needed to

handle partial lab els Sp ecically if we examine the pro of of Theorem Section

K

B we nd that Theorem assumed that M M M for each

n

n n

n the describ ed change generalizes the PAC mo del Note that as exp ected if the

K

recognizers are p erfect then M M M and the change has no eect

n

n n

Next the second term in the lefthand side of Equation is changed as follows

N N

R R

Again if the recognizers are p erfect then N N and this change has no eect

Putting these two mo dications together we arrive at the following equation

k

M M

tot tot

K

X

B C

B C

A

B C

B C

K K

k

B C

B C

B C

B C

B C

A

N

R

To summarize the new PAC mo del Equation generalizes the original mo del

Equation so that the PAC analysis correctly handles partially lab eled pages

Handling false p ositives Recall that from a PACtheoretic p ersp ective the dif

culty with false p ositives is that there is no guarantee that there do es not exist some

wrapp er that happ ens to b e consistent with a set of incorrectlylab eled examples

However supp ose that we have a b ound on the chance that there exists a

wrapp er consistent with a set of incorrectly lab eled examples More formally supp ose

j

j

N

i hP L ig of examples if E contains at least that for any set E fhP L

N

N

one incorrectly lab eled example then

j

n

w P L Pr

n j

w H

n

n

hlrt

i hP L

n

n

The b ound provides the PAC mo del with some measure of how dangerous

false p ositives are in a particular domain In this Chapter we have provided examples

meant to motivate the claim that is relatively small in the html applications with

which this thesis is concerned In the Section we return to this claim and nd

that is extremely close to zero for now we simply assume that is provided as an

additional parameter for the PAC mo del

Before pro ceeding note that is not the chance that a lab el is wrong As we

noisy

have seen Generalize assumes that a consistent wrapp er can b e found only for

hlrt

correctly lab eled pages is a measure of how risky this assumption is For example

our recognizers library might make many mistakes so that nearly all the lab els

returned by Corrob contain errors Nevertheless can b e close to zero if the infor

mation resource under consideration is suciently structured so that incorrect lab els

are usually discovered

Incorp orating into the mo del is straightforward As discussed in Section

the terms on the lefthand side of Equations and account for dierent ways

in which the induction pro cess might fail to generate a highquality wrapp er We

simply must introduce as an additional term

k

M M

tot tot

K

X

B C

B C

B C

A

B C

K K

B C

k

B C

B C

B C

B C

B C

B C

N

B C

R

B C

B C

B C

B C

B C

A

In Chapter we describ e how this mo del of noise is related to others in the PAC literature

Note that if there is no chance that an incorrect lab el will yield a consistent wrapp er

then and the new PAC mo del Equation reduces to the previous mo del

Equation

In summary Equation denes the generalized PAC mo del which correctly

accounts for b oth incomplete and unsound recognizers

Complexity analysis and the Corrob algorithm

noisy

The Generalize algorithm describ ed ab ove is a relatively clean extension of the

hlrt

original Generalize algorithm so that it handles pages that are incorrectly lab eled

hlrt

However a simple analysis reveals that it is computationally very exp ensive In this

section we p oint out these computational costs and describ e a mo died version of

Corrob called Corrob which overcomes these costs

noisy

The Generalize takes as input a set

hlrt

J

J

N

hP fL L gi gi hP fL L

N

N

N

of noisily lab eled pages generated by the Corrob algorithm and invokes the original

Q

J ways to choose one of the candi Generalize algorithm once for each of the

n

hlrt

n

date lab els for each page where as indicated ab ove page Corrob returns J p ossible

n

lab els for page P

n

How large can each J b e Examination of the Corrob algorithm Figure

n

provides the following b ounds Line c indicates that each of the J lab els for

n

page P p otentially corresp onds to a dierent ordering of the K attributes and

n

there are K such orderings Furthermore supp ose that there are U K unsound

recognizers in the library provided to Corrob Supp ose further that each unsound

recognizer has a falsep ositive rate of ie on average each unsound recognizer

outputs ab out M falsep ositives in addition to the M truep ositives instances

n n

M

n

In this case there are ways to choose M instances from the M

n n

M M

n n

that were recognized by each of the U unsound recognizers Therefore across all

recognizers the PTPSets function call at line b pro duces a set containing ab out

U

M

n

members Multiplying these two b ounds together we have that

M M

n n

U

M

n

J K

n

M M

n n

Q

J is p otentially enormous For example if each In summary J and hence

n n

n

recognizer has a false p ositive rate of ie there are tuples p er

Q

J page M and unsoundlyrecognized attributes U then

n

n

p ossible lab els must b e considered

We conclude that a straightforward implementation of Corrob is impractical b e

cause its output contains so many lab els Therefore we implemented a mo died

version of Corrob which for clarity we will refer to as Corrob The two algorithms

dier in three ma jor ways which can b e characterized as follows

The rst mo dication is to provide additional input to the algorithm to make

the problem easier

The second mo dication involves using greedy heuristics yielding outputs that

are strictly sp eaking incorrect though as our exp eriments demonstrate useful

anyway

The third mo dication involves using domainspecic heuristics

Additional input Attribute ordering

Recall that Corrob determines which attribute orderings are consistent with

the recognized instances and returns lab els corresp onding to each such consistent

ordering

Our rst simplication is to provide Corrob with the correct attribute ordering

as an additional input As the complexity analysis ab ove indicates enumerating

all K p ossible orderings is exp ensive see Corrobs line c Therefore we decided

to eliminate this admittedly interesting asp ect of Corrobs functionality In terms

Note that the parameter is introduced only to simplify this complexity analysis plays no role

in the Corrob algorithm

of the analysis ab ove this simplication removes the K term from the estimate of

each J value

n

Greedy heuristic Stronglyambiguous instances

Corrobs enumeration of all attribute orderings is one reason it p erforms p o orly The

second source of diculty is that the algorithms PTPSets subroutine enumerates all

p ossible subsets of the instances recognized by unsound recognizers see Corrobs line

b

However we can improve this asp ect of Corrob by categorizing the unsoundly

recognized instances into three groups those that are unambiguous those that are

weakly ambiguous and those that are strongly ambiguous For example consider a

library of three recognizers which resp ond as follows

R

code

R

cap

R

unsound

ctry

unsound

perfect

h i U

h i U

h i W

h i

h i U

h i W

h i

h i S

h i S

h i

h i S

h i S

We have annotated the unsoundlyrecognized instances ie the code and cap in

stances as follows U indicates that the instance is unambiguous W weakly

ambiguous and S strongly ambiguous

To understand these annotations consider the output of Corrob

ctry code cap ctry code cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i h i h i

ctry code cap ctry code cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i h i h i

ctry code cap ctry code cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i h i h i

We b egin by observing that these six lab els are in fact the output of Corrob b ecause

they are the only lab els that satisfy Consistent among the

z z

jR P j jR P j

code cap

p ossibilities considered ie among the ways to choose M instances each from

the sets R P and R P We omit the details

code cap

These annotations can b e understo o d as follows

Note that the code instance h i o ccurs in every lab el in Equation

Instance h i is unambiguous b ecause though it was recognized by an un

sound recognizer h i is necessarily a truep ositive by virtue of the other

instances Similarly the cap instances h i and h i are also unambigu

ous

In summary an unsoundlyrecognized instance is unambiguous if and only if it

o ccurs in every lab el output by Corrob

Now consider the code instance h i It o ccurs in some but not all of the

lab els in Equation so h i is not unambiguous But if we lo ok more

carefully we discover a relationship b etween the code attribute h i and

instances of the cap attribute h i o ccurs in a given lab el only when the

cap instance h i o ccurs in the same lab el We say that h i is strongly

ambiguous b ecause whether it is included in a lab el dep ends on the instances

are selected for other attributes The word strongly is meant to stress that

resolving the ambiguity requires examining a p otentially large p ortion of the

lab el

Of course since h codei dep ends on h capi then h capi

dep ends on h codei and so caps instance h i is strongly ambiguous

as well

Moreover note that h codei comp etes with instance h codei in

that b oth b elong to the same cell of the lab el In the true lab el there is ex

actly one instance p er cell and thus these two instances are mutually exclusive

And since h codei is strongly ambiguous we say that h codei is

strongly ambiguous to o By similar reasoning we have that h capi is

strongly ambiguous b ecause it is mutually exclusive with the strongly ambigu

ous instance h capi

More precisely an instance is strongly ambiguous if it interacts with another

unsoundlyrecognized attribute in the sense just discussed or if it mutually

excludes a strongly ambiguous instance

Finally note that the fact that strong ambiguity propagates from one instance to

another means that strong ambiguity induces a partition on a lab els strongly

ambiguous instances two strongly ambiguous instances b elong to the same

Recall that to emphasize the attribute to which an instance b elongs we use the notation hb e F i

k

to indicate that hb ei is an instance of attribute F k

partition element i they are connected by a chain of strongly ambiguous rela

tionships We call each such partition element a cluster of strongly ambiguous

instances In the example the clustering is trivial with all of the strongly

ambiguous instances b elonging to the single cluster

fh codei h codei h capi h capig

In general each cluster contains one or more instances from each of several

attributes

Finally note that h codei and h codei are neither unambiguous

nor strongly ambiguous These two instances are mutually exclusive but the

choice of including one or the other in a particular lab el can b e made in

dep endently from choices for the other attributes Therefore we say that

h codei and h codei are weakly ambiguous

To summarize an instance is weakly ambiguous if it is neither unambiguous

nor strongly ambiguous

We are now nally in a p osition to describ e Corrob s greedy heuristic Whereas

Corrob includes al l unsoundlyrecognized instances Corrob outputs all unambiguous

and weakly ambiguous instances but only some of the strongly ambiguous instances

Sp ecically for each cluster of strongly ambiguous instances Corrob arbitrarily

selects one attribute from the clusters instances and then discards all instances in

the cluster except those of the selected attribute Note that by discarding all but

one attributes instances that attributes instances are rendered weakly rather than

strongly ambiguous

In the example Corrob would create a set of lab els based on the following in stances

necessarily truep ositives h ctryi h ctryi h ctryi

unambiguous instances h codei h capi h capi

weakly ambiguous instances h codei h codei

strongly ambiguous instances either h codei h codei y

or h capi h capi

The rst three items listed have already b een discussed the fourth item illustrates

Corrob s greediness In the rst case marked y the instances h capi and

h capi are simply discarded from further consideration In the second case

h codei h codei are discarded Let us emphasize that Corrob s choice

of which attributes instances to select is made arbitrarily on no principled basis

and greedily once the choice is made it is never reconsidered

Supp ose Corrob chooses the rst set of strongly ambiguous instances marked y

ab ove In this case Corrob pro duces the following four lab els as output

ctry code cap ctry code cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i

ctry code cap ctry code cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i

Note that since the instances that were to have o ccupied the b ottomleft cell have

b een discarded Corrob must ll this cell with in each lab el This symbol

demonstrates that Corrob s greedy discarding of strongly ambiguous instances means

that it loses information ab out a pages true lab el

However this loss is repaid with a substantial computational gain Sp ecically

note that we can compactly represent the entire set in Equation by the following

structure

ctry code cap

h i h i h i

h i h i h i h i

h i h i h i

The exclusiveor symbols indicate that exactly one instance in these cells can

o ccur in any given lab el This structure is a compact representation of Equation

b ecause there are four ways to choose one of the two elements in each of the two

cells each combination corresp onds to a member of the set in Equation

Note that Corrob can always represent its set of output lab els using such a com

pact representation This prop erty holds b ecause with no strongly ambiguous in

stances decisions ab out how to resolve any remaining ambiguity ie how to choose

one instance from each weakly ambiguous set can always b e made regardless of how

decisions are made in unrelated parts of the lab el

We can summarize this second simplication in the faster implementation Corrob

of the slow Corrob algorithm as follows First we showed how to categorize unsoundly

recognized instances as either unambiguous weakly ambiguous or strongly ambigu

ous We then suggested a simple techniquediscarding all but one attributes in

stances from each cluster of strongly ambiguous instanceswhich always p ermits a

compact representation of the entire set of Corrob s output lab els In essence Corrob

treats unsound recognizers as if they were incomplete whenever the instances they

output are strongly ambiguous by pretending it never observed the instances in

the rst place

Of course this mo dication to Corrob means that Corrob is not formally correct

according to the initial denition of the lab eling problem Sp ecically Corrob s

output might contain symbols in cells that ought to b e o ccupied However just as

noisy

Generalize can learn in the face of incomplete recognizers so to o can it learn when

hlrt

some of the strongly ambiguous instances have b een discarded While not formally

noisy

correct Corrob s output is go o d enough that it can b e used by Generalize

hlrt

anyway

To review the rst simplication to Corrob is to provide the ordering in

addition to the page P and the recognizer library This second mo dication means

that the output of the function call Corrob P is a compact lab el structure

noisy

such as in Equation As b efore Generalize now must iterate over all the

hlrt

ways to resolve the remaining ambiguitiesie all ways to choose one instance from

each group The advantage of the compact representation now b ecomes clear

noisy

we can save time by enumerating lab els only as they are needed by Generalize

hlrt

noisy

However in which order should Generalize examine these lab els see line a

hlrt

of the algorithm The third mo dication to Corrob addresses this question

Domainspecic heuristic Proximity ordering

noisy

In the example developed in this section Generalize s line a eventually might

hlrt

examine all four lab els to which the compact representation in Equation is equiv

noisy

alent Ideally we want the true lab el to b e examined rst so that Generalize

hlrt

noisy

can ignore the rest How should Generalize select among the p ossible orders

hlrt

in which to examine these four lab els

To answer this question Corrob employs a simple heuristic which we call prox

imity ordering Consider the second row of the compact representation listed in

Equation We need to choose one instance from h i h i Note that

instance h i is closer to the interval h i than h iie

Proximity ordering simply involves trying h i b efore h i on the basis of this

distance relationship

Similarly in the choice of one instance from h i h i proximity ordering

tries h i b efore h i b ecause

noisy

Of course when Generalize queries this compact representation for a lab el

hlrt

the result must involve a selection of one instance from each of the groups

Proximity ordering in this more general sense states that the lab els are to b e visited

in order of decreasing total proximity

For example when applied to the representation in Equation the proximity

To b e more accurate we should refer to this heuristic as leftproximity ordering b ecause it ignores

proximity to instances on the right

ordering heuristic suggests visiting in the following order the four lab els listed in

Equation

ctry code cap ctry code cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i

ctry code cap ctry code cap

h i h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i

Note that proximity is a relative rather than metric the proximity ordering heuristic

is indierent to the order b etween the second and third lab els

Proximity ordering is clearly a domainsp ecic heuristic As such it might b ehave

well ie force early consideration of a lab el matching the true lab el for some lab eling

problems and p o orly for others In Chapter we demonstrate that this heuristic

works surprisingly well for several actual Internet information resources

Performance of Corrob

At this p oint the natural questions to ask are what is the p erformance of the Corrob

algorithm and how large is the improvement over Corrob

Analytical answers eg asymptotic complexity b ound to these questions would

b e very interesting Unfortunately such b ounds are dicult to derive The problem

is that the number of lab els consistent with any given example the main parameter

governing the algorithms running time dep ends on the sp ecic mistakes made by

the recognizer library

Therefore we have taken an empirical approach to verifying whether Corrob

runs quickly Section rep orts that for several actual Internet resources Corrob

represents a small fraction of the overall running time of our wrapp er construction system

Recognizers

This chapter has made extensive use of the notion of recognizers We have assumed

that a library of p ossiblynoisy recognizers is provided as input discussed techniques

for corrobating the evidence provided by such recognizers and shown how to extend

the algorithms and analysis from Chapter to robustly learn hlrt wrapp ers in the

face of the incorrectly lab eled pages

However we have not actually discussed recognizers per se The reason is simply

that this thesis is concerned with wrapp er induction rather than with the details

of any particular class of recognizer Nevertheless in this section we address several

imp ortant issues related to recognizers

Multiple recognizers First recall that the recognizer library is assumed to con

tain exactly one recognizer p er attribute In fact this is an unimp ortant restriction

The algorithms analysis and technqiues discussed in this chapter can all b e extended

to handle multiple recognizers p er attribute Imp ortantly this holds even when the

recognizers make dierent kinds of mistakes For example the Corrob algorithm can

easily b e extended so that two country recognizers can b e provided one unsound and

the other incomplete

Perfect recognizers Assumption s requirement that the recognizer library

contains at least one p erfect recognizer might seem to o strong However we b elieve

there are two reasons to b e optimistic

First many of the attributes which we want to b e able to extract can b e recognized

p erfectly For example regular expressions can b e used to p erfectly match attributes

such as electronic mail addresses URLs ISBN numbers Internet IP addresses credit

card numbers and US telephone numbers ZIP co des and states It is certainly true

that an adversary can usually defeat such a simple mechanism However note that

a recognizer claiming to b e p erfect need only b ehave p erfectly with resp ect to the

information resources to which it is applied and such resources are unlikely to engage

in a conspiracy with the recognizers adversary

Second though this thesis has largely concentrated on automatic wrapp er con

struction the techniques developed are compatible with a related though more mo d

est goal semiautomatic wrapp er construction Therefore if a p erfect recognizer is

needed for a particular attribute then there is always the p ossibility of asking a

p erson who presumably always give the correct answer

Unreliable recognizers Assumption also prohibits unreliable recognizers

Corrob requires this stipulation for two reasons First sp ecifying the op eration of Cor

rob for unrealiable recognizers is p ossible but somewhat more complicated Therefore

in the interest of simplicity we ignored unrealiable recognizers

However unreliable recognizers also introduce a tremendous complexity blowup

into the corrob oration pro cess The basic problem is that every instance returned by

an unsound recognizer is susp ect In contrast none of an incomplete recognizers

instances and al l but M of an unsound recognizers instances are susp ect In

n

concrete terms this means that with unsound recognizers the PTPSets subroutine

in Figure returns vastly more sets of p ossibly truep ositive instances compared

to libraries without unsound recognizers To use the notation introduced earlier the

problem with unreliable recognizers is that they greatly increase J the number of

n

lab els Corrob outputs for page P see Equation

n

Of course we have seen that even without unreliable recognizers the Corrob al

gorithm has a high computational complexity However the techniques mentioned

in Section to alleviate these problems rely on the assumption of no unreliable

recognizers We leave as op en the issues related to ecient corrob oration algorithms

that do not require Assumption

Onesided error Since we essentially disallow unreliable recognizers our recog

nizer noise mo del requires that recognizers make onesided errorie either false

p ositives or false negatives but not b oth How realistic is this restriction Let us

mention several reasons for b eing optimistic

First of all many techniques for implementing recognizers see b elow for more de

tails naturally give rise to onesided errors For example if one needs a recognizer for

company names then an index of companies from the Fortune list for example

can b e used to implement an incomplete companyname recognizer On the other

hand consider a regularexpressionbased US ZIP recognizer that searches for the

pattern This recognizer is

most likely unsound since it might recognize nonZIP instances eg fragments

of creditcard numbers On the other hand this recognizer will never incorrectly

miss a ZIP co de

Second we note that in some cases it might b e feasible to decomp ose a recognizer

that gives twosided errorsie an unreliable recognizerinto two recognizers each

having onesided error one that is incomplete and another that is unsound If

this decomp osition is p ossible then we will have eectively included an unreliable

recognizer into the library without suering the exp ected computational cost

Otheshelf recognition technology Finally let us briey describ e several ex

isting technologies that might b e useful for implementing recognizers

First simple regularexpression mechanisms can b e surpisingly p owerful for im

plementing recognizers even for relatively structured and complicated attributes For

example using lists of abbreviations obtained from the US Postal Service and related

knowledge we have designed a regular expression that do es a very go o d job of recog

nizing US street addresses the patten is ab out characters long While regular

expressions are certainly simple they should not b e understated

Second there is a large b o dy of research in the naturallanguage pro cessing com

munity concerned with identifying classes of entities in free text For example the

Named Entity task at the MUC conference ARPA involves lo cating p eo

ples names company names dates times lo cations and so forth This research

has matured to the p oint that highquality commercial p eoplename recognizers

are now availableexamples include the Carnegie Groups NameFinder system

wwwcgicom and SSANAME wwwsearchsoftwarecom

Third many dierent entitiesgovernment agencies companies etcare in

volved with compiling sp ecialized databases indices encylop edias dictionaries cata

logs surveys and so forth examples include Bo oks in Print and wwwimdbcom

Consider the following simple example a telephone directory is essentially a list of

p eople and businesses and their addresses and telephone numbers an incomplete

address recognizer or name recognizer can b e readily built by consulting such a list

Of course extracting the content from such resources for the purp ose of build

ing recognizers leads to a meta wrapp er induction problem This idea leads to a

fourth strategy for building recognizers Once a single information resource has b een

successfully wrapp ed the resource can b e mined Etzioni b for additional in

formation that can subsequently b e used to build recognizers for other resources For

example once one telephone directory has b een wrapp ed with assistance from a p er

son p erhaps then the resource can b e queried when learning a wrapp er for a second

telephone directory This b o otstrapping approach might prove useful as more and

more sites providing information related to any particular topic come online

Summary

In this chapter we introduced corroboration a technique for lab eling the example

pages need by the Induce generic inductive learning algorithm The main results of

this chapter are a precise statement of the lab eling problem an algorithm Corrob for

solving an interesting sp ecial case of this problem

Chapter

EMPIRICAL EVALUATION

Introduction

In this thesis we have prop osed a collection of related techniqueswrapper induction

for several wrapp er classes PAC analysis and corrob orationdesigned to address

the problem of automatically constructing wrapp ers for semistructured information

resources How well do our techniques work Our approach to evaluation is thor

oughly empirical In this chapter we describ e several exp eriments that measure our

systems p erformance on actual Internet resources

Are the six wrapp er classes useful

In Chapters and we identied six wrapp er classes lr hlrt oclr hoclrt

nlr and nhlrt Learnability aside an imp ortant issue is whether these classes are

useful for handling the actual information resources we would like our software agents

to use That is b efore addressing how to learn wrapp ers we must ask whether we

should bother to do so Our approach to answering this question was to conduct a

survey In a nutshell we examined a large collection of resources and found that the

ma jority in total of the resources can b e covered by our six wrapp er classes

To maintain ob jectivity we selected the resources from an indep endently col

lected index The Internet site wwwsearchcom maintains an index of Inter

net resources A wide variety of topics are included from the Ab ele Owners Net

work wwwownercom over prop erties nationwide the national resource for

homes sold by owner to Zipp er wwwvoxpoporgzipp er nd the name of your

representative or senator along with the address phone number email and Web

page While the Internet obviously contains more than sites we exp ect that

this index is representative of sites that software agents might use

To p erform the survey we rst randomly selected resources ie

from wwwsearchcoms index Figure lists the surveyed sites The gure also

shows the number of attributes K extracted from each K ranges from two to

eighteen

Next for each of the thirty sites we gathered the resp onses to ten sample queries

The queries were chosen by hand We choose queries that were appropriate to the

resource in question For example for the rst site in Figure a computer hardware

vendor the sample queries were p entium pro newton hard disk cache

memory macintosh server mainframe zip backup and monitor Our

intent was to solicit normal resp onses rather than unusual resp onses eg error

resp onses pages containing no data etc

To complete the survey we determined how to ll in the thirtybysix matrix

indicating for each resource whether it can b e handled by each of the wrapp er classes

To ll in this matrix we lab eled the examples by hand and then used the Generalize

W

algorithm to try to learn a wrapp er in class W that is consistent with the resources

ten examples Note that as discussed in Section for the hlrt class it is trivial

to mo dify Generalize so that it detects if no consistent wrapp er exists in class W

W

p

Our results are listed in Figure indicates that the given resource can b e

wrapp ed by the given wrapp er class while indicates that there do es not exist a

wrapp er in the class consistent with the collected examples

Figure shows a table summarizing Figure Each line in the table indicates

The site wwwsearchcom is constantly up dating its index Our survey was conducted in July

naturally some of the sites might have disapp eared or changed signicantly since then

While learning to handle such exceptional situations is imp ortant our work do es not address this

problem see Do orenbos et al for some interesting progress in this area K coverage class wrapper measure to surveyed we httpwwwcomputerespcom URL httpwwwfilmcomadminsearchhtm httpallpoliticscom httpwwwyahoocomsearchpeople httpwwwultranetcombizinnssearchformhtml httpwwwitnnet httppathfindercomthriveindexhtml httpwwwnewscom httpwwwpharmwebnet httppathfindercomtime httpwwwmonstercom httpgortucsdedunewjour httpwwwiafnet httpwwwvoxpoporgzipper httpwwwjobsjobsjobscom httpwwwexpediacompubgenftsdll httpwwwdemocratsorg httpwwwfourmilabchustaxustaxhtml httpetextvirginiaedursvbrowsehtml httpthetechmiteduShakespeareworkshtml httpublcom httpwwwtravlangcom httpwwwcinemachinecom httpwwwcdclubscom httpshopsnet httppathfindercomvg httpwwwexpediacompubcurcnvtddll httpblueridgeinfomktibmcombikes httpqssecaplcom httpvotercqalertcomcqjobhtm that Schools Engine resources Search Guide Job Search Pharmacy Newsletters Search Site Job of The Shakespeare information Site On List Review Version Search Travelers TelephoneAddress Breakfast The Journals Server Electronic Wide William for Movie WWW Converter Online and Finder of Guide Network List OnLine Standard Server Quarterlys The Search Quote World Bed Newsgroups Wide Party AllPolitics ESP Cycling Electronic Works Search Classieds Web Band Co de Figure Job World Currency APL Travel Address Languages Garden People World Revised Tax Club resource Computer CNNTime Filmcom Yahoo thrivepathnder NEWSCOM Time Monster Internet TravelDatas Internet NewJour Cinemachine PharmWebs Zipp er Exp edia Co olware Ultimate Complete CD Bible Virtual ShopsNet US Exp edia Demo cratic Foreign Cyb erider Security Congressional resource total lr p p p p p p p p p p p p p p p p Figure hlrt p p p p p p p p p p p p p p p p p oclr p p p p p p p p p p p p p p p p The results hoclrt p p p p p p p p p p p p p p p p p of our nlr p p p p coverage nhlrt survey p p p p p p p p p p p p p p p region Fig G A A A A A A A A A A A A C C E E E E E E E E E E E E E E E in region Fig M M M O O N N N N N N N T T R T P P P S S J J J J J J J J J in

wrapp er classes coverage

lr hlrt oclr hoclrt nlr nhlrt

lr hlrt oclr hoclrt

lr oclr

lr

oclr

hlrt hoclrt

hlrt

hoclrt

nlr nhlrt

nlr

nhlrt

nlr nhlrt when not lr hlrt oclr hoclrt

Figure A summary of Figure

the coverage of one or more wrapp er classes For example the rst line indicates that

of the surveyed sites can b e handled by one or more of the six wrapp er

classes

Figure rep orts the coverage for several groups of wrapp er classes The groups

are organized hierarchically rst we distinguish b etween wrapp ers for tabular and

nested resources we then partition the tabular wrapp er classes according to the

relative expressiveness results describ ed by Theorem Section We conclude

that with the exception of nlr the wrapp er classes we have identied are indeed

useful for handling actual Internet resources

Notice that the worst p erforming wrapp er classes are nlr and nhlrt Recall

that we introduced the nlr and nhlrt wrapp er classes in order to handle the

resources whose content exhibited a nested rather than tabular structure The last

line of the table ab ove measures how successful we were we nd that nlr and

nhlrt cover of the resources that the other four classes can not handle We

conclude that despite their relatively p o or showing overall nlr and nhlrt do

indeed provide expressiveness not available with the other four classes

Finally recall from Chapter the discussion of relative expressiveness the extent

to which wrapp ers in one class can mimic those in another We describ ed the ways

that the six classes are related Figure and Theorem for lr hlrt oclr and

hoclrt and Figure and Theorem for lr hlrt nlr and nhlrt

Figure indicates the regions in Figures and to which each of the sur

veyed resources b elongs For the tabular classes for lr hlrt oclr and hoclrt

the surveyed sites are lo cated in several of the regions in Figure though the

surveyed sites do not show the expressiveness dierences b etween lr and oclr or

b etween hlrt and hoclrt For the nested classes nlr and nhlrt the surveyed

sites are distributed in most of the regions in Figure We conclude that our theo

retical results concerning relative expressiveness reect reasonably well the dierences

observed among actual Internet resources

Can hlrt b e learned quickly

In the previous section we saw that it would b e useful to learn wrapp ers in the six

classes describ ed in this thesis b ecause these six classes can in fact wrap numerous

actual Internet resources We now go on to ask how well do es our automatic learning

system work To answer this question we measured the p erformance of our system

when learning the hlrt wrapp er class

We tested our hlrt learning algorithm on actual Internet resources Seventeen

of the surveyed resources listed in Figure were chosen namely those resources

p

that can b e handled by hlrt ie whose second column is marked in Figure

sites and

Four additional resources were examined

okra okraucredu an email address lo cator service The query is a p ersons name

K attributes are extracted name email address database entry date and

condence score Resp onses to queries were collected the queries were ran

domly selected from a collection of p eoples last names gathered from newsgroup

p osts eg Bruss Melese Bhavanasi Kulo e Izab el Beaume

Lib erop oulos

bigbook wwwbigb o okcom a yellow pages telephone directory The query is a

yellow pages category K attributes are extracted company name street

address city state area co de and lo cal phone number Resp onses to queries

were collected the queries were standard yellow pages categories eg Automo

tive Business Services Communications Computers Entertainment

and Hobbies Fashion and Personal Care

corel coreldigitalrivercom a searchable sto ck photography archive They query

is a series of keywords K attributes are extracted image URL image name

and image category Resp onses to queries were collected the queries were

handselected from an English dictionary egartichoke africa basket

industry instruments israel japan juice

altavista wwwaltavistadigitalcom a search engine The query is a series of key

words K attributes are extracted URL do cument title and do cument

summary Resp onses to queries were collected the queries were randomly

selected from an English dictionary eg usurer farm budgetary won

derland p eanut

The okra resource was used to test and debug our implementation and thus it is

p ossible that we inadvertently tuned our system to this resource To ensure impar

tiality we therefore made no changes to the system when running it with the other

resources

We generated the lab els for each example page by using a wrapp er constructed

The okra service was discontinued after this exp eriment was conducted

by hand These wrapp ers one p er resource were used only to lab el the pages the

induction algorithm did not have access to them

Figure provides one measure of the p erformance of our system the number of

examples required for the induction algorithm to learn a wrapp er that works p erfectly

on a suite of test problems On each trial we randomly split the collected examples

into two halves a training set and a test set We ran our system by providing it

with the training set Our metho dology was to give the system just one training

example then two then three and so forth until our system learned a wrapp er that

p erformed p erfectly on the test set Each such trial was rep eated times Our

results demonstrate that for the sites we examined only a handful of examples are

needed to learn a highquality wrapp er

Of course these results assume a p erfect recognizer library Therefore for four of

the sites okra bigbook corel and altavista we p erformed a more detailed

analysis involving imp erfect recognizers

Sp ecically after generating the correct lab els the lab els were sub jected to a

mutation pro cess in order to simulate imp erfect recognizers We tested our system

by varying which attributes were imp erfectly recognized as well as the rate at which

these mistakes were made The result was a set of trials each corresp onding to a

dierent level of mistakes made in the recognition of dierent attributes

Cho osing the number of attributes to mutate Sp ecically to generate the tri

als we rst varied the number G of such imp erfect recognizers from zero up to

K Recall that our Corrob algorithm requires that at least one attribute

must b e recognized p erfectly see Assumption For example with bigbook

we mutated b etween zero and ve of the six attributes

K

Cho osing the attributes to mutate Next we enumerated the ways to

GK G

select which G attributes to mutate from the K total For example when

ways to mutating two G of bigbooks attributes there are

choose two attributes out of six

Assigning unsound and incomplete recognizers Next for a given set of G at

resource examples needed

okra

bigbook

corel

altavista

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

survey site

average

Figure Average number of example pages needed to learn an hlrt wrapper that

performs perfectly on a suite of test pages for actual Internet resources

tributes to b e mutated we enumerated all combinations of ways to introduce

either false p ositives simulating an unsound recognizer or false negatives sim

ulating an incomplete recognizer Recall that our Corrob algorithm can not

handle recognizers that exhibit b oth false p ositives and false negatives see As

sumption For example if we decide to mutate G attributes and then

select the URL and title attributes then there are four p ossibilities b oth have

false p ositives b oth have false negatives title has false p ositives and URL has

false negatives and URL has false p ositives and title has false negatives

Varying the mutation rate Once the sp ecic kinds of mutations are chosen we

then varied the rate at which these errors are introduced We use the ve rates

f g Of course corresp onds to a p erfect recognizer

Performing the mutations Finally the selected attributes were mutated in the

selected way at the selected rate

To introduce a false negative rate of a fraction of the correct instances

were simply discarded from the set of recognized instances For example with

if there are ve correct instances then we randomly select and discard

one of them

To introduce a false p ositive rate of we introduced an additional W ran

domly constructed instances where W is the number of correct instances of the

selected attribute These false p ositives were constructed by randomly selecting

an integer b jP j where jP j is the length of the page whose lab el is b eing

mutated We then randomly select a second integer l Finally we

add the pair hb b l i into the set recognized instances As desired after adding

these extra instances a fraction of the entire collection are false p ositives In

the exp eriments we used and so that the inserted instances have

length b etween and characters typical values for these resources

One iteration of this vestep pro cess corresp onds to a single trial Note that the

number of trials is a function of K and thus there are dierent numbers of trials for

the dierent resources in this exp eriment Since randomness is introduced at several

p oints each trial was rep eated several times There are trials for okra and we

ran each trial times there are trials for bigbook and we ran iterations of

each and there are trials for corel and altavista and we ran iterations of

each

As b efore our metho dology was to randomly split the collected examples into

a training set and a test set and then determine the minimum number of training

examples required to learn a wrapp er that p erforms p erfectly on the test set

We use this exp erimental setup to measure two statistics the number of examples

and the CPU time needed to p erform p erfectly on the test set as a function of the

level of recognizer errors ie the number of imp erfectly recognized attributes and

the rate at which mistakes were made

9

3 imperfect recognizers 7 2 imperfect recognizers

1 imperfect recognizer a 5

pages needed for 100% accuracy 3 perfect 10% 20% 30% 40% 45 5 imperfect recognizers 4 imperfect recognizers 3 imperfect recognizers 35 2 imperfect recognizers

1 imperfect recognizer b 25

pages needed for 100% accuracy 15 perfect 10% 20% 30% 40% 4.5

4 2 imperfect recognizers 1 imperfect recognizer 3.5

c 3

2.5

pages needed for 100% accuracy 2 perfect 10% 20% 30% 40%

5 2 imperfect recognizers 1 imperfect recognizer 4

d 3

2 pages needed for 100% accuracy perfect 10% 20% 30% 40%

error rate of each recognizer

Figure Number of examples needed to learn a wrapper that performs perfectly on

test set as a function of the recognizer noise rate for the a okra b bigbook

c corel and d altavista sites

Figure shows our results for the rst statistic the number of examples needed

to achieve p erfect p erformance on the test set Figure a shows our results for the

okra resource Figure b for bigbook Figure c for corel and Figure

d for altavista

Within each graph the dierent curves indicate dierent numbers of imp erfect

attribute recognizers dierent values for G in the ab ove discussion For simplicity

each curve is the average across the various trials for the given value of G The

abscissa measures the noise rate ab ove for each curve

For example in Figure a the topmost curve marked imp erfect recogniz

ers represents all trials in which three of okras attributes were recognized with a

certain level of noise The rightmost p oint on this curve marked error rate of each

recognizer indicates that in these trials the imp erfect recognizers made errors

with noise rate Notice that this toprightmost p oint represents distinct

trials there are ways to choose three out of four attributes to mutate times

ways to assign unsound or incomplete recognizers to the mutated attributes

From Figure we can see that our hlrt induction system requires b etween

and example pages to learn a wrapp er that p erforms p erfectly on the test set

dep ending on the domain and the level of noise made by the recognizers We conclude

that wrapp er induction requires a relatively mo dest sample of training examples to

p erform well

Of course the total running time is at least as imp ortant as the required number

of examples Averaging across all the trials for each resource our system p erforms as

shown in Figure We conclude that our induction system do es indeed run quite

quicklyunder ten seconds for three domains and well under two minutes for the

fourth

All exp eriments were run on SPARC and SGI Indy workstations our system is implemented

in Allegro Common Lisp

Let us emphasize that in contrast to wrapp er execution wrapp er induction is an oline pro cess

Internet total CPU CPU time p er

resource time sec example sec

okra

bigbook

corel

altavista

Figure Average time to learn a wrapper for four Internet resources

The rep orted CPU time is consumed by the two main algorithms developed in

this thesis Generalize Chapter and Corrob Chapter Of the two Corrob

hlrt

is much faster averaging across all resources and all conditions Corrob takes only

ab out of the CPU time the remaining is consumed by Generalize

hlrt

Why do es bigbook take so much longer than the other resources Insp ection of

the query resp onses reveals that bigbooks resp onses are much longer ab out

characters p er resp onse compared to ab out As our complexity analysis

shows Theorem our induction algorithm runs in time that is p olynomial in the

length of the examples Thus the dierence in time p er example is explained by the

dierence in resp onse length

However pro cessing time p er example do es not completely explain the dierence

bigbook also requires many more examples than the other resources For instance

Figure shows that bigbook requires ab out examples with p erfect recogniz

ers while the other resources require only examples The diculty is due to

advertisements The heads of the bigbook resp onses contain one of several dierent

advertisements To learn the head delimiter h examples with dierent advertisements

must b e observed But since there are relatively few advertisements the chance of

p erhaps fresh wrapp ers are learned each Sunday evening for an agents online use during the week

and thus we consider even several minutes to b e quite reasonable

observing pages with the same advertisement is relatively high

Evaluating the PAC mo del

We now turn to an evaluation of the PAC mo del As discussed at the end of Section

our approach to evaluating the PAC mo del is enpirical rather than analytical

The PAC mo del predicts how many examples are required to learn a highquality

wrapp er in this section we compare this prediction with the numbers measured in

the previous section

We use the same metho dology and resources as in Section The PAC param

eters were set rather lo osely we used the value for the accuracy parameter

for the reliability parameter Our results are shown in Figure

As Figure shows we found the PAC mo del predicts that b etween and

examples are needed to satisfy the PAC termination condition dep ending on

the domain and the level of noise made by the recognizers

In constrast earlier we saw that in fact only examples are needed to learn

a satisfactory wrapp er We conclude that our PAC mo del is to o lo ose by one to two

orders of magnitude Moreover note that the exp eriment in Section involved

learning wrapp ers that p erform p erfectly on large test suites which presumably

corresp onds to tighter and values than the relatively lo ose parameters chosen

here

Verifying Assumption Short page fragments

In Section and throughout Chapter we app ealed to Assumption as a basis

for a heuristiccase analysis of the running time of the Generalize function for each

W

wrapp er class W In this section we empirically validate this assumption

We b egin by reviewing Assumption Our Generalize algorithms p erform

W

many string op erations searching for one string in another and enumerating various

1100

900 3 imperfect recognizers 2 imperfect recognizers 1 imperfect recognizer

700 a

500

pages needed for PAC bound 300 perfect 10% 20% 30% 40% 1100

5 imperfect recognizers 900 4 imperfect recognizers 3 imperfect recognizers 2 imperfect recognizers

700 1 imperfect recognizer b

500 pages needed for PAC bound 300 perfect 10% 20% 30% 40%

800

700 2 imperfect recognizers 1 imperfect recognizer 600

c 500

400

pages needed for PAC bound 300

perfect 10% 20% 30% 40%

800

700 2 imperfect recognizers 1 imperfect recognizer 600

d 500

400

pages needed for PAC bound 300

perfect 10% 20% 30% 40%

error rate of each recognizer

Figure Predicted number of examples needed to learn a wrapper that satises

the PAC termination criterion as a function of the recognizer noise rate for the a

okra b bigbook c corel and d altavista sites

strings substrings suxes and prexes Our complexity analysis requires b ounding

the running time of these op erations To do so we need to b ound the length F of

the strings over which these op erations o ccur

In our worstcase analysis we b ounded F by using the length R min jP j the

n n

length of the shortest example page Sp ecically we simply used the b ound F R

It is dicult to improve this b ound using only analytical techniques We there

fore p erformed an heuristiccase analysis by asssuming that the relevant strings have

p

R rather than F R In Section we co died this lengths b ounded by F

approach as Assumption We then used this assumption to derive tighter b ounds

on the running time of our functions Generalize eg Theorem in Section

W

for the hlrt wrapp er class

Of course the validity of these tighter b ounds dep ends on the validity of Assump

tion In order to verify Assumption we simply compared the actual lengths

with the assumed lengths

Sp ecically we examined the example pages collected for the four example re

sources describ ed earlier in this chapter We compared the length R of these examples

pages with the actual length F of the A and S partition elements Note that

mk mk

these partition elements are exactly the strings discussed ab ove whose lengths we

want to b ound Figure shows a scatterplot of the hF Ri pairs collected

p

R from the example do cument as well as the function F

Since the data are so noisy it is dicult to tell the extent to which Assumption

holds We measured this supp ort in three ways First we t the data to a

p

mo del of the form F R We found that the value minimizes the least

squarederror b etween the mo del and the data Figure also shows the b estt

p

mo del F R Note that if then Assumption is conrmed b ecause it

overestimates the page fragment lengths

As a second technique for demonstrating that the assumption holds we measured

p

R We nd that more the fraction of the data that lies b elow the curve F of 30000 models two as model wel l as 25000 bestt R the R lengths p page 20000 F versus and F lengths 15000 page length R Assumption fragment by 10000 partition demanded Best-fit: F = R^(1/3.1) observed model the Assumption 4.1: F = R^(1/3) the of 5000 R p F scatterplot A 0 5

35 25 15 fragment length F length fragment relationship this Figure

than of the data lie b elow this curve Thus while the data contain p oints that

violate Assumption the ma jority of the data is within a small constant factor of

assumptions predictions

Finally we randomly selected of the hF R i pairs and measured the ra

p

jF R j

which indicates the closeness of the predicated and observed lengths tio

R

Assumption is validated to the extent that this ratio is zero We nd that the

average value of this ratio is

Based on these three ways of comparing the predicted and observed page fragment

lengths we conclude that the data drawn from our actual Internet resources validate

Assumption

Verifying Assumption Few attributes plentiful data

In Chapter we developed a PAC mo del for learning the hlrt wrapp er class

M

tot

N

K R

K

see Section for details Assumption states that the sum on the left side of

this inequality is dominated by its second term

M

tot

N

K R

K

In this section we empirically verify whether this assumption do es in fact hold

Sp ecically for each of the four resources introduced earlier in this chapter we mea

sured the ratio

M

tot

K

K

N

R

As b efore for this exp eriment we set Clearly Assumption is validated

to the extent that this ratio is equal to zero

Our exp eriment was designed as follows For each of the four resources we iterated

the following pro cess times First we selected a random integer N b etween

Internet average maximum

resource ratio ratio

okra

bigbook

corel

altavista

Figure The measured values of the ratio dened in Equation

and We then randomly selected N examples pages and computed the ratio

listed in Equation

Figure summarizes our results For each resource we rep ort the average

and maximum values of the ratio dened in Equation over the trials We

conclude that Assumption is strongly validated for the four resources we examined

Measuring The PAC mo del noise parameter Equation

Recall from Chapter that if the recognizer library contains imp erfect recognizers

then the Corrob algorithm might output more than one lab el for a given page b ecause

the lab els might all b e consistent with the evidence provided by the recognizers Of

course only one such lab el is correct and if a page is incorrectly lab eled then the

induction system might output the wrong wrapp er

Without additional input Corrob has no basis for preferring one lab el over an

other As describ ed in Chapter our strategy for dealing with this problem is very

noisy

simple Generalize tries to generalize from the noisy lab els we can only hop e

hlrt

Note that Figure indicates that with p erfect recognizer the four resources require ab out

pages to satisfy the PAC termination criterion Therefore evaluating Assumption for either

very few or very many examples is unnecessary in b oth case b oth the original PAC mo del and the

approximation agree Therefore the most signicant test for Assumption involves N near

noisy

that no consistent wrapp er will exist so that Generalize s line a will try the

hlrt

noisy

next combination of lab els Since Generalize tries all such combinations eventu

hlrt

noisy

ally the correct combination of lab els will b e tried and Generalize will nd and

hlrt

return the correct wrapp er

Unfortunately as describ ed in Chapter this strategy ruins our PAC anal

ysis b ecause there is always chance that we will inadvertently encounter a set of

examples that are incorrectly lab eled yet for which there exists a consistent wrapp er

We extended our PAC mo del by introducing a noise rate which quanties this

chance Equation We want to b e small b ecause as Equation shows as

approaches zero the PAC mo del requires fewer examples

So far we have given only an informal argument that is small If our system is

trying to learn how to extract the country from a page containing BCongoB

but the pages lab el extracts the string BCon instead of Congo then it is quite likely

that no consistent wrapp er will b e found

We now provide empirical evidence that is indeed very small We rep eated the

following pro cess times for each of the four information resources introduced

earlier in this chapter First we randomly selected example pages We then

mutated a small fraction of these pages true lab els Sp ecically we examined each

pair hb ei One p ercent of the time we replaced this pair with a dierent pair either

hb ei or hb e i The choice of whether to change b or e is random In each

case is randomly selected from the range The eect is that of the hb ei

pairs are slightly jiggled in an attempt to simulate lab els that are correct except

for a few minor errors We then passed these mutated lab els to Generalize which

hlrt

as usual tries to nd a wrapp er that is consistent with the examples

Our results are as follows Across all twenty thousand trials we observed no

cases for which there exists a consistent hlrt wrapp er Recall that the noise rate

measures the chance of such a case and so we conclude that for the Internet resources

we examined is extremely close to zero

The WIEN application

As an additional attempt to validate the techniques prop osed in this thesis we have

developed wien a wrapper induction environment application see Figure Us

ing a standard Internet browser a user shows wien an example do cument a page

from lycos in this example Then with a mouse the user indicates the fragments of

the page to b e extracted eg the do cument URL title and summary wien then

tries to learn a wrapp er for the resource When shown a second example wien uses

the learned wrapp er to automatically lab el the new example The user then corrects

any mistakes and wien generalizes from b oth examples This pro cess rep eats until

the user is satised

In terms of the techniques developed in this thesis wien provides a complete

implementation of the Generalize algorithm with the user playing the role of

hlrt

the example oracle Oracle

T D

The current version of wien do es not implement the Corrob algorithm However

as shown in Figure a we have provided an extensible facility for applying rec

ognizers to the pages When wien starts it dynamically loads from a rep ository a

library of recognizers The user selects which recognizers are to b e applied and then

manually corrects any mistakes Thus while the user plays the role of the Corrob

algorithm its input recognizer library is applied automatically

In addition to this basic functionality wien provides several facilities designed to

simplify the pro cess of automatic wrapp er construction Figure b shows wiens

mechanism for editing the set of attributes to b e extracted Each attribute is assigned

a color which is used to highlight the extracted text in the browser window d

Since it is easy to make minor mistakes when marking up the pages a facility is

provided for reviewing and editing the extracted fragments c Finally since

some extracted attributes might not app ear when the page is rendered eg the URL

wien is pronounced like a d Figure The wien b application being e used to learn a wrapper c for lycos

attribute we have provided a facility for examining and marking up the raw html

e

Finally we have extended wien in several ways not discussed elsewhere in this

thesis The most interesting extension is aggressive learning If the user is willing to

correct its mistakes then he can invoke the learning algorithm b efore the example

pages are completely lab eled and Generalize will do its b est Sp ecically when

hlrt

invoked with incompletely lab eled pages Generalize might b e able to nd a

hlrt

wrapp er that is consistent with all the examples If such a wrapp er can b e found it

is then applied to the examples and any new extracted text fragments are identied

The user can then correct any mistakes and invoke the regular learning algorithm

on the resulting p erfect lab els As Section shows relatively few examples are

usually sucient for eective learning and so this aggressive learning mechanism

considerably simplies the wrapp er construction pro cess

We have not conducted user studies with wien although anecdotally it app ears

to b e a useful to ol for helping to automate the wrapp er construction pro cess

Chapter

RELATED WORK

Introduction

Our approach to automatic wrapp er construction draws on ideas from many dierent

research areas In this chapter we review this related literature by considering three

kinds of work First we describ e work related to the motivation and background

of this thesis Section Second we describ e several pro jects that are concerned

with similar applications Section Finally we describ e work that is related at a

formal or theoretical level Section

Motivation

The systems that motivate our fo cus on wrapp er induction are mainly concerned with

the integration of heterogeneous information resources Section In addition

research on supp orting legacy systems is also relevant Section as is work on

the development of standard information exchange proto cols Section

Software agents and heterogeneous information sources

As describ ed in Section our concern with wrapp er induction can b e traced to the

pressing need for systems that interact with information resources that were designed

to b e used by people rather than machines The hop e is that such capabilities will

relieve users of the tedium of directly using the growing plethora of online resources

There are two distinct subcommunities working in this area The rst involves

articial intelligence researchers interested in software agents see Etzioni et al

Wooldridge Jennings Bradshaw for surveys The second consists of

database and information systems researchers working on the integration of hetero

geneous databases see Gupta for a survey

We were strongly inuenced by the University of Washington softb ot

pro jects Etzioni et al Etzioni Etzioni Weld Perkowitz Etzioni

Selb erg Etzioni Etzioni a Kwok Weld Selb erg Etzioni

Do orenbos et al Friedman Weld Shakes et al Related pro jects

at other institutions include carnot Collet et al disco Florescu et al

garlic Carey et al hermes Adali et al the Information Manifold

Levy et al sims Arens et al tsimmis Chawathe et al fusion

Smeaton Crimmins BargainFinder Krulwich and the Knowledge Broker

Andreoli et al Chidlovskii et al

While these systems are primarily research prototypes there is substantial com

mercial interest in software agents and heterogeneous database integration pro ducts

Examples include Jango wwwjangocom Junglee wwwjungleecom AlphaCON

NECT wwwalphamicrocom BidFind wwwvsnnetaf LiveAgent wwwagent

softcom Computer ESP oracleuvisioncomshop the Shopping Explorer shop

pingexplorercom and Texis thunderstonecom

The details of these pro jects vary widely but all share a common need for a layer

of wrapp ers b etween the integration system and the information resources it accesses

For example the Ahoy system Shakes et al searches for p eoples home pages

by querying email address lo cator services search engines etc while Junglee www

jungleecom integrates across multiple sources of apartment or job listings Currently

each relies on handcrafted wrapp ers to interact with the resources it needs

Finally let us mention that wrapp er construction touches on subtle issues re

lated to copyright protection in the information age Do es the owner of an online

newspap er or magazine have the right to dictate that its pages can not b e automat

ically parsed and reassembled eg to remove advertisements without p ermission

Clearly these socalled digital law issues are well b eyond the scop e of this thesis

see infolawalertcom for further information

Legacy systems

While we are primarily motivated by the task of integrating heterogeneous infor

mation resources our work is also related to the management of socalled legacy

systems Aiken Bro die Stonebraker For various economic and engineering

reasons many information rep ositories are buried under layers of outdated complex

or p o orlydo cumented co de

Of particular relevance are legacy databases One common strategy is to en

capsulate the entire legacy database in a software layer that provides a standard

interface There is substantial eort are building such wrapp ers see for example

Shklar et al Shklar et al Roth Schwartz

It remains an op en question whether the techniques we have developed in this

thesis can b e applied to legacy systems Certainly few of these systems use html

formatting But we conjecture that the formatting conventions used do exhibit suf

cient regularity that our techniques can b e applied For example a system was

recently developed to automatically execute queries against the University of Wash

ington legacy telnet interface to the library catalog Drap er The system uses

vt escap e sequences to format the requested content Like the wrapp er classes

describ ed in this thesis the system uses constant strings such as ESCj or Title

to identify and extract the relevant content

Standards

Finally in Section we describ ed one ob jection to the very idea of wrapp ers why

not simply adopt standard proto cols to facilitate communication b etween the informa

tion providers and consumers A wide variety of such standards have b een developed

for various purp oses examples include corba wwwomgorg odbc wwwmicro

softcomdatao db c xml wwwworgTRWDxml kif logicstanfordedukif

z lcweblocgovz widl wwwwebmethodscom shoe Luke et al

and kqml Finin et al

The main problem with all such standards is that they must b e enforced Un

fortunately despite the clear b enets there are many reasonseconomic technical

and so cialwhy adherance is not always universal For example online businesses

might prefer to b e visited manually rather than mechanically and refusing to adhere

to a standard is one way to discourage automatic browsing Search engines such as

Yahoo for example are in the business of delivering users to advertisers not servic

ing queries as an end in itself Similarly a retail store might not want to simplify the

pro cess of automatically comparing prices b etween vendors And of course the cost

of reengineering existing resources might b e prohibitive

As with the earlier discussion of digital law these issues are complex and b eyond

the scop e of this thesis The b ottom line though is that certainly automatic wrapp er

construction will b ecome less comp elling if standards b ecome entrenched

Applications

We now turn to research that is fo cused on applications that are similar to our work

on wrapp er construction and information extraction

Systems that learn wrappers

We know of three systems that are involved with learning wrapp ers ila shopbot

and ariadne

ila The ila system Perkowitz Etzioni Perkowitz et al was prop osed as

an instance of the idea of a softb ot learning how to use its to ols ila takes as

input a set of preparsed query resp onses its task is to determine the semantics of

the extracted text fragments For example when learning ab out the countrycode

resource ila is given the information content

fhCongo i hEgypt i hBelize i hSpain ig

Note that ila would not directly manipulate the raw html eg Figure c

ila then compares these values with a collection of facts such as

namec Congo countrycode c capitalc Brazzaville

namec Ireland countrycode c capitalc

namec Spain countrycode c capitalc

On the basis of this background knowledge ila hypothesizes that the rst attribute is

the name of the country while the second is its telephone country co de rather than

for example its capital city Thus ila has learned that the countrycode resource

returns pairs of the form hnamec countrycode ci

This example is very simple For more complicated resources ila searches the

space of functional relationships b etween the attribute values For example ila might

discover that some resource returns pairs consisting of a country and the address of

the mayor of its capital city hnamec address mayorcapitalci Moreover if

necessary ila asks fo cused queries to disambiguate the observed examples

ila and our wrapp er construction techniques thus provide complementary func

tionality ila requires that the input examples b e preparsed but uses knowledge

based techniques for determining a resources semantics In contrast our system uses

a relatively primitive mo del of semantics recognizers and the corrob oration pro cess

but learns how to parse do cuments Clearly integrating the two approaches is an

interesting direction for future work

shopbot A second system that learns wrapp ers is shopbot Do orenbos et al

Perkowitz et al In many ways shopbot is much more ambitious than our work

As a fulledged autonomous shopping agent shopbot not only learns how to extract

a vendors content but also how to query the vendors and how to identify unusual

conditions eg a vendor not sto cking some particular pro duct In this comparison

we fo cus exclusively on shopbots techniques for learning to parse a vendors query

resp onses

shopbot op erates by lo oking for patterns in the html source of the example

do cument shopbot rst partitions the example pages into a sequence of logical

records the system assumes that these records are separated by visually salient html

constructs such as HR LI or BR Each record is then abstracted by removing

nonhtml characters generating a signature for each record For example when

examining the countrycode resp onse in Figure c shopbot would generate the

following six signatures

HTMLTITLEtextTITLEBOD Y Btext B P

BtextB ItextIBR

BtextB ItextIBR

BtextB ItextIBR

BtextB ItextIBR

HRBtextBBODYHTML

shopbot then ranks these signatures by the fraction of the pages each accounts for

In the example the rst and last signatures each account for onesixth of the page

while the middle four identical signatures together account for foursixths of the page

On this basis shopbot decides to ignore the parts of the page corresp onding to the

rst and last signatures and extracts from the countrycode resource only those

records that match the signature BtextB ItextIBR

shopbot also uses domainsp ecic heuristics to rank these signatures For ex

ample in shopping tasks the presence of a price suggests that a signature is correct

Moreover these heuristics are used to extract some information from each record For

example shopbot attempts to extract prices from each record so that its results

can b e sorted

At this p oint we can compare shopbot and our wrapp er induction system shop

bot uses htmlsp ecic heuristics to identify records In contrast our system learns

to exploitbut do es not dep end onhtml or any other particular formatting con

vention Moreover although shopbot do es partially extract the content of the indi

vidual records using domainsp ecic heuristics it do es not learn to fully parse each

record Thus shopbot uses wrapp ers from a class that we might call hoct shop

bot learns to ignore pages heads and tails and extract each tuple as a whole but it

do es not learn to extract the individual attributes within each tuple

ariadne Finally ariadne is a semiautomatic system for constructing wrapp ers

Ashish Knoblo ck a Ashish Knoblo ck b ariadne is targeted at hierar

chically structured do cuments similar to those we discussed in Section

ariadne employs p owerful though html and domainsp ecic heuristics for

guessing the structure of pages For instance relative font size is a usually go o d

clue for determining when an attribute ends and sub ordinate attributes b egin For

example given the following page

Introduction to CMOS Circuits

FONT SIZECMOS LogicFONT

FONT SIZEThe InverterFONT

MOS Transistor Theory

ariadne would hypothesize that Introduction Circuits and MOS Theory

are instances of the highestlevel headings while CMOS Logic is a subheading and

The Inverter is a subsubheading In addition to font sizes ariadne attends to

the use of b old or italics fonts sequences of alphanumeric characters ending with a

colon eg Title relative indentation and several other heuristics

Once the hierarchically nested structure of such a do cument is determined ari

adne generates a grammar consistent with the observed structure Finally a parser

is generated that accepts sentences ie pages in the learned grammar Of course

since the heuristics used to guide these pro cesses are imp erfect the user must correct

ariadnes guesses Ashish and Knoblo ck rep ort that only a handful of corrections

are needed in practice

ariadnes grammars are similar to the nlr wrapp ers dened in Section

The dierence is that in nlr the and r delimiters must b e constant strings

k k

while they can b e regular expressions in ariadnes wrapp ers Note though that

ariadne do es not learn these regular expressions but rather imp orts them from the

heuristics used to nd the pages structure

Information extraction

At the highest level this thesis is concerned with information extraction IE This

eld has a rich literature see Hobbs Cowie Lehnert for surveys Although

similar in spirittraditional IE is the task of identifying literal fragments of an input

text that instantiate some relation or conceptour use of the phrase information

extraction diers in ve ways

The rst dierence concerns our fo cus on extralinguistic regularity to guide ex

traction In contrast with its ro ots in natural language pro cessing NLP much

other IE work is designed to exploit the rich linguistic structure of the sources do c

uments While such approaches are useful in many applications we have found that

the Internet resources which motivate us often do not exhibit this linguistic structure

A second dierence is that many knowledgeintensive approaches to IE are slow

and thus b est suited to oline extraction while a software agents wrapp ers must

execute online and therefore quickly

A third dierence is that we take a rather rigid approach to information extraction

demanding that the text to b e extracted is reliably delimited by constant strings the

r and in our various wrapp er classes In contrast NLPbased IE systems b egin

k k

by segmenting the text in eect inserting invisible delimiters around noun phrases

direct ob jects etc Our approach could in principle op erate on unstructured text once

it has b een annotated with such syntactic delimiters eg adding NP and NP

around noun phrases DOBJ and DOBJ around the direct ob jects etc

A fourth dierence is that most IE systems employ rules that extract relevant

text fragments A separate p ostpro cessing phase is usually employed to group these

fragments together into a coherent summarization of the do cument For instance

this p ostpro cessing phase is resp onsible for handling anaphora eg resolving pro

noun referents or merging fragments that were extracted twice by dierent rules In

contrast our approach to information extraction is to identify the entire content of

the page at once Of course this approach is successful only b ecause the do cuments

in which we are interested have certain kinds of regular structure

Finally given our goal of eliminating the engineering b ottleneck caused by con

structing wrapp ers by hand we are interested in automatic learning techniques In

contrast many IE systems are handcrafted The two notable exceptions are au

toslog and crystal

autoslog Rilo learns information extraction rules The system uses heuris

tics to identify sp ecic words that trigger extraction For example when learning to

extract medical symptoms not exp erienced by patients autoslog would learn that

the word denies indicates such symptoms based on example sentences such

as The patient denies any episo des of nausea While there are many imp ortant

dierences it is apparent that this learned rule is similar to our wrapp ers in that a

sp ecic literal string is used to delimit the desired text

Like autoslog crystal So derland et al So derland b and its descendant

webfoot So derland c learn information extraction rules crystal takes as input

a set of lab eled example do cuments and a set of features describing these do cuments

As originally envisioned crystal used linguistic features such as partofsp eech tags

and it learned rules that are triggered by these linguistic features webfoot extends

crystal to handle nonlinguistic Internetbased do cuments webfoot provides the

crystal rulelearning algorithm with a set of features that are useful for informa

tion extraction from Internetbased rather than free natural language do cuments

Sp ecically a set of htmlsp ecic heuristics are used to identify the text fragments

to b e extracted the heuristics are similar to those used by shopbot and ariadne

Similar ideas are explored in Freitag Freitag Using as a test domain

relatively unstructured departmental talk announcements Freitag demonstrates that

dierent machine learning techniques can b e combined to improve the precision with

which various information extraction tasks can b e p erformed

These three systemsautoslog crystalwebfoot and Freitags work

suggest promising directions for future research Drawing on techniques from b oth

NLP and machine learning they p oint to an integration of NLPbased techniques

and our extralinguistic techniques see Section for details

Recognizers

Chapter describ es techniques for automating the pro cess of lab eling the example

pages Central to these techniques is a library of recognizers domainsp ecic pro ce

dures for identifying instances of particular attributes

Recognizers for sp ecic attributes have received much attention in the text pro

cessing communities For example the Sixth Message Understanding Conferences

Named Entity task ARPA ftpmucsaiccompubMUCMUCguidelinesne

taskdefvpsZ involves identifying particular kinds of information such as p eople

and company names dates lo cations and so forth

Certain highly valuable attributes such as company and p eoples names

have received substantial attention Rau Borgman Siegfried Paik et al

Hayes This research has matured to the p oint that highquality commercial name

recognizers are now availableexamples include the Carnegie Groups NameFinder

system wwwcgicom and SSANAME wwwsearchsoftwarecom Our recogniz

ers are also similar to Apple Computers Data Detectors applescriptapplecomda

ta detectors

Others have taken machine learning approaches to constructing recognizers for

particular kinds of attributes The wil system Goan et al uses novel grammar

induction techniques to learn regular expressions from examples of the attributes

values they demonstrate that their techniques are eective for learning attributes

such as telephone numbers and US Library of Congress Call Numbers Freitag

presents similar results for the domain of academic talk announcements

Finally the eld matching problem Monge Elkan is relevant to building

recognizers Field matching involves determining whether two character strings such

as Dept Comput Sci Eng and Department of Computer Science and Engi

neering do in fact designate the same entity Monge and Elkan prop ose heuristics

that are eective at solving the eld matching problem in the domain of aligning

academic department names and addresses In Section we suggested that one

technique for building recognizers is to exploit existing indiceseg constructing a

company name recognizer from the Fortune list Such recognizers will probably

p erform p o orly unless they solve the eld matching problem

Document analysis

Our approach to information extraction exploits the structure of the information

resources query resp onses For example the lr hlrt oclr and hoclrt wrapp er

classes exploit the fact that the page is formatted with a tabular layout while the

nlr and nhlrt classes assume a hierarchically nested layout

There is a wide variety of research concerned with recovering a do cuments struc

ture Of particular relevance to our work is the recovery of structured informa

tion such as tables of tablesofcontents Douglas et al Douglas Hurst

discuss techniques for identifying the tabular structured in plain text do cuments

Green Krishnamo orthy solve a similar problem except that their system takes

as input scanned images of do cuments

More ambitiously Rus Subramanian provide a theoretical characterization

of information capture and access a novel approach to the development of systems

that integrate heterogeneous information sources The idea is to formalize the notion

of a do cument segmenter which identies p ossiblyrelevant fragments of the do cu

ment These candidates are then examined by structure detectors which lo ok for

patterns among the segments This work is interesting b ecause many dierent kinds

of heuristic techniques for identifying do cument structure can b e mo deled using their

formalism For example when trying to identify a do cuments tabular structure a

segmenter might lo ok for rectangular areas of whitespace and then a structure de

tector would try to lo cate rep etitive arrangements of these regions that indicate a

table

Rus and Subramanian demonstrate that their techniques eectively identify the

tabular structure of plain text do cuments such as retail pro duct catalogs They

also demonstrate techniques for automatically extracting and combining informa

tion from heterogeneous sources For example they have developed an agent that

scans various sources of sto ck price listings to generate graphs of p erformance over

time Unfortunately this system relies on handco ded heuristics to help it determine

the semantics of the extracted information Integrating these techniques with ila

Perkowitz Etzioni Perkowitz et al would b e an interesting direction for

future work

Formal issues

So far we have describ ed the related pro jects at a relatively shallow level of detail

The reason is simply that while our wrapp er induction work is motivated by or ad

dresses similar issues as other work the nutsandbolts technical details are actually

very dierent However there are two research areas which call for a more detailed

comparison First we rst briey describ e the connection b etween wrapp er induction

and grammar induction learning Section Second we describ e the relationship

b etween our PAC mo del and others in the literature Section

Grammar induction

For each wrapp er class W the ExecW pro cedure uses a nite amount of state to scan

and parse the input page augmented with additional b o okkeeping state for keeping

track of the extracted information Although unbounded this b o okkeeping state is

entirely unrelated to the state used for parsing and thus our wrapp ers are formally

equivalent to regular grammars It is thus natural to compare our inductive learning

algorithms with the rich literature in regular grammar induction see for example

Biermann Feldman Gold Angluin Smith Angluin

We did not employ otheshelf grammar induction algorithms To understand

why recall the purp ose of wrapp ers they are used for parsing not simply classi

cation That is our wrapp ers can not simply examine a query resp onse and conrm

that it came from a particular information resource Rather a sp ecic sort of exam

ination must o ccur namely one that involves scanning the page in such a way as to

identify the text fragments to b e extracted

Therefore we require that the nitestate automaton to which the learned gram

mar corresp onds have a sp ecic state top ology Ecient induction algorithms have

b een developed for several classes of regular grammars eg reversible Angluin

and strictly regular Tanida Yokomori grammars The diculty is simply that

we do not know of any such results that deliver the particular state top ology we re

quire Therefore we have developed novel induction algorithms targeted sp ecically

at our wrapp er classes

PAC model

Our PAC mo del Equation in Section app ears to b e more complicated than

most results in the literature

The reason for this complexity is simply that we are learning relatively compli

cated hypotheses Our pro ofs rely on elab orate constructions such as our use of a

disjunction of K terms to capture the ways that a wrapp ers error can exceed

pages or our use of an interval L U to represent the set of prop er suxes

common to a given set of strings pages

However once these constructions are in place we make use of wellknown pro of

techniques Our pro of of Lemma Bs parts is a simplied version of the well

known pro of that axisaligned rectangles are PAClearnable Kearns Vazirani

pp and our pro of of part Lemma B part is so basic as to b e published in

AI textb o oks eg Russell Norvig pp

However there are two areas of related work which must b e addressed a com

parison of our results with the use of the VapnikChervonenkis VC dimension and

our mo del of lab el noise

The VC dimension The VC dimension of a hypothesis class is a combi

natorial measure of the inherent diculty of learning hypotheses in the class

Vapnik Chervonenkis The VC dimension of hypothesis class H is dened

as the cardinality of the largest set I of instances for which there exists hypotheses

jI j

in H that can classify I s members in all p ossible ways

For example supp ose we treat the intervals of the real number line as a hypothesis

class where an interval classies as true those p oints enclosed in the interval No

matter how we pick any two real numbers we can always nd an interval that

contains b oth p oints contains neither p oint contains one p oint but not the other

and vice versa Therefore the VC dimension of the real interval hypothesis space is

at least two However this prop erty do es not hold for any set of three real numbers

there is no interval that includes the two extreme p oints yet excludes the middle

p oint Therefore the VC dimension of the real interval hypothesis space is exactly

two

The VC dimension is useful b ecause PAC b ounds for hypothesis classes with nite

VC dimension have b een developed Haussler Blumer et al

Analyzing the VC dimension of the hlrt wrapp er class is somewhat complicated

We can easily analyze the VC dimension of the r and k delimiters in

k k

isolation Since we reduce the problem of learning the r and the for k to

k k

the problem of learning an interval over the integers see the pro of of Lemma B

parts is Section B we know that the VC dimension of each of these K

comp onents is two However what is the VC dimension of the h t subspace

of the hlrt class We conjecture that this VC dimension is innite since pages

and the hlrt delimiters are all strings of unbounded length there are essentially an

unbounded number of degrees of freedom in the system If this conjecture holds then

the b ounds rep orted in Haussler Blumer et al are inapplicable In summary

we sp eculate that the PAC b ounds we rep ort for hlrt are tighter than obtainable

using VCtheoretic analysis

Lab el noise Finally let us elab orate on the mo del of noise assumed in our exten

sions of the PAC mo del to handle corrob oration Section

Recall that we use an extremely simple mo del of noise Equation denes

the parameter which measures the danger asso ciated with lab els noise The

noisy

algorithm simply assumes that wrapp ers can b e found only for pages Generalize

hlrt

that are lab eled correctly measures how risky this assumption is

Note that do es not measure the chance that any particular lab el is wrong that

any particular set of lab els are wrong etc Instead plays the following role The

noisy

Generalize algorithm Figure assumes that a consistent wrapp er exists only

hlrt

noisy

for correctly lab eled pages is a measure of how likely Generalize is to make

hlrt

a mistake by following this pro cedure For example a recognizers library might

make many mistakes so that nearly all the lab els returned by Corrob and therefore

noisy

tried by Generalize contain errors Nevertheless might b e close to zero if the

hlrt

information resource under consideration is structured so that incorrect lab els can b e

discovered by the Generalize algorithm hlrt

This approach to handling noise app ears to b e novel The PAC literature eg

Angluin Laird Kearns Decatur Gennaro covers a wide variety of

noise mo dels and strategies for coping with this noise

The basic idea is that when the learning algorithm asks the Oracle for any

T D

example the oracle returns the correct example hI T I i with probability and

returns a defective example hI L i with probability where L is a mutated version

of T I eg if the lab els are binary then L T I and is the chance

that the oracle is wrong Under these circumstances reliable learning is a matter of

obtaining enough examples how many dep ends on to comp ensate for the fact that

any particular lab el might b e wrong

In summary standard approaches to learning from noise examples involve extra

sampling in the hop e that the noise will wash out In contrast we assume an oracle

that provides not one but several lab els for each example and we use selfcorrection

mechanism to verify that the observed lab els are correct Our exp eriments in Section

conrm that this selfcorrection mechanism is highly reliable ie for the

Internet resources we have examined

Chapter

FUTURE WORK AND CONCLUSIONS

Thesis summary

Let us review this thesis by discussing our three main contributions

Contribution Automatic wrapp er construction as inductive learning

We b egan with a description of the kinds of information extraction tasks in which

we are interested Chapter an information resource resp onds to queries with a

response eg an html page containing the content we are lo oking for We fo cus on

pages that are semistructured in that while they might contain advertisements and

other extraneous information the content itself ob eys rigid though arbitrary and

unknown formatting conventions Sp ecically we fo cus on resources whose content

is formatted in either a tabular or hierarchically nested manner A wrapper is simply a

pro cedure customized to a particular information resource for extracting the content

from such pages

We then describ ed inductive learning our approach to automatically learning

wrapp ers Chapter Induction is the pro cess of reasoning from a set of examples

to some hypothesis that in some applicationsp ecic sense explains or generalizes

over the examples For example if told that Thatcher lied Mao lied and Einstein

didnt lie an inductive learner might hypothesize that Politicians lie Each example

consists of an instance and its label according to the target hypothesis Induction is

thus a matter of reconstructing the target from a sample of its b ehavior on a number

of example instances We dened the Induce generic learning algorithm To learn a

particular hypothesis is some particular class H Induce is provided with an oracle

which provides a stream of examples as well as a generalization function Generalize

H

that is sp ecialized to the class H

In our application examples corresp ond to the query resp onses and hypotheses

corresp ond to wrapp ers The example pages are lab eled with the content that is to b e

extracted from them Describing how to learn wrapp ers thus mainly involves dening

a class W of wrapp ers and the implementing the function Generalize

W

Contribution Reasonably expressive yet eciently learnable wrapp er

classes In Chapter we describ e one such class hlrt As discussed in Section

h eadleftrighttail wrapp ers formalize a simple but common programming idiom

hlrt wrapp ers op erate by scanning the page for sp ecic constraint strings that sepa

rate the b o dy of the page from its head and tail which might contain extraneous but

confusing text as well as strings that delimit each text fragment to b e extracted

Though simple our empirical results indicate that hlrt wrapp ers can handle

of a recently surveyed list of actual Internet resources Section

In Section we pro ceeded by dening the Generalize function To do so

hlrt

we sp ecied a set of constraints denoted C that must hold if an hlrt wrapp er

hlrt

is to op erate correctly on a given set of examples Thus Generalize is a sp ecial

hlrt

purp ose constraint satisfaction engine customized to solving the constraint C

hlrt

Our nave generateandtest implementation of Generalize is relatively simple

hlrt

Unfortunately this algorithm runs very slowly our worstcase complexity analysis

shows that it runs in time that is exponential in the number of attributes Theo

rem But a careful examination of C reveals that it can b e decomp osed

hlrt

into several distinct subproblems In Section we use this analysis to develop

the Generalize algorithm which runs in time that is linear in the number of

hlrt

attributes Theorem

Our exp eriments in Section demonstrate that Generalize is quite fast in hlrt

practice Indeed this p erformance is somewhat at o dds with Theorem s com

plexity b ound while the b ound is linear in the number of attributes it is actually a

degreesix p olynomial in the size of the examples Our heuristiccase analysis Theo

is so fast in practice rem in Section explains analytically why Generalize

hlrt

In Section we empirically validated the assumptions underlying this analysis

In any inductive learning setting running time is only one imp ortant resource

The number of examples required for eective learning is often even more signi

cant For instance in our wrapp erconstruction application each example consumes

substantial network bandwidth The p robably approximately correct PAC mo del

of inductive learning an approach to obtaining theoretical b ounds on the number of

examples needed to satisfy a usersp ecied level of p erformance In Section we

describ ed the PAC mo del in general and in Section we developed a PAC mo del for

the hlrt wrapp er class The main result is thatunder assumptions that are empir

ically validated Section the PAC mo dels prediction of the number of training

examples required for the hlrt class is independent of the number of attributes and

polynomial in all relevant parameters Theorem

Finally hlrt is just one of many p ossible wrapp er classes In Chapter we dene

ve more classes lr oclr hoclrt nlr and nhlrt Adopting the framework

developed for the hlrt class we describ e the Generalize function for each of these

W

ve wrapp er classes We then compare all six classes on two grounds First Theorems

and we analyze the relative expressiveness of the six wrapp er classes which

characterize the extent to which wrapp ers in one class can mimic the b ehavior of

another Second Theorems and we analyzed the heuristic

case running time of each Generalize function

W

Contribution Noisetolerant techniques for automatically lab eling ex

amples Our inductive approach to constructing wrapp ers requires a supply of la

beled example pages In Chapter we describ e techniques to automate this lab eling

pro cess The basic idea is to take as input a library of recognizers A recognizer is a

pro cedure that examines a page for instances of a particular attribute For example

when learning a wrapp er for the countrycode resource the system takes as input

a recognizer for the countries and another recognizer for the co des The results of

these recognizers are then integrated to generate a lab el for the entire page

Clearly this corroboration pro cess is trivial if each of the recognizers is perfect

Assuming p erfect recognizers is unrealistic though corrob oration is very dicult

unless the library contains at least one p erfect recognizer which can b e used to

anchor the evidence provided by the others Corrob oration is also dicult if any of

the recognizers are unreliable ie might rep ort b oth spurious false p ositive instances

as well as miss some true p ositives We developed the Corrob algorithm Section

which can handle recognizer libraries that contain at least one p erfect recognizer but

no unreliable recognizers Note that though the recognizers can not b e unreliable

they can b e unsound exhibit false p ositives but no false negatives or incomplete

false negatives but no false p ositives

The idea then is to use Corrob to lab el the example pages But in general

recognizers give ambiguous or incomplete information ab out a pages lab el Thus

we face an interesting problem of learning from noisilylabeled pages Fortunately as

we demonstrate for the hlrt class in Section it is relatively straightforward to

mo dify the Generalize functions and PAC mo del to handle this noise Our approach

W

to learning from noisy data is somewhat unusual our wrapp er induction application

is selfcorrecting in that it can usually determine which lab els are correct

Unfortunately Corrob is very slow unless all the recognizers are p erfect We thus

developed Corrob Section which heuristically solves a simplied version of the

corrob oration problem

Future work

While this summary describ es the progress we have made in many ways our tech

niques for automatic wrapp er construction just scratch the surface We now suggest

promising directions for future research

Shorttomedium term ideas

The p oint of our eort is to build to ols to make it easier to write wrapp ers for actual

Internet information resources While our exp eriments and analysis demonstrate that

our techniques are eective for such resources there are numerous ways to make our

system more immediately useful and practical

First there are many subtle userinterface issues that must b e handled when de

veloping a useful wrapp erconstruction to ol Our wien application Section

application could b e improved in several ways First we have many ideas for

how to change the user interface to simplify the pro cess of wrapp er construction

For example we are exp erimenting with aggressive learning in which wien starts

learning as so on as parts of the page are lab eled Of course the results of such

learning might b e wrong and thus we must provide a mechanism for easily cor

recting wiens mistakes Second our six wrapp er classes are rather general pur

p ose but they are unknown outside our research group it would b e interest

ing to select a standard wrapp er language eg the wrapp er languages discussed

in Ashish Knoblo ck a Roth Schwartz Chidlovskii et al and extend

wien so that it builds wrapp ers that adhere to an emerging standard

Second as discussed in Section we have not exp ended much eort developing

recognizers To make our techniques more useful it would b e helpful to develop a

library of common recognizers While many recognizers would no doubt b e simply

idiosyncratic hacks it would also b e worthwhile trying to develop a more general framework

A third imp ortant direction related to the second is the elab oration of our work

on corrob oration We developed the Corrob algorithm which works very well in

the domains we have investigated But recall that Corrob makes several simplica

tions and assumptions Section We are a long way from ecient corrob oration

algorithms that work well with unrestricted recognizer libraries

A fourth extension would b e to investigate ways to make the generalization func

tions Generalize for each class W run faster One idea would b e to investigate

W

the search control used by these functions Figure on page shows the space

searched by the Generalize function We did not sp ecify exactly how this space

hlrt

is traversed b eyond stating that it is searched in a depthrst manner Recall that

Generalize tries each candidate for each hlrt comp onent eg line b tries

hlrt

all candidates for r Dierent orderings of these candidates corresp ond to dierent

k

ways to traverse the space of hlrt wrapp ers As our exp eriments in Section show

the default search order is reasonably fast However more advanced search control

would probably sp eed up the search

A nal direction for shorttomedium term future work involves the wrapp er

classes themselves We have identied six wrapp er classes they are reasonably useful

for actual Internet resources and yet can b e learned reasonably quickly But obvi

ously there are many other classes that can b e considered We have two sp ecic

suggestions

Our coverage survey Section suggests that the nlr and nhlrt wrapp er

classes are less useful than anticipated nlr for example can handle just

of the surveyed sites The problem is that these classes oer tremendous

exibility they allow for arbitrarily nested pages But this exibility requires

that the pages b e highly structured as well there must exist a set of delimiters

that reliably extract this structure and the constraints on these delimiters are

much more stringent than for the tabular wrapp ers Unfortunately the data

suggest that relatively few sites have such structure Thus one fruitful avenue

for future work would b e a more careful investigation of the issues related to

building wrapp ers for hierarchically nested resources

Our results in Section indicate that ab out of the surveyed information

resources can not b e handled by any of the six wrapp er classes we discuss It

would b e helpful to develop wrapp er classes for these resources Insp ection

of these sites reveals that many contain missing attributes eg in the coun

tryco de resource p erhaps the country co de is o ccasionally missing None of

the six wrapp er classes we dened can handle missing attributes Our pre

liminary investigation suggests that ab out ie ab out twothirds of the

unwrappable sites could b enet from wrapp ers that can handle missing values

Our delimiterbased wrapp er framework can b e extended to handle missing

attributes Consider mlr a simple extension to the lr wrapp er class Like

lr mlr uses two delimiters and r to indicate the b eginning and end

k k

of each attribute mlr wrapp ers also use a third delimiter p er attribute x

k

which indicates that the attribute is missing For example if countries are

formatted as BCongoB or as BLINKMissingBLINK if missing then

an mlr wrapp er could B r B and x BLINK While we

k k k

have not investigated the coverage of the mlr class a la Section it is

nonetheless usfseful to ask how quickly mlr wrapp ers can b e learned

Note that for each k delimiters and x interact just as h t and interact

k k

for the hlrt class Therefore learning mlr is slower than learning lr since

the Generalize function must enumerate the combinations of candidates

mlr

for and x As with lr the r delimiters can b e learned in isolation To

k k k

Actually the nlr and nhlrt wrapp er class can handle a sp ecial case of missing attributes if

K attributes are to b e extracted then attributes K K K ie the b ottom several

layers of no des in a tree such as Figure on page can b e missing

summarize we conjecture that learning mlr is harder than learning lr though

easier than oclr hlrt hoclrt nlr and nhlrt

Mediumtolong term ideas

There are several directions in which to push our work that concern the bigger pic

ture While our work on wrapp er construction is an imp ortant enabling technology

for a variety of information integration applications there are still many problems

left unsolved

Consider rst the task of information extraction from semistructured resources

We emphasized one simple version of this task information is to b e extracted from

a single do cument containing tuples from one relation As a concrete example the

information from a query to the countrycode resource is contained in a single page

and there is just one table of data on this page namely the countrycode table

But industrialstrength wrapp er applications require more sophisticated functionality

Do orenbos such as

extracting and merging data from multiple do cumentseg clicking on a table

of hyperlinked pro duct names in order to get their prices and

extracting information from a single page that contains more than one collection

of dataeg a page might contain one table listing pro ducts and prices and

another table listing manufacturers and their telephone numbers

It is straightforward to comp ose our wrapp ers so that such issues can b e handled a

metawrapper can invoke a wrapp er on the pages obtained by clicking on hyperlinks

or invokes several wrapp ers on the same page However automatically constructing

such wrapp ers is a challenging direction for future work

A second imp ortant longterm research direction involves closing informa

tion integration lo op We have examined information extraction in isola

tion but our techniques must b e integrated with work on resource discovery

Bowman et al Zaiane Jiawei learning to query information resources

Cohen Singer Do orenbos et al and learning semantic models of informa

tion resources Perkowitz Etzioni Tejada et al

Third our work has fo cused on resources whose content is formatted by html

tags Let us emphasize that our techniques do not dep end on html or any other

particular formatting convention Nevertheless we have not yet demonstrated that

our techniques work for other kinds of semistructured information resources For

example in Section we mentioned that it would b e interesting to apply our

techniques to legacy information systems

Another p ossibility is to apply our techniques to various formats used to enco de

tabular information such as csv or dif two spreadsheet standards or various

A

do cument standards such as Rich Text Format rtf or L T X For instance the

E

following co de samples

standard example

TABLE

html

TRTDATDTDBTD TDC T D TR

TRTDDTDTDETD TDF T D TR

TABLE

ABC

csv

DEF

DATA

dif

BOT A B C

BOT D E F

EOD

rtf

trowd A cell B cell C cell row

trowd D cell E cell F cell row

A

begintabularccc

L T X

E

A B C

D E F endtabular

all enco de the following table

A B C

D E F

Clearly most if not all of these formats can b e handled by the sort of delimiterbased

wrapp ers discussed in this thesis

Finally as a fourth mediumtolong term direction for future research we prop ose

integrating our techniques with traditional natural language pro cessing approaches to

information extraction IE Eective IE systems should leverage whatever structure

is available in the text whether it b e linguistic or extralinguistic Ideally inductive

learning techniques can b e used to discover such regularities automatically

This thesis demonstrates that such an approach is feasible for the kinds of semi

structured do cuments found on the Internet At the other end of the sp ectrum

autoslog Rilo and crystalwebfoot So derland et al So derland b

So derland c have demonstrated the feasibility of this approach for natural language

text More recently So derland has developed whisk So derland a which learns

information extraction rules in highly structured but nongrammatical domains such

as apartment listings whisks rules are essentially regular expressions sp ecifying

sp ecic delimiters quite similar to our delimiterbased wrapp ers But truly general

purp ose IE systems are still a long way o

Theoretical directions

Finally let us mention two directions in which our theoretical analysis can b e ex

tended

First as discussed in Section our PAC mo del is to o lo ose by one to two

orders of magnitude Thus we can not realistically use this PAC mo del to auto

matically terminate the learning algorithm Tightening this mo del would b e an

interesting direction for future work Note that the PAC mo del makes worstcase

assumptions ab out the learning task Sp ecically it assumes that the distribution D

over examples is arbitrary A standard technique for tightening a PAC mo del is to

assume that D has certain prop erties Benedek Itai Bartlett Williamson

Schuurmans Greiner suggests another strategy by replacing the batch

mo del on inductive learning with a sequential mo del in which the PACtheoretic

analysis is rep eated as each example is observed many fewer examples are predicted

It would b e interesting to apply each of these approaches to our wrapp er construction

task

Second although we are interested in completely automating the wrapp er con

struction pro cess realism demands that we fo cus on semiautomatic systems for some

time to come But our techniques do not take into account the cost of asking a p er

son for assistance For example if a p erson plays the role of a recognizer for country

names then our system requires that the p erson lo cate all the countries in the do c

uments even though it is p ossible that p ointing out just one or two countries could

b e sucient for the system to guess the rest Our investigation of aggressive learning

Section is a step in the right direction But it would b e interesting to develop a

mo del of learning that explicitly reasons ab out the utility of asking the oracle various

questions

Conclusions

The Internet presents a dizzying array of information resources and more are coming

online every day Many of these sitesthose whose goal is purely entertainment

or the growing collection of online magazines and newspap ersare p erhaps suited

only for direct manual browsing But many other sitesairline schedules retail

pro duct catalogs weather forecasts sto ck market quotationsprovide structured

data As the number and variety of such resources has grown many have argued for

the development of systems that automatically interact with such resources in order

to extract this structured content on a users b ehalf

There are many technical challenges to building such systems but one in particular

is immediately apparent as so on as one starts to design such a system There is a

huge amount of information available on the Internet but machines to day understand

very little of it While standards such as xml or z promise to ameliorate this

situation standards require co op eration or enforcement b oth of which are in short

supply on the Internet to day

In this thesis we have investigated techniqueswrapper learning algorithms and

corrob oration algorithms for generating lab eled examplesthat promise to signif

icantly increase the amount of online information to which software agents have

access Although our techniques make relatively strong assumptions we have at

tempted to empirically validate these assumptions against actual Internet resources

As describ ed earlier in this chapter there are many op en issues But we think

that this thesis demonstrates that automatic wrapp er construction can facilitate the

construction of a variety of software agents and information integration to ols

App endix A

AN EXAMPLE RESOURCE AND ITS WRAPPERS

In this App endix we provide a concrete instance of each of the six wrapp er

classes discussed in this thesis lr hlrt oclr hoclrt nlr and nhlrt We

provide examples of the html source for site in the wwwsearchcom survey

describ ed in Section Yahoo People Search TelephoneAddress available at

httpwwwyahoocomsear chp eop le Site can b e wrapp ed by all six wrapp er

classes and so we present a wrapp er in each class that is consistent with pages drawn

from this resource

Resource To b egin Figure Aa shows the query interface for resource

while Figure Ab provides an example of the resources query resp onses The site

is treated as yielding tuples containing K attributes name address area co de

and phone number

Example We now list the html source for two of the example query resp onses

The html is shown exactly as supplied from the resource except that whitespace

has b een altered in unimp ortant ways to simplify the presentation The information

content of each page is indicated with b oxes around the text to b e extracted

The rst example is the html source of the resp onse to the query Last Name

kushmerick depicted in Figure Ab

head

title

Yahoo People Search Telephone Results

title

head body

a b

Figure A Site in the survey Figure a the query interface and b an

example query response

Yahoo Subcategory Banner

map namemenu

area shaperect coords hrefhttpwwwyahoocom

area shaperect coords

hrefhttpwwwyahoocomdocsinfohelphtml

area shaperect coords hrefhttpaddyahoocombinadd

area shaperect coords

hrefhttpwwwyahoocomdocsfamilymorehtml

map

img srchttpwwwyahoocomimagescatgif border usemapmenu ismap

p Default Fail IMG SRCgrlogogif ALTFour Corp HEIGHT

WIDTH

P

If you have questions regarding privacy please

a hrefhttpwwwyahoocomdocsinfopeople privacyhtmlread thisap

a hrefhttpwwwyahoocomsearchpeoplebSearch Againbap

center

tabletr

tdbbtd

td

bDisplaying matches of b

td

td stronga hrefcgibinFourYahooPhoneResultsOffsetId

NameCountLastNamekushmerickFirstNameCityState Next astrongtd

tdbbtd

trtable p

table border cellpadding cellspacing

WIDTHtrthNamethAddressthPhonetr

tr

tdstrong James Kushmerick strongtd

td W Curtin St BellefontePA td

td td

tr

tr

Jim Kushmerick strongtd tdstrong

td Clark Av Ocean GroveNJ td

td td

tr

tr

John M Kushmerick

strongtd tdstrong

td Delaware Av OlyphantPA td

td td

tr

tr

John P Kushmerick strongtd tdstrong

td Powell Av JessupPA td

td td

tr

tr

K Kushmerick strongtd tdstrong

Center Pl ChathamNJ td td

td td

tr

tr

tdstrong L J Kushmerick strongtd

td Maplewood Av MaplewoodNJ td

td td

tr

tr

M Kushmerick strongtd tdstrong

N Main St SherbornMA td td

td td

tr

tr

tdstrong Marie Kushmerick strongtd

td Sherwood Oaks MarsPA td

td td

tr

tr

Marie J Kushmerick strongtd tdstrong

Summit Pt ScrantonPA td td

td td

tr

tr

Martin J Kushmerick strongtd tdstrong

td Riviera Pl Ne SeattleWA td

td td tr

table

br

center

table bordertr

tdbbtd

td

bDisplaying matches of b

td

td stronga hrefcgibinFourYahooPhoneResultsOffsetId

NameCountLastNamekushmerickFirstNameCityState Next astrongtd

tdbbtd

trtable

HR

center

img srchttpimagesFourcomgYahooresultsbyfourgif

altresults by Four width heightBR

iCopyright copy Yahoo All Rights Reservedbr

FONT SIZE Copyright MetroMail Corp copy FONT

center

body

Example The second example is the html source of the resp onse to the query

First Name weld

head

title

Yahoo People Search Telephone Results

title

head

body

Yahoo Subcategory Banner

map namemenu

area shaperect coords hrefhttpwwwyahoocom

area shaperect coords

hrefhttpwwwyahoocomdocsinfohelphtml

area shaperect coords hrefhttpaddyahoocombinadd

area shaperect coords

hrefhttpwwwyahoocomdocsfamilymorehtml

map

img srchttpwwwyahoocomimagescatgif border usemapmenu ismap

p Default Fail IMG SRCgrlogogif ALTFour Corp HEIGHT

WIDTH

P

If you have questions regarding privacy please a

hrefhttpwwwyahoocomdocsinfopeople privacyhtmlread thisap

a hrefhttpwwwyahoocomsearchpeoplebSearch Againbap

There are bover b names that meet your search criteriabr

Try narrowing your search by specifying more parameters

Here are the bfirst b namesP

center

tabletr

tdbbtd

td

bDisplaying matches of b

td

td stronga hrefcgibinFourYahooPhoneResultsOffsetId

NameCountLastNameFirstNameweldCityState Next astrongtd

tdbbtd

trtable

p

table border cellpadding cellspacing

WIDTHtrthNamethAddressthPhonetr

tr

tdstrong Weld strongtd

nd Av Nw BirminghamAL td td

td td

tr

tr

Weld Butler strongtd tdstrong

td Cider Hl Rd YorkME td

td td

tr

tr

Weld Mary Butler strongtd tdstrong

td Kings EliotME td

td td

tr

tr

Weld S Jessie Carter strongtd tdstrong

td Cresthill Ct Fox LakeIL td

td td

tr

tr

Weld S Ruth Carter strongtd tdstrong

td Lorimer Rd BelmontMA td

td td

tr

tr

Robert Weld Conley strongtd tdstrong

S Beckley Sta Rd LouisvilleKY td td

td td

tr

tr

Weld Conley strongtd tdstrong

td Corona Ct LouisvilleKY td

td td

tr

tr

Weld Coxe strongtd tdstrong

td Concord Av CambridgeMA td

td td

tr

tr

tdstrong Weld H Fickel strongtd

td Kernville KernvilleCA td

td td

tr

tr

Weld Field strongtd tdstrong

td ColumbiaSC td

td td

tr

table

br

center

table bordertr

tdbbtd

td

bDisplaying matches of b

td

td stronga hrefcgibinFourYahooPhoneResultsOffsetId

NameCountLastNameFirstNameweldCityState Next astrongtd

tdbbtd

trtable

HR

center

img srchttpimagesFourcomgYahooresultsbyfourgif

altresults by Four width heightBR

iCopyright copy Yahoo All Rights Reservedbr

FONT SIZE Copyright MetroMail Corp copy FONT

center

body

The wrapp ers The following wrapp ers are consistent with these examples as well

as eight more collected from this resource

class wrapp er

lr and nlr h r r r r i

hlrt and nhlrt hh t r r r r i

oclr ho c r r r r i

hoclrt hh t o c r r r r i

where

h body

t body

o tdstrong

c

tdstrong

tdtd

r strongtd

r td

r

r td

Recall from App endix C that denotes the empty string and is the carriage return

character

Note that these particular wrapp ers are one of many p ossible for each class For

example exhaustive enumeration reveals that there are more than twentyve million

consistent lr wrapp ers

App endix B

PROOFS

B Pro of of Theorem

Pro of of Theorem We must prove that for any wrapp er

w hh t r r i and example hP Li C w hP Li

K k hlrt

ExecHLRTw P L As usual let M jLj b e the number of tuples in the

example and let K b e the number of attributes p er tuple

Part I We demonstrate that if C w hP Li then

hlrt

b the rst tuples rst attributes b eginning index is calculated correctly

ie the computed b equals the corresp onding value in P s lab el L

For each k K and m M if index b is calculated correctly then

mk

the end index e is calculated correctly

mk

For each k K and m M if e is calculated correctly then b

mk mk

is calculated correctly

For each m M if e is calculated correctly then b is calculated

m

mK

correctly

If e is calculated correctly then ExecHLRT will immediately halt

M K

Note that establishing each of these claims suces to establish Part I

ExecHLRT calculates b by setting i to the p osition of h in P line a

and then setting b to the rst character following the next o ccurrence of

lines cd Since C w hP Li holds we know that Ch t hP Li

hlrt

holds In particular we know that Ciii b oth hold Ci guarantees that

this pro cedure works prop erly since is a prop er sux of the part of page

P s head following h searching for h and then will in fact nd the end of

the head Moreover Cii guarantees that the test at line b succeeds so

that line c is actually reached

ExecHLRT calculates e by starting with index i equal to b line d

mk mk

and then searching for the next o ccurrence of r Since C w hP Li holds

hlrt

k

we know that Cr hP Li holds Therefore r will not b e found in the

k k

attribute value itself Cii but wil l b e found immediately following the

attribute value Ci Therefore searching for string r at p osition b do es

k mk

in fact nd index e

mk

ExecHLRT calculates b k K by starting with index i equal to the

mk

p osition of r line e and then setting b equal to the next o ccur

k mk

rence of line cd Since C w hP Li holds we know that

k hlrt

C hP Li holds Therefore is a prop er sux of the page frag

k k

ment b etween i and the b eginning of the next tuple which has index b

mk

Thus ExecHLRT will indeed nd b prop erly

mk

ExecHLRT calculates b m M by starting with the index i p ointing

m

at the end of the previous tuple which has index e line e during

mK

th

the K iteration of the inner for each lo op Then b is calculated by

m

scanning forward for the next o ccurrence of Since C w hP Li holds we

hlrt

know that Ch t hP Li holds In particular we know that Civ which

guarantees that this pro cedure works prop erly and Cv which guarantees

that the test at line b succeeds so that line c is actually reached b oth

hold

After calculating e correctly the index i p oint to the rst character of

M K

page P s tail and ExecHLRT invokes the termination condition test at line

b Since C w hP Li holds we know that Ch t hP Li holds

hlrt

In particular we know that Ciii holds which ensures that this termination

condition fails so ExecHLRT halts

To summarize we have shown that lab el Ls b and e values are all computed

mk mk

correctly with no content b eing incorrectly extracted no content b eing skipp ed and

no extra content b eing extracted

Part II We demonstrate a contradiction Supp ose there exists an exam

ple hP Li such that ExecHLRTw P L yet C w hP Li Then one of

hlrt

the three constraints CC must b e incorrectie it must b e the case that

ExecHLRTw P L even though one of CC do esnt hold

C For each k if Cr hP Li then ExecHLRTs line f would have cal

k

culated e incorrectly for at least one value of m But we assumed that

mk

ExecHLRTw P L and thus the e must b e correct

mk

C For each k if C hP Li then S for some particular tuple m

k mk k k

But then ExecHLRTs line d would have calculated b incorrectly But we

mk

assumed that ExecHLRTw P L and thus b must b e correct

mk

C Constraint C has ve parts

Ci If this predicate do es not hold then the execution of line a would

set i The b ehavior of ExecHLRT under these circumstances is un

sp ecied but we can assume that ExecHLRTw P L

Cii If this predicate fails then line c would fail to calculate b correctly

or the outer lo op line b would iterate to o few times

Ciii If this predicate fails then the outer lo op line b would iterate to o

many times

Throughout this thesis we ignore the issue of applying a wrapp er to an inappropriate page

These circumstances are of little theoretical signicance for example we could easily have sp ecied

formally what happ ens if i but that would have simply cluttered the algorithms while providing

little b enet Moreover since these mo dications are so simple we do not need to b e concerned

ab out them from a practical p ersp ective either

th

tuple then line c would fail to Civ If this predicate fails for the m

calculate b correctly

m

th

Cv If this predicate fails for the m tuple then the outer lo op line b

would stop after m iterations instead of M

2 Pro of of Theorem

B Pro of of Lemma

Pro of of Lemma Completeness requires that for every example set E if

there exists a wrapp er satisfying C for each example in E then Generalize

hlrt hlrt

will return such a wrapp er Note that by lines hj Generalize returns only

hlrt

wrapp ers that satisfy C Thus to prove the lemma we need to prove that if a

hlrt

consistent wrapp er exists then line j is eventually reached

We prove that Generalize has this prop erty by contradiction Let E

hlrt

fhP L i hP L ig b e a set of examples and w b e a wrapp er such that

N N

C w E Supp ose that Generalize E never reaches line jie sup

hlrt hlrt

p ose that the K nested lo ops iterate completely with C never satised

hlrt

The fact that Generalize never reaches j implies that it must have ne

hlrt

glected to consider wrapp er w In particular Generalize must have neglected to

hlrt

consider some of the values for one or more of w s hlrt comp onents h t r

etc We now show that on the contrary Generalize considers enough candi

hlrt

dates for each hlrt comp onent Sp ecically we show that for each hlrt comp onent

w s value for the comp onent is among the set of candidates Establishing this fact

suces to prove that w is considered at line j b ecause the algorithms nested

lo op control structure lines ag eventually considers every combination of the

candidates

the r Supp ose Generalize failed to consider all p ossible strings for the right

hlrt

k

hand delimiter r Generalize do es not consider every string as a candidate

hlrt

k

for each r Rather lines ab indicate that the candidates for r are the pre

k k

th

xes of the rst pages S the k intratuple separators for the rst pages rst

k

tuple But note that this restriction is required in order to satisfy constraint

Ci Therefore if the r value of wrapp er w do es not satisfy this restriction

k

then it will not satisfy C and therefore w will not satisfy C w E But

hlrt

this is a contradiction since we assumed that C w E holds

hlrt

the k A similar argument applies to each lefthand delimiter for k

k k

Lines de indicate that the candidates for are the suxes of the rst

k

th

pages S the k intratuple separators of the rst pages rst tuple

k

But constraint C requires this restriction Therefore if the value of wrapp er

k

w do es not satisfy this restriction then it will not satisfy C and therefore w

will not satisfy C w E

hlrt

A similar argument applies to Line c indicates that the candidates for

must b e suxes of P s head S which is required to satisfy constraint Ci

K

h A similar argument applies to the head delimiter h Line f indicates that

candidates for h must by substrings of page P s head S which is required

K

to satisfy constraint Ci

t A similar argument applies to the tail delimiter t Line g indicates that

candidates for t must by substrings of page P s tail S which is required to

M K

satisfy constraint Ciii

In summary we have shown that wrapp er w is eventually considered in line hi

b ecause sucient candidates are always considered for each of the K comp onents

of w

2 Pro of of Lemma

B Pro of of Theorem

Pro of of Theorem Recall the discussion of Figure on pages and

pages We saw that Generalize and Generalize search the same space

hlrt

hlrt

they simply traverse it dierently Sp ecically Generalize searches the tree com

hlrt

searches the tree greedily backtracking only over the pletely while Generalize

hlrt

See Figure and Equation for a refresher on the meaning of the partition variables S mk

b ottom three layers of no des If we can show that this greedy search never skips a

is consistent Note consistent wrapp er then we will have shown that Generalize

hlrt

that since the search space is the same for b oth algorithms we dont need to reprove

that all returned wrapp ers satisfy C

hlrt

Let E b e a set of examples and supp ose that w is an hlrt wrapp er that is

consistent with E Without loss of generality supp ose that w is the only consis

tent wrapp erinformally w s leaf in Figure is the only one not marked X By

Theorem we know that Generalize will return w We need to show that

hlrt

Generalize will to o

hlrt

The assumption that w is unique causes no loss of generality b ecause if there

were multiple such wrapp ers then we could apply this pro of technique to each The

only diculty would b e that we cant guarantee which particular wrapp er is returned

by Generalize since we havent sp ecied the order in which candidates for each

hlrt

comp onent are considered But no matter which ordering is used Generalize

hlrt

would still b e consistent b ecause consistency requires that any rather than some

particular consistent wrapp er b e returned

Generalize returns wrapp er w i at each no de in the search tree

hlrt

Generalize chooses w s value for the corresp onding wrapp er comp onent There

hlrt

fore Generalize will fail to return w if and only if some of w s values are rejected

hlrt

the r Generalize will not reject w s value for any of the r To see this note

k k

hlrt

that we assumed w satises C and therefore w s value for r satises C

hlrt

k

and therefore Generalize s line c will not reject the value

hlrt

the k Generalize will not reject w s value for any of the for k

k k

hlrt

To see this note that we assumed w satises C and therefore w s value

hlrt

for satises C and therefore Generalize s line f will not reject the

k

hlrt

value

h t and Generalize will not reject w s value for h t and To see this

hlrt

note that we assumed w satises C and therefore w s value for h t and hlrt

satises C and therefore Generalize s line jwhich examines all

hlrt

combinations of these three comp onentswill not reject these values

2 Pro of of Theorem

B Pro of of Theorem

Pro of of Theorem Supp ose E fhP L i hP L ig is a set of N pages

N N

drawn indep endently according to distribution D and then lab eled according to the

target wrapp er T Let hlrt wrapp er w Generalize E We must show that if

hlrt

Equation holds then w is PACie E w with probability at least

T D

E w measures the chance that the hypothesis wrapp er w and the target wrap

T D

p er T disagree When do w and T disagree That is under what circumstances is

w P T P for some particular page P The consistency constraints CC

capture these circumstances Sp ecically by Denition and Theorem if any

of the constraints CC fail to hold b etween w and hP T P i then w P T P

and thus w and T disagree ab out P

Consider how predicates CC apply to each of w s comp onents As indicated

in Denition these three predicates op erate on the dierent comp onents of w

C constrains each of the K r comp onents C constrains each of the K

k k

comp onents for k and C constrains h t and Wrapper w disagrees with

target T on page P if any of these K particular constraints are violated for P

Therefore the chance of w disagreeing with T is equal to the chance that one or

more of the K separate constraints are violated We shall pro ceed by establishing

that if Equation holds then with high reliability w and T disagree only rarely

Observe that by the probabilistic union b ound we have that the chance of w

disagreeing with T is at most the sum of the chances of the K individual constraints

b eing violated

The probabilistic union b ound states that for any events A and B PrA B PrA PrB

Supp ose we can guarantee to a particular level of reliability that the K C

constraints and K C constraints each are violated with probability at most

and that the C constraint is violated with probability at most If we meet

K

this guarantee then we will have shown that with b ounded reliability w and T

disagree with probability at most K K ie we will have

K K

shown that w is PAC

Now consider the following Lemma

Lemma B Sample complexity of individual delimiters The

fol lowing statements hold for any target T distribution D and

As usual E is a set of N examples comprising a total of M tuples

tot

and the shortest example has length R

For any k K if C holds of r and every example in E then

k

M

tot

with probability at most C is violated by a fraction

or more with respect to D of the instances

For any k K if C holds of and every example in E then

k

M

tot

with probability at most C is violated by a fraction

or more with respect to D of the instances

If C holds between h t and every example in E then with

N

probability at most R C is violated by a fraction

or more with respect to D of the instances where R

R R R

Invoking Lemma B with the value for the K r and delimiters

k k

K

and for h t and we have that the chance that E w is at most

T D

M M N

tot tot

K K R

K K

and therefore ensuring that

M N

tot

R K

K

suces to ensure that w is PAC as claimed For clarity we can rewrite this equation

as

M

tot

N

R K

K

where K K To complete the pro of of Theorem it suces to establish

Lemma B

2 Pro of of Theorem

Pro of of Lemma B We derive each part of the Lemma in turn

Part Wrapper comp onent is learned by examining the M tuples in E

tot

k k

We are lo oking for some string that o ccurs just to the left of the instances of the

k

th

k attribute

More abstractly we can treat the task of learning as one of identifying a common

k

g where the s are the values of the prop er sux of a set of strings fs s

i

M

tot

S partition variables For example when learning using the example shown

mk

in Figure c the set of examples would b e the four strings

fB I B I B I B I g

Note that these four strings are the values of S S S and S which in this

example happ en to b e identical

Given this description of the learning task we need to determine the chance that

is not a prop er sux of at least a fraction with resp ect to D of the instances

k

We can treat D as a distribution over the s even though D is in fact a distribution

i

over pages b ecause from the p ersp ective of learning a single an example page is

k

equivalent to the examples of it provides

k

A prop er sux is a sux that o ccurs only as a sux see App endix C for a description of our

string algebra

Let s b e a string Notice that each sux of s can b e represented as an integer I

where I is the index in s of the suxs rst character For example the sux abc of

the strings abcpqrabc is identied with the integer I b ecause abcpqrabc

abc Since a sux and its integral representation are equivalent we use the two terms

interchangeably in this pro of

The set of all proper suxes of a string s can b e represented as an interval U

where U indicates ss shortest prop er sux For example the set of prop er suxes

of the string abcpqrabc can b e represented as the interval b ecause abc which

starts at p osition is the strings longest improp er sux Note that every integer

in U corresp onds to a prop er sux and every prop er sux corresp onds to an

integer in U every string is a prop er sux of itself and if some particular sux

is improp er then all shorter suxes are also improp er

In addition to b eing a prop er sux must b e a common prop er sux of each

k

example The set of common prop er suxes of a set of strings fs g is captured

by the interval L U L is the index into s indicating the longest common sux

while U is the index into s indicating the shortest common prop er sux Notice

that implicitly these indices use s as a reference string

Observe that every integer in L U corresp onds to a common prop er sux and

every common prop er sux corresp onds to an integer in L U by construction

suxes corresp onding to integers L are common suxes while those corresp onding

to integers U are prop er suxes Therefore the intersection of these two intervals

corresp onds exactly to the set of common prop er suxes

Figure B uses a suxtree representation to illustrate that represents the

common prop er suxes of the strings s rqpabcd s dzqpabcd s dqzabcd

s pqzabcd and s qpabcd That is bcd and cd are the only prop er suxes that

are common to all ve examples

Figure B illustrates the p oint of the preceding discussion learning a common

prop er sux corresp onds to learning the integer I based on a set of observed upp er

s2 = "dzqpabcd" s1 = "rqpabcd" d s3 = "dqzabcd" r 1 z d s4 = "pqzabcd" 2 q p q 3 p z a s5 = "qpbcd" 4 q p L = 5 b

U = 6 c

7 d

Figure B A sux tree showing that the common proper suxes of the ve examples

can be represented as the interval L U

L* I U*

L1 L3 L5 L4 L2 U3 U4 U1 U2 U5

LU

Figure B An example of learning the integer I from lower bounds fL L g and

upper bounds fU U g

and lower b ounds on I where I is the index into the reference string s representing

the target value for Each example s provides a lower b ound L for I as well as an

i i

k

upp er b ound U for I More formally L is the index in s of the longest sux shared

i i

by s and s and U is the index into s of the shortest prop er sux of s Given a

i i i

set of examples our learner outputs an hypothesis corresp onding to integers in the

interval L U where L max L is the greatest lower b ound and U min U is

i i i i

the least upp er b ound

At this p oint we can analyze the learning task in terms of a PAC mo del As

shown in Figure B dene L I to b e the integer such that exactly a fraction

of the probability mass of D corresp onds to strings providing a lower b ound in the

range L I Conceptually L is determined by starting at I and moving to the

left until the strings dep ositing their lowerbound L within L I collectively have

i

Similarly dene U to b e the integer such that I U probability mass exactly

has mass exactly

The learner will output some integer from L U what is the chance that a ran

domly selected string will disagree with this choice By construction if L L or

U U then this chance will b e at least What is the chance that L L or

U U By the union b ound we have that the chance of this disjunctive event is

at most the sum of the probabilities of each disjunct Therefore by calculating this

sum we can obtain the probability that the learned hypothesis has error more than

What is the chance that L L L L if amongst the M example strings

tot

we have seen none such that L L since if we had seen such a string then we

i

would have L L By the construction of L the chance that a single strings

lower b ound is less than L is So the chance that we see no such examples is

M

tot

A symmetrical argument applies to the chance that U U the chance that

M

tot

U U is exactly

Summing these two b ounds we have that the chance that disagrees with a

k

M

tot

fraction or more of the instances is at most

Part r A similar pro of technique can b e used to show that the same b ound

k

applies to learning r The dierence is that the observed lower b ounds and upp er

k

b ounds are generated from the examples somewhat dierently

Since the interval is taken to b e integral it may b e that no such L or U exists For example

while I has mass then L is not dened In this case if the interval I has mass

we simply choose L More generally we allow L and U to b e nonintegral if needed This

technicality do es not aect the rest of the pro of

Sp ecically r is learned from a set of M pairs of examples strings

tot

k

a b a b a b

fhs s i hs s i hs s ig

i i M M

tot tot

a b

The s are A partition variables and the s are the S partition variables For

mk mk

i i

example when learning r the countrycode example page provides the four pairs of

strings

fhCongo B Ii hEgypt B Ii hBelize B Ii hSpain B Iig

b

and that r must not Constraint C sp ecies that r must b e a prex of every s

k k

i

a

o ccur in any s We need to determine the chance that r do es not satisfy C for a

k

i

fraction or more of the instances

An hypothesis is represented as an index I into the rst example For example

the prex abc of the example hpqrs abcdef i would b e represented by the integer

I since abcdef abc

The interval L U represents all the hypothesis satisfying constraint C As with

part each example corresp onds to a value L and U and the learning task involves

i i

returning any hypothesis in L U where L max L and U min U

i i i i

a b a

Sp ecically L for example hs s i is the shortest prex of s satisfying constraint

i

i i

C U is the longest such prex For instance if we are trying to learn r from the

i

k

following pairs of strings

n o

a b a b

hs abc s abcdefi hs pqr s abcdexi

then we would translate these examples into L U L and U so

that L and U

The key observation ab out this construction is the interval L U is equivalent to

the set of hypothesis satisfying C G L U G satises C Pro of of

Supp ose G did not satisfy C for example i thus L G or U G therefore L G

i i

or U G but G L U contradiction Pro of of Supp ose G L U then

there exists an example i such that L G or U G therefore G do es not satisfy

i i

C for example icontradiction

From this p oint the pro of of part is identical to the pro of of part the chance

M

tot

that r disagrees with a fraction or more of the instances is at most

k

Part h t We can think of nding values for h t and as a problem of

learning within a restricted hypothesis space H Given the rst example page

ht

P there are only a nite number of choices for h t and thus H has nite

ht

cardinality Sp ecically Equation states that the number of combinations of h

t and is b ounded by OR where R min jP j is the length of the shortest

n n

example page

How many hypotheses exactly In the analysis leading to Equation we were

concerned only with asymptotic complexity We can reexamine the relevant parts

to determine the actual number of h t combinations jH j of Generalize

ht

hlrt

Note that jH j is a function only of R let jH j R Then lines gi

ht ht

reveal that

p p p p

p

R R R R

R R

z

z z

is a

t is a

h is a

sux of a

substring of

substring of

string of

a string of

a string of

p

p

length

p

length R

R length

R

R R R

To summarize we have determined R which is the number of hypotheses jH j

ht

examined by the Generalize algorithm

hlrt

PAC b ounds for nite hypothesis spaces are well known in the literature eg

Russell Norvig pp Sp ecically it is straightforward to show that if

This pro of requires Assumption only to obtain this b ound if the assumption do esnt hold

then we get a lo oser though still correct mo del

some hypothesis from class H is consistent with N examples then the chance that it

N

disagrees with at least a fraction of the instances is at most jH j

Applying this b ound to the problem of learning H we have that the chance

ht

that h t and jointly disagree with at least a fraction of the instances is at most

N

R

2 Pro of of Lemma B

B Pro of of Theorem Details

Pro of of Theorem Details

Assertion Figure B lists nine pages For each page the lab el not explicitly

listed extracts the information fhA Ai hA A i hA Aig from the corre

sp onding page For example the page marked A pAqArAsAtAuA v is

lab eled fhh i h ii hh i h ii hh i h iig

Figure B lists each of the wrapp er classes lr hlrt oclr and hoclrt The

symbol in a particular pages row and a particular classs column indicates that

no wrapp er in the class is consistent with the given page For example the symbol

at page A under column hlrt indicates that no hlrt wrapp er exists which

can extract the given content from page A On the other hand the hlrt wrapp er

hh t i for page C indicates that the given wrapp er can indeed extracted

the desired content from the page

We must verify each cell of the ninebyfour matrix The cells not marked by

can b e easily veried by simply running the corresp onding wrapp er execution

For instance to verify that the cell Chlrt should b e marked hh t i we

simply run ExecHLRT with arguments hAAAAA A t and

hh t i and then verify that the output is correct

Recall our warning in Footnote on page when examining Figures B and B

Verifying the cells marked is more dicult Each requires a detailed argument

sp ecic to the particular page and wrapp er class in question In this pro of we provide

such arguments for just two cells

Consider row I The page hoAAcoxAA co A A c is

claimed to b e wrappable by oclr and hoclrt but not lr or hlrt To see

that no consistent lr wrapp er exists note that must b e a common prop er

sux of the three strings ho cox and co Clearly is the only such

string But using this candidate in an lr wrapp er causes ExecLR to get

confused by the pages head ho Sp ecically when attempting to extract the

pages rst attribute A ExecLR will extract hoA instead b ecause

o ccurs rst at p osition instead of p osition

To see that hoAAcoxAAc oA A c can not b e wrapp ed

by hlrt consider the tail delimiter t t must b e a substring of the tail c

However we can not use any of the three substrings c and c b ecause all

three o ccur b etween the tuples and therefore confuse ExecHLRT

As a second example consider row C

hAAAAA A t As with I there is no consis

tent lr wrapp er b ecause all candidates for confuse ExecLR

To see that no oclr wrapp er can handle

hAAAAA A t notice that all candidates for

the op ening delimiter o confuse ExecOCLR

Finally notice that row A in Figure B corresp onds to region A in Figure

B in Figure B corresp onds to region B in Figure and so on Thus we

We also implemented a computer program to enumerate and rule out every p ossible wrapp er in

the given class The program is complete in that it will always nd a wrapp er if one exists and it

will always terminate so that if no wrapp er exists then we can in fact use the program to verify

this fact

have exhibited one pair hP Li in each of the regions A I which establishes

the rst assertion in the pro of of Theorem

Assertions and Lemmas B and B b elow establish the second and third

assertions resp ectively

2 Pro of of Theorem Details

Lemma B lr oclr

Pro of of Lemma B We need to show that for every pair hP Li lr if

lr wrapp er w h r r i satises w P L then oclr wrapp er w

K K

h r r i satises w P L If we establish this assertion then we

K K

will have established that lr oclr

To see that this assertion holds note that oclr wrapp er w s o comp onent has the

value while c is the empty string But under these circumstances the ExecOCLR

pro cedure is equivalent to the ExecLR pro cedure Sp ecically when o lines a

b of ExecOCLR are equivalent to line ab of ExecLR Furthermore when c

line c of ExecOCLR is a noop

Since we know ExecLRw P L we conclude ExecOCLR w P L

2 Pro of of Lemma B

Lemma B hlrt hoclrt

Pro of of Lemma B Substantially the same pro of technique applies to Lemma

B and Lemma B In this case when o then lines ab of ExecHOCLRT

are equivalent to lines bc of ExecHLRT and when c then line c of

ExecHOCLRT is a noop

2 Pro of of Lemma B

Recall that the empty string is dened in App endix C region G D A H C B E F I ohoAAcoxAAcoAAc hoAAcoxAAcoAAc hoAAcoxAAcoAAc hoAAcoAAcoAAct hAAhAAhAAt xoAAoAAoxAA hAAAAAAt AAtAAAAt pAqArAsAtAuAv example Figure B One point in h h h h each r lr of r i i i the i h t t h h h h h h h regions t t t hlrt r in r Figure i i i i h h c c o h h h h h o o o h o c c c oclr r r i i i i i i i h h h h h h h t c c o t h h h h h h h t t t hoclrt o o o x o c c c c r r i i i i i i i

B Pro of of Theorem Details

Pro of of Theorem Details

Assertion The pro of technique used for Theorem can b e applied to the

present theorem Figure B lists one p oint in each of the regions marked J through

U in Figure

Assertions and Lemmas B and B b elow establish the second and third

assertions resp ectively

2 Pro of of Theorem Details

Lemma B nlr hlrt lr

Pro of of Lemma B Let hP Li b e some page wrappable by b oth nlr and hlrt

hP Li nlr hlrt Since hP Li is wrappable by hlrt it must have a

tabular as opp osed to nested structure Let w b e an nlr wrapp er consistent with

hP Li Since P has a tabular structure then ExecLRw P L ie when treated as

an lr wrapp er the nlr wrapp er w is consistent with hP Li To see this note that

P is tabular and so line i of ExecNLR always nds attribute k after attribute k

Therefore the ExecLR routine op erates prop erly when using the same attribute

k

Since there exists a w such ExecLRw P L we have that hP Li lr

2 Pro of of Lemma B

Lemma B nhlrt lr hlrt

Pro of of Lemma B Essentially the same pro of applies at for Lemma B The

dierence is that instead of treating an nlr wrapp er as an lr wrapp er the technique

used in the previous pro of we rely on the fact that if a given page is tabular then

an nhlrt wrapp er op erates correctly when treated as an hlrt wrapp er region M Q K O U N T R P L S J hAiAiiAAAAAt hAiAiiAAAAAt AiAiiAtAAAAt hAtAAAAAt hAAAAAAt hAAAAAAt AAtAAAAt hAAAAAAt hAAAAAAt hAAAAAAt AAtAAAAt pAqArAsAtAuAv example Figure B One point in h h h h h h h each r lr of r the i i i i i i i h t t h h h h h h h regions h h h h h h t t t t t t hlrt r in r Figure i i i i i i i h h h h h h r nlr r i i i i i i h h t t h h h h h h h h h h t nhlrt t t t t r r i i i i i i

2 Pro of of Lemma B

B Pro of of Theorem

Pro of of Theorem Let b e a recognizer library satisfying Assumption

let P b e a page let L b e P s true lab el and let L CorrobP Correctness

requires that there exists some member of L that matches L To establish this

assertion we rst show how to construct a lab el L that matches L and then show

that L L

Constructing L from L Ls attribute ordering is identical to that of L L also

contains exactly the same instances as L except that any instance hb e F i from L

k

P is replaced by if hb ei was not recognized ie hb ei R

F

k

For example consider the following example recognizer library output

R R

cap code

perfect R perfect

ctry

incomplete

h i h i

h i h i h i

h i h i

In this case we construct L from L as follows

true lab el L constructed lab el L

ctry code cap ctry code cap

h i h i h i h i h i

h i h i h i h i h i h i

h i h i h i h i h i

Note that L is identical to L except that two ctry instancesh i and h i

are omitted b ecause neither is a member of R P

ctry

Showing that L L It is clear from the previous discussion that L matches L

in the sense required for the set L to qualify as a solution to the lab eling problem

hP i Thus to complete the pro of we need to show that L L

Let b e the attribute ordering of L and L note that the ordering is

the same for b oth Let f hb e F i g b e the instances in L Now par

k

g tition f hb e F i g into two parts I fhb e F i j unsndR

unsnd F

k k

k

the set of instances recognized by the unsound recognizers and I

unsnd

g the set of instances recognized by the incomplete or p er fhb e F i j unsndR

F

k

k

fect recognizers

To show that L L we must demonstrate that the following ve assertions hold

I is the set of instances returned by the call to NTPSet at line a

unsnd

so that NTPSet I

unsnd

I is one of the sets of instances returned by the call to PTPSets at line

unsnd

b so that eventually PTPSet I

unsnd

is one of the orderings returned by the call to Orders at line c

When invoked at line d with NTPSet I PTPSet I

unsnd unsnd

and attribute ordering the function invocation Consistent NTPSet

PTPSet returns true thereby ensuring that

MakeLabel NTPSet PTPSet L

Finally when invoked at line e with NTPSet I PTPSet

unsnd

I and attribute ordering lab el L is returned by the function in

unsnd

vocation

MakeLabel NTPSet PTPSet

We now establish each item on this list

The function NTPSet simply collect the instances recognized by each of the

p erfect or incomplete recognizers So we have that hb e F i I

k unsnd

hb e F i NTPSetP and thus I NTPSetP

unsnd k

Consider partitioning I according to the attribute F to which they b e

unsnd

k

F

k

fhb ei j hb e F i I g Note that unsound recognizers long I

k unsnd

unsnd

F

k

P Furthermore R never pro duce false negatives and therefore I

F

unsnd

k

F

k

j M where M is the number of rows in L Now since note that jI

unsnd

the PTPSets subroutine returns the cross pro duct of al l subsets of size M for

P of instances recognized by unsound recognizers we know that al l sets R

F

k

the particular set I will b e among those returned

unsnd

Orders returns the set of all p ossible attribute orderings and therefore it must

include the ordering

To review we know now that Ls attribute ordering NTP instances

I and PTP instances I will eventually b e considered at lines

unsnd unsnd

de Thus we know that Corrob is prepared to add L to L So the only re

maining questions are Do es Consistent return true and if so do es MakeLabel

correctly pro duce L from these inputs

To see that Consistent returns true for these inputs note that Consistent

veries that a lab el exists with the given attribute ordering and all the given

instances Since L is such a lab el we know that Consistent must return true

Moreover note that there is exactly one such lab elie L is the only lab el that

can b e constructed from and I I To see this note that

unsnd unsnd

I I contain a subset of L s instances and since the attribute

unsnd unsnd

ordering is xed each instance recognized by an incomplete recognizer b elongs

to exactly one row in L Therefore MakeLabel returns L for the given inputs

2 Pro of of Theorem

App endix C

STRING ALGEBRA

The consistency predicates CC and the ExecW execution functions for each

wrapp er class W are dened in terms of the following concise string algebra

We assume an alphab et We dont restrict the alphab et though we have in

mind the asci i characters The symbol refers to the newline character

In this App endix s s etc are strings over The length of string s is written

jsj The empty string is denoted Adjacency indicates concatenation s s

denotes the string s followed by s

is the string search op erator string ss is the sux of s starting from the

rst o ccurrence of s For example abcdecdf cd cdecdf The empty string

o ccurs at the b eginning of every s s

If s do es not contain s then we write ss For example abc xyz We

stipulate that prop ogates failure s s Also for convenience

we dene jj

One common use of the search op erator is to determine if one string is a prop er

sux of another We say that s is a proper sux of string s if s is a sux of

s and moreover s o ccurs only as a sux of s Note that s is a prop er sux of

s if and only if ss s

A second common use of the string search op erator is to determine if one string

o ccurs b efore another in some third string Note that s o ccurs after s in s if

and only if jss j jss j

is the string index op erator ss jsj jss j As with ss ss

if s do es not contain s For example abcdefg cde while abcdefg xyz

The substring op erator sb e extracts from s characters b though e where b

and e are onebased integral indices into s For example abcdef cd sb

is an abbreviation for sb jsj For example abcdef cdef

BIBLIOGRAPHY

Adali et al Adali S Candan K Papakonstantinou Y and Subrahmanian

V Query caching and optimization in distributed mediator systems In

Procceedings of SIGMOD

Aiken Aiken P Data Reverse Engineering Staying the Legacy Dragon McGraw

Hill

Andreoli et al Andreoli J Borgho U and Pareschi R The Constraint

Based Knowledge Broker Mo de Semantics Implementation and Analysis

J Symbolic Computation

Angluin Laird Angluin D and Laird P Learning from noisy examples Ma

chine Learning

Angluin Smith Angluin D and Smith C Inductive inference Theory and

metho ds ACM Computing Surveys

Angluin Angluin D Inference of reversible languages J ACM

Angluin Angluin D Learning regular sets from queries and counterexamples

Information and Computation

Angluin Angluin D Computational learning theory survey and selected bibli

ography In Proc th ACM Symp Theory Comp pages

ANSI ANSI Database Language SQL Standard X

Arens et al Arens Y Knoblo ck C Chee C and Hsu C SIMS Single inter

face to multiple sources TR RLTR USC Rome Labs

ARPA ARPA Proc th Message Understanding Conf Morgan Kaufmann

Ashish Knoblo ck a Ashish N and Knoblo ck C Semiautomatic wrapp er gen

eration for Internet information sources In Proc Cooperative Information

Systems

Ashish Knoblo ck b Ashish N and Knoblo ck C Wrapper Generation for

Semistructured Information Sources In Proc ACM SIGMOD Workshop

on Management of Semistructured Data

Bartlett Williamson Bartlett P and Williamson R Investigating the distri

butional assumptions of the PAC learning mo del In Proc th Workshop

Computational Learning Theory pages

Benedek Itai Benedek G and Itai A Learnability by xed distributions In

Proc st Workshop Computational Learning Theory pages

Biermann Feldman Biermann A and Feldman J On the synthesis of nite

state machines from samples of their b ehavior IEEE Trans on Computers

C

Blumer et al Blumer A Ehrenfeucht A Haussler D and Warmuth M

Learnability and the VapnikChervonenkis dimension J ACM

Borgman Siegfried Borgman C and Siegfried S Gettys Synoname and its

cousin A survey of applications of p ersonal namematching algorithms J

Amer Soc of Information Science

Bowman et al Bowman M Danzig P Manber U and Schwartz F Scalable

Internet discovery Research problems and approaches C ACM

Bradshaw Bradshaw J editor Intel ligent Agents MIT Press

Bro die Stonebraker Bro die M and Stonebraker M Migrating Legacy Sys

tems Gateways Interfaces the Incremental Approach Morgan Kauf

mann

Carey et al Carey M Haas L Schwarz P Arya M Co dy W Fagin

R Flickner M Luniewski A Niblack W Petkovic D Thomas J

Williams J and Wimmers E Towards heterogeneous multimedia infor

mation systems The Garlic approach In Proc th Int Workshop of Re

search Issues in Data Engineering Distributed Object Management pages

Chawathe et al Chawathe S GarciaMolina H Hammer J Ireland K Pa

pakonstantinou Y Ullman J and Widom J The TSIMMIS pro ject

Integration of heterogeneous information sources In Proc th Meeting of

the Information Processing Soc of Japan pages

Chidlovskii et al Chidlovskii B Borgho U and Chevalier P Towards So

phisticated Wrapping of Webbased Information Rep ositories In Proc

Conf ComputerAssisted Information Retrieval pages

Cohen Singer Cohen W and Singer W Learning to Query the Web In Proc

Workshop Internetbased Information Systems th Nat Conf Articial

Intel ligence pages

Collet et al Collet C Huhns M and Shen W Resource integration using a

large knowledge base in CARNOT IEEE Computer

Cowie Lehnert Cowie J and Lehnert W Information extraction C ACM

Decatur Gennaro Decatur S and Gennaro R On Learning from Noisy and

Incomplete Examples In Proc th Annual ACM Conf Computational

Learning Theory pages

Decker et al Decker K Sycara K and Williamson M Middleagents for the

internet In Proc th Int Joint Conf AI pages

Dietterich Michalski Dietterich T and Michalski R A comparative review

of selected metho ds for learning from examples In Michalski et al

chapter pages

Do orenbos Do orenbos R Octob er Personal communication

Do orenbos et al Do orenbos R Etzioni O and Weld D A scalable

comparisonshopping agent for the WorldWide Web In Proc Autonomous

Agents pages

Douglas Hurst Douglas S and Hurst M Layout and language Lists and

tables in tehcnical do cuments In Proc SIGPARSE pages

Douglas et al Douglas S Hurst M and Quinn D Using Natural Language

Pro cessing for Identifying and Intepreting Tables in Plain Text In Proc th

Symp Document Analysis and Information Retrieval pages

Drap er Drap er D Octob er Personal communication

Etzioni Weld Etzioni O and Weld D A softb otbased interface to the Inter

net C ACM

Etzioni Etzioni O Intelligence without rob ots A reply to Bro oks AI Magazine

Etzioni a Etzioni O Moving up the information fo o d chain softb ots as infor

mation carnivores In Proc th Nat Conf AI

Etzioni b Etzioni O The World Wide Web quagmire or gold mine C ACM

Etzioni et al Etzioni O Lesh N and Segal R Building softb ots for UNIX

preliminary rep ort Technical Rep ort University of Washington

Etzioni et al Etzioni O Maes P Mitchell T and Shoham Y editors Work

ing Notes of the AAAI Spring Symposium on Software Agents Menlo Park

CA AAAI Press

Finin et al Finin T Fritzson R McKay D and McEntire R KQML A lan

guage and proto col for knowledge and information exchange In Know ledge

Building and Know ledge Sharing Ohmsha and IOS Press

Florescu et al Florescu D Rashid L and Valduriez P Using heterogeneous

equivalences for query rewriting in multidatabase systems In Proc Coop

erative Information Systems

Freitag Freitag D Machine Learning for Information Extraction from Online

Do cuments A Preliminary Exp eriment Unpublished manuscript

Available at wwwcscmueduafscscmueduuserdayneww wprelimps

Freitag Freitag D Using grammatical inference to improve precision in infor

mation extraction In Working Papers of the Workshop on Automata In

duction Grammatical Inference and Language Acquisition th Int Conf

Machine Learning Available at httpwwwcscmuedupdup ont

mlpml GI wkshptar

Friedman Weld Friedman M and Weld D Eciently executing information

gathering plans In Proc th Int Joint Conf AI pages

Goan et al Goan T Benson N and Etzioni O A grammar inference algo

rithm for the world wide web In Proc AAAI Spring Symposium on Machine

Learning in Information Access

Gold Gold E Complexity of automaton identication from given data Infor

mation and Control

Green Krishnamo orthy Green E and Krishnamo orthy M Mo delBased

Analysis of Printed Tables In Proc rd Int Conf Document Analysis

and Recognition

Gupta Gupta A editor Integration of Information Systems Bridging Hetero

geneous Databases IEEE Press

Haussler Haussler D Quantifying inductive bias J Articial Intel ligence

Hayes Hayes P NameFinder Softwre that nds names in text In Proc Conf

ComputerAssisted Information Retrieval pages

Hobbs Hobbs J The generic information extraction system In Proc th Mes

sage Understanding Conf

Kearns Vazirani Kearns M and Vazirani U An introduction to computa

tional learning theory MIT

Kearns Kearns M Ecient NoiseTolerant Learning from Statistical Queries

In Proc th Annual ACM Symp Theory of Computing pages

Kirk et al Kirk T Levy A Sagiv Y and Srivastava D The Information

Manifold In AAAI Spring Symposium Information Gathering from Het

erogeneous Distributed Environments pages

Knuth et al Knuth D Morris J and Pratt V Fast pattern matching in

strings SIAM J Computing

Krulwich Krulwich B The BargainFinder agent Comparison price shopping

on the Internet In Williams J editor Bots and Other Internet Beasties

chapter SAMSNET

Kwok Weld Kwok C and Weld D Planning to gather information In Proc

th Nat Conf AI

Levy et al Levy A Ra jaraman A and Ordille J Queryanswering algorithms

for information agents In Proc th Nat Conf AI

Luke et al Luke S Sp ector L Rager D and Hendler J Ontologybased

web agents In Proc First Int Conf Autonomous Agents

Martin Biggs Martin A and Biggs N Computational learning theory An

introduction Cambridge University Press

Michalski Michalski R A theory and metho dology of inductive learning In

Michalski et al chapter pages

Michalski et al Michalski R Carb onell J and Mitchell T editors Machine

Learning An Articial Intel ligence Approach Morgan Kaufman

Mitchell Mitchell T The need for biases in learning generalizations Technical

Rep ort CBMTR Dept of Computer Science Rutgers Univ

Mitchell Mitchell T Machine Learning McGraw Hill

Monge Elkan Monge A and Elkan C The eld matching problem Algo

rithms and applications In Proc nd Int Conf Know ledge Discovery and

Data Mining

Paik et al Paik W Liddy E Yu E and McKenna M Categorizing and

standardizing prop er nouns for ecient retrieval In Proc Assoc for Com

putational Linguistics Workshop on the Aquisition of Lexical Know ledge

from Text

Papakonstantinou et al Papakonstantinou Y GarciaMonlina H and

Widom J Ob ject exchange across heterogeneous information sources In

Proc th Int Conf Data Engineering pages

Perkowitz Etzioni Perkowitz M and Etzioni O Category translation Learn

ing to understand information on the Internet In Proc th Int Joint Conf

AI pages

Perkowitz et al Perkowitz M Do orenbos R Etzioni O and Weld D Learn

ing to understand information on the Internet An examplebased approach

J Intel ligent Information Systems

Rau Rau L Extracting company names from text In Proc th Nat Conf AI

Rilo Rilo E Automatically Constructing a Dictionary for Information Ex

traction Tasks In Proc th Nat Conf AI pages

Roth Schwartz Roth M and Schwartz P Dont scrap it wrap it A wrapp er

architecture for legacy data sources In Proc nd VLDB Conf pages

Rus Subramanian Rus D and Subramanian D Customizing information

capture and access ACM Trans Information Systems

Russell Norvig Russell S and Norvig P Articial Intel ligence A Modern

Approach Prentice Hall

Schuurmans Greiner Schuurmans D and Greiner R Practical PAC Learn

ing In Proc th Int Joint Conf AI pages

Selb erg Etzioni Selb erg E and Etzioni O Multiservice search and compar

ison using the metacrawler In Proc th World Wide Web Conf pages

MA USA

Selb erg Etzioni Selb erg E and Etzioni O The metacrawler architecture for

resource aggregation on the web IEEE Expert January

Shakes et al Shakes J Langheinrich M and Etzioni O Dynamic reference

sifting a case study in the homepage domain In Proc th World Wide

Web Conf See httpwwwcswashingto ned ure sear ch ahoy

Shklar et al Shklar L Thatte S Marcus H and Sheth A The InfoHarness

Information Integration Platform In Proc nd Int WWW Conf

Shklar et al Shklar L Shah K and Basu C Putting Legacy Data on the

Web A Rep ository Denition Language In Proc rd Int WWW Conf

Smeaton Crimmins Smeaton A and Crimmins F Relevance Feedback and

Query Expansion for Searching the Web A Mo del for Searching a Digital

Library In Proc st European Conf Digital Libraries pages

So derland a So derland S Octob er Personal communication

So derland b So derland S Learning Text Analysis Rules for DomainSpecic Nat

ural Language Processing PhD dissertation University of Massachusetts

So derland c So derland S Learning to Extract Textbased Information from the

World Web In Proc rd Int Conf Know ledge Discovery and Data Mining

So derland et al So derland S Fisher D Aseltine J and Lehnert W CRYS

TAL Inducing a Conceptual Dictionary In Proc th Int Joint Conf AI

pages

Tanida Yokomori Tanida N and Yokomori T Polynomialtime identication

of strictly regular languages in the limit IEICE Tran Information and

Systems ED

Tejada et al Tejada S Knoblo ck C and Minton S Learning mo dels for

multisource integration In Proc AAAI Spring Symposium on Machine

Learning in Information Access

Valiant Valiant L A theory of the learnable C ACM

Vapnik Chervonenkis Vapnik V and Chervonenkis A On the uniform con

vergence of relative frequencies of events to their probabilities Theory of

Probability and its Applications

Wooldridge Jennings Wooldridge M and Jennings N editors Intel ligent

Agents Springer Verlag

Zaiane Jiawei Zaiane O R and Jiawei H Resource and knowledge discovery

in global information systems A preliminary design and exp eriment In

Proc KDD pages

VITA

Nicholas Kushmerick received his BS degree in Computer Engineering with Uni

versity Honors from Carnegie Mellon University in From he worked

as a research programmer in CMUs Psychology Department From he

was a graduate student in the University of Washingtons Department of Computer

Science and Engineering He received his MS degree in and his PhD de

gree in In he was a visiting researcher at the Center for Theoretical

Study and the Institute for Information Theory and Automation

Czech Academy of Science in Prague In Nick will join the Department of

Computer Applications at Dublin City University as a Lecturer