Wrapper Induction for Information Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Wrapper Induction for Information Extraction by Nicholas Kushmerick A dissertation submitted in partial fulllment of the requirements for the degree of Do ctor of Philosophy University of Washington Approved by Chairp erson of Sup ervisory Committee Program Authorized to Oer Degree Date In presenting this dissertation in partial fulllment of the requirements for the Do c toral degree at the University of Washington I agree that the Library shall make its copies freely available for insp ection I further agree that extensive copying of this dissertation is allowable only for scholarly purp oses consistent with fair use as prescrib ed in the US Copyright Law Requests for copying or repro duction of this dissertation may b e referred to University Microlms Eisenhower Place PO Box Ann Arb or MI to whom the author has granted the right to repro duce and sell a copies of the manuscript in microform andor b printed copies of the manuscript made from microform Signature Date University of Washington Abstract Wrapper Induction for Information Extraction by Nicholas Kushmerick Chairp erson of Sup ervisory Committee Professor Daniel S Weld Department of Computer Science and Engineering The Internet presents numerous sources of useful informationtelephone directories pro duct catalogs sto ck quotes weather forecasts etc Recently many systems have b een built that automatically gather and manipulate such information on a users b ehalf However these resources are usually formatted for use by p eople eg the relevant content is embedded in HTML pages so extracting their content is dicult Wrappers are often used for this purp ose A wrapp er is a pro cedure for extracting a particular resources content Unfortunately handco ding wrapp ers is tedious We introduce wrapper induction a technique for automatically constructing wrapp ers Our techniques can b e describ ed in terms of three main contributions First we p ose the problem of wrapp er construction as one of inductive learning Our algorithm learns a resources wrapp er by reasoning ab out a sample of the re sources pages In our formulation of the learning problem instances corresp ond to the resources pages a pages label corresp onds to its relevant content and hypotheses corresp ond to wrapp ers Second we identify several classes of wrapp ers which are reasonably useful yet eciently learnable To assess usefulness we measured the fraction of Internet re sources that can b e handled by our techniques We nd that our system can learn wrapp ers for of the surveyed sites Learnability is assessed by the asymptotic complexity of our systems running time most of our wrapp er classes can b e learned in time that grows as a smalldegree p olynomial Third we describ e noisetolerant techniques for automatically labeling the exam ples Our system takes as input a library of recognizers domainsp ecic heuristics for identifying a pages content We have developed an algorithm for automatically corroborating the recognizers evidence Our algorithm p erform well even when the recognizers exhibit high levels of noise Our learning algorithm has b een fully implemented We have evaluated our system b oth analytically with the PAC learning mo del and empirically Our system requires to examples for eective learning and takes ab out ten seconds of CPU time for most sites We conclude that wrapp er induction is a feasible solution to the scaling problems inherent in the use of wrapp ers by informationintegration systems TABLE OF CONTENTS List of Figures vi Chapter Introduction Motivation Background Information resources and software agents Our fo cus Semistructured resources An imp erfect strategy Handco ded wrapp ers Overview Our solution Automatic wrapp er construction Our technique Inductive learning Evaluation Contributions Organization Chapter A formal mo del of information extraction Introduction The basic idea The formalism Summary Chapter Wrapper construction as inductive learning Introduction Inductive learning The formalism The Induce algorithm PAC analysis Departures from the standard presentation Wrapper construction as inductive learning Summary Chapter The HLRT wrapp er class Introduction hlrt wrapp ers The Generalize algorithm hlrt The hlrt consistency constraints Generalize hlrt Example Formal prop erties Eciency The Generalize algorithm hlrt Complexity analysis of Generalize hlrt Generalize hlrt Formal prop erties Complexity analysis of Generalize hlrt Heuristic complexity analysis PAC analysis Summary Chapter Beyond HLRT Alternative wrapp er classes Introduction Tabular resources ii The lr oclr and hoclrt wrapp er classes Segue Relative expressiveness Complexity of learning Nested resources The nlr and nhlrt wrapp er classes Relative expressiveness Complexity of learning Summary Chapter Corrob oration Introduction The issues A formal mo del of corrob oration The Corrob algorithm Explanation of Corrob Formal prop erties Extending Generalize and the PAC mo del hlrt noisy The Generalize algorithm hlrt Extending the PAC mo del Complexity analysis and the Corrob algorithm Additional input Attribute ordering Greedy heuristic Stronglyambiguous instances Domainsp ecic heuristic Proximity ordering Performance of Corrob Recognizers Summary iii Chapter Empirical Evaluation Introduction Are the six wrapp er classes useful Can hlrt b e learned quickly Evaluating the PAC mo del Verifying Assumption Short page fragments Verifying Assumption Few attributes plentiful data Measuring The PAC mo del noise parameter Equation The WIEN application Chapter Related work Introduction Motivation Software agents and heterogeneous information sources Legacy systems Standards Applications Systems that learn wrapp ers Information extraction Recognizers Do cument analysis Formal issues Grammar induction PAC mo del Chapter Future work and conclusions Thesis summary iv Future work Shorttomedium term ideas Mediumtolong term ideas Theoretical directions Conclusions App endix A An example resource and its wrapp ers App endix B Pro ofs B Pro of of Theorem B Pro of of Lemma B Pro of of Theorem B Pro of of Theorem B Pro of of Theorem Details B Pro of of Theorem Details B Pro of of Theorem App endix C String algebra Bibliography v LIST OF FIGURES The showtimeshol lywoodcom Internet site is an example of a semi structured information resource a A ctitious Internet site providing information about countries and their telephone country codes b an example query response and c the html text from which b was rendered .