Combining Substructures to Uncover The Relational Web B. Cenk Gazen Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213
[email protected] June 25th, 2004 Thesis Committee: Jaime Carbonell, Carnegie Mellon University (chair) William Cohen, Carnegie Mellon University John Lafferty, Carnegie Mellon University Steven Minton, Fetch Technologies Abstract I describe an approach to automatically convert web-sites into relational form. The approach relies on the existence of multiple types of substructure within a collection of pages from a web-site. Corresponding to each substructure is an expert that generates a set of simple hints for the particular collection. Each hint describes the alignment of some tokens within relations. An optimization algorithm then finds the relational representation of the given web site such that the likelihood of observing the hints from the relational representation is maximized. The contributions of the thesis will be a new approach for combining heterogeneous sub- structures in document collections, an implemented system that will make massive amounts of web data available to applications that use only structured data, and new search tech- niques in probabilistic constraint satisfaction. 1 1 Introduction 1.1 Motivation Even though the amount of information on the web has been growing at an incredible rate, software applications have only been able to make use of it in limited ways, such as spidering and indexing of words on a page, activities that do not require a deep understanding. This is mainly because as data is transformed into web format, its inherent structure is replaced with formatting structure that makes the data easier to absorb for humans but harder for computers.