Database Techniques for the World-Wide Web: a Survey
Total Page:16
File Type:pdf, Size:1020Kb
Database Techniques for the World-Wide Web: A Survey Daniela Florescu Alon Levy Alb erto Mendelzon Inria Ro quencourt Univ. of Washington Univ. of Toronto dana@ro din.inria.fr [email protected] [email protected] Information extraction and integration: Certain web 1 Intro duction sites can b e viewed at a ner granularity level than pages, as containers of structured data e.g., sets of tuples, or The p opularity of the World-Wide Web WWW has made sets of ob jects. For example, the Internet Movie Database it a prime vehicle for disseminating information. The rel- http://www.imdb.com can b e viewed as a front end in- evance of database concepts to the problems of managing terface to a database ab out movies. Given the rise in the and querying this information has led to a signi cant body numb er of such sites, there are two tasks we consider. The of recent research addressing these problems. Even though rst task is to actually extract a structured representation the underlying challenge is the one that has b een tradition- of the data e.g., a set of tuples from the HTML pages ally addressed by the database community{how to manage containing them. This task is p erformed by a set of wrap- large volumes of data { the novel context of the WWW forces per programs, whose creation and maintenance raises several us to signi cantl y extend previous techniques. The primary challenges. Once we view these sites as autonomous hetero- goal of this survey is to classify the di erent tasks to which geneous databases, we can address the second task of p osing database concepts have b een applied, and to emphasize the queries that require the integration of data. The second task technical innovations that were required to do so. is addressed by mediator or data integration systems. We do not claim that database technology is the magic bullet Web site construction and restructuring: A di er- that will solve all web information management problems; ent asp ect in which database concepts and technology can other technologies, such as Information Retrieval, Arti cial b e applied is that of buildin g, restructuring and managing Intelligence, and Hyp ertext/Hyp ermedia, are likely to b e web-sites. In contrast to the previous two classes which ap- just as imp ortant. However, surveying all the work going ply on existing web sites, here we consider the pro cess of on in these areas, and the interactions b etween them and creating sites. Web sites can b e constructed either by start- database ideas, would b e far b eyond our scop e. ing with some raw data stored in databases or structured We fo cus on three classes of tasks related to information les or by restructuring existing web sites. Peforming this management on the WWW. task requires metho ds for modeling the structure of web site and languages for restructing data to conform to a desired Mo deling and querying the web: Supp ose we view the structure. web as a directed graph whose no des are web pages and whose edges are the links b etween pages. A rst task we Before we b egin, we note that there are several topics con- consider is that of formulating queries for retrieving certain cerning the application of database concepts to the WWW pages on the web. The queries can b e based on the content which are not covered in this survey, such as caching and of the desired pages and on the link structure connecting the replication see [WWW98 , GRC97 ] for recentworks, trans- pages. The simplest instance of this task, which is provided action pro cessing and securityinweb environments see by search engines on the web is to lo cate pages based on the e.g. [Bil98 ], p erformance, availabil ity and scalability issues words they contain. A simple generalizatio n of such a query for web servers e.g. [CS98], or indexing techniques and is to apply more complex predicates on the contents of a crawler technology e.g. [CGMP98]. Furthermore, this page e.g., nd the pages that contain the word \Clinton" is not meant to b e a survey on existing pro ducts even in next to a link to an image. Finally, as an example of a query the areas on whichwe do fo cus. Finally, there are several that involves the structure of the pages, consider the query tangential areas whose results are applicabl e to the systems asking for all images reachable from the ro ot of the CNN web we discuss, but we do not cover them here. Examples of site within 5 links. The last typ e of queries are esp ecially such elds include systems for managing do cument collec- useful when detecting violations of integrity constraints on + tions and ranking of do cuments e.g., Harvest [BDH 95], aweb site or a collection of web sites. Gloss [GGMT99] and exible query answering systems [BT98]. Finally, the eld of web/db is a very dynamic one; hence, there are undoubtedly some omissions in our coverage, for whichwe ap ologize in advance. The survey is organized as follows. We b egin in Section 2 by discussing the main issues that arise in designing data mo d- els for web/db applications . The following three sections consider each of the aforementioned tasks. Section 6 con- attributes. Because of the characteristics of semistructured cludes with p ersp ectives and directions for future research. data mentioned ab ove, an additional feature that b ecomes imp ortant in this context is the ability to query the schema i.e., the lab els on the arcs in the graph. This feature is 2 Data Representation for Web/DB Tasks supp orted in languages for querying semistructured data by arc variables which get b ound to lab els on arcs, rather than Building systems for solving any of the previous tasks re- no des in the graph. quires that wecho ose a metho d for mo deling the underlying domain. In particular, in these tasks, we need to mo del the In addition to developing mo dels and query languages for web itself, structure of web sites, internal structure of web semistructured data, there has b een considerable recentwork pages, and nally, contents of web sites in ner granulariti es. on issues concerning the management of semistructured data, In this section we discuss the main distinguishi ng factors of such as the extraction of structure from semistructured data + the data mo dels used in web applicatio ns. [NAM98], view maintenance [ZGM98 , AMR 98], summa- rization of semistructured data [BDFS97, GW97 ], and rea- Graph data mo dels: As noted ab ove, several of the ap- soning ab out semistructured data [CGL98, FFLS98 ]. Aside plications we discuss require to mo del the set of web pages from the relevance of these works to the tasks mentioned in and the links b etween them. These pages can either b e on this survey, the systems based on these metho ds will b e of several sites or within a single site. Hence, a natural way sp ecial imp ortance for the task of managing large volumes to mo del this data is based on a labeledgraph data mo del. of XML data [XML98]. Sp eci call y, in this mo del, no des representweb pages or internal comp onents of web pages, and arcs represent links Other characteristics of web data mo dels: Another between pages. The lab els on the arcs can b e viewed as at- distinguish in g characteristic of mo dels used in web/db ap- tribute names. Along with the lab eled graph mo del, several plications is the presence of web-sp eci c constructs in the query languages have b een develop ed. One central feature data representation. For example, some mo dels distinguish that is common to these query languages is the abilityto a unary relation identifying pages and a binary relation for formulate regular path expression queries over the graph. links b etween pages. Furthermore, wemay distinguish b e- Regular path expressions enable p osing navigational queries tween links within a web site and external links. An im- over the graph structure. p ortant reason to distingui sh a link relation is that it can generally only b e traversed in the forward direction. Addi- Semistructured data mo dels: The second asp ect of mo d- tional second order dimensions along which the data mo dels eling data for web applicatio ns is that in many cases the we discuss di er are 1 the ability to mo del order among structure of the data is irregular. Sp eci call y, when mo del- elements in the database, 2 mo deling nested data struc- ing the structure of a web site, we don't have a xed schema tures, and 3 supp ort for collection typ es sets, bags, ar- whichisgiven in advance. When mo deling data coming from rays. An example of a data mo del that incorp orates explicit multiple sources, the representation of some attributes e.g., web-sp eci c constructs pages and page schemes, nesting, addresses may di er from source to source. Hence, several and collection typ es is ADM, the data mo del of the Ara- pro jects have considered mo dels of semistructured data. The neus pro ject [AMM97b]. We remark that all the mo dels we initial motivation for this work was the existence and rela- mention in this pap er represent only static structures.