Database Techniques for the World-Wide Web: A Survey

Daniela Florescu Alon Levy Alb erto Mendelzon

Inria Ro quencourt Univ. of Washington Univ. of Toronto

dana@ro din.inria.fr [email protected] [email protected]

Information extraction and integration: Certain web 1 Intro duction

sites can b e viewed at a ner granularity level than pages,

as containers of structured data e.g., sets of tuples, or The p opularity of the World-Wide Web WWW has made

sets of ob jects. For example, the Internet Movie it a prime vehicle for disseminating information. The rel-

http://www.imdb.com can b e viewed as a front end in- evance of database concepts to the problems of managing

terface to a database ab out movies. Given the rise in the and querying this information has led to a signi cant body

numb er of such sites, there are two tasks we consider. The of recent research addressing these problems. Even though

rst task is to actually extract a structured representation the underlying challenge is the one that has b een tradition-

of the data e.g., a set of tuples from the HTML pages ally addressed by the database community{how to manage

containing them. This task is p erformed by a set of wrap- large volumes of data { the novel context of the WWW forces

per programs, whose creation and maintenance raises several us to signi cantl y extend previous techniques. The primary

challenges. Once we view these sites as autonomous hetero- goal of this survey is to classify the di erent tasks to which

geneous , we can address the second task of p osing database concepts have b een applied, and to emphasize the

queries that require the integration of data. The second task technical innovations that were required to do so.

is addressed by mediator or data integration systems.

We do not claim that database technology is the magic bullet

Web site construction and restructuring: A di er- that will solve all web information management problems;

ent asp ect in which database concepts and technology can other technologies, such as Information Retrieval, Arti cial

b e applied is that of buildin g, restructuring and managing Intelligence, and Hyp ertext/Hyp ermedia, are likely to b e

web-sites. In contrast to the previous two classes which ap- just as imp ortant. However, surveying all the work going

ply on existing web sites, here we consider the pro cess of on in these areas, and the interactions b etween them and

creating sites. Web sites can b e constructed either by start- database ideas, would b e far b eyond our scop e.

ing with some raw data stored in databases or structured

We fo cus on three classes of tasks related to information

les or by restructuring existing web sites. Peforming this

management on the WWW.

task requires metho ds for modeling the structure of web site

and languages for restructing data to conform to a desired Mo deling and querying the web: Supp ose we view the

structure. web as a directed graph whose no des are web pages and

whose edges are the links b etween pages. A rst task we

Before we b egin, we note that there are several topics con-

consider is that of formulating queries for retrieving certain

cerning the application of database concepts to the WWW

pages on the web. The queries can b e based on the content

which are not covered in this survey, such as caching and

of the desired pages and on the link structure connecting the

replication see [WWW98 , GRC97 ] for recentworks, trans-

pages. The simplest instance of this task, which is provided

action pro cessing and securityinweb environments see

by search engines on the web is to lo cate pages based on the

e.g. [Bil98 ], p erformance, availabil ity and scalability issues

words they contain. A simple generalizatio n of such a query

for web servers e.g. [CS98], or indexing techniques and

is to apply more complex predicates on the contents of a

crawler technology e.g. [CGMP98].  Furthermore, this

page e.g., nd the pages that contain the word \Clinton"

is not meant to b e a survey on existing pro ducts even in

next to a link to an image. Finally, as an example of a query

the areas on whichwe do fo cus. Finally, there are several

that involves the structure of the pages, consider the query

tangential areas whose results are applicabl e to the systems

asking for all images reachable from the ro ot of the CNN web

we discuss, but we do not cover them here. Examples of

site within 5 links. The last typ e of queries are esp ecially

such elds include systems for managing do cument collec-

useful when detecting violations of integrity constraints on

+

tions and ranking of do cuments e.g., Harvest [BDH 95],

aweb site or a collection of web sites.

Gloss [GGMT99] and exible query answering systems [BT98].

Finally, the eld of web/db is a very dynamic one; hence,

there are undoubtedly some omissions in our coverage, for

whichwe ap ologize in advance.

The survey is organized as follows. We b egin in Section 2 by

discussing the main issues that arise in designing data mo d-

els for web/db applications . The following three sections

consider each of the aforementioned tasks. Section 6 con- attributes. Because of the characteristics of semistructured

cludes with p ersp ectives and directions for future research. data mentioned ab ove, an additional feature that b ecomes

imp ortant in this context is the ability to query the schema

i.e., the lab els on the arcs in the graph. This feature is

2 Data Representation for Web/DB Tasks

supp orted in languages for querying semistructured data by

arc variables which get b ound to lab els on arcs, rather than

Building systems for solving any of the previous tasks re-

no des in the graph.

quires that wecho ose a metho d for mo deling the underlying

domain. In particular, in these tasks, we need to mo del the

In addition to developing mo dels and query languages for

web itself, structure of web sites, internal structure of web

semistructured data, there has b een considerable recentwork

pages, and nally, contents of web sites in ner granulariti es.

on issues concerning the management of semistructured data,

In this section we discuss the main distinguishi ng factors of

such as the extraction of structure from semistructured data

+

the data mo dels used in web applicatio ns.

[NAM98], view maintenance [ZGM98 , AMR 98], summa-

rization of semistructured data [BDFS97, GW97 ], and rea-

Graph data mo dels: As noted ab ove, several of the ap-

soning ab out semistructured data [CGL98, FFLS98 ]. Aside

plications we discuss require to mo del the set of web pages

from the relevance of these works to the tasks mentioned in

and the links b etween them. These pages can either b e on

this survey, the systems based on these metho ds will b e of

several sites or within a single site. Hence, a natural way

sp ecial imp ortance for the task of managing large volumes

to mo del this data is based on a labeledgraph data mo del.

of XML data [XML98].

Sp eci call y, in this mo del, no des representweb pages or

internal comp onents of web pages, and arcs represent links

Other characteristics of web data mo dels: Another

between pages. The lab els on the arcs can b e viewed as at-

distinguish in g characteristic of mo dels used in web/db ap-

tribute names. Along with the lab eled graph mo del, several

plications is the presence of web-sp eci c constructs in the

query languages have b een develop ed. One central feature

data representation. For example, some mo dels distinguish

that is common to these query languages is the abilityto

a unary relation identifying pages and a binary relation for

formulate regular path expression queries over the graph.

links b etween pages. Furthermore, wemay distinguish b e-

Regular path expressions enable p osing navigational queries

tween links within a web site and external links. An im-

over the graph structure.

p ortant reason to distingui sh a link relation is that it can

generally only b e traversed in the forward direction. Addi-

Semistructured data mo dels: The second asp ect of mo d-

tional second order dimensions along which the data mo dels

eling data for web applicatio ns is that in many cases the

we discuss di er are 1 the ability to mo del order among

structure of the data is irregular. Sp eci call y, when mo del-

elements in the database, 2 mo deling nested data struc-

ing the structure of a web site, we don't have a xed schema

tures, and 3 supp ort for collection typ es sets, bags, ar-

whichisgiven in advance. When mo deling data coming from

rays. An example of a data mo del that incorp orates explicit

multiple sources, the representation of some attributes e.g.,

web-sp eci c constructs pages and page schemes, nesting,

addresses may di er from source to source. Hence, several

and collection typ es is ADM, the data mo del of the Ara-

pro jects have considered mo dels of semistructured data. The

neus pro ject [AMM97b]. We remark that all the mo dels we

initial motivation for this work was the existence and rela-

mention in this pap er represent only static structures. For

tive success of p ermissive data mo dels such as [TMD92 ] in

example, the work on mo deling the structure of web sites do

the scienti c community, the need for exchanging ob jects

not consider dynamic web pages created as a result of user

across heterogeneous sources [PGMW95 ], and the task of

inputs.

managing do cument collections [MP96].

An imp ortant asp ect of languages for querying data in web

Broadly sp eaking, semistructured data refers to data with

application s is the need to create complex structures as a

some of the following characteristics:

result of a query. For example, the result of a query in

aweb site management system is the graph mo deling the

 the schema is not given in advance and may b e implicit

web site. Hence, a fundamental characteristic of manyof

in the data,

the languages we discuss in this pap er is that their query

expressions contain a structuring comp onent in addition to

 the schema is relatively large w.r.t. the size of the

the traditional data ltering comp onent.

data and maybechanging frequently,

Table 1 summarizes some of the web query systems cov-

 the schema is descriptive rather than prescriptive, i.e.,

ered in this pap er. A more detailed version of this table,

it describ es the current state of the data, but violations

http://www.cs.washington.edu/homes/alon/webdb.html in-

of the schema are still tolerated,

cludes URLs for the systems where available. In subsequent

sections we will illustrate in detail languages for querying

 the data is not strongly typ ed, i.e., for di erent ob-

data represented in these mo dels.

jects, the values of the same attribute may b e of dif-

fering typ es.

3 Mo deling and Querying the Web

Mo dels for semistructured data have b een based on lab eled

1

If the web is viewed as a large, graph-like database, it is

directed graphs [Abi97 , Bun97 ]. In a semistructured data

natural to p ose queries that go b eyond the basic informa-

mo del, there is no restriction on the set of arcs that emanate

tion retrieval paradigm supp orted bytoday's search engines

from a given no de in a graph, or on the typ es of the values of

and take structure into account; b oth the internal structure

1

It should b e noted that there is no inherent diculty in trans-

of web pages and the external structure of links that inter-

lating these mo dels into relational or ob ject-oriented terms. In fact,

connect them. In an often-cited pap er on the limitations of

the languages underlying Description Logics e.g., Classic [BBMR89]

hyp ertext systems, Halasz says: [Hal88 ]

and FLORID [HLLS97]have some of the features mentioned ab ove, and are describ ed in non-graph mo dels.

System Data Mo del Language Style Path Expressions Graph Creation

WebSQL [MMM97] relational SQL Yes No

W3QS [KS95] lab eled multigraphs SQL Yes No

WebLog [LSS96 ] relational Datalog No No

+

Lorel [AQM 97 ] lab eled graphs OQL Yes No

WebOQL [AM98] hyp ertrees OQL Yes Yes

UnQL [BDHS96 ] lab eled graphs structural recursion Yes Yes

+

Strudel [FFK 98, FFLS97] lab eled graphs Datalog Yes Yes

Araneus Ulixes [AMM97b ] page schemes SQL Yes Yes

Florid [HLLS97] F-logic Datalog Yes No

Table 1: Comparison of query systems

Content search ignores the structure of a hy- ables. Guting et al.[GZC89 ] mo del do cuments using nested

p ermedia network. In contrast, structure search ordered relations and use a generalization of nested rela-

sp eci cally examines the hyp ermedia structure tional algebra as a query language. Beeri and Kornatzky

for subnetworks that match a given pattern. [BK90] prop ose a logic whose formulas sp ecify patterns over

the hyp ertext graph.

and go es on to give examples where such queries are useful.

Graph query languages: Work in using graphs to mo del

databases, motivated by applications such as software engi-

3.1 Structural Information Retrieval

neering and computer network management, led to the G,

G+ and GraphLog graph-based languages [CMW87 , CMW88 ,

The rst to ols develop ed for querying the web were the well-

CM90]. In particular, G and G+ are based on lab eled

known search engines which are now widely deployed and

graphs; they supp ort regular path expressions and graph

used. These are based on searching indices of words and

construction in queries. GraphLog, whose semantics is based

phrases app earing in do cuments discovered byweb \crawlers."

on Datalog, was applied to Hyp ertext queries in [CM89].

More recently, there have b een e orts to overcome the lim-

+

Paredaens et al [PdBA 92 ] develop ed a graph query lan-

itations of this paradigm by exploiting link structure in

guage for ob ject-oriented databases.

queries. For example, [Kle98 ], [BH98] and [CDRR98], pro-

p ose to use the web structure to analyze the many sites

Languages for querying semistructured data: Query

+

returned by a search engine as relevant to a topic in order

languages for semistructured data such as Lorel [AQM 97],

to extract those that are likely to b e authoritative sources

UnQL [BDHS96 ] and StruQL [FFLS97] also use lab eled

on the topic. To supp ort connectivity analysis for this and

graphs as a exible data mo del. In contrast to graph query

other applications, such as ecient implementations of the

languages, they emphasize the ability to query the schema

query languages describ ed b elow the Connectivity Server

of the data, and the ability to accommo date irregularities

+

[BBH 98 ] provides fast access to structural information.

in the data, such as missing or rep eated elds, heteroge-

Go ogle [BP98], a prototyp e next-generation web search en-

neous records. Related work in the OO community [Har94]

gine, makes heavy use of web structure to improve crawling

prop oses \schema-shy" mo dels and queries to handle infor-

and indexing p erformance. Other metho ds for exploiting

mation ab out software engineering artifacts.

link structure are presented in [PPR96 , CK98]. In these

These languages were not develop ed sp eci cally for the web,

works, structural information is mostly used b ehind the scenes,

and do not distinguis h, for example, b etween graph edges

to improve the answers to purely content-oriented queries.

that represent the connection b etween a do cument and one

Sp ertus [Sp e97 ] p oints out many useful application s of queries

of its parts and edges that representahyp erlink from one

that take link structure into account explicitly.

web do cument to another. Their data mo dels, while elegant,

are not very rich, lacking such basic comforts as records and

3.2 Related query paradigms

ordered collections.

In this section we brie y describ e several families of query

3.3 First generation web query languages

languages that were not develop ed sp eci cally for querying

the web. However, since the concepts on which they are

A rst generation of web query languages aimed to combine

based are similar in spirit to the web query languages we

the content-based queries of search engines with structure-

discuss, these languages can also b e useful for web applica-

based queries similar to what one would nd in a database

tions.

system. These languages, which include W3QL [KS95], Web-

SQL [MMM97 , AMM97a], and WebLog [LSS96], combine

Hyp ertext/do cument query languages: Anumber of

conditions on text patterns app earing within do cuments with

mo dels and languages for querying structured do cuments

graph patterns describing link structure. We use WebSQL

and hyp ertexts were prop osed in the pre-web era. For exam-

as an example of the kinds of queries that can b e asked.

ple, Abiteb oul et al.[ACM93] and Christophides et al.[CACS94]

map do cuments to ob ject oriented database instances by

WebSQL WebSQL prop oses to mo del the web as a rela-

means of semantic actions attached to a grammar. Then the

tional database comp osed of two virtual relations: Do cu-

database representation can b e queried using the query lan-

ment and Anchor. The Do cument relation has one tuple for

guage of the database. A novel asp ect of this approach is the

each do cument in the web and the Anchor relation has one

p ossibili ty of querying the structure by means of path vari-

tuple for each anchor in each do cument in the web. This ture, placing it closer to the class of languages discussed in

relational abstraction of the web allows us to use a query the next subsection.

language similar to SQL to p ose the queries.

3.4 Second generation: Web Data Manip-

If Do cument and Anchor were actual relations, we could

ulation Languages

simply use SQL to write queries on them. But since the

Do cument and Anchor relations are completely virtual and

The languages ab ove treat web pages as atomic ob jects with

there is no waytoenumerate them, we cannot op erate on

two prop erties: they contain or do not contain certain text

them directly. The WebSQL semantics dep ends instead on

patterns, and they p oint to other ob jects. Exp erience with

materializin g p ortions of them by sp ecifying the do cuments

their use suggests there are two main areas of application

of interest in the FROM clause of a query. The basic way

that they can b e useful for: data wrapping, transformation,

of materializing a p ortion of the web is bynavigating from

and restructuring, as describ ed in Section 4; and web site

known URL's. Path regular expressions are used to describ e

construction and restructuring, as describ ed in Section 5. In

this navigation. An atom of such a regular expression can

b oth applicati on areas, it is often essential to have access to

b e of the form d1 => d2, meaning do cumentd1points to

the internal structure of web pages from the query language,

d2 and d2 is stored on a di erent server from d1; or d1 ->

if wewant declarative queries to capture a large part of the

d2, meaning d1 p oints to d2 and d2 is stored on the same

task at hand. For example, the task of extracting a set of

server as d1.

tuples from the HTML pages of the Internet Movie Database

For example, supp ose wewant to nd a list of triples of the

requires parsing the HTML and selectively accessing certain

form d1,d2,label, where d1 is a do cument stored on our

subtrees in the parse tree.

lo cal site, d2 is a do cument stored somewhere else, and d1

In this section we describ e the second-generation of web

p oints to d2 by a link lab eled label. Assume all our lo cal

query languages that we call \Web data manipulati on lan-

do cuments are reachable from www.mysite.start.

guages." These languages go b eyond the rst generation

SELECT d.url,e.url,a.label

languages in two signi cantways. First, they provide access

FROM Document d SUCH THAT

to the structure of the web ob jects that they manipulate.

"www.mysite.start" ->* d,

Unlike the rst-generation languages, they mo del internal

Document e SUCH THAT d => e,

structure of web do cuments as well as the external links

Anchor a SUCH THAT a.base = d.url

that connect them. They supp ort references to mo del hyp er-

WHERE a.href = e.url

links, and some supp ort ordered collections and records for

more natural data representation. Second, these languages

The FROM clause instantiates two Do cumentvariables, d

provide the ability to create new complex structures as a

and e, and one Anchor variable a. The variable d is b ound

result of a query. Since the data on the web is commonly

in turn to each lo cal do cument reachable from the starting

semistructured or worse, these languages still emphasize

do cument, and e is b ound to each non-lo cal do cument reach-

the ability to supp ort semistructured features. We brie y

able directly from d. The anchor variable a is instantiated

describ e three languages in this class: WebOQL [AM98],

to each link that originates in do cument d; the extra condi-

StruQL[FFLS97] and Florid [HLLS97].

tion that the target of link a b e do cument e is given in the

WHERE clause. Another way of materializing part of the

WebOQL

Do cument and Anchor relations is by content conditions:

for example, if wewere only interested in do cuments that

contains the string "database" we could have added to the

The main data structure provided byWebOQL is the hy-

FROM clause the condition d MENTIONS "database". The

p ertree. Hyp ertrees are ordered arc-lab eled trees with two

implementation uses search engines to generate candidate

typ es of arcs, internal and external. Internal arcs are used

do cuments that satisfy the MENTION conditions.

to represent structured ob jects and external arcs are used

to represent references typically hyp erlinks among ob jects.

Other Languages W3QL [KS95] is similar in avour to

Arcs are lab eled with records. Figure 1, from [Aro97], shows

WebSQL, with some notable di erences: it uses external

ahyp ertree containing descriptions of publications from sev-

programs similar to user de ned functions in ob ject-relational

eral research groups. Such a tree could easily b e built, for

languages for sp ecifying content conditions on les rather

example, from an HTML le, using a generic HTML wrap-

than building conditions into the language syntax, and it

p er.

provides mechanisms for handling forms encountered dur-

Sets of related hyp ertrees are collected into webs. Both hy-

ing navigation. In [KS98], Konopnicki and Shmueli describ e

p ertrees and webs can b e manipulated using WebOQL and

planned extensions to move W3QL into what we call the

created as the result of a query.

second generation. These include mo deling internal do c-

ument structure, hierarchical web mo deling that captures

WebOQL is a functional language, but queries are couched

the notion of web site explicitly, and replacing the exter-

in the familiar select-from-where form. For example, sup-

nal program metho d of sp ecifying conditions with a general

p ose that the name csPap ers denotes the pap ers database

extensible metho d based on the MIME standard.

in Figure 1, and that wewant to extract from it the title

and URL of the full version of pap ers authored by \Smith".

WebLog [LSS96 ] di ers from the ab ove languages in using

deductive rules instead of SQL-like syntax see the descrip-

select [y.Title,y'.Url]

tion of Florid b elow.

from x in csPapers, y in x'

where y.Authors ~ "Smith"

WQL, the query language of the WebDB pro ject [LSCH98],

is similar to WebSQL but it supp orts more comprehensive

In this query, x iterates over the simple trees of csPap ers SQL functionality such as aggregation and grouping, and

i.e., over the research groups and, given a value for x, y provides limited supp ort for querying intra-do cument struc- [Group: Databases] [Group: Card Punching] ... [Group: Programming Languages]

[Title: Recent Discoveries in Card Punching, [Title: Are Magnetic Media Better?, [Title: Cobol in AI, [Title: Assembly for the Masses, Authors: Peter Smith, John Brown, Authors:Peter Smith, John Brown, Tom Wood, Authors: Sam James, John Brown] Authors:John Brown, Tom Wood, Publication: Technical Report TR015] Publication:ACM TOCP Vol. 3 No. (1942) pp 23-37] Publication:ACM 2 POPL Proceedings. (1943)]

[label: Abstract, [label: Abstract, [label: Abstract, [label: Abstract, url: www.../abstr1.html] url: www.../abstr2.html] url: www.../abstr13.html] url: www.../abstr17.html]

[label: Full version, [label: Full version, url: www.../paper1.ps.Z] url: www.../paper2.ps.Z] [label: Full version, [label: Full version,

url: www.../paper13.ps.Z] url: www.../paper17.ps.Z]

Figure 1: Example of a hyp ertree

iterates over the simple trees of x'. The primed variable x' hyp ertext, but it is p ossible to browse them sequentially by

denotes the result of applying to tree x the Prime op era- following links having the string \Next" as lab el. For build-

tor, which returns the rst subtree of its argument. The ing the full-text index we need to feed the indexer with the

same op erator is used to extract from tree y its rst subtree text and the URL of each do cument. We can obtain this

in y'.Url. The square brackets denote the Hang op erator, information using the following query:

which builds an arc lab eled with a record formed with the ar-

select [ x.Url, x.Text ]

guments in this example, the eld names are inferred. Fi-

from x in browse"root.html"

nally, the tilde represents the string pattern matching pred-

via ^*[Text ~ "Next"]>*

icate: its left argument is a string and its right argumentis

a pattern.

StruQL

Web Creation The query ab ove maps a hyp ertree into

another hyp ertree; more generally, a query is a function that

StruQL is the query language of the Strudel web site

maps a web into another. For example, the following query

management system, describ ed b elow in Section 5. Even

creates a new page for each research group using the group

though StruQL was develop ed in the context of a sp eci c

name as URL. Each page contains the publication s of the

web application , it is a general purp ose query language for

corresp onding group.

semistructured data, based on a data mo del of lab eled di-

select x' as x.Group

rected graphs. In addition, the Strudel data mo del in-

from x in csPapers

cludes named collections, and supp orts several atomic typ es

that commonly app ear in web pages, such as URLs, and

In general, the select clause has the form ` select q as s ,

Postscript, text, image, and HTML les. The result of a

1 1

q as s , ... , q as s ', where the q 's are queries and each

StruQL query is a graph in the same data mo del as the

2 2 m m i

of the s 's is either a string query or the keyword schema.

input graphs. In Strudel, StruQL was used for two tasks:

i

The \as" clauses create the URL's s , s , ::: , s , which

querying heterogeneous sources to integrate them into a site

1 2 m

are assigned to the new pages resulting from each query q .

data graph, and for querying this data graph to pro duce a

i

site graph.

Navigation Patterns Navigation patterns are regular ex-

A StruQL query is a set of p ossibly nested blo cks, eachof

pressions over an alphab et of record predicates; they allow

the form:

us to sp ecify the structure of the paths that must b e followed

in order to nd the instances for variables.

[where C1,...,Ck]

[create N1,...,Nn]

Navigation patterns are mainly useful for two purp oses. The

[link L1,...,Lp]

rst reason is for extracting subtrees from trees whose struc-

[collect G1,...,Gq].

ture we do not know in detail or whose structure presents

irregulariti es, and the second is for iterating over trees con-

nected by external arcs. In fact, the distinction b etween in- The where clause can include either memb ership conditions

ternal and external arcs in hyp ertrees b ecomes really useful or conditions on pairs of no des expressed using regular path

when we use navigation patterns that traverse external arcs. expressions. The where clause pro duces all bindings of no de

Supp ose that wehave a software pro duct whose do cumen- and arc variables to values in the input graph, and the re-

tation is provided in HTML format and wewant to build maining clauses use Skolem functions to construct a new

a full-text index for it. These do cuments form a complex graph from these bindings.

webdoc::string[url => url; author => string; We illustrate StruQL with a query de ning a web site,

modif => string; starting with a Bibtex bibliograp hy le, mo deled as a lab eled

type => string; hrefs@string =>> url; graph. The web site will consist of three kinds of pages: a

error =>> string]. Pap erPresentation page for each bibliography entry,a Year

page for eachyear, p ointing to all pap ers published in that

year, and a Ro ot page p ointing to all the Year pages. After

The rst declaration intro duces a class url, sub class of string,

showing the query in StruQL,we showitinWebOQL to

with the only metho d get. The notation get => webdoc

give a feel for the di erences b etween the languages.

means that get is a single-valued metho d that returns an

ob ject of typ e webdoc. The metho d get is system-de ned;

// Create Root

the e ect of invoking u.get for a url u in the head of a de-

create RootPage

ductive rule is to retrieve from the web the do cument with

// Create a presentation for every publication x

that URL and cache it in the lo cal Florid database as a

where Publicationsx, x->l->v

webdoc ob ject with ob ject identi er u.get.

create PaperPresentationx

The class webdoc with metho ds self, author, modif, type, link PaperPresentationx->l->v

hrefs and error mo dels the basic information common to

{ // Create a page for every year

all web do cuments. The notation hrefs@string =>> url where l = "year"

means that the multi-valued metho d hrefs takes a string

create YearPagev

as argument and returns a set of ob jects of typ e url. The link

idea is that, if d is a webdoc, then d.hrefs@aLabel returns YearPagev -> "Year" -> v

all URL's of do cuments p ointed to by links lab eled aLabel YearPagev->"Paper"->PaperPresentationx,

within do cument d. // Link root page to each year page

RootPage -> "YearPage" -> YearPagev

Sub classes of do cuments can b e declared as needed using

}

F-logic inheritance, e.g.:

htmldoc::webdoc[title => string; text => string].

In the where clause, the notation Publicationsx means

that x b elongs to the collection Publications, and the atom

x ! l ! v denotes that there is a link in the graph from

Computation in Florid is expressed by sets of deductive

x to v and the lab el on the arc is l. The same notation is

rules. For example, the program b elow extracts from the

used in the link clause to sp ecify the newly created edges in

web the set of all do cuments reachable directly or indirectly

the resulting graph. After creating the Ro ot page, the rst

from the URL www.cs.toronto.edu by links whose lab els

CREATE generates a page for each publicatio n denoted

contain the string \database."

by the Skolem function, Pap erPresentation. The second

CREATE, nested within the outer query, generates a Year

"www.cs.toronto.edu":url.get.

page for eachyear, and links it to the Ro ot page and to

Y:url.get <-

the Pap erPresentation pages of the publications published

X:url.get[hrefs@L=>>{Y}],

in that year. Note the Skolem function YearPage ensures

substr"database",L.

that a Year page for a particular year is only created once,

no matter how many pap ers were published in that year.

Florid provides a p owerful formalism for manipulating semi-

structured data in a web context. However, it do es not cur-

Below is the same query in WebOQL.

rently supp ort the construction of new webs as results of

select unique [Url: x.year, Label:"YearPage"]

computation; the result is always a set of F-logic ob jects in

as "RootPage",

the lo cal store.

[ label: "Paper" / x ] as x.year

from x in browse"bibtex: myfile.bib"

Ulixes and Penelop e

|

In the Araneus pro ject [AMM97b ], the query and restruc-

select [year: y.url] + y as y.url

turing pro cess is split into two phases. In the rst phase,the

from y in "browseRootPage"

Ulixes language is used to build relational views over the

web. These views can then b e analyzed and integrated using

standard database techniques. Ulixes queries extract rela-

The WebOQL query consists of two sub queries, with the

tional data from instances of page schemes de ned in the

web resulting from the rst one \pip ed" into the second one

ADM mo del, making heavy use of star-free path expres-

using the \j" op erator. The rst sub query builds the Ro ot,

sions. The second phase consists of generating hyp ertextual

Pap er, and Year pages, and the second one rede nes each

views of the data using the Penelope language. Query op-

Year page by adding the \year" eld to it.

timization for relational views over sets of web pages, such

as those constructed by Ulixes, is discussed in [MMM98 ].

Florid

+

Interactive query interfaces

Florid [HLLS97 , LHL 98] is a prototyp e implementation of

the deductive and ob ject-oriented formalism F-logic [KLW95].

All the languages in the previous two subsections are to o

To use Florid as a web query engine, a web do cumentis

complex to b e used directly byinteractive users, just as SQL

mo deled by the following two classes:

is; like SQL, they are meant to b e used mostly as program-

ming to ols. There has however b een work in the design of url::string[get => webdoc].

interactive query interfaces suitable for casual users. For ex- autonomy.

ample, Dataguides [GW97 ] is an interactive query to ol for

An imp ortant distinction in building data integration sys-

semistructured databases based on hierarchical summaries

tems, and therefore in building web data integration sys-

of the data graph; extensions to supp ort querying single

tems, is whether to takeawarehousing or a virtual approach

complex web sites are describ ed in [GW98 ]. The system de-

see [HZ96, Hul97 ] for a comparison. In the warehousing

+

scrib ed in [HML 98] supp orts queries that combine multi-

approach, data from multiple web sources is loaded into a

media features, such as similarity to a given sketch or image,

warehouse, and all queries are applied to the warehoused

textual features suchaskeywords, and domain semantics.

data; this requires that the warehouse b e up dated when

data changes, but the advantage is that adequate p erfor-

mance can b e guaranteed at query time. In the virtual ap-

Theory of web queries

proach, the data remains in the web sources, and queries

to the data integration system are decomp osed at run time

In de ning the semantics of rst-generation web query lan-

into queries on the sources. In this approach, data is not

guages, it was immediately observed that certain easily stated

replicated, and is guaranteed to b e fresh at query time. On

queries, such as \list all web do cuments that no other do c-

the other hand, b ecause the web sources are autonomous,

ument p oints to," could b e rather hard to execute. This

more sophisticated query optimization and execution meth-

leads naturally to questions of query computability in this

o ds are needed to guarantee adequate p erformance. The

context. Abiteb oul and Vianu[AV97a] and Mendelzon and

virtual approach is more appropriate for building systems

Milo [MM97 ] prop ose formal ways of categorizing web queries

where the numb er of sources is large, the data is changing

according to whether they can in principle b e computed or

frequently, and there is little control over the web sources.

not; the key idea b eing that essentially, the only p ossible way

For these reasons, most of the recent research has fo cused

to access the web is to navigate links from known starting

on the virtual approach, and therefore, so will our discus-

p oints. Note this includes a sp ecial case navigating links

sion. We emphasize that many of the issues that arise in

from the large collections of starting p oints known as index

the virtual approach also arise in the warehousing approach

servers or search engines. Abiteb oul and Vianu[AV97b]

often in a slightly di erent form, and hence our discussion

also discuss fundamental issues p osed by query optimiza-

is relevant to b oth cases. Finally,we refer the reader to two

tion in path traversal queries. Mihaila, Milo and Mendel-

commercial application s of web data integration, one that

zon [MMM97] showhow to analyze WebSQL queries in

takes the warehousing approach [Jun98 ] and the other that

terms of the maximum number of web sites. Florescu, Levy

takes the virtual approach [Jan98].

and Suciu [FLS98 ] describ e an algorithm for query contain-

A prototypical architecture of a virtual data integration sys-

ment for queries with regular path expressions, which is then

tem is shown in Figure 3. There are two main features distin-

used for verifying integrity constraints on the structure of

guishing such a system from a traditional database system:

web sites [FFLS98].

 As stated earlier, the system do es not communicate di-

4 Information Integration

rectly with a lo cal storage manager. Instead, in order

to obtain data, the query execution engine communi-

As stated earlier, the WWW contains a growing number of

cates with a set of wrapp ers. A wrapp er is a program

information sources that can b e viewed as containers of sets

which is sp eci c to every web site, and whose task is

of tuples. These \tuples" can either b e emb edded in HTML

to translate the data in the web site to a form that can

pages, or b e hidden b ehind form interfaces. By writing sp e-

b e further pro cessed by the data integration system.

cialized programs called wrappers, one can give the illusion

For example, the wrapp er may extract from an HTML

that the web site is serving sets of tuples. We refer to the

le a set of tuples. It should b e emphasized that the

combination of the underlying web site and the wrapp er as-

wrapp er provides only an interface to the data served

so ciated with it as a web source.

by the web site, and hence, if the web site provides

The task of a web information integration system is to an-

only limited access to the data e.g., through a form

swer queries that may require extracting and combining data

interface that requires certain inputs, then the wrap-

from multiple web sources. As an example, consider the do-

p er can only supp ort limited access patterns to the

main of movies. The Internet Movie Database contains com-

data.

prehensive data ab out movies, their casts, genres and direc-

 The second di erence from traditional systems is that

tors. Reviews of movies can b e found in multiple other web

the user do es not p ose queries directly in the schema in

sources e.g., web sites of ma jor newspap ers, and several

which the data is stored. The reason for this is that one

web sources provide schedules of movie showings. By com-

of the principal goals of a data integration system is to

bining the data from these sources we can answer queries

free the user from having to know ab out the sp eci c

such as: give me a movie, playing time and a review of

data sources and interact with each one. Instead, the

movies starring Frank Sinatra, playing tonight in .

user p oses queries on a mediated schema. A mediated

Several systems have b een built with the goal of answering

schema is a set of virtual relations, which are designed

+

queries using a multitude of web sources [GMPQ 97, EW94 ,

for a particular data integration applicati on. The re-

+ +

WBJ 95, LRO96 , FW97 , DG97b , AKS96, Coh98, AAB 98 ,

lations in the mediated schema are not actually stored

+

BEM 98]. Many of the problems encountered in build-

anywhere. As a consequence, the data integration sys-

ing these systems are similar to those addressed in building

tem must rst reformulate a user query into a query

+

heterogeneous database systems [ACPS96, WAC 93, HZ96 ,

that refers directly to the schemas in the sources. In

TRV98, FRV96, Bla96 , HKWY97]. Web data integration

order to p erform the reformulation step, the data in-

systems have, in addition, to deal with 1 large and evolv-

tegration system requires a set of source descriptions.

ing number of web sources, 2 little meta-data ab out the

A description of an information source sp eci es the

characteristics of the source, and 3 larger degree of source

contents of the source e.g., contains movies, the at- User interface Answers (WWW-based) World view Query

Relevance reasoning Source descriptions Plan Logical planner Contents Generator Capabilties Execution planner ..

Execution plan

Execution engine

Select, project, join, union...

Interface program Interface program Interface program Interface program

INTERNET

Structured WWW Form Relational Object-oriented

Files interface Database database

Figure 2: Architecture of a data integration system

tributes that can b e found in the source e.g., genre, the domain they are covering. For example, a bibliograp hy

cast, constraints on the contents of the source e.g., source is unlikely to b e complete for the eld of Computer

contains only American movies, completeness and re- Science. However, in some cases, we can assert complete-

liability of the source, and nally, the query pro cessing ness statements ab out sources. For example, the DB&LP

2

capabilities of the source e.g., can p erform selections, Database has the complete set of pap ers published in most

or can answer arbitrary SQL queries. ma jor database conferences. Knowledge of completeness of

aweb source can help a data integration system in several

ways. Most imp ortantly, since a negative answer from a

The following are the main issues addressed in the work on

complete source is meaningful, the data integration system

building web data integration systems.

can prune access to other sources. The problem of describ-

Sp eci cation of mediated schema and reformulation:

ing completeness of web sources and using this information

The mediated schema in a data integration system is the

for query pro cessing is addressed in [Mot89 , EGW94 , Lev96 ,

set of collection and attribute names that are used to for-

Dus97 , AD98 , FW97 ]. The work describ ed in [FKL97] de-

mulate queries. Toevaluate a query, the data integration

scrib es a probabilis tic formalism for describing the contents

system must translate the query on the mediated schema

and overlaps among information sources, and presents algo-

into a query on the data sources, that have their own lo-

rithms for cho osing optimally b etween sources.

cal schemas. In order to do so, the system requires a set of

Di ering query pro cessing capabilities: From the p er-

source descriptions. Several recent researchworks addressed

sp ective of the web data integration system, the web sources

the problem of how to sp ecify source descriptions and how

app ear to havevastly di ering query pro cessing capabilitie s.

to use them for query reformulation. Broadly sp eaking,

The main reasons for the di erent app earance are 1 the

two general approaches have b een prop osed: Global as view

+

underlying data may actually b e stored in a structured le

GAV [GMPQ 97 , PAGM96, ACPS96, HKWY97 , FRV96,

or legacy system and in this case the interface to this data

TRV98] and Local as view LAV [LRO96 , KW96 , DG97a ,

is naturally limited, and 2 even if the data is stored in a

DG97b , FW97 ] see [Ull97 ] for a detailed comparison.

traditional database system, the web site may provide only

In the GAV approach, for each relation R in the medi-

limited access capabili ties for reasons of security or p erfor-

ated schema, we write a query over the source relations

mance.

sp ecifying how to obtain R's tuples from the sources. The

To build an e ective data integration system, these capabili -

LAV approach takes the opp osite approach. For every in-

ties need to b e explicitly describ ed to the system, adhered to,

formation source S ,we write a query over the relations in

and exploited as much as p ossible to improve p erformance.

the mediated schema that describ es which tuples are found

We distinguis h twotyp es of capabiliti es: negative capabili -

in S . The main advantage of the GAV approach is that

ties that limit the access patterns to the data, and p ositive

query reformulation is very simple, b ecause it reduces to

capabiliti es, where a source is able to p erform additional

view unfolding. In contrast, in the LAV approach it is sim-

algebraic op erations in addition to simple data fetches.

pler to add or delete sources b ecause the descriptions of

the sources do not havetotakeinto account the p ossible

The main form of negative capabiliti es is limitations on the

interactions with other sources, as in the GAV approach,

binding patterns that can b e used in queries senttothe

and it is also easier to describ e constraints on the con-

source. For example, it is not p ossible to send a query to

tents of sources. The problem of reformulation b ecomes a

the Internet Movie Database asking for al l the movies in the

variant on the of the problem of answering queries using

database and their casts. Instead, it is only p ossible to ask

views [YL87, TSI96, LMSS95 , CKPS95, RSU95, DG97b ].

for the cast of given movie, or to ask for the set of movies in

which a particular actor app ears. Several works have con-

Completeness of data in web sources: In general, sources

2

that we nd on the WWW are not necessarily complete for http://ww w.i nf or mat ik .u ni -tr ie r. de/ ley/db/

sidered the problem of answering queries in the presence of with a set of HTML pages where the data in the page is la-

binding pattern limitations [RSU95 , KW96 , LRO96, FW97 ]. beled. The algorithm uses the lab eled examples to automati-

cally output a grammar by which the data can b e extracted

Positive capabiliti es p ose another challenge to a data inte-

from subsequent pages. Naturally, the more examples we

gration system. If a data source has the ability to p erform

give the system, the more accurate the resulting grammar

op erations such as selections and joins, wewould like to push

can b e, and the challenge is to discover wrapp er languages

as much as p ossible of the pro cessing to the source, thereby

that can b e learned with a small numb er of examples. The

hop efully reducing the amount of lo cal pro cessing and the

rst formulation of wrapp er construction as inductive learn-

amount of data transmitted over the network. The problem

ing and a set of algorithms for learning simple classes of

of describing the computing capabilities of data sources and

wrapp ers are given in [KDW97 ]. The algorithm describ ed

exploiting them to create query execution plans is consid-

in [AK97] exploits heuristics sp eci c to the common uses of

ered in [PGGMU95 , TRV98, LRU96, HKWY97 , VP97a ]

HTML in order to obtain faster learning. It should b e noted

that Machine Learning metho ds have also b een used to learn

Query optimization: Manyworks on web data integra-

the mapping b etween the source schemas and the mediated

tion systems have fo cused on the problem of selecting a min-

+

schemas [PE95, DEW97 ]. The work describ ed [CDF 98 ] is

imal set of web sources to access, and on determining the

a rst step in bridging the gap b etween the approaches of

minimal query that needs to b e sent to each one. However,

Machine Learning and of Natural Language Pro cessing to

the issue of cho osing an optimal query execution plan to ac-

the problem of wrapp er construction. Finally,we note that

cess the web sources has received relatively little attention

the emergence of XML may lead web site builders to exp ort

in the data integration literature [HKWY97], and remains

the data underlying their sites in a machine readable form,

an active area of research. The additional challenges that

thereby greatly simplifying the construction of wrapp ers.

are faced in query optimization over sources on the WWW

is that wehave few statistics on the data in the sources, and

Matching ob jects across sources: One of the hardest

hence little information to evaluate the cost of query exe-

problems in answering queries over a multitude of sources is

cution plans. The work in [NGT98] considers the problem

deciding that two ob jects mentioned in two di erent sources

of calibrating the cost mo del for query execution plans in

refer to the same entity in the world. This problem arises

this context. The work in [YPAGM98] discusses the prob-

b ecause each source employs its own naming conventions

lem of query optimization for fusion queries, which are a

and shorthands. Most systems deal with this problem using

sp ecial class of integration queries that fo cus on retrieving

domain sp eci c heuristics as in [FHM94]. In the WHIRL

various attributes of a given ob ject from multiple sources.

system [Coh98], matching of ob jects across sources is done

Moreover, we b elieve that query pro cessing in data integra-

by using techniques from Information Retrieval. Further-

tion systems is one area whichwould b ene t from ideas such

more, the matching of the ob jects is elegantly integrated in

as interleaving of planning and execution and of computing

anovel query execution algorithm.

conditional plans [GC94, KD98 ].

5 Web site construction and restructuring

Query execution engines: Even less attention has b een

paid to the problem of building query execution engines tar-

The previous two sections discussed tasks that concerned

geted for web data integration. The challenges in building

querying existing web sites and their content. However,

such engines are caused by the autonomy of the data sources

given the fact that web sites essentially provide access to

and the unpredictabil i ty of the p erformance of the network.

complex structures of information, it is natural to apply

In particular, when accessing web sources wemay exp erience

techniques from Database Systems to the pro cess of building

initial delays b efore data is transmitted, and even when it is,

and maintaining web sites. One can distinguish two general

the arrival of the data may b e bursty. The work describ ed

classes of web site building tasks: one in whichweb sites are

in [AFT98, UFA98] has considered the problem of adapting

created from a collection of underlying data sources, and

a query execution plans to initial delays in the arrival of the

another in which they are created by restructuring existing

data.

web sites. As it turns out, the same techniques are required

Wrapp er construction: Recall that the role of a wrapp er

for b oth of these classes. Furthermore, we note that the task

is to extract the data out of a web site into a form that can

of providing a web interface to data that exists in a single

b e manipulated by the data integration system. For exam-

database system [NS96] is a simple instance of the problem

3

ple, the task of a wrapp er could b e to p ose a query to a

of creating web sites.

web source using a form interface, and to extract a set of

To understand the problem of building web sites and the

answer tuples out of the resulting HTML page. The di-

p ossible imp ort of database technology, consider the tasks

culty in building wrapp ers is that the HTML page is usu-

that a web site builder must address: 1 cho osing and ac-

ally designed for human viewing, rather than for extracting

cessing the data that will b e displayed at the site, 2 design-

data by programs. Hence, the data is often emb edded in

ing the site's structure, i.e., sp ecifying the data contained

natural language text or hidden within graphical presenta-

within each page and the links b etween pages, and 3 de-

tion primitives. Moreover, the form of the HTML pages

signing the graphical presentation of pages. In existing web

changes frequently, making it hard to maintain the wrap-

site management to ols, these tasks are, for the most part, in-

p ers. Several works have considered the problem of build-

terdep endent. Without any site-creation to ols, a site builder

ing to ols for rapid creation of wrapp ers. One class of to ols

+

writes HTML les by hand or writes programs to pro duce

e.g., [HGMN 98 , GRVB98 ] is based on developing sp e-

them and must fo cus simultaneously on a page's content, its

cialized grammars for sp ecifying how the data is laid out in

relationship to other pages, and its graphical presentation.

an HTML page, and therefore how to extract the required

As a result, several imp ortant tasks, such as automatically

data. A second class of techniques is based on developing

inductive learning techniques for automatically learning a 3

Most database vendors were quick to provide commercial to ols

wrapp er. Using these algorithms, we provide the system for p erforming this task.

tegrity

up dating a site, restructuring a site, or enforcing in Browsable Web Site

constraints on a site's structure, are tedious to p erform.

Graphical eb sites as declaratively de ned structures: Sev- W HTML Generator

presentation

eral systems have b een develop ed with the goal of apply-

specification

hniques to the problem of web site cre-

ing database tec Logical representation of the web site

+ +

98, AMM98, AM98 , CDSS98 , PF98 , JB97, LSB 98, ation [FFK Declarative web site

TN98]. The common theme to these systems is that they structure specification

vide an explicit declarative representation of the struc-

pro Uniform view of the underlying data

ture of a web site. The structure of the web site is de ned

view over existing data. However, we emphasize that

as a Mediator

the languages used to create these views result in graphs of

web pages with hyp ertext links, rather than simple tables. The systems di er on the data mo del they use, the query

Wrappers

language they use, and whether they haveanintermediate

logical representation of the web site, rather than having

only a representation of the nal HTML. DB Structured HTML

Files Pages

Building a web site using a declarative representation of the

structure of the site has several signi cant advantages. Since

aweb site's structure and content are de ned declaratively

by a query, not pro cedurally by a program, it is easy to cre-

Figure 3: Architecture for Web Site Management Systems

ate multiple versions of a site. For example, it is p ossible

to easily build internal and external views of an organiza-

tion's site or to build sites tailored to di erent classes of

Some of the salientcharacteristics of the di erent systems

+

users. Currently, creating multiple versions requires writ-

are the following. Strudel [FFK 98] uses a semistructured

ing multiple sets of programs or manually creating di erent

data mo del of lab eled directed graphs for mo deling b oth the

sets of HTML les. Buildin g multiple versions of a site can

underlying data and for mo deling the web site. It uses a sin-

b e done by either writing di erent site de nition queries, or

gle query language, StruQL, throughout the system, b oth

bychanging the graphical representation indep endently of

for integrating the raw data and for de ning the structure of

the underlying structure. Furthermore, a declarative rep-

the web site. Araneus [AMM97b ] uses a more structured

resentation of the web site's structure also supp orts easy

data mo del, ADM, and provides a language for transform-

evolution of a web site's structure. For example, to reorga-

ing data into ADM and a language for creating web sites

nize pages based on frequent usage patterns [PE97], or to

from data mo deled in ADM. In addition, Araneus uses web-

extend the site's content, simply rewrite the site-de nition

sp eci c constructs in the data mo del. The Autoweb [PF98 ]

query, as opp osed to rewriting a set of programs or a set of

system is based on the hyp ermedia design mo del HDM, a

HTML les. Declarative sp eci cation of web sites can o er

design to ol for hyp ermedia applications . Its data mo del is

other advantages. For example, it b ecomes p ossible to ex-

based on the entity-relationshi p mo del; the \access schema"

press and enforce integrity constraints on the site [FFLS98],

sp eci es how the hyp erbase is navigated and accessed in

and to up date a site incrementally when changes o ccur in

a browsable site; and the \presentation schema" sp eci es

the underlying data. Moreover, a declarative sp eci cation

how ob jects and paths in the hyp erbase and access schemas

provides a platform for developing optimization algorithms

are rendered. All three systems mentioned ab ove provide a

for run-time management of data intensiveweb sites. The

clear separation b etween the creation of the logical struc-

challenge in run-time managementofaweb site is to auto-

ture of the web site and the sp eci cation of the graphical

matically nd an optimal tradeo b etween precomputation

presentation of the site. The YAT system [CDSS98 ] is an

of parts of the web site and click-time computation. Finally,

application of a data conversion language to the problem of

we remark that buildin g web sites using this paradigm will

building web sites. Using YAT, a web site designer writes

also facilitate the tasks of querying web sites and integrating

a set of rules converting from the raw data into an abstract

data from multiple web sources.

syntax tree of the resulting HTML, without going through

an intermediate logical representation phase. In fact, in

A prototypical architecture of such a system is shown in

a similar way, other languages for data conversation such

Figure 3. At the b ottom level, the system accesses a set of

+

as [MZ98, MPP 93 , PMSL94 ] can also b e used to build

data sources containing the data that will b e served on the

web sites. The WIRM system [JB97] is similar in spirit to

web site. The data may b e stored in databases, in structured

the ab ove systems in that it enables users to build web sites

les, or in existing web sites. The data is represented in

in which in which the pages can b e viewed context-sensitive

the system in some data mo del, and the system provides

views of the underlying data. The ma jor fo cus of WIRM is

a uniform interface to these data sources using techniques

on integrating medical research data for the national Human

similar to the ones describ ed in the previous section. The

Brain Pro ject.

main step in buildin g a web site is to write an expression that

declaratively represents the structure of the web site. The

expression is written in a sp eci c query language provided

6 Conclusions, Persp ectives and Future Work

by the system. The result of applying this query to the

underlying data is the logical representation of the web site

An overarching question regarding the topic of this survey

in the data mo del of the system e.g., a lab eled directed

is whether the World-Wide Web presents novel problems

graph. Finally, to actually create a browsable web site,

to the database community. In manyways, the WWW is

the system contains a metho d e.g., HTML templates for

not similar to a database. For example, there is no uni-

translating the logical structure into a set of HTML les.

form structure, no integrity constraints, no transactions, no

Acknowledgements standard query language or data mo del. And yet, as the

survey has showed, the p owerful abstractions develop ed in

The authors are grateful to the following colleagues who pro- the database communitymay provetobekey in taming the

vided suggestions and comments on earlier versions of this web's complexity and providing valuable services.

pap er: Serge Abiteb oul, Gustavo Aro cena, Paolo Atzeni,

Of particular imp ortance is the view of a large web site as

Jos e Blakeley, Nick Kushmerick, Bertram Ludascher, C.

b eing not just a database, but an information system built

Mohan, Yannis Papakonstantino u, Oded Shmueli, Anthony

around one or more databases with an accompanying com-

Tomasic and Dan Weld.

plex navigation structure. In that view, a web site has many

similariti es to non-web information systems. Designing such

References

aweb site requires extending information systems design

metho dologies [AMM98, PF98]. Using these principles to

+

[AAB 98] Jos Luis Ambite, Naveen Ashish, Greg Barish,

build web sites will also impact the waywe query the web

Craig A. Knoblo ck, Steven Minton, Pragnesh J.

and the wayweintegrate data from multiple web sources.

Mo di, Ion Muslea, Andrew Philp ot, and Sheila

Several trends will have signi cant impact on the use of

Tejada. ARIADNE: A system for constructing

database technology for web applications . The rst is, of

mediators for internet sources system demon-

course, XML. The considerable momentum b ehind XML

stration. In Proc. of ACM SIGMOD Conf. on

and related metadata initiatives can only help the appli-

Management of Data, Seattle, WA, 1998.

cability of database concepts to the web by providing the

[Abi97] Serge Abiteb oul. Querying semi-structured

much needed structure in a widely accepted format. While

data. In Proc. of the Int. Conf. on Database

the availabili ty of data in XML format will reduce the need

Theory ICDT, Delphi, Greece, 1997.

to fo cus on wrapp ers converting human readable data to ma-

chine readable data, the challenges of semantic integration

[ACM93] Serge Abiteb oul, Sophie Cluet, and Tova Milo.

of data from web sources still remains. Building on our ex-

Querying and up dating the le. In Proc. of the

p erience in developing metho ds for manipulatin g semistruc-

Int. Conf. on Very Large Data Bases VLDB,

tured data, our community is in a unique p osition to develop

Dublin, Ireland, 1993.

to ols for manipulati ng data in XML format. In fact, some

of the concepts develop ed in this community are already

[ACPS96] S. Adali, K. Candan, Y. Papakonstantinou , and

+

b eing adapted to the XML context [DFF 98 , GMW98 ].

V.S. Subrahmanian . Query caching and opti-

Other pro jects under way in the database communityinthe

mization in distributed mediator systems. In

area of metadata architectures and languages e.g. [MRT98,

Proc. of ACM SIGMOD Conf. on Management

KMSS98 ] are likely to take advantage of and merge with

of Data, Montreal, Canada, 1996.

the XML framework.

[AD98] S. Abiteb oul and O. Duschka. Complexityof

A second trend that will a ect the applicabi li ty of database

answering queries using materialized views. In

techniques for querying the web is the growth of the so-called

Proc. of the ACM SIGACT-SIGMOD-SIGART

hidden web. The hidden web refers to the web pages that

Symposium on Principles of Database Systems

are generated by programs given user inputs, and are there-

PODS, Seattle, WA, 1998.

fore not accessible to web crawlers for indexing. A recent

article [LG98] claims that close to 80 of the web is already

[AFT98] Laurent Amsaleg, Michael Franklin, and An-

in the hidden web. If our to ols are to b e able to b ene t

thony Tomasic. Dynamic query op erator

from data in the hidden web, wemust develop techniques

scheduling for wide-area remote access. Dis-

for identifying sites that generate web pages, classify them

tributed and Paral lel Databases, 63:217{246,

and automatically create query interfaces to them.

July 1998.

There is no shortage in p ossible directions for future research

[AI I98] Proceedings of the AAAI Workshop on Intel li-

in this area. In the past, the bulk of the work has fo cused on

gent Data Integration, Madison, Wisconsin, July

the logical level, developing appropriate data mo dels, query

1998.

languages and metho ds for describing di erent asp ects of

web sources. In contrast, problems of query optimization

[AK97] Naveen Ashish and Craig A. Knoblo ck. Wrapp er

and query execution have received relatively little attention

generation for semi-structured internet sources.

in the database community, and p ose some of the more im-

SIGMOD Record, 264:8{15, 1997.

p ortantchallenges for future work. Some of the imp ortant

[AKS96] Yigal Arens, Craig A. Knoblo ck, and Wei-Min

directions in which to enrich our data mo dels and query lan-

Shen. Query reformulation for dynamic informa-

guages include the incorp oration of various forms of meta

tion integration. International Journal on Intel-

data ab out sources e.g., probabilisti c information and the

ligent and Cooperative Information Systems, 6

principled combination of querying structured and unstruc-

2/3:99{130, June 1996.

tured data sources on the WWW.

Finally, in this article we tried to provide a representative

[AM98] Gustavo Aro cena and Alb erto Mendelzon. We-

list of references on the topic of web and databases. In

bOQL: Restructuring do cuments, databases and

addition to these references, readers can get a more detailed

webs. In Proc. of Int. Conf. on Data Engineering

account from recentworkshops related to the topic of the

ICDE, Orlando, Florida, 1998.

survey [SSD97 , Web98 , AI I98].

[AMM97a] Gustavo O. Aro cena, Alb erto O. Mendelzon,

and George A. Mihaila. Applications of a web

query language. In Proc. of the Int. WWW

Conf., April 1997.

suite for harnessing web data. In Proceedings [AMM97b] Paolo Atzeni, Giansalvatore Mecca, and Paolo

of the International Workshop on the Web and Merialdo. Toweave the web. In Proc. of the

Databases,Valencia, Spain, 1998. Int. Conf. on Very Large Data Bases VLDB,

1997.

[BH98] Krishna Bharat and Monika Henzinger. Im-

proved algorithms for topic distillatio n in hyp er- [AMM98] Paolo Atzeni, Giansalvatore Mecca, and Paolo

linked environments. In Proc. 21st Int'l ACM Merialdo. Design and maintenance of data-

SIGIR Conference, 1998. intensiveweb sites. In Proc. of the Conf. on Ex-

tending Database Technology EDBT,Valencia,

[Bil98] David Billard. Transactional services for the

Spain, 1998.

web. In Proceedings of the International Work-

+

shop on the Web and Databases, pages 11{17, [AMR 98] Serge Abiteb oul, Jason McHugh, Michael Rys,

Valencia, Spain, 1998. Vasilis Vassalos, and Janet Weiner. Incre-

mental maintenance for materialized views over

[BK90] C. Beeri and Y. Kornatzky. A logical query lan-

semistructured data. In Proc. of the Int. Conf.

guage for hyp ertext systems. In Proc. of the

on Very Large Data Bases VLDB, New York

European Conference on Hypertext, pages 67{80.

City, USA, 1998.

Cambridge University Press, 1990.

+

[AQM 97] Serge Abiteb oul, Dallan Quass, Jason McHugh,

[Bla96] Jose A. Blakeley. Data access for the masses

Jennifer Widom, and Janet Wiener. The Lorel

through OLE DB. In Proc. of ACM SIGMOD

query language for semistructured data. Inter-

Conf. on Management of Data, pages 161{172,

national Journal on Digital Libraries, 11:68{

Montreal, Canada, 1996.

88, April 1997.

[BP98] Sergey Brin and Lawrence Page. The anatomy

[Aro97] Gustavo Aro cena. WebOQL: Exploiting do cu-

of a large-scale hyp ertextual web search engine.

ment structure in web queries. Master's thesis,

In Proc. of the Int. WWW Conf., April 1998.

UniversityofToronto, 1997.

[BT98] Philipp e Bonnet and AnthonyTomasic. Partial

[AV97a] S. Abiteb oul and V. Vianu. Queries and compu-

answers for unavailabl e data sources. In Proceed-

tation on the Web. In Proc. of the Int. Conf. on

ings of the 1998 Workshop on Flexible Query-

Database Theory ICDT, Delphi, Greece, 1997.

Answering Systems FQAS'98, pages 43{54.

Department of , Roskilde Uni- [AV97b] Serge Abiteb oul and Victor Vianu. Regular

versity, 1998. path queries with constraints. In Proc. of the

ACM SIGACT-SIGMOD-SIGART Symposium

[Bun97] . Semistructured data. In

on Principles of Database Systems PODS,

Proc. of the ACM SIGACT-SIGMOD-SIGART

Tucson, Arizona, May 1997.

Symposium on Principles of Database Systems

+

PODS, pages 117{121, Tucson, Arizona, 1997. [BBH 98] Krisha Bharat, Andrei Bro der, Monika Hen-

zinger, Puneet Kumar, and Suresh Venkatasub-

[CACS94] V. Christophides, S. Abiteb oul, S. Cluet, and

ramanian. The connectivity server: fast access

M. Scholl. From structured do cuments to novel

to linkage information on the web. In Proc. of

query facilities. In Proc. of ACM SIGMOD

the Int. WWW Conf., April 1998.

Conf. on Management of Data, pages 313{324,

Minneap oli s, Minnesota, 1994. [BBMR89] Alex Borgida, Ronald Brachman, Deb orah

McGuinness, and Lori Resnick. CLASSIC: A

+

[CDF 98] Mark Craven, Dan DiPasquo, Dayne Fre-

structural data mo del for ob jects. In Proc. of

itag, Andrew McCallum, Tom Mitchell, Kamal

ACM SIGMOD Conf. on Management of Data,

Nigam, and Sean Slattery. Learning to extract

pages 59{67, Portland, Oregon, 1989.

symb olic knowledge from the world-wide web.

In Proceedings of the AAAI Fifteenth National [BDFS97] Peter Buneman, Susan Davidson, Mary Fernan-

ConferenceonArti cial Intel ligence, 1998. dez, and . Adding structure to un-

structured data. In Proc. of the Int. Conf. on

[CDRR98] Soumen Chakrabarti, Byron Dom, Prabhakar

Database Theory ICDT, Delphi, Greece, 1997.

Raghavan, and Sridhar Ra jagopalan. Auto-

+

matic resource compilation by analyzing hyp er- [BDH 95] C. M. Bowman, Peter B. Danzig, Darren R.

link structure and asso ciated text. In Proc. of Hardy, Udi Manb er, and Michael F. Schwartz.

the Int. WWW Conf., April 1998. The harvest information discovery and access

system. Computer Networks and ISDN Systems,

[CDSS98] Sophie Cluet, Claude Delob el, Jerome Simeon,

281-2:119{125, Decemb er 1995.

and Katarzyna Smaga. Your mediators need

data conversion. In Proc. of ACM SIGMOD [BDHS96] P. Buneman, S. Davidson, G. Hillebrand, and

Conf. on Management of Data, Seattle, WA, D. Suciu. A query language and optimization

1998. techniques for unstructured data. In Proc. of

ACM SIGMOD Conf. on Management of Data,

[CGL98] Diego Calvanese, Giusepp e De Giacomo, and

pages 505{516, Montreal, Canada, 1996.

Maurizio Lenzerini. What can knowledge repre-

+

[BEM 98] C. Beeri, G. Elb er, T. Milo, Y. Sagiv, sentation do for semi-structured data? In Pro-

O.Shmueli, N.Tishby, Y.Kogan, D.Konopnicki, ceedings of the AAAI Fifteenth National Confer-

P. Mogilevski, and N.Slonim. Websuite-a to ol enceonArti cial Intel ligence, 1998.

[CGMP98] Jungho o Cho, Hector Garcia-Molina, and [DG97b] Oliver M. Duschka and Michael R. Genesereth.

Lawrence Page. Ecient crawling through url Query planning in infomaster. In Proceedings

ordering. In Proc. of the Int. WWW Conf., April of the ACM Symposium on Applied Computing,

1998. San Jose, CA, 1997.

[Dus97] Oliver Duschka. Query optimization using lo cal

[CK98] J. Carriere and R. Kazman. Web query: Search-

completeness. In Proceedings of the AAAI Four-

ing and visualizi ng the web through connectiv-

teenth National ConferenceonArti cial Intel li-

ity.InProc. of the Int. WWW Conf., April 1998.

gence, 1997.

[CKPS95] Sura jit Chaudhuri, Ravi Krishnamurthy,Spy-

[EGW94] Oren Etzioni, Keith Golden, and Daniel Weld.

ros Potamianos, and Kyuseok Shim. Optimiz-

Tractable closed world reasoning with up dates.

ing queries with materialized views. In Proc. of

In Proceedings of the Conference on Princi-

Int. Conf. on Data Engineering ICDE,Taip ei,

ples of Know ledge Representation and Reason-

Taiwan, 1995.

ing, KR-94., 1994. Extended version to app ear

[CM89] Mariano P. Consens and Alb erto O. Mendel-

in Arti cial Intel ligence.

zon. Expressing structural hyp ertext queries in

[EW94] Oren Etzioni and Dan Weld. A softb ot-based

graphlog. In Proc. 2nd. ACM Conferenceon

interface to the internet. CACM, 377:72{76,

Hypertext, pages 269{292, Pittsburgh, Novem-

1994.

b er 1989.

+

[FFK 98] Mary Fernandez, Daniela Florescu, Jaewoo

[CM90] Mariano Consens and Alb erto Mendelzon.

Kang, Alon Levy, and Dan Suciu. Catching the

GraphLog: a visual formalism for real life recur-

b oat with Strudel: Exp eriences with a web-site

sion. In Proc. of the ACM SIGACT-SIGMOD-

management system. In Proc. of ACM SIG-

SIGART Symposium on Principles of Database

MOD Conf. on Management of Data, Seattle,

Systems PODS, pages 404{416, Atlantic City,

WA, 1998.

NJ, 1990.

[FFLS97] Mary Fernandez, Daniela Florescu, Alon Levy,

[CMW87] Isab el F. Cruz, Alb erto O. Mendelzon, and Pe-

and Dan Suciu. A query language for a web-site

ter T. Wo o d. A graphical query language sup-

management system. SIGMOD Record, 263:4{

p orting recursion. In Proc. of ACM SIGMOD

11, Septemb er 1997.

Conf. on Management of Data, pages 323{330,

San Francisco, CA, 1987. [FFLS98] Mary Fernandez, Daniela Florescu, Alon Levy,

and Dan Suciu. Reasoning ab out web-sites. In

[CMW88] I.F. Cruz, A.O. Mendelzon, and P.T. Wood. G+:

Working notes of the AAAI-98 Workshop on Ar-

Recursive queries without recursion. In Proceed-

ti cial Intel ligence and Data Integration. Amer-

ings of the Second International Conferenceon

ican Association of Arti cial Intel ligence., 1998.

Exper t Database Systems, pages 355{368, 1988.

[FHM94] Douglas Fang, Joachim Hammer, and Dennis

[Coh98] William Cohen. Integration of heteroge-

McLeo d. The identi cation and resolution of se-

neous databases without common domains using

mantic heterogeneityinmultidatabase systems.

queries based on textual similarity.InProc. of

In Multidatabase Systems: AnAdvanced Solu-

ACM SIGMOD Conf. on Management of Data,

tion for Global Information Sharing, pages 52{

Seattle, WA, 1998.

60. IEEE Computer So ciety Press, Los Alami-

tos, California, 1994.

[CS98] Pei Cao and Sekhar Sarukkai, editors. Proceed-

ings of Workshop on Inter-

[FKL97] Daniela Florescu, Daphne Koller, and Alon

net Server Performance, Madison, Wisconsin,

Levy. Using probabili sti c information in data

1998. http://www.cs.wisc.edu/ cao/WISP98-

integration. In Proc. of the Int. Conf. on

program.html.

Very Large Data Bases VLDB, pages 216{225,

Athens, Greece, 1997.

[DEW97] B. Do orenb os, O. Etzioni, and D. Weld. Scalable

[FLS98] Daniela Florescu, Alon Levy, and Dan Su-

comparison-shoppi ng agent for the world-wide

ciu. Query containment for conjunctive queries

web. In Proceedings of the International Con-

with regular expressions. In Proc. of the

ferenceonAutonomous Agents,February 1997.

ACM SIGACT-SIGMOD-SIGART Symposium

+

[DFF 98] Alin Deutsch,

on Principles of Database Systems PODS,

Mary Fernandez, Daniela Florescu, Alon Levy,

Seattle,WA, 1998.

and Dan Suciu. A query language for XML.

[FRV96] Daniela Florescu, Louiqa Raschid, and Patrick

http://www.research.att.com/ m /xml/w3c-

Valduriez. A metho dology for query reformu-

note.html, 1998.

lation in cis using semantic knowledge. Int.

[DG97a] Oliver M. Duschka and Michael R. Genesereth.

Journal of Intel ligent & Cooperative Informa-

Answering recursive queries using views. In

tion Systems, special issue on Formal Methods

Proc. of the ACM SIGACT-SIGMOD-SIGART

in Cooperative Information Systems, 54, 1996.

Symposium on Principles of Database Systems

[FW97] M. Friedman and D. Weld. Ecient execution of

PODS,Tucson, Arizona., 1997.

information gathering plans. In Proceedings of

the International Joint ConferenceonArti cial

Intel ligence, Nagoya, Japan, 1997.

[GC94] G. Graefe and R. Cole. Optimization of dy- [HLLS97] Rainer Himmeroder, Georg Lausen, Bertram

namic query evaluation plans. In Proc. of ACM Ludascher, and Christian Schlepphorst. On a

SIGMOD Conf. on Management of Data, Min- declarative semantics for web queries. In Proc. of

neap olis, Minnesota, 1994. the Int. Conf. on Deductive and Object-Oriented

Databases DOOD, pages 386{398, Singap ore,

[GGMT99] Luis Gravano, Hector Garcia-Molina, and An-

Decemb er 1997. Springer.

thonyTomasic. GlOSS:Text-source discov-

+

ery over the internet. ACM Transactions on

[HML 98] Kyo ji Hirata, Sougata Mukherjea, Wen-Syan Li,

Database Systems to appear, 1999.

Yusaku Okamura, and Yoshinori Hara. Facilitat-

ing ob ject-based navigation through multimedia

+

[GMPQ 97] H. Garcia-Molin a, Y. Papakonstantinou ,

web databases. TAPOS, 44, 1998. In press.

D. Quass, A. Ra jaraman, Y. Sagiv, J. Ullman,

and J. Widom. The TSIMMIS pro ject: Integra-

[Hul97] Richard Hull. Managing semantic heterogene-

tion of heterogenous information sources, March

ity in databases: A theoretical p ersp ective. In

1997.

Proc. of the ACM SIGACT-SIGMOD-SIGART

Symposium on Principles of Database Systems

[GMW98] R. Goldman, J. McHugh, and J. Widom. Lore:

PODS, pages 51{61, Tucson, Arizona, 1997.

A database management system for XML. Pre-

sentation, Stanford University Database Group,

[HZ96] Richard Hull and Gang Zhou. A framework for

1998.

supp orting data integration using the material-

ized and virtual approaches. In Proc. of ACM

[GRC97] S. Gadde, M. Rabinovich, and J. Chase. Reduce,

SIGMOD Conf. on Management of Data, pages

reuse, recycle: An approach to building large

481{492, Montreal, Canada, 1996.

internet caches. In Proceedings of the Workshop

on Hot Topics in Operating Systems HotOS,

[Jan98] http://www.jango.excite.com, 1998.

May 1997.

[JB97] R. Jakob ovits and J. F. Brinkley. Managing

[GRVB98] Jean-Rob ert Gruser, Louiqa Raschid, Mar a Es-

medical research data with a web-interfacing

ther Vidal, and Laura Bright. Wrapp er genera-

rep ository manager. In American Medical Infor-

tion for web accessible data sources. In Proceed-

matics AssociationFal l Symposium, pages 454{

ings of the CoopIS, 1998.

458, Nashville, Oct 1997.

[GW97] Roy Goldman and Jennifer Widom. Dataguides:

[Jun98] http://www.junglee.com, 1998.

Enabling query formulation and optimization

in semistructured databases. In Proc. of the

[KD98] Navin Kabra and David J. DeWitt. Ecient

Int. Conf. on Very Large Data Bases VLDB,

mid-query re-optimization of sub-optimal query

Athens, Greece, 1997.

execution plans. In Proc. of ACM SIGMOD

Conf. on Management of Data, pages 106{117,

[GW98] Roy Goldman and Jennifer Widom. Interactive

Seattle, WA, 1998.

query and search in semistructured databases.

In Proceedings of the International Workshop on

[KDW97] N. Kushmerick, R. Do orenb os, and D. Weld.

the Web and Databases, pages 42{48, Valencia,

Wrapp er induction for information extraction.

Spain, 1998.

In Proceedings of the 15th International Joint

ConferenceonArti cial Intel ligence, 1997.

[GZC89] Ralf Hartmut Guting,  Rob erto Zicari, and

David M. Choy. An algebra for structured oce

[Kle98] Jon Kleinb erg. Authoritative sources in a hyp er-

do cuments. ACM TOIS, 72:123{157, 1989.

linked environment. In Proceedings of 9th ACM-

SIAM Symposium on Discrete Algorithms, 1998.

[Hal88] Frank G. Halasz. Re ections on Notecards:

Seven issues for the next generation of hyp erme-

[KLW95] M. Kifer, G. Lausen, and J. Wu. Logical foun-

dia systems. Comm. of the ACM, 317:836{852,

dations of ob ject-oriented and frame-based lan-

1988.

guages. J. ACM, 424:741{843, 1995.

[Har94] Coleman Harrison. Aql: An adaptive query lan-

[KMSS98] Y. Kogan, D. Michaeli, Y. Sagiv, and

guage. Technical Rep ort NU-CCS-94-19, North-

O. Shmueli. Utilizing the multiple facets of

eastern University, Octob er 1994. Master's The-

www contents. Data and Know ledge Engineer-

sis.

ing, 1998. In press.

+

[HGMN 98] Joachim Hammer, Hec-

[KS95] D. Konopnicki and O. Shmueli. W3QS: A query

tor Garcia-Molina, Svetlozar Nestorov, Ramana

system for the World Wide Web. In Proc. of the

Yerneni, Markus M. Breunig, and Vasilis Vas-

Int. Conf. on Very Large Data Bases VLDB,

salos. Template-based wrapp ers in the TSIM-

pages 54{65, Zurich, Switzerland, 1995.

MIS system system demonstration. In Proc. of

ACM SIGMOD Conf. on Management of Data,

[KS98] David Konopnicki and Oded Shmueli. Bringing

Tucson, Arizona, 1998.

database functionality to the WWW. In Pro-

ceedings of the International Workshop on the

[HKWY97] Laura Haas, Donald Kossmann, Edward Wim-

Web and Databases, Valencia, Spain, pages 49{

mers, and Jun Yang. Optimizing queries across

55, 1998.

diverse data sources. In Proc. of the Int. Conf.

on Very Large Data Bases VLDB,Athens, Greece, 1997.

[KW96] Chung T. Kwok and Daniel S. Weld. Planning to [MMM98] Giansalvatore Mecca, Alb erto O. Mendelzon,

gather information. In Proceedings of the AAAI and Paolo Merialdo. Ecient queries over web

Thirteenth National ConferenceonArti cial In- views. In Proc. of the Conf. on Extending

tel ligence, 1996. Database Technology EDBT,Valencia, Spain,

1998.

[Lev96] Alon Y. Levy. Obtaining complete answers from

incomplete databases. In Proc. of the Int. Conf. [Mot89] Amihai Motro. Integrity=validity + complete-

on Very Large Data Bases VLDB, Bombay, In- ness. ACM Transactions on Database Systems,

dia, 1996. 144:480{502, Decemb er 1989.

[LG98] S. Lawrence and C.L. Giles. Searching the world [MP96] Kenneth Mo ore and Michelle Peterson. A group-

wide web. Science, 2804:98{100, 1998. ware b enchmark based on lotus notes. In Proc.

of Int. Conf. on Data Engineering ICDE, 1996.

+

[LHL 98] Bertram Ludascher, Rainer Himmeroder, Georg

+

Lausen, Wolfgang May, and Christian Schlep- [MPP 93] Bernhard Mitschang, Hamid Pirahesh, Peter

phorst. Managing semistructured data with Pistor, Bruce G. Lindsay, and Norb ert Sdkamp.

FLORID : A deductive ob ject-oriented p ersp ec- Sql/xnf - pro cessing comp osite ob jects as ab-

tive. Information Systems, 238, 1998. to ap- stractions over relational data. In Proc. of Int.

p ear. Conf. on Data Engineering ICDE, pages 272{

282, Vienna, Austria, 1993.

[LMSS95] Alon Y. Levy, Alb erto O. Mendelzon, Yehoshua

Sagiv, and Divesh Srivastava. Answering queries [MRT98] George A. Mihaila, Louiqa Raschid, and An-

using views. In Proc. of the ACM SIGACT- thonyTomasic. Equal time for data on the in-

SIGMOD-SIGART Symposium on Principles of ternet with websemantics. In Proc. of the Conf.

Database Systems PODS, San Jose, CA, 1995. on Extending Database Technology EDBT,Va-

lencia, Spain, 1998.

[LRO96] Alon Y. Levy, Anand Ra jaraman, and Joann J.

Ordille. Querying heterogeneous information [MZ98] Tova Milo and Sagit Zohar. Using schema

sources using source descriptions. In Proc. of the matching to simplify heterogeneous data trans-

Int. Conf. on Very Large Data Bases VLDB, lation. In Proc. of the Int. Conf. on Very Large

Bombay, India, 1996. Data Bases VLDB, New York City, USA,

1998.

[LRU96] Alon Y. Levy, Anand Ra jaraman, and Je rey D.

Ullman. Answering queries using limited exter- [NAM98] Svetlozar Nestorov, Serge Abiteb oul, and Ra-

nal pro cessors. In Proc. of the ACM SIGACT- jeev Motwani. Extracting schema from

SIGMOD-SIGART Symposium on Principles of semistructured data. In Proc. of ACM SIG-

Database Systems PODS, Montreal, Canada, MOD Conf. on Management of Data, Seattle,

1996. WA, 1998.

+

[LSB 98] T. Lee, L. Sheng, T. Bozkaya, N.H. Balkir, Z.M. [NGT98] Hub ert Naacke, Georges Gardarin, and Anthony

Ozsoyoglu, and G. Ozsoyoglu. Querying multi- Tomasic:. Leveraging mediator cost mo dels with

media presentations based on content. to appear heterogeneous data sources. In Proc. of Int.

in IEEE Trans. on Know. and Data Engineer- Conf. on Data Engineering ICDE, Orlando,

ing, 1998. Florida, 1998.

[LSCH98] Wen-Syan Li, Junho Shim, K. Selcuk Canadan, [NS96] Tam Nguyen and V. Srinivasan. Accessing re-

and Yoshinori Hara. WebDB: A web query sys- lational databases from the world wide web. In

tem and its mo deling, language, and implemen- Proc. of ACM SIGMOD Conf. on Management

tation. In Proc. IEEE Advances in Digital Li- of Data, Montreal, Canada, 1996.

braries '98, 1998.

[PAGM96] Y. Papakonstantinou, S. Abite-

[LSS96] Laks V. S. Lakshmanan, Fereido on Sadri, and b oul, and H. Garcia-Molina. Ob ject fusion in

Iyer N. Subramanian. A declarative language for mediator systems. In Proc. of the Int. Conf. on

querying and restructuring the Web. In Proc. of Very Large Data Bases VLDB, Bombay, India,

6th. International Workshop on Research Issues 1996.

in Data Engineering, RIDE '96, New Orleans,

+

[PdBA 92] Jan Paredaens, Jan Van den Bussche, Marc

February 1996.

Andries, Marc Gemis a nd Marc Gyssens, Inge

[MM97] Alb erto O. Mendelzon and Tova Milo. For- Thyssens, Dirk Van Gucht, Vijay Sarathy, and

mal mo dels of web queries. In Proc. of the Lawre nce V. Saxton. An overview of GOOD.

ACM SIGACT-SIGMOD-SIGART Symposium SIGMOD Record, 211:25{31, 1992.

on Principles of Database Systems PODS,

[PE95] MikePerkowitz and Oren Etzioni. Category

pages 134{143, Tucson, Arizona, May 1997.

translation: Learning to understand informa-

[MMM97] A. Mendelzon, G. Mihaila, and T. Milo. Query- tion on the internet. In Working Notes of the

ing the world wide web. International Journal AAAI Spring Symposium on Information Gath-

on Digital Libraries, 11:54{67, April 1997. ering from Heterogeneous Distributed Environ-

ments. American Asso ciation for Arti cial In- telligence., 1995.

[UFA98] Tolga Urhan, Michael J. Franklin, and Laurent [PE97] MikePerkowitz and Oren Etzioni. Adaptiveweb

Amsaleg. Cost based query scrambling for initial sites: an AI challenge. In Proceedings of the 15th

delays. In Proc. of ACM SIGMOD Conf. on International Joint ConferenceonArti cial In-

Management of Data, pages 130{141, Seattle, tel ligence, 1997.

WA, 1998.

[PF98] P.Paolini and P.Fraternali. A conceptual mo del

[Ull97] Je rey D. Ullman. Information integration us- and a to ol environment for developing more

scalable, dynamic, and customizable web ap-

ing logical views. In Proc. of the Int. Conf. on

Database Theory ICDT, Delphi, Greece, 1997. plications. In Proc. of the Conf. on Extending

Database Technology EDBT,Valencia, Spain,

[VP97a] Vasilis Vassalos and Yannis Papakonstantinou .

1998.

Describing and using the query capabilities of

[PGGMU95] Yannis Papakonstantinou, Ashish Gupta, Hec-

heterogeneous sources. In Proc. of the Int. Conf.

on Very Large Data Bases VLDB,Athens, tor Garcia-Molina, and Je rey Ullman. A query

translation scheme for rapid implementation of

Greece, 1997.

wrapp ers. In Proc. of the Int. Conf. on Deduc-

[VP97b] Anne-Marie Vercoustre and Francois Paradis.

tive and Object-Oriented Databases DOOD,

A descriptive language for information ob ject

1995.

reuse through virtual do cuments. In Proceedings

[PGMW95] Yannis Papakonstantinou , Hector Garcia-

of the 4th International Conference on Object-

Oriented Information Systems OOIS'97, Bris- Molina, and Jennifer Widom. Ob ject exchange

across heterogeneous information sources. In

bane, Australia, Novemb er 1997.

Proc. of Int. Conf. on Data Engineering

+

93] Darrel Wo elk, Paul Attie, Phil Cannata, Greg [WAC

ICDE, pages 251{260, Taip ei, Taiwan, 1995.

Meredith, Amit Seth, Munindar Sing, and

[PMSL94] Hamid Pirahesh, Bernhard Mitschang, Norb ert

Christine Tomlinson. Task scheduling using in-

tertask dep endencies in Carnot. In Proc. of Sdkamp, and Bruce G. Lindsay. Comp osite-

ob ject views in relational dbms: an implemen-

ACM SIGMOD Conf. on Management of Data,

pages 491{494, 1993. tation p ersp ective. Information Systems,IS

191:69{88, 1994.

+

[WBJ 95] Darrell Wo elk, Bill Bohrer, Nigel Jacobs,

K. Ong, Christine Tomlinson, and C. Unnikr- [PPR96] P. Pirolli, J. Pitkow, and R. Rao. Silk from a

sow's ear: Extracting usable structures from the

ishnan. Carnot and infosleuth: Database tech-

nology and the world wide web. In Proc. of ACM web. In Proc. ACM SIGCHI Conference, 1996.

SIGMOD Conf. on Management of Data, pages

[RSU95] Anand Ra jaraman, Yehoshua Sagiv, and Jef-

443{444, San Jose, CA, 1995.

frey D. Ullman. Answering queries using tem-

plates with binding patterns. In Proc. of the

[Web98] Proceedings of the International Workshop on

the Web and Databases,Valencia, Spain, March ACM SIGACT-SIGMOD-SIGART Symposium

on Principles of Database Systems PODS, San

1998.

http://p oincare.dia.un iroma 3.it:8080 /web db 98/pa p ers.html . Jose, CA, 1995.

[Sp e97] Ellen Sp ertus. ParaSite: Mining structural in-

[WWW98] Proceedings of 3d WWW Caching Workshop

formation on the web. In Proc. of the Int. WWW

Manchester, England, June 1998.

Conf., April 1997.

[XML98] http://w3c.org/XML/, 1998.

[SSD97] Proceedings of Workshop on Management of

[YL87] H. Z. Yang and P. A. Larson. Query transforma-

Semistructured Data,Tucson, Arizona, May

tion for PSJ-queries. In Proc. of the Int. Conf.

1997.

on Very Large Data Bases VLDB, pages 245{

[TMD92] J. Thierry-Mieg and R. Durbin. A C.elegans

254, Brighton, England, 1987.

database: syntactic de nitions for the ACEDB

[YPAGM98] Ramana Yerneni, Yannis Papakonstantinou ,

data base manager, 1992.

Serge Abiteb oul, and Hector Garcia-Molina . Fu-

[TN98] Motomichi Toyama and T. Nagafuji. Dynamic

sion queries over internet databases. In Proc.

and structured presentation of database con-

of the Conf. on Extending Database Technology

tents on the web. In Proc. of the Conf. on Ex-

EDBT, pages 57{71, Valencia, Spain, 1998.

tending Database Technology EDBT,Valencia,

[ZGM98] Yue Zhuge and Hector Garcia-Molina. Graph

Spain, 1998.

structured views and their incremental mainte-

[TRV98] A. Tomasic, L. Raschid, and P.Valduriez. Scal-

nance. In Proc. of Int. Conf. on Data Engineer-

ing access to distributed heterogeneous data

ing ICDE, Orlando, Florida, February 1998.

sources with Disco. IEEE Transactions On

Knowledge and Data Engineering to appear,

1998.

[TSI96] Odysseas G. Tsatalos, Marvin H. Solomon, and

Yannis E. Ioannidis. The GMAP: A versatile

to ol for physical data indep endence. VLDB

Journal, 52:101{118, 1996.