Database Techniques for the World-Wide Web: A Survey
Daniela Florescu Alon Levy Alb erto Mendelzon
Inria Ro quencourt Univ. of Washington Univ. of Toronto
dana@ro din.inria.fr [email protected] [email protected]
Information extraction and integration: Certain web 1 Intro duction
sites can b e viewed at a ner granularity level than pages,
as containers of structured data e.g., sets of tuples, or The p opularity of the World-Wide Web WWW has made
sets of ob jects. For example, the Internet Movie Database it a prime vehicle for disseminating information. The rel-
http://www.imdb.com can b e viewed as a front end in- evance of database concepts to the problems of managing
terface to a database ab out movies. Given the rise in the and querying this information has led to a signi cant body
numb er of such sites, there are two tasks we consider. The of recent research addressing these problems. Even though
rst task is to actually extract a structured representation the underlying challenge is the one that has b een tradition-
of the data e.g., a set of tuples from the HTML pages ally addressed by the database community{how to manage
containing them. This task is p erformed by a set of wrap- large volumes of data { the novel context of the WWW forces
per programs, whose creation and maintenance raises several us to signi cantl y extend previous techniques. The primary
challenges. Once we view these sites as autonomous hetero- goal of this survey is to classify the di erent tasks to which
geneous databases, we can address the second task of p osing database concepts have b een applied, and to emphasize the
queries that require the integration of data. The second task technical innovations that were required to do so.
is addressed by mediator or data integration systems.
We do not claim that database technology is the magic bullet
Web site construction and restructuring: A di er- that will solve all web information management problems;
ent asp ect in which database concepts and technology can other technologies, such as Information Retrieval, Arti cial
b e applied is that of buildin g, restructuring and managing Intelligence, and Hyp ertext/Hyp ermedia, are likely to b e
web-sites. In contrast to the previous two classes which ap- just as imp ortant. However, surveying all the work going
ply on existing web sites, here we consider the pro cess of on in these areas, and the interactions b etween them and
creating sites. Web sites can b e constructed either by start- database ideas, would b e far b eyond our scop e.
ing with some raw data stored in databases or structured
We fo cus on three classes of tasks related to information
les or by restructuring existing web sites. Peforming this
management on the WWW.
task requires metho ds for modeling the structure of web site
and languages for restructing data to conform to a desired Mo deling and querying the web: Supp ose we view the
structure. web as a directed graph whose no des are web pages and
whose edges are the links b etween pages. A rst task we
Before we b egin, we note that there are several topics con-
consider is that of formulating queries for retrieving certain
cerning the application of database concepts to the WWW
pages on the web. The queries can b e based on the content
which are not covered in this survey, such as caching and
of the desired pages and on the link structure connecting the
replication see [WWW98 , GRC97 ] for recentworks, trans-
pages. The simplest instance of this task, which is provided
action pro cessing and securityinweb environments see
by search engines on the web is to lo cate pages based on the
e.g. [Bil98 ], p erformance, availabil ity and scalability issues
words they contain. A simple generalizatio n of such a query
for web servers e.g. [CS98], or indexing techniques and
is to apply more complex predicates on the contents of a
crawler technology e.g. [CGMP98]. Furthermore, this
page e.g., nd the pages that contain the word \Clinton"
is not meant to b e a survey on existing pro ducts even in
next to a link to an image. Finally, as an example of a query
the areas on whichwe do fo cus. Finally, there are several
that involves the structure of the pages, consider the query
tangential areas whose results are applicabl e to the systems
asking for all images reachable from the ro ot of the CNN web
we discuss, but we do not cover them here. Examples of
site within 5 links. The last typ e of queries are esp ecially
such elds include systems for managing do cument collec-
useful when detecting violations of integrity constraints on
+
tions and ranking of do cuments e.g., Harvest [BDH 95],
aweb site or a collection of web sites.
Gloss [GGMT99] and exible query answering systems [BT98].
Finally, the eld of web/db is a very dynamic one; hence,
there are undoubtedly some omissions in our coverage, for
whichwe ap ologize in advance.
The survey is organized as follows. We b egin in Section 2 by
discussing the main issues that arise in designing data mo d-
els for web/db applications . The following three sections
consider each of the aforementioned tasks. Section 6 con- attributes. Because of the characteristics of semistructured
cludes with p ersp ectives and directions for future research. data mentioned ab ove, an additional feature that b ecomes
imp ortant in this context is the ability to query the schema
i.e., the lab els on the arcs in the graph. This feature is
2 Data Representation for Web/DB Tasks
supp orted in languages for querying semistructured data by
arc variables which get b ound to lab els on arcs, rather than
Building systems for solving any of the previous tasks re-
no des in the graph.
quires that wecho ose a metho d for mo deling the underlying
domain. In particular, in these tasks, we need to mo del the
In addition to developing mo dels and query languages for
web itself, structure of web sites, internal structure of web
semistructured data, there has b een considerable recentwork
pages, and nally, contents of web sites in ner granulariti es.
on issues concerning the management of semistructured data,
In this section we discuss the main distinguishi ng factors of
such as the extraction of structure from semistructured data
+
the data mo dels used in web applicatio ns.
[NAM98], view maintenance [ZGM98 , AMR 98], summa-
rization of semistructured data [BDFS97, GW97 ], and rea-
Graph data mo dels: As noted ab ove, several of the ap-
soning ab out semistructured data [CGL98, FFLS98 ]. Aside
plications we discuss require to mo del the set of web pages
from the relevance of these works to the tasks mentioned in
and the links b etween them. These pages can either b e on
this survey, the systems based on these metho ds will b e of
several sites or within a single site. Hence, a natural way
sp ecial imp ortance for the task of managing large volumes
to mo del this data is based on a labeledgraph data mo del.
of XML data [XML98].
Sp eci call y, in this mo del, no des representweb pages or
internal comp onents of web pages, and arcs represent links
Other characteristics of web data mo dels: Another
between pages. The lab els on the arcs can b e viewed as at-
distinguish in g characteristic of mo dels used in web/db ap-
tribute names. Along with the lab eled graph mo del, several
plications is the presence of web-sp eci c constructs in the
query languages have b een develop ed. One central feature
data representation. For example, some mo dels distinguish
that is common to these query languages is the abilityto
a unary relation identifying pages and a binary relation for
formulate regular path expression queries over the graph.
links b etween pages. Furthermore, wemay distinguish b e-
Regular path expressions enable p osing navigational queries
tween links within a web site and external links. An im-
over the graph structure.
p ortant reason to distingui sh a link relation is that it can
generally only b e traversed in the forward direction. Addi-
Semistructured data mo dels: The second asp ect of mo d-
tional second order dimensions along which the data mo dels
eling data for web applicatio ns is that in many cases the
we discuss di er are 1 the ability to mo del order among
structure of the data is irregular. Sp eci call y, when mo del-
elements in the database, 2 mo deling nested data struc-
ing the structure of a web site, we don't have a xed schema
tures, and 3 supp ort for collection typ es sets, bags, ar-
whichisgiven in advance. When mo deling data coming from
rays. An example of a data mo del that incorp orates explicit
multiple sources, the representation of some attributes e.g.,
web-sp eci c constructs pages and page schemes, nesting,
addresses may di er from source to source. Hence, several
and collection typ es is ADM, the data mo del of the Ara-
pro jects have considered mo dels of semistructured data. The
neus pro ject [AMM97b]. We remark that all the mo dels we
initial motivation for this work was the existence and rela-
mention in this pap er represent only static structures. For
tive success of p ermissive data mo dels such as [TMD92 ] in
example, the work on mo deling the structure of web sites do
the scienti c community, the need for exchanging ob jects
not consider dynamic web pages created as a result of user
across heterogeneous sources [PGMW95 ], and the task of
inputs.
managing do cument collections [MP96].
An imp ortant asp ect of languages for querying data in web
Broadly sp eaking, semistructured data refers to data with
application s is the need to create complex structures as a
some of the following characteristics:
result of a query. For example, the result of a query in
aweb site management system is the graph mo deling the
the schema is not given in advance and may b e implicit
web site. Hence, a fundamental characteristic of manyof
in the data,
the languages we discuss in this pap er is that their query
expressions contain a structuring comp onent in addition to
the schema is relatively large w.r.t. the size of the
the traditional data ltering comp onent.
data and maybechanging frequently,
Table 1 summarizes some of the web query systems cov-
the schema is descriptive rather than prescriptive, i.e.,
ered in this pap er. A more detailed version of this table,
it describ es the current state of the data, but violations
http://www.cs.washington.edu/homes/alon/webdb.html in-
of the schema are still tolerated,
cludes URLs for the systems where available. In subsequent
sections we will illustrate in detail languages for querying
the data is not strongly typ ed, i.e., for di erent ob-
data represented in these mo dels.
jects, the values of the same attribute may b e of dif-
fering typ es.
3 Mo deling and Querying the Web
Mo dels for semistructured data have b een based on lab eled
1
If the web is viewed as a large, graph-like database, it is
directed graphs [Abi97 , Bun97 ]. In a semistructured data
natural to p ose queries that go b eyond the basic informa-
mo del, there is no restriction on the set of arcs that emanate
tion retrieval paradigm supp orted bytoday's search engines
from a given no de in a graph, or on the typ es of the values of
and take structure into account; b oth the internal structure
1
It should b e noted that there is no inherent diculty in trans-
of web pages and the external structure of links that inter-
lating these mo dels into relational or ob ject-oriented terms. In fact,
connect them. In an often-cited pap er on the limitations of
the languages underlying Description Logics e.g., Classic [BBMR89]
hyp ertext systems, Halasz says: [Hal88 ]
and FLORID [HLLS97]have some of the features mentioned ab ove, and are describ ed in non-graph mo dels.
System Data Mo del Language Style Path Expressions Graph Creation
WebSQL [MMM97] relational SQL Yes No
W3QS [KS95] lab eled multigraphs SQL Yes No
WebLog [LSS96 ] relational Datalog No No
+
Lorel [AQM 97 ] lab eled graphs OQL Yes No
WebOQL [AM98] hyp ertrees OQL Yes Yes
UnQL [BDHS96 ] lab eled graphs structural recursion Yes Yes
+
Strudel [FFK 98, FFLS97] lab eled graphs Datalog Yes Yes
Araneus Ulixes [AMM97b ] page schemes SQL Yes Yes
Florid [HLLS97] F-logic Datalog Yes No
Table 1: Comparison of query systems
Content search ignores the structure of a hy- ables. Guting et al.[GZC89 ] mo del do cuments using nested
p ermedia network. In contrast, structure search ordered relations and use a generalization of nested rela-
sp eci cally examines the hyp ermedia structure tional algebra as a query language. Beeri and Kornatzky
for subnetworks that match a given pattern. [BK90] prop ose a logic whose formulas sp ecify patterns over
the hyp ertext graph.
and go es on to give examples where such queries are useful.
Graph query languages: Work in using graphs to mo del
databases, motivated by applications such as software engi-
3.1 Structural Information Retrieval
neering and computer network management, led to the G,
G+ and GraphLog graph-based languages [CMW87 , CMW88 ,
The rst to ols develop ed for querying the web were the well-
CM90]. In particular, G and G+ are based on lab eled
known search engines which are now widely deployed and
graphs; they supp ort regular path expressions and graph
used. These are based on searching indices of words and
construction in queries. GraphLog, whose semantics is based
phrases app earing in do cuments discovered byweb \crawlers."
on Datalog, was applied to Hyp ertext queries in [CM89].
More recently, there have b een e orts to overcome the lim-
+
Paredaens et al [PdBA 92 ] develop ed a graph query lan-
itations of this paradigm by exploiting link structure in
guage for ob ject-oriented databases.
queries. For example, [Kle98 ], [BH98] and [CDRR98], pro-
p ose to use the web structure to analyze the many sites
Languages for querying semistructured data: Query
+
returned by a search engine as relevant to a topic in order
languages for semistructured data such as Lorel [AQM 97],
to extract those that are likely to b e authoritative sources
UnQL [BDHS96 ] and StruQL [FFLS97] also use lab eled
on the topic. To supp ort connectivity analysis for this and
graphs as a exible data mo del. In contrast to graph query
other applications, such as ecient implementations of the
languages, they emphasize the ability to query the schema
query languages describ ed b elow the Connectivity Server
of the data, and the ability to accommo date irregularities
+
[BBH 98 ] provides fast access to structural information.
in the data, such as missing or rep eated elds, heteroge-
Go ogle [BP98], a prototyp e next-generation web search en-
neous records. Related work in the OO community [Har94]
gine, makes heavy use of web structure to improve crawling
prop oses \schema-shy" mo dels and queries to handle infor-
and indexing p erformance. Other metho ds for exploiting
mation ab out software engineering artifacts.
link structure are presented in [PPR96 , CK98]. In these
These languages were not develop ed sp eci cally for the web,
works, structural information is mostly used b ehind the scenes,
and do not distinguis h, for example, b etween graph edges
to improve the answers to purely content-oriented queries.
that represent the connection b etween a do cument and one
Sp ertus [Sp e97 ] p oints out many useful application s of queries
of its parts and edges that representahyp erlink from one
that take link structure into account explicitly.
web do cument to another. Their data mo dels, while elegant,
are not very rich, lacking such basic comforts as records and
3.2 Related query paradigms
ordered collections.
In this section we brie y describ e several families of query
3.3 First generation web query languages
languages that were not develop ed sp eci cally for querying
the web. However, since the concepts on which they are
A rst generation of web query languages aimed to combine
based are similar in spirit to the web query languages we
the content-based queries of search engines with structure-
discuss, these languages can also b e useful for web applica-
based queries similar to what one would nd in a database
tions.
system. These languages, which include W3QL [KS95], Web-
SQL [MMM97 , AMM97a], and WebLog [LSS96], combine
Hyp ertext/do cument query languages: Anumber of
conditions on text patterns app earing within do cuments with
mo dels and languages for querying structured do cuments
graph patterns describing link structure. We use WebSQL
and hyp ertexts were prop osed in the pre-web era. For exam-
as an example of the kinds of queries that can b e asked.
ple, Abiteb oul et al.[ACM93] and Christophides et al.[CACS94]
map do cuments to ob ject oriented database instances by
WebSQL WebSQL prop oses to mo del the web as a rela-
means of semantic actions attached to a grammar. Then the
tional database comp osed of two virtual relations: Do cu-
database representation can b e queried using the query lan-
ment and Anchor. The Do cument relation has one tuple for
guage of the database. A novel asp ect of this approach is the
each do cument in the web and the Anchor relation has one
p ossibili ty of querying the structure by means of path vari-
tuple for each anchor in each do cument in the web. This ture, placing it closer to the class of languages discussed in
relational abstraction of the web allows us to use a query the next subsection.
language similar to SQL to p ose the queries.
3.4 Second generation: Web Data Manip-
If Do cument and Anchor were actual relations, we could
ulation Languages
simply use SQL to write queries on them. But since the
Do cument and Anchor relations are completely virtual and
The languages ab ove treat web pages as atomic ob jects with
there is no waytoenumerate them, we cannot op erate on
two prop erties: they contain or do not contain certain text
them directly. The WebSQL semantics dep ends instead on
patterns, and they p oint to other ob jects. Exp erience with
materializin g p ortions of them by sp ecifying the do cuments
their use suggests there are two main areas of application
of interest in the FROM clause of a query. The basic way
that they can b e useful for: data wrapping, transformation,
of materializing a p ortion of the web is bynavigating from
and restructuring, as describ ed in Section 4; and web site
known URL's. Path regular expressions are used to describ e
construction and restructuring, as describ ed in Section 5. In
this navigation. An atom of such a regular expression can
b oth applicati on areas, it is often essential to have access to
b e of the form d1 => d2, meaning do cumentd1points to
the internal structure of web pages from the query language,
d2 and d2 is stored on a di erent server from d1; or d1 ->
if wewant declarative queries to capture a large part of the
d2, meaning d1 p oints to d2 and d2 is stored on the same
task at hand. For example, the task of extracting a set of
server as d1.
tuples from the HTML pages of the Internet Movie Database
For example, supp ose wewant to nd a list of triples of the
requires parsing the HTML and selectively accessing certain
form d1,d2,label, where d1 is a do cument stored on our
subtrees in the parse tree.
lo cal site, d2 is a do cument stored somewhere else, and d1
In this section we describ e the second-generation of web
p oints to d2 by a link lab eled label. Assume all our lo cal
query languages that we call \Web data manipulati on lan-
do cuments are reachable from www.mysite.start.
guages." These languages go b eyond the rst generation
SELECT d.url,e.url,a.label
languages in two signi cantways. First, they provide access
FROM Document d SUCH THAT
to the structure of the web ob jects that they manipulate.
"www.mysite.start" ->* d,
Unlike the rst-generation languages, they mo del internal
Document e SUCH THAT d => e,
structure of web do cuments as well as the external links
Anchor a SUCH THAT a.base = d.url
that connect them. They supp ort references to mo del hyp er-
WHERE a.href = e.url
links, and some supp ort ordered collections and records for
more natural data representation. Second, these languages
The FROM clause instantiates two Do cumentvariables, d
provide the ability to create new complex structures as a
and e, and one Anchor variable a. The variable d is b ound
result of a query. Since the data on the web is commonly
in turn to each lo cal do cument reachable from the starting
semistructured or worse, these languages still emphasize
do cument, and e is b ound to each non-lo cal do cument reach-
the ability to supp ort semistructured features. We brie y
able directly from d. The anchor variable a is instantiated
describ e three languages in this class: WebOQL [AM98],
to each link that originates in do cument d; the extra condi-
StruQL[FFLS97] and Florid [HLLS97].
tion that the target of link a b e do cument e is given in the
WHERE clause. Another way of materializing part of the
WebOQL
Do cument and Anchor relations is by content conditions:
for example, if wewere only interested in do cuments that
contains the string "database" we could have added to the
The main data structure provided byWebOQL is the hy-
FROM clause the condition d MENTIONS "database". The
p ertree. Hyp ertrees are ordered arc-lab eled trees with two
implementation uses search engines to generate candidate
typ es of arcs, internal and external. Internal arcs are used
do cuments that satisfy the MENTION conditions.
to represent structured ob jects and external arcs are used
to represent references typically hyp erlinks among ob jects.
Other Languages W3QL [KS95] is similar in avour to
Arcs are lab eled with records. Figure 1, from [Aro97], shows
WebSQL, with some notable di erences: it uses external
ahyp ertree containing descriptions of publications from sev-
programs similar to user de ned functions in ob ject-relational
eral research groups. Such a tree could easily b e built, for
languages for sp ecifying content conditions on les rather
example, from an HTML le, using a generic HTML wrap-
than building conditions into the language syntax, and it
p er.
provides mechanisms for handling forms encountered dur-
Sets of related hyp ertrees are collected into webs. Both hy-
ing navigation. In [KS98], Konopnicki and Shmueli describ e
p ertrees and webs can b e manipulated using WebOQL and
planned extensions to move W3QL into what we call the
created as the result of a query.
second generation. These include mo deling internal do c-
ument structure, hierarchical web mo deling that captures
WebOQL is a functional language, but queries are couched
the notion of web site explicitly, and replacing the exter-
in the familiar select-from-where form. For example, sup-
nal program metho d of sp ecifying conditions with a general
p ose that the name csPap ers denotes the pap ers database
extensible metho d based on the MIME standard.
in Figure 1, and that wewant to extract from it the title
and URL of the full version of pap ers authored by \Smith".
WebLog [LSS96 ] di ers from the ab ove languages in using
deductive rules instead of SQL-like syntax see the descrip-
select [y.Title,y'.Url]
tion of Florid b elow.
from x in csPapers, y in x'
where y.Authors ~ "Smith"
WQL, the query language of the WebDB pro ject [LSCH98],
is similar to WebSQL but it supp orts more comprehensive
In this query, x iterates over the simple trees of csPap ers SQL functionality such as aggregation and grouping, and
i.e., over the research groups and, given a value for x, y provides limited supp ort for querying intra-do cument struc- [Group: Databases] [Group: Card Punching] ... [Group: Programming Languages]
[Title: Recent Discoveries in Card Punching, [Title: Are Magnetic Media Better?, [Title: Cobol in AI, [Title: Assembly for the Masses, Authors: Peter Smith, John Brown, Authors:Peter Smith, John Brown, Tom Wood, Authors: Sam James, John Brown] Authors:John Brown, Tom Wood, Publication: Technical Report TR015] Publication:ACM TOCP Vol. 3 No. (1942) pp 23-37] Publication:ACM 2 POPL Proceedings. (1943)]
[label: Abstract, [label: Abstract, [label: Abstract, [label: Abstract, url: www.../abstr1.html] url: www.../abstr2.html] url: www.../abstr13.html] url: www.../abstr17.html]
[label: Full version, [label: Full version, url: www.../paper1.ps.Z] url: www.../paper2.ps.Z] [label: Full version, [label: Full version,
url: www.../paper13.ps.Z] url: www.../paper17.ps.Z]
Figure 1: Example of a hyp ertree
iterates over the simple trees of x'. The primed variable x' hyp ertext, but it is p ossible to browse them sequentially by
denotes the result of applying to tree x the Prime op era- following links having the string \Next" as lab el. For build-
tor, which returns the rst subtree of its argument. The ing the full-text index we need to feed the indexer with the
same op erator is used to extract from tree y its rst subtree text and the URL of each do cument. We can obtain this
in y'.Url. The square brackets denote the Hang op erator, information using the following query:
which builds an arc lab eled with a record formed with the ar-
select [ x.Url, x.Text ]
guments in this example, the eld names are inferred. Fi-
from x in browse"root.html"
nally, the tilde represents the string pattern matching pred-
via ^*[Text ~ "Next"]>*
icate: its left argument is a string and its right argumentis
a pattern.
StruQL
Web Creation The query ab ove maps a hyp ertree into
another hyp ertree; more generally, a query is a function that
StruQL is the query language of the Strudel web site
maps a web into another. For example, the following query
management system, describ ed b elow in Section 5. Even
creates a new page for each research group using the group
though StruQL was develop ed in the context of a sp eci c
name as URL. Each page contains the publication s of the
web application , it is a general purp ose query language for
corresp onding group.
semistructured data, based on a data mo del of lab eled di-
select x' as x.Group
rected graphs. In addition, the Strudel data mo del in-
from x in csPapers
cludes named collections, and supp orts several atomic typ es
that commonly app ear in web pages, such as URLs, and
In general, the select clause has the form ` select q as s ,
Postscript, text, image, and HTML les. The result of a
1 1
q as s , ... , q as s ', where the q 's are queries and each
StruQL query is a graph in the same data mo del as the
2 2 m m i
of the s 's is either a string query or the keyword schema.
input graphs. In Strudel, StruQL was used for two tasks:
i
The \as" clauses create the URL's s , s , ::: , s , which
querying heterogeneous sources to integrate them into a site
1 2 m
are assigned to the new pages resulting from each query q .
data graph, and for querying this data graph to pro duce a
i
site graph.
Navigation Patterns Navigation patterns are regular ex-
A StruQL query is a set of p ossibly nested blo cks, eachof
pressions over an alphab et of record predicates; they allow
the form:
us to sp ecify the structure of the paths that must b e followed
in order to nd the instances for variables.
[where C1,...,Ck]
[create N1,...,Nn]
Navigation patterns are mainly useful for two purp oses. The
[link L1,...,Lp]
rst reason is for extracting subtrees from trees whose struc-
[collect G1,...,Gq].
ture we do not know in detail or whose structure presents
irregulariti es, and the second is for iterating over trees con-
nected by external arcs. In fact, the distinction b etween in- The where clause can include either memb ership conditions
ternal and external arcs in hyp ertrees b ecomes really useful or conditions on pairs of no des expressed using regular path
when we use navigation patterns that traverse external arcs. expressions. The where clause pro duces all bindings of no de
Supp ose that wehave a software pro duct whose do cumen- and arc variables to values in the input graph, and the re-
tation is provided in HTML format and wewant to build maining clauses use Skolem functions to construct a new
a full-text index for it. These do cuments form a complex graph from these bindings.
webdoc::string[url => url; author => string; We illustrate StruQL with a query de ning a web site,
modif => string; starting with a Bibtex bibliograp hy le, mo deled as a lab eled
type => string; hrefs@string =>> url; graph. The web site will consist of three kinds of pages: a
error =>> string]. Pap erPresentation page for each bibliography entry,a Year
page for eachyear, p ointing to all pap ers published in that
year, and a Ro ot page p ointing to all the Year pages. After
The rst declaration intro duces a class url, sub class of string,
showing the query in StruQL,we showitinWebOQL to
with the only metho d get. The notation get => webdoc
give a feel for the di erences b etween the languages.
means that get is a single-valued metho d that returns an
ob ject of typ e webdoc. The metho d get is system-de ned;
// Create Root
the e ect of invoking u.get for a url u in the head of a de-
create RootPage
ductive rule is to retrieve from the web the do cument with
// Create a presentation for every publication x
that URL and cache it in the lo cal Florid database as a
where Publicationsx, x->l->v
webdoc ob ject with ob ject identi er u.get.
create PaperPresentationx
The class webdoc with metho ds self, author, modif, type, link PaperPresentationx->l->v
hrefs and error mo dels the basic information common to
{ // Create a page for every year
all web do cuments. The notation hrefs@string =>> url where l = "year"
means that the multi-valued metho d hrefs takes a string
create YearPagev
as argument and returns a set of ob jects of typ e url. The link
idea is that, if d is a webdoc, then d.hrefs@aLabel returns YearPagev -> "Year" -> v
all URL's of do cuments p ointed to by links lab eled aLabel YearPagev->"Paper"->PaperPresentationx,
within do cument d. // Link root page to each year page
RootPage -> "YearPage" -> YearPagev
Sub classes of do cuments can b e declared as needed using
}
F-logic inheritance, e.g.:
htmldoc::webdoc[title => string; text => string].
In the where clause, the notation Publicationsx means
that x b elongs to the collection Publications, and the atom
x ! l ! v denotes that there is a link in the graph from
Computation in Florid is expressed by sets of deductive
x to v and the lab el on the arc is l. The same notation is
rules. For example, the program b elow extracts from the
used in the link clause to sp ecify the newly created edges in
web the set of all do cuments reachable directly or indirectly
the resulting graph. After creating the Ro ot page, the rst
from the URL www.cs.toronto.edu by links whose lab els
CREATE generates a page for each publicatio n denoted
contain the string \database."
by the Skolem function, Pap erPresentation. The second
CREATE, nested within the outer query, generates a Year
"www.cs.toronto.edu":url.get.
page for eachyear, and links it to the Ro ot page and to
Y:url.get <-
the Pap erPresentation pages of the publications published
X:url.get[hrefs@L=>>{Y}],
in that year. Note the Skolem function YearPage ensures
substr"database",L.
that a Year page for a particular year is only created once,
no matter how many pap ers were published in that year.
Florid provides a p owerful formalism for manipulating semi-
structured data in a web context. However, it do es not cur-
Below is the same query in WebOQL.
rently supp ort the construction of new webs as results of
select unique [Url: x.year, Label:"YearPage"]
computation; the result is always a set of F-logic ob jects in
as "RootPage",
the lo cal store.
[ label: "Paper" / x ] as x.year
from x in browse"bibtex: myfile.bib"
Ulixes and Penelop e
|
In the Araneus pro ject [AMM97b ], the query and restruc-
select [year: y.url] + y as y.url
turing pro cess is split into two phases. In the rst phase,the
from y in "browseRootPage"
Ulixes language is used to build relational views over the
web. These views can then b e analyzed and integrated using
standard database techniques. Ulixes queries extract rela-
The WebOQL query consists of two sub queries, with the
tional data from instances of page schemes de ned in the
web resulting from the rst one \pip ed" into the second one
ADM mo del, making heavy use of star-free path expres-
using the \j" op erator. The rst sub query builds the Ro ot,
sions. The second phase consists of generating hyp ertextual
Pap er, and Year pages, and the second one rede nes each
views of the data using the Penelope language. Query op-
Year page by adding the \year" eld to it.
timization for relational views over sets of web pages, such
as those constructed by Ulixes, is discussed in [MMM98 ].
Florid
+
Interactive query interfaces
Florid [HLLS97 , LHL 98] is a prototyp e implementation of
the deductive and ob ject-oriented formalism F-logic [KLW95].
All the languages in the previous two subsections are to o
To use Florid as a web query engine, a web do cumentis
complex to b e used directly byinteractive users, just as SQL
mo deled by the following two classes:
is; like SQL, they are meant to b e used mostly as program-
ming to ols. There has however b een work in the design of url::string[get => webdoc].
interactive query interfaces suitable for casual users. For ex- autonomy.
ample, Dataguides [GW97 ] is an interactive query to ol for
An imp ortant distinction in building data integration sys-
semistructured databases based on hierarchical summaries
tems, and therefore in building web data integration sys-
of the data graph; extensions to supp ort querying single
tems, is whether to takeawarehousing or a virtual approach
complex web sites are describ ed in [GW98 ]. The system de-
see [HZ96, Hul97 ] for a comparison. In the warehousing
+
scrib ed in [HML 98] supp orts queries that combine multi-
approach, data from multiple web sources is loaded into a
media features, such as similarity to a given sketch or image,
warehouse, and all queries are applied to the warehoused
textual features suchaskeywords, and domain semantics.
data; this requires that the warehouse b e up dated when
data changes, but the advantage is that adequate p erfor-
mance can b e guaranteed at query time. In the virtual ap-
Theory of web queries
proach, the data remains in the web sources, and queries
to the data integration system are decomp osed at run time
In de ning the semantics of rst-generation web query lan-
into queries on the sources. In this approach, data is not
guages, it was immediately observed that certain easily stated
replicated, and is guaranteed to b e fresh at query time. On
queries, such as \list all web do cuments that no other do c-
the other hand, b ecause the web sources are autonomous,
ument p oints to," could b e rather hard to execute. This
more sophisticated query optimization and execution meth-
leads naturally to questions of query computability in this
o ds are needed to guarantee adequate p erformance. The
context. Abiteb oul and Vianu[AV97a] and Mendelzon and
virtual approach is more appropriate for building systems
Milo [MM97 ] prop ose formal ways of categorizing web queries
where the numb er of sources is large, the data is changing
according to whether they can in principle b e computed or
frequently, and there is little control over the web sources.
not; the key idea b eing that essentially, the only p ossible way
For these reasons, most of the recent research has fo cused
to access the web is to navigate links from known starting
on the virtual approach, and therefore, so will our discus-
p oints. Note this includes a sp ecial case navigating links
sion. We emphasize that many of the issues that arise in
from the large collections of starting p oints known as index
the virtual approach also arise in the warehousing approach
servers or search engines. Abiteb oul and Vianu[AV97b]
often in a slightly di erent form, and hence our discussion
also discuss fundamental issues p osed by query optimiza-
is relevant to b oth cases. Finally,we refer the reader to two
tion in path traversal queries. Mihaila, Milo and Mendel-
commercial application s of web data integration, one that
zon [MMM97] showhow to analyze WebSQL queries in
takes the warehousing approach [Jun98 ] and the other that
terms of the maximum number of web sites. Florescu, Levy
takes the virtual approach [Jan98].
and Suciu [FLS98 ] describ e an algorithm for query contain-
A prototypical architecture of a virtual data integration sys-
ment for queries with regular path expressions, which is then
tem is shown in Figure 3. There are two main features distin-
used for verifying integrity constraints on the structure of
guishing such a system from a traditional database system:
web sites [FFLS98].
As stated earlier, the system do es not communicate di-
4 Information Integration
rectly with a lo cal storage manager. Instead, in order
to obtain data, the query execution engine communi-
As stated earlier, the WWW contains a growing number of
cates with a set of wrapp ers. A wrapp er is a program
information sources that can b e viewed as containers of sets
which is sp eci c to every web site, and whose task is
of tuples. These \tuples" can either b e emb edded in HTML
to translate the data in the web site to a form that can
pages, or b e hidden b ehind form interfaces. By writing sp e-
b e further pro cessed by the data integration system.
cialized programs called wrappers, one can give the illusion
For example, the wrapp er may extract from an HTML
that the web site is serving sets of tuples. We refer to the
le a set of tuples. It should b e emphasized that the
combination of the underlying web site and the wrapp er as-
wrapp er provides only an interface to the data served
so ciated with it as a web source.
by the web site, and hence, if the web site provides
The task of a web information integration system is to an-
only limited access to the data e.g., through a form
swer queries that may require extracting and combining data
interface that requires certain inputs, then the wrap-
from multiple web sources. As an example, consider the do-
p er can only supp ort limited access patterns to the
main of movies. The Internet Movie Database contains com-
data.
prehensive data ab out movies, their casts, genres and direc-
The second di erence from traditional systems is that
tors. Reviews of movies can b e found in multiple other web
the user do es not p ose queries directly in the schema in
sources e.g., web sites of ma jor newspap ers, and several
which the data is stored. The reason for this is that one
web sources provide schedules of movie showings. By com-
of the principal goals of a data integration system is to
bining the data from these sources we can answer queries
free the user from having to know ab out the sp eci c
such as: give me a movie, playing time and a review of
data sources and interact with each one. Instead, the
movies starring Frank Sinatra, playing tonight in Paris.
user p oses queries on a mediated schema. A mediated
Several systems have b een built with the goal of answering
schema is a set of virtual relations, which are designed
+
queries using a multitude of web sources [GMPQ 97, EW94 ,
for a particular data integration applicati on. The re-
+ +
WBJ 95, LRO96 , FW97 , DG97b , AKS96, Coh98, AAB 98 ,
lations in the mediated schema are not actually stored
+
BEM 98]. Many of the problems encountered in build-
anywhere. As a consequence, the data integration sys-
ing these systems are similar to those addressed in building
tem must rst reformulate a user query into a query
+
heterogeneous database systems [ACPS96, WAC 93, HZ96 ,
that refers directly to the schemas in the sources. In
TRV98, FRV96, Bla96 , HKWY97]. Web data integration
order to p erform the reformulation step, the data in-
systems have, in addition, to deal with 1 large and evolv-
tegration system requires a set of source descriptions.
ing number of web sources, 2 little meta-data ab out the
A description of an information source sp eci es the
characteristics of the source, and 3 larger degree of source
contents of the source e.g., contains movies, the at- User interface Answers (WWW-based) World view Query
Relevance reasoning Source descriptions Plan Logical planner Contents Generator Capabilties Execution planner ..
Execution plan
Execution engine
Select, project, join, union...
Interface program Interface program Interface program Interface program
INTERNET
Structured WWW Form Relational Object-oriented
Files interface Database database
Figure 2: Architecture of a data integration system
tributes that can b e found in the source e.g., genre, the domain they are covering. For example, a bibliograp hy
cast, constraints on the contents of the source e.g., source is unlikely to b e complete for the eld of Computer
contains only American movies, completeness and re- Science. However, in some cases, we can assert complete-
liability of the source, and nally, the query pro cessing ness statements ab out sources. For example, the DB&LP
2
capabilities of the source e.g., can p erform selections, Database has the complete set of pap ers published in most
or can answer arbitrary SQL queries. ma jor database conferences. Knowledge of completeness of
aweb source can help a data integration system in several
ways. Most imp ortantly, since a negative answer from a
The following are the main issues addressed in the work on
complete source is meaningful, the data integration system
building web data integration systems.
can prune access to other sources. The problem of describ-
Sp eci cation of mediated schema and reformulation:
ing completeness of web sources and using this information
The mediated schema in a data integration system is the
for query pro cessing is addressed in [Mot89 , EGW94 , Lev96 ,
set of collection and attribute names that are used to for-
Dus97 , AD98 , FW97 ]. The work describ ed in [FKL97] de-
mulate queries. Toevaluate a query, the data integration
scrib es a probabilis tic formalism for describing the contents
system must translate the query on the mediated schema
and overlaps among information sources, and presents algo-
into a query on the data sources, that have their own lo-
rithms for cho osing optimally b etween sources.
cal schemas. In order to do so, the system requires a set of
Di ering query pro cessing capabilities: From the p er-
source descriptions. Several recent researchworks addressed
sp ective of the web data integration system, the web sources
the problem of how to sp ecify source descriptions and how
app ear to havevastly di ering query pro cessing capabilitie s.
to use them for query reformulation. Broadly sp eaking,
The main reasons for the di erent app earance are 1 the
two general approaches have b een prop osed: Global as view
+
underlying data may actually b e stored in a structured le
GAV [GMPQ 97 , PAGM96, ACPS96, HKWY97 , FRV96,
or legacy system and in this case the interface to this data
TRV98] and Local as view LAV [LRO96 , KW96 , DG97a ,
is naturally limited, and 2 even if the data is stored in a
DG97b , FW97 ] see [Ull97 ] for a detailed comparison.
traditional database system, the web site may provide only
In the GAV approach, for each relation R in the medi-
limited access capabili ties for reasons of security or p erfor-
ated schema, we write a query over the source relations
mance.
sp ecifying how to obtain R's tuples from the sources. The
To build an e ective data integration system, these capabili -
LAV approach takes the opp osite approach. For every in-
ties need to b e explicitly describ ed to the system, adhered to,
formation source S ,we write a query over the relations in
and exploited as much as p ossible to improve p erformance.
the mediated schema that describ es which tuples are found
We distinguis h twotyp es of capabiliti es: negative capabili -
in S . The main advantage of the GAV approach is that
ties that limit the access patterns to the data, and p ositive
query reformulation is very simple, b ecause it reduces to
capabiliti es, where a source is able to p erform additional
view unfolding. In contrast, in the LAV approach it is sim-
algebraic op erations in addition to simple data fetches.
pler to add or delete sources b ecause the descriptions of
the sources do not havetotakeinto account the p ossible
The main form of negative capabiliti es is limitations on the
interactions with other sources, as in the GAV approach,
binding patterns that can b e used in queries senttothe
and it is also easier to describ e constraints on the con-
source. For example, it is not p ossible to send a query to
tents of sources. The problem of reformulation b ecomes a
the Internet Movie Database asking for al l the movies in the
variant on the of the problem of answering queries using
database and their casts. Instead, it is only p ossible to ask
views [YL87, TSI96, LMSS95 , CKPS95, RSU95, DG97b ].
for the cast of given movie, or to ask for the set of movies in
which a particular actor app ears. Several works have con-
Completeness of data in web sources: In general, sources
2
that we nd on the WWW are not necessarily complete for http://ww w.i nf or mat ik .u ni -tr ie r. de/ ley/db/
sidered the problem of answering queries in the presence of with a set of HTML pages where the data in the page is la-
binding pattern limitations [RSU95 , KW96 , LRO96, FW97 ]. beled. The algorithm uses the lab eled examples to automati-
cally output a grammar by which the data can b e extracted
Positive capabiliti es p ose another challenge to a data inte-
from subsequent pages. Naturally, the more examples we
gration system. If a data source has the ability to p erform
give the system, the more accurate the resulting grammar
op erations such as selections and joins, wewould like to push
can b e, and the challenge is to discover wrapp er languages
as much as p ossible of the pro cessing to the source, thereby
that can b e learned with a small numb er of examples. The
hop efully reducing the amount of lo cal pro cessing and the
rst formulation of wrapp er construction as inductive learn-
amount of data transmitted over the network. The problem
ing and a set of algorithms for learning simple classes of
of describing the computing capabilities of data sources and
wrapp ers are given in [KDW97 ]. The algorithm describ ed
exploiting them to create query execution plans is consid-
in [AK97] exploits heuristics sp eci c to the common uses of
ered in [PGGMU95 , TRV98, LRU96, HKWY97 , VP97a ]
HTML in order to obtain faster learning. It should b e noted
that Machine Learning metho ds have also b een used to learn
Query optimization: Manyworks on web data integra-
the mapping b etween the source schemas and the mediated
tion systems have fo cused on the problem of selecting a min-
+
schemas [PE95, DEW97 ]. The work describ ed [CDF 98 ] is
imal set of web sources to access, and on determining the
a rst step in bridging the gap b etween the approaches of
minimal query that needs to b e sent to each one. However,
Machine Learning and of Natural Language Pro cessing to
the issue of cho osing an optimal query execution plan to ac-
the problem of wrapp er construction. Finally,we note that
cess the web sources has received relatively little attention
the emergence of XML may lead web site builders to exp ort
in the data integration literature [HKWY97], and remains
the data underlying their sites in a machine readable form,
an active area of research. The additional challenges that
thereby greatly simplifying the construction of wrapp ers.
are faced in query optimization over sources on the WWW
is that wehave few statistics on the data in the sources, and
Matching ob jects across sources: One of the hardest
hence little information to evaluate the cost of query exe-
problems in answering queries over a multitude of sources is
cution plans. The work in [NGT98] considers the problem
deciding that two ob jects mentioned in two di erent sources
of calibrating the cost mo del for query execution plans in
refer to the same entity in the world. This problem arises
this context. The work in [YPAGM98] discusses the prob-
b ecause each source employs its own naming conventions
lem of query optimization for fusion queries, which are a
and shorthands. Most systems deal with this problem using
sp ecial class of integration queries that fo cus on retrieving
domain sp eci c heuristics as in [FHM94]. In the WHIRL
various attributes of a given ob ject from multiple sources.
system [Coh98], matching of ob jects across sources is done
Moreover, we b elieve that query pro cessing in data integra-
by using techniques from Information Retrieval. Further-
tion systems is one area whichwould b ene t from ideas such
more, the matching of the ob jects is elegantly integrated in
as interleaving of planning and execution and of computing
anovel query execution algorithm.
conditional plans [GC94, KD98 ].
5 Web site construction and restructuring
Query execution engines: Even less attention has b een
paid to the problem of building query execution engines tar-
The previous two sections discussed tasks that concerned
geted for web data integration. The challenges in building
querying existing web sites and their content. However,
such engines are caused by the autonomy of the data sources
given the fact that web sites essentially provide access to
and the unpredictabil i ty of the p erformance of the network.
complex structures of information, it is natural to apply
In particular, when accessing web sources wemay exp erience
techniques from Database Systems to the pro cess of building
initial delays b efore data is transmitted, and even when it is,
and maintaining web sites. One can distinguish two general
the arrival of the data may b e bursty. The work describ ed
classes of web site building tasks: one in whichweb sites are
in [AFT98, UFA98] has considered the problem of adapting
created from a collection of underlying data sources, and
a query execution plans to initial delays in the arrival of the
another in which they are created by restructuring existing
data.
web sites. As it turns out, the same techniques are required
Wrapp er construction: Recall that the role of a wrapp er
for b oth of these classes. Furthermore, we note that the task
is to extract the data out of a web site into a form that can
of providing a web interface to data that exists in a single
b e manipulated by the data integration system. For exam-
database system [NS96] is a simple instance of the problem
3
ple, the task of a wrapp er could b e to p ose a query to a
of creating web sites.
web source using a form interface, and to extract a set of
To understand the problem of building web sites and the
answer tuples out of the resulting HTML page. The di-
p ossible imp ort of database technology, consider the tasks
culty in building wrapp ers is that the HTML page is usu-
that a web site builder must address: 1 cho osing and ac-
ally designed for human viewing, rather than for extracting
cessing the data that will b e displayed at the site, 2 design-
data by programs. Hence, the data is often emb edded in
ing the site's structure, i.e., sp ecifying the data contained
natural language text or hidden within graphical presenta-
within each page and the links b etween pages, and 3 de-
tion primitives. Moreover, the form of the HTML pages
signing the graphical presentation of pages. In existing web
changes frequently, making it hard to maintain the wrap-
site management to ols, these tasks are, for the most part, in-
p ers. Several works have considered the problem of build-
terdep endent. Without any site-creation to ols, a site builder
ing to ols for rapid creation of wrapp ers. One class of to ols
+
writes HTML les by hand or writes programs to pro duce
e.g., [HGMN 98 , GRVB98 ] is based on developing sp e-
them and must fo cus simultaneously on a page's content, its
cialized grammars for sp ecifying how the data is laid out in
relationship to other pages, and its graphical presentation.
an HTML page, and therefore how to extract the required
As a result, several imp ortant tasks, such as automatically
data. A second class of techniques is based on developing
inductive learning techniques for automatically learning a 3
Most database vendors were quick to provide commercial to ols
wrapp er. Using these algorithms, we provide the system for p erforming this task.
tegrity
up dating a site, restructuring a site, or enforcing in Browsable Web Site
constraints on a site's structure, are tedious to p erform.
Graphical eb sites as declaratively de ned structures: Sev- W HTML Generator
presentation
eral systems have b een develop ed with the goal of apply-
specification
hniques to the problem of web site cre-
ing database tec Logical representation of the web site
+ +
98, AMM98, AM98 , CDSS98 , PF98 , JB97, LSB 98, ation [FFK Declarative web site
TN98]. The common theme to these systems is that they structure specification
vide an explicit declarative representation of the struc-
pro Uniform view of the underlying data
ture of a web site. The structure of the web site is de ned
view over existing data. However, we emphasize that
as a Mediator
the languages used to create these views result in graphs of
web pages with hyp ertext links, rather than simple tables. The systems di er on the data mo del they use, the query
Wrappers
language they use, and whether they haveanintermediate
logical representation of the web site, rather than having
only a representation of the nal HTML. DB Structured HTML
Files Pages
Building a web site using a declarative representation of the
structure of the site has several signi cant advantages. Since
aweb site's structure and content are de ned declaratively
by a query, not pro cedurally by a program, it is easy to cre-
Figure 3: Architecture for Web Site Management Systems
ate multiple versions of a site. For example, it is p ossible
to easily build internal and external views of an organiza-
tion's site or to build sites tailored to di erent classes of
Some of the salientcharacteristics of the di erent systems
+
users. Currently, creating multiple versions requires writ-
are the following. Strudel [FFK 98] uses a semistructured
ing multiple sets of programs or manually creating di erent
data mo del of lab eled directed graphs for mo deling b oth the
sets of HTML les. Buildin g multiple versions of a site can
underlying data and for mo deling the web site. It uses a sin-
b e done by either writing di erent site de nition queries, or
gle query language, StruQL, throughout the system, b oth
bychanging the graphical representation indep endently of
for integrating the raw data and for de ning the structure of
the underlying structure. Furthermore, a declarative rep-
the web site. Araneus [AMM97b ] uses a more structured
resentation of the web site's structure also supp orts easy
data mo del, ADM, and provides a language for transform-
evolution of a web site's structure. For example, to reorga-
ing data into ADM and a language for creating web sites
nize pages based on frequent usage patterns [PE97], or to
from data mo deled in ADM. In addition, Araneus uses web-
extend the site's content, simply rewrite the site-de nition
sp eci c constructs in the data mo del. The Autoweb [PF98 ]
query, as opp osed to rewriting a set of programs or a set of
system is based on the hyp ermedia design mo del HDM, a
HTML les. Declarative sp eci cation of web sites can o er
design to ol for hyp ermedia applications . Its data mo del is
other advantages. For example, it b ecomes p ossible to ex-
based on the entity-relationshi p mo del; the \access schema"
press and enforce integrity constraints on the site [FFLS98],
sp eci es how the hyp erbase is navigated and accessed in
and to up date a site incrementally when changes o ccur in
a browsable site; and the \presentation schema" sp eci es
the underlying data. Moreover, a declarative sp eci cation
how ob jects and paths in the hyp erbase and access schemas
provides a platform for developing optimization algorithms
are rendered. All three systems mentioned ab ove provide a
for run-time management of data intensiveweb sites. The
clear separation b etween the creation of the logical struc-
challenge in run-time managementofaweb site is to auto-
ture of the web site and the sp eci cation of the graphical
matically nd an optimal tradeo b etween precomputation
presentation of the site. The YAT system [CDSS98 ] is an
of parts of the web site and click-time computation. Finally,
application of a data conversion language to the problem of
we remark that buildin g web sites using this paradigm will
building web sites. Using YAT, a web site designer writes
also facilitate the tasks of querying web sites and integrating
a set of rules converting from the raw data into an abstract
data from multiple web sources.
syntax tree of the resulting HTML, without going through
an intermediate logical representation phase. In fact, in
A prototypical architecture of such a system is shown in
a similar way, other languages for data conversation such
Figure 3. At the b ottom level, the system accesses a set of
+
as [MZ98, MPP 93 , PMSL94 ] can also b e used to build
data sources containing the data that will b e served on the
web sites. The WIRM system [JB97] is similar in spirit to
web site. The data may b e stored in databases, in structured
the ab ove systems in that it enables users to build web sites
les, or in existing web sites. The data is represented in
in which in which the pages can b e viewed context-sensitive
the system in some data mo del, and the system provides
views of the underlying data. The ma jor fo cus of WIRM is
a uniform interface to these data sources using techniques
on integrating medical research data for the national Human
similar to the ones describ ed in the previous section. The
Brain Pro ject.
main step in buildin g a web site is to write an expression that
declaratively represents the structure of the web site. The
expression is written in a sp eci c query language provided
6 Conclusions, Persp ectives and Future Work
by the system. The result of applying this query to the
underlying data is the logical representation of the web site
An overarching question regarding the topic of this survey
in the data mo del of the system e.g., a lab eled directed
is whether the World-Wide Web presents novel problems
graph. Finally, to actually create a browsable web site,
to the database community. In manyways, the WWW is
the system contains a metho d e.g., HTML templates for
not similar to a database. For example, there is no uni-
translating the logical structure into a set of HTML les.
form structure, no integrity constraints, no transactions, no
Acknowledgements standard query language or data mo del. And yet, as the
survey has showed, the p owerful abstractions develop ed in
The authors are grateful to the following colleagues who pro- the database communitymay provetobekey in taming the
vided suggestions and comments on earlier versions of this web's complexity and providing valuable services.
pap er: Serge Abiteb oul, Gustavo Aro cena, Paolo Atzeni,
Of particular imp ortance is the view of a large web site as
Jos e Blakeley, Nick Kushmerick, Bertram Ludascher, C.
b eing not just a database, but an information system built
Mohan, Yannis Papakonstantino u, Oded Shmueli, Anthony
around one or more databases with an accompanying com-
Tomasic and Dan Weld.
plex navigation structure. In that view, a web site has many
similariti es to non-web information systems. Designing such
References
aweb site requires extending information systems design
metho dologies [AMM98, PF98]. Using these principles to
+
[AAB 98] Jos Luis Ambite, Naveen Ashish, Greg Barish,
build web sites will also impact the waywe query the web
Craig A. Knoblo ck, Steven Minton, Pragnesh J.
and the wayweintegrate data from multiple web sources.
Mo di, Ion Muslea, Andrew Philp ot, and Sheila
Several trends will have signi cant impact on the use of
Tejada. ARIADNE: A system for constructing
database technology for web applications . The rst is, of
mediators for internet sources system demon-
course, XML. The considerable momentum b ehind XML
stration. In Proc. of ACM SIGMOD Conf. on
and related metadata initiatives can only help the appli-
Management of Data, Seattle, WA, 1998.
cability of database concepts to the web by providing the
[Abi97] Serge Abiteb oul. Querying semi-structured
much needed structure in a widely accepted format. While
data. In Proc. of the Int. Conf. on Database
the availabili ty of data in XML format will reduce the need
Theory ICDT, Delphi, Greece, 1997.
to fo cus on wrapp ers converting human readable data to ma-
chine readable data, the challenges of semantic integration
[ACM93] Serge Abiteb oul, Sophie Cluet, and Tova Milo.
of data from web sources still remains. Building on our ex-
Querying and up dating the le. In Proc. of the
p erience in developing metho ds for manipulatin g semistruc-
Int. Conf. on Very Large Data Bases VLDB,
tured data, our community is in a unique p osition to develop
Dublin, Ireland, 1993.
to ols for manipulati ng data in XML format. In fact, some
of the concepts develop ed in this community are already
[ACPS96] S. Adali, K. Candan, Y. Papakonstantinou , and
+
b eing adapted to the XML context [DFF 98 , GMW98 ].
V.S. Subrahmanian . Query caching and opti-
Other pro jects under way in the database communityinthe
mization in distributed mediator systems. In
area of metadata architectures and languages e.g. [MRT98,
Proc. of ACM SIGMOD Conf. on Management
KMSS98 ] are likely to take advantage of and merge with
of Data, Montreal, Canada, 1996.
the XML framework.
[AD98] S. Abiteb oul and O. Duschka. Complexityof
A second trend that will a ect the applicabi li ty of database
answering queries using materialized views. In
techniques for querying the web is the growth of the so-called
Proc. of the ACM SIGACT-SIGMOD-SIGART
hidden web. The hidden web refers to the web pages that
Symposium on Principles of Database Systems
are generated by programs given user inputs, and are there-
PODS, Seattle, WA, 1998.
fore not accessible to web crawlers for indexing. A recent
article [LG98] claims that close to 80 of the web is already
[AFT98] Laurent Amsaleg, Michael Franklin, and An-
in the hidden web. If our to ols are to b e able to b ene t
thony Tomasic. Dynamic query op erator
from data in the hidden web, wemust develop techniques
scheduling for wide-area remote access. Dis-
for identifying sites that generate web pages, classify them
tributed and Paral lel Databases, 63:217{246,
and automatically create query interfaces to them.
July 1998.
There is no shortage in p ossible directions for future research
[AI I98] Proceedings of the AAAI Workshop on Intel li-
in this area. In the past, the bulk of the work has fo cused on
gent Data Integration, Madison, Wisconsin, July
the logical level, developing appropriate data mo dels, query
1998.
languages and metho ds for describing di erent asp ects of
web sources. In contrast, problems of query optimization
[AK97] Naveen Ashish and Craig A. Knoblo ck. Wrapp er
and query execution have received relatively little attention
generation for semi-structured internet sources.
in the database community, and p ose some of the more im-
SIGMOD Record, 264:8{15, 1997.
p ortantchallenges for future work. Some of the imp ortant
[AKS96] Yigal Arens, Craig A. Knoblo ck, and Wei-Min
directions in which to enrich our data mo dels and query lan-
Shen. Query reformulation for dynamic informa-
guages include the incorp oration of various forms of meta
tion integration. International Journal on Intel-
data ab out sources e.g., probabilisti c information and the
ligent and Cooperative Information Systems, 6
principled combination of querying structured and unstruc-
2/3:99{130, June 1996.
tured data sources on the WWW.
Finally, in this article we tried to provide a representative
[AM98] Gustavo Aro cena and Alb erto Mendelzon. We-
list of references on the topic of web and databases. In
bOQL: Restructuring do cuments, databases and
addition to these references, readers can get a more detailed
webs. In Proc. of Int. Conf. on Data Engineering
account from recentworkshops related to the topic of the
ICDE, Orlando, Florida, 1998.
survey [SSD97 , Web98 , AI I98].
[AMM97a] Gustavo O. Aro cena, Alb erto O. Mendelzon,
and George A. Mihaila. Applications of a web
query language. In Proc. of the Int. WWW
Conf., April 1997.
suite for harnessing web data. In Proceedings [AMM97b] Paolo Atzeni, Giansalvatore Mecca, and Paolo
of the International Workshop on the Web and Merialdo. Toweave the web. In Proc. of the
Databases,Valencia, Spain, 1998. Int. Conf. on Very Large Data Bases VLDB,
1997.
[BH98] Krishna Bharat and Monika Henzinger. Im-
proved algorithms for topic distillatio n in hyp er- [AMM98] Paolo Atzeni, Giansalvatore Mecca, and Paolo
linked environments. In Proc. 21st Int'l ACM Merialdo. Design and maintenance of data-
SIGIR Conference, 1998. intensiveweb sites. In Proc. of the Conf. on Ex-
tending Database Technology EDBT,Valencia,
[Bil98] David Billard. Transactional services for the
Spain, 1998.
web. In Proceedings of the International Work-
+
shop on the Web and Databases, pages 11{17, [AMR 98] Serge Abiteb oul, Jason McHugh, Michael Rys,
Valencia, Spain, 1998. Vasilis Vassalos, and Janet Weiner. Incre-
mental maintenance for materialized views over
[BK90] C. Beeri and Y. Kornatzky. A logical query lan-
semistructured data. In Proc. of the Int. Conf.
guage for hyp ertext systems. In Proc. of the
on Very Large Data Bases VLDB, New York
European Conference on Hypertext, pages 67{80.
City, USA, 1998.
Cambridge University Press, 1990.
+
[AQM 97] Serge Abiteb oul, Dallan Quass, Jason McHugh,
[Bla96] Jose A. Blakeley. Data access for the masses
Jennifer Widom, and Janet Wiener. The Lorel
through OLE DB. In Proc. of ACM SIGMOD
query language for semistructured data. Inter-
Conf. on Management of Data, pages 161{172,
national Journal on Digital Libraries, 11:68{
Montreal, Canada, 1996.
88, April 1997.
[BP98] Sergey Brin and Lawrence Page. The anatomy
[Aro97] Gustavo Aro cena. WebOQL: Exploiting do cu-
of a large-scale hyp ertextual web search engine.
ment structure in web queries. Master's thesis,
In Proc. of the Int. WWW Conf., April 1998.
UniversityofToronto, 1997.
[BT98] Philipp e Bonnet and AnthonyTomasic. Partial
[AV97a] S. Abiteb oul and V. Vianu. Queries and compu-
answers for unavailabl e data sources. In Proceed-
tation on the Web. In Proc. of the Int. Conf. on
ings of the 1998 Workshop on Flexible Query-
Database Theory ICDT, Delphi, Greece, 1997.
Answering Systems FQAS'98, pages 43{54.
Department of Computer Science, Roskilde Uni- [AV97b] Serge Abiteb oul and Victor Vianu. Regular
versity, 1998. path queries with constraints. In Proc. of the
ACM SIGACT-SIGMOD-SIGART Symposium
[Bun97] Peter Buneman. Semistructured data. In
on Principles of Database Systems PODS,
Proc. of the ACM SIGACT-SIGMOD-SIGART
Tucson, Arizona, May 1997.
Symposium on Principles of Database Systems
+
PODS, pages 117{121, Tucson, Arizona, 1997. [BBH 98] Krisha Bharat, Andrei Bro der, Monika Hen-
zinger, Puneet Kumar, and Suresh Venkatasub-
[CACS94] V. Christophides, S. Abiteb oul, S. Cluet, and
ramanian. The connectivity server: fast access
M. Scholl. From structured do cuments to novel
to linkage information on the web. In Proc. of
query facilities. In Proc. of ACM SIGMOD
the Int. WWW Conf., April 1998.
Conf. on Management of Data, pages 313{324,
Minneap oli s, Minnesota, 1994. [BBMR89] Alex Borgida, Ronald Brachman, Deb orah
McGuinness, and Lori Resnick. CLASSIC: A
+
[CDF 98] Mark Craven, Dan DiPasquo, Dayne Fre-
structural data mo del for ob jects. In Proc. of
itag, Andrew McCallum, Tom Mitchell, Kamal
ACM SIGMOD Conf. on Management of Data,
Nigam, and Sean Slattery. Learning to extract
pages 59{67, Portland, Oregon, 1989.
symb olic knowledge from the world-wide web.
In Proceedings of the AAAI Fifteenth National [BDFS97] Peter Buneman, Susan Davidson, Mary Fernan-
ConferenceonArti cial Intel ligence, 1998. dez, and Dan Suciu. Adding structure to un-
structured data. In Proc. of the Int. Conf. on
[CDRR98] Soumen Chakrabarti, Byron Dom, Prabhakar
Database Theory ICDT, Delphi, Greece, 1997.
Raghavan, and Sridhar Ra jagopalan. Auto-
+
matic resource compilation by analyzing hyp er- [BDH 95] C. M. Bowman, Peter B. Danzig, Darren R.
link structure and asso ciated text. In Proc. of Hardy, Udi Manb er, and Michael F. Schwartz.
the Int. WWW Conf., April 1998. The harvest information discovery and access
system. Computer Networks and ISDN Systems,
[CDSS98] Sophie Cluet, Claude Delob el, Jerome Simeon,
281-2:119{125, Decemb er 1995.
and Katarzyna Smaga. Your mediators need
data conversion. In Proc. of ACM SIGMOD [BDHS96] P. Buneman, S. Davidson, G. Hillebrand, and
Conf. on Management of Data, Seattle, WA, D. Suciu. A query language and optimization
1998. techniques for unstructured data. In Proc. of
ACM SIGMOD Conf. on Management of Data,
[CGL98] Diego Calvanese, Giusepp e De Giacomo, and
pages 505{516, Montreal, Canada, 1996.
Maurizio Lenzerini. What can knowledge repre-
+
[BEM 98] C. Beeri, G. Elb er, T. Milo, Y. Sagiv, sentation do for semi-structured data? In Pro-
O.Shmueli, N.Tishby, Y.Kogan, D.Konopnicki, ceedings of the AAAI Fifteenth National Confer-
P. Mogilevski, and N.Slonim. Websuite-a to ol enceonArti cial Intel ligence, 1998.
[CGMP98] Jungho o Cho, Hector Garcia-Molina, and [DG97b] Oliver M. Duschka and Michael R. Genesereth.
Lawrence Page. Ecient crawling through url Query planning in infomaster. In Proceedings
ordering. In Proc. of the Int. WWW Conf., April of the ACM Symposium on Applied Computing,
1998. San Jose, CA, 1997.
[Dus97] Oliver Duschka. Query optimization using lo cal
[CK98] J. Carriere and R. Kazman. Web query: Search-
completeness. In Proceedings of the AAAI Four-
ing and visualizi ng the web through connectiv-
teenth National ConferenceonArti cial Intel li-
ity.InProc. of the Int. WWW Conf., April 1998.
gence, 1997.
[CKPS95] Sura jit Chaudhuri, Ravi Krishnamurthy,Spy-
[EGW94] Oren Etzioni, Keith Golden, and Daniel Weld.
ros Potamianos, and Kyuseok Shim. Optimiz-
Tractable closed world reasoning with up dates.
ing queries with materialized views. In Proc. of
In Proceedings of the Conference on Princi-
Int. Conf. on Data Engineering ICDE,Taip ei,
ples of Know ledge Representation and Reason-
Taiwan, 1995.
ing, KR-94., 1994. Extended version to app ear
[CM89] Mariano P. Consens and Alb erto O. Mendel-
in Arti cial Intel ligence.
zon. Expressing structural hyp ertext queries in
[EW94] Oren Etzioni and Dan Weld. A softb ot-based
graphlog. In Proc. 2nd. ACM Conferenceon
interface to the internet. CACM, 377:72{76,
Hypertext, pages 269{292, Pittsburgh, Novem-
1994.
b er 1989.
+
[FFK 98] Mary Fernandez, Daniela Florescu, Jaewoo
[CM90] Mariano Consens and Alb erto Mendelzon.
Kang, Alon Levy, and Dan Suciu. Catching the
GraphLog: a visual formalism for real life recur-
b oat with Strudel: Exp eriences with a web-site
sion. In Proc. of the ACM SIGACT-SIGMOD-
management system. In Proc. of ACM SIG-
SIGART Symposium on Principles of Database
MOD Conf. on Management of Data, Seattle,
Systems PODS, pages 404{416, Atlantic City,
WA, 1998.
NJ, 1990.
[FFLS97] Mary Fernandez, Daniela Florescu, Alon Levy,
[CMW87] Isab el F. Cruz, Alb erto O. Mendelzon, and Pe-
and Dan Suciu. A query language for a web-site
ter T. Wo o d. A graphical query language sup-
management system. SIGMOD Record, 263:4{
p orting recursion. In Proc. of ACM SIGMOD
11, Septemb er 1997.
Conf. on Management of Data, pages 323{330,
San Francisco, CA, 1987. [FFLS98] Mary Fernandez, Daniela Florescu, Alon Levy,
and Dan Suciu. Reasoning ab out web-sites. In
[CMW88] I.F. Cruz, A.O. Mendelzon, and P.T. Wood. G+:
Working notes of the AAAI-98 Workshop on Ar-
Recursive queries without recursion. In Proceed-
ti cial Intel ligence and Data Integration. Amer-
ings of the Second International Conferenceon
ican Association of Arti cial Intel ligence., 1998.
Exper t Database Systems, pages 355{368, 1988.
[FHM94] Douglas Fang, Joachim Hammer, and Dennis
[Coh98] William Cohen. Integration of heteroge-
McLeo d. The identi cation and resolution of se-
neous databases without common domains using
mantic heterogeneityinmultidatabase systems.
queries based on textual similarity.InProc. of
In Multidatabase Systems: AnAdvanced Solu-
ACM SIGMOD Conf. on Management of Data,
tion for Global Information Sharing, pages 52{
Seattle, WA, 1998.
60. IEEE Computer So ciety Press, Los Alami-
tos, California, 1994.
[CS98] Pei Cao and Sekhar Sarukkai, editors. Proceed-
ings of Workshop on Inter-
[FKL97] Daniela Florescu, Daphne Koller, and Alon
net Server Performance, Madison, Wisconsin,
Levy. Using probabili sti c information in data
1998. http://www.cs.wisc.edu/ cao/WISP98-
integration. In Proc. of the Int. Conf. on
program.html.
Very Large Data Bases VLDB, pages 216{225,
Athens, Greece, 1997.
[DEW97] B. Do orenb os, O. Etzioni, and D. Weld. Scalable
[FLS98] Daniela Florescu, Alon Levy, and Dan Su-
comparison-shoppi ng agent for the world-wide
ciu. Query containment for conjunctive queries
web. In Proceedings of the International Con-
with regular expressions. In Proc. of the
ferenceonAutonomous Agents,February 1997.
ACM SIGACT-SIGMOD-SIGART Symposium
+
[DFF 98] Alin Deutsch,
on Principles of Database Systems PODS,
Mary Fernandez, Daniela Florescu, Alon Levy,
Seattle,WA, 1998.
and Dan Suciu. A query language for XML.
[FRV96] Daniela Florescu, Louiqa Raschid, and Patrick
http://www.research.att.com/ m /xml/w3c-
Valduriez. A metho dology for query reformu-
note.html, 1998.
lation in cis using semantic knowledge. Int.
[DG97a] Oliver M. Duschka and Michael R. Genesereth.
Journal of Intel ligent & Cooperative Informa-
Answering recursive queries using views. In
tion Systems, special issue on Formal Methods
Proc. of the ACM SIGACT-SIGMOD-SIGART
in Cooperative Information Systems, 54, 1996.
Symposium on Principles of Database Systems
[FW97] M. Friedman and D. Weld. Ecient execution of
PODS,Tucson, Arizona., 1997.
information gathering plans. In Proceedings of
the International Joint ConferenceonArti cial
Intel ligence, Nagoya, Japan, 1997.
[GC94] G. Graefe and R. Cole. Optimization of dy- [HLLS97] Rainer Himmeroder, Georg Lausen, Bertram
namic query evaluation plans. In Proc. of ACM Ludascher, and Christian Schlepphorst. On a
SIGMOD Conf. on Management of Data, Min- declarative semantics for web queries. In Proc. of
neap olis, Minnesota, 1994. the Int. Conf. on Deductive and Object-Oriented
Databases DOOD, pages 386{398, Singap ore,
[GGMT99] Luis Gravano, Hector Garcia-Molina, and An-
Decemb er 1997. Springer.
thonyTomasic. GlOSS:Text-source discov-
+
ery over the internet. ACM Transactions on
[HML 98] Kyo ji Hirata, Sougata Mukherjea, Wen-Syan Li,
Database Systems to appear, 1999.
Yusaku Okamura, and Yoshinori Hara. Facilitat-
ing ob ject-based navigation through multimedia
+
[GMPQ 97] H. Garcia-Molin a, Y. Papakonstantinou ,
web databases. TAPOS, 44, 1998. In press.
D. Quass, A. Ra jaraman, Y. Sagiv, J. Ullman,
and J. Widom. The TSIMMIS pro ject: Integra-
[Hul97] Richard Hull. Managing semantic heterogene-
tion of heterogenous information sources, March
ity in databases: A theoretical p ersp ective. In
1997.
Proc. of the ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems
[GMW98] R. Goldman, J. McHugh, and J. Widom. Lore:
PODS, pages 51{61, Tucson, Arizona, 1997.
A database management system for XML. Pre-
sentation, Stanford University Database Group,
[HZ96] Richard Hull and Gang Zhou. A framework for
1998.
supp orting data integration using the material-
ized and virtual approaches. In Proc. of ACM
[GRC97] S. Gadde, M. Rabinovich, and J. Chase. Reduce,
SIGMOD Conf. on Management of Data, pages
reuse, recycle: An approach to building large
481{492, Montreal, Canada, 1996.
internet caches. In Proceedings of the Workshop
on Hot Topics in Operating Systems HotOS,
[Jan98] http://www.jango.excite.com, 1998.
May 1997.
[JB97] R. Jakob ovits and J. F. Brinkley. Managing
[GRVB98] Jean-Rob ert Gruser, Louiqa Raschid, Mar a Es-
medical research data with a web-interfacing
ther Vidal, and Laura Bright. Wrapp er genera-
rep ository manager. In American Medical Infor-
tion for web accessible data sources. In Proceed-
matics AssociationFal l Symposium, pages 454{
ings of the CoopIS, 1998.
458, Nashville, Oct 1997.
[GW97] Roy Goldman and Jennifer Widom. Dataguides:
[Jun98] http://www.junglee.com, 1998.
Enabling query formulation and optimization
in semistructured databases. In Proc. of the
[KD98] Navin Kabra and David J. DeWitt. Ecient
Int. Conf. on Very Large Data Bases VLDB,
mid-query re-optimization of sub-optimal query
Athens, Greece, 1997.
execution plans. In Proc. of ACM SIGMOD
Conf. on Management of Data, pages 106{117,
[GW98] Roy Goldman and Jennifer Widom. Interactive
Seattle, WA, 1998.
query and search in semistructured databases.
In Proceedings of the International Workshop on
[KDW97] N. Kushmerick, R. Do orenb os, and D. Weld.
the Web and Databases, pages 42{48, Valencia,
Wrapp er induction for information extraction.
Spain, 1998.
In Proceedings of the 15th International Joint
ConferenceonArti cial Intel ligence, 1997.
[GZC89] Ralf Hartmut Guting, Rob erto Zicari, and
David M. Choy. An algebra for structured oce
[Kle98] Jon Kleinb erg. Authoritative sources in a hyp er-
do cuments. ACM TOIS, 72:123{157, 1989.
linked environment. In Proceedings of 9th ACM-
SIAM Symposium on Discrete Algorithms, 1998.
[Hal88] Frank G. Halasz. Re ections on Notecards:
Seven issues for the next generation of hyp erme-
[KLW95] M. Kifer, G. Lausen, and J. Wu. Logical foun-
dia systems. Comm. of the ACM, 317:836{852,
dations of ob ject-oriented and frame-based lan-
1988.
guages. J. ACM, 424:741{843, 1995.
[Har94] Coleman Harrison. Aql: An adaptive query lan-
[KMSS98] Y. Kogan, D. Michaeli, Y. Sagiv, and
guage. Technical Rep ort NU-CCS-94-19, North-
O. Shmueli. Utilizing the multiple facets of
eastern University, Octob er 1994. Master's The-
www contents. Data and Know ledge Engineer-
sis.
ing, 1998. In press.
+
[HGMN 98] Joachim Hammer, Hec-
[KS95] D. Konopnicki and O. Shmueli. W3QS: A query
tor Garcia-Molina, Svetlozar Nestorov, Ramana
system for the World Wide Web. In Proc. of the
Yerneni, Markus M. Breunig, and Vasilis Vas-
Int. Conf. on Very Large Data Bases VLDB,
salos. Template-based wrapp ers in the TSIM-
pages 54{65, Zurich, Switzerland, 1995.
MIS system system demonstration. In Proc. of
ACM SIGMOD Conf. on Management of Data,
[KS98] David Konopnicki and Oded Shmueli. Bringing
Tucson, Arizona, 1998.
database functionality to the WWW. In Pro-
ceedings of the International Workshop on the
[HKWY97] Laura Haas, Donald Kossmann, Edward Wim-
Web and Databases, Valencia, Spain, pages 49{
mers, and Jun Yang. Optimizing queries across
55, 1998.
diverse data sources. In Proc. of the Int. Conf.
on Very Large Data Bases VLDB,Athens, Greece, 1997.
[KW96] Chung T. Kwok and Daniel S. Weld. Planning to [MMM98] Giansalvatore Mecca, Alb erto O. Mendelzon,
gather information. In Proceedings of the AAAI and Paolo Merialdo. Ecient queries over web
Thirteenth National ConferenceonArti cial In- views. In Proc. of the Conf. on Extending
tel ligence, 1996. Database Technology EDBT,Valencia, Spain,
1998.
[Lev96] Alon Y. Levy. Obtaining complete answers from
incomplete databases. In Proc. of the Int. Conf. [Mot89] Amihai Motro. Integrity=validity + complete-
on Very Large Data Bases VLDB, Bombay, In- ness. ACM Transactions on Database Systems,
dia, 1996. 144:480{502, Decemb er 1989.
[LG98] S. Lawrence and C.L. Giles. Searching the world [MP96] Kenneth Mo ore and Michelle Peterson. A group-
wide web. Science, 2804:98{100, 1998. ware b enchmark based on lotus notes. In Proc.
of Int. Conf. on Data Engineering ICDE, 1996.
+
[LHL 98] Bertram Ludascher, Rainer Himmeroder, Georg
+
Lausen, Wolfgang May, and Christian Schlep- [MPP 93] Bernhard Mitschang, Hamid Pirahesh, Peter
phorst. Managing semistructured data with Pistor, Bruce G. Lindsay, and Norb ert Sdkamp.
FLORID : A deductive ob ject-oriented p ersp ec- Sql/xnf - pro cessing comp osite ob jects as ab-
tive. Information Systems, 238, 1998. to ap- stractions over relational data. In Proc. of Int.
p ear. Conf. on Data Engineering ICDE, pages 272{
282, Vienna, Austria, 1993.
[LMSS95] Alon Y. Levy, Alb erto O. Mendelzon, Yehoshua
Sagiv, and Divesh Srivastava. Answering queries [MRT98] George A. Mihaila, Louiqa Raschid, and An-
using views. In Proc. of the ACM SIGACT- thonyTomasic. Equal time for data on the in-
SIGMOD-SIGART Symposium on Principles of ternet with websemantics. In Proc. of the Conf.
Database Systems PODS, San Jose, CA, 1995. on Extending Database Technology EDBT,Va-
lencia, Spain, 1998.
[LRO96] Alon Y. Levy, Anand Ra jaraman, and Joann J.
Ordille. Querying heterogeneous information [MZ98] Tova Milo and Sagit Zohar. Using schema
sources using source descriptions. In Proc. of the matching to simplify heterogeneous data trans-
Int. Conf. on Very Large Data Bases VLDB, lation. In Proc. of the Int. Conf. on Very Large
Bombay, India, 1996. Data Bases VLDB, New York City, USA,
1998.
[LRU96] Alon Y. Levy, Anand Ra jaraman, and Je rey D.
Ullman. Answering queries using limited exter- [NAM98] Svetlozar Nestorov, Serge Abiteb oul, and Ra-
nal pro cessors. In Proc. of the ACM SIGACT- jeev Motwani. Extracting schema from
SIGMOD-SIGART Symposium on Principles of semistructured data. In Proc. of ACM SIG-
Database Systems PODS, Montreal, Canada, MOD Conf. on Management of Data, Seattle,
1996. WA, 1998.
+
[LSB 98] T. Lee, L. Sheng, T. Bozkaya, N.H. Balkir, Z.M. [NGT98] Hub ert Naacke, Georges Gardarin, and Anthony
Ozsoyoglu, and G. Ozsoyoglu. Querying multi- Tomasic:. Leveraging mediator cost mo dels with
media presentations based on content. to appear heterogeneous data sources. In Proc. of Int.
in IEEE Trans. on Know. and Data Engineer- Conf. on Data Engineering ICDE, Orlando,
ing, 1998. Florida, 1998.
[LSCH98] Wen-Syan Li, Junho Shim, K. Selcuk Canadan, [NS96] Tam Nguyen and V. Srinivasan. Accessing re-
and Yoshinori Hara. WebDB: A web query sys- lational databases from the world wide web. In
tem and its mo deling, language, and implemen- Proc. of ACM SIGMOD Conf. on Management
tation. In Proc. IEEE Advances in Digital Li- of Data, Montreal, Canada, 1996.
braries '98, 1998.
[PAGM96] Y. Papakonstantinou, S. Abite-
[LSS96] Laks V. S. Lakshmanan, Fereido on Sadri, and b oul, and H. Garcia-Molina. Ob ject fusion in
Iyer N. Subramanian. A declarative language for mediator systems. In Proc. of the Int. Conf. on
querying and restructuring the Web. In Proc. of Very Large Data Bases VLDB, Bombay, India,
6th. International Workshop on Research Issues 1996.
in Data Engineering, RIDE '96, New Orleans,
+
[PdBA 92] Jan Paredaens, Jan Van den Bussche, Marc
February 1996.
Andries, Marc Gemis a nd Marc Gyssens, Inge
[MM97] Alb erto O. Mendelzon and Tova Milo. For- Thyssens, Dirk Van Gucht, Vijay Sarathy, and
mal mo dels of web queries. In Proc. of the Lawre nce V. Saxton. An overview of GOOD.
ACM SIGACT-SIGMOD-SIGART Symposium SIGMOD Record, 211:25{31, 1992.
on Principles of Database Systems PODS,
[PE95] MikePerkowitz and Oren Etzioni. Category
pages 134{143, Tucson, Arizona, May 1997.
translation: Learning to understand informa-
[MMM97] A. Mendelzon, G. Mihaila, and T. Milo. Query- tion on the internet. In Working Notes of the
ing the world wide web. International Journal AAAI Spring Symposium on Information Gath-
on Digital Libraries, 11:54{67, April 1997. ering from Heterogeneous Distributed Environ-
ments. American Asso ciation for Arti cial In- telligence., 1995.
[UFA98] Tolga Urhan, Michael J. Franklin, and Laurent [PE97] MikePerkowitz and Oren Etzioni. Adaptiveweb
Amsaleg. Cost based query scrambling for initial sites: an AI challenge. In Proceedings of the 15th
delays. In Proc. of ACM SIGMOD Conf. on International Joint ConferenceonArti cial In-
Management of Data, pages 130{141, Seattle, tel ligence, 1997.
WA, 1998.
[PF98] P.Paolini and P.Fraternali. A conceptual mo del
[Ull97] Je rey D. Ullman. Information integration us- and a to ol environment for developing more
scalable, dynamic, and customizable web ap-
ing logical views. In Proc. of the Int. Conf. on
Database Theory ICDT, Delphi, Greece, 1997. plications. In Proc. of the Conf. on Extending
Database Technology EDBT,Valencia, Spain,
[VP97a] Vasilis Vassalos and Yannis Papakonstantinou .
1998.
Describing and using the query capabilities of
[PGGMU95] Yannis Papakonstantinou, Ashish Gupta, Hec-
heterogeneous sources. In Proc. of the Int. Conf.
on Very Large Data Bases VLDB,Athens, tor Garcia-Molina, and Je rey Ullman. A query
translation scheme for rapid implementation of
Greece, 1997.
wrapp ers. In Proc. of the Int. Conf. on Deduc-
[VP97b] Anne-Marie Vercoustre and Francois Paradis.
tive and Object-Oriented Databases DOOD,
A descriptive language for information ob ject
1995.
reuse through virtual do cuments. In Proceedings
[PGMW95] Yannis Papakonstantinou , Hector Garcia-
of the 4th International Conference on Object-
Oriented Information Systems OOIS'97, Bris- Molina, and Jennifer Widom. Ob ject exchange
across heterogeneous information sources. In
bane, Australia, Novemb er 1997.
Proc. of Int. Conf. on Data Engineering
+
93] Darrel Wo elk, Paul Attie, Phil Cannata, Greg [WAC
ICDE, pages 251{260, Taip ei, Taiwan, 1995.
Meredith, Amit Seth, Munindar Sing, and
[PMSL94] Hamid Pirahesh, Bernhard Mitschang, Norb ert
Christine Tomlinson. Task scheduling using in-
tertask dep endencies in Carnot. In Proc. of Sdkamp, and Bruce G. Lindsay. Comp osite-
ob ject views in relational dbms: an implemen-
ACM SIGMOD Conf. on Management of Data,
pages 491{494, 1993. tation p ersp ective. Information Systems,IS
191:69{88, 1994.
+
[WBJ 95] Darrell Wo elk, Bill Bohrer, Nigel Jacobs,
K. Ong, Christine Tomlinson, and C. Unnikr- [PPR96] P. Pirolli, J. Pitkow, and R. Rao. Silk from a
sow's ear: Extracting usable structures from the
ishnan. Carnot and infosleuth: Database tech-
nology and the world wide web. In Proc. of ACM web. In Proc. ACM SIGCHI Conference, 1996.
SIGMOD Conf. on Management of Data, pages
[RSU95] Anand Ra jaraman, Yehoshua Sagiv, and Jef-
443{444, San Jose, CA, 1995.
frey D. Ullman. Answering queries using tem-
plates with binding patterns. In Proc. of the
[Web98] Proceedings of the International Workshop on
the Web and Databases,Valencia, Spain, March ACM SIGACT-SIGMOD-SIGART Symposium
on Principles of Database Systems PODS, San
1998.
http://p oincare.dia.un iroma 3.it:8080 /web db 98/pa p ers.html . Jose, CA, 1995.
[Sp e97] Ellen Sp ertus. ParaSite: Mining structural in-
[WWW98] Proceedings of 3d WWW Caching Workshop
formation on the web. In Proc. of the Int. WWW
Manchester, England, June 1998.
Conf., April 1997.
[XML98] http://w3c.org/XML/, 1998.
[SSD97] Proceedings of Workshop on Management of
[YL87] H. Z. Yang and P. A. Larson. Query transforma-
Semistructured Data,Tucson, Arizona, May
tion for PSJ-queries. In Proc. of the Int. Conf.
1997.
on Very Large Data Bases VLDB, pages 245{
[TMD92] J. Thierry-Mieg and R. Durbin. A C.elegans
254, Brighton, England, 1987.
database: syntactic de nitions for the ACEDB
[YPAGM98] Ramana Yerneni, Yannis Papakonstantinou ,
data base manager, 1992.
Serge Abiteb oul, and Hector Garcia-Molina . Fu-
[TN98] Motomichi Toyama and T. Nagafuji. Dynamic
sion queries over internet databases. In Proc.
and structured presentation of database con-
of the Conf. on Extending Database Technology
tents on the web. In Proc. of the Conf. on Ex-
EDBT, pages 57{71, Valencia, Spain, 1998.
tending Database Technology EDBT,Valencia,
[ZGM98] Yue Zhuge and Hector Garcia-Molina. Graph
Spain, 1998.
structured views and their incremental mainte-
[TRV98] A. Tomasic, L. Raschid, and P.Valduriez. Scal-
nance. In Proc. of Int. Conf. on Data Engineer-
ing access to distributed heterogeneous data
ing ICDE, Orlando, Florida, February 1998.
sources with Disco. IEEE Transactions On
Knowledge and Data Engineering to appear,
1998.
[TSI96] Odysseas G. Tsatalos, Marvin H. Solomon, and
Yannis E. Ioannidis. The GMAP: A versatile
to ol for physical data indep endence. VLDB
Journal, 52:101{118, 1996.