Edinburgh Research Explorer

Semistructured Data

Citation for published version: Buneman, P 1997, Semistructured Data. in PODS '97 Proceedings of the sixteenth ACM SIGACT- SIGMOD-SIGART symposium on Principles of systems. ACM, pp. 117-121. https://doi.org/10.1145/263661.263675

Digital Object Identifier (DOI): 10.1145/263661.263675

Link: Link to publication record in Edinburgh Research Explorer

Document Version: Early version, also known as pre-print

Published In: PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems

General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 27. Sep. 2021



Semistructured Data

Peter Buneman

Department of Computer and Information Science

University of Pennsylvania

Philadelphia PA

p etercisup ennedu

Some data really is unstructured Abstract

The most obvious motivation comes from the need to bring

In semistructured data the information that is normally as

new forms of data into the ambit of conventional database

so ciated with a schema is contained within the data which is

technology Some of these such as do cuments with struc

sometimes called selfdescribing In some forms of semi

tured text and data formats while they may

structured data there is no separate schema in others it

call for increasingly expressive query languages and new op

exists but only places lo ose constraints on the data Semi

timization techniques only require mild extensions to the

structured data has recently emerged as an imp ortant topic

existing notion of data mo dels such as ODMG How

of study for a variety of reasons First there are data sources

ever these extensions still require the prior imp osition of

such as the Web which we would like to treat as

structure on the data and there are some forms of data for

but which cannot b e constrained by a schema Second it

which this is genuinely dicult

may b e desirable to have an extremely exible format for

The most immediate example of data that cannot b e con

data exchange b etween disparate databases Third even

strained by a schema is the WorldWideWeb As database

when dealing with structured data it may b e helpful to view

researchers we would like to think of this as a database but

it as semistructured for the purp oses of browsing This tu

to what extent are database to ols available for querying or

torial will cover a numb er of issues surrounding such data

maintaining the web Most web queries exploit information

nding a concise formulation building a suciently expres

retrieval techniques to retrieve individual pages from their

sive language for querying and transformation and opti

contents but there is little available that allows us to use

mization problems

the structure of the web in formulating queries and since

the web do es not obviously conform to any standard data

The motivation

mo del we need a metho d of describing its structure

Another example little known to the database commu

The topic of semistructured data also called unstructured

nity but resp onsible for piquing the authors interest in this

data is relatively recent and a tutorial on the topic may

topic is the database management system ACeDB which

well b e premature It represents if anything the conver

is p opular with biologists Sup ercially it lo oks like

gence of a numb er of lines of thinking ab out new ways to

an ob jectoriented database system for it has a schema

represent and query data that do not completely t with

language that resembles that of an ob jectoriented DBMS

conventional data mo dels The purp ose of this tutorial is

but this schema imp oses only lo ose constraints on the data

to to describ e this motivation and to suggest areas in which

Moreover the relationship b etween data and schema is not

further research may b e fruitful For a similar exp osition

easily describ ed in ob jectoriented terms and there are struc

the reader is referred to Serge Abiteb ouls recent survey pa

tures that are naturally expressed in ACeDB such as trees of

p er

arbitrary depth that cannot b e queried using conventional

The slides for this tutorial will b e made available from a

techniques

section of the Penn database home page

httpwwwcisupennedudb

Data Integration



This work was partly supp orted by the Army Research Oce

A second motivation is that of data exchange and transfor

DAAH and the National Science Foundation CCR

mation which is the starting p oint for the Tsimmis pro ject

at Stanford The rationale here is that none of the

existing data mo dels is allembracing so that it is dicult

to build software that will easily convert b etween two dis

parate mo dels The Ob ject Exchange Mo del OEM oers

a highly exible data structure that may b e used to cap

ture most kinds of data and provides a substrate in which

almost any other data structure may b e represented In ef

fect OEM is an internal data structure for exchange of data

b etween DBMSs but having such a structure invites the

idea of querying data in OEM format directly Entry Entry Entry

References Movie Movie TV Show

Is referenced in

Title Cast Director Title Cast Director Title Cast Episode

“Play it again, Sam” “Casablanca” “Bogart” “Bacall” Credit Actors 1 2 3 1.2E6 Actors

“Allen” Special Guests Director

“Allen”

Figure An example movie database

Browsing b oth with data of typ es such as int and string and p ossi

bly other base or external abstract typ es video audio etc

A nal motivation is that of browsing Generally sp eaking

Edges are also with names such as Movie and Title that

a user cannot write a database query without knowledge

would normally b e used for attribute or class names We

of the schema However schemas may have opaque termi

shall refer to such lab els as symbols Internally they are rep

nology and the rationale for the design is often dicult to

resented as strings Note that arrays may b e represented by

understand It may help in understanding the schema to b e

lab eling internal edges with integers We can formulate the

able to query data without full knowledge of the schema

typ e of this kind of lab eled tree as

For example the queries

type label int j string j j symbol

Where in the database is the string Casablanca to

type tree setlabel tree

b e found

The rst line describ es a tagged union or variant the

16

Are there integers in the database greater than

second says that a tree is a set of lab eltree pairs The edges

out of no des in our trees are assumed to b e unordered

What ob jects in the database have an attribute name

There are a numb er of variations on this basic mo del

that starts with act

and it is worth briey reviewing them In leaf no des

Such questions cannot b e answered in any generic fashion

are lab eled with data internal no des are not lab eled with

by standard relational or ob jectoriented query languages

meaningful data and edges are lab eled only with symb ols

While languages have b een prop osed that allow schema and

type base int j string j

data to b e queried simultaneously in the context of

type tree base j setsymbol tree

relational and ob jectoriented database systems these lan

guages do not have the exibility to express complex con

The dierences b etween the two mo dels are minor and

straints on paths and it is not clear how their implementa

give rise to minor dierences in the query language It is

tion will work on the structures describ ed b elow

easy to dene mappings in b oth directions

Another p ossibility is to allow lab els on internal no des

The Mo del for example

type base int j string j j symbol

The unifying idea in semistructured data is the representa

type tree label setlabel tree

tion of data as some kind of graphlike or treelike structure

Although we shall allow cycles in the data we shall gener

The problem with using this representation directly is

ally refer to these graphs as trees The example in gure is

that it makes the op eration of taking the union of two trees

taken from in which the data mo del is formalized as an

dicult to dene However by intro ducing extra edges

edge labeled graph The structure is taken with some inac

this represaentation can b e converted into one of the edge

curacies from a wellknown web database that provides

lab elled representations ab ove

a go o d example of semistructured data There are several

A nal and more complex issue is that of ob ject identity

things to note ab out it If one connes ones attention to the

by which we mean no de lab els or p ossibly edge lab els

parts of the database b elow Movie edges the data app ears

that apart from an equality test are not observable in

fairly regular except that there are two ways of representing

the query language In OEM ob ject identities are used as

a cast That is the data do es not quite t with some re

no de lab els and placeholders to dene trees While ob ject

lational or ob jectoriented presentation Edges are lab eled

identities provide an ecient way to dene and test equality

The select fragment of UnQL and the Lorel query within a database they p ose problems when comparing data

language solve these problems with very similar syntactic across databases See discussions and related

forms Lorel which is a comp onent of the Lore pro ject work

requires a rich set of overloadings for its op erators for deal It is straightforward to enco de relational and ob ject

ing with comparisons of ob jects with values and of values oriented databases in this mo del although in the latter case

with sets These are avoided in UnQL by not having ob ject one must take care to deal with the issue of ob jectidentity

identity and exploiting a simple form of pattern matching However the co ding is not unique and the examples in

Other languages that use a SQLlike syntax include a pre and show some dierences in how tuples of sets are

cursor to Lorel and WebSQL which contains a treated

numb er of constructs sp ecic to web queries A language The term self describing is often used to describ e un

for web site management is prop osed in structured data In each of the mo dels we have describ ed

Having asked what the surface syntax should lo ok like the data is a tagged union typ e and one can imagine a pro

one wants to ask what the underlying computational strat gram whose b ehavior is dynamically determined by switch

egy should b e Here there app ear to b e two principled strate ing on the typ e For example a programs b ehavior may

gies The rst is to mo del the graph as a relational database b e altered by whether it nds an integer or string as a lab el

and then exploit a relational query language In our lab eled and one would exp ect any language for dealing with semi

graph mo del this is remarkably simple We can take the structured data to incorp orate predicates that describ e the

database as a large relation of typ e no deid lab el no deid typ e of an edge or no de The situation is similar to that

and consider the expressive p ower of relational languages in programming languages Lisp and many interpreted and

on this structure but this apparently simple approach has scripting languages are dynamical ly typ ed Predicates are

a numb er of complications available to determine at run time typ e of a value or class

of an ob ject Languages in the Algol tradition Pascal C

Our lab els are drawn from a heterogeneous collection

ML Mo dula are statical ly typ ed Predicates are not needed

of typ es so it may b e appropriate to use more than

to determine the typ e of a value b ecause it is known from the

one relation

source co de of the program and hence to the programmer

There is a go o d analogy b etween dynamic typ e systems and

If information also is held at no des one needs addi

semistructured data on one hand and static typ e systems

tional relations to express this

and databases with schemas on the other

The no de identiers may only b e used as temp orary

no de lab els and one may want to limit the way they

Query Languages

can app ear in the output of the query How they are

used is related to the discussion of ob ject identity

There app ear to b e two general approaches to devising query

languages for semistructured data First take SQL or p er

We are concerned with what is accessible from a given

haps OQL as a starting p oint and add enough fea

ro ot by forward traversal of the edges and one may

tures to p erform a useful class of queries The second ap

want to limit the languages appropriately

proach is to start from a language based on some formal

notion of computation on semistructured data then to mas

Some forms of unb ounded search will require recursive

sage that language into acceptable syntax It is remarkable

queries ie a graph datalog and such languages are pro

that the two approaches app ear to end up with very similar

p osed in for the web and for hyp ertext Theoretical

languages

treatments of queries that deal with computation on graphs

Let us start with the rst approach to see what what

or on the web app ear in It should also b e mentioned

kinds of queries are useful The following SQLlike syntax

that this mo del of computation is used in as a starting

suggests itself

p oint for optimization

The second strategy is adopted in the basis for UnQL

select EntryMovieTitle

Here the starting p oint is that of structural recur

from DB

sion and is an extension of a principle put forward in

where EntryMovieDirector

that there are natural forms of computation asso ciated with

the typ e For semistructured data one starts with the natu

However the syntax do es not make clear how much of the

ral form of recursion asso ciated with the recursive datatyp e

two paths EntryMovieTitle and EntryMovieDirector

of lab eled trees However some restrictions need to b e

are to b e taken as the same The solution is to intro duce

placed for such recursive programs to b e welldened we

variables to indicate how paths or edges are to b e tied to

want them to b e welldened on graphs with cycles These

gether These variables can then b e used in other expres

restrictions give rise to an algebra that can b e viewed as

sions to form new structures Lab el variables tree variables

having two comp onents a horizontal comp onent that ex

and p ossibly path variables are needed to express a reason

presses computations across the edges of a given no de and

able set of queries

from this computations to a xed depth from the ro ot and

The next problem is that one wants to sp ecify paths of

a vertical comp onent that expresses computations that go

arbitrary length to nd for example all the strings in the

to arbitrary depths in the graph A prop erty of this algebra

database This requires us to b e able to express arbitrary

is that when restricted to input and output data that con

paths in our syntax Even this is not enough Consider the

form to a relational nested relational schema it expresses

problem of nding whether Allen acted in Casablanca

exactly the relational nested relational algebra Hence an

One might try this by searching for paths from a Movie

SQLlike language is a natural fragment of UnQL

edge down to an Allen edge but one would not want

The SQL or OQL like languages we have mentioned typ

this path to contain another Movie edge These problems

ically bring information to the surface but they are not

indicate that one would like to have something like regular

capable of p erforming complex or deep restructuring of

expressions to constrain paths

the data Simple examples of such op erations include delet References

ingcollapsing edges with a certain prop erty relab eling edges

Serge Abiteb oul Querying semistructured data In

or p erforming lo cal interchanges Both graph datalog and

Proceedings of ICDT Jan

UnQL are capable of various forms of restructuring For ex

ample in UnQL one can write a query that corrects the

Serge Abiteb oul Sophie Cluet Vassilis Christophides

egregious error in the Bacall edge lab el One can also

Tova Milo and Jerome Simeon Querying do cuments

p erform a numb er of global restructuring functions such as

in ob ject databases In Journal of Digital Libraries

deleting edges with certain prop erties or adding new edges

volume

to shortcircuit various paths The the relationship b e

tween the restructuring p ossible in UnQL and what can b e

Serge Abiteb oul Sophie Cluet and Tova Milo Query

done in graph datalog is not understo o d Some simple

ing and up dating the le In Proceedings of th In

forms of restructuring are also present in a view denition

ternational Conference on Very Large Databases pages

language prop osed in

Dublin Ireland

Serge Abiteb oul Roy Goldman Jason McHugh Vasilis

Implementation and Optimizations

Vassalos and Yue Zhuge Views for semistructured

data Technical rep ort Stanford University

This topic is very much in its infancy and again dep ends on

the underlying representation of the data Moreover the op

Serge Abiteb oul Dallan Quass Jason McHugh Jen

timization prblems dier dep ending on whether one is using

nifer Widom and Janet L Weiner The lorel query

a semistructured mo del as an interface to existing data or

language for semistructured data In Journal of Dig

one is building a data structure to represent semistructured

ital Libraries volume To app ear See

data directly In the former case the extensions of ex

httpwwwdbstanfordedupu bpap ers

isting techniques for optimization of ob jectoriented or re

lational query languages mentioned ab ove may b e exploited

Serge Abiteb oul and Victor Vianu Queries and compu

together with the addition of path or text indices on lab els

tation on the web In Proceedings of ICDT Jan

and strings In the second case disk layout and clustering

Gustavo O Aro cena Alb erto O Mendelzon and

together with appropriate indexing is also imp ortant

George A Mihaila Applications of a Web query lan

In a large class of computations can b e shown to

guage In Proc th Intl WWW Conf April In

b e translatable into a basic graph transformation technique

press

which in turn allows some simple optimizations Also some

of the basic optimizations of the relational algebra can b e

P Buneman S Davidson Mary Fernandez and D Su

applied to the vertical computations In it is shown

ciu Adding structure to unstructured data In Pro

how an analysis of the query combined with some segmen

ceedings of ICDT January

tation of the graph into lo cal sites can b e used to decom

p ose a query into indep endent parallel subqueries In

P Buneman SB Davidson K Hart C Overton and

and extensions to optimization techniques for ob ject

L Wong A data transformation system for biological

oriented query languages are exploited In a translation

data sources In Proceedings of VLDB Sept

is sp ecied for a fragment UnQL into a an underlying rela

Susan Davidson Gerd Hillebrand and

tional structure

Dan Suciu A query language and optimization tech

niques for unstructured data In Proceedings of ACM

Adding Structure

SIGMOD International Conference on Management of

Data pages Montreal Canada June

One of the main attractions of semistructured data is that

it is unconstrained Nevertheless it may b e appropriate

Peter Buneman Susan Davidson and Pro

to imp ose or to discover some form of structure in the

gramming constructs for unstructured data In Proceed

data In a schema is dened as a graph whose edges are

ings of th International Workshop on Database Pro

lab eled with predicates and the prop erty of simulation is

gramming Languages Gubbio Italy Septemb er

used to describ e the relationship b etween data and schema

To app ear

In the schema is also an edge lab eled graph and the

stronger relationship of automata equivalence is used In

Peter Buneman Shamim Naqvi Val Tannen and Lim

schemas are used for further optimization

so on Wong Principles of programming with complex

Schemas are useful for browsing and for providing partial

ob jects and collection typ es Theoretical Computer Sci

answers to queries They will also b e needed for the passage

ence Septemb er

back from semistructured to structured data for which a

R G G Cattell editor The Object Database Standard

richer notion of schema is necessary This is an area in

ODMG Morgan Kaufmann San Mateo California

which much further work is needed

Acknowledgments

Sophie Cluet and Claude Delob el A general frame

work for the optimization of ob ject oriented queries In

I would like to thank Susan Davidson and Dan Suciu for

M Stonebraker editor Proceedings ACMSIGMOD In

their collab oration and for stimulating my interest in this

ternational Conference on Management of Data pages

area I am greatly indebted to Serge Abiteb oul for most

San Diego California June

constructive discussions on a numb er of issues

Sophie Cluet and Guido Mo erkotte Query pro cessing

in the schemaless and semistructured context Techni

cal rep ort INRIA

Mariano P Consens and Alb erto O Mendelzon Ex Alb erto O Mendelzon and Tova Milo Formal mo d

pressing structural hyp ertext queries in graphlog In els of web queries

Proc nd ACM Conference on Hypertext pages In Proc PODS May In press available in

Pittsburgh Novemb er ftpdbtorontoedupubpapers pods MM ps

Susan B Davidson Christian Overton Val Tan

S Nestorov J Ullman Weiner J and S Chawathe

nen and Limso on Wong Biokleisli A digital li

Representative ob jects Concise representations of

brary for biomedical researchers In Journal of Dig

semistructured hierarchical data In Proceedings of the

ital Libraries volume Novemb er See

Thirteenth International Conference on Data Engineer

httpwwwcisupennedu db

ing Birmingham England April

Mary Fernandez Daniela Florescu Jaewo o Kang Alon

Y Papakonstantinou S Abiteb oul and H Garcia

Levy and Dan Suciu STRUDEL A Web Site Manage

Molina Ob ject fusion in mediator systems In Proc

ment System In Proceedings of ACMSIGMOD Inter

nd VLDB conference Septemb er

national Conference on Management of Data Tuscon

Yannis Papakonstantinou Hector GarciaMolina and

May

Jennifer Widom Ob ject exchange across heterogenous

Mary Fernandez Lucian Popa and Suciu Dan

information sources In Proceedings of IEEE Interna

A structure based approach to querying semi

tional Conference on Data Engineering pages

structured data Manuscript available from

March

httpwwwresearchattcom info fmff suc iug

D Quass A Rjaraman Y Sagiv and J Ullman

Mary Fernandez and Dan

Querying semistructured heterogeneous information In

Suciu Query optimizations for semistructured data

Proceedings of the Fourth International Conference on

using graph schemas Manuscript available from

Deductive and Objectoriented Databases pages

httpwwwresearchattcom info fmff suc iug

dec

H GarciaMolina Y Papakonstantinou D Quass

Dan Suciu Query decomp osition for unstructured

A Ra jaraman Y Sagiv J Ullman and J Widom The

query languages In Proc nd VLDB conference

tsimmis approach to mediation Data mo dels and lan

Septemb er

guages In Proceedings of Second INternational Work

shop on Next Generation Information Technologies and

Jean ThierryMieg and Richard Durbin ACeDB

Systems pages June

A C elegans Database Syntactic denitions for the

ACeDB data base manager

Roy Goldman and Jennifer Widom Dataguides En

abling query formulation and optimization in semi

structured databases Technical rep ort Stanford

The internet movie database httpusimdbcom

M Kifer W Kim and Y Sagiv Querying ob ject

oriented databases In M Stonebraker editor Proceed

ings ACMSIGMOD International Conference on Man

agement of Data pages San Diego California

June

Anthony Kosky Observational prop erties of databases

with ob ject identity Technical Rep ort MSCIS

University of Pennsylvania

Laks VS Lakshmanan Fereido on Sadri and Iyer N

Subramanian A declarative language for querying

and restructuring the worldwideweb In PostICDE

IEEE Workshop on Research Issues in Data Engineer

ing RIDENDS New Orleans February See

also ftpftpcsconcordiacapu blak spa pers

J McHugh S Abiteb oul R Goldman D Quass and

J Widom Lore A database management system for

semistructured data Technical rep ort Stanford Uni

versity Database Group February

J McHugh and J Widom Integrating dynamically

fetched external information into a dbms for semi

structured data Technical rep ort Stanford University

Alb erto O Mendelzon George A Mihaila and Tova

Milo Querying the World Wide Web In Proc

PDIS pages Decemb er Available as

tpdbtorontoedupubpapers pdis p sgz