Semistructured Data
Total Page:16
File Type:pdf, Size:1020Kb
Edinburgh Research Explorer Semistructured Data Citation for published version: Buneman, P 1997, Semistructured Data. in PODS '97 Proceedings of the sixteenth ACM SIGACT- SIGMOD-SIGART symposium on Principles of database systems. ACM, pp. 117-121. https://doi.org/10.1145/263661.263675 Digital Object Identifier (DOI): 10.1145/263661.263675 Link: Link to publication record in Edinburgh Research Explorer Document Version: Early version, also known as pre-print Published In: PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 27. Sep. 2021 Semistructured Data Peter Buneman Department of Computer and Information Science University of Pennsylvania Philadelphia PA p etercisup ennedu Some data really is unstructured Abstract The most obvious motivation comes from the need to bring In semistructured data the information that is normally as new forms of data into the ambit of conventional database so ciated with a schema is contained within the data which is technology Some of these such as do cuments with struc sometimes called selfdescribing In some forms of semi tured text and data formats while they may structured data there is no separate schema in others it call for increasingly expressive query languages and new op exists but only places lo ose constraints on the data Semi timization techniques only require mild extensions to the structured data has recently emerged as an imp ortant topic existing notion of data mo dels such as ODMG How of study for a variety of reasons First there are data sources ever these extensions still require the prior imp osition of such as the Web which we would like to treat as databases structure on the data and there are some forms of data for but which cannot b e constrained by a schema Second it which this is genuinely dicult may b e desirable to have an extremely exible format for The most immediate example of data that cannot b e con data exchange b etween disparate databases Third even strained by a schema is the WorldWideWeb As database when dealing with structured data it may b e helpful to view researchers we would like to think of this as a database but it as semistructured for the purp oses of browsing This tu to what extent are database to ols available for querying or torial will cover a numb er of issues surrounding such data maintaining the web Most web queries exploit information nding a concise formulation building a suciently expres retrieval techniques to retrieve individual pages from their sive language for querying and transformation and opti contents but there is little available that allows us to use mization problems the structure of the web in formulating queries and since the web do es not obviously conform to any standard data The motivation mo del we need a metho d of describing its structure Another example little known to the database commu The topic of semistructured data also called unstructured nity but resp onsible for piquing the authors interest in this data is relatively recent and a tutorial on the topic may topic is the database management system ACeDB which well b e premature It represents if anything the conver is p opular with biologists Sup ercially it lo oks like gence of a numb er of lines of thinking ab out new ways to an ob jectoriented database system for it has a schema represent and query data that do not completely t with language that resembles that of an ob jectoriented DBMS conventional data mo dels The purp ose of this tutorial is but this schema imp oses only lo ose constraints on the data to to describ e this motivation and to suggest areas in which Moreover the relationship b etween data and schema is not further research may b e fruitful For a similar exp osition easily describ ed in ob jectoriented terms and there are struc the reader is referred to Serge Abiteb ouls recent survey pa tures that are naturally expressed in ACeDB such as trees of p er arbitrary depth that cannot b e queried using conventional The slides for this tutorial will b e made available from a techniques section of the Penn database home page httpwwwcisupennedudb Data Integration This work was partly supp orted by the Army Research Oce A second motivation is that of data exchange and transfor DAAH and the National Science Foundation CCR mation which is the starting p oint for the Tsimmis pro ject at Stanford The rationale here is that none of the existing data mo dels is allembracing so that it is dicult to build software that will easily convert b etween two dis parate mo dels The Ob ject Exchange Mo del OEM oers a highly exible data structure that may b e used to cap ture most kinds of data and provides a substrate in which almost any other data structure may b e represented In ef fect OEM is an internal data structure for exchange of data b etween DBMSs but having such a structure invites the idea of querying data in OEM format directly Entry Entry Entry References Movie Movie TV Show Is referenced in Title Cast Director Title Cast Director Title Cast Episode “Play it again, Sam” “Casablanca” “Bogart” “Bacall” Credit Actors 1 2 3 1.2E6 Actors “Allen” Special Guests Director “Allen” Figure An example movie database Browsing b oth with data of typ es such as int and string and p ossi bly other base or external abstract typ es video audio etc A nal motivation is that of browsing Generally sp eaking Edges are also with names such as Movie and Title that a user cannot write a database query without knowledge would normally b e used for attribute or class names We of the schema However schemas may have opaque termi shall refer to such lab els as symbols Internally they are rep nology and the rationale for the design is often dicult to resented as strings Note that arrays may b e represented by understand It may help in understanding the schema to b e lab eling internal edges with integers We can formulate the able to query data without full knowledge of the schema typ e of this kind of lab eled tree as For example the queries type label int j string j j symbol Where in the database is the string Casablanca to type tree setlabel tree b e found The rst line describ es a tagged union or variant the 16 Are there integers in the database greater than second says that a tree is a set of lab eltree pairs The edges out of no des in our trees are assumed to b e unordered What ob jects in the database have an attribute name There are a numb er of variations on this basic mo del that starts with act and it is worth briey reviewing them In leaf no des Such questions cannot b e answered in any generic fashion are lab eled with data internal no des are not lab eled with by standard relational or ob jectoriented query languages meaningful data and edges are lab eled only with symb ols While languages have b een prop osed that allow schema and type base int j string j data to b e queried simultaneously in the context of type tree base j setsymbol tree relational and ob jectoriented database systems these lan guages do not have the exibility to express complex con The dierences b etween the two mo dels are minor and straints on paths and it is not clear how their implementa give rise to minor dierences in the query language It is tion will work on the structures describ ed b elow easy to dene mappings in b oth directions Another p ossibility is to allow lab els on internal no des The Mo del for example type base int j string j j symbol The unifying idea in semistructured data is the representa type tree label setlabel tree tion of data as some kind of graphlike or treelike structure Although we shall allow cycles in the data we shall gener The problem with using this representation directly is ally refer to these graphs as trees The example in gure is that it makes the op eration of taking the union of two trees taken from in which the data mo del is formalized as an dicult to dene However by intro ducing extra edges edge labeled graph The structure is taken with some inac this represaentation can b e converted into one of the edge curacies from a wellknown web database that provides lab elled representations ab ove a go o d example of semistructured data There are several A nal and more complex issue is that of ob ject identity things to note ab out it If one connes ones attention to the by which we mean no de lab els or p ossibly edge lab els parts of the database b elow Movie edges the data app ears that apart from an equality test are not observable in fairly regular except that there are two ways of representing the query language In OEM ob ject identities are used as a cast That is the data do es not quite t with some re no de lab els and placeholders to dene trees While ob ject lational or ob jectoriented presentation Edges are lab eled identities provide an ecient way to dene and test equality The select