Union Types for Semistructured Data
Total Page:16
File Type:pdf, Size:1020Kb
Edinburgh Research Explorer Union Types for Semistructured Data Citation for published version: Buneman, P & Pierce, B 1999, Union Types for Semistructured Data. in Union Types for Semistructured Data: 7th International Workshop on Database Programming Languages, DBPL’99 Kinloch Rannoch, UK, September 1–3,1999 Revised Papers. Lecture Notes in Computer Science, vol. 1949, Springer-Verlag GmbH, pp. 184-207. https://doi.org/10.1007/3-540-44543-9_12 Digital Object Identifier (DOI): 10.1007/3-540-44543-9_12 Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: Union Types for Semistructured Data General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 27. Sep. 2021 Union Typ es for Semistructured Data Peter Buneman Benjamin Pierce University of Pennsylvania Dept of Computer Information Science South rd Street Philadelphia PA USA fpeterbcpiercegcisupenn edu Technical rep ort MSCIS Corrected Version July Abstract Semistructured databases are treated as dynamically typ ed they come equipp ed with no indep endent schema or typ e system to constrain the data Query languages that are designed for semistructured data even when used with structured data typically ignore any typ e information that may b e present The consequences of this are what one would exp ect from using a dynamic typ e system with complex data fewer guarantees on the correctness of applications For example a query that would cause a typ e error in a statically typ ed query language will return the empty set when applied to a semistructured representation of the same data Much semistructured data originates in structured data A semistructured representation is useful when one wants to add data that do es not conform to the original typ e or when one wants to combine sources of dierent typ es However the deviations from the prescrib ed typ es are often minor and we b elieve that a b etter strategy than throwing away all typ e information is to preserve as much of it as p ossible We describ e a system of untagged union types that can accommo date variations in structure while still allowing a degree of static typ e checking A novelty of this system is that it involves nontrivial equivalences among typ es arising from a law of distributivity for records and unions a value may b e intro duced with one typ e eg a record containing a union and used at another typ e a union of records We describ e programming and query language constructs for dealing with such typ es prove the soundness of the typ e system and develop algorithms for subtyping and typ echecking Intro duction Although semistructured data has by denition no schema there are many cases in which the data obviously p ossesses some structure even if it has mild deviations from that structure Moreover it typically has this structure b ecause it is derived from sources that have structure In the pro cess of annotating data or combining data from dierent sources one needs to accommo date the irregularities that are intro duced by these pro cesses Because there is no way of describing mildly irregular structure current approaches start by ignoring the structure completely treating the data as some dynamically typ ed ob ject such as a lab elled graph and then p erhaps attempting to recover some structure by a variety of pattern matching and data mining techniques NAM Ali The purp ose of this structure recovery is typically to provide optimization techniques for query evaluation or ecient storage storage structures and it is partial It is not intended as a technique for preserving the integrity of data or for any kind of static typ echecking of applications When data originates from some structured source it is desirable to preserve that structure if at all p ossible The typical cases in which one cannot require rigid conformance to a schema arise when one wants to annotate or mo dify the database with unanticipated structure or when one merges two databases with slight dierences in structure Rather than forgetting the original typ e and resorting to a completely dynamically typ e we b elieve a more disciplined approach to maintaining typ e information is appropriate We prop ose here a typ e system that can degrade gracefully if sources are added with variations in structure while preserving the common structure of the sources where it exists The advantages of this approach include The ability to check the correctness of programs and queries on semistructured data Current semistruc + + tured query languages BDHS AQM DFF have no way of providing typ e errors they typically return the empty answer on data whose typ e do es not conform to the typ e assumed by the query The ability to create data at one typ e and query it at another equivalent typ e This is a natural consequence of using a exible typ e system for semistructured data New query language constructs that p ermit the ecient implementation of case expressions and increase the expressive p ower of a OQLstyle query languages As an example biological databases often have a structure that can b e expressed naturally using a com bination of tuples records and collection typ es They are typically cast in sp ecialpurp ose data formats and there are groups of related databases each expressed in some format that is a mild variation on some original format These formats have an intended typ e which could b e expressed in a numb er of notations For example a source source could have typ e 1 set id Int description Str bibl set title Str authors listname Str address Str year Int A second source source might yield a closely related structure 2 set id Int description Str bibl set title Str authors listfn Str ln Str address Str year Int This diers only in the way in which author names are represented This example is ctional but not far removed from what happ ens in practice The usual solution to this problem in conventional programming languages is to represent the union of the sources using some form of tagged union typ e seth tag id Int tag id Int i 1 2 The diculty with this solution is that a program such as for each x in source do printxdescription 1 that worked on source must now b e mo died to 1 for each x in source union source do 1 2 case x of h tag y i printy description 1 1 1 j h tag y i printy description 2 2 2 in order to work on the union of the sources even though the two branches of the case statement contain identical co de This is also true for the few database query languages that deal with tagged union typ es + BLS Contrast this with a typical semistructured query select description d title t where description d bibl Title t source 1 This query works by pattern matching based on the dynamically determined structure of the data Thus 1 the same query works equally well against either of the two sources and hence also against their union The drawback of this approach however is that incorrect queries for example queries that use a eld that do es not exist in either source yield the empty set rather than an error In this pap er we dene a system that combines the advantages of b oth approaches based on a system of typ esafe untagged union typ es As a rst example consider the two forms of the author eld in the typ es ab ove We may write the union of these typ es as name Str address Str ln Str fn Str address Str It is intuitively obvious that an address can always b e extracted from a value of such a typ e To express this formally we b egin by writing a multield record typ e l T l T as a pro duct of singleeld record 1 1 2 2 typ es l T l T In this more basic form the union typ e ab ove is 1 1 2 2 name Str address Str ln Str fn Str address Str We now invoke a distributivity law that allows us to treat a T b T c T and a T b T a T c T a b c a b a c as equivalent typ es Using this the union typ e ab ove rewrites to name Str fn Str ln Str address Str In this form it is evident that the the selection of the address eld is an allowable op eration Typ eequivalences like this distributivity rule allow us to intro duce a value at one typ e and op erate on it another typ e Under this system b oth the program and the query ab ove will typ echeck when extended to the union of the two sources On the other hand queries that reference a eld that is not in either source will fail to typ e check Some care is needed in designing the op erations for manipulating values of union typ es Usually the in terrogation op eration for records is eld selection and the corresp onding op eration for unions is a case expression However it is not enough simply to use these two op erations Consider the typ e a T b 1 1 1 U a T b U The form of this typ e warrants neither selecting a eld nor using a case 1 n n n n expression We can