Edinburgh Research Explorer

Union Types for Semistructured Data

Citation for published version: Buneman, P & Pierce, B 1999, Union Types for Semistructured Data. in Union Types for Semistructured Data: 7th International Workshop on Database Programming Languages, DBPL’99 Kinloch Rannoch, UK, September 1–3,1999 Revised Papers. Lecture Notes in , vol. 1949, Springer-Verlag GmbH, pp. 184-207. https://doi.org/10.1007/3-540-44543-9_12

Digital Object Identifier (DOI): 10.1007/3-540-44543-9_12

Link: Link to publication record in Edinburgh Research Explorer

Document Version: Peer reviewed version

Published In: Union Types for Semistructured Data

General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 27. Sep. 2021

Union Typ es for Semistructured Data

Peter Buneman Benjamin Pierce

University of Pennsylvania

Dept of Computer Information Science

South rd Street

Philadelphia PA USA

fpeterbcpiercegcisupenn edu

Technical rep ort MSCIS

Corrected Version

July

Abstract

Semistructured databases are treated as dynamically typ ed they come equipp ed with no indep endent

schema or typ e system to constrain the data Query languages that are designed for semistructured

data even when used with structured data typically ignore any typ e information that may b e present

The consequences of this are what one would exp ect from using a dynamic typ e system with complex

data fewer guarantees on the correctness of applications For example a query that would cause a typ e

error in a statically typ ed query language will return the empty when applied to a semistructured

representation of the same data

Much semistructured data originates in structured data A semistructured representation is useful

when one wants to add data that do es not conform to the original typ e or when one wants to combine

sources of dierent typ es However the deviations from the prescrib ed typ es are often minor and we

b elieve that a b etter strategy than throwing away all typ e information is to preserve as much of it as

p ossible We describ e a system of untagged union types that can accommo date variations in structure

while still allowing a degree of static typ e checking

A novelty of this system is that it involves nontrivial equivalences among typ es arising from a law of

distributivity for records and unions a value may b e intro duced with one typ e eg a record containing

a union and used at another typ e a union of records We describ e programming and query language

constructs for dealing with such typ es prove the soundness of the typ e system and develop algorithms

for and typ echecking

Intro duction

Although semistructured data has by denition no schema there are many cases in which the data obviously

p ossesses some structure even if it has mild deviations from that structure Moreover it typically has this

structure b ecause it is derived from sources that have structure In the pro cess of annotating data or

combining data from dierent sources one needs to accommo date the irregularities that are intro duced by

these pro cesses Because there is no way of describing mildly irregular structure current approaches

start by ignoring the structure completely treating the data as some dynamically typ ed ob ject such as a

lab elled graph and then p erhaps attempting to recover some structure by a variety of

and data mining techniques NAM Ali The purp ose of this structure recovery is typically to provide

optimization techniques for query evaluation or ecient storage storage structures and it is partial It is

not intended as a technique for preserving the integrity of data or for any of static typ echecking of

applications

When data originates from some structured source it is desirable to preserve that structure if at all p ossible

The typical cases in which one cannot require rigid conformance to a schema arise when one wants to

annotate or mo dify the database with unanticipated structure or when one merges two databases with slight

dierences in structure Rather than forgetting the original typ e and resorting to a completely dynamically

typ e we b elieve a more disciplined approach to maintaining typ e information is appropriate We prop ose

here a typ e system that can degrade gracefully if sources are added with variations in structure while

preserving the common structure of the sources where it exists

The advantages of this approach include

The ability to check the correctness of programs and queries on semistructured data Current semistruc

+ +

tured query languages BDHS AQM DFF have no way of providing typ e errors they typically

return the empty answer on data whose typ e do es not conform to the typ e assumed by the query

The ability to create data at one typ e and query it at another equivalent typ e This is a natural

consequence of using a exible typ e system for semistructured data

New query language constructs that p ermit the ecient implementation of case expressions and

increase the expressive p ower of a OQLstyle query languages

As an example biological databases often have a structure that can b e expressed naturally using a com

bination of records and collection typ es They are typically cast in sp ecialpurp ose data formats

and there are groups of related databases each expressed in some format that is a mild variation on some

original format These formats have an intended typ e which could b e expressed in a numb er of notations

For example a source source could have typ e

1

set id Int

description Str

bibl set title Str authors listname Str address Str year Int

A second source source might yield a closely related structure

2

set id Int

description Str

bibl set title Str authors listfn Str ln Str address Str year Int

This diers only in the way in which author names are represented This example is ctional but not far

removed from what happ ens in practice

The usual solution to this problem in conventional programming languages is to represent the union of the

sources using some form of typ e

seth tag id Int tag id Int i

1 2

The diculty with this solution is that a program such as

for each x in source do printxdescription

1

that worked on source must now b e mo died to

1

for each x in source union source do

1 2

case x of

h tag y i printy description

1 1

1

j h tag y i printy description

2 2

2

in order to work on the union of the sources even though the two branches of the case statement contain

identical co de This is also true for the few database query languages that deal with tagged union typ es

+

BLS

Contrast this with a typical semistructured query

select description title t

where description d bibl Title t source

1

This query works by pattern matching based on the dynamically determined structure of the data Thus

1

the same query works equally well against either of the two sources and hence also against their union

The drawback of this approach however is that incorrect queries for example queries that use a eld that

do es not exist in either source yield the empty set rather than an error

In this pap er we dene a system that combines the advantages of b oth approaches based on a system of

typ esafe untagged union typ es As a rst example consider the two forms of the author eld in the typ es

ab ove We may write the union of these typ es as

name Str address Str ln Str fn Str address Str

It is intuitively obvious that an address can always b e extracted from a value of such a typ e To express this

formally we b egin by writing a multield record typ e l T l T as a pro duct of singleeld record

1 1 2 2

typ es l T l T In this more basic form the union typ e ab ove is

1 1 2 2

name Str address Str ln Str fn Str address Str

We now invoke a distributivity law that allows us to treat

a T b T T and a T b T a T c T

a b c a b a c

as equivalent typ es Using this the union typ e ab ove rewrites to

name Str fn Str ln Str address Str

In this form it is evident that the the selection of the address eld is an allowable op eration

Typ eequivalences like this distributivity rule allow us to intro duce a value at one typ e and op erate on

it another typ e Under this system b oth the program and the query ab ove will typ echeck when

extended to the union of the two sources On the other hand queries that reference a eld that is not in

either source will fail to typ e check

Some care is needed in designing the op erations for manipulating values of union typ es Usually the in

terrogation op eration for records is eld selection and the corresp onding op eration for unions is a case

expression However it is not enough simply to use these two op erations Consider the typ e a T b

1 1 1

U a T b U The form of this typ e warrants neither selecting a eld nor using a case

1 n n n n

expression We can if we want use distributivity to rewrite it into a disjunct of pro ducts but the size of this

1

One could also achieve the same eect through the use of inheritance rather than union typ es in some ob jectoriented

language This would involve the intro duction of named classes with explicit sub class assertions As we shall shortly see the

numb er of p ossible classes is exp onential in the size of the typ e

disjunct is exp onential in n and so presumably would b e the corresp onding case expression We prop ose

instead an extended pattern matching syntax that allows us to op erate on the typ e in its original compact

form

More sophisticated pattern matching op erations may b e useful additions even to existing semistructured

query languages Consider the problem of writing a query that pro duces a uniform output from a single

source that contains two representations of names

select description d name n

where description d bibl author name n source

union

select description d name stringconcat f l

where description d bibl author ln l fn f source

This is the only metho d known to the authors of expressing this query in current semistructured query

languages It suggests an inecient execution mo del and may not have the intended semantics when for

example the source is a list and one wants to preserve the order Thus some enhancement to the syntax is

desirable

This pap er develops a typ e system based on untagged union typ es along with op erations to construct and

deconstruct these typ es In particular we dene a syntax of patterns that may b e used b oth for an extended

form of case expression and as an extension to existing query languages for semistructured data We should

remark that we cannot capture all asp ects of semistructured query languages For example we have nothing

+

that corresp onds to regular path expressions BDHS AQM However we b elieve that for most

examples of mildly semistructured data esp ecially the forms that arise from the integration of typ ed

data sources a language such as prop osed here will b e adequate Our main technical contribution is a pro of

of the decidabiliity of subtyping for this typ e system which is complicated by the nontrivial equivalences

involving union and record typ es

To our knowledge untagged union typ es never b een formalized in the context of database programming

languages Tagged union typ es have b een suggested in several pap ers on data mo dels AH CM but

+

have had minimal impact on the design of query languages CPL BLS for example can match on only

one tag of a tagged union and this is one of the few languages that makes use of union typ es Pattern

+

matching has b een recently exploited in languages for semistructured data and XML BDHS DFF In

the programming languages and typ e theory communities on the other hand untagged union typ es have b een

studied extensively from a theoretical p ersp ective Pie BDCd Hay Dam DCdP etc but the

interactions of unions with higherorder function typ es have b een shown to lead to signicant complexities

the present system provides only a very limited form of function typ es like most database query languages

and remains reasonably straightforward

Section develops our language for programming with record and union typ es including pattern matching

primitives that can b e used in b oth case expressions and query languages Section describ es the system

formally and demonstrates the decidability of subtyping and typ e equivalence Pro ofs will b e provided in

the full pap er Section oers concluding remarks

Programming with Union Typ es

In this section we shall develop a syntax for the new programming constructs that are needed to deal with

union typ es The presentation is informal for the moment more precise denitions app ear in Section

We start with op erations on records and extend these to work with unions of records we then deal with

op erations on sets Taken in conjunction with op erations on records these op erations are enough to dene

a simple query language We also lo ok at op erations on more general union typ es and give examples of a

typ ecase op eration

Record formation

Just as we dened a record typ e l T l T as the pro duct l T l T of elementary

1 1 n n 1 1 n n

or singleton record typ es we can dene a record value as the disjoint concatenation of singleton records

The op erations for creating records are the empty record the singleton record l e where e is an

expression and the disjoint concatenation of records ee Actually we also allow multield record values

of the form l e l e as this makes the op erational semantics easier to state

1 1 n n

Case expressions

Records are decomp osed through the use of case expressions These allow us to take alternative actions

based on the structure of values We shall also b e able to use comp onents of the syntax of case expressions

in the development of matching constructs for query languages The idea in developing a relatively complex

syntax for the b o dy of case expressions is that the structure of the b o dy can b e made to match the exp ected

structure of the typ e of the value on which it is op erating There should b e no need to atten the typ e

into disjunctive normal form and write a much larger case expression at that typ e

We start with a simple example

case e of fn f Str ln l Str stringconcat f l

j name n Str n

This matches the result of evaluating e to one of two record typ es If the result is a record with fn and

ln elds the variables f and l are b ound and the righthand side of the rst clause is evaluated If the

rst pattern do es not match the second clause is tried This case expression will work provided e has typ e

fn Str ln Str name Str

We should note that pattern matching intro duces identiers such as l f n in this example and we shall make

a shortsighted assumption that identiers are intro duced when they are asso ciated with a typ e x T This

+

ignores the p ossibility of typ e inference See BLS for a more sophisticated syntax for intro ducing

identiers in patterns

Field selection is given by a oneclause case expression case e of l x T x

We shall also allow case expressions to dispatch on the runtime typ e of an argument

case e of x Int x

j y setInt sum y

This will typ echeck when e Int setInt

The clauses of a case expression have the form p e where p is a pattern that intro duces binds identiers

which may o ccur free in the expression e Thus each clause denes a function Two or more functions can

b e combined by writing p e j p e j to form another function The eect of the case expression

1 1 2 2

case e of f is to apply this function to the result of evaluating e

Now supp ose we want to extract information from a value of typ e

name Str ln Str fn Str age Int

The age eld may b e extracted using eld selection using a oneclause case expression as describ ed ab ove

However information from the lefthand comp onent cannot b e extracted by extending this case expression

What we need is need something that will turn a multiclause function back into a pattern that binds a new

identier We prop ose the syntax x as f in which f is a multiclause function In the evaluation of x as f

f is applied to the appropriate structure and x is b ound to the result

case e of

x as fn f Str ln l Str stringconcat f l j name n Str n

age a Int

name x age a

could b e applied to an expression e of typ e ab ove Note the use of to combine two patterns so that

they match on a pro duct typ e This symb ol is used to concatenate patterns in the same way that it is used

to concatenate record values

There are some useful extensions to case expressions and pattern matching that we shall briey mention

here but omit in the formal development they are essentially syntactic sugar The rst is the addition of

a fallthrough or else branch of a case expression The pattern else matches any value that has not b een

matched in a previous clause Most programming languages have an analogous construct

Such branches are particularly useful if we allow constants in patterns For example

case e of name n Str age e

j else

Here only tuples with a sp ecic value for age are matched Tuples with a dierent value will b e matched in

the else clause Note that patterns bind variables and that if one allows constants in patterns one wants

to discriminate b etween those variables that are used as constants and those that are b ound in the pattern

+

CPL BLS uses a sp ecial marker to ag b ound variables In that language name n age na is a

pattern in which a is b ound and n is treated as a constant it is b ound in some outer scop e This extended

syntax of patterns is esp ecially convenient when used in query languages for sets

Sets

We shall follow the approach to collection typ es given in BNTW It is known that b oth relational and

complexob ject languages can b e expressed using this formalism The op erations for forming sets are feg

2

singleton set and e union e set union For iterating over a set we use the form

0

collect e where p e

0

Here e and e are b oth expressions of set typ e and p is a pattern as describ ed ab ove The meaning of this

S

0

is informally f e j p e g in which is a substitution that binds the variables of p to match an

0

element of e

These op erations taken in conjunction with the record op erations describ ed ab ove and an equality op eration

may b e used as the basis of practical query languages Conditionals and b o oleans may b e added but they

can also b e simulated with case expressions and some appropriately chosen constants

Unlike typ ed systems with tagged unions in our system there is no formation op eration directly asso ciated

with the union typ e However we may want to intro duce op erators such as relaxed setunion which takes

two sets of typ e sett and sett and returns a set of typ e sett t

1 2

Examples

We conclude this section with some remarks on highlevel query languages A typical form of a query that

makes use of pattern matching is

2

The present system do es not include fg empty set It can b e added at the cost of a slight extension to the typ e system

see Section

select e

where p e

1 1

p e

2 2

condition

Here the p are patterns and the expressions e e have set typ es Variables intro duced in pattern p

i 1 i i

may b e used in expression e and as constants in pattern p where j i They may also b e used in the

j j

expression e and the condition which is simply a b o olean expression This query form can b e implemented

using the op erations describ ed in the previous section

As an example here is a query based on the example typ es in the intro duction We make use of the syntax

of patterns as develop ed for case expressions but here we are using them to match on elements of one or

more input sets

select description d authName a year y

where description d Str bibl b BT source union source

1 2

authors aa AT year y Int b

a as fn f Str ln l Str stringconcat f l j name n Str n aa

y

Note that we have assumed a relaxed union to combine the two sources In the interests of consistency

with the formal development we have also inserted all typ es for identiers so AT and BT are names for

the appropriate fragments of the exp ected source typ e In many cases such typ es can b e inferred

Here are two examples that show the use of paterns in matching on typ es rather than record structures

Examples of this kind are commomly used to illustrate the need for semistructured data

select x

where x as s setNum average s j r Num r source

select s

0

where s as n Str n j fn f Str ln l Str stringconcat f l source

In the rst case we have a set source that may contain b oth numb ers and sets of numb ers In the second

case we have a set that may contain b oth base typ es and record typ es Both of these can b e statically

typ echecked If for example in the rst query s has typ e setS tr the query would not typ echeck

To demonstrate the prop osed syntax for the use of functions in patterns here is one last slightly contrived

example We want to calculate the mass of a solid ob ject that is either rectangular or a sphere Each measure

of length can b e either integer or real The typ e is

density Real

intRadius Int realRadius Real

intHeight Int realHeight Real

intWidth Int realWidth Real

intDepth Int realDepth Real

The following case expression makes use of matching based on b oth unions and pro ducts of record structures

Note that the structure of the expression follows that of the typ e It would b e p ossible to write an equivalent

case expression for the disjunctive normal form for the typ e and avoid the use of the form x asf but such

an expression would b e much larger than the one given here

case e of

density d Real

v as

r as intRadius ir Int oat ir j realRadius rr Real rr

r

j

h as intHeight ih Int oat ih j realHeight rh Real rh

w as intWidth iw Int oat iw j realWidth rw Real rw

d as intDepth id Int oat id j realDepth rd Real rd

h w d

d v

Formal Development

With the foregoing intuitions and examples in mind we now pro ceed to the formal denition of our language

its typ e system and its op erational semantics Along the way we establish fundamental prop erties such as

runtime safety and the decidability of subtyping and typ echecking

Typ es

We develop a typ e system that is based on conventional complex ob ject typ es those that are constructed

from the base typ es with record and set constructors As describ ed in the intro duction the record

constructors are the empty record typ e l t the singleton record typ e and R R the disjoint

concatenation of two records typ es By disjoint we mean that the two record typ es have no eld names in

common Thus a conventional record typ e l T l T is shorthand for l Y l T

1 1 n n 1 1 n n

To this we add an untagged union typ e T T We also assume a single base typ e B and a set typ e setT

Other collection typ es such as lists and multisets would b ehave similarly

The syntax of typ es is describ ed by the following grammar

T B base typ e

empty record typ e

l T lab eling singleeld record typ e

T T record typ e concatenation

1 2

T T union typ e

1 2

setT set typ e

Kinding

We have already noted that certain op erations on typ es are restricted For example we cannot take the

pro duct of two record typ es with a common eld name This in turn means that any op eration on records

whose typing rules make improp er use of a typ e constructor is also illegal In order to control the formation

of typ es we intro duce a system of kinds This consists of the kind of all typ es Typ e and a subkind RcdL

which is the kind of all record typ es whose lab els are included in the lab el set L

K Typ e kind of all typ es

RcdL kind of record typ es with at most lab els L

The kinding relation is dened as follows

B Typ e

KBase

Rcdfg

KEmpty

T Typ e

KField

l T Rcdfl g

S RcdL T RcdL L L

1 2 1 2

KRcd

S T RcdL L

1 2

S K T K

KUnion

S T K

T Typ e

KSet

setT Typ e

T RcdL

1

KSubsumption

T RcdL L

1 2

T RcdL

KSubsumption

T Typ e

There are two imp ortant consequences of these rules First record kinds extend to the union typ e For

example A t B t C t D t has kind RcdfA B C D g Second the kinding rules

require the lab els in a concatenation of two record typ es to b e disjoint However the union typ e constructor

is not limited in the same way Int Str and Int a Str are wellkinded typ es

Subtyping

As usual the subtyp e relation written S T captures a principle of safe substitutibility any element of

S may safely b e used in a context exp ecting an element of T

For sets and records the subtyping rules are the standard ones setS setT if S T eg a set of

employees can b e used as a set of p eople and a record typ e S is a subtyp e of a record typ e T if S has

more elds than T and the typ es of the common elds in S are subtyp es of the corresp onding elds in T

This eect is actually achieved by the combination of several rules b elow This explo ded presentation of

record subtyping corresp onds to our presentation of record typ es in terms of separate empty set singleton

and concatenation constructors

For union typ es the subtyping rules are a little more interesting First we axiomatize the fact that S T

is the least upp er b ound of S and T that is S T is ab ove b oth S and T and everything that is ab ove

b oth S and T is also ab ove their union rules SUnionUB and SUnionL b elow We then have two rules

SDistRcd and SDistField showing how union distributes over records

Formally the subtyp e relation is the least relation on wellkinded typ es closed under the following rules

T T

SRefl

R S S T

STrans

R T

l T

SRcdFE

S T S

SRcdRE

S T T S

SRcdComm

S T U S T U

SRcdAssoc

S S

SRcdIdent

S T

SRcdDF

l S l T

S T S T

1 1 2 2

SRcdDR

S S T T

1 2 1 2

S T

SSet

setS setT

R T S T

SUnionL

R S T

S S S

SUnionUB

i 1 2

R S T R S R T

SDistRcd

l S T l S l T

SDistField

Note that we restrict the subtyp e relation to wel lkinded typ es S is never a subtyp e of T if either S or T

is illkinded The typing rules will b e careful only to call the subtyp e relation on typ es that are already

known to b e well kinded

If b oth S T and T S we say that S and T are equivalent and write S T Note for example that the

distributive laws SDistRcd and SDistField are actually equivalences the other directions follow from

the laws for union plus transitivity Also note the absence of the other distributivity law for unions and

records P Q R P Q P R This law do esnt make sense here b ecause it violates the kinding

constraint that pro ducts of record typ es can only b e formed if the two typ es have disjoint lab el sets

The subtyp e relation includes explicit rules for asso ciativity and commutativity of the op erator Also it

is easy to check that the asso ciativity commutativity and idemp otence of follow directly from the rules

given We shall take advantage of this uidity in the following by writing b oth records and unions in a

comp ound nary form

def

l T l T l T l T

1 1 n n 1 1 n n

W

def

T T T T

1 n 1 n

In the rst line n may b e we allow empty recordsbut in the second it must b e p ositivefor brevity

we do not allow empty unions in the present system See Section

We often write comp ound unions using a simple comprehension notation For example

AB j A A A A and B B B B

1 2 m 1 2 n

denotes

A B A B A B A B A B

1 1 1 2 1 n 2 1 m n

Prop erties of Subtyping

For proving prop erties of the subtyp e relation it is convenient to work with typ es in a more constrained

syntactic form

Denition The sets of normal N and simple A typ es are dened as follows

W

N A A

1 n

A B

l A l A

1 1 n n

setN

Intuitively a simple typ e is one in which unions only app ear immediately inside of the set constructor a

normal typ e is a union of simple typ es Note that every simple typ e is also normal 

The restricted form of normal and simple typ es can b e exploited to give a much simpler subtyping relation

written S T in terms of the following rules

B B

SABase

N M

SASet

setN setM

fk k g fl l g for all k fk k g A B

1 m 1 n i 1 m k k

i i

SARcd

l A k B l A k B

m l m k 1 l 1 k

m m 1 1

i m j n A B

i j

W W

SNUnion

A A B B

1 m 1 n

M is decidable  Fact N

Pro of The macro rules can b e read as a pair of algorithms one for subtyping b etween simple typ es and one

for subtyping b etween normal typ es Both of these algorithms are syntax directed and obviously terminate

on all inputs all recursive calls reduce the size of the inputs 

Lemma N N for all N 

M and M L then N L  Lemma If N

Pro of By induction on the total size of L M N First supp ose that all of L M N are simple The

induction hyp othesis is immediately satised for SABase and SASet For SARcd use the transitivity

of set inclusion and induction on the appropriate subterms

If at least one of L M N is nontrivially normal use the transitivity of the functional relationship expressed

by the SNUnion rule and induction on the appropriate subterms 

to we rst show how any typ e may b e converted to an To transfer the prop erty of decidability from

equivalent typ e in disjunctive normal form

Denition The disjunctive normal form dnf of a typ e T is dened as follows

dnf B B

dnf

W

dnf P Q A B j A dnfP B dnf Q a

i j i j

W

dnf l P l A j A dnf P b

i i

dnf P Q dnf P dnf Q c

dnf setP setdnf P d



Fact dnf P P 

Fact dnf P is a normal typ e for every typ e P 

Fact N M implies N M 

dnf T  Lemma S T i dnfS

Pro of By we have derivations of S dnf S and dnfT T and by we have a derivation

of dnfS dnfT Use transitivity to build a derivation of S T

By induction on the height of the derivation of S T We consider the nal rule in the derivation By

induction we assume we can build a derivation of the normal forms for the antecedents and now we consider

all p ossible nal rules

We start with the axioms

SRefl By reexivity of

SRcdFE dnf l T l dnfT and l dnf T by SARcd

W

SRcdRE dnf S T S T j S dnf S T dnf T Now dnf S dnf T dnfS by

i j i j i j i

SARcd and the result follows from SNUnion

SRcdComm If dnfS and dnfT are simple then dnf S dnf T dnf T dnfS by SARcd If

not use SNUnion rst

SRcdAssoc As for SRcdComm

SRcdIdent As for SRcdComm

SUnionUB By SNUnion

W

SDistRcd dnfR S T R U j R dnfR U dnf S T

i j i j

W

R S j R dnfR U dnfS

i j i j

W

R T j R dnf R T dnfT

i k i k

dnfR S R T

W

SDistField dnf l S T l U j U dnf S T

i i

W W

l S j S dnfS l T j T dnf T

i i i i

dnf l S dnf l T

dnf l S l T

Now for the inference rules The premises for all the rules are of the form S T and our inductive hyp othesis

is that for the premises of the nal rule we have obtained a derivation using SA and SNUnion rules of the

dnf T Without loss of generality we may assume that the nal rule in the derivation corresp onding dnfS

of each such premise is SNUnion We examine the remaining inference rules

STrans By Lemma

SRcdDF Since dnfS dnfT was derived by SNUnion we know that for each A dnfS there

i

B Therefore for each such A we may use SARcd to derive is a B dnf T such that A

j i j i

l A l B These derivations may b e combined using SNUnion to obtain a derivation of

i j

dnf l S dnf l T

2 1 2 1

dnf T dnfT and B dnf S there exist B dnfS and each A SRcdDR For each A

2 1 2 1

j j i i

2 1 2 1

2 1 2 1

and A For each such pair we can therefore use SA B B such that we have a derivations of A

i j j i

2 1 2 1

2 1 2 1

and then use SNUnion to derive dnfS S B B dnfT T A Rcd to derive A

1 2 1 2

j j i i

2 1 2 1

SSet Immediate by SASet

C and for each B dnf S SUnionL For each A dnf R there is a C dnfT such that A

j k i j i

there is a C dnf T such that B C From these dnf R S dnfT can b e derived directly

l k l

using SNUnion 

Theorem The subtyp e relation is decidable 

Pro of Immediate from Lemmas and 

We do not yet have any results on the complexity of checking subtyping or equivalence The pro of strategy

we have adopted here leads to an algorithm with running time exp onential in the size of its inputs

The structured form of the macro rules can b e used to derive several inversion properties which will b e

useful later in reasoning ab out the typing relation

Corollary If S setT then S setS with S T 

1 1 1 1

Corollary If W U with

W l W l W l W

1 1 m m n n

U l U l U

1 1 m m

then W U for each k m 

k k

W

Pro of From the denition of disjunctive normal forms we know that dnf W l W

1 i1

l W l W j W W W dnfW dnf W dnf W and dnf U

m im n in i1 im in 1 m n

W

l U l U j U U dnf U dnf U By SNUnion

1 j 1 m j m j 1 j m 1 m

for each A l W l W l W dnf W

i 1 i1 m im n in

there is some B l U l U dnf U

j 1 j 1 m j m

B with A

j i

This derivation must b e an instance of SaRcd with W U In other words for each W dnfW

ik j k ik k

there is some U j k dnf U with W U By SNUnion dnfW dnf U The desired result

k ik j k k k

W U now follows by Lemma 

k k

Corollary If S is a simple typ e and S T T then either S T or else S T 

1 2 1 2

Terms

The sets of programs functions and patterns are describ ed by the following grammar

e b base value

x variable

l e l e record construction

1 1 n n

e e record concatenation

1 2

case e of f pattern matching

f e e g set

1 n

e union e union of sets

1 2

collect e where p e set comprehension

1 2

p x T variable pattern typ ecase

l p l p record pattern

1 1 n n

p p pattern concatenation

1 2

x as f function nested in pattern

f p e base function

f j f comp ound function

1 2

Typing

The typing rules are quite standard

Expressions e T

b B

TBase

x x

TVar

e T all the l are distinct

i i i

TRcd

l e l e l T l T

1 1 n n 1 1 n n

e T e T T T K

1 1 2 2 1 2

TConcat

e e T T

1 2 1 2

f S T e R R S

TCase

case e of f T

e T for each i n

i i

TSet

f e e g setT T

1 n 1 n

e setT e setT

1 1 2 2

TUnion

e union e setT T

1 2 1 2

0

e setS p U S U

2

0

e setT

1

TCollect

collect e where p e setT

1 2

Functions f S T

0 0

p S e T

TFPat

p e S T

f S T f S T

1 1 1 2 2 2

TFAlt

f j f S S T T

1 2 1 2 1 2

0

Patterns p T

T K

TPVar

x T T x T

0 0

p T the all have disjoint domains

i i

i i

TPRcd

0 0

l p l p l T l T

1 1 n n 1 1 n n

n 1

0 0

p k S k S p l T l T

1 1 1 m m 2 1 1 n n

1 2

0 0

have disjoint domains and fk k g fl l g

1 m 1 n

2 1

TPConcat

0 0

p p k S k S l T l T

1 2 1 1 m m 1 1 n n

2 1

f S T

TPAs

x as f S x T

Prop erties of Typing

Prop osition The typing relation is decidable 

Pro of Immediate from the decidability of subtyping and the syntaxdirectedness of the typing rules 

Denition A substitution is a nite function from variables to terms We say that a substitution

satises a context written j if they have the same domain and for each x in their common

domain we have x S for some S with S x 

x x x

0 0

Denition We say that a typing context renes another context written if their

0

domains are the same and for each x dom we have x x 

0 0

Fact Narrowing If e T and then e T 

Lemma Substitution preserves typing

If j and e Q then e P for some P Q

If j and f S Q then f S P for some P Q

0 00 00

If j and p U then p U for some 

Pro of By simultaneous induction on derivations The arguments are all straightforward using previously

established facts For the second prop erty note that substitution into a pattern only aects functions that

may b e emb edded in the pattern since all other variables mentioned in the pattern are binding o ccurrences

Moreover by our conventions ab out names of b ound variables we must assume that the variables b ound in

an expression function or pattern are distinct from those dened by 

Evaluation

The op erational semantics of our language is again quite standard we dene a relation e v read closed

expression e evaluates to result v by a collection of syntaxdirected rules emb o dying a simple abstract

machine

Denition We will use the metavariables v and w to range over values closed expressions not

involving case union concatenation or collect

v b

l v l v

1 1 n n

f v v g

1 n

We write v as shorthand for a set of values v v 

1 n

Denition A substitution is a nite function from variables to values When and have

1 2

disjoint domains we write for their combination 

1 2

Reduction e v for closed terms e

b b

EBase

e v for each i

i i

ERcd

l e l e l v l v

1 1 n n 1 1 n n

e l v l v e j w j w

1 1 1 m m 2 1 1 n n

fl l g fj j g

1 m 1 n

EConcat

e e l v l v j w j w

1 2 1 1 m m 1 1 n n

0

e v match v f v

ECase

0

case f of e v

e v for each i

i i

ESet

f e e g f v v g

1 n 1 n

e f v g e f v g

1 1 2 2

EUnion

e union e f v v g

1 2 1 2

e f v v g

2 1 n

for each i match v p and e f w g

i i i 1 i

ECollect

collect e where p e f w w g

1 2 1 n

0

Function matching match v f v

0

match v p e v

EFPat

0

match v p e v

0

match v f v

1

EFAlt

0

match v f j f v

1 2

0

match v f match v f v

1 2

EFAlt

0

match v f j f v

1 2

Matching match v p

v S S T

EPVar

match v x T x v

match v p the have disjoint domains

i i i i

EPRcd

match l v l v l v l p l p

1 1 m m n n 1 1 m m

1 m

match v p match v p and have disjoint domains

1 1 2 2 1 2

EPConcat

match v p p

1 2 1 2

0

match v f v

EPAs

0

match v x as f x v

Prop erties of Evaluation

Fact If v is a value and v V then V is a simple typ e 

Theorem Sub ject reduction

If e v

e Q

then v V

V Q

0

If match v f v

f U V

v W

W U

0

then v X

X V

If match v p

v W

p U

W U

then j 

Pro of By simultaneous induction on evaluation derivations

Straightforward using part of the induction hyp othesis for the interesting case ECase

Consider the nal rule in the given derivation

Case EFPat f p e

match v p

0

e v

From f U V we know p U and e V By part of the induction hyp othesis

0

j By Lemma e V Now by the induction hyp othesis v X and X V as

required

Case EFAlt f f j f

1 2

0

match v f v

1

From rule TFAlt we see that f U V and f U V with U U U and V V V

1 1 1 2 2 2 1 2 1 2

0

The induction hyp othesis yields v X with X V from which the result follows immediately by

1

SUnion

Case EFAlt f f j f

1 2

match v f

1

0

match v f v

2

Similar

Consider the nal rule in the given derivation

Case EPVar p x T

v S

S T

x v

x T

Immediate

Case EPRcd v l v l v l v

1 1 m m n n

p l p l p

1 1 m m

match v p

i i i

m

1

From TRcd we have W l W l W and v W for each i Similarly by

1 1 n n i i

TPRcd we have U l U l W with p U Finally by Corollary

1 1 m m i i i

we see that W U Now by the induction hyp othesis j for each i But this means that

i i i i

j as required

Case EPConcat p p p

1 2

match v p match v p

1 1 2 2

and have disjoint domains

1 2

1

By TPConcat we have U k S k S l T l T with p k S k

1 1 m m 1 1 n n 1 1 1 m

S and p l T l T Since U k S k S and U l

m 1 2 1 1 n n 2 1 1 m m 1

T l T by the subtyping laws transitivity of subtyping gives us W k S k S

1 n n 1 1 m m

and W l T l T Now by the induction hyp othesis j and j But this

1 1 n n 1 1 2 2

means that j as required

Case EPAs p x f

0

match v f v

0

x v

0

By TPAs f U V and x V By part of the induction hyp othesis v X for some

0

X V So x v j x V by the denition of satisfaction 

Theorem Safety

If e T then e v for some v That is the evaluation of a closed welltyp ed expression cannot

lead to a matchfailure or otherwise get stuck

0 0 0

If f S T and v R S then match v f v with v T T

0 0

If p U and v S U then match v p with j 

Pro of Straightforward induction on derivations 

Conclusions

We have describ ed a typ e system that may b e of use in checking programs or queries that apply to semistruc

tured data Unlike other approaches to the problem it is a relaxed version of a conventional system that

can handle the kinds of irregular typ es that o ccur in semistructurd data

Although we have established the basic prop erties of the typ e system a go o d deal of work remains to b e

done First there are some extensions that we do not see as problematic These include

Both strict and relaxed setunion op erations In the former case the two typ es are constrained to b e

equivalent Similarly one can imagine strict and relaxed case expressions

Equality Both absolute equality and equality at typ e T t with this scheme

A b ottom typ e the nullary case of union typ es An immediate application is in the typing

rule TSet for set formation where we can remove the side condition n to allow formation of the

empty set f g

Additional base typ es such as b o oleans and op erations such as set ltering

A top typ e Such a typ e would b e completely dynamic and would b e analyzed by typ ecase

expressions One could also add typ e insp ection primitives along the lines describ ed for Amb er Carrk

An otherwise or fallthrough branch in case expressions

A numb er of more signicant problems also remain to b e addressed

Complexity The obvious metho d of checking whether two typ es are equivalent or whether one is a

subtyp e of the other involves rst reducing b oth to disjunctive normal form As we have observed this

pro cess may b e exp onential in the size of the two typ e expressions We conjecture that equivalence

and subtyping can b e checked faster but we have not b een able to show this

Even if these problems turn out to b e intractable in general it do es not necessarily mean that this

approach to typing semistructured data is p ointless Typ e inference in ML for example is known to

b e exp onential KTU yet the forms of ML programs that are the cause of this complexity never

o ccur in practice Here it may b e the case that typ es that typ es that only have small dierences

will not give rise to exp ensive transformations

Recursive typ es The pro of of the decidability of subtyping works by induction on the

derivation tree of a typ e which is closely related to the structure of the typ e We do not know whether

the same result holds in the presence of recursive typ es

Relationship with other typing schemes There may b e some relationship b etween the typing

scheme prop osed here and those mentioned earlier NAM Ali that work by inferring structure from

semistructured data Simulation for example gives rise to something like a subtyping relationship

BDFS but it is not clear what would give rise to union typ es

Applications Finally we would like to think that a system like this could b e of practical b enet

We mentioned that there is a group of biological data formats that are all derived from a common

basic format We should also mention that the pattern matching constructs intro duced in section

indep endently of any typing issues might b e used to augment other query languages such as XMLQL

+

DFF that exploit pattern matching

Acknowledgements

The idea of using untagged unions to typ e semistructured data started when Peter Buneman was on a visiting

research fellowship provided by the Japanese So ciety for the Promotion of Science He is grateful to Atsushi

Ohori for stimulating discussions Benjamin Pierce is supp orted by the National Science Foundation under

Career grant CCR

An implementation pro ject and meticulous reading by Davor Obradovic help ed us correct several aws in

an earlier version of this pap er

References

AH Serge Abiteb oul and Richard Hull IFO A formal semantic database mo del ACM Transactions

on Database Systems Decemb er

Ali Alin Deutsch and Mary Fernandez and Dan Suciu Storing semistructured data with STORED

In Proceedings of ACM SIGMOD International Conference on Management of Data June

+

AQM S Abiteb oul D Quass J McHugh J Widom and J Wiener The lorel query language for

semistructured data Journal on Digital Libraries

BDCd Franco Barbanera Mariangiola DezaniCiancaglini and Ugo deLiguoro Intersection and union

typ es Syntax and semantics Information and Computation June

BDFS P Buneman S Davidson M Fernandez and D Suciu Adding structure to unstructured data

In Proc ICDT

BDHS P Buneman S Davidson G Hillebrand and D Suciu A query language and optimization

techniques for unstructured data In ACMSIGMOD pages

+

BLS P Buneman L Libkin D Suciu V Tannen and L Wong Comprehension syntax SIGMOD

Record March

BNTW Peter Buneman Shamim Naqvi Val Tannen and Limso on Wong Principles of programming

with complex ob jects and collection typ es Theoretical Computer Science Septem

b er

Carrk L Cardelli Amb er In B Robinet G Cousineau PL Curien editor Combinators and Functional

programming languages page SpringerVerlag NewYork

CM M Consens and T Milo Optimizing queries on les In Proc ACM Sigmod Minneapolis

Dam Flemming M Damm Subtyping with union typ es intersection typ es and recursive typ es In

Masami Hagiya and John C Mitchell editors Theoretical Aspects of Computer Software volume

of Lecture Notes in Computer Science pages SpringerVerlag April

DCdP M DezaniCiancaglini U deLiguoro and A Pip erno Filter mo dels for conjunctivedisjunctive

calculi Theoretical Computer Science Decemb er

+

DFF A Deutsch M Fernandez D Florescu A Levy and D Suciu Xmlql A query language for

xml httpwwwworgTRNOTExm lql

Hay Susumu Hayashi Singleton union and intersection typ es for program extraction In T Ito and

A R Meyer editors Theoretical Aspects of Computer Software Sendai Japan numb er

in Lecture Notes in Computer Science pages SpringerVerlag Septemb er Full

version in Information and Computation

KTU A J Kfoury J Tiuryn and P Urzyczyn An analysis of ML typability Journal of the ACM

March

NAM S Nestorov S Abiteb oul and R Motwani Inferring structure in semistructured data In

Proceedings of the Workshop on Management of Semistructured Data Available from

httpwwwresearchattcomsuciuworkshoppapershtml

Pie Benjamin C Pierce Programming with intersection typ es union typ es and p olymorphism

Technical Rep ort CMUCS Carnegie Mellon University February