SchemaSQL - A Language for Interoperability in Relational Multi- Systems*

Laks V. S. Lakshmanan+ Fereidoon SadriS Iyer N. Subramaniant

1 Introduction Abstract In recent years, there has been a tremendous pro- liferation of in the work place, dominated We provide a principled extension of SQL, called by relational database systems. An emerging need SchemaSQL , that offers the capability of uni- for sharing data and programs across the different form manipulation of data and meta-data in re- databases has motivated the need for Multi-database lational multi-database systems. We develop a systems (MDBS), sometimes also referred to as hetero- precise syntax and semantics of SchemaSQL in geneous database systems and federated database sys- a manner that extends traditional SQL syntax tems. Systems capable of operating over a distributed and semantics, and demonstrate the following. (1) network and encompassing a heterogeneous mix of SchemaSQL retains the flavour of SQL while sup- computers, operating systems, communication links, porting querying of both data and meta-data. (2) and local database systems have become highly desir- It can be used to represent data in a database able, and commercial products are slowly app’earing in a structure substantially different from origi- on the market. For surverys on MDBS, see [ACM901 nal database, in which data and meta-data may (in particular, Sheth and Larson, and Litwin, hark, be interchanged. (3) It also permits the cre- and Roussopoulos), and Hsiao [Hsi92]. ation of views whose schema is dynamically de- One of the fundamental requirements in a multi- pendent on the contents of the input instance. (4) database system is interoperabikty, which is the ability While aggregation in SQL is restricted to values to uniformly share, interpret, and manipulate informa- occurring in one column at a time, SchemaSQL tion in component databases in a MDBS. Almost all permits “horizontal” aggregation and even aggre- factors of heterogeneity in a MDBS pose challenges for gation over more general “blocks” of informa- interoperability. These factors can be classified into se- tion. (5) SchemaSQL provides a great facility mantics issues (e.g., interpreting and cross-relating in- for interoperability and data/meta-data manage- formation in different local databases), syntactic issues ment in relational multi-database systems. We (e.g., heterogeneity in database schemas, data models, provide many examples to illustrate our claims. and in query processing, etc.), and systems issues (e.g., We outline an architecture for the implementa- operating systems, communication protocols, consis- tion of SchemaSQL and discuss implementation tency management, security management, etc). We algorithms based on available database technology focus on syntactic issues here. We consider the prob- that allows for powerful integration of SQL based lem of interoperability among a number of component relational DBMS. relational databases storing semantically similar infor- mation in structurally dissimilar ways. As was pointed l This work was supported by grants from the Natural Sci- out in [KLKSl], the requirements for interoperability ences and Engineering Research Council of Canada (NSERC), even in this casefall beyond the capabilities of conven- the National Science Foundation (NSF), and The University of North Carolina at Greensboro. tional languages like SQL. t Dept of Computer Science, Concordia University, Mon- Some of the key features required of a language for treal, Canada. {laks,subbu}Ocs.concordia.ca interoperability in a (relational) MDBS are the fol- $ Dept of Mathematical Sciences, University of North Car- olina, Greensboro, NC. [email protected] lowing. (1) The language must have an expressive Permission to copy without fee all or port of this material is power that is independent the schema with which a gmnted provided that the copies ore not, mode or distributed for database is structured. For instance, in most conven- direct commercial advantage, the VLDB copyright notice and tional relational languages, some queries (e.g., “find all the title of the publication and its dote appear, and notice is department names”) expressible against the database given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires o fee univ-A in Figure 2 are no longer expressible when the and/or special permission from the Endowment. information is reorganized according to the schema of, Proceedings of the 22nd VLDB Conference say, univ-B there. This is undesirable and should be Mumbai(Bombay), India, 1996 avoided. (2) To promote interoperability, the language

239 must permit the restructuring of one database to con- a variable that ranges over the (tuples of the) form to the schema of another. (3) The language employees (in the usual SQL jargon, these variables must be easy to use and yet sufficiently expressive. are called aliases.) The select and where clauses re- (4) The language must provide the full data manip- fer to (the extension of) attributes, where an attribute ulation and definition capabilities, and must be is denoted as . , var being a (tuple) downward compatible with SQL, in the sense that it variable declared in the from clause, and attName be- must be compatible with SQL syntax and semantics. ing the name of an attribute of the relation (extension) We impose this requirement in view of the importance over which var ranges. and popularity of SQL in the database world. (5) Fi- When no ambiguity arises, SQL permits certain ab- nally, the language must admit effective and efficient breviations. Queries of Figure l(b,c) are equivalent to implementation. In particular, it must be possible to the first one, and are the most common ways such realize it non-intrusive implementation that would re- queries are written in practice. Note that in Figures quire minimal additions to component RDBMS. l(b) and l(c), empl 0 y ees acts essentially as a tuple Ccmtcibutions: (1) We propose a language called variable. SchemaSQL which meets the above criteria, re- The SchemaSQL syntax extends that of SQL in sev- view the syntax and semantics of SQL, and develop eral directions. SchemaSQL as a principled extension of SQL. As a The federation consists of databases, with each result, for a SQL user, adapting to SchemaSQL is database containing relations. The syntax allows relatively easy. (2) We illustrate via examples the fol- to distinguish between (the components of) dif- lowing powerful features of SchemaSQL : (i) uniform ferent databases. manipulation of data and meta-data; (ii) creating re- structured views and the ability to dynamically create To permit meta-data queries and restructuring views, SchemaSQL permits the declaration of output schemas; (iii) the ability to express sophisti- other types of variables in addition to the (tuple) cated aggregate computations far beyond those express- ible in conventional languages like SQL (Sections 3.1, variables permitted in SQL. 4.1). (3) We propose an implementation architecture Aggregate operations are generalized in for SchemaSQL that is designed to build on existing SchemaSQL to make horizontal and block aggre- RDBMS technology, and requires minimal additions to gations possible, in addition to the usual vertical it, while greatly enhancing its power (Section 5). We aggregation in SQL. provide an implementation algorithm for SchemaSQL, In this section we will concentrate on the first two and establish its correctness. We, also discuss novel aspects. Restructuring views and aggregation are dis- query optimization issues that arise in the context of cussed in Section 4. this implementation. (4) Finally, we propose an ex- Variable Declarations in SchemaSQL tension to SchemaSQL for systematically resolving the SchemaSQL permits the declaration of variables semantic heterogeneity problem arising in a MDBS en- that can range over any of the following five sets: (i) vironment (Section 6). names of databases in a federation; (ii) names of the In this paper we illustrate the semantics of relations in a database; (iii) names of the attributes in SchemaSQL mainly via examples. A precise seman- the scheme of a relation; (iv) tuples in a given relation tics of SchemaSQL, together with many more exam- in a database; and (v) values appearing in a column ples illustrating its powerful features can be found in corresponding to a given attribute in a relation. Vari- [LSS96b]. able declarations follow the same syntax as in SQL, where var is any identifier. However, 2 Syntax there are two major differences. (1) The only kind of range permitted in SQL is a set of tuples in some re- Our goal is to develop SchemaSQL as a principled lation in the database, whereas in SchemaSQL any of extension of SQL. To this end, we briefly analyze the five kinds of ranges above can be used to declare the syntax of SQL, and then develop the syntax of variables. (2) More importantly, the range specifica- SchemaSQL as a natural extension. Our discussion tion in SQL is made using a constant, i.e. an identifier below is itself a novel way of viewing the syntax and referring to a specific relation in a database. By con- gemantics of SQL, which, in our opinion, helps a better trast, the diversity of ranges possible in SchemaSQL understanding of SQL subtleties. permits range specifications to be nested, in the sense In an SQL query, the (tuple) variables are declared that it is possible to say, e.g., that X is a variable rang- in the from clause. A variable declaration has the ing over the relation names in a database D, and that form . For example, in the query in T is a tuple in the relation denoted by X. These ideas Figure l(a), the expression employees T declares T as are made precise in the following definition.

240 select T. name select employees.name select name from employees T from employees from employees where T.department = where employees.department = where department = “Marketing” “Marketing” “Marketing” (a> (b) cc>

Figure 1: Syntax of simple SQL queries Definition 2.1 (Range Specifications) The con- section we discuss SchemaSQL queries with a fixed cepts of range specifications, constant, and variable output schema. The topics of dynamic output schema identifiers are simultaneously defined by mutual recur- and restructuring views are discussed in the next sec- sion as follows: tion. 1. Range specifications are one of the following five The following federation of databases is used as types of expressions, where db , rel , attr are our running example. Consider the federation con- any constant or variable identifiers (defined in 2 sisting of four databases, univ-A, univ-B, univ-C, below). and univ-D. Each database has (one or more) rela- tion(s) that record(s) the salary floors for employees The expression -> denotes a range corre- (4 by their categories and their departments, as follows: sponding to the set of database names in the federation. 0) The expression db-> denotes the set of rela- univ-A has a relation salInf o (category, tion names in the database db. dept , salFloor). (4 The expression db: :rel-> denotes the set of univ-B has a relation salInfo (category, names of attm’butes in the scheme of the re- deptl, dept2, . . . >. Note that the domains of lation rel in the database dbl. deptl, dept2, . . . are the same as the domain (4 db: :rel denotes the set of tuples in the re- of salFloor in univ-A: : salInf o. lation rel in the database db. (4 db: :rel. attr denotes. the set of values ap- univ-C has one relation for each department with pearing in the column named attr in the re- the scheme depti (category, salFloor). lation rel in the database db. univ-D has a relation salInfo (dept, catl, cat2, . . . >. Note that the domains of cat 1, 2. A variable declaration is of the form cat2, . . . are the same as the domain of where is one of the range speci- salFloor in univ-A: : salInf o. fications above and is an identifie?. An identifier is said to be a variable if it is de- Figure 2 shows some sample data in each of these four clared as a variable by an expression of the form databases. in the from clause. Variables de- Example 3.1 List the departments in univ-A that clared over the ranges (a) to (e) are called db- pay a higher salary floor to their technicians compared name, rel-name, attr-name, tuple, and domain with the same department in univ-B. variables, respectively. Any identifier not so de- clared is a constant. select A.dept from univ-A::salInfo A, univ-B::salInfo B, univiFi;salInfo-> httB As an illustration of the idea of nesting variable dec- where <> category” and larations, consider the clause from dbl-> X, dbl: :X A. dept = AttB and T. This declares X as variable ranging over the set of (91) A.category = “technician” and relation names in the database dbl and T as a vari- B.category = “technician” and A.salFloor > B.AttB able ranging over the tuples in each relation X in the database dbl. Explanation: Variables A and B are (SQL-like) tupie The following sections provide several examples variables ranging over the relations univ-A : : salInf o demonstrating various capabilities of-SchemaSQL. and univ-B : : salInf o, respectively. The variable AttB is declared as an attribute name of the rela- 3 Fixed Output Schema tion univ-B: : salInfo. It is intended to be a depti attribute lhence the condition AttB <> “category” In this and the next section, we illustrate via examples in the where clause). The rest of the query is self- the many powerful features of SchemaSQL . In this explanatory. 8 ‘The intuition for the notation is that we can regard the attributes of a relation as written to the right of- the relation Example 3.2 List the departments in univ-C that name itself! ‘Abbreviations similar in spirit to those allowed for SQL are pay a higher salary floor to their technicians compared also allowed in SchemclSQL . with the same department in univ-D.

241 univ-B univ-C salInf0 univ-A cs salInf0 -1

univ-D Math salInf0 category salFloor Prof 70,000 pyipz&q Assoc Prof 60.000 I I

Figure 2: Representing Similar Information Using Different Schemas in Multiple Databases univ-A, univ-B, univ-C, and univ-D

pl;ct RelC similar way, aggregations over values collected from unlv-C-> RelC, univ-C::RelC C, univ-D::salInfo D more than one database can also -be expressed. Block (02). where RelC = D.dept and aggregations of a more sophisticated form are illus- C. category = “tec6nician” and trated in Example 4.3. n C.salFloor > D.technician

3.1 Aggregation with Fixed Output Schema 4 Dynamic Output Schema and Re- structuring Views In SQL, we are restiicted to %ertical” (or column- wise) aggregation on a predetermined set of columns, The result, of an SQL query (or view definition) is a while SchemaSQL allows “horizontal” (or row-wise) single relation. Our discussion in the previous section aggregation, and also aggregation over more general was limited to the fragment of SchemaSQL queries “blocks” of information. We illustrate these points that produce one relation, with a fixed schema, as out- with examples. The formal development of semantics put. In this section, we provide examples to demon- can be found in [LSS96b]. strate the following capabilities of SchemaSQL . (i) declaration of dynamic output schema, (ii) restruckr- Example 3.3 The query ing views, and (iii) interaction between dynamic output select T.category, avg(T.D) schema creation and aggregation. (co) from univ-B::salInfo-> D, We illustrate the capabilities of SchemaSQL for $he univ-B : : salInfo T where D 0 “category” generation of an output schema which can dynami- group by T.category cally depend on the input instance (i.e. the databases in the federation). While aggregation in SQL is re- computes the average salary floor of each category of stricted to vertical aggregation on a predetermined set employees over all departments in univ-6. This cap- of columns, we have so far seen that SchemaSQL can tures horizontal aggregation. The condition D <> express horizontal aggregation and aggregation over “category” enforces the variable D to range over de- partment names. Hence a knowledge of department more general “blocks” (see Example 3.3). In this sec- names (and even the number of departments) is not tion, we shall see that the combination of dynamic required to express this query. Alternatively, we could output schema and meta-data variables allows us to enumerate the departments, e.g., use the condition (D express vertical aggregation on a variable number of = “Math” or D = “CS” or . ..)“. By contrast, the query columns as well. The examples in this and later sec- tions are based on the database schema of Figure 2. select T.category, avg(T.salFloor) (cm from univ-C-> D, univ-C::D T group by T.category Example 4.1 Consider the relation saIInfo in the database univ-B. The following SchemaSQL view def- computes a similar information from univ-C. Notice inition restructures this inform&ion into the format of that the aggregation is computed over a multiset of the schema univ-A : : salInf o. values obtained from several relations in univ-C. In a create view 3An elegant solution would be to specify some kind of “type BtoA::salInfo(category, dept, salFloor) as hierarchy” for the attributes which can then be used for saying select T.category, D, T.D “D is an attribute of the following kind’, rather than “D is (95) from univ-B::salInfo-> D, one of the following attributes”. Our proposed extension to univ-9::salInfy T SchemaSQL discussed in Section 6, addresses this issue. where D <> category

242 Explanation: Two variables are declared in the section, we shall see that when SchemaSQL aggrega- from clause: T is a tuple variable ranging over the tion is combined with its view definition facility, it is tuples of relation univ-B: : salInf o, and D is an possible to express vertical aggregation over a variable attribute-name variable ranging over the attributes of number of columns, determined dynamically by the univ-B : : salInf o. The condition in the where clause input instance. forces D to be a department name. Finally, each output tuple (T . category, D, T . D) lists the category, depart- Example 4.3 Suppose’that in the database univ-D ment, name, and the corresponding salary floor (which in Figure 2, there is an additional relation is in the format of univ-A : : salInf 0). f acuity (dname, f name>relating each department to Note that each tuple in the univ-B: : salInf o for- its faculty. Consider the query mat generates several tuples in the univ-A : : salInf o scheme. The mapping, in this respect, is one-to-many. select U.fname, avg(T.C) But each instantiation of the variables in the query, from univ-D::salInfo-> C, actually contributes to one output tuple. n (Q7) univ-D::salInfo T, univ-D::faculty U where C <> “dept” and T.dept = U.dname The following example illustrates restructuring involv- group by U.fname ing dynamic creation of output schema. Q7 computes, for each faculty, the faculty-wide av- Example 4.2 This view definition restructures data erage floor salary of all employees (over all depart- in univ-A: : salInf o into the format of the schema ments) in the faculty. Notice that the aggregation is performed over ‘rectangular blocks’ of information. univ-B::salInfo. Consider now the following view definition QS, which create view AtoB: : salInf o (cateEory , D> as is essentially defined using the query Q7. select A. category, A. salFlo& - from (06) univ-A: : salInfo A, A.dept D create view averages::salInfo(faculty, C) as select U.fname, avg(T.C) Explanation: Each tuple of univ-A: : salInf o con- from univ-D::salInfo-> C, tains the salary floor for one category in a single de- (Q8) univ-D : :salInfo T, univ-D::faculty U partment, while each tuple of univ-B: : salInf o con- where C 0 “dept” and T.dept = U.dname tains the salary floors for one category in every de- group by U.fname partment. Intuitively, all tuples in univ-A : : salInf o corresponding to the same category are grouped to- The view defined by QS actually computes, for each gether and “merged” to produce one output tuple. faculty, the average floor salary in each category of em- Another aspect of this restructuring view is the use ployees (over all departments) in the faculty. This is of variables in the create view clause. The variable D achieved by using the variable C, ranging over cat- in create view AtoB: : salInfo(category, D) is de- clared as a domain variable ranging over the values of egories, in the dynamic output schema declaration the dept attribute in the relation univ-A: : salInf o. through the create view statement. n Hence, the schema of the view AtoB : : salInf o is “dy- namicalljr” declared as AtoB : : salInf o (category, 5 Implementation Issues deptl, . . . . deptn) , where dept 1, . . . , deptn are the values occurring in the dept column in the relation In this section we describe the architecture of a sys- univ-A: : salInf o. tem for implementing a multidatabase querying and The restructuring in this example corresponds to restructuring facility based on SchemaSQL. A high- a many-to-one mapping from instantiations to output light of our architecture is that it builds on existing tuples. n architecture in a non-intrusive way, requiring mini- mal extensions to prevailing database technology. This In the full paper [LSS96b], we present additional ex- makes it possible to build a SchemaSQL system on top amples to illustrate restructuring views that distribute of (already available) SQL systems. We also identify values from one tuple into many relations, and vice- novel query optimization opportunities that arise in a versa. Examples of many-to-many mappings (e.g. multidatabase setting. mappings between schemes of univ-B and univ-D) are The architecture consists of a SchemaSQL server also given there. that communicates with the local databases in the fed- eration. We assume that the meta-information com- 4.1 Aggregation with Dynamic View Defini- prising of component database names, names of the tion relations in each database, names of the attributes In Section 3, we illustrated the capability of in each relation, and possibly other useful informa- SchemaSQL for computing (i) horizontal aggregation tion (such as statistical information on the component and (ii) aggregation over blocks of information col- databases useful for query optimization) are stored in lected from several relations, or even databases. In this the SchemaSQL server in the form of a relation called

243 Federation System (FST). Due to the varying OUTPUT: Bindings for the variables appearing in the degrees of autonomy component databases enjoy in a select clause of the SchemaSQL statement. multidatabase system, some or all of this information METHOD: The algorithm consists of two phases. may not available. In [LSS96b] we describe a flex- (1) Corresponding to a set of variable declarations in ible architecture that makes use of as much of the the from clause, create VITs using one or more SQL available information as possible. In discussions here, queries against soine local databases and/or the FST. we assume that the component database names as (2) Rewrite the original SchemaSQL query against the well as their schema information is available in the federation into an equivalent query against the set of SchemaSQL server. VIT relations and run it using the resident SQL server. In our architecture, global SchemaSQL queries are Phase I submitted to the SchemaSQL server, which deter- (0) The input SchemaSQL statement is rewritten into mines a series of local SQL queries and submits them the following form such that the conditions in the to the local databases. The SchemaSQL server then where clause are in conjunctive normal form. collects the answers from local databases, and, using select Sl,...,S, its own resident SQL engine, executes a final series of from (rangel) VI, . . . , (rangek) Vk where (condl) and . . . and (cond,) SQL queries to produce the answer to the global query. groupby groupList Intuitively, the task of the SchemaSQL server is to having haveconditions compile the instantiations for the variables declared in (1) Consider the variable declaration for variable Vi. the query, and enforce the conditions, groupings, ag- (a) If Vi is a meta-variable: In this case, all variables gregations, and mergings to produce, the output. Many in the declaration (rangei) Vi range over meta-data. query optimization opportunities at different stages, Create VITi with a schema consisting of Vi and any and at different levels of abstraction, are possible, and variables appearing in (rangei), and contents obtained should be employed for efficiency (see discussions at using an appropriate SQL query against the FST. For the end of this section). Figure 3 depicts our archi- example, let 'D: :rel-> Vi ’ be the declaration and one of the conditions in the where clause be ‘Vi .op. c’ tecture for implementing SchemaSQL. Algorithm 5.1, where .op. is a (in)equality operator and c is a constant. gives a more detailed account of our query processing Obtain VITi corresponding to VITi as: strategy. select db-name as D, attr-name as Vi from FST Query processing in a SchemaSQL environment where rel-name = ‘rel’ and attr-name .op. c consists of two major phases. In the first phase, ta- Meta-variable declarations of other forms are han- bles called VIT’s (Variable Instantiation Table) cor- dled in a similar way. responding to the variable declaration in the from (b) If Vi is a domain variable: Group together domain clause of a SchemaSQL statement are generated. The variable declarations that are declared using the same tuple variable as for Vi. Create VITi with schema con- schema of a VIT consi&s of all the variables in one sisting of the domain variables in the group. Obtain or more variable declarations in i&e from clause and the (tuple of) bindings for the attr-name variables (in its contents correspond to instantiat’ions of these vari- the range declarations) in the group, using their cor- ables. VIT’s are materialized by executing appro- responding VIT’s. Using this, generate a set of SQL priate SQL queries on the FST and/or component queries against local databases. The contents of VITi databases. In the second phase, the SchemaSQL query will be the union of answers to these queries. For ex- ample, let db::rel T be a tuple variable declaration, is rewritten into an equivalent SQL query on the ‘T.A Vi’ be the declaration for the domain variable VIT’s and the generated answer is appropriately pre- and ‘Vi .op. c’ be a condition in the where clause, where sented to the user. Our algorithm below considers .op. is a (in)equality operator and c is a constant. Let SchemaSQL queries with a fixed output schema pos- ‘T.attr Vj’ be another domain variable declaration in sibly with aggregation. A complete algorithm for the from clause. the implementation of the fuil language, as well as (i) Obtain the bindings for attr-name variable A from its VIT, and name it relation T. novel query optimization strategies are discussed in [LSS96b]. (ii) For a E T, generate an SQL query against In the following, we assume that the FST has database db: the scheme FST (db-name, rel-name, attr-name). select a as Vi, attr as Vj from rel Also, we refer to the db-name, rel-name, and attr- where a .op. c name variables (defined in Definition 2.1) collectively as meta-variables. (iii) Obtain VITi as the union of answers to all the SQL queries generated in (ii), against db.. Algorithm 5.1 SchemaSQL Query Processing Domain-variable declarations of other forms (e.g. INPUT: A SchemaSQL query with a fixed output when db, rel, attr are also variables) are handled in a schema and aggregation. similar way.

244 < > Resident SQL Engine final series of answers to yp SQL queries queries Ql . . . Qn , collected

SchemaSQL final answer ‘schemaSQL ---m-e----- Server Federation Query User

optimized local SQL query Qn

Figure 3: SchemaSQL - Implementation Architecture (c) If Vi is a tuple variable: Generate bindings for the (iii) Obtain VITi as the union of all the SQL state- meta-variables in (rangei) as in case (a). The at- ments generated in (ii). tributes of the VIT corresponding to Vi are obtained by analyzing the select, where, group by, and having Tuple variable declarations of other forms are han- clauses. We consider a variable V as relevant in the dled in a similar way. context of tuple variable Vi, if (i) V is of the form Vi.C Phase II or Vi.c (C, c are a variable and constant respectively) Execution of this phase happens in the SchemaSQL and occurs in the select, groupby, or having clause, or server. The SchemaSQL query is rewritten into an (ii) V occurs in the declaration of Vi and either is com- equivalent conventional SQL statement on the VIT’s pared with a variable in the where clause, or occurs in generated in Phase I, in the following way. (a) The se- the select clause, or (iii) V occurs in a relevant variable lect, group by, and having clauses of the rewritten query of the form Vi.V and V is compared with a variable in are obtained by copying the corresponding clauses in the where clause. The schema of the VIT is the set the SchemaSQL query after disambiguating the at- consisting of all relevant variables in the context of Vi. tribute names that appear in more than one VIT; (b) Findly, the ontents of VIT are obtained by generat- the from clause consists of the subset of VIT’s relevant ing approp l-flate SQL queries against local databases. to the final result, and (c) the where clause is obtained In general, if there are occurrences of the form Vi.C by retaining the conditions involving tuple variables in the select or the where clause, the VIT would be and by adding a condition ‘VITi.X = VITj.X’ for obtained as a union of several SQL queries. tables VITi and VITj having a common attribute. For example, let the select clause contain an aggre- gation of the form avg(Vi.C), the variable declaration be ‘db::R Vi’ and two of the conditions in the where It is interesting to note that using our algorithm, clause be ‘Vi.al .op. Vj.az’ and ‘Vi.as .op. c’, where the novel horizontal aggregation (Section 3.1, Exam- al, a2, as, c are constants. ple 3.3) which cannot be performed in a conventional SQL system, can be easily realized in our framework. 6) Obtain a VIT corresponding to db-> R (as in (a) above) and name it T. More general kind of ‘block’ aggregations can also be handled in a similar way - details can be found in (ii) The schema of VITi is {&.C:, K.al}. [LSS96b], which also contains the proof of the follow- ing theorem. (ii) For each T E T, obtain the attribute names in re- lation T (using an SQL query on the FST) and reycate the following SQL statement. Theorem 5.1 Algorithm 5.1 correctly computes an- 1,. . . , ck be the instantiations of C, corre- swers to SchemaSQL queries. sponding to T. select cl as Vi.C, al as Vi.al Example 5.1 In this example, we illustrate our al- from r gorithm using a variant of the query Q2 of Example where Vi.as .op. c 3.2. UNION . . . ‘List the departments in univ-C that pay a higher UNION salary floor to their technicians compared with the select ck as vi.c, al as vi.al same department in univ-D. List also the (higher) from r where Vi.a3 .op. c pay.’

245 Figure 4: Example - Query Processing select RelC, C.salFloor A SchemaSQL system on the PC-Windows plat- from univ-C-B RelC, univ-C::RelC C, univ-D::salInfo D form is currently under implementation. where RelC = D.dept and C.category = “technician” and Query Optimization C.salFloor > D.technician There are several opportunities for query optimization Phase I which are peculiar to the MDBS environment. In the VITl corresponding to the variable declaration following, we identify the major optimization possi- univ-C-> RelC is created using: bilities and sketch how they can be incorporated in Algorithm 5.1. ;;el;ct $-+-name as RelC where db-name = ‘univ-C’ 1. The conditions in the where clause of the input SchemaSQL query should be pushed inside the lo- Figure 4 shows VITl. To generate the SQL state- cal spawned SQL queries so that they are as ‘tight’ ment that creates VITz, the following SQL queries are issued against the FST. as possible. Algorithm 5.1 incorporates this opti- mization to some extent. select attr-name select attr-name ;f;ro~~FsT from FST 2. Knowledge of the variables in the select and where where db-name = ‘univ-C’ db-name = ‘univ-C’ clauses can be made use of to minimize the size and rel-name = ’ cs’ and rel-name = ‘math’ of the VIT’s generated in Phase I. For example, if certain attributes are not required for processing Let the answer to both the queries be {category, in Phase II, they can ‘dropped’ while generating salFloor). VITz, corresponding to univ-C : : RelC C is obtained by querying the database univ-C using: the local SQL queries. select ‘cs’ as RelC, cs.salFloor as C.salFloor 3. If more than one tuple variable refers to the same from database, and their relevant where conditions do where Z . category = ‘technician’ not involve data from another database, the SQL select ‘%gv as RelC, statements corresponding to these variable dec- iE;k.salFloor as C.salFloor from larations should be combined into one. This where math.category = ‘technician’ would have the effect of combining the VIT’s To obtain VIT3 corresponding to univ-D : : salInf o corresponding to these variable declarations and D, querying is first done on the FST to obtain the thus reducing the number of spawned local SQL grvebof the attributes in relation sa.Unfo of database queries. This can be incorporated by modifying the step I(c) of our algorithm. select attr-name from FST 4. One of the costliest factors for query evaluation in where db-name = ‘univ-D’ & rel-name = ‘salInfo’ a multidatabase environment is database connec- Let the answel;to this query be {dept, prof, techni- tivity. We should minimize the number of times cian}. VITS, shown in Figure 4 is obtained by query- connections are made to a database during query ing the database univ-D: evaluation. Thus, the spawned SQL statements select dept as D.dept, need to be submitted (in batches) to the compo- “,B,+$i.gim as D.technician nent databases in such a way that they are eval- from uated in minimal number of connections to the Phase II databases. Having obtained all the VIT’s corresponding to the variable declarations, Phase II now consists of rewrit- 5. In view,of the sideways information passing (sip) ing the SchemaSQL statement into the following SQL [BFU36] technique inherent in our algorithm, re- statement to obtain the final answer. ordering of variable declarations would result in select RelC, C.salFloor more efficient query processing. However, the from VIT2, VIT3 where RelC = D.deDt and heuristics that meta-variables obtain a signifi- C.salFloor >*D.technician cantly less number of bindings when compared n to other variables in a multidatabase setting,

246 presents novel issues in reordering. For instance for explicitly accessing the data as well as its context the order ‘db: :r.a R, -> D, D-> R’ suggested information. by the conventional reordering strategies could be In this section, we sketch how SchemaSQL can be worse than ‘ -> D, D-> R, db::r.a R’ because extended with the wherewithal to tackle the seman- of the lower number of bindings R obtains for tic heterogeneity problem. We extend the proposal T/IT2 in the latter. of [SSR94], by associating the context information to relation names as well as attribute names, in ad- We should make use of works such as [LN90, dition to the values in a database. Also, in the LNSSO] to determine which of the VIT’s should SchemaSQL setting, there is a natural need for in- be generated first so that the tightest bindings cluding the type information of an object as part of are passed for generating subsequent VITs. its context information. We propose techniques for in- If parallelism can be supported, SQL queries to tensionally specifying the semantic values as well as multiple databases can be submitted in parallel. for algorithmically deriving the (intensional) semantic Replication and Inconsistency value specification of a restructured database, given the old specification and the SchemaSQL view defi- Replication of data, and inconsistency among data nition. The following example illustrates our ideas. from local databases are common in multidatabase sys- tems. The view facility of SchemaSQL and our archi- Details can be found in [LSS96b]. tecture provide the means to cope with these difficul- Example 6.1 Consider the database univlnfoA hav- ties. ing a single relation stats with scheme {cat, cs, math, Controlled (intentional) replication can be addressed Ontario, quebec}. This database stores information on through the Federation System Table, FST. A copy of the floor salary of various employee categories for each the replicated data is identified as the pramay copy, department (as in univ-B of the university federation) and the FST routes all references to the replicated data as well as information on the average number of years to the primary copy. The choice of the primary copy it takes to get promoted to a category, in each province is influenced by factors such as efficiency of query pro- in the country. The type information of the objects in cessing, network connectivity, and the load at local the database univlnfoA is stored in a relation called isa sites. In a dynamic scheme, the FST is updated in and is captured using the following rules4: response to changes in the network (e.g., network dis- isa(cs, dept) t connection) and the load at local sites. isa(math, dept) t Data replication and overlap among (independent) lo- isa(ontario,prov) e caI sites, with the possibility of inconsistency, is much isa(quebec, pruv) t subtler. The view facility:of SchemaSQL can be used isa(C,cat) t stats[cat + C] to resolve inconsistencies by exposing only the appro- isa(S, sal) t stats[D + S], isa(D, dept) priate data through the view. This is similar to the isa(Y, gear) t stats[P -+ Y], isa(P,prov) approach taken in multidatabase systems utilizing an Now, consider restructuring univlnfoA into univln- (integrated) global schema. Our architecture is more foB which consists of two relations salstats{dept, prof, flexible, and does not require a global schema, yet, the assoc-prof} and timestats{prov, prof, assoc-prof}. sal- view facility can mimic the role played by the global stats has tuples of the form < d, sr, sz >, represent- schema for resolving data inconsistency. ing the fact that d is a department that has a floor salary of si for category professor, and sz for asso- 6 Semantic Heterogeneity ciate professor. A tuple of the form < p, yr , yz > in One of the roadblocks to achieving true interoperabil- timestats says that p is a province in which the av- ity is the heterogeneity that arises due to the differ- erage time it takes to reach the category professor is ence in the meaning and interpretation of similar data yi and to reach the category associate professor is yz. across the component systems. This semantic hetero- The following SchemaSQL statements perform the re- geneity problem has been discussed in detail in [Siggl], structuring that yields univlnfoB. [KCGS93], [HM93]. A promising approach to dealing create view with semantic heterogeneity is the proposal of Sciore, univInfoB: :salstats(dept, T.cat) as Siegel, and Rosenthal [SSR94]. The main idea behind select D, T.D from univInfoA: : stats T, their proposal is the notion of semantic vdues, ob- univInf oA: : stats-> D, tained by introducing an explicit context information where D isa ‘dept’ to each data object in the database. In applying this create view idea to the , they develop an exten- 4The syntax of the type specification rules is ‘baaed on the sion of SQL called Context-SQL (C-SQL) that allows syntax of SM [LSS96a].

247 ’ univInfoB::timestats(prov, T.cat) as schema against which the component databases could select P, T.P from univInf oA: : stats T, be mapped. Though MSQL (and its extension) has univInfoA::stats-> P, facilities for ranging variables over multiple database where P isa ‘prov’ names, its treatment of data and meta-data is non- uniform in that relation names and attribute names Note how the type information is used in the where clause to elegantly specify the range of the attribute are not given the same status as the data values. The variables. Our algorithm that processesthe restructur- issues of schema independent querying and resolving ing view definitions derives the following intensional schematic discrepancies of the kind discussed in this type specification for univlnfoB: paper, are not addressed in their work. isa(prof, cat) t Many object-oriented query languages, by virtue of isa(assoc-prof, cat) t treating the schema information as objects, are ca- isa(D,dept) t saZstats[dept + D] isa(S, sal) t saZstats[C + S], isa(C, cat) pable of powerful meta-data querying and manipula- isa(P,prov) t timestats[prov + P] tion Some of these languages include XSQL (Kifer, isa(Y, year) t timestats[P + Y], isa(P,prov) Kim, and Sagiv [KKS92]), HOSQL (Ahmed et al. [ASD+Sl]), and OSQL (Chomicki and Litwin [CL93]). Query processing in this setting involves the fol- XSQL ([KKS92]) has its logical foundations in F- lowing modification to the processing of comparisons logic ([KLW95]) and is capable of querying and re- mentioned in the user’s query. The comparison is per- structuring object-oriented databases. However, it formed after (a) finding the type information using the is not suitable for the needs addressed in this pa- specification, (b) finding the associated context infor- per as its syntax was not designed with interoper- mation, and (c) applying the appropriate conversion ability as a main goal. Besides, the complex nature functions. [LSS96b] has the details. of this raises concerns about effective and efficient implementability, a concern not addressed 7 Comparison with Related Work in [KKS92]. The Pegasus Multi-database system In this section, we compare and. contrast our proposal ([ASD+Sl]) uses a language called HOSQL as its data against some of the related work for meta-data manip- manipulation language. HOSQL is a functional object- oriented language that incorporates non-procedural ulation and multidatabase interoperability. statements to manipulate multiple databases. OSQL The features of SchemaSQL that distinguishes it ([CL93]), an extension of HOSQL is capable of tackling from similar works include schematic discrepancies among heterogeneous object- Uniform treatment of data and metadata. oriented databases with a common . Both No explicit use of object identifiers. HOSQL and OSQL do not provide for ad-hoc queries that refer to many local databases in the federation in Downward compatibility with SQL. one shot. While XSQL, HOSQL, and OSQL have a SQL flavor, unlike SchemaSQL , they do not appear Comprehensive aggregation facility. to be downward compatible with SQL syntax and se- Restructuring views, in which data and meta-data mantics. In other related work, [Ros92] proposes an may be interchanged. interesting algebra and calculus that treats relation Designed specifically for interoperability in multi- names at par with the values in a relation. However, database systems. its expressive power is limited in that attribute names, database names, and comprehensive aggregation capa- Further, we also discuss the implementation of bilities are not supported. SchemaSQL on a platform of SQL servers. In [LBT92], Lefebvre, Bernus, and Topor use F- In [Lit89, GLRS93], Litwin et al. propose a multi- logic ([KLW95]), t o reconcile schematic discrepan- database manipulation language called MSQL that is cies in a federation of relational databases. Unlike capable of expressing queries over multiple databases SchemaSQL which can provide a ‘dynamic global in a single statement. MSQL extends the traditional schema’, ad hoc queries that refer the data and schema functions of SQL to the context of a federation of components of the local databases in a single state- databases. The salient, features of this language in- ment cannot be posed in their framework. clude the ability to retrieve and update relations in UniSQL/M [KGK+95] is a multidatabase system different databases, define multi-database views, and for managing a heterogeneous collection of relational specify compatible and equivalent domains across dif- database systems. The language of UniSQL/M, ferent databases. [MR95] extends MSQL with fea- known as SQL/M, provides facilities for defining a tures for accessing external functions (for resolving global schema over related entities in different local semantic heterogeneity) and for specifying a global databases, and to deal with semantic heterogeneity is-

248 sues such as scaling and unit transformation. How- allows for expressing powerful queries and programs in ever, it does not have facilities for manipulating meta- the context of schema browsing and interoperability. . data. Hence features such as restructuring views that A formal account of SchemaLog’s syntax and seman- transform data into metadata and vice versa, dynamic tics can be found in [LSS96a]. SchemaLog can also schema definitions, and extended aggregation facilities express the complex forms of aggregation discussed in supported in SchemaSQL are not available in SQL/M. this paper. The emerging standard for SQL3 ([SQL96, Bee93]) SchemaSQL has been to a large extent inspired supports ADTs and oid’s, and thus shares some fea- by SchemaLog. Indeed, the logical underpinnings of tures with higher-order languages. However, even SchemaSQL can be found in SchemaLog [LSS96b]. though it is computationally complete, to our knowl- However, SchemaSQL is not obtained by simply edge it does not directly support the kind of higher- “SQL-izing” SchemaLog. There are important differ- order features in SchemaSQL. ences between the two languages. (i) SchemaSQL has Krishnamurthy and Naqvi [KN88] and Krishna- been designed to be as close as possible to SQL. In this murthy, Litwin, and Kent [KLKSl] are early and influ- vein, we have developed the syntax and semantics of ential proposals that demonstrated the power of using SchemaSQL by extending that of SQL. SchemaLog on variables that uniformly range over data and meta- the other hand has a syntax based on logic program- data, for schema browsing and interoperability. While ming. (ii) Answers to SchemaSQL queries come with such ‘higher-order variables’ admitted in SchemaSQL an associated schema. In SchemaLog, as in other logic have been inspired by these proposals, there are major programming systems, answers to queries are simply a differences that distinguish our work from the above set of (tuples of) bindings of variables in the query (un- proposals. (i) These languages have a syntax closer less explicitly specified using a restructuring rule). (iii) to that of iogic programming languages, and far from The aggregation semantics of SchemaSQL is based on that of SQL. (ii) M ore importantly, these languages a ‘merging’ operator ([LSS96b]). There is no obvious do not admit tuple variables of the kind permitted in way to simulate merging in SchemaLog. (iv) To facil- SchemaSQL (and even SQL). This limits their expres- itate an ordinary SQL user to adapt to SchemaSQL sive power. (iii) Lastly, aggregate computations of the in an easy way, we have designed SchemaSQL with- kind discussed in Sections 3.1 and 4.1 are unique to out the following features present in SchemaLog - (a) our framework, and to our knowledge, not addressed function symbols and (b) explicit access to tuple-id’s. elsewhere in the litefature. As demonstrated in this paper, the resulting language In the context of multi-dimensional databases is simple, yet powerful for the interoperability needs (MDDB) and on-line analytical processing (OLAP), in a federation. there is a great need for powerful languages expressing Acknowledgements: We would like to thank complex forms of aggregation ([CCS95]). The power- Fr6d&ic Gingras for his valuable comments at various ful features of SchemaSQL for horizontal and block ag- stages of evolution of this paper and the anonymous gregation will be especially useful in this context (e.g. referees for their suggestions that helped improve the see Examples 3.3, 4.3). We have recently observed paper. that the Data Cube operator proposed by Gray et al. ([GBLP96]) can be simulated in SchemaSQL. Unlike References the cube operator, SchemaSQL can express any subset [ACM901 ACM. ACM Computing Surveys, vol- of the data cube to any level of granularity. ume 22, Sept 1990. Special issue on HDBS. In other rel&ed work, Gyssens et al. ([GLS96]) de- [ASD+Sl] Ahmed, R., Smedt, P., Du, W., Kent, W., velop a general data model called Tabular Data Model, Ketabchi, A., and Litwin, W. The Pegasus which subsumes relations and spreadsheets as special heterogeneous multidatabase system. IEEE cases. They develop an algebra for querying and re- Computer, December 1991. structuring tabular information and show that the al- [Bee931 Beech, D. Collections of objects in SQL3. gebra is complete for a broad .class of natural tra;ns- In Proc. 19th VLDB Conference, 1993. formations. They also demonstrate that the tabular [BR86] Bancilhon, F. and Ramakrishnan, R. An algebra can serve as a foundation for OLAP. Restruc- amateur’s introduction to recursive auerv- turing views expressible in SchemaSQL can also be ex- processing strategies. In Proc. ACM’ SIb- pressed in their algebra but they do not address aggre- MOD, 1986. gate computations. [CCS95] Codd, E.F., Codd, S.B., and Sal- In [LSS96a, LSS93], we proposed a logic-based ley C.T. Providing OLAP (on-line query/restructuring language SchemaLog, for facil- analytical processing) to user-analysts: itating interoperability in multidatabase systems. An IT mandate, 1995. White paper SchemaLog admits a simple syntax and semantics, but www.arborsoft.com/papers/coddTOC.html

249 [CL931 Chomicki, J. and Litwin, W. Declar- [LBT92] Lefebvre, A., Bernus, P., and Topor, R. ative definition of object-oriented multi- Query transformation for accessing hetero- database mappings. In Ozsu, M.T, Dayal, geneous databases. In Workshop on Deduc- U, and Valduriez, P, editors, Distributed tive Databases in conjunction with JICSLP, Object Management. M. Kaufmann Pub- pages 31-40, November 1992. lishers, Los Altos, California, 1993. [Lit891 Litwin, W. MSQL: A multidatabase lan- [GBLP96] Gray, J., Bosworth, A., Layman, A., and guage. Information Science, 48(2), 1989. Pirahesh H. Data Cube: A relational ag- gregation operator generalizing group-bi, [LN90] Lipton, Richard and Naughton, Jeffrey. cross-tab, and sub-totals. In Proceedinas of Query size estimation by adaptive sam- the 12th hernational Conference on Da& pling. In Proc. ACM PODS, 1990. Engineering, pages 152-159,1996. [LNSSO] Lipton, Richard, Naughton, Jeffrey, and Schneider, Donovan. Practical selectivity [GLR$93] John Grant, Witold Litwin, Nick Rous- estimation through adaptive sampling. Iti sopoulos, and Timos Sellis. Query Proc. ACM SIGMOD, 1990. languages for relational multidatabases. VLDB Journal, 2(2):153-171, 1993. [LSS93] Lakshmanan, L.V.S., Sadri, F., and Sub- ramanian, I. N. On the logical foun- [GLS96] Gyssens, Marc, Lakshmanan, L.V.S., and dations of schema integration and evolu- Subramanian, I. N. Tables as a paradigm tion in heterogeneous database systems. for querying and restructuring. In Proc. In Proc. 3rd International Conference on ACM Symposium on Principles of Database Deductive and Object-Oriented Databases Systems (PODS), June 1996. (DOOD ‘93). Springer-Verlag, LNCS-760, December 1993. [HM93] Hammer, J. and McLeod, D. An approach [LSS96a] Lakshmanan, L.V.S., Sadri, F., and Sub- to resolving semantic heterogeneity in a ramanian, I. N. Logic and algebraic lan- federation of autonomous, heterogeneous guages for interoperability in multidatabase database systems. Intl Jouma2 of Intelli- systems. Technical report, Concordia Uni- gent & Cooperative Information Systems, versity, Montreal, Feb 1996. Accepted to 2(l), 1993. the Journal of Logic Programming. [Hsi92] Hsiao. D.K. Federated databases and sys- [LSS96b] Lakshmanan, L.V.S., Sadri, F., and Subra- tems:’ Part-one - a tutorial on their d&a manian, I. N. SchemaSQL - a language for sharing. VLDB Journa2, 1:127-179, 1992. querying and restructuring multidatabase systems. Technical report, Concordia Uni- [KCGS93] Kim, W., Choi, I., Gala, S.K., and versity, Montreal, 1996. In Preparation. Scheevel, M. On resolving schematic het- erogeneity in multidatabase systems. Dis- [MR95] Missier, P. and Rusinkiewicz, Marek. Ex- tributed and Parallel Databases, l(3), 1993. tending a multidatabase manipulation lan- guage to resolve schema and data conflicts. In Proc. Sixth IFIP TC-2 Working Confer- [KGK+95] Kelley, W., Gala, S. K., Kim, W., Reyes, T.C.. and Graham. B. Schema architecture ence on Data Semantics (DS-6), Atlanta, of the UniSQL/M multidatabase system. In May 1995. Modern Database Systems. 1995. [Ros92] Ross, Kenneth. Relations with relation names as arguments: Algebra and calcu- [KKS92] Kifer, Michael, Kim, Won, and Sa- lus. In Proc. 11th ACM Symp. on PODS, giv, Yehoshua., Querying object-oriented pages 346-353, June 1992. databases. In Proc. ACM SIGMOD, pages 393-402, 1992. [SigSl] Semantic Issues in Multidatabase Systems. Sigmod Record, 20(4), December, 1991. [KLKSl] Krishnamurthy, R., Litwin, W., and Kent, Special Issue Edited by Amit Sheth. W. Language features for interoperability of databases with schematic discrepancies. [SQL961SQL Standards Home Page. SQL 3 In Proc. ACM SIGMOD, 1991. articles and publications, 1996. URL: www.jcc.com/sql-articles.html. [KLW95] Kifer M., Lausen G., and Wu J. Logical foundations for object-oriented and frame- [SSR94] Sciore, E., Siegel, M., and Rosenthal, A. based languages. Journal of ACM, May Using semantic values to facilitate interop- 1995. erability among heterogeneous information systems. ACM Transactions on Database [KN88] Krishnamurthy, R. and Naqvi, S. Towards Systems, 19(2):254-290, June 1994. a real horn clause language. In Proc. 14th VLDB Conf., pages 252-263,1988.

250