Heterogeneous Distributed Database Systems for Production Use

GOMER THOMAS Bellcore, 444 Hoes Lane, Piscataway, NJ 08854 GLENN R. THOMPSON Amoco Production Company Research Center, 4502 East 41st Street, Tulsa, Oklahoma 74135 CHIN-WAN CHUNG Computer Science Department, GM Research Laboratories, Warren, Michigan 48090 EDWARD BARKMEYER National Institute of Standards and Technology, Gaithersburg, Maryland 20899 FRED CARTER Zngres Corporation, 1080 Marina Village Parkway, Alameda, California 94501 MARJORIE TEMPLETON Data Integration, Inc., 3233 Federal Avenue, Los Angeles, California 90066 STEPHEN FOX Xerox Custom Systems Division, 7900 Westpark Drive, McLean, Virginia 22102 BERL HARTMAN Sybase, Inc., 6475 Christie Avenue, Emeryville, California 94608

It is increasingly important for organizations to achieve additional coordination of diverse computerized operations. To do so, it is necessary to have database systems that can operate over a distributed network and can encompass a heterogeneous mix of computers, operating systems, communications links, and local database management systems. This paper outlines approaches to various aspects of heterogeneous distributed data management and describes the characteristics and architectures of existing heterogeneous distributed database systems developed for production use. The objective is a survey of the state of the art in systems targeted for production environments as opposed to research prototypes.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems-distributed databases; H.2.4 [Database Management]: Systems; H.2.5 [Database Management]: Heterogeneous Databases General Terms: Design Additional Key Words and Phrases: Database integration, distributed query management, distributed transaction management, federated database, multidatabase, system architecture

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1990 ACM 0360-0300/90/0900-0237 $01.50

ACM Computing Surveys, Vol. 22, No. 3, September 1990 238 l Thomas et al.

CONTENTS term is typically used to imply multiple data models.) It is considered to be for production use if it has been, or is being, developed for that purpose, even if it is INTRODUCTION currently only in an advanced prototype 1. HETEROGENEOUS DISTRIBUTED state. DATABASE CAPABILITIES Although a number of specific database 1.1 Schema Integration systems are described or mentioned in this 1.2 Distributed Query Management 1.3 Distributed Transaction Management paper, this paper is not intended to contain 1.4 Administration a definitive list of heterogeneous distrib- 1.5 Types of Heterogeneity uted database systems for production use 2. STANDARDS ACTIVITIES or to endorse or recommend any particular 3. SOME EXISTING SYSTEMS 3.1 ADDS (Amoco Production Company, system. The information about specific sys- Research) tems may or may not be entirely accurate, 3.2 DATAPLEX (General Motors Corporation) and it is likely that other systems on the 3.3 IMDAS (National Institute of Standards and market, in production use, or under devel- Technology, U. Florida) opment have capabilities equaling or ex- 3.4 Ingres (Ingres Corporation) 3.5 Mermaid (Data Integration, Inc.) ceeding those mentioned here. 3.6 MULTIBASE (Xerox Advanced Information Section 1 discusses different types of dis- Technology) tributed database capabilities that can be 3.7 SYBASE (Sybase, Inc.) provided by different systems. Section 2 4. SUMMARY ACKNOWLEDGMENTS describes the status of remote database ac- REFERENCES cess standards, a key factor in the ease of developing heterogeneous distributed da- tabase systems. Section 3 describes the characteristics and architecture of a sam- INTRODUCTION pling of existing heterogeneous distributed database systems developed for production The objectives of this survey paper are to use. These descriptions are current as of provide insight into the kinds of heteroge- late 1989, written by individuals who have neous distributed database capabilities been involved in the development of these available in off-the-shelf systems and to systems. Section 4 gives a brief summary describe the practicalities of in-house de- of the state of the art. velopment of their capabilities; that is, what can be achieved and what approaches 1. HETEROGENEOUS DISTRIBUTED are likely to be promising. DATABASE CAPABILITIES This paper takes a broad view of the terms “distributed”, “heterogeneous”, and Different types of capabilities can be “production use.” A database system is con- provided by heterogeneous distributed sidered to be distributed if it provides access database systems. They include schema to data located at multiple (local) sites in a integration, distributed query processing, network, even if it does not provide full distributed transaction management, ad- facilities for schema integration, distrib- ministrative functions, and coping with uted query management, and/or distributed different types of heterogeneity. Schema transaction management. It is considered integration has to do with the way in which to be heterogeneous if the local nodes have users can logically view the distributed different types of computers and operating data. Distributed query management deals systems, even if all local databases are with the analysis, optimization, and exe- based on the same data model and perhaps cution of queries that reference distributed even the same database management sys- data. Distributed transaction management tem. (This definition of heterogeneous is deals with the atomicity, isolation, and admittedly at odds with that commonly durability of transactions in a distributed used by the research community, where the system. Administrative functions include

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems l 239 such things as authentication and authori- The simplest form of schema integration zation, defining and enforcing semantic is schema union, whereby a conceptual constraints on the data, and manage- schema seen by distributed users is the ment of data dictionaries and directories. disjoint union of local schemata or perhaps Heterogeneity can include differences in of subsets or views of local schemata. In hardware, operating systems, communica- the simplest form of this, a data item in the tions links, database management system conceptual schema is identified by the (DBMS) vendors, and/or data models. name of a local database concatenated with These are all important aspects of distrib- the name of an item in the local database. uted data management. In considering More commonly, some type of aliasing them, it is important to recognize there is capability is provided. no “ideal” set of capabilities for all envi- More sophisticated capabilities that can ronments or applications. A particular be provided include the following: capability may be invaluable in certain sit- Replication, whereby data items in differ- uations while being totally unsuitable in ent local databases may be identified as others. copies of each other 1.1 Schema Integration Horizontal fragmentation, whereby data items in different local databases may be Each local database has a local schema, identified as logically belonging to the describing the structure of the data in that same table or entity set in the integrated database. Each user has a user view, de- schema scribing that portion of the distributed data Vertical fragmentation, whereby data that is of interest to the user, possibly re- items in different local databases may be organized to meet the user’s requirements. identified as logically representing the There are two general approaches to the same row or entity in the integrated problem of providing mappings between schema but containing different attri- user views and the local schemata: butes for the row or entity (1) All of the local schemata may be inte- Data mapping, whereby data types or grated into a single global schema that data values are converted for conformity represents all data in the entire distrib- with each other uted system. All user views are derived As an example of data mapping, one local from this global schema. database might store temperatures as float- (2) Various portions of various local sche- ing point numbers representing degrees mata may be integrated into multiple Celsius, and another local database might federated schemata. These federated store temperatures as character string rep- schemata may correspond closely to resentations of degrees Fahrenheit. To in- user views, or user views may be further tegrate these schemas, one might want to derived from these federated schemata. map degrees Fahrenheit to degrees Celsius In either approach one is faced with schema and map character string representations integration-the process of developing a of numbers to floating point representa- conceptual schema that encompasses a tions, or vice versa. collection of local schemata. If the local databases are based on differ- 1.2 Distributed Query Management ent data models and/or use different names or representations for particular data ele- Distributed query management provides ments, the usual first step is for each local the ability to combine data from different schema to be translated into an equivalent local databases in a single retrieval opera- schema in some common data model using tion (as seen by the user). This capability common names and representations. Thus, may or may not be provided by a distributed in this case one is also faced with a schema database system. In some systems, an ap- translation problem. plication needing data from multiple local

ACM Computing Surveys, Vol. 22, No. 3, September 1990 240 l Thomas et al. databases would explicitly send queries to tion of all the distributed transactions the individual databases rather than pre- [Traiger et al. 19821. senting a single distributed query to a dis- Local commit protocols guarantee atom- tributed query manager. If distributed icity and durability of local transactions; query management is provided, there are that is, that either all of the a&ions of a many possible approaches to optimizing the local transaction complete and commit or heterogeneous distributed queries and none of them do. Distributed commit pro- managing their execution. To complicate tocols guarantee global atomicity and the optimization problem, there may be durability; that is, that either all of the great differences in the speeds of the com- local subtransactions of a distributed munications links, the speeds and work- transaction complete and commit or none loads of the local processors, and the nature of them do. of the data operations available at the lo- These distributed protocols carry an op- cal sites. Depending on the system char- erational cost, as well as an implementation acteristics, it may be desirable to exploit cost, since a failure at one site may leave parallelism in the execution of queries. locked data at other sites unavailable for Replicated and/or fragmented data also extended periods of time. Thus, there can add complexity. One of the simpler ap- be advantages to not having them if they proaches to query management is to move are not needed. In some situations the se- all the desired data to the query site and mantics of the databases and applications combine them there. More sophisticated may make both distributed concurrency approaches may consider a wide range of control and distributed commit protocols algorithms and take account a wide range unnecessary. An example of this is a collec- of factors in evaluating them [Chen et al. tion of independently maintained reference 19891. databases (e.g., restaurant ratings or biblio- graphic listings), with a capability for users to submit queries that form the union of 1.3 Distributed Transaction Management information from the different databases in Distributed transaction management pro- the collection. Since the databases are vides the ability to read and/or update data being independently updated and since at multiple sites within a single transaction, users are typically interested in partial in- preserving the transaction properties of formation, even if one of the databases is atomicity, isolation, and durability [Ceri temporarily unavailable, there is no need and Pelagatti 19841. This capability may or to ensure global serializability or global may not be provided by a distributed data- atomicity. In other situations, distributed base system. If it is provided, there are two commit protocols, but not distributed con- aspects provided: concurrency control pro- currency control, may be needed. An ex- tocols and commit protocols. ample of this is a travel reservation system Local concurrency control protocols en- in which reservations for different re- sure that the execution of local transac- sources are independently managed in dif- tions, whether originating at the local ferent databases. The order in which database or coming from a remote site, reservations for different resources are in- meet the isolation standards of the local terleaved is not critical, so distributed con- system. In most cases this means the exe- currency control is not necessary. A user cution is serializable; that is, the actual would, however, often want to package a execution is equivalent to the serial execu- number of interdependent reservations into tion of the transactions in some order [Es- a single transaction and require that either waran et al. 19761. Distributed concurrency they all commit or none of them do. control protocols are designed to ensure In other situations, both distributed con- that distributed transactions are globally currency control and commit protocols may serializable; that is, that the serializations be needed. An example is a banking system of the local components at all the databases in which accounts at different branches are are compatible with some global serializa- maintained in different databases. Global

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems 241 atomicity is needed to ensure that funds taking the join of the two local tables and transfers between branches are handled applying aggregation and projection oper- correctly. Global serializability is needed to ations) give the users permission to access ensure that audit programs run correctly, the view but not permission to access the even when run concurrently with funds underlying employee salary table. With transfer programs. only local permissions, the only choices are to give the user access to the employee salary table or not to. There is no mecha- 1.4 Administration nism for preventing access to the salary A number of administrative functions, such table but granting access to a view derived as authorization of users, definition and from a join of the salary table with a table enforcement of semantic integrity con- in another database. straints on the data, and maintenance of schema information, must be performed for 1.4.2 Semantic Integrity Management any database system. In a heterogeneous distributed database system, there are Some database management systems pro- many choices as to the degree of centrali- vide capabilities for specifying and enforc- zation and transparency of these adminis- ing semantic integrity rules for the trative functions. database instead of leaving that function to the applications programmers. These in- clude such things as rules for allowed data 1.4.1 Authorization Management values and existence dependencies. In the At one extreme, authorization can be com- context of heterogeneous distributed data- pletely decentralized, with each user having bases, this can be done either centrally, a user id at each local database and permis- through a global conceptual schema, or lo- sions granted by local database adminstra- cally, through the individual local schemas. tors and enforced by local database One potential advantage of the centralized management systems. At the other ex- approach is that distributed integrity rules, treme, authorization can be completely that is, rules that depend on data stored in centralized, with each user having a sys- different databases, can be enforced. tem-wide user id and permissions granted on a system-wide basis by a system-wide 1.4.3 Location Transparency and Schema database administrator and enforced by a Maintenance system-wide distributed query manager. An intermediate possibility is to have system- Location transparency is the ability to deal wide user ids but with permissions granted with distributed data without having to and enforced at local databases. know where it is or even that it is distrib- One potential advantage of centralized uted. There are two levels at which location permissions is that a user can be given transparency is relevant: the level of the access to certain distributed views while user (programmer or interactive user) and being denied direct access to the underlying the level of the database administrator. local data. For example, one might have a At one extreme, users may have to spec- table in one database that tells which em- ify data items by machine name or address, ployees are assigned to which division of a DBMS identifier, database name, and data company and another table in another da- item name. At the other extreme, users may tabase that gives the salary of each em- only have to specify logical names of data ployee. One might want certain users to items. For pure logical data item naming to have access to the average salaries in the work, there must be some kind of naming different divisions of the company but not convention that guarantees uniqueness of to the salaries of individual employees. names for all data items accessible by the With centralized permissions, it is straight- user, a troublesome requirement if there forward to define a view containing the are users who need access to essentially all average salary information (obtained by the data of the enterprise. A compromise is

ACM Computing Surveys, Vol. 22, No. 3, September 1990 242 . Thomas et al. to require the user to specify a logical data- the system developer. Different hardware base name and a data item name, with the and operating systems may use different system mapping logical database names to data representations, for example, different local databases or federated schemata. character codes or different representations This requires only system-wide uniqueness for floating point numbers. Different com- of logical database names and uniqueness munications mechanisms require gateways, of data item names within each logical and these can be difficult to implement database. because of the different capabilities in- There is also a range of choices for how volved. For example, one protocol may al- the database administrator(s) provide lo- low an out-of-band interruption of a cation transparency for the programmer or session, whereas another may not. Differ- interactive user. At one extreme, the data- ent DBMS vendors may use different data base administrator(s) may individually definition and data manipulation lan- maintain multiple data dictionaries at mul- guages, even though they may be using the tiple sites describing what data can be same data model. Different data models can found where. These data dictionaries may present a difficult schema translation and all represent a global schema and describe query translation problem, especially if per- all the data in the system, or they may each formance is important. represent a particular federated schema and only describe data of interest to a par- 2. STANDARDS ACTIVITIES ticular group of users. At the other extreme, the system itself may make all decisions One key to efficient development of heter- about data placement and maintain all di- ogeneous distributed database systems is a rectories automatically (although this ex- standard language and protocol for remote treme would be highly unusual in a data access. If all of the local sites in the heterogeneous system). A compromise ap- system supported a common set of data proach is to have any changes in data place- operations, accessible via a standard lan- ment be entered only once and to have the guage, it would greatly simplify managing system automatically propagate the infor- distributed queries. If support for a “pre- mation throughout the system as needed. pare to commit” state was also standard- In a system with multiple federated sche- ized, implementing a distributed two-phase mata, the data dictionaries and directories commit protocol would be very straightfor- are typically physically decentralized. In a ward. Moreover, if the local systems all system with a single global schema, they used strict two-phase locking (locks held are logically centralized but may be physi- until commit point), as many of today’s cally centralized or distributed. In either database systems do, distributed concur- case, as indicated above, varying degrees of rency control would be implied by distrib- centralization and coordination are possi- uted two-phase commit [Breitbart and ble for their maintenance. Silberschatz 19881. Distributed deadlock detection would, however, be needed. 1.5 Types of Heterogeneity There are two major hurdles to overcome in reaching a consensus on the many tech- Many people in the research community nical issues involved in such standards. One have traditionally regarded a “heteroge- is the usual problem of getting a diverse neous” database system as one involving community of vendors and users to agree multiple data models. From a practical on a common way of doing things when standpoint, however, a distributed database they may already have investments in sep- system may involve a number of different arate solutions. The other problem, which types of heterogeneity-computer hard- in this case, is perhaps more serious is that ware, operating systems, communications there has still been relatively little experi- links and protocols, database management ence with distributed databases in produc- systems (vendors), and/or data models- tion use. Even those organizations that each of which presents its own problems to have implemented them often do not

ACM Computing Surveys, Vol. 22, NO. 3, September 1990 Heterogeneous Distributed Database Systems 243 they have a good understanding of precisely IBM PC, OS/2, VM, CMS, MVS, and SNA what functionality should be provided, let are trademarks of International Business alone precisely how various things should Machines Corporation. DEC, VMS, and be handled in the remote data access lan- VAX are trademarks of Digital Equipment guage. There is not even a widely accepted Corporation. Sun Workstation is a trade- reference model for distributed database mark of Sun Microsystems, Inc. Apollo Do- systems. main is a trademark of Apollo Computers, Nonetheless, both the International Inc. RIM is a trademark of Boeing Com- Standards Organization (ISO) and the puter Services. FOCUS is a trademark of American National Standards Institute Information Builders, Inc. UNIX is a reg- (ANSI) are active in this area. The RDA istered trademark of AT&T. ORACLE is a Rapporteur Group of Working Group 3 on trademark of Oracle Corporation. IDM is Database of IS0 TC97/SC21 was formed a trademark of Sharebase, Inc. Mermaid in late 1985 to work on a Remote Data is a trademark of UNISYS Corporation. Access (RDA) standard, and Subcommittee System 2000 is a trademark of Intel Cor- X3H2.1 of Technical Committee X3H2 on poration. Ingres/STAR and Ingres/Gate- Databases of the ANSI/X3 Committee way are trademarks of Ingres Corporation on Information Processing Systems was (formerly Relational Technology, Inc.). formed in late 1988 to coordinate U.S. SQL Server, SQL Toolset, Open Client, input into the IS0 process. It is possible and Open Server are trademarks of Sybase, that an IS0 Draft International Standard Inc. could be approved as early as mid-1990 and an International Standard, could be ap- proved as early as mid-1991. 3.1 ADDS (Amoco Production Company, Research)

3. SOME EXISTING SYSTEMS 3.1.1 Background on ADDS This section presents a brief description of The Amoco Distributed Database System a sampling of systems that have been de- (ADDS) [Breitbart and Tieman 1985; veloped or are being developed for produc- Breitbart et al. 19861 project began in late tion use. Four general types of information 1983, responding to the problem of inte- are given for each system: grating databases distributed throughout (1) Background-motivation, objectives, the corporation. Applications were being and history for the project or product planned that required data from multiple sources. At the time, database products did System characteristics-capabilities or (2) not provide effective means for accessing features of the system, including its or managing data from diverse systems. positioning in Sheth and Larson’s tax- Therefore, the ADDS project was initiated onomy of heterogeneous distributed to simplify distributed data access and database systems [Sheth and Larson management within Amoco. 19901 (3) System architecture-major system components and their functions 3.1.2 ADDS System Characteristics Current status and future plans-in- (4) ADDS provides uniform access to preexist- cluding, in some cases, an estimate of ing heterogeneous distributed databases. the amount of work that has gone into The ADDS system is based on the rela- the project or product and/or some tional data model and uses an extended of the major lessons learned during relational algebra query language. A subset development of the ANSI SQL language standard is also To avoid numerous footnotes, we consol- supported. In the terminology of [Sheth idated the following trademark acknowl- and Larson 19901, ADDS is a tightly cou- edgements: IBM, IMS, SQL/DS, DB2, pled federated system supporting multiple

ACM Computing Surveys, Vol. 22, No. 3, September 1990 244 . Thomas et al.

federated schemata. Local database sche- may submit any number of queries for si- mata are mapped into multiple federated multaneous execution. ADDS allows a user database schemata, called Composite to “disconnect” from the execution of a DataBase (CDB) definitions. The map- query, which is important for long-running pings are stored in the ADDS data diction- queries. A failed query is automatically re- ary. The data dictionary is fully replicated started, without loss of intermediate re- at all ADDS sites to expedite query sults, after the cause of the failure is processing. A CDB is usually defined for determined and corrected. Also, query ex- each application. Multiple applications and ecution may be deferred to nonprime time, users may, however, share CDB definitions. thereby decreasing execution costs. Users must be authorized to access specific The ADDS system includes geographi- CDBs and relational views that are defined cally distributed mainframes running the against the CDBs. VM and MVS operating systems and Sun The CDBs support the integration of the and Apollo workstations running the hierarchical, relational, and network data UNIX operating system. Therefore, provid- models. Local DBMSs currently supported ing a uniform network interface to these include IMS, SQL/DS, DB2, RIM, systems is important for ADDS develop- INGRES, and FOCUS. Semantically ment and maintenance. The Network In- equivalent data items from different local terface Facility (NIFTY) architecture [Lee databases, as well as appropriate data con- et al. 19881 is an extension of the OS1 version for the data items, may be defined. Reference Model [ISO 19821 and provides The user interface consists of an Appli- a uniform and reliable interface to com- cation Program Interface (API) and an in- puter systems that use different physical teractive interface [Lee et al. 19881. The communication networks. An ADDS pro- API consists of a set of callable procedures cess on one system can initiate a session that provide access to the ADDS system with an ADDS process on another system for application programs. Programs use the without regard for the multitude of heter- API to submit queries for execution, access ogeneous network hardware and software the schema of retrieved data, and access that is used to accomplish the session. retrieved data on a row-by-row basis. Currently, NIFTY supports interprocess The API provides programmers with lo- communication using SNA, ethernet cation and DBMS transparent access to (TCP/IP), and binary synchronous net- distributed databases. works. The interactive interface allows terminal ADDS maintains the autonomy of the users to execute queries, display the results local database systems and does not require of the queries, and save the retrieved data. any modifications to local DBMS software. The interactive interface is actually an ap- The only communication between ADDS plication that uses the API to provide a and the local DBMSs is in the form of query high-level interface for ADDS. Free-form submission and data retrieval. query submission is supported for experi- enced users, and menu-driven query sub- 3.1.3 ADDS System Architecture mission is supported for those less experienced with the ADDS and SQL lan- The layered architecture of the ADDS sys- guages. Frequently used queries may be tem is illustrated in Figure 1. Global trans- stored in the query catalog, and cataloged actions are application programs composed queries may be selected and modified by of one or more global database queries and/ the user before execution. or updates. A single query may reference Queries submitted for execution are com- data at several sites. The Global Transac- piled and optimized for minimal data trans- tion Interface (GTI) verifies the syntactical mission cost. Semijoins and common correctness of user queries and constructs subquery elimination are just two of the a global execution plan. The Global Data query optimization techniques used. A user Manager (GDM) determines the location

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems 245 Global Transactions ;"'-'-----'--""'-'------'-'------"'------"---~~-, I I I, GLOBAL ,# TRANSACTION ADDS ; INTERFACE L GLg;OE3G;;TA I, ,# LI I ,I L GLOBAL I L ,I ,I TRANSACTION L # I , MANAGER , , I , I , I, I, IL Local Local Transactions Transactions DBMS l ee DBMS

DATA- DATA- BASE BASE

Figure 1. ADDS architecture from [Breitbart et al. 19871. of the data referenced by a global transac- DBMSs. The servers also perform data tion from the CDB definition in the Data conversion as the local data is retrieved and Dictionary. The GDM also manages all transfer the data to the GTM for further intermediate data that is received from processing. the Global Transaction Manager (GTM) A “site graph” concurrency control algo- during transaction execution. The GTM rithm [Breitbart et al. 1987,1989a; Thomp- manages the execution of the global trans- son 19871 was used in early efforts to actions and allocates servers to process support update transactions in ADDS. This global subtransactions. general algorithm guarantees the serializ- The GTM uses a two-phase server allo- able execution of global transactions with cation strategy to guarantee the atomicity permitted local transactions and the ab- of global transactions. All servers remain sence of global deadlocks [Gligor and allocated to a global transaction until the Popescu-Zeletin 19851. In a very active sys- transaction completes. This implies that all tem, however, too many global transactions local data items referenced by a global are aborted. To reduce transaction aborts transaction remain locked until the trans- and increase throughput, ADDS was mod- action completes. Local transactions that ified to use a two-phase locking algorithm, have no data items in common with global with timeouts to guarantee freedom from transactions are unaffected. The GTMs on global deadlocks. A two-phase commit pro- different network nodes negotiate for tocol is used to write the results of the server resources when it is necessary for a global transactions into the local databases. transaction submitted at one node to access Although no longer used for concurrency data on other nodes. The servers perform control, the site graph is an important local security verification, then translate tool for guaranteeing global database con- the subqueries into the language of the local sistency during commit and recovery

ACM Computing Surveys, Vol. 22, No. 3, September 1990 246 l Thomas et al. processing [Breitbart et al. 1989b]. Used in bases. The lack of effective data sharing this way, the site graph algorithm provides causes inefficient engineering and manu- acceptable performance. facturing activities and business opera- tions. Duplicated data at different locations 3.1.4 ADDS Status and Future Plans often results in data inconsistency. A heterogeneous distributed database A production version of the ADDS system system is an effective means of sharing data that supports retrieval transactions has in an organization with diverse data sys- been deployed within Amoco. Prototype tems. DATAPLEX is a heterogeneous dis- support for distributed update transactions, tributed database management system including concurrency control and commit/ being developed by General Motors Cor- recovery management, has been integrated poration [Chung 19901. Sections 3.2.2 and into a centralized version of the system; 3.2.3 describe the functions and methodol- that is, a version in which all global queries ogies of DATAPLEX in its target full- are submitted at a single site. Preparations function form. Section 3.2.4 describes are being made for the limited deployment its current implementation status. of the ADDS update system prototype. Future plans include (1) developing a decentralized version of the transaction 3.2.2 DATAPLEX System Characteristics management, concurrency control, and DATAPLEX allows queries and transac- commit/recovery algorithms and (2) build- tions to retrieve and update distributed ing tools to simplify the CDB definition data managed by diverse data systems such process and to maintain synchronization that the location of data is transparent to between the CDB definitions and the local requestors. In this environment, different database schemata. data management systems can run on dif- Some of the lessons learned during ferent operating systems that may be con- ADDS development and deployment are as nected by different communication follows: protocols. Introducing distributed data manage- The relational model of data is used as ment into a large organization can be a the global data model. Since different data monumental problem. models used by unlike database systems A flexible user interface design is neces- structure data differently, the data defini- sary to meet diverse and changing user tion for each sharable database in the het- requirements. erogeneous distributed database system is Adequate query optimization and data transformed to an equivalent relational security are essential for system accep- data definition or conceptual schema. The tance. conceptual schema is implemented as a set of overlapping relational schemata, one for Separating the network architecture each location. The relations at each loca- from the ADDS architecture allowed con- tion represent data objects that need to be centration on important problem areas in accessed by users at that location. Conse- both components individually. quently, conceptual schemata are neither centralized nor replicated. Thus, in the 3.2 DATAPLEX (General Motors Corporation) terminology of [Sheth and Larson 19901, DATAPLEX is a tightly coupled fed- 3.2.1 Background on DATAPLEX erated system supporting multiple feder- Many different kinds of database manage- ated schemata. ment systems and file systems are used in Use of a common data model eases the the manufacturing industry because of the problem of providing a uniform user inter- diverse data management requirements. face. Among several relational query lan- Historically, there has been no effective guages, SQL was chosen as the uniform means to share these heterogeneous data- user interface because SQL is widely used

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems l 247 and an ANSI standard has been developed To process a distributed retrieval query, for it. Both interactive SQL queries and the Distributed Query Decomposer decom- embedded SQL programs are supported. poses the distributed query into a set of local queries and a user-location query that merges the results from other locations. (A 3.2.3 DATAPLEX System Architecture local query references data from a single The above strategies establish the archi- location that may be a remote location). tecture of DATAPLEX. Figure 2 shows The user-location (source) DATAPLEX DATAPLEX and other elements in a het- sends local SQL queries to data-location erogeneous distributed database system. (target) DATAPLEXs using the DDBP. The functions of DATAPLEX are per- The Translator finds query translation formed by 14 major modules, described information from a translation table that here. records differences of data names and data The Controller module schedules the in- structures between the conceptual schema vocations of the rest of the modules and and the local schema. The Translator handles inputs and outputs of the modules. translates a local SQL query to a query The User Interface and Application In- (or program) in a local data manipulation terface modules provide interfaces for quer- language (DML) using the translation ies to be entered into DATAPLEX. The information. The distributed query decom- User Interface appears to users as a com- position method and translation scheme mand prompter or a form-oriented query used by DATAPLEX are described in a facility. The Application Interface is linked previous report [Chung 19871. The Local to a compiled application before executing DBMS Interface sends the translated query the application. to the local DBMS and obtains the local The Distributed Database Protocol result. The local result is in a report form (DDBP) module provides communications similar to a relation regardless of the data between the DATAPLEX software at user structure used by a local DBMS. locations and data locations. Different The Distributed Query Optimizer of the communication protocols can be used by source DATAPLEX schedules an optimal adapting the DDBP to them. data reduction plan using the statistical The SQL Parser module checks syntactic information from the target DATAPLEX. errors of SQL statements. The Distributed The data reduction plan [Chung and Irani Query Decomposer and Distributed Query 19861 is a sequence of semijoins that con- Optimizer modules prepare distributed sists of local data reduction operations and queries for execution, with the aid of the data moves among computers. Upon com- Data Dictionary Manager module. The pletion of the execution of the data reduc- Translator and Local DBMS Interface mod- tion plan, the reduced local results are sent ules provide interfaces to the local database to the source DATAPLEX. There the Re- systems for execution of local subqueries, lational Operation Processor of the source and the Relational Operation Processor of DATAPLEX merges the local results by the user-location DATAPLEX merges the processing the user-location query. results from the local sites to provide the To process a distributed update query, final query result. the Distributed Query Decomposer gener- The Data Dictionary Manager finds the ates a set of local retrieval queries to iden- location of the data referenced by a query tify the specific data to be updated, as well and determines the type of the query. There as a set of update queries, one for each are three different types of queries: user- relation to be updated. location query, remote single-location The Distributed Transaction Coordinator query, and distributed query. The user-lo- enforces two-phase locking on referenced cation query and the remote single-location data at the local DBMSs that are involved. query are special cases of the distributed Although there are a number of global query. deadlock detection and avoidance methods

ACM Computing Surveys, Vol. 22, No. 3, September 1990 248 l Thomas et al.

LOCAL Conceptual DATA Schema

LOCAL DBMS 1 USER -+ DATAPLEX 0 COMMUNICATION Detailed Information HW & SW and Translation Table for Local Data

SQL Query 4 , SQL Query 2 Data +I[ COMMUNICATION

LOCAL 6DATA Figure 2. DATAPLEXin a heterogeneous distributed database system. for homogeneous systems [Elmagarmid show the feasibility of the concepts under- 19861,there hasnotbeenmuchresearch on lying the DATAPLEX approach. The this topic for heterogeneous systems. Until prototype system interfaces an IMS more research results are available, the hierarchical DBMS running under the time-out method is used initially to handle MVS operating system and an INGRES global deadlocks. After the specific data to relational DBMS running on a VAX com- be updated is identified by processing the puter under the VMS operating system. retrieval part, the local update queries are The features the prototype system pro- processed, incorporating a two-phase com- vides to users are as follows: mit to enforce the update atomicity, and then the locks on the referenced data are . SQL queries to IMS released. There are also Security Manager l Distributed SQL queries to IMS and and Error Handler modules. INGRES All modules of DATAPLEX are indepen- l Distributed SQL queries embedded in a dent of the local data system except for the C language program Translator and Local DBMS modules. The data types supported between IBM Thus, any data system can be interfaced to -~ DATAPLEX by developing these two mod- and DEC computers are characters, text ules for them. This architecture is modular (variable length fields), integers, floating and is an open architecture with which point numbers, and packed decimal num- functionality and performance can be bers. In addition, the prototype system gradually increased. checks whether a user is authorized to ac- cess IMS data at a segment level using the user id. 3.2.4 DATAPLEX Status and Future Plans Since rapid prototyping was required to A prototype DATAPLEX was jointly de- show the feasibility of the concept before veloped in 1986 with a DBMS vendor to developing a full-function DATAPLEX,

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems l 249 updates of IMS data and full query opti- are being tested using production applica- mization were not implemented in the tions and data. Subsequent versions will prototype system. In addition to these provide the capabilities of distributed restrictions, the system supports only a update, multiple copy synchronization, subset of SQL defined to have the following and support for horizontal and vertical syntax: partitioning. SELECT list of target attributes and set functions 3.3 IMDAS (National Institute of Standards FROM list of relations and Technology, U. Florida) WHERE qualifications 3.3.1 IMDAS Background: Sharing Data in a ORDER BY attributes Manufacturing Complex where set functions are MAX, MIN, SUM, In modern manufacturing systems, two COUNT, AVERAGE, and the qualification developments are paramount: contains >, >=, <, <=, =, <>, AND, OR, NOT, and parentheses. l Industrial Automation-computer sys- A testbed has been established at General tems controlling and monitoring the Motors Research Laboratories, with a test physical processes

distributed database and test transactions. l Computer Integrated Manufacturing Users formulate requests based on the re- (CIM)-direct data sharing among pro- lational view of the distributed database. duction control systems and the - The location and the type of the actual neering and administrative systems that database are transparent to users. The sys- support them tem executes the requests. Production IMS data have been used to In most industrial facilities, control, en- test the effect of the size of the database on gineering, and administrative systems op- efficiency. The data is from the Mainte- erate on computer systems and database nance Management Information System systems from different manufacturers. (MMIS) running at a car assembly plant. They contain independently designed, The MMIS database contains a few c ,.ferlapping databases, with logical and hundred thousand records. This is about physical differences in the representation 1000 times bigger than that of the test of the same real-world objects. These exist- IMS database. The results of tests using ing systems represent a major investment the production database show the proto- and support real production. It is not fea- type system incurs some overhead com- sible to replace or significantly redesign pared with access using PL/I programs. them. As the database size grows, the fraction The Integrated Manufacturing Data Ad- of the overhead to the total processing ministration System (IMDAS) [Barkmeyer time decreases. It was observed that the et al. 1986; Krishnamurthy et al. 1987; Su SQL-to-DL/I Translator and IMS Inter- et al. 19861 was developed to support a face modules were the bottleneck in ac- prototype CIM environment-the NBS Au- cessing IMS data through the prototype tomated Manufacturing Research Facility DATAPLEX. (AMRF) [Nanzetta 19841, a testbed for Based on the success of the prototype small-batch manufacturing automation system, General Motors Corporation has and in-process measurement. The objective initiated the development of a full-function was to provide access from many systems DATAPLEX system, interfacing IMS, to the many sources of manufacturing data, DB2, and INGRES with outside vendors. cooperating with existing applications on The full-function system uses full SQL, and existing databases while enabling new and the initial version supports distributed re- modified application programs to access trieval and single-location update. Cur- data as needed, insulated from accidental rently, most of the initial version has been distinctions in location, representation, implemented, and completed components and access mechanisms.

ACM Computing Surveys, Vol. 22, No. 3, September 1990 250 l Thomas et al.

3.3.2 IMDAS System Characteristics system in the enterprise has a Basic Data Server (BDAS), which provides the inter- In the terminology of [Sheth and Larson face between the local repository managers 19901, IMDAS is a tightly coupled feder- and the integrated data system. It contains ated system with a single global schema. the front-end processes that provide the The integrating data model is the Semantic standard interfaces for the local DBMSs. Association Model (SAM*) [Su 19851, a The BDASs and the DBMSs are the ele- semantic network data model capable of ments that execute the data manipulations. representing the complex structures and The Distributed Data Servers (DDASs) relationships and many integrity con- perform the query processing and transac- straints found in a manufacturing enter- tion management functions. Each DDAS prise. A fragmentation schema maps the provides the query interface to all applica- global model to the underlying databases, tion programs within a cluster of computer supporting both horizontal and vertical systems that are its segment of the enter- partitioning of a given object class. prise and logically integrates the collection Existing database systems are front of data repositories managed by the BDASs ended by IMDAS modules supporting an in that cluster into a corresponding seg- internal query interchange form, which is ment of the global database. The DDASs an extended algebra on generalized rela- manage the data manipulations. tions corresponding to the modeled object The Master Data Server (MDAS) is classes, and a corresponding data inter- needed when there is more than one DDAS. change form, expressed in Abstract Syntax It integrates the separately managed seg- Notation 1 [ISO 1987a, 1987b]. This com- ments into the global database and man- mon interface is readily mapped onto un- ages transactions that cross segment (i.e., derlying relational and navigational DDAS) boundaries. The MDAS does not databases. A library of routines supporting “manage” the fully distributed system; it is it minimizes the effort involved in integrat- rather a utility used by the distributed serv- ing new data systems and databases. ers to resolve the global model and provide The user program phrases queries in an concurrency control for transactions that SQL-like language adapted to the model. involve multiple DDASs. The query is passed to the IMDAS in string The IMDAS modular architecture per- form, rather than precompiled, to permit mits several “distributed data system ar- access by controllers programmed in non- chitectures” to be built from the same standard languages. This mechanism can components. A system with exactly one support an interactive interface, although DDAS and one BDAS is essentially a cen- none has been built yet. IMDAS supports tralized system, whereas a system with one both distributed updates (transaction man- DDAS and multiple BDASs is a distributed agement) and distributed retrievals (query system with centralized control. A system management). The fragmentation schema with multiple DDASs is a distributed sys- does not currently support replication, tem with distributed control. however, which is a significant limitation An application program issues a trans- of the system. action to the IMDAS in string form in the data manipulation language. The DDAS 3.3.3 IMDAS System Architecture query processor in that cluster converts the The architecture of IMDAS is shown in transaction, expanding application-specific Figure 3. The lowest level of architecture views into standard operations on concep- comprises the data repositories-data- tual generalized relations. If the resulting bases, files, controller memories-managed query can be executed entirely within by commercial DBMSs, file systems, home- the DDAS segment, it is passed to the grown application-specific servers, and so DDAS transaction manager. Otherwise, it on. These are the existing data systems on is sent to the MDAS. In either case, the which the IMDAS depends. Each computer query processor reports final status to

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems l 251 ;------, APPLICATION -'------'---'--""": i MASTER PROGRAM DATA SERVER 4 (mAS) PROCESSOR DISTRIBUTED i DATA SERVER ; TRANSA (DDAS) ; MANAGER

Dat

1 ;a L BASIC , .TA SERV'ER I, (BDAS) I ,I I l ee ,4 L, L I, TRANSLATOR .., BDBMS Figure 3. The IMDAS hierarchy. the user program when the transaction is to ensure that simultaneous read/write or completed. write/write access to the same underlying The DDAS transaction manager, using a “database” is avoided. In this context a fragmentation schema describing the dis- “database” is a modeled collection of infor- tribution of its segment of the global model, mation, which is a (possibly proper) subset maps the query into a set of subqueries, of the data managed by a single DBMS. each of which operates on elements of the Transactions that “conflict” at this level global database managed by an individual are serialized by the controlling transaction DBMS. The mapping algorithm takes into manager, thus effecting distributed concur- account the capabilities of the target rency control. DBMS. Operations that exceed the capa- An affected BDAS receives subqueries bilities of the repository DBMS are routed from the DDAS in the interchange form, to a sufficiently capable DBMS, with tem- specifying the operations to be performed porary generalized relations specified to on the local data repositories and the hold the data to be operated on. This is also sources and destinations of the associated the mechanism by which information units data. The BDAS converts the subquery to from multiple DBMSs are integrated. the form appropriate to the designated The subqueries are then dispatched to DBMS and passes it to that DBMS, con- the affected BDASs, using optimistic com- verting any data involved between the mitment in the case of update, because DBMS form and the interchange data many of the DBMSs have no commitment form, and reports completion to the DDAS. features at all. That is, individual sub- The BDAS itself accesses any referenced transactions commit independently on the local data that are not managed by a assumption that all will commit. The trans- DBMS-files or shared memory [Libes action manager uses a locking mechanism 1985; Mitchell and Barkmeyer 1984]-

ACM Computing Surveys, Vol. 22, No. 3, September 1990 252 9 Thomas et al. converting between the user-specified rep- choice, but the SAM* model itself is not resentation and the IMDAS interchange sufficiently flexible. Consideration is being form. The BDAS also accesses required given to replacement of SAM* with a more remote data by communication with the complete semantic network data model into remote BDAS, moving the data directly which many common information models from producer to consumer without regard can be translated; for example, SDM, IDE- to the control path. (For example, if DDAS- FIX, NIAM, Express, and OSAM*. This 1 specifies that data be sent from BDAS-1 implies reworking of the query language, to BDAS-2, the data would go directly from the internal query interpretation, and the BDAS-1 to BDAS-2 without going through mapping onto application databases, each DDAS-1.) of which has its own strengths and weak- The MDAS is essentially a DDAS trans- nesses, and all of which are appropriate action manager with a fragmentation joint research efforts for the 1990s. schema that describes the distribution of On the other hand, the separation of the global model over the DDASs, instead paths for control and data flow at the user of the DBMSs. It accepts transactions from interface and within the IMDAS, a clear and reports status to the individual DDAS departure from conventional wisdom, has query processors. It sends subqueries to and proven itself to be both effective and effi- receives status reports from the individual cient and yielded a number of other capa- DDAS transaction managers. Since the bilities, coding elegancies, and design MDAS is a clone of the DDAS transaction dividends. manager, it can be instantiated in any sta- tion that has a DDAS and thus can readily be replaced in the event of failure. 3.4 lngres (Ingres Corporation’) 3.4.1 Background on lngres 3.3.4 IMDAS Status and Future Plans Ingres Corporation grew out of the IMDAS modules currently exist for VAX INGRES project at the University of Cali- computers running under the VMS oper- fornia at Berkeley, a research project on ating system and for Sun Workstations relational database technology that began (which run under the UNIX operating sys- in the early 1970s [Stonebraker 19861. The tem) using TCP/IP networks. IMDAS in- company was incorporated as Relational terfaces currently exist for RTI/Ingres Technology, Inc., in 1980 and changed its [Ingres 19861 and BCS/RIM [Boeing Com- name to Ingres Corporation in 1989. The puter Services 19851 systems, for the first commercial Ingres database manage- object-oriented GBASE system [Le Noan ment systems were delivered to customers 19881, for the AMRF Geometry Modeling in 1981. Ingres products currently are avail- System [Tu and Hopp 19871, and for sev- able for a wide range of mainframes, mini- eral file systems and shared-memory sys- computers, workstations, and personal tems. The current IMDAS is just over computers under a wide range of operating 100,000 lines of C and Pascal code, and it systems. represents 15-20 staff years of effort. Ingres/NET, which provides remote ac- A front end for DBMSs using SQL and cess from an Ingres application at one site IMDAS modules for the IBM PC are in to an ingres database at another site, was development. Modifications to IMDAS to first introduced in 1983. Ingres/STAR, use OS1 networks for both internal and which provides transparent access to dis- external communication and the draft tributed data, was first introduced in late Remote Data Access protocol [ISO 19891 1986. for the user-IMDAS interface are also underway. 3.4.2 IngreslSTAR System Characteristics Experience in mapping the IMDAS se- mantic model to different DBMSs and ap- The Ingres DBMS provides access to an plication databases indicates that the use Ingres database, which is a named collec- of a semantic integrating model was a wise tion of tables. Ingres front-end programs

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems 253 submit SQL queries to the Ingres DBMS Gateways) in the same manner a front-end to obtain data stored in the database. program would. The same information is An Ingres Gateway provides a method communicated. Ingres/STAR sends a query whereby data stored in other (i.e., non- language representation of the desired Ingres) data managers is made to appear as work, and the data manager replies with if it were stored in an Ingres database and the requested data. Figure 4 illustrates a thus is made available to Ingres front-end typical configuration of users, data man- programs. agers, and Ingres/STAR servers. The Ingres/STAR system allows users to access a distributed database, which is de- fined as a collection of tables from one or 3.4.3 IngresjSTAR System Architecture more Ingres databases. Any set of tables from any set of Ingres databases can be As noted above, the Ingres/STAR system combined to form a new, distributed builds a distributed database from a num- ber of underlying component databases. In Ingres/STAR database. This includes not order to provide long-term storage for in- only databases under an Ingres DBMS but formation about this federation, Ingres/ also databases accessible via an Ingres/ STAR uses a local database called the Co- Gateway and, in the near future, other ordinator Database (CDB). The CDB holds Ingres/STAR databases. A single Ingres/ information on each distributed database STAR server may service multiple distrib- concerning uted databases, and multiple Ingres/STAR servers may exist in the network. Thus, in Which databases are used in the distrib- the terminology of [Sheth and Larson uted database 19901, Ingres/STAR is a tightly coupled The location of these databases federated system supporting multiple fed- erated schemata. The data manager associated with each Access to the Ingres/STAR distributed database databases is transparent in the sense that What tables from each database are in- once the database has been created, the cluded in the distributed database users of the database no longer need to Naming (aliasing) information about know anything about the existence of the these tables individual Ingres databases that make up the distributed database. Their contents Every Ingres DBMS, whether in a cen- are now available transparently via Ingres/ tralized or in a distributed system, uses STAR. Ingres/STAR appears to front-end the Internal Ingres Database Database programs just as if it were a centralized (IIDBDB) to determine the information Ingres DBMS. Front-end programs func- (sites, disks, users, etc.) necessary to access tion in the same manner regardless of local or distributed databases. Each site whether the database being accessed is dis- contains one IIDBDB, any number of da- tributed or not, except for the restriction tabases, and some number of servers. In- (in the current release) that within a single formation about each Ingres/STAR transaction only inserts/deletes/updates to distributed database would appear in the data at a single site are allowed. That is, a IIDBDBs at sites at which queries to the distributed commit protocol has not yet distributed database originate. An Ingres/ been implemented in the current release of STAR server must have access to the Ingres/STAR. IIDBDB containing information about Ingres/STAR itself does not deal directly each component database of each distrib- with the physical storage and retrieval of uted database that it manages. data. Instead it relies upon the Ingres/ Figure 4 depicts a conceptual picture of DBMS and/or Ingres/Gateway compo- the various components and databases nents to do this. The Ingres/STAR com- of an Ingres/STAR system. The functions ponent communicates with these Ingres of Ingres/STAR are provided by seven data managers (either Ingres DBMSs or major modules, described below.

ACM Computing Surveys, Vol. 22, No. 3, September 1990 254 l Thomas et al.

STAR STAR SERVER SERVER

GATEWAY DBMS DBMS SERVER SERVER SERVER

I I c > C HETEROGENEOUS INGRES INGRES DBMS DATABASE DATABASE L / L -

Figure 4. Configuration of Ingres/STAR system components and databases.

The General Communication Facility TPF, formats the instructions, sends (GCF) provides the intercommunication them to other participants in the session among instances of Ingres/STAR, Ingres (Ingres/STARs, Ingres/DBMSs, or Ingres/ DBMSs, and Ingres/Gateways. Gateways), and returns answers to the The Transaction Processing Facility requestor. (TPF), a feature still under development, The Relation Description Facility (RDF) will be responsible for maintaining Ingres/ provides efficient access to catalog infor- STAR’s transaction system. It will monitor mation by retrieving it, caching it, and the transaction states of the various Ingres/ managing the cache. DBMS, Ingres/Gateway, and Ingres/STAR The Parser Facility (PSF) parses the partners and keep track of the state of query and passes it on to the Optimizer distributed transactions. Thus, TPF will Facility (OPF) in parsed form. OPF plans know which partners need which instruc- the method of performing the query. This tions during the prepare, commit, and abort process is more complex than in an Ingres portions of a two-phase commit transac- DBMS because it must take into account tion. In the event of an Ingres/STAR crash, the capabilities of the various data man- it will be TPF’s responsibility to resolve agers involved in executing the query (since any outstanding transactions when Ingres/ some may be gateways), the amount of data STAR is running again. TPF will maintain that must be moved from one site to an- its own log for recovery of distributed trans- other, the network speed(s), and the query- actions. This log will be separate from any processing facilities available at the site of log maintained by the DBMSs for recovery the Ingres/STAR server itself. of transactions at individual databases. The Query Evaluation Facility (QEF) 3.4.4 IngresjSTAR Status and Future Plans manages the actual execution of queries. It sends subqueries to the other participants Gateways are currently available on a pro- in a session, manipulates the returned re- duction basis for RMS files and RDB da- sults as required, and returns the final re- tabases on VAX computers under the VMS sults to the Ingres/STAR client. operating system and for DB2 databases on The Remote Query Facility (RQF) re- IBM (or compatible) mainframes under the ceives instructions from either QEF or MVS operating system. Gateways are soon

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems l 255

expected to be available on a production ment team left Unisys to start Data basis for SQL/DS databases and IMS Integration, Inc., which is continuing de- databases. velopment of Mermaid as a commercial A near-term future release will provide product. support for a two-phase commit protocol, thereby implementing the distributed 3.5.2 Mermaid System Characteristics transaction management capabilities and allowing distributed updates with full dis- In the terminology of [Sheth and Larson tributed data consistency and recovery. 19901, Mermaid is a tightly coupled feder- Support is also planned for horizontal and ated system supporting multiple federated vertical partitioning of tables and for rep- schemata. In a sense, Mermaid is not a licated tables. database management system but rather a One of the biggest challenges in designing front-end system that locates and inte- gateways has been in understanding the grates data that are maintained by local capabilities required of the local data man- DBMSs. Parts of the local databases may agers for participation in distributed oper- be shared with global users. ations. A subset of SQL that is supported There are two parts to presenting a single by all gateways has been designed, and this database view to the user of the federated common SQL is used in the Ingres/STAR system. First, a federated view or schema of product. This common subset is expected all or parts of the component databases to evolve over time. must be defined. Second, at run time the Similarly, providing a unified view of the system needs to translate from the feder- system catalogs and data types across a ated schema into the form in which the variety of data managers has been a diffi- data are actually stored. cult job. A set of standard catalog infor- The user is able to use a single query mation has been defined that is available language, SQL, to access and integrate the from all gateways. This catalog information data from the different databases. The sys- is used by Ingres/STAR to obtain infor- tem automatically locates the data, opens mation about the local databases. connections to the backend DBMSs, issues queries to the DBMSs in the appropriate query language, and integrates the data 3.5 Mermaid (Data Integration, Inc.) from multiple sources. Integration may re- 3.5.1 Background on Mermaid quire translation of the data into a standard data type, translation of the units, combi- Development of the Mermaide system be- nation or division of fields, union of hori- gan at System Development Corporation zontal fragments, join of vertical fragments, (now a part of Unisys) in 1982 [Templeton and/or encoding values. et al. 1987a, 1987b]. The motivation for the Several levels of heterogeneity are sup- project was the requirement in the Depart- ported: ment of Defense (DOD) for accessing and integrating data stored in autonomous l Hardware

databases. DOD cannot standardize on a l Operating system of the DBMS host single type of hardware or DBMS and l Network connection to the DBMS host therefore must develop the capability to l DBMS type and query language operate in a permanently heterogeneous environment. After the completion of l Data model-relational, network, se- Mermaid it became clear that this require- quential file ment was not unique to DOD. In fact, any l Database schema large organization may have multiple au- Presently the system permits retrieval tonomous databases. In 1989, the develop- across databases and updates to a single database. A read transaction may see an @Mermaid is a trademark of Unisys Corporation. inconsistent state of the database, since

ACM Computing Surveys, Vol. 22, No. 3, September 1990 256 l Thomas et al. local updates may occur in the local data- the workstation where the user is logged in. bases during query execution. Mermaid Next, the DD/D is opened and permissions minimizes the window of inconsistency by are checked for the specific federated da- making snapshots of all relations as a first tabase or view. These permissions must step in processing. The snapshot also re- reflect the permissions granted by the ad- duces relations using the query’s selects and ministrators for the underlying databases. projects and translates the schema into the Mermaid then logs into the local computers federated schema. Updates to replicated and opens the underlying databases. and fragmented relations may cross data- The query optimizer does dynamic plan- bases, but the updates to all databases are ning. It first locates all required relations not necessarily made concurrently. Pro- and selects one or more copies of replicated cessing is done interactively, although an relations and some or all fragments of re- application program interface is being lations. Each local relation is reduced with developed. selects, projects, and joins to other relations at the same site. The size of each interme- diate result is returned and used to plan the 3.5.3 Mermaid System Architecture next step. The optimizer considers network As shown in Figure 5, Mermaid has four speeds and relative processor speeds when components: the User Interface, the server, determining the best way to process the the Data Dictionary/Directory (DD/D) query. If there are large relations to be and the DBMS Interfaces. Most of the joined, it may perform semijoins between Mermaid software resides in a server that sites before moving data to an assembly exists on the same network as the user site. As much processing as possible is done workstations and DBMSs. The DBMSs in parallel. may reside on the workstations or on main- Each Mermaid component uses the RPC frame computers. The system is designed remote procedure call protocol above for flexibility and modularity. All compo- Transmission Control Protocol/Internet nents may reside on a single computer, or Protocol (TCP/IP) to communicate with each may reside on a different computer. the other components, with the possible In a large system with many users there exception of code residing on the DBMS may be several copies of the components. host. This allows the communication to be At least one Mermaid server must exist the same whether the processes are local or to provide a platform for the code. The remote. All messages between DBMS In- server contains the optimizer that plans terfaces go through the server, which pro- query processing and the controller that vides protocol conversion. For example, if configures the system and controls execu- one DBMS site uses LU6.2 and another tion. The server runs on a Sun Worksta- uses TCP/IP, the server would receive tion, taking advantage of its support for a message from the DBMS interface many network protocols. (DBMSI) at the first site using an LU6.2 The User Interface includes code to au- protocol and send it to the DBMS1 at the thenticate users, initialize the system, edit second site using the TCP/IP protocol. queries, maintain a query library, get help The Data Dictionary/Directory is a re- with the system, view reports, and option- lational database, stored in a commercial ally manipulate the output returned DBMS, that contains information about through other programs. Support is pro- the databases and environment. Figure 5 vided for both workstations and dumb ter- shows the case of the DD/D residing on the minals and for both single window and server. The DD/D may also reside on a multiple window interfaces. different computer. The current Mermaid code provides sev- The DBMSs generally reside on different eral levels of access control that make it at computers. Mermaid has an open architec- least as secure as a commercial DBMS. The ture that will support the development of first level of access control is a list of users interfaces to many types of DBMS. The with permission to execute Mermaid from only requirement is that the data are

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems l 257

MERMAID SERVER DD/D DUMB TERMINAL UI SUNWINDOWS UI TERMINAL

& CONTROLLER INTERFACE NETWORK INTERFACE WORKSTATION

T.AN

TERMINAL INTERFACE RETRIEVAL

CLASS II

Figure 5. Mermaid systemarchitecture. managed by a DBMS that provides sepa- This type of interface can be difficult to ration of the application from the data, develop and is weak in error handling. retrieval of data elements by name (not file Mermaid also has the capability to re- offsets), selection of records, an interactive trieve data from files. A file is a typed object query language, and concurrency control. with a retrieval method and a display There are three generic types of DBMS method. The user selects a set of files of interface. A Class I DBMS is the favored interest by searching on structured fields type. The current INGRES, ORACLE, and and listing structured fields and files in the IDM interfaces fall into this class. The target list. The report looks like a standard DBMS1 code can be put on the same com- report from a relational system except that puter as the DBMS and interfaced to the files are given a symbolic name. The user DBMS through a subroutine call. This pro- enters the symbolic name of one or more vides the most efficient operation and the files he or she wants to see. The method best error handling. A Class II DBMS (process) to retrieve the file is started either does not support standard network on the computer where the file is stored, protocols or does not allow the Mermaid and the method (process) to display the DBMS interface code to reside on the same file is started in a window on the user’s computer as the DBMS. An example of this workstation. New file types can be sup- would be a database machine. The schema ported by writing the retrieve and display and language translators then reside on the methods. server or on another front-end computer. A Class III DBMS is accessed using a termi- 3.5.4 Mermaid Status and Future Plans nal emulator interface such as a 3270 em- ulator from a front-end processor. An The run-time part of Mermaid was com- example of this would be a DBMS that only pleted at Unisys. Work on the system supports an interactive terminal interface. was started in 1982, and the system was

ACM ComputingSurveys, Vol. 22, No. 3, September 1990 258 l Thomas et al.

completely recoded from 1986 to 1987. The grated view of information, MULTIBASE total level of effort was 30-40 staff years. allows the user to access data in multiple Data Integration, Inc., was founded in databases quickly and easily. Because there April 1989 to continue the product devel- is an integrated schema and a single query opment. The two major missing compo- language, the user has to be familiar with nents in the system at that time were a only one uniform interface instead of nu- DBMS-independent DD/D Builder Tool merous local system interfaces. and a system administration utility. These The MULTIBASE project was initiated are currently being completed and pro- in 1980 to develop a software system that fessional documentation is being written. would enable organizations to achieve in- Additional DBMS interfaces are being de- tegrated data access without replacing ex- veloped under contract. isting databases. The system has three The biggest challenge faced by the devel- major design objectives. First, it is a general opers of Mermaid has been error handling. system that is not designed for any specific Mermaid runs above many layers of application area. It can be used without DBMS, operating system, and network pro- making changes to existing databases and tocol. Each layer has many error conditions does not interfere with existing application that may cause errors in other layers. There programs. Second, MULTIBASE has been is no clear definition of all errors that can designed to incorporate a wide range of data occur in each layer or how other layers sources. These data sources encompass the respond to errors. When an error does oc- major classes of DBMSs (hierarchical, net- cur, it is difficult for the Mermaid code to work, relational) as well as file systems understand the source of the error and po- and custom built DBMSs. Third, MULTI- tential cures. It is also difficult to describe BASE has been designed to minimize the some errors to the user. cost of adding a new data source. The sys- Another major problem has been coping tem is largely description driven, and with new releases of the underlying soft- its modular architecture minimizes the ware. Mermaid frequently encounters prob- amount of custom software that must be lems when new releases of the DBMS, developed for each new DBMS. operating system, or network are installed. This poses problems for support of a het- 3.6.2 MULTIBASE System Characteristics erogeneous database system because re- lease control can be difficult in such an MULTIBASE uses a data definition and environment, and testing of all combina- manipulation language called DAPLEX, tions of underlying software is generally not which is based on the functional data model possible. [Landers et al. 1984; Shipman 19811. Be- cause the functional model is rich enough to represent relational, hierarchical, and 3.6 MULTIBASE (Xerox Advanced network database schemata directly, the Information Technology’) need to translate these schemata and their 3.6.1 Background on MULTIBASE corresponding operations into a strictly re- lational model has been eliminated. The MULTIBASE [Landers and Rosenberg system supports ad-hoc query access 19821 provides a uniform, integrated inter- through several interactive interfaces and face for retrieving data from preexisting, an Ada application program interface. heterogeneous, distributed databases. It In the terminology of [Sheth and Larson was designed to allow the user to reference 19901, MULTIBASE is a tightly coupled data in these databases with one query federated system that provides for the def- language over one database description inition of multiple local schemata and mul- (schema). By presenting a globally inte- tiple federated schemata, or views. Local schemata describe the data available at an 1 Formerly the Advanced Information Technology individual local DBMS. Views describe in- Division of Computer Corporation of America. tegrations of the data described in local

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems 259 schemata. Users can query any combina- one or more Local Database Interfaces tion of local schemata or views, and multi- (LDIs), and the Internal DBMS. ple schemata or views can be referenced in The GDM is the central component of a single query. From the user’s perspective, MULTIBASE, providing query manipula- views provide complete location transpar- tions that require global knowledge. These ency. The MULTIBASE view definition operations include translation of queries language supports horizontal and vertical expressed over views into queries expressed fragmentation and data mapping of the over individual local databases, modifica- data contained in the individual local da- tion of queries to compensate for operations tabases. MULTIBASE makes no restric- that cannot be performed at the individual tions on the queries that can be processed local databases, optimization, and genera- over views. The current version of the sys- tion of plans for the execution of the user’s tem does not, however, support updates. query. All query manipulation in the GDM Each MULTIBASE location maintains is performed in an internal form of its own directory of local schemata and DAPLEX. The GDM is responsible for views. management of the global directory, in- The MULTIBASE view mechanism is cluding views, local schemata, authoriza- also used to resolve data incompatibilities tion information, and descriptions of local [Katz et al. 19811 that frequently arise DBMS capabilities. when separately developed and maintained Local Database Interfaces are responsi- databases are accessed conjointly. In- ble for receiving queries that have been compatibilities include (a) differences in generated by the GDM, translating these naming conventions, underlying data queries from DAPLEX into a form that is structures, representations, or scale, acceptable to the local DBMS and pass- (b) missing data, and (c) conflicting data ing these queries to the local DBMS to values. When defining a view, the database be processed. The LDI receives the data administrator applies knowledge of the lo- from the local DBMS, translates it into cal databases to determine what incompat- MULTIBASE standard format, and re- ibilities might arise and what rules should turns it to the GDM. A MULTIBASE sys- be used to reconcile them. The rules are tem can contain multiple LDIs. included in the view definition, after which LDIs are only developed for each new they are followed automatically by the sys- local DBMS. The GDM transmits to the tem in generating answers to queries. LDI the local schemata for the LDI’s da- MULTIBASE performs query optimiza- tabases. The local schema contains the in- tion at both the global and local levels formation required by the LDI to generate [Dayal et al. 1981,1982]. At the global level, queries against the individual local data- the system creates query execution strate- bases. In general, the GDM does not have gies that attempt to minimize the amount to be modified to add a new local database of data moved between sites and to maxi- or DBMS. The information about the mize the potential for parallel processing LDI and local DBMS that is required by that is inherent when multiple distributed the GDM is stored by the GDM in the databases are accessed. At the local level, directory. the system attempts to minimize the The Internal DBMS is a DAPLEX amount of time to retrieve data from a local DBMS that is used by the GDM for any DBMS by taking full advantage of the local processing that cannot be performed at an DBMS query language, physical database individual local DBMS. This provides organization, and fast access paths. MULTIBASE users with the full power of the DAPLEX query language against any local DBMS. Any operation that cannot be 3.6.3 MUL TIBASE System Architecture performed by a local DBMS is performed As shown in Figure 6, a MULTIBASE sys- transparently by the Internal DBMS using tem consists of three major types of com- data that has been retrieved by the LDI ponents: the Global Data Manager (GDM), under direction of the GDM. The Internal

ACM Computing Surveys, Vol. 22, No. 3, September 1990 260 l Thomas et al. DAPLEX Global Result

GLOBAL INTERJ DATA MANAGER - DBMS DAPLEX

.e

LOCAL LOCAL DATABASE l ee DATABASE INTERFACE1 INTERFACEn Local Local Query Query

Figure 6. MULTIBASE component architecture.

DBMS can also store auxiliary databases LDI building blocks has been created. that are used to support resolution of data These building blocks implement LDI pro- incompatibilities. For example, auxiliary cessing that is common to all LDIs and databases can contain new data that are greatly reduce the amount of new software not available in any local databases, statis- required to create an LDI. tics used to determine which data values MULTIBASE is currently being used as should be used in case of conflict, and con- a component of two systems. One system is version tables that provide a means of per- a prototype for supporting design, manu- forming data transformations that cannot facturing, and logistics of large mechanical be done using simple formulas. systems. The other system is a pilot dem- onstration of the utility of distributed het- erogeneous data management for providing 3.6.4 MULTIBASE Status and Future Plans integrated access to logistics databases. A MULTIBASE prototype system has been Both of these efforts require MULTIBASE implemented in Ada. It supports the capa- to interface to existing databases that are bilities described above except that only a not based on the relational model. portion of the global optimization design Several lessons have been learned during has been implemented. The GDM software this project that will provide direction for is about 350,000 Ada source lines. It exe- future enhancements to MULTIBASE. cutes on a VAX under the VMS operating Two of the most important of these are the system. LDIs have been developed for five difficulties in handling local system pecu- DBMSs. They provide access to ORACLE liarities and the need for automated tools and RIM systems executing under the VMS to support the creation and maintenance of operating system and to FOCUS, System MULTIBASE schemata. 2000, and DMR (a hierarchical DBMS de- Although MULTIBASE has proved ef- veloped by the U.S. Army) systems execut- fective at supporting most of the data man- ing under the MVS operating system. The agement capabilities of a wide range of LDIs vary in size from 7000 to 20,000 Ada DBMSs, each DBMS has had its own pe- source lines. To minimize the cost of adding culiarities (i.e., special data types, opera- a new DBMS to MULTIBASE, a set of tions, or optimization heuristics) that have

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems 261 turned out to be difficult to support. This port. Decision support applications tend to has pointed out the need for a more ex- read-but not update-remote data. They tensible framework, possibly object ori- are mostly concerned with presenting a de- ented, that would allow new capabilities cision maker with a unified, single system to be readily accommodated within the image of data that is distributed throughout MULTIBASE system. the enterprise. On-line applications, by Experience has indicated that automated contrast, involve remote updates and have tools are needed for administering the dic- a strict requirement to maintain local site tionary of a system that is integrating da- autonomy. Given the orientation of the tabases from many different organizations. SYBASE system to on-line applications, Tools are needed to assist both in creating SYBASE is an interoperable system, or MULTIBASE schemata and in maintain- a “loosely coupled” federated system in ing consistency as changes occur to the the terminology of Sheth and Larson local databases. [1990]. SYBASE attempts to open the ar- chitecture as widely as possible to allow any database, application, or service to be in- 3.7 SYBASE (Sybase, Inc.) tegrated into the client/server architecture 3.7.1 Background on SYBASE in a heterogeneous environment. No global data model or schema is enforced. Rather, Sybase, Inc., was founded in 1984 with the distributed operations can be supported via goal of bringing a high-performance distrib- application programming or via database- uted RDBMS to the market. The initial oriented RPCs between SQL Servers. This production versions of the SYBASE SQL provides a high degree of site autonomy. At Server and the SQL Toolset (application the same time, the SQL Server provides development tools) were shipped in June full DBMS support at each location and 1987, and there are currently more than “prepare to commit” support for a two- 1000 customers using the product on nearly phase commit protocol to guarantee re- 30 different hardware platforms. Using a coverability for multisite updates [Gray client/server architecture, SYBASE was 19781. designed to handle on-line, transaction-ori- SYBASE is based on the relational model ented applications that require high perfor- and supports both interactive and pro- mance, continuous availability, and data grammed access to the SQL Server or any integrity that cannot be circumvented. Open Server application. The basic query The need to integrate a variety of client language is SQL. Multiple SQL statements applications with multiple sources of data may, however, be augmented with program- is a clear requirement in today’s commer- ming constructs such as conditional logic cial market. Hence, in September 1989 Sy- (if, else, while, etc.), procedure calls and base introduced the Open Server, a product parameters, and local variables. These may that extends the SYBASE distributed ca- be combined into a single database object pabilities to heterogeneous data sources. called a stored procedure. A procedure is This product complements the Open an independently protected object and (as Client, a client Application Programming in the case of a view) can override the Interface (API) used to send SQL or Re- protection of the tables it references. Thus, mote Procedure Calls (RPCs) to an SQL it is possible to grant execution privileges Server. Together they form the Client/ to a procedure but disallow direct access to Server Interfaces, the basis of the SYBASE the data it references. Procedures can re- approach to heterogeneous distributed turn rows of data and error messages, databases. and they can return values back into pro- gramming variables in the application program. 3.7.2 SYBASE System Characteristics The SQL Server also supports triggers There are two broad categories of distrib- as independent objects in the database uted databases: on line and decision sup- [Date 19831. These have the capabilities

ACM Computing Surveys, Vol. 22, No. 3, September 1990 262 l Thomas et al. of procedures, with three important exten- combination of SQL Servers or Open Serv- sions: ers. If the data are stored in a non-SYBASE source, the Open Server provides the nec- (1) They cannot be directly executed but essary data type and network conversions rather are executed as a side effect of to allow the Open Client to process the an SQL delete, insert, or update. returned data. (2) A trigger is an extension of the user’s SYBASE supports distributed updates current SQL statement. It can roll back that span multiple locations. A two-phase or modify the results of a user’s trans- commit protocol, coded in the application, action. enforces distributed transaction control (3) Triggers can view the data being across multiple SQL Servers. changed.

In traditional centralized database sys- 3.7.3 SYBASE System Architecture tems, users of on-line applications are not given direct update access to a database but As shown in Figure 7, The SYBASE Open rather communicate with an application Server consists of two logical components. program that protects the database from A server network interface manages the the user. This common approach can be network connection and accepts requests called “application-enforced integrity.” from client programs running Open Client The legality of any update is determined or database RPCs from an SQL Server. principally by rules enforced by the appli- Event-driven server utilities in the Server- cation program. Application-enforced in- Library provide the logic to ensure that tegrity is, however, a flawed approach in client requests are passed to the appropri- heterogeneous distributed databases, where ate User Developed Handler and are com- the application may be written in a differ- pleted properly. They also ensure that the ent department or in a different city from returned data are correctly formatted for the DBA whose database is being updated. the client program. This enables a SY- A better alternative in a heterogeneous BASE client application to request pro- distributed database is to enforce data in- cessing and exchange information with any tegrity within the database itself. Under application or data management system. this alternative, an application at a remote The SYBASE Network Interface compo- site communicates directly with a database nent of the Open Server appears to the that has sufficient richness of semantics to developer and user exactly as an SQL decide by itself whether the transaction Server interface. It violates any integrity rules. Stored proce- dures and triggers provide this capability. l Supports multiple connections from mul- The SYBASE Open Server provides a tiple clients or SQL Servers consistent method of receiving SQL re- l Supports multiple logical connections on quests or remote procedure calls from a single network connection to increase an application based on the SYBASE network efficiency SQL Toolset or an application using the l Shields the user from knowledge of un- SYBASE Open Client Interface and pass- derlying networking ing them to a non-SYBASE database or l Passes returned data a record at a time application. Whereas Open Client gives in exactly the same format as an SQL users the flexibility to use a variety of front- Server end packages or applications for accessing and updating data, the SYBASE Open The Server-Library enormously simpli- Server allows access to and updating of fies the coding of distributed multiuser foreign (non-SYBASE) databases and ap- server applications. The utilities supplied plications. by SYBASE provide the logic to handle At run time an application program is- basic server events such as establishing or sues a database RPC to the distributed ending client connections and starting or database system, which consists of any stopping processes. SYBASE also provides

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems 9 263

Open Client SYBASE Application

SQL Server 1

USER DEVELOPED HANDLER HETEROGENEOUS DATABASE OR APPLICATION

Figure 7. SYBASE open client/server architecture.

utilities to handle SQL and database RPC vides a server-to-server RPC capability, requests. These utilities are designed to be have been commercially available since extended by user-developed handlers. The September 1989. They are in use at numer- handlers provide the logic to transmit the ous customer sites performing on-line RPC or SQL request to the target applica- applications. tion or data management system in a pre- The total effort involved in the develop- defined predictable way. The user can also ment of these products is hard to measure. define additional utilities to handle events It would be fair to say, however, that the particular to an application or environ- development of the SQL Server took more ment. The combination of SYBASE- than 30 staff years, and the Open Server supplied and user-developed utilities pro- and RPC enhancements to the SQL Server vide the flexibility to develop the appropri- added approximately 3 staff years. This ate server environment for any application includes engineering effort only and does or data source. Through Server-Library the not include quality assurance, documenta- developer can tion, and so on. Future plans for the prod- uct include moving the two-phase commit l Manage the task queue protocol into the SQL Server and extending l Pass requests and parameters to foreign its capabilities to include heterogeneous applications or data management sys- systems. Access to distributed SQL Servers tems through the user handler will be made transparent by means of syn- l Process responses from user handlers- onyms and a distributed data dictionary. normal, error, and so on l Collect returned data or parameters from 4. SUMMARY the user handler As can be seen from the above system de- l Convert returned data to format compat- ible with client application scriptions, significant progress has been made in developing heterogeneous distrib- uted database systems for production use. 3.7.4 SYBASE Status and Future Plans It is, however, certainly not yet possible to The SYBASE Open Server and Release 4.0 buy a system off the shelf that will link all of the SYBASE SQL Server, which pro- of the popular data models and database

ACM Computing Surveys, Vol. 22, No. 3, September 1990 264 . Thomas et al.

management systems and provide full sup- helpful suggestions that significantly improved this port for schema integration, distributed paper. query management, and distributed trans- action management. Moreover, although REFERENCES the architectural principles are becoming well understood for building a custom sys- BARKMEYER, E., MITCHELL, M., MIKKILINENI, K., tem that provides distributed query man- SU, S. Y. W., AND LAM, H. 1986. An architec- ture for an integrated manufacturing data admin- agement, the effort required actually to istration system. NBSIR 863312, National implement such a system is high. In addi- Bureau of Standards, Gaithersburg, Md. tion, implementing heterogeneous distrib- BOEING COMPUTER SERVICES 1985. Boeing RIM uted transaction management is not yet User’s Manual. Version 7.0, 20492-0502, Boeing well understood. Computer Services, Seattle, Wash. There are commercially available sys- BREITBART, Y. J., AND SILBERSCHATZ, A. tems that a wide variety of types of 1988. Multidatabase systems with a decentral- computers, operating systems, and net- ized concurrency control scheme. IEEE Comput. SOC. D&rib. Proc. Tech. Comm. Newsletter 10, 2 works. Gateways are being developed from (Nov.), 35-41. these systems to various other database BREITBART, Y. J., AND TIEMAN, L. R. 1985. ADDS: management systems based on different Heterogeneous distributed database system. In data models. Some of these commercial sys- Distributed Data SharinE- Svstems.- F. Schreiber tems offer distributed query management and W. Litwin, Eds. North Holland Publishing and some offer distributed transaction Co., The Netherlands, pp. 7-24. management, but none offer both, although BREITBART, Y. J., OLSON, P. L., AND THOMPSON, that is expected to change in the near fu- G. R. 1986. Database integration in a distrib- uted heterogeneous database system. In Proceed- ture. So far these systems offer only limited ings of the Znternationnl Conference on Data schema integration capabilities, without Engineering (Los Angeles, CA, Feb. 5-7). IEEE, system support for horizontal or vertical Washington, D.C., pp. 301-310. fragmentation or replicated data, although BREITBART, Y. J., SILBERSCHATZ, A., AND THOMP- this is also expected to change in the near SON, G. R. 1987. An update mechanism for mul- future. tidatabase systems. Data Eng. 10, 3 (Sept.), Custom systems have been built that en- 12-18. compass a variety of types of computers, BREITBART, Y. J., SILBERSCHATZ, A., AND THOMP- operating systems, networks, database SON, G. R. 1989a. Transaction management in a multidatabase environment. Integration of Zn- management systems, and data models. formation Systems: Bridging Heterogeneous Dn- Some of these systems have rather sophis- tabases. A. Gupta, Ed. IEEE Press,-New York, ticated schema integration and distributed pp. 135-143. query management capabilities. They are, BREITBART, Y. J., SILBERSCHATZ, A., AND THOMP- however, just beginning to develop distrib- SON, G. R. 1989b. Reliable transaction manage- uted transaction management. ment in a multidatabase system. Submitted for It is important to remember that this is publication. a very fluid landscape. This paper was writ- CERI, S., AND PELAGATTI, G. 1984. Distributed Da- tabases: Principles and Systems. McGraw-Hill, ten in late 1989, and there will undoubtably New York. be advances between that time and the time CHEN, A. L. P., BRILL, D., TEMPLETON, M. P., AND it appears in print. There will undoubtably Yu, C. T. 1989. Distributed query processing in be even more advances in the months to a multiple database system. ZEEEJ. Select. Areas follow. Existing systems will improve their Commun. 7,3 (Apr.), 390-398. capabilities and new systems and vendors CHUNG, C. W. 1987. DATAPLEX: A heterogeneous will appear. distributed database management system. Re- search Publication GMR-5973, General Motors Research Laboratories (Sept.). ACKNOWLEDGMENTS CHUNG, C. W. 1990. DATAPLEX: An access to heterogeneous distributed databases. Commun. The authors wish to express their appreciation to the ACM 33, 1 (Jan.), 70-80. (With corrigendum in editors and the anonymous referees for the many Commun. ACM 33, 4 (Apr.), p. 459).

ACM Computing Surveys, Vol. 22, No. 3, September 1990 Heterogeneous Distributed Database Systems 265

CHUNG, C. W., AND IRANI, K. B. 1986. An optimi- distributed databases. Tech. Rep. CCA-81-06. zation of queries in distributed database systems. Computer Corporation of America (May). J. Parall. D&rib. Comput. 3, 2 (June), 137-157. KRISHNAMURTHY, V., SW, S. Y. W., LAM, H., MITCH- DATE, C. J. 1983. An Introduction to Database Sys- ELL, M., AND BARKMEYER, E. 1987. A distrib- tems. Vol. II. Addison-Wesley, Reading, Mass. uted database architecture for an integrated DAYAL, U., GOODMAN, N., LANDERS, T., OLSEN, K., manufacturing facility. In Proceedings of the Zn- SMITH, J. AND YEDWAB, L. 1981. Local query ternational Conference on Data and Knowledge Systems for Manufacturing and Engineering optimization in MULTIBASE: A system for het- (Oct.), Computer Society Press of the IEEE, pp. erogeneous distributed databases. Tech. Rep. CCA-81-11, Computer Corporation of America 4-13. (Oct.). LANDERS, T., AND ROSENBERG, R. 1982. An over- view of MULTIBASE. In Distributed Databases, DAYAL, U., LANDERS, T., AND YEDWAB, L. H.-J. Schneider. Ed. North Holland Publishing 1982. Global optimization techniques in MUL- Company, The fietherlands, pp. 153-184. - TIBASE. Tech. Rep. CCA-82-05, Computer Cor- poration of America (Oct.). LANDERS, T., Fox, S., RIES, D., AND ROSENBERG, R. 1984. DAPLEX User’s Manual. Tech. Rep. ELMAGARMID, A. K. 1986. A survey of distributed CCA-84-01, Computer Corporation of America. deadlock detection algorithms. SZGMOD Rec. 15, LE NOAN, Y. 1988. Object-oriented programming ex- 3 (Sept.), 37-45. - ploits AI. Comput. Technol. Reu. (Apr.). ESWARAN, K. P., GRAY, J. N., LORIE, R. A., AND LEE, W. F., OLSON, P. L., THOMAS, G. F., AND TRAIGER, I. L. 1976. The notions of consistency THOMPSON, G. R. 1988. A remote user interface and predicate locks in a database system. Com- for the ADDS multidatabase system. In Proceed- mum ACM 19, 11 (Nov.), 624-633. ings of the 2nd Oklahoma Workshop on Applied GLIGOR, V. D., AND POPESCU-ZELETIN, R. Computing (Tulsa, Okla. Mar. 18), The Univer- 1985. Concurrency control issues in distributed sity of Tulsa, pp. 194-204. heterogeneous database management systems. LIBES, D. 1985. User-level shared variables. In Pro- Distributed Data Sharing Systems. F. Schreiber ceedings of the Summer 1985 USENZX Confer- and W. Litwin. Eds. North-Holland Publishing ence (PO&and, Oregon, June), The USEfiIX Co., The Netherlands, 43-56. Association, Berkeley, Calif. GRAY, J. 1978. Notes on data base operating systems. MITCHELL, M., AND BARKMEYER, E. 1984. Data dis- In Operating Systems: An Advanced Course, R. tribution in the NBS AMRF. In Proceedings of Bayer, R. M. Graham, and G. Seegmuller, Eds. the ZPAD ZZ Conference (Denver, Colo., April), Springer-Verlag, New York, pp. 393-481. NASA Conference Publication 2301, 211-227. INCRES 1986. Zngres. Relational Technology,-. Inc, NANZETTA, P. 1984. Update: NBS research facility Alameda, Calif. 94501 (Jan.). addresses problems in set-ups for small batch manufacturing. Znd. Eng. 16, 6 (June), 68-73. Is0 1982. ISO/TC97: Draft International Standard ZSO/DZS 7489: Information Processing System.- SHETH, A., AND LARSON, J. A. 1990. Federated da- Open Systems Interconnection-Basic Reference tabases: Architectures and integration. ACM Model. International Organization for Standard- Comput. Suru. 22, 4 (Dec.). ization. SHIPMAN, D. 1981. The functional data model and Is0 1987a. International Standard IS0 8824: Znfor- the data language DAPLEX. ACM Trans. Data- mation Processing Systems-Open Systems Znter- base Syst. 6, 1 (Mar.), 140-173. connection-Specification of Abstract Syntax STONEBRAKER, M., Ed. 1986. The ZNGRES Papers: Notation One (ASN.Z). International Organiza- Anatomy of a Relational Database System. tion for Standardization. Addison-Wesley, Reading, Mass. Is0 1987b. International Standard IS0 8825: Znfor- Su, S. Y. W. 1985. Modeling integrated manufactur- mation Processing Systems-Open Systems Znter- ing data using SAM*. In Proceedings of GZ Fach- connection-Basic &coding kules- for Abstract tagung, Karlsruhe (Mar.). (Reprinted as Syntax Notation One (ASN.1). International Or- Datenbank-Systeme fur Buro, Technik und ganization for Standardization. Wissenschaft. Springer-Verlag, New York.) Is0 1989. Draft proposed international standard Su, S. Y. W., LAM, H., KHATIB, M., KRISHNAMUR- 9579: Information processing systems-Open THY, V., KUM~R, A., MALIK, S. MITCHELL, M., systems interconnection-Generic remote data- AND BARKMEYER. E. 1986. The architecture base access service and protocol. M.A. Corfman, and prototype implementation of an integrated Ed., unpublished document ISO/IEC JTCl SC21 manufacturing database administration system. WG3 N845. In Proceedings of Spring COMPCON. KATZ, R., GOODMAN, N., LANDERS, T., SMITH, J., TEMPLETON, M., WARD P., AND LUND, E. AND YEDWAB, L. 1981. Database integration 1987b. Pra$matics of access control in Mer- and incompatible data handling in MULTI- maid. Q. Bull. IEEE Comput. SOC. Tech. Commit- BASE: A system for integrating heterogeneous tee Data Eng. 10, 3 (Sept.), 33-38.

ACM Computing Surveys, Vol. 22, No. 3, September 1990 266 l Thomas et al.

TEMPLETON, M., BRILL, D., CHEN, A., DAO, S., TRAIGER, I. L., GRAY, J. N., GALTIERI, C. A.. AND LUND, E., MACGREGOR, R., WARD, P. LINDSAY, B. G. 1982. Transactions and cbnsis- 1987a. Mermaid: A front-end to distributed het- tency in distributed database management sys- erogeneous databases. Proc. IEEE 75, 5 (May, tems. ACM 2’ran.s. Database Syst. 7, 3 (Sept.), Special Issue on Distributed Database Systems), 323-342. 695-708. THOMPSON, G. R. 1987. Multidatabase concurrency Tu, J. S., AND HOPP, T. H. 1987. Part geometry control. Ph.D. dissertation, Oklahoma State Uni- data in the AMRF. NBSIR 87-3551, National versity. Bureau of Standards, Gaithersburg, Md.

Received November 1988; final revision accepted June 1990.

ACM Computing Surveys, Vol. 22, No. 3, September 1990