A Relational Multi-Schema Data Model and Query Language for Full Support of Schema Versioning?
Total Page:16
File Type:pdf, Size:1020Kb
A Relational Multi-Schema Data Model and Query Language for Full Support of Schema Versioning? Fabio Grandi CSITE-CNR and DEIS, Alma Mater Studiorum – Universita` di Bologna Viale Risorgimento 2, 40136 Bologna, Italy, email: [email protected] Abstract. Schema versioning is a powerful tool not only to ensure reuse of data and continued support of legacy applications after schema changes, but also to add a new degree of freedom to database designers, application developers and final users. In fact, different schema versions actually allow one to represent, in full relief, different points of view over the modelled application reality. The key to such an improvement is the adop- tion of a multi-pool implementation solution, rather that the single-pool solution usually endorsed by other authors. In this paper, we show some of the application potentialities of the multi-pool approach in schema versioning through a concrete example, introduce a simple but comprehensive logical storage model for the mapping of a multi-schema database onto a standard relational database and use such a model to define and exem- plify a multi-schema query language, called MSQL, which allows one to exploit the full potentialities of schema versioning under the multi-pool approach. 1 Introduction However careful and accurate the initial design may have been, a database schema is likely to undergo changes and revisions after implementation. In order to avoid the loss of data after schema changes, schema evolution has been introduced to provide (partial) automatic recov- ery of the extant data by adapting them to the new schema. However, if only the updated schema is retained, all the applications compiled with the past schema may cease to work. In order to let applications work on multiple schemata (at least to attain continued support of legacy applications), schema evolution is not sufficient and the maintenance of more than one schema is required. This leads to the notion of schema versioning and schema version which is the persistent outcome of the application of schema modifications [6, 5]. Several issues and design options concerning schema versioning support in a relational database have been studied in [2]. In particular, two implementation options, named single- pool and multi-pool solutions, were considered for the management of the extensional data. We briefly recall here the difference between these options: single-pool : all the schema versions are associated with a unique, shared, extensional repos- itory, so that the same objects cannot have different values for the same properties when “viewed” through different schema versions; multi-pool : each schema version is associated with a “private” extensional data pool; dif- ferent data pools may contain the same objects having (possibly) completely independent representations and evolutions. ? This work has been partially supported by the Italian MURST Project “D2I”. Although the multi-pool solution may look more flexible even at a first glance, a single-pool solution has usually been considered satisfactory for implementation, for example by means of the so-called completed schema technique [7, 2], as it limits the storage space overhead due to coexistence of multiple schemas. In our opinion, with a single-pool implementation solution, the schema versioning ap- proach gives up, from a conceptual point of view, most of its potential appeal and useful- ness. As a matter of fact, if the single-pool is implemented via the completed schema solution, the schema versioning support trivially reduces to a view mechanism, as schema versions can be defined as collections of simple projection-based views (the schema of a relation in each schema version is, by definition, a subset of the completed schema). Moreover, a database is an abstract representation (model) of a portion of real world of interest for some application (mini-world). Such a representation includes an intensional aspect –i.e. the schema– and an extensional aspect –i.e. the instance. A unique point of view on the mini-world is implied in this conceptualization. When we consider the useful coexistence of different schema versions for the same database, we actually introduce different viewpoints on the mini-world, which are, in some sense, all useful for the applications. To have different viewpoints means that the very same mini-world objects have different representations in the system: if this “multiple modeling” has an intensional aspect, it also has, in general, an extensional aspect: different viewpoints may render the same objects endowed with different structural properties but also with different values for the same properties. By the way, the notion of useful “versions” orig- inated in the field of OODBs used for design/engineering applications [3, 8], where the main focus was on different versions of the same extant objects. However, for the sake of generality, such a difference was allowed to span also structural properties, giving rise to the introduction of schema versions. When we proposed the multi-pool solution in [1, 2], we also aimed at giv- ing back the right perspective on versioning, stating that different representations at intensional level –i.e. schema versioning– may imply, in general, useful different representations also at extensional level. On the other hand, the main objection that can be moved against the multi-pool solution is the potentially huge storage space overhead due to the introduction of multiple representations of each object in the database: more or less, the full database extent needs to be replicated in every pool. First, our more direct answer to this objection relies on the progress of the sec- ondary memory technology, that has been making the multi-pool storage space requirements more and more inexpensive. For instance, with respect to the storage cost of a database when we introduced the multi-pool in [1], current technology could enable today the storage of the same database with more than 100 schema versions and data pools, at the same cost or less (as a 20GB disk costs today much less than a 200MB disk in 1995). This number may exceed far enough the number of schema versions probably required by any reasonable application. Second, not all the data actually need to be replicated in every data pool: for the unchanged at- tributes and attribute values of the objects which belong to several data pools, a non-redundant storage representation can be adopted, possibly in conjunction with lazy-style update tech- niques to delay space duplication till when it becomes strictly necessary. With such kinds of optimizations, storage space requirements may resemble those of the single-pool solution, and the overhead is introduced to strictly support the enhanced semantics of the multiple versions connected to different schema versions. Anyway, physical optimization is beyond the scope of this paper, and storage requirements will not be considered any longer here. We rather focus on logical aspects, to highlight the multi-pool solution potentialities and show its feasibility with our data model and query language definitions. To this purpose, we will introduce in Sec. 3 a data model, named logical storage model, which maps attributes and tables in a multi-pool schema versioning setting to traditional rela- tional tables, on the basis of which the semantics of a Multi-schema Query Language (MSQL) will be defined in Sec. 4. The specification of a query language supporting multi-schema queries, that is queries where data from different schema versions and data pools can be freely combined, is the novel contribution of this paper. In previous works on schema versioning, query functionalities were limited to accessing data stored according to a schema version through a different schema version, which gives only a small portion of the whole picture. Undoubtedly, one of the reasons why the multi-pool solution could not be appreciated so far was the lack of a multi-schema query language. In order to show how some of the function- alities of schema versioning and of the multi-pool approach could be valuable in a concrete environment, an initial example is also given in the Section that follows. 2 A Concrete Example Assume “ACME Motors” is a car manufacturer based in the United States and selling cars in several countries world-wide by means of a Web-based e-commerce platform. We consider the relational DB supporting the e-commerce activity, with an initial schema version USMKT designed for the American market. Let CAR(NAME,PRICE) be the initial schema of a relation storing the car catalogue in USMKT, which is populated with the following instance: NAME PRICE CAR Bomb 3.0 35K Lark 2.0 20K Lark 2.5 26K where prices are expressed in US dollars. Now we consider a different schema version EUMKT designed for the European market, where the consumers may be interested in knowing the anti air pollution compliance rule of each car (rules are defined as “Euro1”, “Euro2” and so on; the higher the number, the more restrictive the rule). To this purpose, a new column, say APC (for Air Pollution Compliance), is added to table CAR via the schema change: alter table CAR add column APC producing the relation schema CAR(NAME,PRICE,APC) in the new schema version EUMKT. The data pool associated with EUMKT is initially populated as the one associated with USMKT, with the new column padded by null values. However, “ACME Motors” has judged marketing the “Bomb 3.0” model in Europe non convenient and, thus, the corresponding tuple is removed from the data pool. Moreover, as far as car prices are concerned, they must be expressed in Eu- ros rather than in US Dollars as in USMKT. However, their new values are defined not only according to the currency exchange rates, but rather on the basis of a different marketing strat- egy, targeted for Europe: such a strategy takes into account possibly different production and logistic costs but also actual different market conditions (e.g.