Applying Model Management to Classical Meta Data Problems
Philip A. Bernstein Microsoft Research One Microsoft Way Redmond, WA 98052-6399 [email protected]
Abstract design and its implementation, • mapping source makefiles into target makefiles, to Model management is a new approach to meta data drive the transformation of make scripts from one management that offers a higher level programming programming environment to another, and interface than current techniques. The main abstrac- • mapping interfaces of real-time devices to the tions are models (e.g., schemas, interface defini- interfaces required by a system management environ- tions) and mappings between models. It treats these ment to enable it to communicate with the device. abstractions as bulk objects and offers such opera- Following conventional usage, we classify these as meta tors as Match, Merge, Diff, Compose, Apply, and data management applications, because they mostly ModelGen. This paper extends earlier treatments of involve manipulating descriptions of data, rather than the these operators and applies them to three classical data itself. meta data management problems: schema integra- Today’s approach to implementing such applications is tion, schema evolution, and round-trip engineering. to translate the given models into an object-oriented representation and manipulate the models and mappings 1 Introduction in that representation. The manipulation includes Many information system problems involve the design, designing mappings between the models, generating a integration, and maintenance of complex application model from another model along with a mapping between artifacts, such as application programs, databases, web them, modifying a model or mapping, interpreting a sites, workflow scripts, formatted messages, and user mapping, and generating code from a mapping. Database interfaces. Engineers who perform this work use tools to query languages offer little help for this kind of manipulate formal descriptions, or models, of these manipulation. Therefore, most of it is programmed using artifacts, such as object diagrams, interface definitions, object-at-a-time primitives. database schemas, web site layouts, control flow We have proposed to avoid this object-at-a-time diagrams, XML schemas, and form definitions. This programming by treating models and mappings as manipulation usually involves designing transformations abstractions that can be manipulated by model-at-a-time between models, which in turn requires an explicit and mapping-at-a-time operators [6]. We believe that an representation of mappings, which describe how two implementation of these abstractions and operators, called models are related to each other. Some examples are: a model management system, could offer an order-of- • mapping between class definitions and relational magnitude improvement in programmer productivity for schemas to generate object wrappers, meta data applications. • mapping between XML schemas to drive message The approach is meant to be generic in the sense that a translation, single implementation is applicable to all of the data • mapping between data sources and a mediated schema models in the above examples. This is possible because to drive heterogeneous data integration, the same modeling concepts are used in virtually all mod- • mapping between a database schema and its next eling environments, such as UML, extended ER (EER), release to guide data migration or view evolution, and XML Schema. Thus, an implementation that uses a • mapping between an entity-relationship (ER) model representation of models that includes most of those and a SQL schema to navigate between a database concepts would be applicable to all such environments. There are many published approaches to the list of meta Permission to copy without fee all or part of this material is granted data problems above and others like them. We borrow provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the from these approaches by abstracting their algorithms into publication and its date appear, and notice is given that copying is by a small set of operators and generalizing them across permission of the Very Large Data Base Endowment. To copy otherwise, applications and, to some extent, across data models. We or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 2003 CIDR Conference
thereby hope to offer a more powerful database platform map1 Given: S1, S2, map1, SW for such applications than is available today. S1 SW 1. map2 = Match(S1, S2) In a model management system, models and mappings 1. map2 2. map3 = are syntactic structures. They are expressed in a type 2. map3 Compose(map1, map2) system, but do not have additional semantics based on a 3. map4 3. < S3 ,map4> = constraint language or query language. Despite this S2 S3 Diff(S2, map3) limited expressiveness, model management operators are powerful enough to avoid most object-at-a-time pro- Figure 1 Using model management to help generate a gramming in meta data applications. And it is precisely data warehouse loading script this limited expressiveness that makes the semantics and amples to demonstrate that model management is a credi- implementation of the operators tractable. ble approach to solving problems of this type. Although Still, for a complete solution, meta data problems often this paper is not the first overview of model management, require some semantic processing, typically the manipula- it is the most complete proposal to date. Past papers pre- tion of formulas in a mathematical system, such as logic sented a short vision [5,6], an example of applying model or state machines. To cope with this, model management management to a data warehouse loading scenario [7], an offers an extension mechanism to exploit the power of an application of Merge to mediated schemas [22], and an inferencing engine for any such mathematical system. initial mathematical semantics for model management [1]. Before diving into details, we offer a short preview to We also studied the match operator [23], which has see what model management consists of and how it can developed into a separate research area. This paper offers yield programmer productivity improvements. First, we the following new contributions to the overall program: summarize the main model management operators: • The first full description of all of the model • Match – takes two models as input and returns a management operators. mapping between them • New details about two of the operators, Diff and • Compose – takes a mapping between models A and B Compose, and a new proposed operator, ModelGen. and a mapping between models B and C, and returns a • Applications of model management to three well mapping between A and C known meta data problems: schema integration, • Diff – takes a model A and mapping between A and schema evolution, and round-trip engineering. some model B, and returns the sub-model of A that We regard the latter as particularly important, since they does not participate in the mapping offer the first detailed demonstration that model manage- • ModelGen – takes a model A, and returns a new ment can help solve a wide range of meta data problems. model B based on A (typically in a different data The paper is organized as follows: Section 2 describes model than A’s) and a mapping between A and B the two main structures of model management, models • Merge – takes two models A and B and a mapping and mappings. Section 3 describes the operators on between them, and returns the union C of A and B models and mappings. Section 4 presents walkthroughs of along with mappings between C and A, and C and B. solutions to schema integration, schema evolution, and Second, to see how the operators might be used, consi- round-trip engineering. Section 5 gives a few thoughts der the following example [7]: Suppose we are given a about implementing model management. Section 6 discusses related work. Section 7 is the conclusion. mapping map1 from a data source S1 to a data warehouse SW, and want to map a second source S2 to SW, where S2 is similar to S1. See Figure 1. (We use S1, SW, and S2 to 2 Models and Mappings name both the schemas and databases.) First we call 2.1 Models Match(S1, S2) to obtain a mapping map2 between S1 and For the purposes of this paper, the exact choice of model S2, which shows where S2 is the same as S1. Second, we representation is not important. However, there are call Compose(map1, map2) to obtain a mapping map3 several technical requirements on the representation of between S2 and SW, which maps to SW those objects of S2 models, which the definitions of mappings and model that correspond to objects of S1. To map the other objects management operators depend on. of S2 to SW, we call Diff(S2, map3) to find the sub-model First, a model must contain a set of objects, each of S3 of S2 that is not mapped by map3 to SW, and map4 to which has an identity. A model needs to be a set so that identify corresponding objects of S2 and S3. We can then its content is well-defined (i.e., some objects are in the set call other operators to generate a warehouse schema for S3 while others are not). By requiring that objects have iden- and merge it into SW. The latter details are omitted, but we tity, we can define a mapping between models in terms of will see similar operator sequences later in the paper. mappings between objects or combinations of objects. The main purpose of this paper is to define the seman- Second, we want the expressiveness of the representa- tics of the operators in enough detail to make the above tion of models to be comparable to that of EER models. sketchy example concrete, and to present additional ex- That is, objects can have attributes (i.e., properties), and
can be related by is-a (i.e., generalization) relationships, Emp Map Employee has-a (i.e., aggregation or part-of) relationships, and asso- ee ciations (i.e., relationships with no special semantics). As well, there may be some built-in types of constraints, such Emp# 1 EmployeeID as the min and max cardinality of set-valued properties. Third, since a model is an object structure, it needs to Name 2 FirstName support the usual object-at-a-time operations to create or A morphism between delete an object, read or write a property, and add or LastName remove a relationship. Emp and Mapee 3 4
Fourth, we expect objects, properties and relationships Figure 2 An example of a mapping to have types. Thus, there are (at least) three meta-levels in the picture. Using conventional meta data terminology, sents it as a set of objects (each of which can relate we have: instances, which are models; a meta-model that objects in the two models). In our experience, this consists of the type definitions for the objects of models; reification is often needed for satisfactory expressiveness. and the meta-meta-model, which is the representation For example, if the mapping in Figure 2 were represented language in which models and meta-models are express- as a relationship, it would presumably include the pairs ed. We avoid using the term “data model,” because it is
when one model is known to be an incremental Second, recall that a model is the set of objects reach- modification of another model. able by paths of has-a relationships from the root. Since Complex Match is based on complex definitions of the result of Diff may equal any subset of the objects of equality. Although it need not set the Expression property M1, some of those objects may not be connected to the on mapping objects, it should at least distinguish sets of Diff result’s root. If they are not, the result of Diff is not a objects that are equal (=) from those that are only similar model. For example, consider Diff(Employee, Mapee) on (≅). By similar, we mean that they are related but we do the models and mapping in Figure 4. Since FirstName and not express exactly how. For example, in Figure 3, object LastName are not referenced by Mapee’s morphism 1 says that Emp# and EmployeeID are equal, while object between Employee and Mapee, they are in the result. 2 says that Name is similar to a combination of FirstName However, Name is not in the result, so FirstName and and LastName. A human mapping designer might update LastName are not connected to the root, Employee, of the object 2’s Expression property to say that Name equals result and therefore are not in that model. This is undesir- the concatenation of FirstName and LastName. able, since such objects cannot be subsequently processed by other operators, all of which expect a model as input. Emp Mapee Employee Therefore, to ensure that the result of Diff is a well- = formed model, for every object o in the result, we require the result to include all objects O on a path of has-a Emp# 1 EmployeeID relationships from the M object referenced by map ’s = 1 1 root to o. Objects in O that are referenced in map1’s Name FirstName morphism to M1 are called support objects, because they 2 are added only to support the structural integrity of the model. For example, in Figure 5, Name is a support object ≅ LastName in the result of Diff(Employee, Mapee). Figure 3 A mapping output from Complex Match Emp Mapee Employee In practice, Complex Match is not an algorithm that returns a mapping but rather is a design environment to help a human designer develop a mapping. It potentially Emp# 1 EmployeeID benefits from using technology from a variety of fields: graph isomorphism to identify structural similarity in Name 2 Name large models; natural language processing to identify similarity of names or to analyze text documentation of a FirstName model; domain-specific thesauri; and machine learning and data mining to use similarity of data instances to infer LastName the equality of model objects. A recent survey of approaches to Complex Match is [23]. Figure 4 Diff(Employee, mapee ) includes FirstName 3.2 Diff and LastName but not Name Intuitively, the difference between two models is the set Having made this decision, we now have a third of objects in one model that do not correspond to any problem, namely, in the model that is returned by Diff, object in the other model. One part of computing a how to distinguish support objects from objects that are difference is determining which objects do correspond. meant to be in the result of Diff (i.e., that do not This is the main function of Match. Rather than repeating participate in map1)? We could simply mark support this semantics as part of the diff operator, we compute a objects in the result. But this introduces another structure, difference relative to a given mapping, which may have namely a marked model. To avoid this complication, we been computed by an invocation of Match. Thus, given a use our two existing structures to represent the result, mapping map1 between models M1 and M2, the operator namely, model and mapping. That is, the result of Diff is Diff(M1, map1) returns the objects of M1 that are not a pair
This is inconvenient, because it makes it hard to align the otherwise required in M1′; every has-a relationship result of Diff with M1 in subsequent operations. We will between two objects of M1 that are also in M1′; and see examples of this in Section 4. Therefore, we alter the every association between two objects in S or between definition of Diff to require that the result includes the an object in S and an object outside of M1. object of M1 referenced by map1’s root.
Employee Mapee′ Employee′ Emp′
EmployeeID Emp#
Name Name Name
FirstName 1 FirstName FirstName LastName
LastName 2 LastName Figure 6 The result of Merge applied to Figure 2
The effect of collapsing objects into a single object can Figure 5 The result of Diff(Employee, Mapee) is cause the output of Merge to violate basic constraints that
Emp MapEmp-Emp′ Emp′ MapEmp′-Employee Employee
Emp# Emp# EmployeeID
Name Name FirstName
FirstName LastName LastName
Figure 7 The merge result, Emp′′′, of Figure 2 with its mappings to the input models Emp and Employee
To compute the composition, for each object m2 in map2, Step 3 defines the domain of each object m3 in map3. we identify each object m1 in map1 where range(m1) ∩ Input(m3) is the set of all objects in map1 whose range domain(m2) ≠ ∅, which means that range(m1) can supply intersects the domain of m3. If the union of the ranges of at least one object to domain(m2). For example, in Figure Input(m3) contains the domain of m3, then the union of the 8, the ranges of 4, 5, and 6 in map1 can each supply one domains of Input(m3) becomes the domain of m3. Other- object to domain(11) in map2. Suppose objects m11, …, wise, m3 is not in the composition, so it is either deleted m1n in map1 together supply all of domain(m2), and each (if it is not a support object, required to maintain the well- m1i (1≤i≤n) supplies at least one object to domain(m2). formed-ness of map3), or its domain and range are cleared ()⊇ () ∩ (since it does not compose with objects in map1). That is, range m1i domain m2 and (range(m1i) U1≤i≤n Sometimes it is useful to keep every object of map2 in ≠ ∅ ≤ ≤ domain(m2)) for 1 i n. Then m2 should generate an map3 even though its Input set does not cover its domain. output object m3 in map3 such that range(m3) = range(m2) This is called a right outer composition, because all () and domain(m3) = domain m1i . objects of the right operand, map2, are retained. Its U1≤i≤n For example, in Figure 8, range(4) and range(5) can semantics is the same as right composition, except that ∅ supply all of domain(11). That is, range(4) ∪ range(5) = step 3b is replaced by “else set domain(m3) = .” {7, 8, 9} ⊇ domain(11) = {7, 9}. Then object 11 should A definition of composition that allows a more flexible choice of inputs to m2 is in [7]. It is more complex than generate an output object m3 in map3 (not shown in the the one above and is not required for the examples in figure), such that range(m3) = range(m2) = {13} and Section 4, so we omit it here. domain(m3) = domain(4) ∪ domain(5) = {1,2}. 3.5 Apply M map M map M 1 1 2 2 3 The operator Apply takes a model and an arbitrary function f as inputs and applies f to every object of the 1 4 7 10 12 model. In many cases, f modifies the model, for example, by modifying certain properties and relationships of each object. The purpose of Apply is to reduce the need for 2 5 8 application programs to do object-at-a-time navigation 11 13 over a model. There can be variations of the operator for different traversal strategies, such as pre-order or post- 3 6 9 order over has-a relationships with the proviso that it does not visit any object twice (in the event of cycles). Figure 8 Mappings map1 and map2 can be composed 3.6 Copy There is a problem, though: for a given m2 in map2, The operator Copy takes a model as input and returns a there may be more than one set of objects m11, …, m1n in copy of that model. The returned model includes all of the map1 that can supply all of domain(m2). For example, in relationships of the input model, including those that Figure 8, {4, 5} and {4, 6} can each supply all of connect its objects to objects outside the model. domain(11). When defining composition, which set do we One variation of Copy is of special interest to us, choose? In this paper, rather than choosing among them, namely DeepCopy. It takes a model and mapping as we use all of them. That is, we compose each m2 in map2 input, where the mapping is incident to the model. It with the union of all objects m1 in map1 where range(m1) returns a copy of both the model and mapping as output. ∩ ≠ ∅ domain(m2) ({4,5,6} in the example). This seman- In essence, DeepCopy treats the input model and mapping tics supports all of the application scenarios in Section 4. as a single model, creating a copy of both of them Given this decision, we define the right composition together. To see the need for DeepCopy, consider how map3 of map1 and map2 constructively as follows: complicated it would be to get its effect without it, by 1. (Copy) Create a copy map3 of map2. Note that map3 copying the model and mapping independently. Several has the same morphisms to M2 and M3 as map2 and, other variations of Copy are discussed in [6]. therefore, the same domains and ranges. 3.7 ModelGen 2. (Precompute Input) For each object m3 in map3, let Applications of model management usually involve the Input(m3) be the set of all objects m1 in map1 such generation of a model in one meta-model from a model in ∩ ≠ ∅ that range(m1) domain(m2) . another meta-model. Examples are the generation of a 3. (Define domains) For each m3 in map3, SQL schema from an ER diagram, interface definitions ()⊇ () a. if ∈ range m1i domain m3 , then set Um1i Input(m3 ) from a UML model, or HTML links from a web site map. () A model generator is usually meta-model specific. For domain(m3) = ∈ domain m1i . Um1i Input(m3 ) example, the behavior of an ER-to-SQL generator very b. else if m3 is not needed as a support object (be- much depends on the source and target being ER and cause none of its descendants satisfies (3a)), then SQL models respectively. Therefore, one would not delete it, else set domain(m3) = range(m3) = ∅.
expect model generation to be a generic, i.e., meta-model- 3.10 Semantics independent, operator. The model management operators defined in Section 3 Still, there is some common structure across all model are purely syntactic. That is, they treat models and generators worth abstracting. One is that the generation mappings as graph structures, not as schemas that are step should produce not only the output model but also a templates for instances. The syntactic orientation is what mapping from the input model to the output model. This enables model and mapping manipulation operators to be allows later operators to propagate changes from one relatively generic. Still, in most applications, to be useful, model to the other. For example, if an application devel- models and mappings must ultimately be regarded as oper modifies a SQL schema, it helps to know how the templates for instances. That is, they must have modified objects relate to the ER model, so the ER model semantics. Thus, there is a semantic gap between model can be made consistent with the revised SQL schema. management and applications that needs to be filled. This scenario is developed in some detail in Section 4.3. The gap can be partially filled by making the meta- A second common structure is that most model genera- meta-model described in Sections 2.1 more expressive tors simply traverse the input model in a predetermined and extending the behavior of the operators to exploit that order, much like Apply, and generate output model extra expressiveness. So, rather than knowing only about objects based on the type of input object it is visiting. For has-a and association relationships, the meta-meta-model example, a SQL generator might generate a table defini- should be extended to include is-a, data types, keys, etc. tion for each entity type, a column definition for each Another way to introduce semantics is to use the attribute type, a foreign key for each 1:n relationship type, Expression property in each mapping object m. Recall and so on. In effect, the generator is a case-statement, that such an expression’s variables are the objects where the case-statement variable is the type of the object referenced by m in the two models being related. To being visited. If the case-statement is encapsulated as a exploit these expressions, the model management function, it can be executed using the operator Apply. operators that generate mappings should be extended to Since the case-statement is driven by object types, one produce expressions for any mapping objects they can go a step further in automating model generation by generate. For example, when Compose combines several tagging each meta-model object (which is a type objects from the two input mappings into an output definition) by the desired generation behavior for model mapping object m, it would also generate an expression objects of that type, as proposed in [10]. Using it, model for m based on the expressions on the input mapping generation could be encapsulated as a model management objects. Similarly, for Diff and Merge. operator, which we call ModelGen. The expression language is meta-model-specific, e.g., for the relational data model, it could be conjunctive 3.8 Enumerate queries. Therefore, the extensions to model management Although our goal is to capture as much model manipula- tion as possible in model-at-a-time operators, there will be operators that deal with expressions must be meta-model- times when iterative object-at-a-time code is needed. To specific too and should be performed by a meta-model- specific expression manipulation engine. For example, the simplify application programming in this case, we offer expression language extension for Compose would call an operator called Enumerate, which takes a model as this engine to generate an expression for each output input and returns a “cursor” as output. The operator Next, mapping object it creates [16]. Some example walk- when applied to a cursor, returns an object in the model that was the input to Enumerate, or null when it hits the throughs of these extensions for SQL queries are given in end of the cursor. Like Apply, Enumerate may offer [7]. However, a general-purpose interface between model management operators and expression manipulation variations for different traversal orderings. engines has not yet been worked out. 3.9 Other Data Manipulation Operators Another approach to adding semantics to mappings is to Since models are object structures, they can be manipu- develop a design tool for the purpose, such as Clio lated by the usual object-at-a-time operators: read an [17,27]. attribute; traverse a relationship, create an object, update an attribute, add or remove a relationship, etc. In addition, there are two other bulk database operators of interest: 4 Application Scenarios • Select – Return the subset of a model that satisfies a In this section, we discuss three common meta data qualification formula. The returned subset includes management problems that involve the manipulation of additional support objects, as in Diff. Like Diff, it also models and mappings: schema integration, schema returns a mapping between the returned model and the evolution, and round-trip engineering. We describe each input model, to identify the non-support objects. problem in terms of models and mappings and show how • Delete – This deletes all of the objects in a given to use model management operators to solve it. model, except for those that are reachable by paths of 4.1 Schema Integration has-a relationships from other models. The problem is to create: a schema S3 that represents all of the information expressed in two given database
schemas, S1 and S2; and mappings between S1 and S3 and similar to FirstName and LastName, these objects are between S2 and S3 (see Figure 9). The schema integration partially integrated in S12 under an object labeled ≅, which literature offers many algorithms for doing this [1,8,23]. is a placeholder for an expression that relates Name to They all consist of three main activities: identifying FirstName and LastName. overlapping information in S1 and S2; using the identified Emp′ overlaps to guide a merge of S1 and S2; and resolving conflict situations (i.e., where the same information was ≅ represented differently in S1 and S2) during or after the Emp# merge. The main differentiator between these algorithms Name is in the conflict resolution approaches. Address FirstName S3
map13 map23 Phone LastName S1 S2 S S 1 2 Figure 11 The result of merging Emp and Employee Figure 9 The schema integration problem based on Mapee in Figure 10 If each schema is regarded as a model, then we can The sub-structure rooted by “≅” represents a conflict express the first two activities using model management between the two input schemas. A schema integration operators as follows: algorithm needs rules to cope with such conflicts. In this 1. map12 = Match(S1, S2). This step identifies the equal case it could consult a knowledge base that explains that and similar objects in S1 and S2. Since Match is first name concatenated with last name is a name. It could creating a mapping between two independently use this knowledge to replace the sub-structure rooted by developed schemas, this is best done with a Complex ≅ either by FirstName and LastName, since together they Match operator (rather than Elementary Match). subsume Name, or by a nested structure Name with sub- objects FirstName and LastName. The latter is probably 2.
For example, if S1 and V1 are relational schemas, then we At this point we have successfully completed the task. would expect each object m of map1 to contain a An alternative to steps 4 and 5 is to be more selective in relational view definition that tells how to derive a view deleting view objects, based on knowledge about the relation in V1 from some of the relations in S1; the syntax and semantics of the mapping expressions. For morphisms of m would refer to the objects of S1 and V1 example, suppose the schemas and views are in the that are mentioned in m’s view definition. Then, given a relational data model and S2 is missing an attribute that is new version S2 of S1, the problem is to define a new used to populate an attribute of a view in V2. In the version V2 of V1 that is consistent with S2 and a mapping previous approach, if each view is defined by one object map2 from S2 to V2. in map1, then the entire view would be an orphan and deleted. Instead, we could drop the attribute from the V V1 2 view without dropping the entire view relation that con- tains it. To get this effect, we could replace Step 2 above map1 map2 by a right outer composition, so that all objects of map1 are copied to map , even if they connect to S objects that S1 S2 S2 4 1 have no counterpart in S2. Then we can write a function f Figure 12 The schema evolution problem that encapsulates the semantic knowledge necessary to strip out parts of a view definition and replace steps 4 and We can solve this problem using model management 5 by Apply(f, map2). Thus, f gives us a way of exploiting operators as follows (Figure 13): non-generic model semantics while still working within 1. map3 = Match(S1, S2). This returns a mapping between the framework of the model management algebra. S1 and S2 that identifies what is unchanged in S2 4.3 Round-Trip Engineering relative to S1. If we know that S2 is an incremental Consider a design tool that generates a compiled version modification of S1, then this can be done by Elemen- tary Match. If not, then Complex Match is required. of a high-level specification, such as an ER modeling tool that generates SQL DDL or a UML modeling tool that • 2. map4 = map1 map3. This is a right composition. In- generates C++ interfaces. After a developer modifies the tuitively, each mapping object in map4 describes a part generated version of such a specification (e.g., SQL of map1 that is unaffected by the change from S1 to S2. DDL), the modified generated version is no longer A mapping object m in map1 survives the composition consistent with its specification. Repairing the specifica- (i.e., becomes an object of map4) if every object in S1 tion is called round-trip engineering, because the tool that is connected to m is also connected to some object forward-engineers the specification into a generated ′ of S2 via map3. If so, then m is transformed into m in version after which the modified generated version is map4 by replacing each reference from m to an object reverse-engineered back to a specification. of S1 by a reference to the corresponding objects in S2. Stating this scenario more precisely, we are given a specification S , a generated model G that was derived map5 1 1 V1 V2 V ′ 2 from S1, a mapping map1 from S1 to G1, and a modified version G2 of G1. The problem is to produce a revised map1 map4 map2 specification S2 that is consistent with G2 and a mapping map2 between S2 and G2. See Figure 14. Notice that diagrammatically, this is isomorphic to the schema evolu- S1 map S2 3 tion problem; it is exactly like Figure 12, with S1 and S2 replacing V and V , and G and G replacing S and S . Figure 13 Result of schema evolution solution 1 2 1 2 1 2 S1 – original spec S1 S2 Some objects of V1 may now be “orphans” in the sense G1 – generated schema that they are not incident to map4. An orphan arises G2 – modified map1 map2 generated schema because it maps via map1 to an object in S1 that has no S2 – modified spec corresponding object in S2 via map3. One way to deal with G1 G2 G2 for G orphans is to eliminate them. Since doing this would 2 corrupt map1, we first make a copy of V1 and then delete Figure 14 The round-trip engineering problem the orphans from the copy: As in schema evolution, we start by matching G1 and 3.
5. For each e in Enumerate(map5), delete domain(e) from identifies what is unchanged in G2 relative to G1. V2. This enumerates the orphans and deletes them. Since G2 is an incremental modification of G1, Notice that we are treating map5 as a model. Elementary Match should suffice. See Figure 15a.
2. map4 = map1 • map3. Mapping map4, between S1 and column into an ER attribute, each table into either an G2, includes a copy of each object in map1 all of whose entity type or relationship type (depending on the key incident G1 objects are still present in G2. structure of the table), etc. 3. = DeepCopy(S , map ). This makes a copy 3 5 1 4 We need to merge S3 and S3′ into a single model S2, S of S along with a copy map of map . 3 1 5 4 which is half of the desired result. (The other half is map2, Steps 2 and 3 eliminate from the specification S3 all coming soon.) To do this, we need to create a mapping objects that do not correspond to generated objects in G2. between S3 and S3′ that connects objects of S3 and S3′ that One could retain these objects by replacing the composi- represent the same thing. Continuing the example after tion in step 2 by outer composition. The remaining steps step 4 above, where G2′ introduces a new column C into in this section would then proceed without modification. table T, the desired mapping should connect the reverse
Next, we need to reverse engineer the new objects that engineered object for T in S3′ (e.g., an entity type) with were introduced in G2 and merge them with S3. Here is the original object for T in S3 (e.g., the entity type that one way to do it (see Figure 15a): was used to generate T in G2 in the first place). By 4.
map came from the new objects introduced in G2) with S3, 6 ′ G1 G2 G2 producing the desired model S2 (cf. Figure 14). map3 map8 Finally, we need to produce the desired mapping map (b) After Step 8 2 between G2 and S2. This is the union (i.e., merge) of Figure 15 Result of round-trip engineering solution map11 • map5 and map11′• map9. To see why this is what we want, recall that G ′ contains the objects of G that do ′ 2 2 For example, suppose G2 and G2 are SQL schemas, and not map to S via map . Mapping map connects those ob- ′ 3 5 7 G2 introduced a new column C into table T. In the model jects to S ′, as does map , except on the original objects in management representation G of the schema, C is an 3 9 2 G rather than on the copies in G ′. Hence, every object in object that is a child of object T. Since C is new, it is not 2 2 G connects to a mapping object in either map or map . connected via map to G , so it is in the result of Diff. 2 5 9 3 1 So to start, we need to compute these compositions: However, to keep G ′ connected, since C is a child of T, T 2 ′ • is also in the result of Diff as a support object, though it is 10. map2 = map11 map5 ″ ′ • not connected to G2 via map6. 11. map2 = map11 map9 ′ ′ 5.
to G2 via the union of the compositions, which is probably Model-driven generator of user interface – Much like not what is desired. Getting rid of the duplicates is a bit of an advanced drawing tool, one can tag meta-model effort. One way is to merge the mappings. To do this, we objects with descriptions of objects and their behavior need to match map2′ and map2″ from steps 10 and 11 to (e.g., a table definition is a blue rectangle and a column find the duplicates (which we can do because mappings definition is a line within its table’s rectangle). are models), and then merge the mappings based on the Generic tools over models and mappings – browser, match result. Here are the steps (not shown in Figure 15): editor, catalog, import/export, scripting. 12. map12 = Match(map2′, map2″). Objects m2′ in map2′ ″ ″ and m2 in map2 match if they connect to exactly the 6 Related Work same objects of G2 and S2. To use this matching Although the model management approach is new, much ′ condition, one needs to regard the morphisms of map2 of the existing literature on meta data management offers ″ and map2 as parts of each map’s model; e.g., the either algorithms that can be generalized for use in model morphisms could be available as relationships on each management or examples that can be studied as chal- map’s model. Using this simple match criterion, lenges for the model management operators. This litera- Elementary Match suffices. ture is too large to cite here, but we can highlight a few ′ ″ 13. map2 = Merge(map2 , map2 , map12). The morphisms areas where there is obvious synergy worth exploring. ′ ″ of map2 and map2 should be merged like ordinary Some of them were mentioned earlier: schema matching relationships. That is, if map12 connects m2′ in map2′ (see the survey in [23]); schema integration [1,8,15,25], and m2″ in map2″, then Merge collapses m2′ and m2″ which is both an example and a source of algorithms for into a single object m2. Object m2 should have only Match and Merge; and adding semantics to mappings one copy of the mapping connections that m2′ and m2″ [7,17,21,27]. Others include: had to G2 and S2. • Data translation [24]; We now have map2 and S2, so we’re done! Cf. Figure 14. • Differencing [11,19,26]; and • EER-style representations and their expressive 5 Implementation power, which may help select the best representation We envision an implementation of models, mappings, and for models and mappings [2,14,15,18,20]. model management operators on a persistent object- oriented system. Given technology trends, an object- 7 Conclusion relational system is likely to be the best choice, but an In this paper, we described model management — a new XML database system might also be suitable. The system approach to manipulating models (e.g., schemas) and consists of four layers: mappings as bulk objects using operators such as Match, Models and mappings – This layer supports the model Merge, Diff, Compose, Apply, Copy, Enumerate, and and mapping abstractions, each implemented as an object- ModelGen. We showed how to apply these operators to oriented structure, both on disk and heavily cached for three classical meta data management problems: schema fast navigation. The representation of models should be integration, schema evolution, and round-trip engineering. extensible, so that the system can be specialized to more We believe these example solutions strongly suggest that expressive meta-meta-models. And it should be semi- an implementation of model management would provide structured, so that models can be imported from more major programming productivity gains for a wide variety expressive representations without loss of information. of meta data management problems. Of course, to make This layer supports: this claim compelling, an implementation is needed. If • Models – We need the usual object-at-a-time opera- successful, such an implementation could be the tions on objects in models, plus GetSubmodels (of a prototype for a new category of database system products. given model) and DeleteSubmodel, where a submod- In addition to implementation, there are many other el is a model rooted by an object in another model. areas where work is needed to fully realize the potential Also Copy (deep and shallow) is supported here. of this approach. Some of the more pressing ones are: • Mappings - CreateMapping returns a model and two • Choosing a representation that captures most of the morphisms. GetSource and GetTarget return the constructs of models and mappings of interest, yet is morphisms of a given mapping. tractable for model management operators. • Morphisms – These are accessible and updatable like • More detailed semantics of model management opera- normal relationships. tors. There is substantial work on Match. Merge, Algebraic operators – This layer implements Match, Compose, and ModelGen are less well developed. Merge, Diff, Compose, Apply, ModelGen, and Enumer- • A mathematical semantics of model management. The ate. It should have an extension mechanism for handling beginnings of a category-theoretic approach appears semantics, such as an expression manipulation engine as in [1], but there is much left to do. A less abstract discussed in Section 3.10. analysis that can speak to the completeness of the set
of operators would help define the boundary of useful 11. Chawathe, Sudarshan S. and Hector Garcia-Molina: model management computations. Meaningful Change Detection in Structured Data. • Mechanisms are needed to fill the gap between SIGMOD 1997: 26-37. models and mappings, which are syntactic structures, 12. Claypool, K. T., J. Jin, E. A. Rundensteiner: SERF: and their semantics, which treat models as templates Schema Evolution through an Extensible, Re-usable for instances and mappings as transformations of and Flexible Framework. CIKM 1998: 314-321. instances. Various theories of conjunctive queries are 13. Claypool, K.T., E.A. Rundensteiner, X. Zhang, H. likely to be helpful. Su, H.A. Kuno, W-C Lee, G. Mitchell: Gangam - A • Trying to apply model management to especially chal- Solution to Support Multiple Data Models, their lenging meta data management problems, to identify Mappings and Maintenance. SIGMOD 2001 limits to the approach and opportunities to extend it. 14. Hull, Richard and Roger King: Semantic Database Modeling:Survey, Applications, and Research Issues. This is a broad agenda that will take many years and ACM Computing Surveys 19(3): 201-260 (1987) many research groups to develop. Although it will be a lot 15. Larson, James A., Shamkant B. Navathe, and Ramez of work, we believe the potential benefits of the approach Elmasri. A theory of attribute equivalence in make the agenda well worth pursuing. databases with application to schema integration. Acknowledgments Trans. on Soft. Eng. 15(4):449-463 (April 1989). The ideas in this paper have benefited greatly from my 16. Madhavan, J., P. A. Bernstein, P. Domingos, A.Y. ongoing collaborations with Suad Alagic, Alon Halevy, Halevy: Representing and Reasoning About Renée Miller, Rachel Pottinger, and Erhard Rahm. I also Mappings between Domain Models. 18th National thank the many people whose discussions have stimulated Conference on Artificial Intelligence (AAAI 2002). me to extend and sharpen these ideas, especially Kajal 17. Miller, R.J., L. M. Haas, M. A. Hernández: Schema Claypool, Jayant Madhavan, Sergey Melnik, Peter Mork, Mapping as Query Discovery. VLDB 2000: 77-88. John Mylopoulos, Arnie Rosenthal, Elke Rundensteiner, 18. Miller, R. J., Y. E. Ioannidis, Raghu Ramakrishnan: Aamod Sane, and Val Tannen. Schema equivalence in heterogeneous systems: bridging theory and practice. Information Systems 8 References 19(1): 3-31 (1994) 1. Alagic, S. and P.A. Bernstein, “A Model Theory for 19. Myers, E.: An O(ND) Difference Algorithm and its Generic Schema Management,” Proc. DBPL 2001, Variations. Algorithmica 1(2): 251-266 (1986). Springer Verlag LNCS. 20. Mylopoulos, John, Alexander Borgida, Matthias 2. Atzeni, Paolo and Riccardo Torlone: Management of Jarke, Manolis Koubarakis: Telos: Representing Multiple Models in an Extensible Database Design Knowledge About Information Systems. TOIS 8(4): Tool. EDBT 1996: 79-95 325-362 (1990). 3. Banerjee, Jay, Won Kim, Hyoung-Joo Kim, Henry F. 21. Popa, Lucian, Val Tannen: An Equational Chase for Korth: Semantics and Implementation of Schema Path-Conjunctive Queries, Constraints, and Views. Evolution in Object-Oriented Databases. SIGMOD ICDT 1999: 39-57. Conference 1987: 311-322 22. Pottinger, Rachel A. and Philip A. Bernstein. Creat- 4. Beeri, C. and T. Milo: Schemas for integration and ing a Mediated Schema Based on Initial Correspon- translation of structured and semi-structured data. dences. IEEE Data Engineering Bulletin, Sept. 2002. ICDT, 1999: 296-313,. 23. Rahm, Erhard and Philip A. Bernstein. A survey of 5. Bernstein, P.A.: Generic Model Management − A approaches to automatic schema matching. VLDB J. Database Infrastructure for Schema Manipulation. 10(4):334-350 (2001). Springer Verlag LNCS 2172, CoopIS 2001: 1-6. 24. Shu, Nan C., Barron C. Housel, R. W. Taylor, Sakti 6. Bernstein, Philip A., Alon Y. Halevy, and Rachel A. P. Ghosh, Vincent Y. Lum: EXPRESS: A Data Pottinger. A vision of management of complex mod- EXtraction, Processing, amd REStructuring System. els. SIGMOD Record 29(4):55-63 (2000). TODS 2(2): 134-174 (1977). 7. Bernstein, Philip A., Erhard Rahm: Data Warehouse 25. Spaccapietra, Stefano and Christine Parent. View Scenarios for Model Management. ER 2000: 1-15. integration: A step forward in solving structural 8. Biskup, J. and B. Convent. A formal view integration conflicts. TKDE 6(2): 258-274 (April 1994). method. SIGMOD 1986: 398-407. 26. J. T-L. Wang, D. Shasha, G. J-S. Chang, L. Relihan, 9. Buneman, P., S.B. Davidson, A. Kosky. Theoretical K. Zhang, G. Patel: Structural Matching and aspects of schema merging. EDBT 1992: 152-167. Discovery in Document Databases. SIGMOD 1997: 10. Cattell, R.G.G., D. K. Barry, M. Berler, J. Eastman, 560-563 D. Jordan, D. Russell, O. Schadow, T. Stanienda, and 27. Yan, Ling-Ling, Renée J. Miller, Laura M. Haas, F. Velez, editors: The Object Database Standard: Ronald Fagin: Data-Driven Understanding and ODMG 3.0. Morgan Kaufmann Publishers, 2000. Refinement of Schema Mappings. SIGMOD 2001.