Industrial-Strength Schema Matching

Industrial-Strength Schema Matching Philip A. Bernstein, Sergey Melnik, Michalis Petropoulos, Christoph Quix Microsoft Research Redmond, WA, U.S.A. {philbe, melnik}@microsoft.com, [email protected], [email protected] Abstract LastName in one schema are related to Name in the other, but not say that concatenating the former yields the latter. Schema matching identifies elements of two Query discovery picks up where schema matching leaves given schemas that correspond to each other. off. Given the correspondences, it obtains queries to trans- Although there are many algorithms for schema late instances of the source schema into instances of the matching, little has been written about building a target, e.g., using query analysis and data mining [5]. system that can be used in practice. We describe This paper focuses on schema matching. There are our initial experience building such a system, a many algorithms to solve it [8]. They exploit name customizable schema matcher called Protoplasm. similarity, thesauri, schema structure, instances, value distribution of instances, past mappings, constraints, 1 Schema Matching cluster analysis of a schema corpus, and similarity to Most systems-integration work requires creating standard schemas. All of these algorithms have merit. So mappings between models, such as database schemas, what we need is a toolset that incorporates them in an message formats, interface definitions, and user-interface integrated package. This is the subject of this paper. forms. As a design activity, schema mapping is similar to The published work on schema matching is mostly database design in that it requires digging deeply into the about algorithms, not systems. This algorithm work is semantics of schemas. This is usually quite time helpful, offering new ways to produce mappings that pre- consuming, not only in level-of-effort but also in elapsed- vious algorithms were unable to find. However, all of the time, because the process of teasing out semantics is slow. published algorithms are fragile: they often need manual It can benefit from the development of improved tools, tuning, such as setting thresholds, providing a thesaurus, but it is unlikely such tools will provide a silver bullet that or being trained on examples; even after tuning, it is easy automates all of the work. For example, it is hard to to find schemas that the algorithms do not map correctly; imagine how the best tool could eliminate the need for a and many of them do not scale to large schemas. designer to read documentation, ask developers and end- COMA is the first work to address engineering issues users how they use the data, or review test runs to check of a schema matching system [1]. Its architecture offers for unexpected results. In a sense, the problem is AI-com- multiple schema-level matchers and a fixed process to plete, that is, as hard as reproducing human intelligence. combine their results. Matchers exploit linguistic, data- The best commercially-available model mapping tools type, and structural information, plus previous matches, to we know of are basically graphical programming tools. produce similarity matrices forming a cube. The cube is That is, they allow one to specify a schema mapping as a aggregated to a matrix. Then particular similarity values directed graph where nodes are simple data transforma- are selected as good match candidates, which are com- tions and edges are data flows. Such tools help specify: a bined to a single value. This process is executed for whole mapping between two messages, a data warehouse schemas or for two schema elements, and is repeated after loading script, or a database query. While such graphical the user provides feedback. Experimentation showed that programming is an improvement over typing code, no finding good combinations of matchers depends on char- database design intelligence is being offered. Despite the acteristics of the schemas being matched, demonstrating limited expectations expressed in the previous paragraph, that customizability of the combination process is crucial. we should certainly be able to offer some intelligence ⎯ Inspired by COMA’s approach, we felt it was impor- automated help, in addition to attractive graphics. tant to push further toward building an industrial-strength There are two steps to automating the creation of schema matcher, one that avoids fragility problems, is mappings between schemas: schema matching and query customizable for use in practical applications, and extends discovery. Schema matching identifies elements that the range of matching solutions being offered. correspond to each other, but does not explain how they We believe fragility is inherent in the problem. To correspond. For example, it might say that FirstName and mitigate it, we need a system that can exploit the best SIGMOD Record, Vol. 33, No. 4, December 2004 38 algorithms and is customizable by a human designer. De- implement match algorithms consisting of interconnected signers use all of the techniques found in specific schema operators. Each script combines a set of operator matching algorithms: name similarity, thesauri, common implementations by passing SMM graphs and similarity schema structure, overlapping instances, common value matrices from one to another. Ideally, strategies would be distribution, re-use of past mappings, constraints, similar- designed using a Graphical Interface that generates the ity to standard schemas, and common-sense reasoning. A strategy script from its graphical representation. system should use all of these techniques too. Customizability is needed for several reasons. First, a Graphical Strategy user can improve a schema matching tool by selecting Interface Scripts particular techniques and combining them in a way that Custom works best for the types of schemas being matched. For Impls Strategy Consuming Operator Execution example, if a version ID is prepended to element names in Interfaces Application Protoplasm Engine evolving schemas, then a user may want to delete the ID Impls before applying any matcher. The degree of such customizability depends on the user’s sophistication. An end-user SMM Similarity Auxiliary Schemas Graph Matrix Info who is matching two schemas wants a limited range of options to choose from. By contrast, a sophisticated user, Figure 1 Protoplasm Architecture such as an application vendor, wants to customize the tool The Strategy Execution Engine takes as input a to work well when matching its schemas to those of other strategy script, the Protoplasm-provided and custom parties. A second reason for customizability is to control operator implementations, and SMM graphs and similar- an algorithm’s scalability, e.g., by trading off response- ity matrices. During the execution, the engine accesses time for the quality of the result. A third reason is extensi- schema repositories and other auxiliary information, e.g., bility, meaning that new techniques can be easily added to a thesaurus or glossary, used by implementations of the tool. This helps sophisticated users, and researchers operators in the strategy script. The execution of the script who want to experiment with new or modified algorithms. results in a similarity matrix exported in a format We therefore embarked on a project to build such a compatible with an external Consuming Application. customizable schema matcher, one that could be used in 2.1 Representing Models commercial settings. This paper describes our early experience in developing this tool, called Protoplasm (a The input schemas can be in any meta-model, e.g., PROTOtype PLAtform for Schema Matching). In SQL DDL, ODMG or XML Schema Definition (XSD) addition to arguing for the importance of research on Language. Since no existing meta-model is expressive industrial-strength schema matching, we offer two main enough to subsume all others, Protoplasm defines SMM contributions: (i) a new architecture for schema matching, Graphs as a meta-model-independent representation. An which includes two new internal schema representations, SMM graph is a rooted, node- and edge-labeled directed interfaces for operators that comprise a matching algor- graph, where each node and edge has a unique identifier. ithm, and a flexible approach to combining operators; and Figure 2 shows an example SMM Graph describing the (ii) experience in tuning the prototype for scalability and SQL DDL in the shaded box that defines a relational using its customization features to add another algorithm. table. The root (i.e., id=1) is indicated by a thick circle. These are covered in Sections 2 and 3 respectively. We The graph identifier (i.e., id=0) and label are in boldface close in Section 4 with a discussion of future work. above the root. The database, table and column names are denoted by white circles, typing information by light gray 2 Architecture circles, and constraints (e.g., primary key) by dark gray circles. SMM is a variation of the Object Exchange Model Figure 1 presents the three-layer architecture of Proto- (OEM) [7], a data model defined for semistructured data, plasm. The bottom layer consists of two supporting struc- thus inheriting its simplicity and self-describing nature. tures. The Schema Matching Model Graph (SMM Graph) Protoplasm builds on widely-used XML technologies is the internal representation of input schemas: a rooted to implement SMM graphs. An XML syntax is defined to node- and edge-labeled directed graph. The Similarity describe and store SMM graphs in secondary memory Matrix holds correspondences between two SMM graphs. (e.g. disk), an extended XML parser is employed to load The middle layer defines a set of Operator Interfaces that them into main memory, a variation of DOM is used to declare the inputs and outputs of generic schema-match- enable programmatic navigation and editing, and an ing tasks such as import, transformation and export of XPath engine is modified to execute simple queries schemas, creation and manipulation of similarity matrices, against SMM graphs. The XML syntax describing SMM and match.

Industrial-Strength Schema Matching

A Framework for Ontology-Based Library Data Generation, Access and Exploitation

Collecting Activity Data Using the Open Mhealth Platform an Exploratory Study on Integrating Objective Data with Sport Monitoring Systems

SAS Decision Services 6.4

Data Warehouse Engineering Process (DWEP) with U.M.L

Schema Profiling of Document-Oriented Databases$,$$

Universal Business Language Version 2.3

Primary Importance of Schemas

Data Warehouse Engineering Process (DWEP) with U.M.L

Designing Data Marts for Data Warehouses

Educare: a Decision Support System for Anticipating Behaviors Among Educational Actors

Sql Server List Schemas Owned User

Business Intelligence on Non-Conventional Data