Shadow Theory: Data Model Design for Data Integration
Total Page:16
File Type:pdf, Size:1020Kb
Shadow Theory: Data Model Design for Data Integration Jason T. Liu* (Draft. Date:9/12/2012) ABSTRACT In information ecosystems, semantic heterogeneity is known as General Terms Algorithms, Design, Human Factors, Theory. the root issue for the difficulties of data integration, and the Relational Model is not designed for addressing such challenges, i.e. to re-use data that is modeled from other sources in the local Keywords data model design. Although researchers have proposed many Data integration, information ecosystems, shadows, meanings, different approaches, and software vendors have designed tools to mental entity, semantic heterogeneity, tags, help the data integration task, it remains an art relying on human labors. Due to lacks of a comprehensive theory to guide the overall modeling process, the quality of the integrated data heavily depends on the data integrators’ experiences. 1. Introduction Based on observations of practical issues for enterprise customer Data integration is essential for most enterprise systems that rely data integration, we believe that the needed solution is a new data on data from multiple sources. Lenzerini defined data integration model. This new data model needs to manage the difficulties of as to combine data residing at different sources in order to provide semantic heterogeneity: different data collected from different users with a unified view of these data [1]. Specifically, the goal perspectives about the same subject matter can naturally be of Enterprise Information Integration (EII) is to provide a uniform inconsistencies or even in conflicts. In other words, existing data access to multiple data sources without having to load the data models are designed to support single version of the truth, and into a central place [2]. Consumes about 40% of IT budget, EII is naturally they will have difficulties during data integration when cited as the biggest and most expensive challenge in [3], where data collected from different sources are based on different Bernstein and Haas provide an overview for related difficulties perspectives or at different levels of abstraction. and technologies. Therefore, we propose Shadow Theory to serve as the In this paper, our main concern is in the conceptual modeling philosophical foundation in order to design a new data model. The level, especially for data integration in large-scale information kernel of the theory is based on the notion of shadow, which can ecosystems. The context is the design activities for data modeling be traced back to Plato’s Allegory of the Cave over 2000 years where a local (downstream) database needs to re-use existing data ago. The basic idea is that whatever we can observe and store into provided by upstream systems, and we focus on the specific issue databases about the subject matter are just shadows. Meanings of that there are multiple sources for the same subject matter (e.g. shadows are mental entities that exist only in the viewers’ customer). These data sources are designed for satisfying different cognitive structures. Such mental entities are constrained by the business requirements based on different understanding, different viewers’ internal model about the reality, especially the implicitly perspectives or even different ontology implicitly. or explicitly chosen perspective(s) or ontology, if formally Therefore, we use the term data integration in a more generalized represented. sense in this paper, for how we can design data models used in the In this paper, we propose six basic principles to guide the overall context of information ecosystems to fulfill its local business data model design. Further, we also propose algebra with a set of functionality through data provided by multiple sources. For basic operators to support data operations by their meanings, not example, ordering systems in an enterprise need to access by their logical structure. The representation is based on point- customer data that exist in billing, marketing, or sales databases. free geometry such that any meaning is represented as an area in However, these different sources have their own perspectives semantic space, which can be decomposed or aggregated in about the same subject matter. To develop an isolated model of different ways concurrently. We use W(hat)-tags to attach on customer data for ordering only is not an acceptable solution since shadows for their meanings, and E(quivalence)-tags to recognize it will not improve the overall performance for the enterprise, and what meanings can be treated as the same. We use enterprise isolated customer data will create more chances of inconsistencies customer data integration as an example to illustrate the data or errors. model design and operation principles. By data model design, there are two levels of meanings here: Categories and Subject Descriptors (1) The generic principles to design the data models for specific applications, for example, to model enterprise customer data H.2.5 [Heterogeneous Databases] in a marketing database, or used in a data warehouse for * The author can be reached at [email protected] business intelligence. 1 (2) The design of a generic data model like the Relational Model Content with algebra to serve as the foundation for data operations. In general, the difficulties of data integration are due to 1. INTRODUCTION heterogeneity at physical level, logical (structural) level, and semantics level. Since semantic heterogeneity is usually hidden 2. ENTERPRISE CUSTOMER DATA INTEGRATION behind structural heterogeneity, it is the most difficult one to 2.1 THE NOTION OF UNIQUE KEYS AND SEMANTIC identify, and the resolution by mapping between semantic HETEROGENEITY heterogeneous data elements usually requires design decisions 2.2 SCHEMA AND RELATED ISSUES made by human data integrators (one of the reasons for why data 2.3 EQUIVALENCE, SIMILARITY, AND DATA INTEGRATION integration still relies on human labors since the beginning of 2.4 EXPECTATIONS AND PRACTICAL NEEDS OF DATA database era). Without a commonly accepted comprehensive INTEGRATION theory to guide such design decisions, data integrators rely on 2.5 SUMMARY their own experiences and personal preferences like performing an art in current practice. 3. SHADOW THEORY The trends of related research are more in making such manual 3.1 SHADOW THEORY process automatic or semi-automatic (e.g. see the survey for 3.2 SIX PRINCIPLES FOR DATA INTEGRATION matching schemas automatically in [4, 5]), in order to reduce the 3.3 SHADOWS IN PLATO’S CAVE dependence of human work. Researchers usually start developing 3.4 W-TAGS: MEANINGS OF SHADOWS and testing their solutions from relative small number of data 3.5 SEMANTIC HETEROGENEITY AND MEANING INDEPENDENCE sources, and push for scale up later. Significant progresses were 4. DATA INTEGRATION made in design software tools at application level to review schema and identify potential mapping candidates. 4.1 E-TAG: EQUIVALENCE OF SHADOWS However, for large-scale information ecosystems, data integration 4.2 MEANINGFUL DATA INTEGRATION AND CRITERIA FOR faces different kinds of challenges. For example, it is not EVALUATION uncommon for an enterprise to have 500 or more different data 4.3 DATA MODEL FEATURES FOR HELPING USERS TO sources that provide similar, overlapped, inconsistent or even UNDERSTAND THEIR INTEGRATED DATA AND USE IT PROPERLY conflicting customer data. First challenge is due to the number of 5. ALGEBRA FOR SHADOW THEORY-BASED DATA schema, which indicates the scale of complexities for mapping MODEL between different logical representations. Second challenge is due to the dynamic nature for subject matters themselves, that 5.1 BASIC STRUCTURE millions of enterprise customers are in endless M&A activities. 5.2 REGIONS AS THE ONLY PRIMITIVE UNITS Third challenge is due to the different business requirements for 5.3 ALGEBRA FOR SHADOW THEORY-BASED DATA MODEL each individual data models, that each one has its procedures to 5.4 SIMULATING RELATIONAL SCHEMA BY TEMPLATES IN decide when to update their data due to M&A results. When there SEMANTIC SPACE is a need to integrate individual information ecosystems due to 6. DISCUSSION AND FUTURE WORK merger of the service providers, each merged department like the integrated billing or ordering usually has to perform their own 6.1 THE CHALLENGES TO SUPPORT MULTIPLE VERSIONS OF THE integration, since enterprise-wide enterprise customer data TRUTH integration for all departments is slow or even not possible to 6.2 COMPARISON WITH RELATIONAL MODEL happen. 6.3 MEANING INDEPENDENCE VERSUS LOGICAL DATA Under these surface issues, the difficulties of semantic INDEPENDENCE heterogeneity can be summarized as the two basic characteristics: 6.4 COMPARISONS WITH ENTITY-RELATIONSHIP MODEL 6.5 BUSINESS PUSH ON CUSTOMER DATA INTEGRATION (CDI) (i) There exist different representations for the same meaning, 6.6 RELATED DATA INTEGRATION APPROACHES (ii) There exist different meanings for the same representation. 6.7 MODEL MANAGEMENT IN SEMANTIC SPACE 6.8 ONTOLOGY, XML, AND SEMANTIC WEB Since Relational Databases are the dominant backbone for 6.9 OTHER RELATED ISSUES enterprise systems, we will focus on Relational Model to make comparison with in this paper. Different representations can be 7. CONCLUSION interpreted as different schema, which is explicitly represented in relational database systems. But semantics is not; it is something out of the scope of Relational Model. Although explicitness was is an uninterpreted finite relational structure” [7]. When one of the original objectives of the Relational Model (as compared with the design of ontology, researchers further summarized by Date: "the meaning of the data should be as explained, “Database schemas often do not provide explicit obvious and explicit as possible” see p.68 in [6]), there is no semantics for their data. Semantics is usually specified explicitly mechanism to enforce meanings to be explicitly represented. at design-time, and frequently is not becoming a part of a What is even worse is that there is no mechanism to prevent database specification, therefore it is not available“ [5, 8].