Semantic Integration in Heterogeneous Databases Using Neural Networks T
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Integration in Heterogeneous Databases Using Neural Networks t Wen-Syan Li Chris Clifton Department of Electrical Engineering and Computer Science Northwestern University Evanston, Illinois, 60208-3118 (acura,clifton}@eecs.nwu.edu mine their equivalence. Manually comparing all pos- sible pairs of attributes is an unreasonably large task, Abstract especially since most pairs do not represent the same information. Simple ad-hoc guesswork, on the other One important step in integrating heteroge- hand, is likely to miss some attributes that should map neous databases is matching equivalent at- to the same global attribute. US West reports having tributes: Determining which fields in two 5 terabytes of data managed by 1,000 systems, with databases refer to the same data. The mean- customer information alone spread across 200 different ing of information may be embodied within a. databases [DKM+93]. One group at GTE began the database model, a conceptual schema, appli- integration process with 27,000 data elements, from cation programs, or data contents. Integra- just 40 of its applications. It required an average of tion involves extracting semantics, expressing four hours per data element to extract and document them as metadata, and matching semantically matching elements when the task was performed by equivalent data elements. We present a proce- someone other than the data owner [VH94]. Other dure using a classifier to categorize attributes GTE integration efforts have found the elements over- according to their field specifications and data lapped or nearly matched in their database to be close values, then train a neural network to recog- to 80% [VH94]. nize similar attributes. In our technique, the (DKM+93] pointed out some important aspects of se- knowledge of how to match equivalent data mantic heterogeneity: Semantics may be embodied elements is “discovered” from metadata , not within a database model, a conceptual schema, ap- “pre-programmed”. plication programs, and the minds of users. Seman- tic differences among components may be considered 1 Introduction inconsistencies or heterogeneity depending on the ap- plication, so it is difficult to identify and resolve all One problem in developing federated databases is se- the semantic heterogeneity among components. Tech- mantic integration: determining which fields are equiv- niques are required to support the re-use of knowledge alent between databases. Attributes (classes of data gained in the semantic heterogeneity resolution pro- items) are compared in a pairwise fashion to deter- cess. The goal of our research is to develop a semi- tThis materialis based upon work supported by the National automated semantic integration procedure which uti- Science Foundation under Grant No. CCR9210704. lizes the meta-data specification and data contents at Penn&on to copy without fee all or pari of this material’ir the conceptual schema and data content levels. Note granted provided that the copier are not made or dirtributed for direct commercial advantage, the VLDB copyright notice and that this is the only information reasonably available the title of the publicalion and itr date appear, and notice ir to an automated tool. Parsing application programs given thai copying ia bg pennirrion of the Very Large Data Bare or picking the brains of users is not practical. We want Endowment. To copy otherwise, or to republish, reqrine a fee the ability to determine the likelihood of attributes re- and/or rpecial permirrion from the Endowment. ferring to the same real-world class of information from Proceedings of the 20th VLDB Conference the input data. We also desire to have the ability to Santiago, Chile, 1994 1 Equivalent DBMS Chify ciustu Tnin TlGL?d AtbibtJtCS Spccitic AmibllW centa NChVUkS NdWOIkS and PiUSMS And To Determine. Similarity Exe&3 Galelate Recognize Similarity DiUhSC Training P&tCmS Bctwecn Infrmnation Da@ i AtWibtltcs t- Figure 1 Procedure of Semantic Integration Using Neural Networks reuse or adapt the knowledge gained in the seman- However, we are given the surprising result of O! tic heterogeneity resolution process to work on similar We then remember to use the semantic integration problems. We present a method where the knowledge tool (using a pre-trained network); in seconds we are of how to determine matching data elements is dticou- given a list of likely matches which shows that r.seats ered, not pre-programmed. matches c&e much more closely than r.size. (Build- We start with the assumption that attributes in differ- ing and Grounds considers the size of a room to be ent databases that represent the same real-world con- the size in square feet). We then reissue the query cept will have similarities in structure and data values. (using r-seats instead of rsize) and are given the num- For example, employee salaries in two databases will ber of classes which filled their rooms. Note that the probably be numbers greater than 0 (which can be main user effort involved is finding the appropriate determined from constraints, this is structural simi- databases; the integration effort is low. The end user is larity). The same can be said for daily gross receipts. able to distinguish between unreasonable and reason- However, the range and distribution of data values will able answers, and exact results aren’t critical. This be very different for salaries and gross receipts. From method allows a user to obtain reasonable answers re- this we can determine that two salary fields probably quiring database integration at low cost. represent the same real-world concept, but gross re- This paper is organized as follows. We first review ex- ceipts are something different. Note that the assump- isting work in this area. In Section 3 we discuss the tion is that there are similarities, not that we know semantics available from databases. In Section 4 we what those similarities are. describe our technique of using a self-organizing map In Figure 1 we outline our method. In this process, classifier to categorize attributes and then train a net- DBMS specific parsers extract information (schema or work to recognize input patterns and give degrees of data contents) from databases. We then use a classi- similarity. In Section 5 the experimental results of fier that learns how to discriminate among attributes testing our techniques on three pairs of real databases in a single database. The classifier output, cluster cen- are presented. Finally, in Section 6 we offer our con- ters, is used to train a neural network to recognize clusions. categories; this network can then determine similar at- tributes between databases. 2 Related Work As an example of how this system could be used, imag- ine that we are planning university admissions. We A federated architecture for database systems was pro- wish to make sure that we do not admit too many stu- posed by McLeod and Heimbigner in [MH80]. A ba- dents; to do this we need to check if classesare oversub- sic conceptual technique for integrating component scribed. We are used to using the Registrar’s database views into a “superview” was introduced by Motro and which contains the number of students who took a Buneman pB81]. The Multibase project [SBU+81, course and the room number (among other informa- DH84] by the Computer Corporation of America in tion). However, it does not contain room size infor- the early 80’s first built a system for integrating pre- mation. After some checking, we are told that Build- existing, heterogeneous, distributed databases. The ing and Grounds maintains a database which contains process of schema generalization and integration, how- room size and we are given accessto this database. ever, still needs the involvement of database designers A first attempt is to issue the query (using a multi- to find those objects that contain information of the database query language such as [Lit89], or better yet same domain or related data. This becomes a bot- a database browser with multidatabase support): tleneck for schema integration when the size of the database is large. Select sum(c.room#) One approach for determining the degree of object From Registrardb.classes c, B+G.rooms r equivalence, proposed by Storey and Goldstein [SG88], Where c .room# = r.room# and c.size >= r.size is to compare objects in a pairwise fashion by consult- 3 ing a lexicon of synonyms. It is assumed that some that are not easily detected by inspection. This shifts classesor at least some of their attributes and/or rela- the problem to building the synonym lexicon. Even tionships are assigned with meaningful names in a pre- a synonym lexicon has limitations because it is diffi- integration phrase. Therefore, the knowledge about cult for database designers to define a field name by the terminological relationship between the names can using only the words that can be found in a dictio- be used as an indicator of the real world correspon- nary or abbreviations carrying unambiguous meanings dence between the objects. In pre-integration, object and in some cases, it is difficult to use a single word equivalence (or degree of similarity) is calculated by rather than a phrase to name a field. These reasons comparing the aspects of each object and computing make it expensive to build a system of this approach. a weighted probability of similarity and dissimilarity. Sheth and Larson [SL90] also pointed out that com- Sheth and Larson [SL90] noted that comparison of pletely automatic determination of attribute relation- the schema objects is difficult unless the related in- ships through searching a synonym lexicon is not pos- formation is represented in a similar form in different sible because it would require that all of the semantics schemas. of schema be completely specified. Also, current se- mantic (or other) data models are not able to capture 2.1 Existing Approaches a real-world state completely and interpretations of real-world state change over time.