A Survey of Schema Matching Research Using Database Schemas and Instances
Total Page:16
File Type:pdf, Size:1020Kb
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 10, 2017 A Survey of Schema Matching Research using Database Schemas and Instances Ali A. Alwan Mogahed Alzeber International Islamic University Malaysia, IIUM, International Islamic University Malaysia, IIUM Kuala Lumpur, Malaysia Kuala Lumpur, Malaysia Azlin Nordin Abedallah Zaid Abualkishik International Islamic University Malaysia, IIUM College of Computer Information Technology Kuala Lumpur, Malaysia American University in the Emirates Dubai, United Arab Emirates Abstract—Schema matching is considered as one of the which might negatively influence in the process of integrating essential phases of data integration in database systems. The the data [3]. main aim of the schema matching process is to identify the correlation between schema which helps later in the data Many firms might attempt to integrate some developed integration process. The main issue concern of schema matching heterogeneous data sources where these businesses have is how to support the merging decision by providing the various databases, and each database might consist of a vast correspondence between attributes through syntactic and number of tables that encompass different attributes. The semantic heterogeneous in data sources. There have been a lot of heterogeneity in these data sources leads to increasing the attempts in the literature toward utilizing database instances to complexity of handling these data, which result in the need for detect the correspondence between attributes during schema data integration [4]. Identifying the conflicts of (syntax matching process. Many approaches based on instances have (structure) and semantic heterogeneity) between schemas is a been proposed aiming at improving the accuracy of the matching significant issue during data integration. For this reason, process. This paper set out a classification of schema matching schema matching has been proposed to handle the process of research in database system exploiting database schema and discovering the correspondence between schema and resolve instances. We survey and analyze the schema matching conflicts when occurred. techniques applied in the literature by highlighting the strengths and the weaknesses of each technique. A deliberate discussion Nevertheless, using schema matching approach is has been reported highlights on challenges and the current inappropriate when databases are developed separately and research trends of schema matching in database. We conclude without unified standards [5]. Furthermore, it is impractical to this paper with some future work directions that help employ the schema design information “schema attributes” to researchers to explore and investigate current issues and determine the correspondences attributes when different challenges related to schema matching in contemporary abbreviations of attribute names “column’s names” is used to databases. represent the same real world entities or objects [2]-[5]. Keywords—Data integration; instance-based schema matching; Consequently, discovering instance correspondences schema matching; semantic matching; syntactic matching become an alternative approach for schema matching when schema information is not available or insufficient to be used I. INTRODUCTION for matching purposes. Instance-based schema matching Nowadays, integrating and managing a tremendous attempts to extract the semantic relationship between targeted amount of data has been extremely simplified due to the attributes via their values “instance”. Therefore, if the schema advancement in information technology. Several solutions matching approach fails to detect the match, then the instances have been proposed to combine data from different will be looked at to carry out the matching process. In this heterogeneous sources to form a unified global view. This paper, we surveyed and examined some well-known process, called data integration aim to represent data in one techniques of instance-based schema matching. We described single view and facilitate the process of interacting with the the strengths and the weaknesses of these techniques and end data to be appearing as one single information system [1]. the paper with some future work directions that can benefit the However, it is very challenging to integrate and manage data researchers in the area of data integration. from several sources that are being independently developed. This paper is organized as follows. Section II presents the This is due to the fact that there are different representations schema information levels. Section III presents and explains of these sources, and data sources might not be designed in a the classification of schema matching methods and the process way to adopt the same abstraction principles or have similar of instance-based schema matching. In Section IV, the semantic concepts to be fully used [2]. Besides, there might be techniques applied based on instance level matching has been various terminologies used to describe and store information 102 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 10, 2017 explained. The related works for instance-based schema B. Instance Level matching have been reviewed and reported in Section V. The Instance level information, which is also known as discussion on the topics presented in this paper is reported in contents level has been extensively applied as an effective tool Section VI. Conclusion is presented in the final Section VII. to determine the correspondence between schemas. In many cases, it is not easy to obtain information from the schema II. SCHEMA INFORMATION LEVELS structure as either it is not available or the information is Due to the rapid development of information systems, the meaningless and could not be used for the matching purpose demand for schema matching solutions is growing [5], [10], [17]. Thus, in such cases, instances are considered as dramatically [7], [8]. For example, the role and tasks of the the most efficient and reliable source of information to enterprise databases evolved from the traditional use of storing identify the correspondences between attributes and determine and manipulating data to be an effective tool for data analysis the similarities and corresponding attributes of schema based and interpretation. Different heterogeneous databases might on exploiting the characters of available values/instances. need to be integrated for various purposes. The heterogeneity between databases encompasses the structure and the C. Hybrid Level semantic, which have resulted in the necessity of the schema Hybrid level retrieves information from the combination of matching [2]. The driving force behind the significant both schema metadata (attribute names, data type, structure development in database role is due to the complexity in and description) and instance level (values/instances) [8], [13], obtaining data from various heterogeneous sources. Besides, [15]. Several criteria and sources of information might be the need for intelligent decision supports tools that extract taken into consideration to achieve the matching between heterogeneous data to ensure the best decision for users. schemas. Among these criteria in sources includes name Identifying the correspondences (matches) between database matching and thesauri together with compatible data types that schemas has been commonly referred to as a schema matching lead to improving the performance through providing best- problem [6], [7], [9]. combined match candidates compared to the individual performance of different matchers [15]. There are three types of information that commonly used to solve the problem of schema matching by identifying the D. Auxiliary Level semantic of schema attributes and detect the correspondences Auxiliary level information is the process of combing between database schemas, i.e., 1) schema information; existing schema information along with additional information 2) instances; and 3) auxiliary information [9], [10]. Several obtained throughout external sources. Examples of external solutions have been proposed aiming at handling schema sources include WordNet/Thesauri, and dictionaries can be matching based on the available schema information [12], used for identifying the semantic relationships between [13]. These information help in preventing the incorrect match schema attribute names or abbreviations such as synonymy between schema attributes and lead to detect the similarities and hyponyms in order to determine the similarities if it exists between schema attributes, particularly for semantically [5], [13]. complex matching process. There are many beneficial levels of information that can be utilized to identify the schema III. CLASSIFICATION OF SCHEMA MATCHING METHODS matching. This includes metadata level, instance level, and In the literature, there have been many schema matching auxiliary level [2], [10]. Apparently, several approaches have methods developed with the aim of identifying the match been proposed employed levels of schema information. Some between database tables. There are a good number of surveys of these approaches rely on utilizing each level independently that discussed, classified and examined these methods [16], as identified individual matcher based on their problematic [18]. For instance, E. Rahm and P. Bernstein [11] have situations and information available [10], [15]. While, other suggested a