A General Strategy for Knowledge Acquisition from Semantically Heterogeneous Data Sources
Total Page:16
File Type:pdf, Size:1020Kb
A General Strategy for Knowledge Acquisition from Semantically Heterogeneous Data Sources Doina Caragea Jie Bao and Vasant Honavar Department of Computing and Information Sciences Artificial Intelligence Research Lab Kansas State University Department of Computer Science 234 Nichols Hall, Manhattan, KS 66506 Iowa State University 226 Atanasoff Hall, Ames, IA 50011 Abstract (a) Data repositories are large in size, dynamic, and phys- With the advent of the Semantic Web, there is increased ically distributed. Consequently, it is neither desirable availability of meta data (ontologies) that make explicit nor feasible to gather all of the data in a centralized lo- the semantic commitments associated with the data cation for analysis. Hence, there is a need for efficient al- sources. Together with tools for specifying mappings gorithms for analyzing and exploring multiple distributed between ontologies, this has opened up for the first time, data sources without transmitting large amounts of data. the possibility of acquiring knowledge from such on- (b) Autonomously developed and operated data sources of- tology extended, semantically disparate data sources. Hence, there is an urgent need for machine learning ten differ in their structure and organization (e.g., rela- algorithms for building predictive models (e.g., classi- tional databases, flat files, etc.) and the operations that fiers) in a setting where there is no unique global in- can be performed on the data sources (e.g., types of terpretation of data from semantically disparate sources queries - relational queries, statistical queries, keyword and it is neither feasible nor desirable to collect data matches). Hence, there is a need for theoretically well- from such sources in a centralized data warehouse. We founded strategies for efficiently obtaining the informa- formulate the problem of learning classifiers from a set tion needed for analysis within the operational constraints of related, semantically heterogeneous data sources, un- imposed by the data sources. der the assumption that ontologies and mappings from a user ontology to the data source ontologies are given. (c) Autonomously developed data sources are semantically We design a general strategy for learning classifiers heterogeneous. The ontological commitments associated from such data sources by reducing the problem of with a data source (and hence its implied semantics) are learning to the problem of answering queries from se- typically determined by the data source designers, based mantically heterogeneous data and we show how to an- on their understanding of the intended use of the data. swer such queries. Very often, data sources that are created for use in one context or application find use in other contexts or ap- Introduction plications, and therefore, semantic differences among au- The availability of large amounts of data in many applica- tonomously designed, owned, and operated data reposito- tion domains offer unprecedented opportunities in computer ries are simply unavoidable. Therefore, effective use of assisted data-driven knowledge acquisition in a number of multiple sources of data in a given context requires recon- applications including, in particular, data-driven scientific ciliation of semantic differences. As a consequence, there discovery (e-Science) in bioinformatics, environmental in- has been significant community-wide efforts aimed at the constructionof ontologies(e.g., Gene Ontology - GO1 - in formatics, geo-informatics, neuro-informatics, health infor- 2 matics, etc. or data-driven decision making in business and biology, Unified Medical Language System - UMLS - in heath informatics, Semantic Web for Earth and Environ- commerce (e-Business and e-commerce). 3 Machine learning techniques (Mitchell 1997; Duda, Hart, mental Terminology - SWEET - in geospatial informat- & Stork 2000), in addition to traditional statistical tech- ics). However, collaborative scientific discovery applica- niques (Casella & Berger 2001), offer some of the most cost- tions often require users to be able to analyze data from effective approaches to analyzing, exploring and extracting various perspectives. There is no single privileged per- knowledge (features, correlations, and other complex rela- spective that can serve all users, or for that matter, even tionships and hypotheses that describe potentially interest- a single user, in every context. Hence, there is a need ing regularities) from such data sources. However, the ap- for methods that can dynamically and efficiently extract plicability of current knowledge acquisition techniques is and integrate information needed for knowledge acquisi- challenged by the nature and the scale of the data available. tion, from semantically heterogeneous data, from a user’s More precisely: 1www.geneontology.org Copyright c 2006, American Association for Artificial Intelli- 2www.nlm.nih.gov/research/umls gence (www.aaai.org). All rights reserved. 3sweet.jpl.nasa.gov perspective, based on user-specified ontologies and user- On the Semantic Web, it is unrealistic to assume the exis- specified mappings between ontologies. tence of a single global ontology that corresponds to a uni- Against this background,we note that a large class of data versally agreed upon set of ontological commitments for all sources on the Semantic Web can be viewed (at a certain users. Instead, it is much more realistic to allow each user level of abstraction) as a collection of semantically disparate or a community of users to choose the ontological commit- relational data sources that are semantically related, froma ments that they deem useful in a specific context. A user on- user’s point of view, in the context of a specific knowledge tology OU , together with a set of interoperation constraints acquisition task. We design a general strategy for learning IC, and the associated set of mappings {ψi|i = 1,p} classifiers from multiple semantically disparate, geograph- from the user ontology OU to the data source ontologies ically distributed, relational data sources on the Semantic O1 · · · Op define a user view (Caragea et al. 2005b). In Web. the relational setting considered in this paper, the interop- The rest of the paper is organized as follows: We first eration constraints can be equality constraints or inclusion introduce the concepts and definitions needed to formulate constraints and can be defined at the concept level (between the problem addressed. Next, we formulate the problem of related concepts), property level (between related attributes learning from semantically heterogeneous data and describe or relations) and at the attribute value level (between related a general strategy for transforming algorithms for learning attribute values). classifiers from relational data into algorithms for learn- ing classifiers from semantically disparate, relational data Bibliography Example sources, using ontologies and mappings between ontologies, We will use an example from the bibliography domain in a setting where it is neither feasible nor desirable to inte- to illustrate the main notions introduced above. Consider grate all the data available into a single relational data ware- the problem of classifying computer science research pa- house. We conclude with a summary and a brief discussion pers into categories from a topic hierarchy (e.g., Arti- of related work. ficial Intelligence, Networking, Data Mining, Relational Data Mining, etc.) (McCallum et al. 2000) . A Concepts and Definitions user interested in a document classification task, might Ontology-Extended Data Sources and User Views consider using several data sources, such as CiteULike (http://www.citeulike.org/) from UMBC; the Collection of We define an ontology-extended relational data source 4 (OERDS) as a tuple D = {D,S,O}, where D is the ac- Computer Science Bibliographies from University of Karl- tual data set in the data source, S represents the data source sruhe; MIT Libraries (http://libraries.mit.edu/index.html), schema and O represents the data source ontology (Bonatti, INRIA Reference Database (http://ontoweb.org/), etc., for Deng, & Subrahmanian 2003; Caragea et al. 2005b). learning classifiers. The Ontology Alignment Evalu- In the relational model, each data source consists of a set ation Initiative (OAEI) has made available a Test Li- brary (http://oaei.inrialpes.fr/2005/benchmarks/) that con- of concepts X1, · · · ,X anda set of properties of these con- n tains representative ontologies for the data sources above. cepts P1, · · · , P . Each concept has associated with it, a set m In this case, the structure ontologies define the relevant con- of attributes denoted by A(Xi) and a set of k-ary relations (k > 1) denoted by R(X ). Each attribute A takes val- cepts, such as Reference, Book, Article, Journal, i i Conference, etc.) relationships between classes (see Fig- ues in a set V(Ai). The concepts and the properties of the concepts (attributes and relations) define the schema of a re- ure 2 for class hierarchies contained in the INRIA, MIT lational data source. A data set D is an instantiation I(S) and a user ontology, respectively), and properties of the of a schema S (Getoor et al. 2001). concepts such as Article author Author; Article Article Author Article The ontology O of an OERDS D consists of two parts: cites ; position; journal Journal, etc. The properties in these ontologies include structure ontology, OS , that defines the semanticsof the data source schema (concepts and properties of the concepts that both attributes (e.g., position)