Journal of Database Management, 18(1), 43-67, January-March 2007 43 Semantic Integration and Knowledge Discovery for Environmental Research Zhiyuan Chen, University of Maryland, Baltimore County (UMBC), USA Aryya Gangopadhyay, University of Maryland, Baltimore County (UMBC), USA George Karabatis, University of Maryland, Baltimore County (UMBC), USA Michael McGuire, University of Maryland, Baltimore County (UMBC), USA Claire Welty, University of Maryland, Baltimore County (UMBC), USA ABSTRACT Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics- based techniques to assist users in integrating, navigating, and mining multiple environmental data sources. Our system contains specifications of various environmental data sources and the relationships that are formed among them. User requests are augmented with semantically related data sources and automatically presented as a visual semantic network. In addition, we present a methodology for data navigation and pattern discovery using multi-resolution brows- ing and data mining. The data semantics are captured and utilized in terms of their patterns and trends at multiple levels of resolution. We present the efficacy of our methodology through experimental results. Keywords: environmental research, knowledge discovery and navigation, semantic integra- tion, semantic networks, wavelets INTRODUCTION spatial and temporal) differences and inter- The urban environment is formed dependencies, are collected and managed by complex interactions between natural by multiple organizations, and are stored and human systems. Studying the urban in varying formats. Scientific knowledge environment requires the collection and discovery is often hindered because of analysis of very large datasets that span challenges in the integration and naviga- many disciplines, have semantic (including tion of these disparate data. Furthermore, Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 44 Journal of Database Management, 18(1), 43-67, January-March 2007 as the number of dimensions in the data data repositories and assists the user to increases, novel approaches for pattern navigate through the abundance of diverse discovery are needed. data sources as if they were a single ho- Environmental data are collected in a mogeneous source. More specifically, our variety of units (metric or SI), time incre- contributions are: ments (minutes, hours, or even days), map projections (e.g., UTM or State Plane) and 1. Recommendation of Additional and spatial densities. The data are stored in Relevant Data Sources: We present numerous formats, multiple locations, and our approach to recommend data are not centralized into a single repository sources that are potentially relevant for easy access. To help users (mostly en- to the user’s search interests. Cur- vironmental researchers) identify data sets rently, it is tedious and impractical for of interest, we use a metadata approach to users to locate relevant information extract semantically related data sources sources by themselves. We provide and present them to the researchers as a a methodology that addresses this semantic network. Starting with an initial problem and automatically supplies search (query) submitted by a researcher, users with additional and potentially we exploit stored relationships (metadata) relevant data sources that they might among actual data sources to enhance the not be aware of. In order to discover search result with additional semantically these additional recommendations, related information. Although domain ex- we exploit semantic relationships perts need to manually construct the initial between data sources. We define se- semantic network, which may only include mantic networks for interrelated data a small number of sources, we introduce sources and present an algorithm to an algorithm to let the network expand automatically refine, augment, and and evolve automatically based on usage expand an initial and relatively small patterns. Then, we present the semantic semantic network with additional and network to the user as a visual display of relevant data sources; we also exploit a hyperbolic tree; we claim that semantic user profiles to tailor resulting data networks provide an elegant and compact sources to specific user preferences. technique to visualize considerable amounts 2. Visualization and Navigation of Rel- of semantically relevant data sources in a evant Data Sources: The semantic simple yet powerful manner. network with the additional sources Once users have finalized a set of envi- is shown to the user as a visual hy- ronmental data sources, based on semantic perbolic tree improving usability by networks, they can access the actual sources showing the semantic relationships to extract data and perform techniques for among relevant data sources in a vi- knowledge discovery. We introduce a new sual way. After the user has decided approach to integrate urban environmental on the choice of relevant data sources data and provide scientists with semantic of interest (based on our metadata ap- techniques to navigate and discover patterns proach) and has accessed the actual in very large environmental datasets. data, we also assist the user in navigat- Our system provides access to a mul- ing through the plethora of environ- titude of heterogeneous and autonomous mental data using visualization and Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 45 navigation techniques that describe have been written over the years (Batini, data at multiple levels of resolution, Lenzerini, & Navathe, 1986; Ouksel & enabling pattern and knowledge Sheth, 1999; Rahm & Bernstein, 2001). discovery at different semantic lev- There has been a significant amount of els. We achieve that, using wavelet work on data integration, especially on transformation techniques, and we resolving discrepancies of different data demonstrate resilience of wavelet schemas using a global (mediated) schema transformation to noisy data. (Friedman, Levy, & Millstein, 1999; Levy, 3. Implementation of a Prototype Rajaraman, & Ordille, 1996; Miller et al., System: Finally, we have designed 2001; Papakonstantinou, Garcia-Molina, and implemented a prototype system & Ullman, 1996; Rahm & Bernstein, as a proof of concept for our tech- 2001; Ram, Park, & Hwang, 2002). More niques. Using this system we have recently, there exists work on decentralized demonstrated the feasibility of our data sharing (Bowers, Lin, & Ludascher, contributions and have conducted 2004; Doan, Domingos, & Halevy, 2003; a set of experiments verifying and Halevy, Ives, Suciu, & Tatarinov, 2003; validating our approach. Hyperion, Tatarinov, & Halevy, 2004) and on integrating data in web-based databases This article is organized as follows. (Bowers et al., 2004; Chang, He, & Zhang, First, we present related work on data inte- 2005; Dispensa & Brulle, 2003). Clustering, gration using semantics, and on exploration classification and ontologies have also been of multi-dimensional data. Next, we present extensively used as a tool to solve semantic our research methodology on semantic net- heterogeneity problems (Jain & Zhao, 2004; works and pattern discovery with wavelet Kalfoglou & Schorlemmer, 2003; Ram transformations. Then, we describe our & Park, 2004; Sheth et al., 2004; Sheth, prototype implementation and the experi- Arpinar, & Kashyap, 2003; Sheth et al., ments we conducted. Our conclusions are 2002; Zhao & Ram, 2002, 2004). presented in the final section. All the previously mentioned work takes a deep integration approach, where RELATED WORK the data schemas (or query interface for integrating web databases) of all sources are Data Integration integrated. However, this approach is often There is a rich body of existing too restrictive for environmental research work on data integration problems. The because: (1) there are so many different fundamental problem is to enable inter- types of data collected by so many different operation across different heterogeneous groups that it is impractical for all of them sources of information. In general, this to agree on a universal mediated schema; (2) problem manifests itself either as schema unlike companies, environmental research- mismatches (schema integration) or data ers often share data in an ad hoc way, e.g., incompatibilities (data integration) while a company may purchase products from accessing disparate data sources. Several several fixed suppliers while environmental surveys identifying problems and proposed researchers may use any dataset collected approaches on schema and data integration by other researchers but related to their current research task. Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 46 Journal of Database Management, 18(1), 43-67, January-March 2007 There has been much effort by the tive/negative association between classes; ecology research community to integrate in our article, we do not restrict our work to its data (EML, ORS, SEEK). These sys- these
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages25 Page
-
File Size-