Journal of Management, 18(1), 43-67, January-March 2007 43

Semantic Integration and Knowledge Discovery for Environmental Research

Zhiyuan Chen, University of Maryland, Baltimore County (UMBC), USA Aryya Gangopadhyay, University of Maryland, Baltimore County (UMBC), USA George Karabatis, University of Maryland, Baltimore County (UMBC), USA Michael McGuire, University of Maryland, Baltimore County (UMBC), USA Claire Welty, University of Maryland, Baltimore County (UMBC), USA

Abstract

Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new approach to elicit semantic from environmental data and implement - based techniques to assist users in integrating, navigating, and mining multiple environmental data sources. Our system contains specifications of various environmental data sources and the relationships that are formed among them. User requests are augmented with semantically related data sources and automatically presented as a visual . In addition, we present a methodology for data navigation and pattern discovery using multi-resolution brows- ing and data mining. The data semantics are captured and utilized in terms of their patterns and trends at multiple levels of resolution. We present the efficacy of our methodology through experimental results.

Keywords: environmental research, knowledge discovery and navigation, semantic integra- tion, semantic networks, wavelets

Introduction spatial and temporal) differences and inter- The urban environment is formed dependencies, are collected and managed by complex interactions between natural by multiple organizations, and are stored and human systems. Studying the urban in varying formats. Scientific knowledge environment requires the collection and discovery is often hindered because of analysis of very large datasets that span challenges in the integration and naviga- many disciplines, have semantic (including tion of these disparate data. Furthermore,

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 44 Journal of Database Management, 18(1), 43-67, January-March 2007 as the number of dimensions in the data data repositories and assists the user to increases, novel approaches for pattern navigate through the abundance of diverse discovery are needed. data sources as if they were a single ho- Environmental data are collected in a mogeneous source. More specifically, our variety of units (metric or SI), time incre- contributions are: ments (minutes, hours, or even days), map projections (e.g., UTM or State Plane) and 1. Recommendation of Additional and spatial densities. The data are stored in Relevant Data Sources: We present numerous formats, multiple locations, and our approach to recommend data are not centralized into a single repository sources that are potentially relevant for easy access. To help users (mostly en- to the user’s search interests. Cur- vironmental researchers) identify data sets rently, it is tedious and impractical for of interest, we use a metadata approach to users to locate relevant information extract semantically related data sources sources by themselves. We provide and present them to the researchers as a a methodology that addresses this semantic network. Starting with an initial problem and automatically supplies search (query) submitted by a researcher, users with additional and potentially we exploit stored relationships (metadata) relevant data sources that they might among actual data sources to enhance the not be aware of. In order to discover search result with additional semantically these additional recommendations, related information. Although domain ex- we exploit semantic relationships perts need to manually construct the initial between data sources. We define se- semantic network, which may only include mantic networks for interrelated data a small number of sources, we introduce sources and present an algorithm to an algorithm to let the network expand automatically refine, augment, and and evolve automatically based on usage expand an initial and relatively small patterns. Then, we present the semantic semantic network with additional and network to the user as a visual display of relevant data sources; we also exploit a hyperbolic tree; we claim that semantic user profiles to tailor resulting data networks provide an elegant and compact sources to specific user preferences. technique to visualize considerable amounts 2. Visualization and Navigation of Rel- of semantically relevant data sources in a evant Data Sources: The semantic simple yet powerful manner. network with the additional sources Once users have finalized a set of envi- is shown to the user as a visual hy- ronmental data sources, based on semantic perbolic tree improving usability by networks, they can access the actual sources showing the semantic relationships to extract data and perform techniques for among relevant data sources in a vi- knowledge discovery. We introduce a new sual way. After the user has decided approach to integrate urban environmental on the choice of relevant data sources data and provide scientists with semantic of interest (based on our metadata ap- techniques to navigate and discover patterns proach) and has accessed the actual in very large environmental datasets. data, we also assist the user in navigat- Our system provides access to a mul- ing through the plethora of environ- titude of heterogeneous and autonomous mental data using visualization and

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 45

navigation techniques that describe have been written over the years (Batini, data at multiple levels of resolution, Lenzerini, & Navathe, 1986; Ouksel & enabling pattern and knowledge Sheth, 1999; Rahm & Bernstein, 2001). discovery at different semantic lev- There has been a significant amount of els. We achieve that, using wavelet work on , especially on transformation techniques, and we resolving discrepancies of different data demonstrate resilience of wavelet schemas using a global (mediated) schema transformation to noisy data. (Friedman, Levy, & Millstein, 1999; Levy, 3. Implementation of a Prototype Rajaraman, & Ordille, 1996; Miller et al., System: Finally, we have designed 2001; Papakonstantinou, Garcia-Molina, and implemented a prototype system & Ullman, 1996; Rahm & Bernstein, as a proof of concept for our tech- 2001; Ram, Park, & Hwang, 2002). More niques. Using this system we have recently, there exists work on decentralized demonstrated the feasibility of our data sharing (Bowers, Lin, & Ludascher, contributions and have conducted 2004; Doan, Domingos, & Halevy, 2003; a set of experiments verifying and Halevy, Ives, Suciu, & Tatarinov, 2003; validating our approach. Hyperion, Tatarinov, & Halevy, 2004) and on integrating data in web-based This article is organized as follows. (Bowers et al., 2004; Chang, He, & Zhang, First, we present related work on data inte- 2005; Dispensa & Brulle, 2003). Clustering, gration using semantics, and on exploration classification and ontologies have also been of multi-dimensional data. Next, we present extensively used as a tool to solve semantic our research methodology on semantic net- heterogeneity problems (Jain & Zhao, 2004; works and pattern discovery with wavelet Kalfoglou & Schorlemmer, 2003; Ram transformations. Then, we describe our & Park, 2004; Sheth et al., 2004; Sheth, prototype implementation and the experi- Arpinar, & Kashyap, 2003; Sheth et al., ments we conducted. Our conclusions are 2002; Zhao & Ram, 2002, 2004). presented in the final section. All the previously mentioned work takes a deep integration approach, where Related Work the data schemas (or query interface for integrating web databases) of all sources are Data Integration integrated. However, this approach is often There is a rich body of existing too restrictive for environmental research work on data integration problems. The because: (1) there are so many different fundamental problem is to enable inter- types of data collected by so many different operation across different heterogeneous groups that it is impractical for all of them sources of information. In general, this to agree on a universal mediated schema; (2) problem manifests itself either as schema unlike companies, environmental research- mismatches (schema integration) or data ers often share data in an ad hoc way, e.g., incompatibilities (data integration) while a company may purchase products from accessing disparate data sources. Several several fixed suppliers while environmental surveys identifying problems and proposed researchers may use any dataset collected approaches on schema and data integration by other researchers but related to their current research task.

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 46 Journal of Database Management, 18(1), 43-67, January-March 2007

There has been much effort by the tive/negative association between classes; ecology research community to integrate in our article, we do not restrict our work to its data (EML, ORS, SEEK). These sys- these types of classes only, instead, we let tems take a shallow integration approach the users identify the degree of relevance where only metadata is integrated; they between data sources as their own semantic allow researchers to store metadata in a interpretation. Although our approach gives centralized database and to select datasets more emphasis on the user’s semantics, it by searching the metadata using keyword may require more manual work to calculate or SQL-based search. Such systems avoid the semantic relationships in the semantic the problem of defining a global-mediated network, since it is user-based. To reduce data schema and allow researchers to share the amount of manual work, we start with a data in an ad hoc way. A semantics-based small manually created semantic network, integration approach for geospatial data and then we apply an algorithm that we de- is presented in (Ram, Khatri, Zhang, & signed and implemented, to automatically Zeng, 2001). expand, refine, and augment the semantic The main problem of existing systems network by taking advantage of observed for integrating environmental data is that usage patterns. Another difference with they provide limited support to assist users (Fankhauser et al., 1991) is the way that in finding data sources semantically related the degree of relevance is calculated. to their research. Most existing systems as- They use triangular norms (T-norms) from sume researchers have full knowledge of fuzzy logic, while we use conditional what keywords to search and provide no probabilities. Relevant to our work is the assistance in selecting data sources based topic of discovering and ranking semantic on relationships between them. However, relationships for the (Ale- unlike business applications, environmental man-Meza, Halaschek-Wiener, Arpinar, research is more explorative and research- Ramakrishnan, & Sheth, 2005; Sheth et ers are interested in searching semantically al., 2004). However, relationship ranking related datasets. Although experienced is not in the scope of this article. researchers may find all related keywords, inexperienced researchers such as gradu- Using Wavelets for Exploration of ate students may have trouble doing this. Multidimensional Data The only exception is the SEEK project In order to study long-term envi- (Bowers et al., 2004; Bowers & Ludascher, ronmental factors, we need to evaluate 2004), which uses an ontology for ecology measures across multiple dimensions such concepts to find related datasets. However, as time and geographic space at different SEEK assumes the ontology will be com- dimensional hierarchies. An example of the pletely given by domain experts, while type of queries that have to be answered our approach augments such knowledge is, “How do stream temperature and pre- by incremental and automatic refinement cipitation change over time and space?” In of the semantic network. order to answer such queries, the system There has also been work on discov- must integrate diverse sets of information, ering semantic similarity in (Fankhauser, which is typically facilitated by dimensional Kracker, & Neuhold, 1991) based on modeling techniques (Kimball, 2002) and generalization/specialization, and posi- online analytical processing (OLAP). The

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 47 challenge stems from the fact that such Li, & Jhingran, 2004; Vitter & Wang, 1999; dimensional models grow exponentially Vitter, Wang, & Iyer, 1998), data cleaning, in size with the number of dimensions and and time-series data analysis (Percival & dimensional hierarchies. Current OLAP Walden, 2000). However, no study has been techniques, however, rely on the intuition done on utilizing wavelet transformation to of the decision maker in navigating through provide decision support. We use wavelets this lattice of cuboids and only provide to provide coarse, low-resolution views to navigation tools such as drill down and researchers with the capability of retriev- roll up. There have been very few attempts ing high-resolution data by zooming into made to address this issue, most notably selected areas. the work done by (Sarawagi, Agrawal, & Generally speaking, wavelet trans- Megiddo, 1998) and (Kumar, Gangopad- formation is a tool that divides up data, hyay, & Karabatis, in press). However, functions, or operators into different fre- the major deficiency of the existing work quency components and then studies each in this area is that the volume of data after component with a resolution matched to its a few drill-downs becomes prohibitively scale (Daubechies, 1992). A wavelet has large, which hinders the effectiveness of the many desirable properties such as compact method. In order to help end users (scientists support, vanishing moments and dilating or engineers) discover and analyze patterns relation and other preferred properties such from large datasets, we have developed a as smoothness (Chui & Lian, 1996). The methodology for visualization of data at core idea behind a discrete wavelet transfor- multiple levels of dimensional hierarchy mation (DWT) is to progressively smooth and pattern discovery through data mining the data using an iterative procedure and techniques (Han & Kamber, 2000; Mitchell, keep the detail along the way. The DWT 1997) at multiple levels of resolution. is performed using the pyramid algorithm The last decade has seen an explosion (Mallat, 1989) in O(N) time. of interest in wavelets (Daubechies, 1992; Goswami & Chan, 1999), a subject area that Research Design and has coalesced from roots in mathematics, Methods physics, electrical engineering and other disciplines. Wavelets have been developed Overview of the Architecture as a means to provide low-resolution views The overall architecture of our system of data with the ability to reconstruct high- is shown in Figure 1. It consists of a data resolution views if necessary. Wavelet integration component, a data warehouse, transformation has been applied in numer- and visualization, navigation, and pattern ous disciplines such as compression and de- discovery component, all for the semantic noising of audio signals and images, finger utilization of heterogeneous data sources print compression, edge detection, object depicted on the left. The data integration detection in two-dimensional images, and component consists of a metadata reposi- image retrieval (Stollnitz, Derose, & Sale- tory, a semantic network, and a set of con- sin, 1996). There have been few studies on version functions. The metadata repository approximate query answering through lossy stores information about the source data compression of multi-dimensional data including descriptions of each particular cubes (Matias, Vitter, & Wang, 1998; Smith, source along with information on its syntax

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 48 Journal of Database Management, 18(1), 43-67, January-March 2007

Figure 1. System architecture with data integration and knowledge discovery compo- nents

ts

en

Web Cli

n

y

io

er

&

on

y

g

ta

ti

scov

n

lu

er

da

Navigat

so

Di

of

n,

re

n

scov

i-

io

Patter

ew

er

lt

Di

tt

vi

at

Mu

iz

Data Minin

al

d Pa

su

an

Vi

e

ts

t

us

en

data

le

ci

al

ve

Data

Wa

Wareho

coeffi

Origin

on

ti

s

ry

on

ic

ra

si

to

on

nt

ti

eg

er

ma

Network

Se

Metadata

Func

Reposi

Conv

Data Int

ce

ce

ce

ce

ur

ur

ur

ur

eet

DB

DB

File

So

So

So

So Spread sh and semantics. In our approach the metadata data to the canonical form if needed through layer is not a global schema. Instead we conversion functions (on measurement collect various information artifacts about units, formats, etc.), as explained in the the sources to assist in finding relationships following section. and correspondences among data in differ- The semantic network contains rela- ent sources. We also store information on tionships between sources. Users request how to access the data (including location data sources by searching the metadata re- identifier, access method, access rights, pository and our system will automatically username, etc.) and how to transform the use the semantic network to return not only

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 49 the requested ones but also to recommend with details from both scientific and storage additional and related data sources that us- perspectives. For example, many ecosystem ers might not know about. Once the users study projects collect data related to climate decide on the final selection of the sources (e.g., precipitation depth, wind speed, wind they may download data directly to their velocity, air temperature, humidity), soils local machines. Data being downloaded (e.g., temperature, water content, trace can be converted to canonical form for gases), and streams (e.g., depth, flow rate, possible analysis. This is achieved by a temperature, nutrients, pathogens, toxics, set of conversion functions that are part of biota). For each such category, we store its the integration component. Data that are definition, measurement unit, collection integrated are stored in the data warehouse. frequency, and measurement location, to The data warehouse is a multidimensional create an accurate description of what is model of commonly used source data, being collected, how it is measured, where which also stores wavelet coefficients. it is stored, and how it is accessed. Usu- Once users have obtained data, they can ally, this type of information is available visualize, navigate, and discover patterns from the data sources themselves. It is part at different dimensions and resolutions to of a routine process to specify specific aid knowledge discovery. metadata information when users submit data at the data sources. Additional infor- Data Integration mation may also be stored from external In this section, we address issues re- sources (e.g., the Open Research System lated to (1) data sources and relationships (ORS)). In general, information about data that form among them, (2) semantic net- sources is not significantly large, especially works, for recommendation of additional when compared with the amount of actual and relevant data sources visualized as hy- data at the sources; metadata information perbolic trees, and (3) automatic expansion can be collected from the sources either and augmentation of the semantic network automatically (through an application by observing user patterns. programming interface (API) if available) or manually. All such information is kept Describing Data Sources and their Rela- in the metadata repository and it serves tionships the purpose of a universal registry; similar The plethora of diverse data in en- but not identical to universal description, vironmental research poses significant discovery and integration (UDDI) for Web integration problems. Some data sources services. The metadata repository, stored in may be structured or semi-structured data- an object-relational database, is augmented bases with varying data models (relational, with information on additional data sources object-oriented, object-relational, etc.); as needed. some may be available as spreadsheets, This work expands on the specification while others may be flat files. Data may of relationships among database objects also contain spatial information in raster stored in heterogeneous database systems or vector formats. (Georgakopoulos, Karabatis, & Gantima- We take a metadata approach, in which hapatruni, 1997; Karabatis, Rusinkiewicz, we store information about the data, which & Sheth, 1999; Rusinkiewicz, Sheth, & is collected and stored in the metadata layer Karabatis, 1991; Sheth & Karabatis, 1993).

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 50 Journal of Database Management, 18(1), 43-67, January-March 2007

We have created a methodology allowing and each stream reach belongs to a sub- researchers to derive semantic relation- watershed (the land area that drains to a ships among data sources based on source particular point along a stream segment descriptions in the metadata layer. These and is represented by a polygon feature). semantic relationships form a semantic Land use/land cover data is also collected network of related information, which as- on areas represented by polygons (although sists users to discover additional informa- these polygons are different and smaller tion, relevant to their search but possibly than polygons for sub-watersheds). Thus, unknown to them. We realize that some we aggregate stream data to sub-watershed relationships may not be captured initially level, and then aggregate land use/land in the metadata repository, especially when cover data to areas represented by the semantic incompatibilities prevent direct same set of polygons for sub-watersheds identification of data (such as problems using re-projection, spatial joins, or overlay related to synonyms, homonyms, etc.). functions provided by ArcObjects, the API Nevertheless, missing relationships are included in the ESRI’s ArcGIS software captured and added to the metadata reposi- suite (www.esri.com/software/arcgis). tory by observing researchers’ usage pat- terns when they interact with the semantic Using Semantic Networks to Expand network, as will be explained further in the User Queries current section. The notion of relationships In this section, we provide details on between concepts is also related to the the creation and utilization of semantic topic maps or concept maps (TopicMap), networks. We formally define semantic and Semantic Web (W3C) for XML and networks and we present techniques to web documents containing metadata about extract information from semantic networks concepts. However, our work does not limit and recommend additional and relevant itself only to XML or web data, but can be data sources to users in their search of used to describe data in general. related data sources. We also present an algorithm to automatically refine, and Converting Data to a Canonical Form dynamically augment semantic networks; Environmental data sources may have Semantic networks have long been used to differences in formats, data units, spatial represent relationships (Masterman, 1961). and temporal granularities, and may be col- We take advantage of their structure to lected at different time intervals. We have elicit additional semantic information for implemented methods and/or applications environmental data. to convert between different units and formats. In addition, spatial and temporal Definition 1: We formally define a semantic disparities are resolved using spatial and network G(V,E,W) as a graph G where: temporal join/aggregation operations and integrating data at the appropriate level. • V is the set of nodes in the network. As an example, suppose that we need to Each node represents a data source or integrate stream chemical and biological data set. For convenience, we use data data collected at each site with land use source and data set interchangeably and land cover data. In our data warehouse in this article. model, each site belongs to a stream reach,

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 51

Figure 2. An example semantic network Using chain rules and assuming all conditional probabilities are independent

Fish 0.9 Stream (Rice, 1994), we have: Population Temperature P(surfaces, temperature | fish) = 0.9 P(temperature | fish) * P(surfaces | tem- Land Use/ 0.9 Impervious perature) = 0.9 * 0.9 = 0.81. Land Cover Surfaces

In general, if vi and vj are two nodes,

there are k paths p1,…, pk between vi and

• E is the set of directed edges in the vj, where path pl (1 <= l <=k) consists of nodes v ,…, v (|p | is the length of network. An edge (vi, vj) indicates that l1 l|pl|+1 l path p ). node vi is semantically related to node l The relevance score rs between v vj. i • W is a |V| * |V| relevance score matrix, and vj is where W(i,j) is a number in range of rs = min( 1, w(l ,l )) [0,1] and represents the degree of rel- ∑ ∏ i i+1 pl 1≤i≤|pl | (1) evance (or relevance score) between nodes v and v . j i The above formula computes the rel- evance score between v and v as the sum Figure 2 depicts an example semantic i j of relevance scores for all paths connecting network related to fish population. The num- v and v . For each such path, the relevance ber on each edge represents the relevance i j score between the two endpoints is com- score associated with the two adjacent puted as the product of relevance scores for nodes. Based on these scores, we can infer all edges along the path. There can be more the relevance between any two nodes in detailed types of semantic relationships the network. We consider each relevance (cause-effect, is-a, and is-part-of), or to use score as a conditional probability and as- more advanced inference rules without the sume they are independent of each other independent assumption on the conditional (Rice, 1994). For example, the relevance probabilities, but these extensions are be- score between fish population and stream yond the scope of this article. temperature can be considered as the condi- tional probability of a researcher interested Construction of Semantic Network in stream temperature given that he or she We assume that domain experts have is interested in fish population. provided an initial semantic network, i.e., a Using the standard notation for con- set of edges and nodes with their relevance ditional probability, we have: scores. Based on this initial semantic net- work, we compute the relevance scores P(surfaces | fish) = P(surfaces, stream between any pair of nodes in the network, temperature | fish)because the user will be and create the matrix W. interested in impervious surfaces, assum- Let us consider the example in Figure ing the user is also interested in stream 2. Suppose matrix R stores the relevance temperature. scores of all edges in the initial semantic

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 52 Journal of Database Management, 18(1), 43-67, January-March 2007 network. The first, second, third, and fourth which in turn will find data sources that row (column) in the matrix corresponds to directly satisfy the user’s conditions. We edges from (to) fish, temperature, surface call these data sources exact answers. In data, and land data. Rij stores the relevance addition to the exact answers, we describe a score from node i to node j. novel approach to enhance and augment the result set with additional sources, semanti- 0 0.9 0 0    cally relevant to the exact answers, which 0 0 0.9 0  R = the user might not be aware of. We achieve 0 0 0 0.9   this goal by exploiting the semantic net- 0 0 0 0    work, and returning all data sources whose relevance score with the exact answers is Based on formula (1), the relevance higher than a threshold. For simplicity, we score between any pair of nodes equals the have set the threshold in our system to 0.5 sum of relevance scores of all paths between but a user can adjust it according to how them. Using matrix multiplication rules, closely additional data sources should be and for any given pair (i, j) with i ≠ j, we related to the exact answers. calculate the sum of relevance scores of all For example, suppose a user wants to paths between i and j with length k. It is find all data sources related to ‘fish popula- k k equal to R ij where R is the multiplication tion.’ The exact answer contains only the of k matrices R. For example, the relevance fish population data set because only that scores of all paths with length 2 is: data set contains the keyword ‘fish popula- tion.’ However, using the semantic network 0 0 0.81 0    in Figure 2, our system will return all other 0 0 0 0.81 R2 = R * R = 0 0 0 0  three data sources in the figure because they     0 0 0 0  are also related to the fish population ac- cording to the semantic network. Therefore, we can automatically recommend to the There are two non-zero entries: R2 13 users additional semantic information (data = 0.81, identifying the relevance score sources) relevant to the exact answers. between fish data to surface data, and R2 = 0.81 identifying the score between 24 Visualizing Semantic Networks temperature and land data. Hence, the rel- Most existing systems for ecological evance score rs between any pair of nodes research (EML) return data sources as a list, in the network can be computed using the and it is difficult for users to go through them following formula: when the list is long. Our system utilizes a hyperbolic tree technique (Lamping, Rao, rs = R i ∑ & Pirolli, 1995) to visualize data sources 1≤i≤N (2) and the semantic relationships they form. Figure 3 shows an example of such a tree. Using Semantic Networks to Elicit Ad- The main benefit of this technique is to ditional Semantics show users the entire set of exact answers A user in search of ecosystem data and additional related data sources at a sources may perform a keyword search or glance, as a visualization of all relevant submit a regular SQL query to our system, nodes and edges forming a semantic net-

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 53

Figure 3. Visualiing a semantic network as a hyperbolic tree

work. In addition, users can dynamically collect information that is used to dynami- adjust the display size of a data source of cally refine and augment the network based their choice and automatically bring it to on these patterns. Therefore, although an the foreground concentrating on a specific initial profile may not completely satisfy data source and all its edges connecting it every user, it will adapt to user preferences to the relevant sources. after some period of time.

Dealing with Different User Prefer- Refining and Augmenting the Semantic ences Network We also consider the issue that differ- The key idea to automatically refine, ent domain experts may not have the same evolve, and augment the semantic network interests; instead they may need to utilize is to monitor usage patterns. Once the initial different semantic networks (if available) semantic network (or a given profile) has pertaining to their own specialties. For been created by a domain expert, users example, a stream chemist may not be can query the metadata repository for data interested in land use/land cover, contrary sources. The system provides exact answers to an urban developer who would certainly and recommends additional data sources focus on it. We address this problem by (displayed visually as in Figure 3). Then, creating different user profiles, each cor- the users select (click on) those data sources responding to a separate semantic network potentially relevant to their research. They with its own bias towards a certain specialty. submit their queries to the data sources, Initially, domain experts will define a set while the metadata repository keeps copies of profiles. A new user will select a profile of queries to identify query patterns. Based before using our system, and can change this on the usage of these patterns by users, we selection at any time. For each profile, we can infer additional relationships that form also track the usage patterns by users and between data sources. These relationships

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 54 Journal of Database Management, 18(1), 43-67, January-March 2007

Algorithm 1. Automatic refinement of semantic network

Input: current network N, a set of usage patterns {Q1, …, Qm}, where each Qi consists of a set of exact answers and a set of related answers selected by users. Output: a refined network N’ 1. Create N’ as exact copy of N 2. For each user query Qi, a. Identify all edges in the current network N that link two selected sources and add them to a multi-set S1. b. For any source selected by users but is not covered by an edge in N, generate an edge from the exact answer to that source and add it to a multi-set S2. 3. For each edge AB in existing network N There are three possible cases: a. AB appears in S1. The new relevance score r(AB) equals r(AB) * d + Occ(AB) / Occ(A) * (1-d) where d is an aging factor ranging from 0 to 1, Occ(AB) is the number of times edge AB appears in S1, and Occ(A) is the number of times node A is selected in usage patterns. b. AB does not appear in S1, and A is never selected. The score of AB remains unchanged. c. AB does not appear in S1, but A is selected. The new score equals r(AB) * d 4. For each edge AB in S2, add it to the new network N’ with relevance score Occ(AB)/ Occ(A), where Occ(AB) is the number of times edge AB appears in S2, and Occ(A) is the number of times node A is selected in usage patterns.

are used to automatically expand, enhance source that is not covered by any existing and refine the semantic network. edges that the user agrees with. Thus, we As an example, suppose two users have propose the Algorithm 1 to automatically asked for data sources related to ‘fish popu- augment and refine the network. lation.’ User1 selects all four data sources This algorithm first creates a copy of in Figure 2, while User2 selects only fish, the current network at step 1. At step 2a) temperature, and land data. Let F, T, S, L it identifies the edges that users agree on represent fish, temperature, surfaces, and based on usage patterns. At step 2b), the land, respectively. We assume that users algorithm identifies new edges not in the agree to incorporate every edge connect- current network, but necessary for users ing two sources that they selected in the to select those sources connected by these network, but disagree with all other edges edges. For instance, in the above example, of sources they did not select. For example, if the usage patterns consists of Q1 = {F, T,

User1 agrees with the edges F-T, T- S, and S, L}, and Q2 = {F, T, L}. At step 2a), the

S-L. However, User2 agrees with the edge algorithm will add to S1 edges F-T, T-S, S-

F-T, but not the other two. The issue is L for Q1, and F-T for Q2. Thus, S1 = {F-T, how User2 selects the land data, which is T-S, S-L, F-T}. At step 2b), the algorithm only related to fish via surface data in the will add to S2 edge F-L. Thus, S2 = {F-L}. current network, and User2 does not select At step 3, the algorithm re-computes the surface. We assume the user agrees with relevance scores for the existing edges. The relationship between fish and land, where new score consists of two components, the fish is an exact answer and land is a selected first component is the current score, and the

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 55 second component is the score based on patterns, trends, and surprises. The main usage patterns. These two components are benefit of using wavelets compared to using combined using a weight d, which is also fixed levels of resolutions is that wavelets called an aging factor because it determines allow finer and more flexible levels of how quickly the new score converges to the resolutions. For example, fixed levels al- usage patterns. We set the aging factor d low users to view stream temperature at = 0.5 in this article. In the above example, minutes, hours, and days, while wavelets the new scores are: allow users to view the temperature at one minute, two minute, or four minute spans, R(fish-temperature) = 0.9 * 0.5 + 1 and so on. * 0.5 = 0.95 In this article, we apply wavelet R(temperature-surfaces) = 0.9 * 0.5 transformations—we used Haar wavelets + 0.5 * 0.5 = 0.7 (Goswami & Chan, 1999), and we are cur- R(surfaces-land) = 0.9 * 0.5 + 0.5 * rently experimenting with other wavelet 0.5 = 0.7 transforms—for numerical attributes. If the R(fish-land) = 0.5 * 0.5 = 0.25. data contains spatial or temporal attributes (e.g., indicating the location or time the Data Navigation: A Visual Approach measurements were collected), we always Visualization of data can be proven sort the data records in the spatial or tem- to be a significant decision support tool. poral order and apply wavelet transforma- It can provide deep insights into data that tions to the sequence of the measurement are very difficult to capture by automatic attributes in this order. Otherwise, we view means. Since environmental data often have the measurement attributes as a sequence in different spatial and temporal granularities, the order that records are stored in the da- environmental researchers are interested tabase, and apply wavelet transformations. in viewing data at multiple resolutions. Of course, in the latter case, the different For example, a spike in stream flow, pre- levels of resolutions do not have spatial cipitation, and nitrogen content will tell a or temporal meanings, and only provide a scientist that there is an influx of nitrogen lower-resolution view of the data. in the stream due to a precipitation event. The generated wavelet coefficients However, a steady increase or decrease in are then stored in an Object-Relational stream flow, precipitation, and nitrogen database (Oracle 10g). We have developed content for several years will indicate a an algorithm (see Goswami & Chan, 1999) possible change in the longer term. Further- to reconstruct not only the complete set of more, the recent development of wireless the original data but also a certain subset sensors and sensor networks has allowed of it, at appropriate levels of resolutions. for the collection of environmental data at The utility of reconstructing a subset of the high temporal resolution. In consequence, original dataset stems from the fact that a researchers often need to visualize this decision maker may want to find out only data for long time scales, that is, at lower that part of the original dataset that was used resolutions. to generate a particular coefficient. Therefore, we present an effective We developed a visualization tool multi-resolution visualization method us- to help environmental scientists visually ing wavelets to help researchers discover inspect temporal and spatial datasets for

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 56 Journal of Database Management, 18(1), 43-67, January-March 2007

Figure 4. Visual navigation of data

noticeable trends and relationships. Figure he or she may want to visualize the data 4 depicts a prototype interface developed in using a line graph, bar graph, or scatter Visual Basic which allows users to spatially plot. The visualization is then displayed locate and select data collection sites and in the bottom pane of the interface. The visualize time-series data for the selected slider below the displayed data allows the sites. The top pane connects to the ESRI’s user to control the temporal resolution of ArcSDE® Geodatabase system where the the visualization. The slider goes from the user can navigate spatially using zooming resolution of the raw data on the left to the and panning tools. The bottom pane con- level n wavelet transformation on the right. nects to a DBMS which stores raw data The scale on the slider can change, based along with wavelet-transformed data at on a combination of the time period of the various levels of temporal resolution. The raw dataset and the level of wavelet trans- left side of the interface allows the user formations available. Data at the selected to (1) select the site or sites of interest resolution will be reconstructed from the spatially or from a list, (2) select the time stored wavelet coefficients and shown to period of the visualization, (3) select the users. Figure 4 shows the McDonogh stream dataset (4) select the type of visualization, temperature site along with time series data and (5) interactively control the temporal at the 64 minute resolution for the month resolution of the visualization. The user can of June, 2004. select a site, or sites, either spatially by using the GIS interface, or by selecting specific Pattern Discovery through Multi-Resolu- sites based on the site name. Once a site is tion Data Mining chosen, the user can select the time period Multi-resolution data mining is of the visualization by providing the date similar in concept to online analytical and time. Then the user can select whether mining (OLAM) (Han, 1998; Han, Chee,

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 57

& Chiang, 1998). Conceptually, it allows demonstrated that for a real data set a user to mine the data at different levels and a benchmark data set, wavelet of the dimension hierarchy. We propose to transformation preserved most pat- augment the dimensional hierarchies with terns in the data while it was used to wavelet coefficients at different levels of convert data to a lower resolution. decomposition and provide mining capa- More interestingly, our results also bilities including association rule mining, showed that wavelet transformation classification, and clustering. This approach is very robust to noise in data and in provides the benefits of OLAM, but in some cases even improved the quality addition, it enables users to select the ap- of discovered patterns. propriate levels of resolution that would be ideally suited for mining the data. If We first describe the implementation the data is noisy, wavelet decomposition details, and then proceed to experimental could reduce noise in the data and would results. result in a better classifier. We illustrate the efficacy of using wavelet decomposition in Implementation classification and its resilience on noise in We used Oracle 10g to store metadata the following section. of data sources and semantic networks us- ing the database schema in Figure 5. We implementation and use three relational tables (sources, edges, experimental results network) to store information about data We have conducted preliminary ex- sources, keywords, and relevance scores. periments to validate our approaches of The Edges table stores the edges and their using semantic networks to help environ- relevance scores in the semantic network. mental researchers find related data sources The Network table stores the relevance and using wavelets to identify patterns in scores between any pair of nodes in the different data resolutions. Our major find- semantic network, which is computed ings are: from the Edge table. We implemented the algorithms described in the previous section • Users of our system concluded that as PL/SQL stored procedures for the con- our query expansion and visualization interface surpasses the traditional exact query interface. In all cases we Figure 5. Database schema for the semantic tested, our query expansion interface network returned more data sources than the exact query interface. They also found Edges PK src value in the automatic adaptation and PK dest

augmentation of the semantic network score based on profiles and refinement Sources techniques. PK id • Wavelet transformation is a promising source_name Network keywords PK src tool to discover patterns at different PK dest resolutions of data. Our experiments score

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 58 Journal of Database Management, 18(1), 43-67, January-March 2007

Figure 6. Query interface using semantic networks

Figure 7. Visualization of results of Figure 6

struction, query expansion, and dynamic the ‘view network’ button. Figure 7 shows augmentation of the semantic network. the hyperbolic visualization of the results We implemented a semantic network obtained in Figure 6. We use a publicly query interface (written in Visual Basic) available Hyperbolic Tree Java Library for researchers to search data sources with (http://sourceforge.net/projects/hyper- semantic terms related to their research; this tree/) to display hyperbolic trees. Users interface is based on the semantic network can also record their selections by first and metadata repository and is shown in checking the sources of interest and then Figure 6. The user first needs to select a clicking ‘record.’ Recorded selections are profile then provide keywords and finally used as usage patterns to dynamically aug- submit the search to database. Our query ment the network as described in the pre- expansion procedure will augment the vious section. We have also implemented query and return all sources related to the a Haar wavelet transformation as a stored given keywords in the results window. The procedure in an Oracle server and inserted user can then visualize the relationships (as the results into a table, which will be later edges) between returned sources by clicking used for pattern discovery.

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 59

Table 1. Data sources used in semantic network experiments

Data Source Name Description Location Vegetation Riparian vegetation of the Gwynns Falls http://www.ecostudies.org/pub/besveg/ Watershed in Baltimore area riparian.zip Hydrology Streamflow data collected along the http://waterdata.usgs.gov/md/nwis/ Gwynns Falls in Baltimore area nwisman?site_no=01589352 Stream Temperature Stream temperature of the Gwynns Falls As an Excel file on local file server Watershed Meteorology Baltimore meteorological station data http://www.ecostudies.org/pub/bes_ 206.zip Stream Chemistry Stream chemistry data of the Gwynns As a text file on local file server Falls Watershed Landscape Satellite image data of Baltimore area As a text file on local file server landscape (forests, grass, crops, etc.)

Experiments with Semantic Networks to take on alternate roles of three different types of users and then we selected one of Setup: We evaluated our semantic network the three profiles. The researcher posted approach using data sets collected by the three example queries as follows: Baltimore Ecosystem Study (http://www. beslter.org/). Table 1 summarizes the details • Query 1: What data sets are related of these data sets. to riparian vegetation? The researcher selected profile 1 and searched the data We asked an environmental researcher sources with keyword ‘vegetation.’ to define the edges in the initial semantic • Query 2: What factors contribute to network between these data sets. The re- the fluctuations in stream tempera- searcher created three different semantic ture? The researcher selected profile networks corresponding to three different 2 and used ‘stream temperature’ as profiles of users interested in vegetation, the keyword. stream temperature, and stream chemistry • Query 3: What factors may affect respectively. Figure 8 shows the networks the stream chemistry? The researcher where Pi identifies the score in the thi profile. selected profile 3 and used ‘stream In this experiment, the researcher consid- chemistry’ as the keyword. ered the relationships bidirectional. In the second experiment, we asked We ran two experiments to test our the researcher to select a set of data sources search interface and the semantic network in the results of Query 4 that he thought refinement algorithm. In the first experi- was most related to the question he asked. ment, we asked another researcher to use We then ran our algorithm to refine the our search interface to find related data semantic network based on his selection sources and asked him to give us feedback and compared the results for Query 4 with on the appropriateness of the results. Due to the results using the original network. Our limited resources, we asked that researcher

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 60 Journal of Database Management, 18(1), 43-67, January-March 2007

Table 2. Adapted from A Primer on Landsat 7 (http://imaging,geocomm,com/ features/sensor/landsat7)

Spectral Bandwidth Ranges for Landsat 7 ETM+ Sensor (µ) Band Number Wavelength Range Recommended Application soil and vegetation discrimination and forest type Band 1 0.45 - 0.52 (blue-green) mapping Band 2 0.53 - 0.61 (green) vegetation discrimination, plant vigor Band 3 0.63 - 0.69 (red) detection of roads, bare soil, and vegetation type biomass estimation, separation of water from vegetation, Band 4 0.78 - 0.90 (near-infrared) soil moisture discrimination Band 5 1.55 - 1.75 (mid-infrared) discrimination of roads, bare soils, and water 10.4 - 12.5 (thermal Band 6 measuring plant heat stress and thermal mapping infrared) discrimination of mineral and rock types, interpreting Band 7 2.09 - 2.35 (mid-infrared) vegetation cover and soil moisture Band 8 .52 - .90 (panchromatic) for enhanced resolution and increased detection ability

search interface returned the following due to the refinement. This reflected the results: user selection. In summary, our experiments veri- • Query 1: Vegetation, hydrology, fied that our system exploits data source stream temperature, and landscape. relationships that are maintained in se- • Query 2: Stream temperature, me- mantic networks and supplies users with teorology, and landscape. additional data sources that are relevant • Query 3: Hydrology, meteorology, to their original search, but they might not stream chemistry, and landscape. be aware of.

In all cases, the exact search interface Experiments for Knowledge Discovery only returned one data source with the using Wavelet Transformations search keyword, while our search interface We conducted several experiments to returned multiple sources (4 for Query 1 test our hypothesis that wavelet transforma- and 3, and 3 for Query 2). We also asked tion results in preserving patterns in data. the researcher to look at the results returned In the first experiment, we used remote by our interface, and he found the answers sensing data from the Landsat 7 ETM+ returned by our search interface actually sensor. Data from the Landsat 7 ETM+ related to these research questions. sensor is typically used by environmental In the second experiment, the user scientists to characterize the landscape in selected only the first three data sources for terms of land cover. The Landsat 7 ETM+ Query 3. When the researcher ran Query sensor captures wavelength values for 8 3 on the refined network, the ‘landscape’ spectral bands based on the reflectance of data source is no longer in the search results the earth’s surface. Table 2 shows the range

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 61 of wavelengths captured in each band and disjoint sets – a test set and a training set; its recommended application. (2) performed first level of Haar wavelet We downloaded a Landsat 7 ETM+ transformation on the raw data; (3) created scene from October 5, 2001, covering a 50% stratified sample of the training set central Maryland (path 15/row 33), from of the raw data; (4) created three classifiers the Global Land Cover Facility (http://glcf. based on raw data, 50% stratified sample, umiacs.umd.edu/data/). We extracted and wavelet transformed data; and (5) spectral information from the Landsat im- compared the predictive accuracy on test age and a subset was evaluated for a 1.2 data for the three classifiers. Two sets of km2 area in northern Baltimore County, three different classifiers were built using Maryland. We then manually classified land an adaptive Bayes network, a naïve Bayes cover values of crop, grass, forest, or water method, and support vector machines (with based on high resolution aerial photography. linear kernel function) for the raw data The resulting dataset consisted of eight and approximate coefficients from a Haar attributes representing the spectral bands wavelet transformation of the raw data. We and one class attribute representing the four used Oracle 10g as the database and Oracle distinct land cover values. The spectral Data Miner for the mining functions. The bands were used to identify whether the reason for testing with a stratified sample land cover is ‘grass,’ ‘forest,’ ‘water,’ or is that the Haar wavelet transform reduces ‘crops’.This yielded 1193 instances that the data size by half. Hence the size of the were divided into two groups. Group 1 had training set for raw data is twice as large 616 instances that were used for training as that of the wavelet transformed data. and group 2 had 577 instances that were As shown in Figure 9, in two out of three used for testing. We performed the follow- methods (naïve Bayes and support vec- ing steps. We (1) divided the data into two tor machine), the use of a Haar wavelet

Figure 9. Performance comparison

Performance comparison

0.89

0.88

0.87

0.86 y

ac 0.85 ur Raw data acc 0.84 50% stratified sample ve ve ti Wavelet transfomed data ic 0.83 ed Pr 0.82

0.81

0.8

0.79 Adaptive Bayes Network Naïve Bayes Support Vector Machine Data mining algorithms

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 62 Journal of Database Management, 18(1), 43-67, January-March 2007

transform resulted in a better classifier than Smith et al., 2004; Vitter & Wang, 1999; both (1) the raw data with twice the size Vitter et al., 1998). The only difference is of the training set and (2) a 50% stratified that the levels of resolutions do not have sample that had the same size of the train- temporal or spatial meanings. While more ing set. For the adaptive Bayes network, a research is needed to establish the efficacy wavelet transform resulted in 2% loss of of wavelet transformation, these prelimi- predictive accuracy as compared with the nary experiments do indicate that wavelet raw data, but had a slightly higher predic- transformation holds promise as a robust tive accuracy when tested with a training tool for pattern discovery at multiple levels set of the same size. of data and as a method for data reduction In addition, we decided to test the in very large datasets with little degradation sensitivity of classifier accuracy on a in predictive accuracy. noisy dataset. For these experiments, we used the Iris plant dataset from the UCI conclusion machine learning repository (http://www. In this article, we have described a ics.uci.edu/~mlearn/MLSummary.html). methodology for data integration and pat- The dataset contained 150 instances with tern discovery for environmental research four attributes and three class labels. The using data semantics. We used semantics attributes represent sepal length, sepal to integrate multiple data sources to answer Figure 10. Performance comparison on noisy data height, petal length, and petal width and the user queries for environmental research. class variable refers to one of three types Our methodology to describe the data of iris plant. Again, the same environment sources is based on a metadata approach was used to test the predictive accuracy of and takes advantage of data interrelation- three classifiers: an adaptive Bayes network, ships represented as a semantic network. a naïve Bayes method, and support vec- User queries are automatically expanded tor machine. In each case, we introduced using a relevance score matrix and a se- random noise following standard normal mantic network, which can be visualized distribution to 10%-40% of the instances. as a hyperbolic tree. We have utilized user As shown in Figures 10a-c, use of the Haar profiles to capture diverse user preferences wavelet transform resulted in a classifier to precisely answer user queries, and have with a comparable predictive accuracy presented an algorithm to automatically to the raw data. It is evident that the raw expand, augment and refine the semantic data outperforms the wavelet transformed network by observing usage patterns. data with 40% noise with the disparity We have demonstrated that our semantic in performance being more pronounced integration techniques offer a powerful, in the adaptive Bayes and naïve Bayes straightforward and user-friendly approach methods. This indicates a threshold in for the visualization of significant amounts the noise-to-signal ratio below which the of environmental data sources. benefit of wavelet transformation is lost. In addition to enabling search for Wavelets can be applied to any numerical data in the integrated system described attributes assuming that the data is sorted above, we also allow users to navigate in the order that they are stored in the data- through multi-dimensional data through base. This approach has also been used by visualization, implemented using wavelet many existing studies (Matias et al., 1998;

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 63 transformation. We have used Haar wave- References lets that decompose data by averaging and Aleman-Meza, B., Halaschek-Wiener, C., differencing consecutive, non-overlapping Arpinar, I. B., Ramakrishnan, C., & pairs of data at each level of decomposition. Sheth, A. P. (2005). Ranking complex Thus, users can visualize multiple levels of relationships on the Semantic Web. data and roll-up or drill down at different IEEE Computing, 37-44. levels of hierarchy. They can also apply Batini, C., Lenzerini, M., & Navathe, S. data mining techniques such as classifica- (1986). A comparative analysis of tion at different levels of resolution. We methodologies for database schema have illustrated that patterns in the data are integration. ACM Computing Surveys, well preserved at first level decomposition 18(4), 323-364. with 50% reduction in data size. We have Bowers, S., Lin, K., & Ludascher, B. (2004). also demonstrated the resilience of wavelet On integrating scientific resources transformation to noisy data. through semantic registration. Paper The research presented in this article presented at the Scientific and Statisti- is being enhanced by further development cal Database Management. of the described methodologies and further Bowers, S., & Ludascher, B. (2004). An experimentation with pattern discovery at ontology-driven framework for data multiple levels of resolution. We plan to in- transformation in scientific work- corporate data mining and machine learning flows. Paper presented at the Interna- techniques to aid in the enhancement and tional Workshop on Data Integration refinement of the semantic network. The in the Life Sciences. methodology presented in this article can Chang, K. C.-C., He, B., & Zhang, Z. also be applied to other application areas (2005). Toward large scale integra- where search, visualization, and pattern tion: Building a MetaQuerier over discovery of data from multiple sources databases on the Web. Paper presented are needed. at the CIDR. Chui, C. K., & Lian, J. (1996). A study of Acknowledgments orthonormal multiwavelets. Applied This material is based upon work Numerical Mathematics: Transac- partly supported by the National Science tions of IMACS, 20(3), 273-298. Foundation under Grant Nos. DEB- Daubechies, I. (1992). Ten lectures on 0423476 and BES-0414206 and by U.S. wavelets. Capital City Press. Environmental Protection Agency under Dispensa, J. M., & Brulle, R. J. (2003). grants R-82818201-0 and CR83105801. The Sprawling Frontier: Politics of Although the research described in this Watershed Management. Submitted article has been funded in part by the U.S. to Rural Sociology. Environmental Protection Agency, it has Doan, A., Domingos, P., & Halevy, A. Y. not been subjected to the agency’s required (2003). Learning to match the sche- peer and policy review and therefore does mas of data sources: A multistrategy not necessarily reflect the views of the approach. Machine Learning, 50(3), agency and no official endorsement should 279-301. be inferred.

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 64 Journal of Database Management, 18(1), 43-67, January-March 2007

EML.Ecological Metadata Language. Jain, H., & Zhao, H. (2004). Federating http://knb.ecoinformatics.org/ heterogeneous information systems software/eml/ using Web services and ontologies. Fankhauser, P., Kracker, M., & Neuhold, Paper presented at the Tenth Americas E. J. (1991). Semantic vs. structural Conference on Information Systems, resemblance of classes. SIGMOD New York. Record, 20(4), 59-63. Kalfoglou, Y., & Schorlemmer, M. (2003). Friedman, M., Levy, A., & Millstein, T. Ontology mapping: The state of the (1999). Navigational plans for data art. Knowledge Engineering Review, integration. Paper presented at the 18(1), 1-31. AAAI/IAAI. Karabatis, G., Rusinkiewicz, M., & Sheth, Georgakopoulos, D., Karabatis, G., & A. (1999). Interdependent database Gantimahapatruni, S. (1997). Speci- systems. In Management of Hetero- fication and management of interde- geneous and Autonomous Database pendent data in operational systems Systems (pp. 217-252). San Francisco, and data warehouses. Distributed and CA: Morgan-Kaufmann. Parallel Databases, An International Kimball, R. (2002). The data warehouse Journal, 5(2), 121-166. toolkit (2nd ed.). Goswami, J. C., & Chan, A. K. (1999). Kumar, N., Gangopadhyay, A., & Karabatis, Fundamentals of wavelets: Theory, G. (in press). Supporting mobile de- algorithms and applications: John cision making with association rules Wiley. and multi-layered caching. Decision Halevy, A. Y., Ives, Z. G., Suciu, D., & Ta- Support Systems. tarinov, I. (2003). Schema mediation Lamping, J., Rao, R., & Pirolli, P. (1995). in peer data management systems. A Focus+Context Technique Based on Paper presented at the ICDE. Hyperbolic Geometry for Visualizing Han, J. (1998). Towards on-line analyti- Large Hierarchies. Paper presented cal mining in large databases. Paper at the Proceedings ACM Confer- presented at the ACM SIGMOD. ence Human Factors in Computing Han, J., Chee, S., & Chiang, J. (1998). Is- Systems. sues for on-line analytical mining of Levy, A. Y., Rajaraman, A., & Ordille, J. data warehouses. Paper presented at J. (1996). Querying heterogeneous the Proceedings of 1998 SIGMOD’96 information sources using source Workshop on Research Issues on Data descriptions. Paper presented at the Mining and Knowledge Discovery VLDB. DMKD, Seattle, Washington. Mallat, S. G. (1989). A theory for mul- Han, J., & Kamber, M. (2000). Data tiresolution signal decomposition: Mining: Concepts and Techniques: The wavelet representation. IEEE Morgan Kaufmann. Transactions on Pattern Analysis and Hyperion. The Hyperion Project. http:// Machine Intelligence, 11, 674-693. www.cs.toronto.edu/db/hyper- Masterman, M. (1961). Semantic message detection for machine translation, ion/.

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 65

using an interlingua. NPL, pp. 438- conflicts. IEEE Transactions on 475. Knowledge and Data Engineering, Matias, Y., Vitter, J. S., & Wang, M. (1998). 16(2), 189-202. Wavelet-based histograms for selec- Ram, S., Park, J., & Hwang, Y. (2002). tivity estimation. Paper presented at CREAM: A mediator based envi- the ACM SIGMOD. ronment for modeling and access- Miller, R. J., Hernandez, M. A., Haas, ing distributed information on the L. M., Yan, L., Ho, C. T. H., Fagin, Web. Paper presented at the British R., et al. (2001). The clio project: National Conference on Databases managing heterogeneity. SIGMOD (BNCOD). Record, 30(1). Rice, J. A. (1994). Mathematical statistics Mitchell, T. M. (1997). Machine learning: and data analysis. Duxbury Press. McGraw-Hill. Rusinkiewicz, M., Sheth, A., & Karabatis, ORS.Open Research System. http://www. G. (1991). Specifying interdatabase orspublic.org dependencies in a multidatabase en- Ouksel, A., & Sheth, A. P. (1999). Special vironment. IEEE Computer, 24(12), issue on semantic interoperability in 46-53. global information systems. SIGMOD Sarawagi, S., Agrawal, R., & Megiddo, N. Record, 28(1). (1998). Discovery-driven exploration Papakonstantinou, Y., Garcia-Molina, H., of OLAP data cubes. Paper presented & Ullman, J. (1996). Medmaker: A at the International Conference on mediation system based on declara- Extending Database Technology. tive specifications. Paper presented SEEK. The Science Environment for at the ICDE. Ecological Knowledge. http://seek. Percival, D. B., & Walden, A. T. (2000). ecoinformatics.org Wavelet methods for time series analy- Sheth, A., Aleman-Meza, B., Arpinar, I. sis. Cambridge University Press. B., Bertram, C., Warke, Y., Ramak- Rahm, E., & Bernstein, P. A. (2001). A rishanan, C., et al. (2004). Semantic survey of approaches to automatic association identification and knowl- schema matching. VLDB Journal, edge discovery for national security 10(4). applications. Journal of Database Ram, S., Khatri, V., Zhang, L., & Zeng, D. Management, 16(1). (2001). GeoCosm: A semantics-based Sheth, A., Arpinar, I. B., & Kashyap, V. approach for information integration (2003). Relationships at the heart of of geospatial data. Paper presented at Semantic Web: Modeling, discover- the Proceedings of the Workshop on ing, and exploiting complex semantic Data Semantics in Web Information relationships. In M. Nikravesh, B. Systems (DASWIS2001), Yokohama, Azvin, R. Yager & L. A. Zadeh (Eds.), Japan. Enhancing the power of the Internet Ram, S., & Park, J. (2004). Semantic con- studies in fuzziness and soft comput- flict resolution ontology (SCROL): ing. Springer-Verlag. An ontology for detecting and resolv- Sheth, A., Bertram, C., Avant, D., Ham- ing data and schema-level semantic mond, B., Kochut, K., & Warke, Y.

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 66 Journal of Database Management, 18(1), 43-67, January-March 2007

(2002). Managing semantic content and integration. http://www.uddi. for the web. IEEE Internet Comput- org ing, 6(4), 80-87. Vitter, J. S., & Wang, M. (1999). Ap- Sheth, A., & Karabatis, G. (1993, May). proximate computation of multidi- Multidatabase Interdependencies in mensional aggregates of sparse data Industry. Paper presented at the ACM using wavelets. Paper presented at the SIGMOD, Washington DC. ACM SIGMOD. Smith, J. R., Li, C.-S., & Jhingran, A. Vitter, J. S., Wang, M., & Iyer, B. (1998). (2004). A wavelet framework for Data Cube Approximation and Histo- adapting data cube views for OLAP. grams via Wavelets. Paper presented IEEE Transactions on Knowledge and at the 7th CIKM. Data Engineering, 16(5), 552-565. W3C. Semantic Web. http://www. Stollnitz, E. J., Derose, T. D., & Salesin, w3.org/2001/sw/. D. H. (1996). Wavelets for Computer Zhao, H., & Ram, S. (2002). Applying Graphics Theory and Applications: classification techniques in semantic Morgan Kaufmann Publishers. integration of heterogeneous data Tatarinov, I., & Halevy, A. Y. (2004). Ef- sources. Paper presented at the Eighth ficient Query Reformulation in Peer- Americas Conference on Information Data Management Systems. Paper Systems, Dallas, TX. presented at the SIGMOD. Zhao, H., & Ram, S. (2004). Clustering TopicMap.XML Topic Maps (XTM) 1.0 schema elements for semantic integra- http://www.topicmaps.org/xtm/ tion of heterogeneous data sources. UDDI. Universal description, discovery Journal of Database Management, 15(4), 88-106.

Zhiyuan Chen received the PhD degree in computer science from Cornell University in 2002. Presently, he is an assistant professor in the Information Systems Department at University of Maryland, Baltimore County (UMBC). His research interests include XML and semi-structured data, privacy-preserving data mining, data integration, automatic database tuning, and database compression.

Aryya Gangopadhyay is an associate professor of Information Systems at UMBC. He has a PhD in Information Systems from Rutgers University. His current research interests are in the areas of knowledge discovery in databases and privacy issues in information systems. He has published over 50 peer-reviewed articles in academic publications such as Decision Support Systems, IEEE Transactions on Knowledge and Data Engineering, Information Resources Management Journal, Artificial Intelligence in Engineering, IEEE Computer, and Journal of Management Information Systems. He is in the editorial board of the International Journal of E-Business Research and the Journal of Electronic Commerce in Organizations.

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Journal of Database Management, 18(1), 43-67, January-March 2007 67

George Karabatis is an assistant professor of information systems at the University of Maryland, Baltimore County (UMBC). He holds degrees in computer science (PhD and MS) and mathematics (BS). His research interests are on various aspects of information technology related to database systems, including semantic information integration, bioinformatics and applications for mobile handheld devices. Prior to his current appointment he was a research scientist at Telcordia Technologies (formerly Bellcore) where he led several telecommunications projects involving database related technologies. His work has been published in journals, conference proceedings and book chapters.

Michael P. McGuire is the geospatial data services manager for the Center for Urban Environmental Research and Education at UMBC and provides technical support to the Baltimore Ecosystem Study for data management and spatial analysis Mr. McGuire holds a Bachelor of Science degree in geography from Towson University and an MS in information systems from UMBC where he is now pursuing his PhD. Prior to his appointment at UMBC, Mr. McGuire held positions at the Baltimore County Office of Planning and Maryland Department of the Environment. His research interests include spatial data warehousing, information visualization, and decision support systems.

Claire Welty is the director of the Center for Urban Environmental Research and Education and professor of civil and environmental engineering at University of Maryland, Baltimore County. Dr. Welty’s work has primarily focused on in transport processes in aquifers; her current research interest is in urban watershed hydrology, particularly in urban groundwater. Prior to her appointment at UMBC, Dr. Welty was a faculty member at Drexel University for 15 years. She holds a PhD in civil and environmental engineering from MIT and is currently a member of the Water Science and Technology Board of the National Research Council.

Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.