Semanteco: a Semantically Powered Modular Architecture for Integrating Distributed Environmental and Ecological Data

Future Generation Computer Systems ( ) –

Contents lists available at ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

SemantEco: A semantically powered modular architecture for integrating distributed environmental and ecological data

Evan W. Patton a,∗, Patrice Seyed a,b, Ping Wang a, Linyun Fu a, F. Joshua Dein c, R. Sky Bristol d, Deborah L. McGuinness a a Tetherless World Constellation, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, USA b DataONE, University of New Mexico, 1 University Boulevard N.E., Albuquerque, NM 87131, USA c US Geological Survey, National Wildlife Health Center, USA d US Geological Survey, Core Science Analytics and Synthesis, USA h i g h l i g h t s

• Discusses the need for flexible frameworks to handle data integration. • Proposes a semantic solution to integrate environmental data for resource managers. • Discusses interoperability challenges addressed with existing ontologies. • Analyzes performance of different schema for artificial intelligence reasoning. • Compares OWL-DL reasoning with more lightweight rule-based reasoning. article info a b s t r a c t

Article history: We aim to inform the development of decision support tools for resource managers who need to examine Received 2 February 2013 large complex ecosystems and make recommendations in the face of many tradeoffs and conflicting Received in revised form drivers. We take a semantic technology approach, leveraging background ontologies and the growing 31 August 2013 body of linked open data. In previous work, we designed and implemented a semantically enabled Accepted 5 September 2013 environmental monitoring framework called SemantEco and used it to build a water quality portal named Available online xxxx SemantAqua. Our previous system included foundational ontologies to support environmental regulation violations and relevant human health effects. In this work, we discuss SemantEco’s new architecture that Keywords: Semantic web supports modular extensions and makes it easier to support additional domains. Our enhanced framework Semantic environmental informatics includes foundational ontologies to support modeling of wildlife observation and wildlife health impacts, Provenance thereby enabling deeper and broader support for more holistically examining the effects of environmental Ecological data integration pollution on ecosystems. We conclude with a discussion of how, through the application of semantic technologies, modular designs will make it easier for resource managers to bring in new sources of data to support more complex use cases. © 2013 Elsevier B.V. All rights reserved.

1. Introduction investigate causes and possible effects of pollution, and identify threats to wildlife and their habitats. To be effective, these efforts In many places around the world, wildlife and the habitats on must be undertaken with a large-scale and multidisciplinary per- which they depend are deteriorating. For instance, almost 40% of spective. However, by nature, this approach requires access and the United States’ freshwater fish species are considered at risk use of disparate data sources and systems. For example, the US Ge- or vulnerable to extinction [1]. Aiming at preserving the environ- ological Survey (USGS) provides integrated science and technology ment and wildlife, scientists and resource managers have initiated to support resource managers in the US Department of the Interior various efforts to monitor ecological and environmental trends, (DOI) through initiatives such as the Wyoming Landscape Conser- vation Initiative (WLCI)1: an effort to assess and enhance aquatic and terrestrial habitats at a landscape scale in southwest Wyoming. Decision support systems are one end result of scientific research ∗ Corresponding author. Tel.: +1 4014847865. E-mail addresses: [email protected] (E.W. Patton), [email protected] (P. Seyed), [email protected] (P. Wang), [email protected] (L. Fu), [email protected] (F.J. Dein), [email protected] (R.S. Bristol), [email protected] (D.L. McGuinness). 1 http://www.wlci.gov/.

0167-739X/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.future.2013.09.017 2 E.W. Patton et al. / Future Generation Computer Systems ( ) – that facilitates examination of the many tradeoffs and conflicting approach provides a formal encoding of the semantics of the data drivers that resource managers often encounter in their work, from and provides services for automatic reasoning and visualizations. energy and agricultural development, fish and wildlife conserva- Furthermore, we compared the performance of a standard seman- tion, to recreational uses of public lands. tic web reasoner used in previous iterations with a new customized Meanwhile, semantic web technologies have been used in en- rule-based reasoner over our data. Lastly, we enhanced the prove- vironmental monitoring to facilitate knowledge encoding, data in- nance support of the existing portal by incorporating rationale as tegration, and collaborative scientific workflows [2]. Previously, provenance. This extensibility enabled by semantic technologies we proposed the Tetherless World Constellation Semantic Ecol- allows the portal to consume data from many different domains ogy and Environment Portal (SemantEco) [3,4] as both an envi- and present it together for the purposes of informing environmen- ronmental portal application and as an example of a semantic tal stakeholders. infrastructure for environmental informatics applications. The In this paper, Section2 elaborates how semantic web technolo- portal captures the semantics of domain knowledge using a fam- gies have been used to extend and improve the portal, including ily of ontologies, integrates environmental data (e.g., water qual- linkages to OBOE, extensions for wildlife monitoring, upgrades to ity measurement and regulatory data), infers pollution events the reasoning mechanism, and capture of rationale as provenance. using ontologies modeled with the Web Ontology Language, Section3 reviews related work. Section4 discusses impacts, high- version 2 (OWL 2), and leverages provenance. In this paper, we lights, and future directions. Section5 presents the conclusion. extend the focus of SemantEco beyond water quality and related human health effects to a more comprehensive effort that includes 2. Semantic approach fish and wildlife species and their related health effects. As such, this extension is in concert with the One Health concept, defined To inform the development of decision support tools for envi- as ‘‘the collaborative effort of multiple disciplines working locally, ronmental and observational communities, we extend the focus nationally, and globally to attain optimal health for people, ani- of SemantEco to a more comprehensive effort including fish and mals and the environment’’ [5]. Consider a data manager interested wildlife species and their related health effects, and upgrade the in identifying correlations between declines of a bird population portal with respect to three aspects: data integration, breadth and observed in a river due to chemicals released by nearby factories. scalability, and provenance support. Furthermore, to aid in ease of There may be direct causal links (i.e., pollutant X in a diet kills Y ) or deployment, the latest version of SemantEco is provided as a Java indirect links (e.g., pollutant X causes reproductive system changes Web Archive that can be deployed in any servlet container, such as leading to lowered breeding success and resultant population de- Tomcat or Jetty. clines). To identify these scenarios, resource managers need access to a number and variety of information and data sources including 2.1. Architecture databases containing species observation counts, literature on the environmental effects of pollutants, how species may migrate and interact with one another, and so forth. Semantic technologies will SemantEco is designed with a modular architecture in order facilitate access to multidisciplinary information that will aid re- to incorporate different domains as well as to provide different source managers in making decisions about complex ecosystems. facets for data discovery and exploration. Interaction between the These technologies also enhance reusability and address extensi- portal and installed modules are handled by a ModuleManager bility issues targeting challenges in the areas of data integration, that is responsible for calling appropriate module methods dur- provenance, and scalability. ing the processing of a client request. Modules are provided to We use semantic technologies to provide responses to these SemantEco in the form of a Java Archive (JAR) file and are, on av- challenges. This work presents a new architecture that evolves an erage, around 64 kB in size, with the largest measured at 440 kB. existing water quality portal into a general model for perform- SemantEco provides 15 modules by default, providing client-side ing ecological and environmental data integration across many interface functionality, a variety of datasets (water quality, air different data sources and domains. Additionally, this new archi- quality, etc.), and facets for that are used to constrain queries. The tecture is designed to be horizontally scalable through the use of ModuleManager actively examines the file system for new mod- representational state transfer [6]. We also demonstrate how se- ules, allowing modules to be added to a production system without mantic technologies can be used to encourage interoperability by taking it offline. Additionally, each JAR can contain properties to connecting SemantEco’s schema to one existing well-used obser- tell the ModuleManager how to prepare an environment in which vation schema—the Extensible Observation Ontology (OBOE) [7] to the module can execute (e.g., setting remote endpoint URLs to find support interoperable observation data. We investigated the rea- data). Furthermore, since modules are based on the Java language, soning implications of adopting OBOE for interpreting data, and it is possible to publish a module via HTTP that can be loaded into discovered that OBOE brings greater interoperability at the cost of SemantEco, providing a straightforward way for data providers to expose data in a usable distributed fashion without system longer reasoning time. Additionally, we now integrate various eco- maintainers needing to expend resources keeping track of mod- logical and environmental data: wildlife observational data from ule versions. In the future, we plan to develop an ecosystem where the Avian Knowledge Network (AKN2) and the US Geological Sur- organizations with scientific datasets can find, download, and pub- vey (USGS3); environmental criteria for wildlife from the Envi- lish modules to support their own scientific inquiries. While this ronmental Protection Agency (EPA4); water body data from the is similar to current content management systems such as Word- USGS; and health effects of contaminants on wildlife from Wild- press or Drupal that provide central repositories of modules, our pro.5 These are provided in addition to the existing water quality aim is to add more semantic infrastructure enabling enhanced cus- and regulatory data presented in earlier versions of this work. Our tomization and validation support through artificial intelligence techniques. This architecture is a vast departure from previous designs 2 http://www.avianknowledge.net/. of SemantEco [3,4,8,9], and was motivated by the need to allow 3 http://www.usgs.gov/. domain-literate (but possibly less semantic technology-literate) 4 http://www.epa.gov/. people to expand the system’s data and functionality without re- 5 http://wildpro.twycrosszoo.org/. quiring significant changes to any particular component. Previous E.W. Patton et al. / Future Generation Computer Systems ( ) – 3

Table 1 Architecture performance comparison. Region Kent, RI Suffolk, MA Yates, NY San Francisco, CA

Triples 11 233 5805 19 196 20 825 Old store (ms) 3 245 2351 3 299 2 941 Old reasoning (ms) 338 169 373 247 Old total (ms) 3 583 2520 3 672 3 188 New store (ms) 1 543 1058 1 840 1 447 New reasoning (ms) 974 780 1 308 1 197 New total (ms) 2 517 1838 3 148 2 644 Reasoning change 188% 362% 251% 385% Fig. 1. The domain model for an implementation of ProvidesDomain and the associated Domain object. extensions (e.g., [9]) required a number of changes to both client- to enable modules to interact with the active request and relevant side and server-side code that touched existing behavior. Conse- entities. Visit methods are defined for four entities: (1) the user quently, the decoupling gained by modularizing the architecture interface, where the module can add new facets, style sheets, or of software can change its performance characteristics. To this end, client-side logic; (2) the data model in which instance level data are we ran a number of queries on both the old portal and the newly loaded from remote sources (e.g., water measurement data from designed portal via the user interface. We used one county from EPA); (3) the ontology model in which the semantics of classes and each of the original four states available in the early version of Se- properties are defined (e.g., regulations for a domain would be im- mantEco, as they have the best curated data within the system. The ported here); and (4) any query prior to execution if the module performance times of these queries are presented in Table 1. desires to specialize it with additional constraints. As the data in Table 1 demonstrate, the total running time of the Modules can also implement an interface, , new portal architecture is faster than that of the previous version ProvidesDomain to mark the module as a domain provider. Modules with this inter- by 14–30%. Most of this improvement is due to improved perfor- face provide one or more s, which are a collection of data mance in the response time of queries to the triple store between Domain sources, entity types, and regulations (see Fig. 1). The user activates the two versions. The total size of the data transferred between domains via the user interface. By knowing the user’s domains of the two systems varied, with San Francisco resulting in the largest interest, SemantEco’s selectively chooses which dataset at 887 kB.6 Reasoning queries, which take up the major- ModuleManager modules to execute in response to a request. We define an active ity of the total time after the triple store queries, were anywhere module to be a module that either does not provide domains, thus from 180% to 400% slower due to the fact that queries have more activated by default, or has at least one domain enabled by the user. variables that need to be bound as a result of the additional do- There are three classes of module in SemantEco, defined by their mains present in the new SemantEco. For example, a module that features. Core modules are essential to how SemantEco functional- provides data for the water domain may add a binding type for ity processes information. This includes, among others, the Domain a variable and then use the reasoner to look for instances of that Module whose inputs are used in the to control type. These bindings are typically of the form (exists(?site ModuleManager which other modules are executed. Data provider modules are the a water:WaterSite) as ?isWater). To ensure complete- second class of modules. These, at a minimum, add at least one do- ness, the reasoner will attempt to compute for every instance sat- isfying the query whether or not it is an instance of the class main when implementing ProvidesDomain, and have an imple- water:WaterSite. This computation, if the reasoner does not have mentation of the data model visitor in order to load data. Much the results available in its cache, can be NP-hard. Ongoing engi- of the data loading features of SemantAqua have been preserved neering is being done to intelligently partition the data to reduce or in SemantEco as the Water Data Provider, which provides the EPA eliminate these excessive tests based on the user-specified query and USGS as data sources as well as a number of semantically en- constraints. Furthermore, it is important to note that many of the coded water regulations. Lastly, query-modifying modules are mod- intermediate queries required to process a request are cacheable, ules that do not provide a domain but instead implement the query but this caching functionality was disabled for these tests. The visitor to customize query patterns. For example, the time module caching mechanism reduces the number of requests to external adds query patterns related to time, but an existing data provider triple stores by using an HTTP proxy to mirror results locally un- module must provide a time dimension on its measurements for til a set timer makes them stale or they reach an expiration set this to succeed. in the HTTP headers, which is the same mechanism used in web In addition to visitor methods, modules optionally provide browsers. Lastly, because SemantEco modules are designed to min- query methods. A query method has a particular signature rec- imize or eliminate side-effects, it is possible to run many instances ognized by SemantEco that allows the module to expose func- of SemantEco to balance the reasoning workload behind a high- tions to the client. A query method must function without availability cluster or on a cloud service such as Amazon EC2. side-effects, as it is potentially called by multiple clients simulta- neously. Thus, query methods can be horizontally scaled to support as many clients as required. An example of a query method would 2.2. Modules be the regulation module’s queryForSites method that searches the knowledge base for sites and binds attributes such as latitude and Modules are the essence of the new version of SemantEco as longitude as well as the results of artificial intelligence inference they encapsulate SemantEco’s functionality in discrete compo- (e.g., whether or not the site is polluted, given the semantics of the nents, and can be dynamically added and removed to affect its be- regulation ontology). havior. SemantEco employs the use of the Visitor design pattern When a client calls a query method, the portal executes a sequence of steps prior to responding to it (see Fig. 2). First, SemantEco decomposes the request components to determine 6 For comparison, the Google homepage served to Chrome 28 on Mac OS X 1.7.5 which module and query method should be executed. Then, the is 1.4 MB. ModuleManager locates the module, uses the Java Reflection API 4 E.W. Patton et al. / Future Generation Computer Systems ( ) –

Fig. 3. Measurement in SemantEco.

Fig. 4. Measurement in OBOE.

differ in several ways. For instance, the SemantEco ontology cap- Fig. 2. Overview of how SemantEco responds to a request to execute a module’s tures measurement values with the data property has Value, while query method. OBOE encodes measured values using property chains on structured objects. More details about the differences between Seman- to obtain the appropriate method, and executes it. At this point, tEco and OBOE can be found in [8]. Figs. 3 and4 depict the data control is turned over to the module that proceeds to perform its presentation of an example water observation record generated task. according to the SemantEco version 2 and version 3, respectively. During execution, the module may wish to make queries to We model a polluted site as something that is both a mea- the local data model or any number of external endpoints. The surement site and a polluted thing. The different observation data recommended way to execute a query is via the QueryExecutor. models lead to different models of polluted things. In version 2, a The QueryExecutor and the ModuleManager work together to polluted thing is defined as something that has at least one mea- provide complex behavior regarding queries. First, the Module surement that violates a regulation. With version 3, a polluted Manager takes the query and calls the query visitor method of thing is modeled as something that has at least one observation of a each active module so that they may be allowed to augment the regulation violation. This allows, for example, statements of obser- query. One such example is the time module that, when the query vations as an aggregate of measurements. It can then be observed asks about items of type Measurement, will insert a graph pattern that some mathematical property of the observation, such as the that will restrict any measurements to a time window based on the average outflow temperature over the course of an hour, violates a user’s selections. If the query is to be executed against a remote regulation. An observation of a regulation violation is modeled as endpoint, the QueryExecutor handles all of the necessary an observation that has at least one measurement that violates a content negotiation on behalf of the calling module. In the event regulation, i.e., the measured value is outside of an allowable range. of a local query, the will build for itself, with QueryExecutor The allowable range from a regulation rule is encoded via property the help of the ModuleManager, a data model and an ontology chains. Fig. 5 gives an example of an observation of regulation vio- model over which the query will be executed. These models are lation. built in a manner analogous to how queries are augmented: the We updated our data converter and regulation converter for ModuleManager takes the newly instantiated models and visits encoding the water observation data and regulation rules accord- each active module so they can provide any data or ontologies appropriate to the current request. The QueryExecutor will then ing to the new ontologies, and investigated the reasoning implica- evaluate the query over the combined data and ontology models tions of adopting OBOE for interpreting data. Our system uses the so that the Pellet reasoner [10] can be used to infer new triples not Pellet reasoner [10] together with the Jena Semantic Web Frame- already stated that might be used to provide solutions to the query. work [11] to reason over the data and ontologies to answer the queries for polluted sites. Listing 1 is under SemantEco version 2, 2.3. Knowledge bases and Listing 2 is under version 3. We execute each query ten times to obtain an average query time. Table 2 summarizes the datasets To increase interoperability, we provided a mapping that and the query execution time. There is a clear tradeoff between allows our system to be interoperable with OBOE-compatible ap- interoperability and reasoning performance: while OBOE brings plications. We name the previous version of our SemantEco ontolo- greater interoperability for the portal, it incurs much longer rea- gies as version 2 and the new version as version 3. The two versions soning time. E.W. Patton et al. / Future Generation Computer Systems ( ) – 5

Table 2 Reasoning comparison. Region Kent, RI Suffolk, MA Yates, NY San Francisco, CA

Regulations RI MA NY CA v2 size (triples) 25 293 53 866 27 631 74 920 v3 size (triples) 31 670 75 059 38 926 101 969 v2 reasoning (s) 6.4 32.9 32.6 58.4 v3 reasoning (s) 301.9 1797.1 568.9 856.0

Fig. 5. Updated model of a regulation violation in OBOE.

Listing 1: Query for polluted sites in the SemantEco ontology select distinct ?pollutedSite where { ?violation a pol:RegulationViolation. ?violation pol:hasSite ?pollutedSite. }

Fig. 6. Subset of the wildlife ontology. Listing 2: Query for polluted sites using OBOE select distinct ?pollutedSite species in the region. Then, the resource manager views the where { map to find out if the selected species might be endangered ?observation a by water pollution in the region. The resource manager can oboe-pol:ObservationOfRegulationViolation. ?observation oboe-core:hasContext ?obsContext. click on polluting facilities or polluted sites to investigate more ?context oboe-core:ofEntity about the pollution, e.g., the health effects of the pollution on oboe-pol:SpatialLocationEntity. the species. ?context oboe-core:hasMeasurement ?locationMea. ?locationMea oboe-core:hasValue ?pollutedSite. } To realize the use case, we enhanced our ontology for modeling the domain of wildlife observation, integrated ecological and environmental observation data according to linked data princi- Reasoning performance is tested on a Silicon Mechanics 4U rack ples [12], and developed visualizations to present the data and server running Debian Squeeze (6.0.4) configured with two six- provenance. core Intel Xeon E5620 with hyper-threading operating at 2.67 GHz and 48 GB of 1333 MHz dual ranked memory. Tests using Pellet Ontology for wildlife monitoring: There are a number of existing were conducted using the 64-bit Java 1.6.0_26 runtime environ- ontologies for modeling and publishing data using the Resource ment (sun-java6-jre, version 6.26-0squeeze1), and the virtual ma- Description Framework (RDF) [13] about species descriptions chine was configured with 1 GB min heap size and 32 GB max heap [14,15]. After reviewing these ontologies, we choose to reuse the size. Geospecies ontology for the purpose of modeling the domain of Table 2 demonstrates that the cost of adding interoperabil- wildlife monitoring, as it contains most of the concepts required ity through support of the OBOE ontology greatly increases the by the use case. For example, the Geospecies ontology defines the amount of time required to complete reasoning over all of the classes Observation and SpeciesConcept, and links the two classes counties studied. There are a number of reasons for this difference, with object properties hasObservation and hasSpecies. but the one that contributes the most is the complexity of the graph However, the Observation class from Geospecies does not cap- and how the changes in OBOE affect how the logical structure of a ture the observed habitat of the wildlife nor the date of the 7 regulation increase the amount of time needed by Pellet to classify observation. Thus, in the extension, we introduce a new class each observation (e.g., compare the order of the graphs in Figs. 3 WildlifeObservation that subclasses Observation and provides a new and4). Thus, one technique that may be important in further ver- property hasDate. A subset of our ontology extension is illustrated sions of the portal is a tool for converting OBOE observations into in Fig. 6. the more compact representation used in earlier versions so that Each contaminant can cause different adverse effects on differ- the reasoning performance is increased. ent wildlife species. For example, when exposed to excessive zinc concentrations, mallards exhibit leg paralysis and decreased food 2.4. Extensions for wildlife monitoring consumption, while invertebrates exhibit decreased growth rate and increased mortality [16]. To help facilitate the investigating We designed the following use case to identify necessary health impacts of contaminants on wildlife species, we refactored extensions to the portal for evaluating wildlife health effects from our ontologies to include the class HealthEffect to model the poten- 8 pollutants. tial health effects of overexposure to contaminants. The property isCausedBy establishes the relationship between each effect and its The resource manager chooses a geographic region of interest by entering a zip code and the species of concern in the species facet. The portal identifies polluted water sources and polluting facilities, and visualizes the results on a map using different 7 http://escience.rpi.edu/ontology/semanteco/3/0/wildlife.owl. icons. Meanwhile, the portal displays the distribution of the 8 http://escience.rpi.edu/ontology/semanteco/3/0/wildlife-healtheffect.owl. 6 E.W. Patton et al. / Future Generation Computer Systems ( ) – causing contaminant, and the property forSpecies links each effect with its target species. Wildlife source data: We integrate various ecological and environmental datasets from government and research institutions as follows. Bird observation data: One major source of bird observation data is AKN, an international effort of government and non- government institutions to understand the patterns and dynamics of bird populations across the Western Hemisphere. We obtain a subset of the eBird Reference Dataset (3.0) [17] from AKN via its database query interface. The datasets are based on reported observations from novice and experienced bird observers. These datasets contain count data for bird species, the location where observation took place, and the time observation started. The AKN is strengthened through the adoption of species codes from the Integrated Taxonomic Information System (ITIS),9 providing unique identification of species tied to an authoritative source. Currently, we have over 485 million triples representing bird observations for the state of California. Fig. 7. SemantEco’s map visualization. Fish observation data: The National Fisheries Data Infrastruc- ture brings together local and regional fisheries information systems and provides fishery managers and decision-makers with one source of comprehensive data and information of fish species.10 We fetch fish observation data from its query interface. The fish dataset includes the species name, the hydrological unit code (HUC)11 of the watersheds where the fish species is observed, the date of the observation, and the originating database. Currently, we have 74 thousand triples on fisheries data from the USGS and 624 thousand triples representing fish counts in Santa Barbara, CA, sourced from the Long Term Ecological Research (LTER) network.12 Regulation data: We integrate EPA’s compilation of national recommended water quality criteria [18], which is presented as a summary table containing recommended water quality criteria for the protection of aquatic life and human health in surface water for approximately 150 contaminants. Fig. 8. SemantEco’s time series visualization. Water body data: The USGS National Hydrography Dataset (NHD) services are used to get HUC codes given locations on the map,13 and the data for the water body shapes.14 Furthermore, we Data visualization: We support two types of visualization: (1) have aggregated over 1.5 billion triples of water quality data mea- map visualization that displays the sources of the water pollution surements from the EPA and USGS distributed across the United together with species habitat in the context of geographic regions States. and (2) time series visualization that depicts species count over Health data: We obtain the health effects on wildlife from the time with respect to a particular geographic region. research effort Wildpro, an electronic encyclopedia and library for The map visualization retrieves sources of water pollution and wildlife [19]. species habitats from the back-end reasoner and the triple store. Data conversion: The general-purpose csv2rdf4lod tool provides We label clean and polluted water sources and clean and pollut- the capability of quick and easy data integration [20,21]. We pro- ing facilities with different markers. In the extension, we focus on vide the converter with declarative parameters that map proper- waterfowl and fish. We visualize highlight the water bodies that ties of the raw data to terms defined in ontologies. For example, the are their habitats. When the user clicks one of the highlighted wa- field Common Name in the eBird dataset is mapped to the property ter bodies, information about the water body and the provenance geospecies15:hasCommonName. Using this mapping, the converter of the information are shown in the ‘‘Water Body Properties’’ tab is able to generate RDF triples compatible to our ontologies from of the pop-up window. Fig. 7 shows an example of our map visu- the raw tabular data provided by our data sources. We use the same alization and pop-up window for water body. In the example, the regulation ontology design and conversion workflow with our pre- portal applies EPA’s water quality criteria for aquatic life on the re- vious work [4] to map the rules in wildlife regulations to OWL [22] gion with the zip code 98103 (Seattle, WA) and identifies polluting classes. sites (outlined in red) that are close to bird habitats. The time series visualization retrieves species count data within a particular geographic region by querying the triple store and displays the data as a time series using the d3.js library. Fig. 8 shows 9 http://www.itis.gov/. the count of Canada geese in Washington state in 2007. 10 http://ecosystems.usgs.gov/fisheriesdata/querybystate.aspx. 11 HUC8 is an 8-digit hydrological unit code identifying a subbasin area of size around 700 square miles. See http://en.wikipedia.org/wiki/Hydrological_code. 2.5. Reasoning engine 12 http://www.lternet.edu/. 13 http://services.nationalmap.gov/ArcGIS/rest/services/nhd/MapServer. Due to forward-chaining closure computing, standard reason- 14 http://nhd.usgs.gov/data.html. ers such as Pellet are much slower than general rule reasoners with 15 http://rdf.geospecies.org/ont/geospecies#. only the necessary rules. For example, a standard reasoner for the E.W. Patton et al. / Future Generation Computer Systems ( ) – 7

Rationale helps portal users to obtain a holistic understanding of Listing 3: Excessive Chloride Rule the portal and facilitates the portal’s maintenance and reuse. Users [Chloride: (?x pol:hasValue ?v) ge(?v, 10.0) (?x pol:hasCharacteristic pol:Chloride) are more likely to have confidence in the portal if the rationale be- (?x repr:hasUnit ‘mg/l’) -> hind design choices and the presentation of data are acceptable to (?x rdf:type pol:ExcessiveChlorideMeasurement)] them. If other portal builders are interested in reusing the architecture or workflow, they can more easily decide whether they would [Chloride-subclass: (?x rdf:type pol:ExcessiveChlorideMeasurement) -> like to reuse one dataset or software agent when given the ratio- (?x rdf:type pol:RegulationViolation)] nale for why we selected a particular dataset or software agent. To encode rationale as provenance, we extend PML 2. PML 2 is a modular explanation interlingua, and it contains three ontologies Listing 4: Query for Regulation Violations using Rules that focus on three types of explanation metadata: provenance, select ?s ?x ?v where { information manipulation or justifications, and trust [24]. We in- ?x a pol:RegulationViolation. troduce the property pmlp:hasRationale, whose domain is the class ?x pol:hasValue ?v. ?x pol:hasSite ?s. } IdentifiedThing and range is String. Identified things can be information, language, and sources (including organization, person, agent, services). With hasRationale, we can provide the rationale for why Table 3 Rule reasoner versus Pellet. we choose to adopt the identified things in simple text. By extend- ing the scope of provenance to include rationale, we are able to Region Kent, RI Suffolk, MA Yates, NY San Francisco, CA capture some important information that may otherwise be lost Rule (ms) 4 014 15 408 4 333 13 093 when the original architects of the portal leave the project. Table 4 Pellet (ms) 11 384 162 566 31 148 31 558 includes three rationale examples. While we automatically capture provenance during the data RDF16 language would include the following rule (encoded with integration stages, such as data source URL, time, and method Jena rule syntax17): used in data manipulation due to the provenance support from csv2rdf4lod [3,4], the rationales as provenance are captured man- [rdf1: (uuu aaa yyy) -> (aaa rdf:type rdf:Property)] ually. The provenances captured using csv2rdf4lod are published Although this rule ensures the fulfillment of the semantics of as metadata graphs within the SemantEco triple store and can be RDF, it is not useful in the query answering task of our system. used for generating faceted search interfaces, e.g., providing data There are 14 such rules embedded in a standard RDFS reasoner18 source-based search to allow users with a particular interest in EPA and even more for an OWL-DL reasoner like Pellet. We avoid the data to selectively access datasets provided by the EPA. invocation of these rules to improve query-answering efficiency. Table 3 compares the performance of a specifically tailored rule 3. Related work reasoner implementing rules such as that in Listing 3 to the Pellet reasoner. They both answer the query in Listing 4 under In ecology and environmental communities, there have been the California water quality regulations on the same hardware as research efforts that facilitate domain knowledge integration via described in Section 2.3. semantic approaches. These research projects focus on differ- These results show that the new custom rule implementation ent fields of ecological and environmental science. OBOE focuses of the regulations outperforms Pellet by an average factor of 4, on encoding generic scientific observation and measurement [7]. and range between 10% and 40% of the overall running time of the GeoSpecies is an effort for enabling species data to be linked to- complete OWL-DL closure. This suggests that having automated gether as part of the Linked Data network [14]. Chen et al. proposed mechanisms to convert from complete OWL-DL ontologies into a prototype system that integrates water quality data from mul- rule-based approximations could provide significant performance tiple sources [25]. However, environmental decision support sys- improvements when completeness of the OWL-DL closure is not tems need to integrate data across multiple fields, facilitate data necessary to sufficiently answer application queries. However, this examination, and assist natural resource managers to draw infor- transformation does come at a cost in terms of expressivity. OWL 2 mative conclusions. In this work, we designed a family of ontolo- specifies a number of language profiles with different complexity gies for encoding water measurement, species observation, and classes [23]. All of the current regulations modeled by our ontol- the health effects of pollution on species, utilized the ontologies ogy can be mapped into OWL 2 RL, which means that using a rule for data integration and reasoning, and demonstrated the value of engine, as we have done in this example, gives us O(n4) runtime. semantic approaches for building environmental decision support This performance comes at the cost of eliminating cardinality re- systems. strictions greater than 1, and it places restrictions on how classes eScience can benefit from provenance for a number of reasons. can be constructed. The benefit is that consistency checking of the For example, provenance provides a context for data and infor- data is PTIME-complete rather than NP-hard. mation interpretation, enabling evaluation of the experimental results and replication of scientific workflows [26]. Research projects 2.6. Rationale as provenance such as myGrid [27] and the Collaboratory for Multi-Scale Chem- ical Science (CMCS) [28] have been conducted to build infrastruc- During the construction of an information portal, we make var- ture that generates provenance data and allows users to view and ious choices, including what data sources to use, which integration use provenance in analyses. The provenance support of this work tools, which data transformations, etc. As pointed by Bristol from differs from that of previous projects in that we extend the scope USGS, rationale that explains how choices were made is critical. of provenance support to include rationale. Futrelle et al. [29] developed a semantic middleware, Tupelo, for eScience that was oriented towards collaborative knowledge generation and the provenance required to capture the artifacts and 16 http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#RDFRules. processes used by scientists. It combined semantic web technolo- 17 http://jena.sourceforge.net/inference/#RULEsyntax. gies with content management systems to provide a platform for 18 http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#RDFSRules. science and included reasoning and rule engine features via OWL 8 E.W. Patton et al. / Future Generation Computer Systems ( ) –

Table 4 Examples of the rationale captured. Identified thing Type Rationale

USGS Data organization It is an authoritative government agency for science about the Earth, its resources, and the environment. NWIS dataset Dataset It is distributed via web services and can be accessed periodically with automated means. csv2rdf4lod Software The open source tool provides a quick and easy way to convert tabular data into well-structured RDF. We have direct support from the author of the tool, who is our lab mate. and the Semantic Web Rule Language (SWRL) [30]. Provenance three groups and provides a unified web service interface for re- captured in Tupelo was represented using the Open Provenance trieving data. However, this system does not cover water quality Model (OPM) [31,32]. Futrelle et al. also presented two use cases data collected at facilities monitored under the Clean Water Act related to virtual sensors and plant growth modeling to demon- or the Safe Water Drinking Act. The SemantEco application incor- strate the generalizability of the middleware to diverse scientific porates water quality data from the aforementioned datasets, and domains. The SemantEco framework provides similar provenance also provides interpretations of the data using regulatory informa- capture, but in two different ways: (1) it captures the provenance tion obtained for federal and state agencies. of the data as they are ingested into triple stores from elsewhere via the csv2rdf4lod tool [21], and (2) the framework can capture 4. Discussion and future work the content and products of all queries executed during a request. This allows our environment to support explanations of how the Environmental and health information systems benefit from se- final display was generated at a granular trace level or also in ab- mantic science and technology from several aspects. First, by en- stracted form. coding the domain knowledge required by the information system Li, Kennedy, Ngoran, Wu, and Hunter [33] outlined a number with ontologies, we make the information system easier to main- of requirements for scientific data management systems and then tain and extend by third parties interested in integrating their data presented a content repository for phonemics data that used an into our system. In SemantEco, we encode the environmental reg- ontology-driven approach to manipulate and validate data and ulation rules and the health effects of pollution on wildlife as OWL metadata. Their solution also introduced an upper application- classes. If one rule becomes stricter, we only need to update the level ontology combined with domain-specific ontologies to pro- OWL class to adopt the stricter threshold value. This extensibil- vide the necessary ontological model, and this model was enforced ity can enable users to make use of latest finds in research to reusing Pellet [10] for consistency checking. SemantEco uses similar fine thresholds. While such a system could be implemented with techniques, but also captures extensive provenance data during its a relational database, it would require additional code complexity operation as well as during data ingest, and it can use the prove- for maintaining threshold rules and providing user interfaces for nance data as part of the faceted search features that drive its data manipulating them. Consider as an example a state wanting to visualization capabilities. import the EPA ruleset and then provide a more strict set of regula- Chau [34] proposed the use of using ontologies for modeling the tions for half of the pollutants while maintaining the EPA’s thresh- metadata of various water flow and quality models combined with olds for the remaining contaminants. An OWL-DL reasoner such as a questionnaire system to assist researchers in finding classes of Pellet can infer that the stricter regulation is a subclass of the less models based on their needs as determined by the questionnaires. stringent EPA regulation and thus skip processing of the additional This system used the KAON suite of semantic web tools [35] to class. In a database environment, a developer or other maintainer perform the necessary inference to link the researchers’ inputs on would need to make a determination of whether exclude such a the questionnaire to the underlying ontology that was then queried rule (e.g., in the form of an SQL query) from the application to sim- to find the appropriate class of models and parameters. ulate the same behavior or potentially face duplicate or irrelevant Liu et al. [36] presented a solution for interoperability of prove- results. nance data of two different hydrological modeling systems. Like Similarly, adding a new OWL instance is sufficient for an exten- in SemantEco, they represented the provenance information gen- sion such as introducing the health effect of a new contaminant. In erated by their system using the Proof Markup Language [24] in contrast, if we embed the domain knowledge in the source infor- conjunction with domain-specific schema. The artifacts produced mation systems, these changes would require more costly mod- by their solution, therefore, can be consumed by SemantEco if the ifications than changing the ontology files. The small number of models are exposed via a SemantEco module, and, conversely, any example datasets used in this experiment for wildlife observations, provenance produced by SemantEco could be queried at a basic pollution sources, regulatory thresholds, and health effects will level by the same queries presented in [36]. need to be dramatically expanded and diversified to answer com- The Consortium of Universities for the Advancement of Hydro- plex resource management questions. However, the methods pio- logic Science, Inc. has produced a system that aggregates meta- neered by the SemantEco portal should prove useful in scaling to data of a vast number of water data repositories from around the necessary breadth and depth of such systems. In particular, Se- the globe and provides a metadata query application built in Mi- mantEco provides a platform that uses data published under a col- crosoft.NET [37]. This makes it easy for researchers working in the lection of interoperable ontologies, so data published by providers field of hydrology to search for water data and then retrieve it in using an interface such as D2RQ [40] would allow SemantEco to be a standardized format called WaterML [38], built on top of the Ex- easily extended to incorporate external database systems. Such a tensible Markup Language [39]. system would be easier to maintain than writing ad hoc data access Efforts to integrate water quality data have also been taking APIs on a per-database level. place at the federal level. The National Water Quality Monitoring Council (NWQMC), in cooperation with the USGS and EPA, main- Semantic technologies facilitate data integration, a crucial step tains a water quality portal19 that aggregates water data from the in building ecological and environmental information systems that are inherently interdisciplinary and sourced from many sectors, e.g., government, industry, private citizens, and others. Synthesiz- ing observation data using our wildlife ontology and adoption of 19 http://www.waterqualitydata.us/. standards such as the Integrated Taxonomic Information System E.W. Patton et al. / Future Generation Computer Systems ( ) – 9

Listing 5: Counting Wildlife Observations in Washington State Listing 6: OWL Definitions for Dangerous Site Detection PREFIX geospecies: :HealthEventSite rdfs:subClassOf [ a owl:Class ; . owl:intersectionOf (:SickEventSite :DeathEventSite)]. PREFIX wildlife: :SickEventSite owl:equivalentClass [ . owl:onProperty :hasWildlifeEvent ; SELECT ?month SUM(?count) as ?total owl:someValuesFrom :WildlifeSickEvent ] . WHERE { ?obv wildlife:hasState ‘‘Washington’’; :DeathEventSite owl:equivalentClass [ geospecies:hasCommonName ‘‘Canada Goose’’; rdf:type owl:Restriction ; wildlife:hasYearCollected ‘‘2007’’; owl:onProperty :hasWildlifeEvent ; wildlife:hasMonthCollected ?month; owl:someValuesFrom :WildlifeDeathEvent ] . wildlife:hasObservationCount ?count. :DangerousSite rdfs:subClassOf [ } GROUP BY ?month rdf:type owl:Class ; owl:intersectionOf (:PollutedSite :HealthEventSite)]

(ITIS)20 leads to the ability to integrate across disparate datasets. For example, we map the fields ‘‘Latitude’’ and ‘‘Longitude’’ of the Listing 7: Retrieving Location Information on Polluted and Health eBird Reference Dataset to the properties geo21:lat and geo:long Event Sites whereas the EPA datasets identify these fields as FCLGLAT and ?pollutedSite wgs:long ?pSiteLong. FCLGLON, and USGS refers to them as ‘‘LatitudeMeasure’’ and ‘‘Lon- ?pollutedSite wgs:lat ?pSiteLat. gitudeMeasure’’. By performing this mapping consistently with ?healthSite wgs:long ?hSiteLong. ?healthSite wgs:lat ?hSiteLat. tools like csv2rdf4lod [21], queries can be performed over these disparate schema. Resource managers need the best available scientific informa- Listing 8: Proximity-based Filtering tion in a form that is easy to comprehend and assimilate into decision-making processes. For instance, evaluation of a pollution FILTER (?pSiteLat < (?hSiteLat+"+delta+") && ? pSiteLat > (?hSiteLat-"+delta+") event in a watershed can benefit more from a focused list of species && ? pSiteLong < (?hSiteLong+"+delta+") most likely to be impacted rather than a comprehensive list of all && ? pSiteLong > (?heSiteLong-"+delta+")) species that might be found in the watershed. This often involves large amounts of data, and the analysis can require much time and Such modeling and reasoning has a constraint: a monitoring site effort to arrive at a decision with significant impacts. Semantic has records for both environmental observation and wildlife health technologies can be used to lower the cost and shorten the time events. However, as environmental qualities and wildlife health required by such decision-making processes by applying knowl- are usually monitored at different locations, the constraint usually edge of known data connections under varying management and does not hold. In such cases, we can model a DangerousSite as a condition scenarios. PollutedSite having at least one HealthEventSite nearby. Then we can We use the SPARQL Protocol and RDF Query Language (SPARQL) get the location information of PollutedSite and Health-EventSite [41] to retrieve data and perform appropriate data aggregation as with a query snippet in Listing 7 and utilize the SPARQL filter to guided by the semantics encoded in the ontologies. Not only does identify DangerousSite as shown in Listing 8. SPARQL enable us to specify the constraints of the data aggregation, To identify HealthEventSite, the portal needs data for wildlife it also supports aggregation functions including COUNT, SUM, AVG, health events provided monitoring systems such as the Wildlife MIN, and MAX. For example, we obtain the total counts of ‘‘Canada Health Event Reporter (WHER) [42]. We are interested in link- Goose’’ in Washington State in 2007 with the SPARQL query in List- ing our portal to WHER and enable the portal to identify ing 5. The query result can be provided in XML or JSON that can be HealthEventSite and DangerousSite. readily consumed by different visualization toolkits (e.g., D3.js) to Next, for modeling wildlife habitats, we plan to connect to produce a time series plot. This way, with SPARQL and visualization the Harmonisa project [43] that provides semantic descriptions toolkits, the portal enables users to review and interact with the of land-use and land-cover categories. Furthermore, EPA’s water growing data resource in the form of maps and other visualizations. quality criteria [18] provides multiple types of threshold and would It can be challenging to retrieve, aggregate, and visualize data provide an important facet to complement our existing datasets. not encoded in semantic format. For instance, the Washington These criteria include measures of acute pollution in freshwater, Department of Fish and Wildlife provides species distribution data chronic pollution in freshwater, acute pollution in saltwater, and in a spreadsheet. To retrieve the data to be aggregated and then chronic pollution in saltwater. We currently incorporate thresh- perform the aggregation, a resource manager has three options: olds for acute pollution in freshwater for two reasons: (1) we do it manually, write ad hoc programs, or write complex Excel mainly focus on inland water bodies; (2) acute pollution can affect macros. All three options require considerable time and effort from both species that live near the polluting water source and those the resource manager. that pass by the water source occasionally. To support thresholds The SemantEco portal has many potential directions to ex- for chronic pollution, we need to consider some additional factors, plore. We can use semantic technologies to enable automatic anal- e.g., the time that species stay near the polluting water source. We ysis over ecological and environmental data. For example, we would require species distribution models from animal experts to model HealthEventSite as a place where some animals are reported model the chronic pollution. We plan to further enhance our mod- as dead or sick, and DangerousSite as a location that is both a eling of rationale as provenance and to collect and integrate more HealthEventSite and a PollutedSite (see Listing 6). Then, if we feed data on the health effects of pollution on species. a semantic reasoner with the ontology and data, it will automati- Lastly, as these datasets continue to grow in size, work in dis- cally identify the dangerous sites. tributed reasoning systems over the web will be critical to providing robust solutions without needing large data centers for performing data integration. Distributed systems would enable the ability for low-powered solutions to operate on data streams re- 20 http://www.itis.gov/. ported by sensor equipment and provide publishing of pollution 21 http://www.w3.org/2003/01/geo/wgs84_pos#. events from the field. 10 E.W. Patton et al. / Future Generation Computer Systems ( ) –

5. Conclusion [7] J. Madin, S. Bowers, M. Schildhauer, S. Krivov, D. Pennington, F. Villa, An ontology for describing and synthesizing ecological observation data, Ecological Informatics 2 (2007) 279–296. We modularized our SemantEco portal based on two driving [8] P. Wang, L. Fu, E.W. Patton, D.L. McGuinness, F.J. Dein, R.S. Bristol, Putting factors: to facilitate decision support systems for resource man- Semantic Technology into Practice: Expanding the Semanteco Environment to agers and to make the portal more broadly reusable. The module Support Natural Resource Managers, Technical Report, Rensselaer Polytechnic Institute, Troy, NY, 2012. architecture allows for new components to be added to the existing [9] P. Wang, L. Fu, E.W. Patton, D.L. McGuinness, F.J. Dein, R.S. Bristol, Towards system without requiring extensive understanding of which com- semantically-enabled exploration and analysis of environmental ecosystems, ponents generate queries. Additionally, modules can be added or in: 8th IEEE International Conference on eScience. removed to provide data relevant to the goals of data managers [10] E. Sirin, B. Parsia, B. Cuenca-Grau, A. Kalyanpur, Y. Katz, Pellet: a practical owl- dl reasoner, Web Semantics: Science, Services and Agents on the World Wide and can be adapted as the needs of the managers evolve. Fur- Web 5 (2007) 51–53. thermore, we included extensions that supported wildlife moni- [11] J.J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, K. Wilkinson, Jena: toring, introduced connections to OBOE, demonstrated integration implementing the semantic web recommendations, in: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & of wildlife observation data as linked data, enhanced provenance Posters, ACM, New York, NY, USA, 2004, pp. 74–83. support through the incorporation of rationale, and completed a [12] C. Bizer, T. Heath, T. Berners-Lee, Linked data—the story so far, International performance comparison between a standard Tableaux-based De- Journal on Semantic Web and Information Systems 5 (2009) 1–22. scription Logic reasoner and customized rule-based reasoner. [13] D. Beckett, RDF/XML Syntax Specification, Technical Report, W3C, 2004. [14] GeoSpecies, Geospecies knowledge base, 2009. There are a number of directions that we are continuing to [15] L. Dodds, T. Scott, BBC Ontologies—the wildlife ontology, 2010. pursue with SemantEco. [16] M.S. Schwarz, K.R. Echols, M.J. Wolcott, K.J. Nelson, Environmental contaminants associated with a swine concentrated animal feeding operation and im- • Continued work on data and query partitioning to reduce the plications for McMurtrey National Wildlife Refuge, 2004. reasoning time in the new multi-domain SemantEco. [17] M.A. Munson, K. Webb, D. Sheldon, D. Fink, W.M. Hochachka, M. Iliff, M. • Continued active module development to support new use Riedewald, D. Sorokina, B. Sullivan, C. Wood, S. Kelling, The eBird Reference Dataset, Version 3.0, Technical Report, Avian Knowledge Network, 2011. cases, including a new initiative between IBM, Rensselaer Poly- [18] US EPA, National water recommended water quality criteria, 2012. technic Institute, and the FUND for Lake George, to help monitor [19] S.I. Boardman, F.J. Dein, WildProMultimedia: an electronic manual on the the overall health of the Adirondack lakes. Initial investigations health, management, and natural history of captive and free-ranging animals, in: Annual Conference of the American Association of Zoo Veterinarians, are underway concerning invasive species, salinity, and general pp. 107–108. water chemistry trend monitoring.22 [20] T. Lebo, G.T. Williams, Converting governmental datasets into linked data, in: • A graphical annotator tool to provide scientists access to a large Proceedings of the 6th International Conference on Semantic Systems. [21] T. Lebo, J.S. Erickson, L. Ding, A. Graves, G.T. Williams, D. DiFranzo, X. Li, J. array of scientific ontologies while also providing recommenda- Michaelis, J.G. Zheng, J. Flores, Z. Shangguan, D.L. McGuinness, J. Hendler, tions for transforming tabular data into graph-structured data Producing and using linked open government data in the TWC LOGD portal, that can be immediately consumed by SemantEco and other Se- in: D. Wood (Ed.), Linking Government Data, Springer, New York, NY, 2011. mantic eScience applications. [22] P. Hitzler, M. Krötzsch, B. Parsia, P.F. Patel-Schneider, S. Rudolph, OWL 2 Web Ontology Language Primer, Technical Report, W3C, 2009. • Support for the Open Geospatial Consortium’s GeoSPARQL stan- [23] W3C, OWL 2 Web Ontology Language Profiles, Technical Report, W3C, 2012. dard for performing SPARQL queries over geographical data be- [24] D.L. McGuinness, L. Ding, P. Pinheiro da Silva, C. Chang, PML2: a modular yond ZIP code boundaries. explanation interlingua, in: Proceedings of the AAAI 2007 Workshop on Explanation-aware Computing, pp. 22–23. [25] Z. Chen, A. Gangopadhyay, S.H. Holden, G. Karabatis, M.P. McGuire, Semantic Acknowledgments integration of government data for water quality management, Government Information Quarterly 24 (2007) 716–735. [26] Y.L. Simmhan, B. Plale, D. Gannon, A survey of data provenance in e-science, The Tetherless World Constellation is supported in part by Fu- ACM SIGMOD Record 34 (2005) 31–36. jitsu, Lockheed Martin, LGS, Microsoft Research, and Qualcomm, in [27] J. Zhao, C. Goble, R. Stevens, S. Bechhofer, Semantically linking and browsing addition to sponsored research from DARPA, IARPA, NASA, NIST, provenance logs for e-science, in: Proceedings of the 1st International Conference on Semantics of a Networked World, pp. 158–176. NSF, and USGS. E.W.P. is supported by a National Science Foun- [28] J.D. Myers, C.M. Pancerella, C.S. Lansing, K.L. Schuchardt, B.T. Didier, N. dation Graduate Research Fellowship. A.P.S. is supported by NSF’s Ashish, C.A. Goble, Multi-scale science: supporting emerging practice with DataONE project. The use of trade or product names does not con- semantically derived provenance, in: Proceedings of the ISWC workshop on Semantic Technologies for Searching and Retrieving Scientific Data. stitute endorsement by the US Government. [29] J. Futrelle, J. Gaynor, J. Plutchak, J.D. Myers, R.E. McGrath, P. Bajcsy, J. Kastner, K. Kotwani, J.S. Lee, L. Marini, R. Kooper, T. McLaren, Y. Liu, Semantic middleware for e-science knowledge spaces, in: Proceedings of the 7th International References Workshop on Middleware for Grids, Clouds and e-Science, MGC’09, ACM, New York, NY, USA, 2009, pp. 4:1–4:6. [1] National Fish Habitat Board, Through a Fish’s Eye: The Status of Fish Habitats [30] I. Horrocks, P.F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, M. Dean, SWRL: A in the United States 2010, Technical Report, Association of Fish and Wildlife Semantic Web Rule Language Combining OWL and RuleML, Technical Report, Agencies, Washington, D.C., 2010. W3C, 2004. [2] F. Villa, I.N. Athanasiadis, A.E. Rizzoli, Modelling with knowledge: a review of [31] L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, P. Paulson, The open emerging semantic approaches to environmental modelling, Environmental provenance model: An overview, in: J. Freire, D. Koop, L. Moreau (Eds.), Modelling & Software 24 (2009) 577–587. Provenance and Annotation of Data and Processes, in: Lecture Notes in [3] P. Wang, J.G. Zheng, L. Fu, E.W. Patton, T. Lebo, L. Ding, Q. Liu, J.S. Luciano, Computer Science, vol. 5272, Springer, Berlin, Heidelberg, 2008, pp. 323–326. D.L. McGuinness, A semantic portal for next generation monitoring systems, [32] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. in: Proceedings of the 10th International Semantic Web Conference, Springer- Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. Stephan, J.V. den Bussche, Verlag, Berlin, Heidelberg, 2011, pp. 253–268. The open provenance model core specification (v1.1), Future Generation [4] P. Wang, Semantically enabling next generation environmental informatics Computer Systems 27 (2011) 743–756. portals, Master’s Thesis, Rensselaer Polytechnic Institute, 2012. [33] Y.-F. Li, G. Kennedy, F. Ngoran, P. Wu, J. Hunter, An ontology-centric [5] The American Veterinary Medical Association, One Health: A New Professional architecture for extensible scientific data management systems, Future Imperative, Technical Report, The American Veterinary Medical Association, Generation Computer Systems 29 (2013) 641–653. 2007. [34] K.W. Chau, An ontology-based knowledge management system for flow and [6] R.T. Fielding, Principled Design of the Modern Web Architecture, University of water quality modeling, Advances in Engineering Software 38 (2007) 172–181. California, Irvine. [35] E. Bozsak, M. Ehrig, S. Handschuh, A. Hotho, A. Maedche, B. Motik, D. Oberle, C. Schmitz, S. Staab, L. Stojanovic, et al., Kaontowards a large scale semantic web, in: E-Commerce and Web Technologies, Springer, 2002, pp. 304–313. [36] Q. Liu, Q. Bai, C. Kloppers, P. Fitch, Q. Bai, K. Taylor, P. Fox, S. Zednik, 22 http://news.rpi.edu/content/2013/06/27/new-project-aims-make- L. Ding, A. Terhorst, D.L. McGuinness, An ontology-based knowledge new-york%E2%80%99s-lake-george-%E2%80%9Csmartest-lake%E2%80%9D- management framework for distributed water information system, Journal of world?destination=node/40140. Hydroinformatics 15 (4) (2013) 1169–1188. E.W. Patton et al. / Future Generation Computer Systems ( ) – 11

[37] D.G. Tarboton, D. Maidment, I. Zaslavsky, D.P. Ames, J. Goodall, J.S. Horsburgh, Linyun Fu is a Ph.D. student at Rensselaer Polytechnic CUAHSI Hydrologic Information System 2010 Status Report, Technical Report, Institute, studying data portal modeling. He has worked Consortium of Universities for the Advancement of Hydrologic Science, Inc., on several environmental information systems since he 2010. joined Tetherless World Constellation in 2010. [38] P. Taylor, G. Walker, D. Valentine, S. Cox, WaterML2.0: harmonising standards for water observation data, Geophysical Research Abstracts (2010). [39] T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, F. Yergeau, Extensible markup language (xml), World Wide Web Journal 2 (1997) 27–66. [40] C. Bizer, D2RQ—treating non-rdf databases as virtual rdf graphs, in: Proceedings of the 3rd International Semantic Web Conference. [41] E. Prud’hommeaux, A. Seaborne, SPARQL Query Language for RDF, Technical Report, W3C, 2008. [42] F.J. Dein, M.K. Hines, C.M. Marsh, Tools for electronic reporting and analysis of wildlife disease events, in: Proceedings of the 61st Annual Conference of the Wildlife Disease Association, Quebec City, Quebec. [43] HarmonISA, Harmonisation of land use data, 2012. F. Joshua Dein, VMD, MS, is a Veterinary Medical Officer at the USGS National Wildlife Health Center. He has been involved in numerous projects focused on the development of tools for multidisciplinary knowledge Evan W. Patton is a Ph.D. student at Rensselaer Polytech- integration, and for public input and use of wildlife event nic Institute, studying how semantic technologies impact data. Some of these include Wildpro, the Canary Database, power consumption on mobile devices and how semantics and the WDIN Wildlife Health Event Reporter. can be applied to a number of problems that benefit from the availability of mobile devices in the field. He holds an M.S. in Cognitive Science from Rensselaer Polytechnic In- stitute.

R. Sky Bristol is currently the Director of Applied Earth Patrice Seyed is a postdoctoral fellow at the Tetherless Systems Informatics Research at the US Geological Survey World Constellation at RPI, via the Data Integration and in Denver, Colorado. Sky’s passion for the effective blend- Semantics Work Group of the NSF DataONE project, un- ing of science and technology towards more effective and der the supervision of Deborah McGuinness. Patrice com- efficient resource management decision making led him pleted his Ph.D. in Computer Science and Engineering from from an early career in environmental contaminants in- the University of Buffalo (UB) in December of 2011. Prior vestigations with the US Fish and Wildlife Service to the to beginning his Ph.D. program at UB, Patrice received an pursuit of a computer and information science capability M.A. in Psychology and an M.S. in Computer Science, from in the USGS. Sky believes strongly in pushing the bound- Boston University. Patrice’s main interests include formal aries of the possible and challenging established wisdom and applied ontology, including the practical application through enabling smart scientists and technologists, cre- of semantic web technologies to help solve data trans- ating right places and right times for innovation. parency and data integration research problems.

Ping Wang is a software engineer at Amazon. She has Deborah L. McGuinness is a leading expert in knowledge worked on how semantic science and technologies enable representation and reasoning languages and systems, next-generation environmental information systems after and she has worked in ontology creation and evolution she joined Tetherless World Constellation 2010. She holds environments for over 20 years. Deborah is best known for an M.S. in Computer Science from Rensselaer Polytechnic her leadership role in semantic web research, and for her Institute. work on explanation, trust, and applications of semantic web technology, particularly for scientific applications. She is a senior constellation professor of the Tetherless World Constellation at Rensselaer Polytechnic Institute.