AI Magazine Volume 29 Number 1 (2008) (© AAAI) Articles Enabling Scientific Research Using an Interdisciplinary Virtual Observatory The Virtual Solar-Terrestrial Observatory Example

Deborah L. McGuinness, , Luca Cinquini, Patrick West, Jose Garcia, James L. Benedict, and Don Middleton

cientific data is being collected and maintained in dig- ■ Our work is aimed at enabling a ital form in high volumes by many research groups. new style of virtual, distributed scien- SThe need for access to and interoperability between tific research. We have designed, built, these repositories is growing. Research groups need to and deployed an interdisciplinary vir- access their own increasingly diverse data collections. As tual observatory—an online service investigations begin to include results from many different providing access to what appears to be experiments, researchers also need to access and utilize oth- an integrated collection of scientific data. The Virtual Solar-Terrestrial er research groups’ data repositories in a single discipline or, Observatory (VSTO) is a production more interestingly, in multiple disciplines. Also, it is not data framework provid- simply trained scientists who are interested in accessing sci- ing access to observational data sets entific data; lay people are becoming interested in looking from fields spanning upper atmospher- at trends in scientific data as well, for example, when they ic terrestrial physics to solar physics. become engaged in climate discussions. The observatory allows virtual access The promise of the true virtual interconnected heteroge- to a highly distributed and heteroge- neous distributed international data repository is starting neous set of data that appears as if all to be realized. Many challenges still exist including inter- resources are organized, stored, and retrieved or used in a common way. operability and integration between data collections. We The end-user community includes sci- are exploring ways of technologically enabling scientific entists, students, and data providers. virtual observatories—distributed resources that may con- We will introduce interdisciplinary vir- tain vast amounts of scientific observational data, theoret- tual observatories and their potential ical models, and analysis programs and results from a broad impact by describing our experiences range of disciplines. Our goal is to make these repositories with VSTO. We will also highlight appear as if they are one integrated local resource, while some benefits of the embedded seman- realizing that the information is collected by many research tic web technology and also provide groups, using a multitude of instruments with varying evaluation results after the first year of use. instrument settings in multiple experiments with different goals, and captured in a wide range of formats. Initially our focus is on trained scientists. Our ultimate goal is to pro-

Copyright © 2008, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2008 65 Articles

vide support for broader usage, including lay peo- than was previously required by most (if not all) ple. distributed data systems or discipline-specific vir- Because we believe science increasingly requires tual observatories. Existing work targeted subject interactions across multiple domains, our setting is matter experts as end users and did little to support interdisciplinary virtual observatories. A researcher integration of multiple collections (other than pro- with a single Ph.D. is expected to have depth in his viding basic access to search interfaces that are typ- or her chosen subject area but is unlikely to have ically specialized and idiosyncratic). enough depth to be considered a subject matter Our initial science domains were those of inter- expert in the entire collection. Vocabulary differ- est to scientists who study the Earth’s atmosphere ences across disciplines, varying terminologies, and the Sun. Our initial virtual observatory is similar terms with different meanings, and multi- thus VSTO—the Virtual Solar-Terrestrial Observa- ple terms for the same phenomenon or process tory. Scientists in these areas must utilize a bal- provide challenges. These challenges present barri- ance of observational data, theoretical models, ers to efforts that hope to use existing technology analysis, and interpretation to make effective in support of interdisciplinary data query and progress. Since many data collections are inter- access, especially when the interdisciplinary appli- disciplinary, and growing in volume and com- cations must go beyond search and access to actu- plexity, the task of making them a research al manipulation and use of the data. In addition resource that is easy to find, access, compare, and the user community has a more diverse level of utilize is still a significant challenge. These col- education and training. We used artificial intelli- lections provide a good initial focus for virtual gence technologies, in particular semantic tech- observatory work since the data sets are of signif- nologies, to create declarative, machine-opera- icant scientific value to a set of researchers and tional encodings of the semantics of the data to capture many, if not all, of the challenges inher- facilitate interoperability and semantic integration ent in complex, diverse scientific data. We view of data. We then wrote web services that used back- VSTO as representative of multidisciplinary virtu- ground knowledge to help them find, manipulate, al observatories in general and thus claim that and present scientific data. many of our results can be applied in other mul- Encoding formal semantics in the technical tidisciplinary virtual observatory efforts. architecture of virtual observatories and their asso- To provide a scientific infrastructure that is ciated data frameworks is similar to efforts to add usable and extensible, VSTO requires contributions semantics to the web in general (Berners-Lee et al. concerning semantic integration and knowledge 2006), workflow systems (for example, Gil, Rat- representation while requiring depth in a number naker, and Deelman [2006]), computational grids of science areas. We chose an AI technology foun- (for example, DeRoure, Jennings, and Shadbolt dation because of the promise for a declarative, [2005]) and data-mining frameworks (such as extensible, reusable technology platform. Rushing et al. [2005]). The value added by basic knowledge representation and reasoning is sup- porting both computer to computer and Application Description researcher-to-computer interfaces that find, access, The application uses background information and use data in a more effective, robust, and reli- about the terms used in the subject matter reposi- able way. tories. We encoded this information in OWL In the rest of this article, we describe our virtual (McGuinness and van Harmelan, 2004)—the rec- observatory project, including our vision, design, ommended from the and AI-enabled implementation. We will highlight Consortium (W3C). We used where we are using semantic web technologies and best-in-class semantic web tools for development. discuss our motivation for using them and some We used both the SWOOP1 and Protégé2 editors for benefits we are realizing. We describe our deploy- ontology development. The definitions in the ment and maintenance settings that started pro- ontologies are used (through the Jena3 and Eclipse4 duction in the summer of 2006. We also include Protégé plug-ins) to generate Java classes in a Java results from our initial evaluation study. object model. We built Java services that use this Java code to access the catalog data services. We Task Description use the Pellet5 reasoning engine to compute information that is implied and also to Our goal was to create a scalable interdisciplinary identify contradictions. The user interface uses the virtual observatory that would support scientists in Spring6 framework for supporting workflow and searching, integrating, and analyzing distributed navigation. heterogeneous data resources. A distributed multi- The main AI elements that support the semantic disciplinary -enabled virtual observatory foundation for integration in our application requires a higher level of semantic interoperability include the OWL ontologies and a description log-

66 AI MAGAZINE Articles

VSTO SOFTWARE DESIGN automatic generation

cedar_instances.owl packages ncar.vsto.auto, ncar.vsto.auto.impl My Java interfaces import implement OWL ONTOLOGIES Java interfaces Java classes cedar.owl import import extend extend

vsto_core.owl vsto.owl implement My Java interfaces My Java classes import miso.owl import create JAVA objects representing import OBJECT OWL classes + stub extensions for VSTOfactory MODEL inserting custom functionality miso_instances.owl

Java Protege-OWL API use PELLET USER INTERFACE + supporting jars Reasoning Engine CONTROL COMPONENTS use use

VSTO WEB PORTAL high level OO API for managing/querying the VSTO ontology VSTOservice SPRING controllers and data beans extend extend MLSO DB JAVA JSP CEDARservice MLSOservice SERVICES CEDAR DB USER INTERFACE VIEWS USE CASES WORKFLOW CEDAR specific MLSO specific JUNITtests SIMULATION PROGRAMS service extensions service extensions

Figure 1: VSTO Software Architecture.

ic reasoner (along with supporting tool infrastruc- The three initial motivating use case scenarios are ture for ontology editing and validation). We will provided in a templated form and then in an describe these elements, how they are used to cre- instantiated form: ate “smart” web services, and their impact in the Template 1: Plot the values of parameter X as taken next two sections. Figure 1 depicts the software by instrument Y subject to constraint Z during the architecture. period W using data product S. Example 1: Plot the neutral temperature (parameter) Artificial Intelligence Technology taken by the Millstone Hill Fabry-Perot interferom- eter (instrument) looking in the vertical direction Usage Highlights from January 2000 to August 2000 as a time series (data product). We made the effort to create ontologies defining the terms used in the data collections because we Template 2: Find and retrieve image data of type X wanted to leverage the precise formal definitions for images of content Y during times described by Z. of the terms for and interoperabil- Example 2: Find and retrieve quick look and science ity. The use cases described below were used to data for images of the solar corona during a recent scope the ontologies. The general form of the use observation period. cases is “retrieve data (from appropriate collec- Template 3: Find data for parameter X constrained tions) subject to (stated and implicit) constraints by Y during times described by Z. and plot in a manner appropriate for the data.“ Example 3: Find data representing the state of the

SPRING 2008 67 Articles

neutral atmosphere anywhere above 100 km and assumptions embedded in the experiment in toward the Arctic circle (above 45 degrees N) at which the data was collected and potentially the times of high geomagnetic activity. goal of the experiment. Phase II of our work will After we elaborated upon the use cases, we iden- address these latter issues. A schematic of part of tified the breadth and depth of the science terms the ontology is given in figure 2. that were used to determine what material we needed to cover and also to scope the search for Reasoning controlled vocabulary starting points. Essentially Our goal was to create a system usable by a broad we looked at the variables in the templates above range of people, some of whom will not be trained and natural hierarchies in those areas (such as an in all areas of science covered in the collection. Ini- instrument hierarchy), important properties (such tially, we targeted trained scientists (in future work, as instrument settings), and restrictions. We also we plan to expand the interfaces and data collec- looked for useful simplifications in areas such as tions to include components appropriate for a the temporal domain. The data collections already broader population). The previous science systems embodied a significant number of controlled required a significant amount of domain knowl- vocabularies. We began with main science com- edge to formulate meaningful and correct queries. 7 8 munities: CEDAR and MLSO. The CEDAR archive Previous interfaces required multiple decisions provides an online of middle and upper (eight for CEDAR and five for MLSO) to be made by atmospheric, geophysical index, and model data. the query generator, and those decisions were dif- The MLSO archive provides an online database ficult to make without depth in the subject matter. (including many images) of solar atmospheric We used the background ontologies together with physics data. The CEDAR holdings embody a con- the reasoning system to do more work for users trolled vocabulary including terms related to and to help them form queries that are both syn- observatories, instruments, operating modes, tactically correct and semantically meaningful. For parameters, observations, and so on. MLSO hold- example, in one work-flow pattern, users are ings also embody a controlled vocabulary with sig- prompted for an instrument, and they may choose nificant overlap in concepts. to filter the instruments by class. If they ask for We searched for existing ontologies in our photometers, they will be given options shown in domain areas and identified SWEET9, an ontology figure 3, and for at least some of these it is not gaining traction in the science community with obvious by name that they can act as a photome- sufficient overlap with our domains. SWEET con- ter. siders itself as an upper-level ontology for Earth An unexpected outcome of the additional and environmental scientists (and from a general knowledge representation and reasoning was that perspective, it would be considered a midlevel the same data-query workflow is used across the ontology). This ontology covered much more than two disciplines. We expect it to generalize to a vari- we needed in breadth and not enough in depth in ety of other data sets as well, and we have seen evi- multiple places. We reused the conceptual decom- position and terms from the ontology as much as dence supporting this expectation in our work on possible and added depth in the areas we required. other semantically enabled data-integration efforts We focused on high-leverage domain areas. in domains including volcanology, plate tectonics, These areas also have proven to be leveragable in and climate change (McGuinness et al. 2007b, Fox applications outside of a solar-terrestrial focus. et al. 2006a), which, as noted earlier, is both lever- The expansion into the disciplines of volcanic aging our VSTO ontology work and adding the effects on climate have led us to reuse many of need for additional reasoning in support of the the ontology concepts we developed for VSTO data integration in this latter project (McGuinness (McGuinness et al. 2007a, Fox et al. 2007a). Our et al. 2007a, Fox et al. 2007b). first focus area was instruments. One challenge The reasoner is also used to deduce the potential for integration of scientific data taken from mul- plot type and return products as well as the inde- tiple instruments is understanding the data-col- pendent variable for plotting on the axes. Previ- lection conditions. It is important to collect not ously, users needed to specify all of these items only the instrument (along with its geographic without assistance. One useful reasoning calcula- location) but also its operating modes and set- tion is the determination of parameters that make tings. Scientists who need to interpret data may sense to plot along with the parameter specified. need to know how an instrument is being used— The background ontology is leveraged to deter- that is, using a spectrometer as a photometer. mine, for example, that if one is retrieving data (The Davis Antarctica Spectrometer is a spec- concerning neutral temperature (subject to certain trophotometer and thus has the capability to conditions) a time series plot is the appropriate observe data that other photometers may collect.) plotting method and neutral winds (the velocity A more sophisticated notion is capturing the field components) should be shown.

68 AI MAGAZINE Articles

hasDataProduct +

DataProduct hasMeasuredParameter + Data Archive hasDataArchive +

Optical dataArchiveFor Instrument is a Instrument hasInstrumentOperatingMode + is a Photometer hasOperatedInstrument + isOperatedByObservatory

is a is a Spectro- Coronometer Observatory photometer

Davis Antarctic Spectrophotometer

Figure 2: VSTO Ontology Instrument Fragment.

Complex Scientific Data Case Study name (Mark IV), parameter (brightness), operating Our first and third use cases involve a heteroge- mode (white light polarization), and processing neous collection of community data from a operations (vignetted data indicates it has not been nationally funded global change research pro- corrected for that effect, and a coordinate transfor- gram—CEDAR. The data collection consists of mation to rectangular coordinates is specified). more than 310 different instruments, and the data Further, the data content retrieved cannot be dis- holdings, which are often specific to each instru- tinguished from another file unless the file-name ment, contain over 820 measured quantities (or encoding is understood. parameters) including physical quantities, derived Ontologies for Interdisciplinary quantities, indices, and ancillary information. Observational Science Systems CEDAR is further complicated by the lack of spec- ification of independent variables in data sets. We focused on six root classes: instrument, obser- Also, the original logical data record encoding for vatory, operating mode, parameter, coordinate many instruments contains interleaved records (including date/time and spatial extent), and data representing data from the instrument operating archive. While this set of classes does not cover all in different modes. Thus odd and even records typ- observational data, it was interesting to note that ically contain different parameters. Sometimes as we have added data sources to the VSTO use these records are returned without column head- cases, we have found these classes to capture the ings so the user needs to be knowledgeable in the key and defining characteristics of a significant science domain and in the retrieval system just to number of observational data holdings in solar make sense of the data. and solar-terrestrial physics. As a result, the In solar physics images, the original data pres- knowledge represented in these classes is applica- entation was that of complex data products, for ble across a range of disciplines. While we do not example, Mark IV white light polarization bright- claim that we have designed a universal broad ness vignetted data (rectangular coordinates). This coverage representation for all observational data is a compound description containing instrument sources, we believe that this is a major step in that

SPRING 2008 69 Articles

Figure 3: VSTO Data Search and Query Interface, Exposing Taxonomy-Based Instrument Selection.

direction. We have tested the hypothesis to some since the sheer quantity of numerical data meant extent in areas including volcanoes and plate tec- that we needed special-purpose handling anyway. tonics. The work has strong similarities to work in The quantity of date data in the distributed repos- the geospatial application domain (Cox 2006, itories is overwhelming, so we have support func- Wolff et al. 2006). tions for accessing it directly from those reposito- ries instead of actually retrieving it into some Uses of AI Technology: Ontology- cached or local store. Our solution uses semanti- cally enhanced web services to retrieve the data Enhanced Search directly. VSTO depends on a number of AI components and We used only open source free software for our tools including background ontologies, reasoners, project. From an ontology editing and reasoning and—from a maintenance perspective—the perspective, this mostly met our needs. A few times semantic technology tools including ontology edi- in the project, it would have been nice to have had tors, validators, and plug-ins for code develop- the support that one typically gets with commer- ment. We designed the ontology to limit its expres- cial software, but we did get some support where siveness to OWL-DL. We did this so that we could needed on the mailing lists and with limited per- leverage the reasoners available for OWL-DL, along sonal communication. The one thing that we with their better computational efficiency. Within would make the most use of if it existed would be OWL-DL, we basically had the expressive power we a commercial-strength collaborative ontology evo- needed with the following two exceptions. We lution and source control system. Our initial could use support for numerics (representation rounds of development on the ontology were dis- and comparison, such as the proposal for numerics tributed in design but centralized in input because in OWL 1.1) and defaults. The current application our initial environment was fragile in terms of does not use an encoding for default values. Our building the ontology and then generating robust current application handles numerical analysis functional Java code. The issues concerning the with special-purpose query and comparison code. development environment did eventually get While it would have been nice to have more sup- resolved, and we are now doing distributed ontol- port within the semantic web technology toolkit, ogy development and maintenance using modu- this is somewhat less of an issue for our application larization and social conventions.

70 AI MAGAZINE Articles

Application Use and Evaluation get it right away,” “nice to have quarter of the options,” and “I am getting closer to 1 query to 1 VSTO has been operational since the summer of data retrieval, that’s nice.” 2006. It has achieved broad acceptance and is cur- Additionally, members of our team who do not rently used by approximately 80 percent of the have training in the subject area are able to use this 10 research community. The production VSTO por- interface while they were unable to use previously tal has been the primary entry point to date for existing systems (largely because they did not have users (as well as those interested in semantic web enough depth in the area, for example, to know technologies in practice). Until recently, all data which parameters needed to be grouped together query formations up to the stage of data retrieval or other subject-specific information). As we pre- in the new and old portal were treated anony- sented this work in computer, biomedical, and mously. The newest release of the portal now cap- physical science communities, we have had many tures session statistics that we reported upon interested parties request accounts to try out the briefly at the IAAI/AAAI June 2007 meeting. We capabilities, and all have successfully retrieved or now collect query logs in the form of both access- plotted data, even users from medical informatics es to the triple store (Jena in memory), as well as who know nothing about space physics. One com- calls to the reasoner (Pellet) and any SPARQL mented, “This is cool, I can now impress my kids.” queries. We are also investigating click-stream This was made possible by appropriately plotting methods of instrumenting parts of the portal inter- the data in a visually appealing and meaningful face as well as the underlying key classes in the API. way, something that someone unfamiliar with the Our intent is to capture and distinguish between data or science could not have done before. portal and web services access (which also record There have been multiple payoffs for the system, details of the arguments and return documents) many of which have quantitative metrics. and query formation. Perhaps most importantly First, there are decreased input requirements. the results of an evaluation study conducted at the The previous system required the user to provide same meeting are presented in a later section. eight pieces of input data to generate a query, and Currently there are on average between 80 and 90 our system requires three. Additionally, the three distinct users authenticated through the portal and choices are constrained by value restrictions prop- issuing 400 to 450 data requests per day, resulting in agated by the reasoning engine. Thus, we have data access volumes of 100 KB to 210 MB per request. In the last year, 100 new users have regis- made the workflow more efficient and reduced tered, more than four times the number from the errors (note the supportive user comments two previous year. The users registered last year when paragraphs ago). the new portal was released and after the primary A second payoff is syntactic query support. The community workshop in June at which the new interface generates only syntactically correct VSTO system was presented. At that meeting, com- queries. The previous interface allowed users to munity agreement was given to transfer operations edit the query directly, thus providing multiple to the new system and move away from the existing opportunities for syntactic errors in the query for- one. mation stage. As one user put it, “I used to do one At the 2006 CEDAR workshop a priority area for query, get the data, and then alter the URL in a way the community was identified that involved the I thought would get me similar data, but I rarely accuracy and consistency of temperature measure- succeeded; now I can quickly regenerate the query ments determined from instruments like the Fab- for new data and always get what I intended.” ry-Perot interferometer. As a result, we have seen a Third, there is semantic query support. By using 44 percent increase in data requests in that area. background ontologies and a reasoner, our appli- We increased the granularity in the related portion cation has the opportunity to expose only query of the ontology to facilitate this study. We focused options that will not generate incoherent queries. on improving users’ ability to find related or sup- The interface exposes options, for example, only in portive data with which to evaluate the neutral date ranges for which data actually exists. This temperatures under investigation. We are seeing an semantic support did not exist in the previous sys- increase (10 percent) in other neutral temperature tem. In fact, the previous interface had limited data accesses, which we believe is a result of this functionality to minimize the chances of mislead- related need. ing or semantically incorrect query construction. One measure that we hoped to achieve is to have This means, for example, that users have increased usage by all levels of domain scientist—from the functionality—that is, they can now initiate a principal investigator (PI) to the early-level gradu- query by selecting a class of parameters. As the ate student. Anecdotal evidence shows this is hap- query progresses, the subclasses and specific pening, and self-classification also confirms the instances of that parameter class are available as distribution. A scientist doing model/observation- the data sets are identified later in the query al comparisons noted, “took me two passes now, I process. We removed the parameter-initiated

SPRING 2008 71 Articles

search in the previous system because only the the data. Our approach to developing the ontology parameter instances could be chosen (for example, allows us to add new subclasses, properties, and there are eight different instances that represent relationships in a way that will naturally evolve neutral temperature, 18 representations of time, the reasoning capabilities available to a user, as and so on), and it was too easy for the wrong one well as to incoming and outgoing web services, to be chosen, quickly leading to a dead-end query especially as those take advantage of our ontolo- and frustrated user. One user with more than five gies. years of CEDAR system experience noted, “Ah, at We conducted an informal user study asking last; I’ve always wanted to be able to search this three questions: What do you like about the new way, and the way you’ve done it makes so much searching interface? Are you finding the data you sense.” need? What is the single biggest difference? Users Fourth is semantic integration. Users now are already changing the way they search for and depend on the ontologies rather than themselves access data. Anecdotal evidence indicates that to represent and remember the nuances of the ter- users are starting to think at the science level of minologies used in varying data collections. Per- queries, rather than at the former syntactic level. haps more importantly, they also can access infor- For example, instead of telling students to enter a mation about how data was collected, including particular instrument and date and time range and the operating modes of the instruments used. “The see what they get, they are able to explore physical fact that plots come along with the data query is quantities of interest at relevant epochs where really nice, and that when I selected the data it these quantities go to extreme values, such as auro- comes with the correct time parameter” (new grad- ral brightness at a time of high solar activity uate student, approximately one year of use). The (which leads to spectacular auroral phenomena). nature of the encoding of time for different instru- A one-hour VSTO workshop was held at the ments means not only that are there 18 different annual CEDAR community meeting on the day parameter representations but also that those after the main plenary presentation for VSTO. The parameters are sometimes recorded in the prologue workshop was very well attended with 35 diverse entries of the data records, sometimes in the head- participants (25 were expected) including senior er of the data entry (that is, as ), and researchers, junior researchers, postdoctoral fel- sometimes as entries in the data tables themselves. lows, and students—including 3 that had just start- Users had to remember (and maintain codes) to ed in the field. account for numerous combinations. The seman- After some self-introductions, eight questions tic mediation now provides the level of sensible were posed and responses recorded, some by count data integration required. (yes/no) or comment. Overall responses ranged Finally, there is a broader range of potential from 5 to 35 per question. We note some general users. VSTO is usable by people who do not have responses as well as some more specific. Out of Ph.D.-level expertise in all of the domain science these responses we identified some new use cases, areas, thus supporting efforts including interdisci- which we enumerate. The quantitative questions plinary research. The user population consists of and answers were: students (undergraduate, graduate) and nonstu- How do you like to search for data? Browse, type a dents (instrument PI, scientists, data managers, query, visual? Responses: 10; Browse = 7, Type = 0, professional research associates). For CEDAR, there Visual = 3. were 168 students and 337 nonstudents. For What other concepts are you interested in using for MLSO, there were 50 students and 250 nonstu- search, for example, time of high solar activity, cam- dents. In addition 36 percent and 25 percent of the paign, feature, phenomenon, others? Responses: 5; users are non-U.S. based (CEDAR—a 57 percent all of these, no others were suggested. increase over the last year—and MLSO, respective- Do the interface and its services deliver the func- ly). The relative percentage of students has tionality, speed, flexibility you require? Responses: increased by approximately 10 percent for both 30; Yes = 30, No = 0. groups. Are you finding the data you need? Responses: 35; Over time, as we continue to add data sources Yes = 34, No = 1. and their associated instruments and measured How often do you use the interface in your normal parameters, users will benefit by being able to find work? Responses: 19; Daily = 13, Monthly = 4, Longer even more data relevant to their inquiry than = 2. before with no additional effort or changes in Are there places where the interface/services fail to search behavior. For example, both dynamic and perform as desired? Responses: 5; Yes = 1, No = 4. climatological models to be added provide an alter- nate, complementary, or comparative source of The more qualitative questions were: data to those measured by instruments, but at pres- What do you like about the new searching inter- ent a user has to know how to search for and use face? Responses: 9.

72 AI MAGAZINE Articles

What is the single biggest difference? Responses: 8. Application Development The general answers were as follows: and Deployment 1. Fewer clicks to data VSTO was funded by a three-year NSF grant. In the 2. Autoidentification and retrieval of independent first year, a small, carefully chosen six-person team variables wrote the use cases, built the ontologies, designed 3. Faster the architecture, and implemented an alpha 4. Seems to converge faster release. We had our first users within the first eight The majority of the comments were on the first months, with a small ontology providing access to three, but a few people identified that the search all of the data resources. Over the last two years, seemed to be more accurate (and converge faster). we expanded the ontology and made the system There were three new users in the session, and more robust and increased domain coverage. their responses were mixed in with the group at Early issues that needed attention in design large. Interestingly, all three had no problem included determining an appropriate ontology searching for and accessing the data archives. structure and granularity. Our method was to gen- Some of the more specific comments (in quote erate iterations initially done by our lead domain fragment form since they were given orally and the scientist and lead knowledge representation expert recorder had to write them down quickly) were: and vet the design through use-case analysis and It makes sense now! other subject matter experts, as well as vetting by [I] Like the plotting. the entire team. We developed minimalist class and property structures that capture all the con- Finding instruments I never knew about. cepts into classes and subclass hierarchies includ- Descriptions are very handy. ing only associations, and class value restrictions What else can you add? needed to support the reasoning required for the How about a python interface [to the services]? use cases. This choice was driven by two factors. These general and specific comments, also along First, keeping a simple representation allowed the with the more quantitative answers above, indi- scientific domain literate experts to view and vet cate that the VSTO, built on semantic technolo- the ontology easily, and second, complex class and gies, provided significant additional value for the property relations, while clear to a knowledge engi- users and the developers. In several cases, the neer, take time for a domain expert to comprehend answers and several unsolicited comments and agree upon. matched almost exactly the ad hoc feedback we A practical consideration arose from Protégé had received in the initial study, thus confirming with automatic generation of a Java class interface our sense of the initial evaluation. and factory classes (see figure 1 and Fox et al. Several new use cases arose out of the responses. [2006a] for details). As we assembled the possible First was the need for a programming/script-level user query work flows and used the Pellet reason- interface, that is, building on the services inter- ing engine, we built dependencies on properties faces, in Python, Perl, C, Ruby, Tcl, and three oth- and their values. If we had implemented a large ers. Second was the addition of models alongside number of properties and needed to change them observational data, that is, finding data from or, as we added classes and evolved the ontology, observations or models that are comparable or had placed properties at different class levels, the compatible. Third were more services (particulary existing code would have needed to be substan- plotting options—for example, coordinate trans- tially rewritten manually to remove the old formation—that are hard to add without detailed dependencies. Our current approach preserves the knowledge of the data). The requirement for script-level access was existing code, automatically generates the new unexpected but afterward seemed a natural way for classes, and adds incrementally to the existing developers to access the programming-level inter- code. This allows rapid development. Deployment face of VSTO (API and web services). Python was cycles and updates to the ontology can be released the clear first choice but was followed closely by with no changes in the existing data framework, Perl. Numerical (simulation) models are an increas- benefiting developers and users. ing need for the CEDAR community, especially for We rely on a combination of editors (Protégé integrating models and observational data. The and Swoop). We use Protégé for its plug-in support need is that the models fit seamlessly into the for Java code generation. Earlier iterations had selection criteria. This means they are not part of some glitches with interoperation in a distributed the instrument selection, which is the way the pre- fashion that supported incremental updates, but vious interface exposed model simulation data we overcame these issues, and the team now uses (listing the models as instruments). a distributed, multicomponent platform.

SPRING 2008 73 Articles

Maintenance ogy evolution environment. We have one person in charge of all VSTO releases, and this person Academic and industrial work has been done on maintains a versioned, stable version at all times. ontology evolution environments that this project We also maintain an evolving, working version. can draw on. In a paper titled “Industrial Strength The ontology is modular so that different team Ontology Management” (Das et al. 2001), a list of members can work on different pieces of the ontol- ontology management requirements is provided ogy in parallel. that we endorse and include in our evolution plan: We are just beginning our work on transparency (1) scalability, (2) availability, (3) reliability and and provenance. Our design leverages the Proof performance, (4) ease of use by domain-literate Markup Language (Pinheiro da Silva, McGuinness, people, (5) extensible and flexible knowledge rep- and Fikes 2006)—an Interlingua for representing resentation, (6) distributed multiuser collabora- provenance, justification, and trust information. tion, (7) security management, (8) difference and Our initial provenance plans include capturing con- merging, (9) XML interfaces, (10) internationaliza- tent such as where the data came from. Once cap- tion, including support for multiple languages, tured in PML, the Inference Web toolkit (McGuin- and (11) versioning. We would also add trans- ness and Pinheiro da Silva 2004) may be used to parency and provenance. display information about why an answer was gen- Our efforts so far have focused on points 1–3 erated, where it came from, and how much the and to a lesser extent on 4, 10, and 11. Our new information might be believed and why. We have system needed to be at least as robust and useful as just received National Science Foundation (NSF) the previously available community system. It was funding for extending VSTO in these directions. imperative that our application have at least ade- quate performance, high reliability, and availabili- ty. We considered two aspects of scaling, first, Summary and Discussion expanding to include broader and deeper domain knowledge, and second, handling large volumes of We introduced our interdisciplinary virtual obser- data. We designed for performance in terms of raw vatory project—VSTO. We used semantic tech- quantity of data. We do not import all of the infor- nologies to quickly design, develop, and deploy an mation into a local knowledge base when we know integrated, virtual repository of scientific data in that volumes of data are large; instead we use data- the fields of solar and solar-terrestrial physics. Our base calls to existing data services. Thus, we do not new virtual observatory can be used in ways the achieve decreased performance. We address rea- previous system was not conveniently able to be soning performance by limiting our representation used to address emerging science area topics such to OWL-DL. as the correctness of temperature measurements We built our ontology design to be extensible, from Fabry-Perot interferometers. A few highlights and over time we are finding that the design is of the knowledge representation that may be of holding up both to extension within our project interest follow. and also to reuse in other projects. We have inves- We designed what appears to be an extensible, tigated the reuse of our ontologies in our Semanti- reusable ontology for solar-terrestrial physics. It is cally Enabled Science Data Integration (SESDI) compatible with controlled vocabularies in use in project, which addresses virtual observatory needs the most widely used relevant data collections. Fur- in the overlapping areas of climate, volcano, and ther, and potentially much more leverageable, is that plate tectonics. We found that while, for example, the structure of the ontology is withstanding reuse seismologists use some instruments that solar-ter- in multiple virtual observatory projects. We have restrial physicists do not, the basic properties used reviewed the ontology with respect to needs for the to describe the instruments, observatories, and NSF-funded GEON project, the NASA-funded SESDI observations are quite similar. Routine mainte- project, and the NASA-funded SKIF project. nance and expansion of the ontologies are done by The SWEET ontology suite was simultaneously the larger team. much too broad and not deep enough in our subject We promote use-case-based design and exten- areas. If we could have imported just the portions of sions. When we plan for extensions, we begin with SWEET that we needed and expanded from there, it use cases to identify additional vocabulary and might have been possible to use more directly. We inferences that need to be supported. We have also made every effort to use terms from SWEET and to used standard naming conventions and have be compatible with the general modeling style. We maintained as much compatibility as possible with are working with the SWEET developers to make a terms in existing controlled vocabularies, general, reusable, modular ontology for earth and Our approach to distributed multiuser collabo- space science. Our ontologies are open source and ration is a combination of social and technical have been delivered to the SWEET community for conventions. This is largely due to the state of the integration. A website is available for obtaining sta- art, where there is no single best multiuser ontol- tus information on this effort: www.planetont.org.

74 AI MAGAZINE Articles

This project has a multitude of challenges. The use cases (for script/programming language access, scope of the ontology is broad enough that it is not synthesizing models and observations, and new possible for any single scientist to have enough plotting options) in the near future. We also plan to depth in the subject matter to provide the raw con- hold a similar evaluation and feedback workshop at tent. The project thus must be a collaborative the next annual CEDAR meeting and also the other effort. Additionally, a small set of experts could be science communities that VSTO is serving. identified to be the main contributors to particular subject areas, and an ontology could be created by Acknowledgements them. If the ontology effort stops there, though, The VSTO project is funded by the National Sci- we will not achieve the results we are looking for. ence Foundation, Office of Cyber Infrastructure We want to have an extensible, evolving, widely under the SEI+II program, grant number 0431153. reusable ontology. We believe this requires broad This article is an expanded version of a paper pre- community buy-in that will include vetting and sented at the 2007 Innovative Applications of Arti- augmentation by the larger scientific community, ficial Intelligence conference. and ultimately it needs usage from the broad com- munity and multiple publication venues including Notes a new Journal of Earth Science Informatics. 1 www.mindswap.org/2004/SWOOP/. We also believe judicious work on modulariza- 2 protege.stanford.edu/. tion is critical since our biggest barrier to reuse of 3 jena.sourceforge.net/. SWEET was the lack of support for importing mod- 4 www.eclipse.org/. ules that were appropriate for our particular sub- 5 www.mindswap.org/2003/pellet/. ject areas. We believe this effort requires commu- 6. www.springframework.org/. nity education on processes for updating and 7. CEDAR—Coupling, Energetics, and Dynamics of extending a community resource such as a large Atmospheric Regions; cedarweb.hao.ucar.edu. (potentially complicated) ontology. 8. MLSO—Mauna Loa Solar Observatory; mlso.hao.ucar. Today, our implementation uses fairly limited edu. inference and supports somewhat modest use cases. 9. SWEET—Semantic Web for Earth and Environmental This was intentional as we were trying to provide an Terminologies; sweet.jpl.nasa.gov/ontology/. initial implementation that was simple enough to 10. We determined this percentage by taking the number be usable by the broad community with minimum of people in the community as measured by the most training. Initial usage reports show that it is well recent subject matter conferences and the number of reg- received and that users may be amenable to addi- istered users for our system. tional inferential support. We plan to redesign the multiple-work-flow interface and combine it into a References much more general and flexible single work flow Berners-Lee, T.; Hall, W.; Hendler, J.; Shadbolt, N.; and that is adaptable in its entry points. Additionally, we Weitzner, J. 2006. Enhanced: Creating a Science of the plan to augment the ontology to capture more Web. Science 313(5788): 769–771 (DOI 10.1126/Sci- detail, for example, in value restrictions, and thus ence.1126902). be able to support more sophisticated reasoning. Cox, S. 2006. Exchanging Observations and Measure- Additionally, the current implementation has limit- ments Data: Applications of a Generic Model and Encod- ed support for encoding provenance of data. Thus ing, Eos Transactions of the American Geophysical Union we will use the provenance Interlingua PML-P to Fall Meeting, Supplement, 87(52) In53c-01. capture knowledge provenance so that end users Das, A.; Wu, W.; and McGuinness, D. L. 2001. Industrial may ask about data lineage. Strength Ontology Management. Stanford Knowledge Our follow-up to the initial informal evaluation Systems Laboratory Technical Report KSL-001-09. Stan- in a workshop setting provided both general and ford University, Stanford CA. specific answers and comments, as well as more Aseem Das, A.; Wu, Wei; and McGuinness, D. L. 2002. quantitative yes/no or multiple choices answers. Industrial Strength Ontology Management. In The Emerg- Both sets reaffirmed the sense we obtained in the ing Semantic Web. ed. I. Cruz, S. Decker, J. Euzenat, and D. L. McGuinness, eds. Amsterdam: IOS Press. initial study that our efforts in applying AI tech- niques in the form of semantics led to an interdisci- De Roure, D.; Jennings, N. R.; and Shadbolt, N. R. 2005. The Semantic Grid: Past, Present, and Future. In Proceed- plinary virtual observatory that provides significant ings of the IEEE, 93(3): 669–681 (DOI: additional value for a spectrum of end users. Per- 10.1109/Jproc.2004.842781). haps also more importantly it also provides signifi- Fox, P., McGuinness, D. L.; Middleton, D.; Cinquini, L.; cant additional value for the developers of both the Darnell, J. A.; Garcia, J.; West, P.; Benedict, J.; and VSTO and other federated virtual observatories and Solomon, S. 2006a. Semantically Enabled Large-Scale Sci- data systems wishing to take advantage of the serv- ence Data Repositories. The Fifth International Semantic ices that our virtual observatory provides. We plan Web Conference (ISWC06), Lecture Notes in Computer Sci- to engage those developers in articulating the new ence 4273, 792–805. Berrlin: Springer-Verlag, Berlin.

SPRING 2008 75 Articles

Fox, P.; McGuinness, D. L.; Raskin, R.; and Sinha, A. K. puter science, information technology, and grid-enabled, 2006b. Semantically Enabled Scientific Data Integration. distributed semantic data frameworks. He utilizes state- In Proceedings of the Geoinformatics Conference. Washing- of-the-art modeling techniques and Internet-based tech- ton, DC: American Geophysical Union. nologies, including the semantic web, and applies them Gil, Y.; Ratnakar, V.; and Deelman, E. 2006. Metadata Cat- to large-scale distributed scientific data systems. Fox is alogs with Semantic Representations. International Prove- currently PI for the Virtual Solar-Terrestrial Observatory, nance and Annotation Workshop 2006 (Ipaw2006), Lecture the semantically enabled scientific data integration, the Notes in 4145, ed. L. Moreau and I. Fos- semantic provenance capture in data ingest systems, and ter, 90–100. Berlin: Springer-Verlag. the CEDAR database projects. McGuinness, D.; Fox, P.; Cinquini, C.; West, P.; Garcia, J.; Luca Cinquini holds a Ph.D. in high energy physics from Benedict, J.; and Middleton, D. 2007a, The Virtual Solar- the University of Colorado and is now working as a sen- Terrestrial Observatory: A Deployed Semantic Web Appli- ior software engineer at the National Center for Atmos- cation Case Study for Scientific Research. In Proceedings of pheric Research (Boulder, CO). His work focuses on the Nineteenth Conference on Innovative Applications of Arti- researching and applying web, semantic, and grid tech- ficial Intelligence (IAAI-07). Menlo Park, CA: AAAI Press. nologies to facilitate access to geophysical and space data. McGuinness, D.; Fox, P.; Sinha, A. K.; and Raskin, R. He is actively involved in several geoinformatics projects 2007b. Semantic Integration of Heterogeneous Volcanic including (among others) the Virtual Solar-Terrestrial and Atmospheric Data. In Proceedings of the Geoinformat- Observatory, the Earth System Grid, and the Communi- ics 2007 Conference. Washington, DC: Geological Society ty Data Portal. He can be reached at [email protected]. of America. Patrick West has a bachelor of science in computer sci- McGuinness, D., and Pinheiro da Silva, P. 2004. Explain- ence from Indiana University, Bloomington, with 17 years ing Answers from the Semantic Web: The Inference Web of experience in developing data access and server-side Approach. Special Issue on the International Semantic data processing systems. West is a software engineer at the Web Conference 2003—edited by K. Sycara and J. High Altitude Observatory at the University Corporation Mylopoulous. Web Semantics: Science, Services and Agents for Atmospheric Research (HAO/UCAR), developing high- on the World Wide Web 1(4), Fall. performance server frameworks for the access and pro- McGuinness, D., and van Harmelen, F. 2004. Owl Web cessing of scientific data and ontology management. Ontology Language Overview. World Wide Web Consor- tium (W3c) Recommendation. February 10, 2004 Jose Garcia has a master of science in computer science (www.w3.org/TR/owl-features). Cambridge, MA: Massa- and a master of science in applied mathematics from the chusetts Institute of Technology. University of Colorado, with 12 years of experience developing software for scientific and nonscientific proj- Pinheiro da Silva, P.; McGuinness, D.; and Fikes, R. 2006. ects. Garcia currently works as a software engineer at the A Proof Markup Language for Semantic Web Services. High Altitude Observatory developing applications for Information Systems 31(4–5): 381–395. database systems, ontology management, informatics, Rushing, J.; Ramachandran, R.; Nair, U.; Graves, S.; numerical analysis, signal processing, high-performance Welch, R.; and Lin, A. 2005. Adam: A Data Mining Toolk- computing, web-based portals, and system integration. Computers and Geosciences it for Scientists and Engineers. His e-mail is [email protected]. 31(5): 607–618. Wolff, A.; Lawrence, B. N.; Tandy, J.; Millard, K.; and James Benedict is the chief operating officer of McGuin- Lowe, D. 2006. Feature Types as an Integration Bridge in ness Associates Consulting. Benedict has a master’s of the Climate Sciences. In Eos Transactions, American Geo- international management with more than 20 years of physical Union, Fall Meeting, (San Diego, CA) Supple- work experience in various industries providing manage- ment, 87(52) Abstract In53c-02. Washington, DC: Amer- ment of corporate information technology, accounting ican Geophysical Union. systems, and business planning. Benedict helps clients plan, evaluate business impact, develop, deploy, and Deborah McGuinness is the acting director and senior maintain applications of artificial intelligence and research scientist of the Knowledge Systems, Artificial semantic web technology. Intelligence Laboratory (KSL) at Stanford University and Don Middleton leads the Visualization and Enabling CEO of McGuinness Associates Consulting. By publica- Technologies program at the U.S. National Center for tion time, McGuinness will be the Tetherless World Chair Atmospheric Research. He is responsible for a program and professor of computer science at Rensselaer Poly- that encompasses data and and technic Institute (RPI). McGuinness’s research focuses on systems, advanced analysis and visualization, collabora- the semantic web, ontologies and ontology evolution tive visual computing environments, grid computing, environments, knowledge representation and reasoning, and education research and outreach activities. Middle- explanation, trust, privacy, and search. McGuinness’s ton is currently contributing to the leadership of a num- consulting focuses on helping companies utilize AI ber of projects, including the Earth System Grid, the research in applications such as smart search, ontology Earth System Curator, the Virtual Solar-Terrestrial Obser- management, information integration, e-science, online vatory, the North American Regional Climate Change commerce, and configuration. Assessment Project, and the Cyberinfrastructure Strategic Peter Fox is the chief computational scientist at the High Initiative. Middleton has also served on a National Altitude Observatory, National Center for Atmospheric Research Council committee, and is currently serving on Research. Fox is a research scientist specializing in the World Meteorological committees working to build glob- fields of solar-terrestrial physics, computational and com- al federated data systems.

76 AI MAGAZINE