Building Web Observatories for Health

Dominic DiFranzo, John S. Erickson, Marie Joan Kristine T. Gloria, Deborah L. McGuinness, Joanne S. Luciano

Tetherless World Constellation

Rensselaer Polytechnic Institute

Troy, NY 12180difrad, erickj4, [email protected]

dlm, [email protected]

Abstract. As the continues to grow and change, the need to study and understand it grows. Web Science is an effort to do just this. Due to the multidisciplinary nature of Web Science, and the wide variety of data on the web it studies and produces, Web Observatories are needed to help foster colla- boration and provide archiving of work and studies for future generations of researchers. In this paper, we use health data on the Web, and Health Web Sci- ence as a use case in showing how and why we need to build Web Observato- ries. We characterize the unique challenges that exist in building Web Observa- tories and present a methodology that accommodates these challenges.

1 Introduction

As the World Wide Web continues to evolve and impact our lives on a daily basis, researchers from a range of disciplines have turned their attention to its study. The field of Web Science, created as an approach for a coherent and connected study of the Web as an entity unto itself, faces several unique challenges ranging from conflict- ing methodological philosophies to the development of a suitable infrastructure for its scientists to collaborate and share resources. Web scientists, therefore, are looking to mixed methodology practices, new computational infrastructure and large-scale ana- lytics in order to better make sense of these complex phenomena.

1.1 Web Observatories As a means for managing this research complexity, the Web Science community has undertaken the development of a new distributed platform to facilitate the collection, analysis and sharing of data about the Web. Whether viewed as individual observato- ries or as a federation, the web observatory will be a distributed archive of data on the Web and its activity. Additionally, it will expose mechanisms and tools that will enab- le the exploration of the Web's past development, to examine its present condition and to establish potential developments in the future [1]. Web observatories promise to be a vast improvement over the resources currently in use by Web scientists, which are centralized, ad hoc in their composition, and often proprietary. With these challenges in mind, the Web Science Trust's Web Observatory Project has three main objectives [1]: 1. Create a global data resource that moves beyond the traditional understanding of a centralized data warehouse to that of a more distributed environment for inter- disciplinary analysis and knowledge sharing. 2. Provide Web scientists a space to foster the development and sharing of tool- sets, frameworks and workflows. This can only be accomplished by adopting a bottom-up approach that aggregates individual repositories into a virtual infrastructure. 3. Promote and empower researchers to use not just quantitative correlation me- thods on datasets, but to explore and incorporate qualitative analyses that may help provide a more comprehensive understanding of the socio-technical evolution of the Web.

2 Health Web Science and the Health Web Observatory Imperative

One sub-domain of particular interest to web scientists is the interplay between health, health sciences, and the Web. It is in this intersection that we are challenged with multi-level, multi-disciplined questions about not just the impact on users of the Web, but the evolution of the Web's infrastructure and related public policies. Thus, Health Web Science (HWS) is an emerging discipline from which its practitioners aim to "describe how the Web shapes, and is shaped by, medicine and health care ecosys- tems. Through this information, HWS will help engineer the Web and Web-related technologies to facilitate health-related endeavors and empower health professionals, patients, health researchers and lay communities. For example, HWS-based research may focus on developing and/or evaluating user responses to Web-based applications that seek to promote the formation of communi- ties and the sharing experience [5]. Unique to this discipline, HWS researchers would not just examine the social practices of such experiences; but also raise questions re- garding the role of the data platform itself in ensuring the proper anonymization of data and user privacy. Furthermore, HWS, in general, includes the examination and understanding of “citizen science”, discovery of new questions and answers in the Web’s , and representation and use of health knowledge by patients, medical professionals, and their machines [8]. 2.1 From Health Data to a Health Web Observatory

Given the copious amounts of richly heterogeneous data, provided in many file for- mats; both unstructured and structured; and highly sensitive, the HWS community would benefit greatly from the development of a distributed network of Health Web observatories. This would allow for better gathering, quantifying, connecting, sharing, and searching over health and related data on the web (both quantitative and qualita- tive in nature); This federated HWO would focus on: “the synthesis, curation, and discovery of Web pages containing health informa- tion; the structure and utilization of interactive sites relevant to pa- tient support groups; and semantic annotation and linking of health records and data to facilitate mechanized exploration and analysis [8]. “

We advocate that the health-domain, in particular Health Web observatories, may serve as an exemplar for all future Web observatories given its technical, social and political complexity. In general, a HWO must be a materialization of the Web Obser- vatory information model as represented by the Web Observatory schema1. This model allows Web scientists to leverage the vast scale of commercial search engines to uncover properly described and published resources. Using the Web Observatory model, certain data characteristics of interest to health web science investigators must be clearly expressed: • Data identification and description: What is this data? In particular, is it statisti- cal in nature? Does it describe a scientific or medical result? • Origination: What is the source of this data? Did it originate with physicians, or patients, or researchers, or governments? Was the data scraped, or submitted, or generated through some study? • Usage: How has this data been used previously, and by whom? Has it been used in research, or advocacy, or treatment, or for decision making? • Citation: How is the data related to specific publications in the scientific and medical literature? • Provenance: How has this data been aggregated and shared? Was (for example) the data made available through a built-for-purpose data repository? Is it an ad hoc source? Has it been provided through a public "data hubs" such as datahub.io? • Policy: What policies and restrictions govern the collection and use of the data? This applies to both data submitted (collection terms) and data provided by a parti- cular web observatory. For any given dataset, what policies affect it?

1 http://logd.tw.rpi.edu/web_observatory 3 Building a Web Observatory

Outlined previously, the goals of the Web Observatory Project are to create a distrib- uted data resource which enables among Web scientists across multiple domains and from around the world. It also aims to promote exploration and incorpo- ration of both qualitative and quantitative methodologies. It is, therefore, unsurprising that one of the biggest obstacles in designing and building a WO is in anticipating the variety of innovative and interesting uses by others, which are not explicitly built into the WO. To minimize this uncertainty, we present a process that prioritizes internal purposes first; while promoting the use of technologies and systems that enable the duplication, understanding and reuse of your Web observatory by others and for alter- native purposes. This general process draws its inspiration from the Semantic eScience methodology [6] Due to the unique challenges and nature of Web Observatories [1] the Semantic eScience methodology doesn't completely address some of the issues that arise in building one. In the following section, we use many parts of the Semantic eScience methodology as a basis to build a new methodology for building Web Observatories.

3.1 Forming Your Purpose, Questions, Ideas and Users

We propose the first step to be: define the purpose of your specific Web observatory. In this, one should define the audience of the WO as well as its initial or alpha users. We recommend asking questions such as: What research domain do we want to ex- plore?; What types of questions would we like to answer?; and, What do will this WO contribute to the greater Web Science community? This phase differs from the "use case" step provided in the eScience methodology as we focus on identifying a "user story" rather than formal, specific use case. The intent is to prevent restricting the observatory to only serve alpha users at the cost of alienating new users and applications.

3.2 Small Core Team, Mixed Skills

The next step is: to create a small team with a diverse skill set. Similar to the second step of the eScience methodology, this team is necessary for researching and discov- ering datasets and tools to be used in the Web observatory. This has yet to be formal- ized within our own current Web observatory process; however, we recognize that this lends a great advantage in ensuring your WO encompasses relevant and authoritative data. 3.3 Gather Data Sources and Data Sets

The third step is one of the most difficult. Gathering data sources and data sets may seem an insurmountable task as the Web is a vast source of information. Identifying which datasets and data sources exist is only one part of the puzzle. Additionally, one must also consider whether the data collected is trustworthy, non-proprietary, privacy- sensitive, etc. This step requires an exhaustive exploration across multiple resources as well as a level of deep expertise in a specific domain. The formation of small core team, as discussed in step two, can alleviate the workload and distribute the these tasks. In addition, we recommend leveraging different search tools and system to help. In TWC, we have IOGDS (International Open Government Data Search), that can search the metadata of over a million datasets from around the world.

3.4 Creating Data Collection Systems: Scraping, User Input, Crowdsourcing, Free Text

This is an optional step; though, often a needed one. Noted above, procuring the "right" data can be difficult. Without domain expertise, locating datasets and data re- sources can be a frivolous exercise. Worse yet, the reproduction of already collected data slows down research and results in redundant resources. Instead, researchers must consider and deliberately outline their collection method. For example, if a re- searcher is interested in capturing and extracting data from a pre-existing community, he or she should ask the following: How do we plan on mining this data?; Do we have access and permission to mine?; Is the data crowdsourced?; How will we collect crowdsourced information?; Is the data expressed as free text, in a forum, in a survey, etc.? After answering these types of questions, then one can move forward. Fortu- nately, there are a plethora of tools already available to help in executing these collec- tion methods.

3.5 Making a List of Where This Data Exists/Capturing Metadata

One of the major components of a well-built WO is its comprehensiveness. By this, we reflect on the inclusion of both the collection of intended data, such as public tweets or public health statistics, but also the meta-data of this information. This in- cludes: where the data originated, how it was produced, column identification, unit distinctions, etc. This may not always be readily available; however, noting such gaps in knowledge is just as important as the meta-data explicitly expressed. It is also help- ful to denote whether the meta-data is machine readable or in an constricted format, like a PDF file. Lastly, and most importantly, data produced and shared by the WO builder(s) must expose the metadata for others to use. 3.6 Analysis

Now that we have a list of datasets with metadata/etc, our core team, and possibly some data capture tools, we need to revisit our user stories and see if what we have will address them. Do you need to continue our search for more data? Do you need to scrape or capture more data from outside sources? Do we need to learn more about our data? Do we have strategies in expressing this to others?

3.7 Develop Model/Ontology

Now that we have some idea of the data we will be using, along with our user story and our initial uses from our alpha users, we can start developing a model to organize, use and publish our data. This step is slightly different than presented in the eScience methodology, as we want to model just enough to cover our alpha users without los- ing potential future users and uses of this data. This step, at least initially requires more lightweight vocabulary modeling and building.

3.8 Review and Iteration

Although this is the next step in the eScience methodology, we feel this step should happen after we have produced and published a prototype of our web observatory, and not before. Our web observatory is a method by which we reach out to the greater web science community and hopefully bring value to it. Review of it before it exist will be difficult to do.

3.9 Adopt Technology and Technical Infrastructure

Seeing as we don't wish to completely recreate a publication and observatory infra- structure, we wish to reuse what has already been done in other web observatories and data publishing platforms. In TWC, we use PRIZMS, which is a collection of the many conversion and publication tools we have developed over the years. PRIZMS allows use to reuse and copy the infrastructure models of other PRIZMS instances, allowing use to build off from other existing and successful projects and observatories. We start by going through the PRIZMS set up install script, collecting all the infor- mation you need (https://github.com/timrdf/prizms/wiki/Installing-Prizms) PRIZMS uses many of the tools and resources we have developed and used at TWC in our web observatory projects. It uses a combination of CKAN and GitHub to version control our data/metadata/instal scripts so that our system can be replicated and reused by anyone else. PRIZMS includes our csv2rdf converter that enables us to convert, en- hance and publish our . It also includes DataFAQs that allow us to show and express the quality of our data to others. It also instals and uses LODSpeaker, which helps to publish our linked data to others on the web, allowing us to easily build visualizations and applications from it. Prizms installs all the software and tools needed to run this for us. We just need to create the correct accounts on github and CKAN for our project. Once we have Prizms installed and running, we must convert our data using csv2rdf4lod. Following the tutorial here (https://github.com/timrdf/csv2rdf4lod-automation/wiki/Real-world-examples) we take our raw csv data, and convert it to RDF.

3.10 Evolve, Iterate, and Evaluate

Here we begin linking together our data, linking to outside data, modeling our data using vocabularies and ontologies, exposing the metadata we collect above into the data itself. (https://github.com/timrdf/csv2rdf4lod-automation/wiki/conversion:Enhancement) This step is also difficult has you need to know a lot about your data, these tools, and linked data to do this well. We also look at finding new ways to expose our metadata to others and increasing the ability for our data to be used and discovered by others. At TWC we developed the Web Observatory extension to schema.org to help with this. By modeling and ex- posing our metadata using schema.org we can allow for better search and discovery of our web observatory datasets and tools. Currently a by hand processes, but will have tools in the future to better integrate this in our existing infrastructure. We also use S2S, the same technology used in IOGDS to build a better search inter- face for our web observatory. This currently needs a lot of modding/scripting to use correctly.

4 Using a Web Observatory

Having web observatories in place will allow much of the difficult data work in web science to become more open, automated, and easy. Data discovery will be aided by having semantic markup, allowing for intelligent agents to read, crawl and index these observatories for better searching, discovering and displaying. We will be able to build new tools that could quickly help researchers find datasets and informa- tion for their research questions.

5 Future Work

There is still a lot of work that needs to be done to make this more of a real- ity. A lot of work has been done in many of the pieces, there are still pieces missing, and they still need to be put all together. We have talked many times in the paper about different semantic markup standards and vocabularies to better describe tools, data and catalogs in web observa- tories, but many of these are still in the early phase of development. There needs to be more community effort to develop, expand, evolve and use these immerging stan- dards. We also need tools to better build, maintain and use web observatories. There are many that already exist that do some things listed here, but there is no unifying tool that brings everything together, that’s easy to use, well documented and mature. We need to better identify where the gaps lie in our tools, and develop and mature them over time.

6 Concluding Remarks

Web Observatories are needed to better realize the goals of web science re- search. We as researchers must begin to reign in many of our ad-hoc data practices together and build some infrastructure to allow others now and in the future use and understand our work. This is a difficult challenge given the wide diversity of discipli- nes and practices in Web Science. In this paper we have outlined some of these chal- lenges along with a first step in building a methodology to build web observatories. We do this both to provide steps that other researchers can follow in use in their own web observatory efforts, but also as a way to better illustrate the gaps and missing pieces that currently exist. We hope that this will motivate others to start building their own web observatories, and begin filling these gaps.

7 References

1. Tiropanis, T., Hall, W., Shadbolt, N., De Roure, D., Contractor, N., & Hendler, J. (2013). The Web Science Observatory. IEEE Intelligent Systems, 28(2), 100-104.

2. Fox, P., Cinquini, L., McGuinness, D., West, P., Garcia, J., Benedict, J. L., & Zed- nik, S. (2007). services for interdisciplinary scientific data query and retrieval. In Proceedings of the AAAI Workshop on Semantic eScience. doi (Vol. 10, No. 144.7441).

3. Rozell, E., Erickson, J., & Hendler, J. (2012, October). From international open government dataset search to discovery: a approach. In- Proceedings of the 6th International Conference on Theory and Practice of E- lectronic Governance (pp. 480-481). ACM.

4. Rozell, E., Fox, P., Zheng, J., & Hendler, J. (2012, April). S2S architecture and faceted browsing applications. In Proceedings of the 21st international conference companion on World Wide Web (pp. 413-416). ACM. 5. Gloria, M. J. K., McGuinness, D. L., Luciano, J. S., & Zhang, Q. (2013, May). Ex- ploration in web science: instruments for web observatories. In Proceedings of the 22nd international conference on World Wide Web companion (pp. 1325-1328). International World Wide Web Conferences Steering Committee.

6. McGuinness, D., Fox, P., Cinquini, L., West, P., Garcia, J., Benedict, J. L., & Middleton, D. (2007, July). The virtual solar-terrestrial observatory: A deployed semantic web application case study for scientific research. InPROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (Vol. 22, No. 2, p. 1730). Menlo Park, CA; Cambridge, MA; ; AAAI Press; MIT Press; 1999.

7. Seyed, P., Chastain, K., Ashby, B., Liu, Y., Lebo, T., Patton, E., & McGuinness, D. SemantEco Annotator.

8. Luciano, J. S., Cumming, G. P., Wilkinson, M. D., & Kahana, E. (2013). The E- mergent Discipline of Health Web Science. Journal of medical re- search, 15(8).