FUSION ONTOLOGY: ENABLING A PARADIGM SHIFT FROM DATA WAREHOUSING TO CROWDSOURCING FOR

ACCELERATED PACE OF RESEARCH

Dissertation

Presented in Partial Fulfillments of the Requirements for the Degree Doctor of Philosophy in the Graduate School of Ohio State University

By

Satyajeet Raje, MS

Graduate Program in Computer Science and Engineering

The Ohio State University

2016

Dissertation Committee

Dr. Jayashree Ramanathan, PhD

Dr. Philip Payne, PhD

Dr. Rajiv Ramnath, PhD

Dr. Gagan Agarwal, PhD

Copyright by

Satyajeet Raje

2016

“The Web as I envisaged it, we have not seen it yet.

The future is still so much bigger than the past.”

- Tim Berners-Lee,

Inventor of the

ABSTRACT

The last decade has seen an explosion of publicly available, crowd-sourced research datasets in various basic and applied science domains. Typically, research datasets are much smaller than the industry counter-parts, but still contain rich knowledge that can be reused to maximize new scientific discoveries. This availability of data has made data-driven academic research across domain at large- scale a real possibility.

However, the availability of the data has not translated to its effective usage in research. A large amount of data within the data repositories remains underutilized.

This research determined that a critical factor for this inefficiency is that majority of the researchers' time is devoted to the pre-processing activities rather than actual analysis. The amount of time and resources required to complete the beginning of this progression appear to limit the efficacy of the remaining research process.

The thesis proposes research-centric "data fusion platforms" to accelerate the research process through reduction in the time required for the quantitative cycle. This is achieved by automating the critical data preprocessing steps in the workflow.

First, the thesis defines the required computational framework to enable implementation of such platforms. It conceptualizes a novel crowdsourcing solution

ii to enable a paradigm shift from “in-advance” to “on-demand” mechanisms. This is achieved through a middle-layer ontological architecture placed over a shared, crowd-sourced data repository. Highly granular semantic annotations anchor the column-level entities and relationships within the datasets to global reference ontologies. The core idea is that these annotations provide the declarative meta-data that can leverage the reference ontologies. The thesis demonstrates how these annotations provide global interoperability across datasets without a pre-determined schema. This enables automation of the data pre- processing operations.

The research is centered on a Data Fusion Ontology (DFO) as the required representational framework to enable the above computational solution. The DFO is designed to enable sophisticated data integration platforms with capabilities to support the research data cycle of discovery, mediation, integration and visualization.

The DFO is validated using a prototype data fusion platform that can leverage the DFO specified annotations to meet the functional requirements to accelerate the pace of research.

iii DEDICATION

To my parents, Sanjeev and Swati Raje, and my sister Surabhi

for their constant belief in my ability to achieve any goal set

and for teaching me to believe in myself.

iv ACKNOWLEDGEMENTS

A special thank you to my advisors, Dr. Jayashree Ramanathan, Dr. Philip Payne and

Dr. Rajiv Ramnath for their guidance throughout this project and my graduate studies. I have benefited greatly from your advice, continual support and direction and hope to continue to do so in the future.

I would like to thank Lakshmikanth Krishnan Kaushik, Justin Dano, Marques

Mayoras, Kuhn, Bradley Myers and Tyler Kuhn for their contributions in implementation of the prototype data fusion platform demonstrated in this dissertation.

I would like to acknowledge my colleagues at CETI and the Payne lab; Kayhan

Moharreri, Zachery Abrams, Kelly Regan, Tasneem Motiwala, Bobbie Kite among many others. I have been fortunate to be involved in several projects with them during the course of my PhD at OSU. I have thoroughly enjoyed working with all of you and hope to continue to collaborate in future research endeavors.

v VITA

EDUCATION

Degree University Graduation Date Field of Study BE University of Pune, India July 2009 Computer Engineering The Ohio State University, Columbus, Computer Science & MS December 2012 Ohio, USA Engineering The Ohio State University, Columbus, Computer Science & PhD May 2016 Ohio, USA Engineering

RELATED PUBLICATIONS

Raje, S. (2015, November). Accelerating Biomedical Informatics Research with Interactive Multidimensional Data Fusion Platforms. American Medical Informatics Association (AMIA) Annual Symposium [Poster].

Raje, S., Kite, B., Ramanathan, J., & Payne, P. (2014). Real-time Data Fusion Platforms: The Need of Multi-dimensional Data-driven Research in Biomedical Informatics. Studies in health technology and informatics, 216, 1107-1107.

Lele, O., Raje, S., Yen, P. Y., & Payne, P. (2015). ResearchIQ: Design of a Semantically Anchored Integrative Query Tool. AMIA Summits on Translational Science Proceedings, 2015, 97.

Raje, S. (2012). ResearchIQ: An End-To-End Semantic Knowledge Platform For Resource Discovery in Biomedical Research (Masters thesis, The Ohio State University).

OTHER PUBLICATIONS

Regan, K., Raje, S., Saravanamuthu, C., & Payne, P. R. (2014). Conceptual Knowledge Discovery in for Drug Combinations Predictions in Malignant Melanoma. Studies in health technology and informatics, 216, 663-667.

vi Motiwala T, Shivade C, Raje S, Lai A, Payne, P. (2015, March) RECRUIT: Roadmap to Enhance Clinical Trial Recruitment Using Information Technology. AMIA Joint Summits on Translational Science - Clinical Research Informatics Summit, March 2015. Abstract.

Raje, S., Davuluri, C., Freitas, M., Ramnath, R., & Ramanathan, J. (2012, July). Using Technologies for RBAC in Project-Oriented Environments. In Computer Software and Applications Conference (COMPSAC), 2012 IEEE 36th Annual (pp. 521-530). IEEE.

Raje, S., Davuluri, C., Freitas, M., Ramnath, R., & Ramanathan, J. (2012, March). Using ontology-based methods for implementing role-based access control in cooperative systems. In Proceedings of the 27th Annual ACM Symposium on Applied Computing (SAC) (pp. 763-764). ACM.

Herold, M. J., Raje, S., Ramanathan, J., Ramnath, R., & Xu, Z. (2012). Human Computation Recommender for Inter-Enterprise Data Sharing and ETL Processes. Tech Report. The Ohio State University.

Raje, S., Tulangekar, S., Waghe, R., Pathak, R., & Mahalle, P. (2009, July). Extraction of key phrases from document using statistical and linguistic analysis. In 2009 4th International Conference on Computer Science & Education (ICCSE). IEEE

FIELDS OF STUDY

Major: Computer Science and Engineering

Minor: Biomedical Informatics

vii TABLE OF CONTENTS

Abstract …………………………………………………………………………………………………………………………………....………………………. ii Dedication …...……...………………………………………………………………………………………………………………...…..……………………… iv Acknowledgements ..………………………………………………………………………………………………………………….………………………. v Vita ..………………………………………………………………………………………………………………………………………………….……………… vi Table of Contents ..…………………………………………………………………………………………………………………………………………….. viii List of Figures ..………………………………………………………………………………………………………………………………………………….. xii List of Tables .……………………………………………………………………………………………………………………………………………………. xiv

CHAPTER 1: INTRODUCTION ...... 1

1.1 METHODOLOGY ...... 5 1.1.1 Identifying the Research Problem ...... 7 1.1.2 Traditional data warehousing approach is not suitable for research environment ... 9

1.2 PROPOSED RESEARCH SOLUTION: DATA CROWDSOURCING ...... 10

1.3 THE DATA FUSION ONTOLOGY METHODOLOGY ...... 11

1.4 PROTOTYPE DATA FUSION PLATFORM IMPLEMENTATION ...... 12

CHAPTER 2: NEED FOR DATA FUSION PLATFORMS IN RESEARCH ...... 13

2.1 METHODS ...... 17 2.1.1 Semi-structured interviews ...... 17 2.1.2 Analysis of current available tools ...... 19 2.1.3 Comparison of research process versus business analytics process ...... 19

2.2 INTERVIEW QUESTIONS AND RESPONSES ...... 19 2.2.1 Summary ...... 35

2.3 ANALYSIS OF CURRENT AVAILABLE TOOLS AND SYSTEMS ...... 36 2.3.1 Data Repositories ...... 36 2.3.2 Workflow Management Tools ...... 38 2.3.3 Data Integration Platforms ...... 38 2.3.4 Identified gaps ...... 40

viii CHAPTER 3: TRADITIONAL DATA WAREHOUSING VERSUS

RESEARCH ENVIRONMENT ...... 41

3.1 RESEARCH VERSUS BUSINESS ANALYTICS ...... 43 3.1.1 Process Differences ...... 43 3.1.2 Dynamic and iterative nature of research versus relatively stable and controlled business environment ...... 47 3.1.3 Autonomous research versus coordinated business data analytics ...... 48

3.2 QUANTIFYING THE COST OF DATA ANALYSIS ...... 49 3.2.1 Cost in Traditional Data-warehousing ...... 50 3.2.1 Cost in the Typical Autonomous Research Process ...... 53

3.3 PARADIGM SHIFT TO CROWDSOURCING FOR RESEARCH ENVIRONMENTS ...... 57 3.3.1 Why is Data Warehousing not suitable for Research Environments? ...... 57 3.3.2 Defining the Hard Problem ...... 59 3.3.3 The Computational Requirements ...... 60

CHAPTER 4: PROPOSED DATA CROWDSOURCING PARADIGM ...... 62

4.1 RELATED WORK ...... 63 4.1.1 Emergence of Smarter Data Repositories ...... 63 4.1.2 Emergence of end-user driven data platforms ...... 66 4.1.3 Crowd-sourced Schema Matching ...... 68 4.1.4 Role of Ontologies in Data Integration ...... 69 4.1.5 Leveraging the Linked Open Data cloud ...... 71

4.2 DATA CROWDSOURCING PRINCIPLES ...... 72 4.2.1 Crowd-sources data sources {Si} ...... 74 4.2.2 Ontologies {O} as universally recognized “schema” ...... 74 4.2.3 Reusable crowd-sourced semantic annotations {Ai} ...... 75 4.2.4 Semantic Match Operation (↔) ...... 76

4.3 “ON DEMAND” DATA INTEGRATION OPERATIONS USING SEMANTIC MATCH ...... 78 4.3.1 Extended Semantic Match (⇔ ) ...... 79 4.3.2 Data Exploration with Semantic Match ...... 80 4.3.3 Operation to find matches between given data sources ...... 82 4.3.4 Complex match operation ...... 83

ix 4.4 RATIONALE OF DATA CROWDSOURCING ...... 85 4.4.1 Graph of interconnected data sources ...... 85 4.4.2 Acceptable cost of querying ...... 86 4.4.3 Ad hoc integration functionality ...... 88 4.4.4 Preserving researchers' autonomy ...... 88

CHAPTER 5: DATA FUSION ONTOLOGY DESIGN ...... 89

5.1 DESIGN METHODOLOGY ...... 90 5.1.1 Purpose of DFO ...... 91 5.1.2 Scope of DFO ...... 93

5.2 DESIGN RATIONALE ...... 95 5.2.1 Bridging crowd-sourced datasets using semantic interoperability meta-data...... 95 5.2.2 Multi-granular Representation ...... 96 5.2.3 Lightweight Solution ...... 98

CHAPTER 6: DATA FUSION ONTOLOGY SPECIFICATION ...... 100

6.1 FUSION GRAPH NOTATION ...... 101

6.2 OWL REPRESENTATION OF DFO ...... 103 6.2.1 Definitions ...... 104 6.2.2 Design notes ...... 107

6.3 ANNOTATING TABULAR DATASET USING DFO ...... 108 6.3.1 Step 1: Capturing Structural Metadata ...... 110 6.3.2 Step 2: Capturing Annotations to {O}: ...... 110 6.3.3 Step 3: Capturing Relationships and Types of Columns ...... 111 6.3.4 Semantics for Join Table ...... 115

6.4 EVALUATION OF DFO ...... 117 6.4.1 OWL Representation and Feasibility ...... 118 6.4.2 Evaluation Methodology ...... 119 6.4.3 Conceptual Completeness Evaluation ...... 120 6.4.4 Consistency Evaluation ...... 121

6.5 APPLICATION-BASED EVALUATION: DATA PREPROCESSING TASKS USING DFO ANNOTATIONS ... 124 6.5.1 Method ...... 124 6.5.2 Use Cases ...... 125

x CHAPTER 7: PROTOTYPE DATA FUSION PLATFORM ...... 133

7.1 SCOPE OF THE DATA FUSION PLATFORM ...... 134 7.1.1 Functional Requirements Specification using Use Cases ...... 134 7.1.2 Non-functional Requirements ...... 136 7.1.3 Scope Definition ...... 137

7.2 IMPLEMENTATION DETAILS ...... 137 7.2.1 System components...... 138

7.3 WALKTHROUGH OF DATA PREPROCESSING TASKS ...... 140 7.3.1 Find available datasets that can be integrated with given dataset on a particular column ...... 141 7.3.2 Find all possible matches between given two datasets ...... 143 7.3.3 Find matches through intermediate dataset ...... 144 7.3.4 Ad hoc Join functionality and on demand integration ...... 145

CHAPTER 8: DISCUSSION ...... 149

8.1 SOCIAL AND ORGANIZATIONAL CHALLENGES ...... 150 8.1.1 Change in incentives for data sharing and annotation for researchers ...... 150 8.1.2 Impact in business data analytics ...... 150

8.2 TECHNICAL CHALLENGES ...... 151 8.2.1 Further development of DFO ...... 151 8.2.2 Inclusion of unstructured data ...... 152 8.2.3 Cost of annotation ...... 153 8.2.4 Coverage of ontologies ...... 153

8.3 FUNCTIONALITY OF THE DATA FUSION PLATFORM ...... 154 8.3.1 Implementing advanced ontology-based use cases ...... 154 8.3.2 End-to-end data functionality ...... 156 8.3.3 Bootstrapping the annotation process ...... 157 8.3.4 HCI evaluation of the platform ...... 162

8.4 CONCLUSION ...... 163

BIBLIOGRAPHY ...... 165

xi LIST OF FIGURES

FIGURE 1. CHAPTERS OF THE THESIS ORGANIZED AROUND THE STEPS INVOLVED IN THE PROBLEM SOLVING

METHODOLOGY...... 4

FIGURE 2. DATA INTEGRATION TASKS ...... 25

FIGURE 3. DATA INTEGRATION THROUGH MEDIATION...... 26

FIGURE 4. THE "UNICORN" DATA FUSION PLATFORM...... 32

FIGURE 5. TRADITIONAL DATA WAREHOUSING ARCHITECTURAL "TRIPLE" ...... 45

FIGURE 6. COMPARING THE DATA PREPROCESSING WORKFLOWS FOR INDIVIDUAL RESEARCHER VERSUS

LARGE-SCALE BUSINESS ANALYTICS ...... 47

FIGURE 7. DATA QUERY PROCESS AND ASSOCIATED COST FACTORS IN THE TRADITIONAL "IN ADVANCE"

WAREHOUSING BASED APPROACH ...... 51

FIGURE 8. DATA QUERY PROCESS AND ASSOCIATED COST FACTORS FOR "ON DEMAND" RESEARCH

ENVIRONMENT ...... 54

FIGURE 9. THE ARCHITECTURAL COMPONENTS FOR THE PROPOSED CROWDSOURCING PARADIGM

FRAMEWORK ...... 73

FIGURE 10. SEMANTIC MATCH OPERATION, USING ANNOTATIONS TO REFERENCE ONTOLOGY CLASS...... 77

FIGURE 11. COMPLEX MATCH OPERATION THROUGH AN INTERMEDIATE DATASET ...... 84

FIGURE 12. COST OF QUERY USING THE PROPOSED CROWDSOURCING PARADIGM ...... 87

FIGURE 13. STEPS IN DESIGNING THE DATA FUSION ONTOLOGY...... 91

FIGURE 14. POSITIONING THE DFO ...... 95

FIGURE 15. DATA FUSION ONTOLOGY COMPONENTS REPRESENTED USING A “FUSION GRAPH NOTATION”...... 102

FIGURE 16. DATA FUSION ONTOLOGY AS AN OWL/RDF REPRESENTATION ...... 103

FIGURE 17. ANNOTATIONS OF STRUCTURAL META-DATA (PHYSICAL VIEW) USING “DFO:BELONGSTO”. .. 109

xii FIGURE 18. HOSPITAL TABLE WITH COLUMN LEVEL ANNOTATIONS...... 111

FIGURE 19. CAPTURING SEMANTICS OF ASSOCIATIONS USING “DFO:JOINTABLE”...... 116

FIGURE 20. (A) DATASET OWNED BY THE RESEARCHER CONTAINING PATIENT AND ALLERGY INFORMATION.

(B) PUBLICLY AVAILABLE DATASET THAT CONTAINS ENVIRONMENTAL DATA TO BE MERGED WITH (A)...... 128

FIGURE 21. INTEGRATION USING COLUMN-LEVEL ANNOTATIONS: THE HOSPITAL TABLE IS USED TO JOIN THE

TWO EXAMPLE DATASETS USING CLASS ANNOTATIONS OF COLUMNS...... 129

FIGURE 22. DATA FUSION PLATFORM COMPONENTS ...... 139

FIGURE 23. START UP PAGE. THE DATASETS ARE ALREADY ANNOTATED AS PER THE DFO SPECIFICATIONS...... 140

FIGURE 24. EXPLORING THE SELECTED DATASET. THE COLUMN THAT THE USER WANTS TO FIND MATCHES

FOR IS SELECTED...... 141

FIGURE 25. ALL POSSIBLE MATCHES FOUND FOR A PARTICULAR COLUMN ...... 142

FIGURE 27. FINDING ALL POSSIBLE MATCHES BETWEEN GIVEN TWO DATASETS ...... 143

FIGURE 28. NO DIRECT MATCHES BETWEEN THE PATIENT AND GENE DATASETS ...... 144

FIGURE 29. AUTOMATICALLY FINDING INTERMEDIATE DATASETS ...... 145

FIGURE 30. AD HOC JOIN FUNCTIONALITY...... 146

FIGURE 31. INTEGRATING DATASETS "ON DEMAND" ...... 147

xiii LIST OF TABLES

TABLE 1. SUMMARY OF RESEARCH CONTRIBUTIONS AND THE RESEARCH "PRODUCTS" OF THE CHAPTERS IN THE

THESIS...... 6

TABLE 2. INTERVIEW QUESTIONS ...... 19

TABLE 3. INTERVIEW RESPONSES FOR QUESTION 1...... 20

TABLE 4. DIFFERENCES IN THE BUSINESS AND RESEARCH ENVIRONMENTS...... 42

TABLE 5. SUMMARY OF COST FACTORS AND HOW THEY EFFECT THE DATA WAREHOUSING AND RESEARCH

ENVIRONMENTS ...... 56

TABLE 6. TYPES OF METADATA CAPTURED BY DFO ...... 97

TABLE 7. DATA FUSION ONTOLOGY CLASS DEFINITIONS ...... 104

TABLE 8. DFO DATA PROPERTIES ...... 105

TABLE 9. DFO OBJECT PROPERTIES ...... 106

TABLE 10. EXAMPLE DATASETS FOR ANNOTATION...... 108

xiv Chapter 1

INTRODUCTION

Summary. This chapter provides an overview of the research presented

in this thesis, the underlying research hypothesis and the resulting

research contributions. It also describes the overall research

methodology and organization of the remaining chapters.

The last decade has seen an explosion of publicly available, crowd-sourced research datasets accessible to researchers for quantitative data analysis in various basic and applied science domains such as social sciences, economics, biomedical and biological sciences, etc. This growing trend of publishing datasets is attributable to new incentives encouraging data sharing and reuse such the White House Office of Science and Technology Policy memo addressing open access to data and directing federal agencies to provide funding towards initiatives that increase accessibility of publicly-funded research (Holdren, 2013) and directives like ‘Open

Science’ and ‘Open Data’ (Molloy, 2011; Murray-Rust, 2008).

1 Although, typically, research datasets are much smaller than the industry counter-parts but still contain rich knowledge that can be reused to maximize new scientific discoveries. Thus, this availability of data has made data-driven academic research across domain at large-scale a real possibility (Nielsen, 2012; Tolle, 2011).

However, the data are often highly distributed across heterogeneous sources and stored in varied formats, and are not well organized or curated. Research datasets are often individually managed and suffer from lack of sufficient and consistent provenance and meta-data critical for enabling efficient reusability. At best, they are isolated in silos of domain-specific repositories hindering cross- domain reuse (Science Staff, 2011). A large amount of data within the data repositories remains underutilized (Ingwersen & Chavan, 2011). This drastically reduces the discoverability, accessibility and in turn overall reuse of these datasets.

To illustrate, the Harvard DataVerse repository contains over 60 thousand datasets but only just over 1.5 million downloads1, thus, averaging only about 2.5 downloads per dataset. The data.gov.uk performs better where the over 27 thousand datasets have been downloaded a total of little over 1.6 million times.

However a closer inspection of the usage statistics2 reveals that a little over 16,500 datasets were downloaded five times or less, out of these about 9300 datasets were never downloaded at all. In fact, just the top 20 datasets contribute a whopping 25%

1 statistics obtained from http://dataverse.org/ on February 24th, 2016 2 https://data.gov.uk/data/site-usage/dataset

2 of the total downloads. Usage statistics for data.gov were not available at all, which contains over 120 thousand datasets3.

Thus, the availability of the data has not translated to its effective usage in research. This has created a conundrum among the research community questioning the “value” of data sharing (Borgman, 2012; Wallis et. al., 2013) Thus, a key research challenge in this area is development of tools that will allow researchers to leverage the available datasets efficiently (Embi & Payne, 2009; Payne et. al., 2009).

The overall goal of this research is “To accelerate the research process through reduction in the time required for the quantitative data analysis cycle by automating the critical components in the workflow.” We can refine this problem as: “Reduction in both time and computational expertise required for the data preprocessing tasks by enabling automation in the research environment”. Keeping this in mind, the thesis addresses two main hypotheses:

Hypothesis 1: Current data warehousing techniques are not suitable to the research environment due to fundamental process differences between the two.

Hypothesis 2: The semantic information necessary for the data preprocessing can be crowd-sourced and represented in a computer readable format in order to

3 http://www.data.gov/metrics (February 24th, 2016)

3 enable the automation of these tasks. This offers a scalable and distributed solution for the research environment.

Problem Representational Chapter 2 Chapter 5 Identification Framework Design Identify the factors affecting pace of The Data Fusion Ontology design is quantitative data analysis in the research presented as the required representational environment. framework that will enable the crowdsourcing solution.

Computational Chapter 3 Problem Refinement Framework Chapter 6 Illustrate why warehousing is not suitable Specification and scalable in the distributed research The DFO specifications are provided in this environment. Also, state the computational chapter. The DFO evaluation is also hard problem that needs to be solved. presented in this chapter

Computational Prototype Chapter 4 Chapter 7 Solution Implementation The crowdsourcing solution is described in In order to demonstrate that the this chapter. The computational crowdsourcing solution with the DFO is components and how they interact for data feasible a prototype data fusion platform is preprocessing tasks implemented. The different data preprocessing tasks are illustrated.

Chapter 8 Discussion

A discussion of the limitations of the current approach, future research agenda and concluding remarks

Figure 1. Chapters of the thesis organized around the steps involved in the problem solving methodology.

The thesis first identifies the need for data fusion platforms in research environments in chapter 2. This and the hard problem that needs to be resolved for research are identified in chapter 3. The related literature is reviewed in chapter 3

4 as well. Chapter 4 proposes a data crowdsourcing approach that forms a foundation for the data fusion platform implementation. Chapter 5 details the methodology, rationale and principles behind the design of the Data Fusion Ontology. The specification of the DFO itself follows in chapter 6. The evaluation of the ontology is also discussed in chapter 6. The application of the ontology to data fusion platform functionality along with a proof-of-concept implementation of such a platform is demonstrated in chapter 7. Finally, chapter 8 discusses the limitations and challenges of the current proposed solution along with the future research agenda.

The following methodology was implemented to transform this high-level objective into a specific research goal and subsequently into a computational framework and architectural model that can enable an implementable solution.

1.1 Methodology

The thesis methodology is based on the enterprise engineering approach

(Giachetti, 2010; Gustas & Gustiene, 2009) in order to solve the problem. Such an approach is frequently used in software development for modeling computational solutions to generic enterprise-level problems; especially for optimization of process and information flow at large-scale. Optimizing the quantitative data-driven research process falls perfectly in this class of problems.

5 As illustrated in Figure 1, the subsequent chapters of this thesis are organized around the steps involved in this methodology. The “products” output from the work described in the above chapters form the key contributions of the thesis.

These research contributions can be categorized into both the information science as well as computer science, with a certain degree of overlap. Table 1 summarizes the research contributions as they relate to the above methodology.

Table 1. Summary of research contributions and the research "products" of the chapters in the thesis.

Research Product Chapters Research Contributions Requirements specification 2 & 3 1. Characterization of critical cost factors in the data for quantitative data-driven analysis process, both in research and business analysis in research environment. environments 2. Gap analysis between the existing tools and systems suitable for data integration in the business environment and the needs and work patterns of the researchers. Crowdsourcing Solution 4 3. Conceptual architecture enabling a paradigm shift from “in advance” to “on demand” data integration using crowd sourced annotations. 4. Formulation of primitives and operations for “on demand” integration. Data Fusion Ontology 5 & 6 5. Representational framework required for implementation of the conceptual solution. Prototype Data Fusion 7 6. Validation of the DFO and the proposed Platform crowdsourcing solution.

6 1.1.1 Identifying the Research Problem

The research approach begins (in Chapter 2) with need assessment from researchers’ perspective to condense the generic high-level objective into a specific research problem definition. This requires a thorough understanding of the data- driven research process itself, specifically the steps involved in the workflow and the time needed to complete these steps as well as identification of the factors affecting the pace of this process (Raje, 2015).

Several researchers and statisticians working in the biomedical and life science domain as well as a business data analysts working in a commercial setting were interviewed to this end. The interviews were also designed to reveal any gaps between the tools and resources generally available to researchers and their requirements and expectations. The key observations were as follows:

Observation 1: Most of the data-driven research depends on tabular data formats and employs statistical or machine learning based analyses methods.

Observation 2: There is an abysmal gap between tools available to the researchers for the actual data analysis and the data preprocessing requirements, meaning that these steps are performed almost exclusively manually. Additionally, even though widely used data repositories like data.gov4,

4 http://www.data.gov/

7 DataVerse5, re3data6, etc. provide functionality for the discovery of crowd-sourced datasets, the next steps of pre-processing of heterogeneous datasets are largely unsupported.

Observation 3: As a result, it was generally observed that researchers spend about 80% of the total research process time in the data pre-processing tasks rather than the actual data analysis tasks.

Observation 4: The research environment critically requires “data fusion platforms” (Raje, 2015) that allow researchers to leverage crowd-sourced datasets efficiently, specifically aiding the data preprocessing tasks in real time. Such data fusion platforms should:

1. Provide researchers automated tools for more efficient pre-processing.

2. Provide interoperability across the datasets enabling automation of the

“on demand” integration to offset high cost of repeated manual data

integration.

5 http://dataverse.org/

6 http://www.re3data.org/

8 1.1.2 Traditional data warehousing approach is not suitable for

research environment

The gap between the data preprocessing and analysis is a well-recognized problem in data sciences in general (Pyle, 1999; Kamel, 2009; Munson, 2012) and is exacerbated due to increase in amount and heterogeneity of data. Data integration is a well-established solution for achieving interoperability across datasets in the business environment (Goodhue et. al., 1992; Zerbe & Scott, 2015). It was therefore necessary to study the data warehousing approach understand if they can be applied to meet the needs of the research environments. Traditional data warehousing techniques, relying on centralized, predetermined data schema and an

“in advance” data integration approach, are effective in reducing the cost of the data analysis cycle in the business environment to some extent (Inmon & Linstedt, 2014;

Kimball& Ross, 2013). It is clear from the data warehousing experience that an integrated data store is vital for the automation of data preprocessing tasks.

Chapter 3 presents the argument illustrating why such a warehousing approach is not viable for research environments in detail. Such a top-down approach is prone to scalability issues. Researchers agree that the traditional “in advance” data integration will not work in distributed and web-scale crowd-sourced environments

(Franklin et. al., 2005; Halevy et. al., 2006). The research environment exhibits these characteristics, where the datasets are often dynamic and their schema are not known in advance.

9 There are fundamental differences in the research and business analytics needs and work patterns. The process for quantitative data analysis in a research environment is significantly different from analytics within an enterprise / commercial setting. The research environment is highly distributed and would require far more scalability. Researchers by-and-large work autonomously or in relatively small cohorts and iteratively test out new hypotheses. This requires using new datasets that are often crowd-sourced from public repositories. Thus the identification of datasets and query formulation is “on-demand” as queries are dictated by changing hypotheses.

Thus, the research process requires the data integration to be conducted “on demand”. Hence, for the research environment, the hard problem identified here is:

How can we achieve mechanisms for “on demand” data integration (emulating the functionality in traditional data warehousing) WITHOUT a predetermined data schema?

1.2 Proposed Research Solution: Data Crowdsourcing

The novel conceptual solution to the hard problem stated above is presented in

Chapter 4. The proposed crowdsourcing approach adopts a middle-layer ontological architecture placed over a shared, crowd-sourced data repository. The goal is to leverage the power of the ontologies, with respect to the automated inference capability and global understandability of entities.

10 1.3 The Data Fusion Ontology

The Data Fusion Ontology (i.e. DFO) contributes a representational standard to address the necessary granularity and features for the crowdsourcing approach and enable development of a “data fusion platform”. This would minimize the effort required to implement a platform solution that can deliver the required functionality for integration. Such a platform should minimize the exposure of the technological aspects to the end user.

The DFO address the following requirements:

1. It provides a unified declarative semantic representation of interoperability

metadata across datasets. DFO provisions for column-level granularity required

for automated “on demand” interpretation and integration operations.

2. It allows independent generation (crowd-sourced) of semantically coherent

annotations across datasets that can be utilized for the integration tasks. These

annotations are used as anchors between the local dataset schemas and concepts

from reference ontologies (, UMLS, etc.).

3. Provide a lightweight solution for automated interpretation and ad hoc

integration in real-time with minimum programming resources. The DFO

achieves this by replacing the overhead of schema matching and mapping with

reusable semantic annotations.

11 Chapters 5 and 6 are dedicated to the design, specification and evaluation of the

DFO. Chapter 5 details the design methodology used to engineer the ontology.

1.4 Prototype Data Fusion Platform Implementation

A prototype data fusion platform is presented in Chapter 7. The prototype demonstrates how users to perform automated interpretation and ad-hoc integration of multiple datasets in real-time. It is designed to cut down data pre- processing time significantly. The platform design illustrates interactive features that are highly visual and intuitive and do not require programming expertise. It effectively validates the DFO and the crowdsourcing approach in automation of the various data preprocessing tasks.

12 Chapter 2

NEED FOR DATA FUSION PLATFORMS IN RESEARCH

Summary. The motivating real-world problem that this thesis is trying

to solve is identified in this chapter. The hypothesis is that the data

preprocessing steps, required before the actual data analysis, form the

critical time consuming factor in the overall quantitative data analysis

process. Data fusion platforms with automated functionality to

efficiently handle the preprocessing tasks are critically required in the

research environment (and for the growing data sciences in general). In

order to validate this hypothesis we interviewed researchers heavily

involved in quantitative data-driven research in the biomedical

informatics domain along with research statisticians and business data

analysts. In addition, we surveyed the current tools readily available to

researchers for data preprocessing tasks. The findings from these

studies are presented in this chapter.

13 It is generally understood that the data-driven research and analytics process often requires integration of datasets from diverse sources involving domain knowledge, experimental and environmental data (Nielsen, 2012; Tolle, 2011). There has been significant previous research (Pawlowsky-Glahn, V., & Buccianti, 2011;

Robson & McCartan, 2016; Schutt & O'Neil, 2013) dedicated to establishing the data analysis methodology.

Data analysis involves several preprocessing activities such as data exploration, collection, cleaning, interpretation etc. as well as post-processing activities like and reuse. These iterative steps are often interwoven to incorporate findings of previous research. Although these activities span a wide range of tasks, they can be broadly classified into the three following stages:

1. Data preprocessing activities such as exploration, collection, cleaning and

curation, interpretation, and annotation achieving normalization of meaning

across data sets necessary prior to any queries;

2. Processing activities involving the actual analysis, where several tools are

applied to the curated datasets, interim results are stored, and research

hypothesis is iteratively tested; and

3. Post-processing activities including not only archiving, but also the publishing

and promoting reuse of data created by the researchers working independently.

An increasing number of datasets are becoming publicly available in basic and applied sciences (e.g. social sciences, economics, biomedical and biological sciences, etc.). The natural assumption is that this availability of data would propel the data-

14 driven research allowing large-scale data analyses across domains and provide capability to answer complex research questions.

Consider the following research scenarios:

“I am a medical researcher interested in finding correlations between factors affecting metabolic syndrome and liver disease in patients (such as obesity, alcoholism, etc.) and any socio-economic and geographic factors (income, education, food deserts, etc.). I have the patient data and now require the appropriate socio- economic and demographic datasets.”

OR

“Can I identify patterns between car, bike and public transportation usage and the pollution data in a densely populated metro region? Such a query would require a specific set of datasets from disparate data sources like government data repositories (from the department of transportation for instance), social data

(commuter data statistics) as well as data from commercial vendor (e.g. bike rental companies).”

Both of these scenarios are representative of typical questions in data-driven research in any domain. They seem simple enough to answer, especially knowing that the required data is bound to be available in the new era of data abundance.

However, in practice, the data availability alone has not been sufficient to fulfill the promise of large-scale data-driven research as was initially expected. While an

15 increasing number of datasets are becoming available, exploring and integrating these remains a challenge.

The thesis makes the following hypotheses in this regard:

Hypothesis 1: Majority of the time in the research process is spent in the preprocessing tasks. Although it is often assumed in data sciences in general that

“80% of the time is spent in preprocessing versus only 20% spent in actual analysis”, there are very few studies actually validating this claim (Pyle, 1999;

Kamel, 2009; Munson, 2012). At the time of writing this thesis, to the best of our knowledge, there have been no studies specific to the research environment.

Hypothesis 2: There are fundamental differences between the quantitative data analysis process and requirements of the research environment and its corresponding process and requirements in the commercial / business environment. As a result, the traditional warehousing based data integration techniques that are successful in the business setting are not suitable for the research environment.

Hypothesis 3: Although several tools are available to for analysis, very few tools address the data preprocessing tasks. Due to this unavailability, researchers often perform all data preprocessing tasks manually. Hence, the effort currently required by the independent researcher to complete these steps can be overwhelming and time consuming, causing the researcher to dedicate more time to data preprocessing and curation, than actual productive analysis of the data.

16 Hypothesis 3: The research environment critically requires automated platforms that allow researchers to leverage crowd-sourced datasets efficiently, specifically aiding the data preprocessing tasks in real time. Such a data fusion platforms should have interactive capability with an intuitive visual interface.

2.1 Methods

In order to validate the above hypotheses and further understand the exact requirements of the researchers and assess the capabilities of the current set of data preprocessing tools we adopted the following study methodologies:

2.1.1 Semi-structured interviews

The first step in designing an effective data fusion platform is identifying the exact researchers’ requirements and recognizing the gaps in the current set of tools satisfy these requirements. Subject matter experts were interviewed in a semi- structured to gauge these requirements.

I interviewed 4 biomedical researchers with varying levels of experience working, primarily, in different areas of research (such as cytogenetics, nursing, epidemiology, nutrition, etc.). In addition, we also interviewed 2 business data analysts and 2 biostatisticians.

The study adopts the grounded theory approach (Corbin & Strauss, 2014).

Grounded Theory is an inductive research methodology that finds common themes

17 and ideas from qualitative data (in our case interview responses). In accordance with this approach,

1. The questions are designed such that the responses would be as descriptive

as possible. The responses to each question are recorded separately.

2. The interview responses for each question are analyzed to find common

patterns in the interviewees’ responses. Responses to questions that

resonate the ideas and thoughts are grouped or “coded” together.

3. These groups are then aggregated and directed towards answering the

hypotheses mentioned above.

2.1.2 Analysis of current available tools and systems

A survey, based on the responses of the interviewees, to assess the capabilities of tools accessible to researchers was conducted. The evaluation analyzed the current systems to identify the gaps in their capabilities and the expectations of the researchers from a (hypothetical) fully automated system. These gaps served as the design principles for the proposed data fusion platforms.

Tools commonly used in a commercial or business setting were also surveyed.

The goal was to identify why such tools are generally not prevalent in the research setting.

18 2.1.3 Comparison of research process versus business analytics

process

The interviews revealed specific differences in the work patterns of researchers in the academic environment and data scientists working in the commercial setting.

For the final study, the typical workflow for data analysis in the research environment was compared to the data analysis process in business / commercial setting, particularly for preprocessing of data to further understand the exact differences in business and research process requirements. These differences severely impact the cost incurred for data preprocessing in the two environments.

These cost have been highlighted in the following chapter.

Table 2. Interview Questions

1. How often do you work with structured 5. Typical problems from a technical tabular datasets? standpoint? (as percentage of total research 5.a. Typical platforms / tools / formats datasets) used? 2. Typical quantitative data analysis 5.b. Expertise possessed / required with process? current tools? 3. Typical data integration tasks? 6. Time spent in preprocessing vs. analysis? 7. Expectations from an automated system? 4. Approach to each task 8. Knowledge of existing tools for integration

2.2 Interview Questions and Responses

Table 1 displays the questions we asked the interviewees. The questions were designed to test the hypotheses stated above and to check if the researchers’

19 responses align with the established data analysis practices as identified in previous literature. In this section we expand on them and discuss the responses. It can be observed that the interview responses also validated the hypotheses proposed by the thesis (stated above).

Question 1: How often do you work with structured (tabular) datasets?

This question was asked to get a broad perspective of how much of data driven research in biomedical informatics involves analysis of tabular data. The interviewees were asked to provide the percentage of relational datasets out of total datasets they utilize for research. Most interviewees provided a range of values.

Table 3. Interview Responses for question 1. The values indicate the % of times the data used for analysis is tabular in nature.

BMI = Biomedical Informatician, BioStat = Bio-statistician, DA = Data analyst

Interviewee BMI1 BMI2 BMI3 BMI4

% work with relational data 70 – 80 60 – 70 70 – 80 80 – 90

Interviewee BioStat1 BioStat2 DA1 DA2

% work with relational data 100 90 – 100 90 – 100* 100*

Biostatisticians indicated, unanimously, that they work with structured data nearly 100% of the time. Biomedical informatics researchers had variance in their answers. The lowest response was 60% whereas the highest was 90%. The average

20 for all interviewees was noted at 80% of the time. The exact responses are shown in

Table 2.

The business analysts initially indicated use of tabular data approximately 60% of the time. On further questioning, it was found that the low percentage was attributable to the fact that the analysts also had other responsibilities relating to data administration and Extract – Transform – Load (ETL) within the business setting apart from the data analysis tasks themselves. When asked how much of the analysis was based on relational data, the analysts corrected to almost a 100% of the time as well. The corrected responses are reflected in Table 2.

Question 2: What is the typical data analysis process followed?

This inquiry was designed to recognize any common pattern of researchers in data analyses practice. All interviewees generally agreed upon the following progression of steps, compliant with the existing literature.

• Query Formulation

The first step is query formulation for a particular research hypothesis. A researcher, or data scientist in general, generates a hypothesis and a set of queries that would be required to test that particular hypothesis. In case of the research environment, individual researchers generate the hypothesis independently and, thus, can generate their own autonomous set of queries. The hypothesis and consequently the query generation in the business setting is much more

21 coordinated among the data analysts and driven by business needs of the driving organization.

• Data Exploration

The next step is to locate the required set of datasets {Sq} that can answer the query from a pool of potential datasets {Si}. The source of the datasets {Si} is where the business and research environments differed drastically.

In the research setting, the {Si} are usually from individual researchers’ personal archive and publically available datasets from global data repositories. The researchers then perform initial analysis of the validity of the data contained within these datasets based, in part, on its relevance to previous research. This requires locating what datasets are relevant for the current research task. Note, again, that individual researchers or at best small cohorts of collaborators perform these tasks autonomously.

In the business setting, by contrast, the {Si} is generally an internal data store that already exists within a single system. The appropriate {Sq} are extracted from this database. External datasets, if required, are quickly identified and integrated with this set based on the data query.

Problem of optimality: A key challenge of data exploration in current data environment is the plethora of publically available datasets for research. In fact, research shows that the plethora of data has a detrimental effect on research (van der Putten et. al., 2002). The vast volume and heterogeneity leads the problem of

22 optimality. This means an optimal dataset may exist, but the complicated, time- consuming and almost entirely manual data exploration process often renders it impossible to find such a dataset among other similar datasets. This problem of

“finding a needle in a needle stack” in turn drastically hinders research progress.

This problem is exacerbated in the research environment due to the diversity and heterogeneity in ways to store and manage datasets, especially across domains.

Data Interpretation

The interpretation step involves understanding the data content and the metadata of a given dataset in order to determine how the selected datasets {Sq} can be integrated with the researcher’s data (or any other datasets that the researcher might be interested in).

In case a new dataset is obtained, the cost of data interpretation is extremely high. For researchers, this occurs quite often, as new hypotheses require almost new set of data. In other words, most of the times, among the datasets {Sq} selected by the researchers, the number of new datasets is generally higher than the number of previously used or personally owned datasets. On the other hand, in the commercial setting, most of the queries themselves are designed around the internal, previously owned data. Although, this is not universally true, it is a generally occurring work pattern.

23 • Data Integration

This stage involves fusing the selected datasets {Sq} and tables within based on the researchers’ interpretation. The output of this integration should finally produce data ready for analysis.

All interviewees undertook different types of data integration tasks (as illustrated in question 3 below) and had several different approaches to perform each tasks with varying levels of difficulty. Data integration, in general, was recognized as the most challenging task in preprocessing. This is also corroborated by other literature studying the data analysis workflow (Seligman et. al. 2002). It often requires technical expertise of tools and technologies beyond the scope of the researchers’ domain. This significantly impacts the usability (or reusability) of data.

Questions 3 through 5 were asked to gauge the details of the integration step.

Question 3: Typical Data Integration tasks:

The interviewees were asked to identify what types of data integration tasks they normally preformed as part of their research. Based on the responses we identified the following classifications of data integration tasks commonly performed by researchers occurring at the table level were identified.

Aggregation: The researcher wants to add more data samples (rows) to the existing data table while keeping the number of fields (columns) same. Usually the datasets are within the domain. From a database perspective, this is a typical

24 “union” operation where each column in a “source” table is “matched” with its respective column in the “target” table and the data rows are combined together.

A" B" C" D" A" B" C" C" D" E"

A" B" C" D"

A" B" C" D" A" B" C" D" E"

(a) Aggregation of data (b) Expansion of tables

Figure 2. Data Integration tasks

Expansion: The researcher wants to add more fields (columns) to the same data points (rows). This appeared to be a common use case when adding data from external sources. Here the researchers had to find a possible join (based on a common field) between the datasets to be integrated. From the database perspective, this is a typical “join” operation over a particular column (or set of columns) to give a unified table.

Mediation: A special case of expansion in which no direct possible join exists between the datasets. The alternatives are to look for another dataset or to find an intermediate dataset that can bridge the datasets. Interviewees indicated the manual effort of finding intermediate datasets is huge. Thus, they often have to settle for alternative datasets that might not be the best suitable but are easier to interpret and integrate. Very few exceptions are made, when the dataset is too

25 valuable or absolutely critical. In such cases, the preprocessing time required for the interpretation and integration is also significantly higher.

No direct matches across datasets

A" B" C" E" F" G"

C" D" E"

A" B" C" D" E" F" G"

Figure 3. Data integration through mediation. An intermediate datasets is identified and leveraged to integrate the required datasets when no direct matches are present between them.

Question 4: Approach to each task

The interviewees were asked to provide anecdotal evidence based on their experience on how these tasks were handled. All the interviewees, apart from the business analysts, said that they performed the preprocessing tasks manually, and very often, individually. The researchers generally used to rely on their previous experience to know the domain specific data sources and repositories. However, now, most of the researchers confessed to “google” for data and manually sift

26 through datasets, assess their usability and quality, and try to interpret and eventually integrate them for each research hypothesis / query. This is becoming increasingly difficult with increasing number and sources of data.

In contrast, the business scenario datasets are generally limited to the scope of the current organization. External datasets, if any, are usually reused from a pool of known and available datasets. Also, in most cases the analysts work as teams and have a greater pool of resources available at their disposal. Thus, the approach to the preprocessing tasks and the process is significantly different in the research and business environments.

Question 5: Time spent in pre-processing vs. analysis

This was a straightforward and yet the most crucial question asking the interviewees what percentage of their research time is dedicated to the preprocessing tasks identified through previous questions. The researchers indicated that the preprocessing time is lower (still averaging 60%) if the datasets has been used previously, saving time on the data interpretation. The amount of corresponding time increases (average 80%) when a new dataset is encountered.

However, this does not negate the time spent in the data integration.

Even the data analysts in the business setting claimed a similar percentage when dealing with a new dataset. However, the time is cut down drastically once the data is integrated in a warehouse. The business environment relies heavily on this warehousing to reduce the preprocessing costs over time.

27 Question 6: Typical problems from a technical standpoint

A two-part question gathered information of the technical aspects related to these processes. The first part gauged the technical requirements of data handling and management possessed by the interviewees in the current scenario specifically for data-driven analysis. The second part assessed the current capabilities of the interviewees to perform data preprocessing tasks from a technical standpoint.

Question 6.a: What are the typical data formats used and what tools / platforms are used to process these formats?

The first difference between the research and commercial setting is the size and complexity of the datasets used by the data scientists / researchers. The research datasets, even those obtained from large data repositories, are relatively small, both in terms of complexity and generally also in terms of size of dataset. Research literature also agrees with these observations (Rajasekar et. al. 2013; Cragin et. al.

2010), described by Palmer et. al. (2007) as the “long tail of science.” To facilitate reuse, datasets are crowd-sourced into shared cross domain “meta” data repositories like Harvard Dataverse (King, 2007), DataHub (Bhardwaj et. al. 2014), etc. or domain specific collections such as dbGap (Mailman et. al. 2007) or GenBank

(Benson et. al., 2008). Thus, the “” problem faced by the researchers is not that of volume and neither of velocity, it is that of variety. The number and heterogeneity of datasets and sources as well as absence of meta-data critically

28 required for effective interoperability across the datasets results in huge costs of data exploration and interpretation and their subsequent integration.

The second difference was in the format of the data primarily used. From the format perspective, the researchers and statisticians relied heavily on raw data in the comma-separated or tab-separated values format. (.txt, .csv, .dat, etc.). This was due, largely, to their use of statistical tools and platforms including Excel, SPSS7 and

Stata8, for analysis and ease of ingesting the said data formats directly into these tools. In addition, they also were comfortable with tool specific formats for SPSS,

Stata, etc. Some researchers occasionally used data from traditional relational databases.

The business analysts also used SPSS and Stata (very minimally Excel) for data analysis. Additionally, they also relied heavily on commercial platforms, for example the SAS9 and SAP Hana10 platforms. However, the business analysts relied primarily on data extracted from a single data store, which can be in any format, SQL, graph- based, etc. This data was translated into the required format in an integrated through ETL processes based on the analytics tools used.

7 IBM SPSS software(Retrieved 2016, Feb. 24), http://www- 01.ibm.com/software/analytics/spss/ 8 Stata statistical analysis tool (Retrieved 2016, Feb. 24), http://www.stata.com/ 9 SAS 9.4 (Retrieved 2016, Feb. 24), https://www.sas.com/en_us/software/sas9.html 10 SAP Hana Retrieved 2016, Feb. 24), https://hana.sap.com/capabilities/analytics.html

29 Question 6.b: What is the expertise and comfort-level with the tools and formats used?

All the interviewees, researchers and analysts alike, were comfortable with the manipulation of raw data formats (such as text, csv, Excel, etc.). However, some stark differences were identified in the technical abilities and, as a result, methods of operation.

Though well versed with statistical languages (like R and Matlab), the interviewed researchers (with the exception of one) had only basic knowledge of other “main stream” programing languages such as Java and C/C++ or even database languages like SQL. Most researchers manipulated and preprocessed data manually to obtain data in the desired format.

Occasionally, researchers obtained the data programmatically as well using scripting languages like Perl or Python; but this was found to be rare. In general, the researchers lacked the technical skills with SQL to manipulate large relational databases. This lack of automation and expertise obviously adds a lot of time for the preprocessing. In addition, it severely curtails the selection of datasets. Most researchers relied on external programming expertise for critical and large projects.

This is complicated because researchers typically work individually and have limited resources.

The business analysts on the other hand generally work in larger teams and thus have access to good amount of technical expertise. The analysts we interviewed

30 were both from computer science / engineering backgrounds and possessed significant programming skills themselves and identified that this is not uncommon in the commercial environment. They also have greater resources available to them that allows them to utilize large-scale data integration platforms with automated schema matching and mapping techniques driving the preprocessing cost further down.

Question 7: Expectations from automated systems:

The researchers unanimously agreed that intuitive platforms requiring limited computer programming knowledge are highly desirable. Ideally, the systems or tools should be completely automated, with minimum intervention from the researcher. They also identified that although there are several tools for data analysis (and the researchers have sufficient technical expertise with such tools), there are minimal tools (at affordable cost) that can allow research specific data preprocessing tasks.

Figure 4 illustrates the desired capabilities from new automated platforms from the researchers’ perspective. The specific functionality, grouped around the different preprocessing stages, as identified by the researchers we interviewed are as follows:

Dataset exploration

• Search for relevant published datasets using dataset annotations in global data

repositories.

31 Capabilities$of$an$ideal$Data$Fusion$Platform$

Data$Discovery$

Data$Mediation$

Data$Integration$ Data$Visualization$

Figure 4. The "Unicorn" Data Fusion Platform. The expectations of researchers from new systems are fairly high. The key observation is their emphasis on an "end-to-end" solution with automated preprocessing capabilities.

• Automatically detect usable datasets (can be integrated) with local (researcher)

datasets.

Dataset Interpretation

• Automated “match” finding between local dataset and public datasets.

• Automatically find intermediate datasets for integrating given two datasets with

no direct “match” between them.

• Get information such as missing values, source, etc. and provide

tools for assessment.

Data Integration

• Perform ad-hoc, real-time “joins” across identified datasets and deliver them in a

relational (tabular) format for use in various analysis tools.

• Compare multiple datasets for integration in a real-time environment using

heuristics of data quality/retention/loss

32 Example: Dataset A and B can be joined on column X or can be joined using P

(intermediate dataset) on different columns, etc. The number of data rows

(observations) retained is good .

Visualization

• An intuitive and visual interface that does not demand programming experience.

• Provide interactive interfaces (drag-drop, etc.).

The automation provided by the end-to-end data fusion platforms with such capabilities is crucial, but far from realization. Implementation of such platforms would require rethinking of the way research datasets are published. It would also require a meta-data representation framework capable of providing required interoperability across datasets, not only at the dataset level but going deeper at the table and column level.

Question 8: What are the different data preprocessing tools that you currently have access to or have knowledge of?

This question was mainly directed towards the researchers and statisticians. The business analysts had knowledge of (and also worked with) large-scale, commercial data integration platforms such as IBM InfoSphere, Informatica, SAS, etc.

(mentioned by the business analysts) or even open source or “freeware” tools.

Though some of the researchers were aware of these tools, they identified that they do not fit the researchers’ requirements. The reasons identified by the researchers are discussed in detail in the next section.

33 Most researchers regularly used domain specific search tools for data discovery, notably data.gov (Ding et. al. 2010) and census data for environmental and socio- economic datasets along with several domain specific data repositories like health.data.gov for health related data (especially population health), Gene Bank for genetic data, NCBI for biomedical datasets (Geer et. al., 2009), etc. As mentioned earlier, the researchers frequently rely on their domain experience to identify the data sources. For example, a researcher involved in drug interactions would rely on

DrugBank or Kyoto Encyclopedia for Gene and Genome (KEGG) for pathway analysis.

Another important observation in this regard is that the data exploration is still reasonably quick when the hypothesis is domain specific. However, if the research question is large-scale or covers a “big picture”, which requires datasets and collaborators from across domains, then the time required for the data preprocessing increases exponentially. This was found to be true even within the larger biomedical and life sciences domain. For example, consider a research trying to find patterns between the response of the patients to medication, traits within their respective genomic structures and correlation with their activity. This requires integration of genomic, medication and behavioral data.

Some researchers also used workflow-based tools, such as Galaxy workflow

(Goecks et. al., 2010), Taverna (Wolstencroft et. al., 2013), LONI pipeline (Rex et. al.,

2003), etc. for specific types of data (especially gene and image data).

34 However, most researchers did not use any automated preprocessing tools, let alone data integration platforms, primarily as they did not fit the requirements and that they had limited resources and access.

2.2.1 Summary

The interviews revealed major differences in the process and philosophy of the individual researchers and the large-scale business analytics.

The key findings are listed below:

• About 80% of data driven research (at least in the selected biomedical

informatics domain) relies on tabular data.

• The research datasets are generally much smaller in size and complexity as

compared to the commercial datasets and also are generally published in textual

formats (csv, Excel, etc.).

• Currently researchers have limited access to automated tools to help in the data

preprocessing and usually perform all tasks manually. Consequently, nearly 80%

of the research time is spent in preprocessing as opposed to analysis.

• Out of the preprocessing tasks, the integration of heterogeneous datasets for

each different research query / hypothesis is most critical.

• Current set of tools are more suitable for the large-scale business analytics

process and not the individual researcher workflows

35 • Individual researchers usually possess their own collection of datasets that are

rarely shared or reused. This mainly due to lack of standardized frameworks for

data archiving that can support data preprocessing tasks.

2.3 Analysis of current available tools and systems

Tools for data-driven research and analytics processes work at different levels of metadata granularity and use diverse views of a dataset to provide different functionalities. Three types of tools are predominantly used in data preprocessing, as identified in the interviews. These are data repositories, scientific workflow systems and data integration platforms.

2.3.1 Data Repositories

Several data repositories are now readily available to researchers to find relevant datasets. A detailed survey of such repositories is found in the study by

Marcial & Hemminberger (2010). These registries rarely store the actual data themselves. Usually, datasets are registered in the repository allowing the researcher to find the relevant data and link to the actual dataset.

The repositories generally tend to be domain-specific, for example, the NCBI

Entrez (Maglott et. al., 2005) collection of biomedical datasets, GenBank specifically

36 for gene datasets, among many others11. Such repositories were created through consolidated efforts among researchers within specific domains (and often within specific regions) to create data networks to facilitate data reuse and collaboration.

For example, the Shared Health Research Information Network (SHRINE) (Weber et. al., 2009) or the Inter-university Consortium for Political and Social Research

(ICPSR) (The ICPSR Annual Report, 2015) , one of the biggest collections for social and behavioral data.

In addition, there has been significant effort dedicated to development of cross- domain “meta” data repositories. Notable interdisciplinary research repositories include the Harvard DataVerse (King, 2007) and the Registry of Research

Repositories (re3data)(Pampel et. al., 2013). Collections such as data.gov and census.gov contain a variety of publicly available datasets from various government agencies. Another important collection of data is the DataHub based on CKAN12.

Similarly, NuCivic DKAN data catalog platform13 in Europe has over 3600 datasets registered.

In general, currently these tools provide only dataset search capabilities. This expedites the data exploration somewhat but the remaining data preprocessing tasks have to be performed manually.

11 A list of open data repositories can be found at oad.simmons.edu/oadwiki/Data_repositories 12 http://datahub.io/ 13 http://www.nucivic.com/dkan/

37 2.3.2 Workflow Management Tools

Another notable approach is the use of workflow-based data platforms, such as

Taverna and Galaxy Workflow. Such tools allow researchers to create and share common tasks that need to be performed on data in the form of series of repeatable steps in a single workflow. For example, the process of converting raw genetic data to a usable tabular format is a standard workflow.

Such tools are useful to record common preprocessing steps, such as transformations and cleaning, on repeatedly used datasets and formats that are very domain specific, e.g. Microarray DNA chip data or pathology imaging data. Thus, interpretation and integration of new datasets is still problematic. And cross- domain research is entirely unsupported.

2.3.3 Data Integration Platforms

Commercial large-scale data integration platforms, such as Informatica 14 ,

Infosphere15, among many others mentioned earlier, come bundled with the data pre-processing and visualization capabilities. Other data pre-processing specific platforms, such as Paxata16 and Trifacta17, are also gaining popularity. Eckerson &

14http://www.informatica.com/ 15http://www-01.ibm.com/software/data/infosphere 16http://www.paxata.com/ 17http://www.trifacta.com/

38 White (2003) conducted a detailed survey studying the prevalence, functionality, technical expertise to operate and cost of such systems.

The data integration platforms available commercially are designed for large- scale shared business needs and not for the individual researchers’ needs. The reasons identified for this gap are as follows:

Financial constraints: Researchers typically work independently or in small collaborative cohorts. Thus, they have limited research budgets. Typical commercial enterprise-scale platforms are too expensive for such budgets.

Lack of intuitive interfaces: While commercial data platforms are moving towards more graphical interfaces, most still require significant programming expertise for setup and advanced pre-processing functionality. Even though there are some free

(open-source) tools available, these require even more technical expertise.

Reliance on traditional data warehousing and ETL process: Traditionally, data integration tools follow the Extract-Transform-Load (ETL) process relying on schema mapping and matching. This warehousing approach assumes that the data, once integrated, will be reused for a long duration of time, as if often the case in business environments. This is not suitable, as researchers frequently need to work with many new and different datasets for different research hypotheses.

39 2.3.4 Identified gaps

• The data repositories have good search capabilities for recommending relevant

datasets and obtain provenance information. However, with the increasing

number of very similar datasets, the interpretation of the recommended datasets

to find the “best fit” is proving to be a difficult problem.

• The researchers have access to data repositories that allows searching for

relevant datasets (partially supporting the data exploration), as well as tools for

data analysis. But there are no tools that can bridge the gap between these tools.

As a result most preprocessing (interpretation and integration) tasks are

performed manually.

• Commercial data integration platforms with functionality for automated data

preprocessing are not suitable in the research environment (due to the reasons

mentioned above).

This unsuitability of the traditional data warehousing approach in research environments owing to the specific work patterns and requirements of the researchers and its impact on the cost of the data preprocessing is the subject of the next chapter.

40 Chapter 3

TRADITIONAL DATA WAREHOUSING VERSUS RESEARCH ENVIRONMENT

Summary. In the previous chapter, the interviews and the analysis of

available tools revealed gaps between the current data preprocessing

tools and data warehousing based integration platforms in the

commercial environment and the specific requirements of the research

environment. This chapter expands on the underlying causes of this

divide with regards to the differences in the process and the work

patterns. These differences have a profound impact on the cost of data

preprocessing and the overall data analysis process. The chapter also

identifies the hard computational problem that is the core of this thesis.

41 Table 4. Differences in the business and research environments.

Business Environment Research Environment (autonomous (warehousing process) processes)

Work patterns -Coordination of efforts by data -Highly autonomous efforts driven by analysts (within the organization), researchers’ hypotheses and typically driven by specific business needs only loose collaboration across which can be incremental. organizations. The efforts are highly -Teams manage data. iterative, but not necessarily -The datasets are highly reusable. incremental. - Individuals (or small group of collaborators) manage data.

Heterogeneity -Mostly internal data (e.g. -Individual researchers (from various of data sources operational data from departments domains, with various hypotheses) within an organization). Hence, a explore heterogeneous data sources finite known number. on the web external to their own -Additionally, there can be few research repositories. known external datasets (such as - Practically infinite number of data environmental or socio-economic). sources, entities and relationships. -Limited number of large datasets, -A predetermined “universal” schema with a controlled set of entities and generalizable for ALL research would relationships. be impossible to achieve and -A predetermined schema is extremely difficult to maintain by conceivable and manageable due distributed researchers. to this finiteness.

Heterogeneity -Data is in relational databases (e.g. -Heterogeneous data formats (such as of data formats SQL) with carefully designed csv, excel sheets or even plain text), and schema definitions. usually without any schema preprocessing -If not, data is loaded into the information. challenges warehouse using schema mapping -Lack of schemas renders established by Extract-Transform-Load (ETL) schema matching, mapping and ETL process and integrated techniques ineffective.

Resources -High technical expertise capable of -Low technical expertise lacking (expertise and handling complex datasets and a capability to handle complex datasets. budget) variety of formats. -Limited research budgets. -Large budgets and dedicated resources for data preprocessing.

Continued on next page

42 Continued from previous page

Availability of -Large budgets and the -Data repositories (both domain tools predetermined schema allow specific as well as cross-domain) are implementation of sophisticated common providing support for data data integration platforms (such as identification, but the preprocessing Informatica, SAP Hana, IBM tasks (e.g. interpretation and Infosphere, etc.) that support data integration) are still manual. preprocessing. -Several tools available for analysis (such as SPSS, Stata, R packages, etc.) but not for preprocessing.

3.1 Research versus Business Analytics

The previous chapter identified the differences between the requirements of data scientists / analysts within the research environment versus the typical business setting. It also identified the gaps between the warehousing based data integration and preprocessing tools available today and the specific requirements of the researchers. A further analysis of these gaps, presented in this section, reveals that the misalignment and gaps are due to deeper disparities in the workflow and work patterns between the research and commercial environments. Table … summarizes the differences in the work patterns and data characteristics of the business and research environments.

3.1.1 Process Differences

Though the end goal is analysis of data and the analysis tools are common, the preprocessing steps vary drastically. The differences between the requirements of research process and the data warehousing process were identified fairly early

43 (Widom, 1995). The main process distinction lies in the ordering of the steps - query formulation and data integration.

Data warehousing process

Data integration is a well-established solution for achieving interoperability across datasets in the enterprise business environment (Goodhue et. al., 1992).

Traditional data integration implements an “in advance” data warehousing approach (Inmon, 2005; Kimball, 2011) using mappings {Mi} from known datasets

{Si} to the pre-determined centralized and “target” data schema T (Lenzerini, 2002).

Computationally, data warehousing is defined by the triple (T, {Si}, {Mi}), as shown in Figure 5. It follows an “eager” or “in-advance” approach, explained below.

1. Data from all sources of interest {Si} is integrated (using an Extract-

Transform-Load (ETL) process for normalizing) and mapped {Mi} to the

predetermined schema T. These sources are mostly internal (e.g. operational

data generated by different departments within an organization) or

predetermined external data sources (e.g. socio-economic or geo-political

datasets). The mappings generated are persistent and updated as the

underlying data changes.

2. Queries {Q} are then formulated (usually to address specific business /

organization need) in a centralized manner. They are compiled against the

central schema, T. The required data is then accessed through the mappings

44 {Mi}. Thus, the mappings {Mi} play a critical role of bridging the central

schema T to the data sources {Si} for query translation.

Such an “in-advance” approach to data integration works well when dealing with a predictable number of data sources, as in a business setting. Here the queries formulated are dictated by the data available in the repository. A centralized data store is highly advantageous for performing consolidated queries on the data; as well as several processing and analytics applications critical for business, such as

OLAP and OLTP (Chaudhuri & Dayal, 1997). Nowadays, even though the data is spatially distributed, the principle of a predetermined schema has remained unchanged.

T Target Schema

{Mi} Set of source to target Mi1 Mi2 Min Me1 Me2 Mem schema mappings

Si1 Si2 ... Sin Se1 Se2 ... Sem {Si} Known number of Known external Finite Set of Data Sources + internal data sources data sources

Figure 5. Traditional data warehousing architectural "triple"

45 Research environment process

Research on the other hand requires an “on-demand” or “ad hoc” approach, as explained below.

1. In research, queries {Q} are determined based on the research hypotheses.

Such a hypothesis is dynamic and so researchers working primarily

autonomously generate multiple unrelated queries on an ongoing basis. The

appropriate data {Sq} to answer a query is thus also determined by exploring

and interpreting several data sources {Si} dynamically.

2. The required data sources {Sq} are then integrated in an ad hoc fashion.

This, as in warehousing, requires creating mappings across the data sources.

However, these mapping are specific to the immediate integration task and

do not persist beyond this task. This is mainly due to the fact that a universal

target schema (T) to map these {Sq} cannot exist because of the autonomous

and highly distributed nature of research.

An important note here is that the steps in research environment are the reverse of data warehousing above. Specifically, in research environments, data is to be integrated after the queries are formulated and this integration process is incurred every time a new query is formulated. This fundamental reversal of process in traditional data warehousing and research environment, as illustrated in

Figure 6, is the root characterization of the paradigm shift discussed here.

46 Generate New Hypothesis Business Need

New Data Preprocessing (External Data) Identify External Explore Internal Data Data Store Data Earlier Interpretation Research Data Data Integration (ETL) Data Integration (Possibly one-time)

Integrated Research Analysis Analysis Data Archive Data Store Findings

(a) Typical Research Workflow (b) Typical Business Analytics Workflow

Figure 6. Comparing the data preprocessing workflows for individual researcher versus large- scale business analytics

3.1.2 Dynamic and iterative nature of research versus relatively

stable and controlled business environment

The new hypothesis generation step in the research process corresponds to the business need identification step in the business analytic workflow. While previous research analysis and findings can naturally lead to new hypothesis generation, so also the analysis and the results of quality improvement initiatives may illuminate new business needs.

However, in research, there may be other factors which influences the hypothesis generation, such as investigator expertise or the current funding environment. Thus, the research hypotheses are generated more iteratively and dynamically as research progresses. This requires using new datasets that are often

47 crowd-sourced from public repositories. Thus, as explained in the above section, the identification of datasets and query formulation is “on-demand” as queries are dictated by changing hypotheses. Hence, the above process is repeated for each research cycle.

In business, client requirements, availability of data, as well as profit or cost limitations may influence business needs and as a result the data analysis priorities.

In the business setting, the {Si} are mostly known datasets and users are limited and organized. Hence T is relatively stable and these mapping provides the necessary interoperability meta-data critical to automate the data interpretation and integration tasks across multiple datasets at query time.

Once a research cycle is started (hypothesis is set), there is generally little deviation from the original research plan until the hypothesis can be answered. The lack of a pre-integrated repository also contributes to this rigidity. In contrast, the business model may change the proposed plan while executing it multiple times through an iterative process with less difficulty due to its in advance warehousing. This is important from a data flow and fusion perspective throughout both models.

3.1.3 Autonomous research versus coordinated business data

analytics

In the business environment, the data analysts generally coordinate their queries and as a result the required datasets to answer them. To illustrate, in

48 business analytics the business needs are dynamic, so they prepare by having an integrated data store. In business analytics, all data is generally aggregated into a single data store. Business vision, mission, and financial goals may act as a cohesive glue pushing for this data store to remain updated both in terms of and data methods deployed. The same data store is frequently reused and all future analysis is also executed on this store. Thus, the cost of data integration is shared.

Moreover, because of this coordination, large-scale data integration solutions are applicable and cost effective.

Conversely, the research process traditionally functions with less cohesion.

Researchers by-and-large work autonomously. Individual researcher often archive datasets separately and in their own way. They do not have a common framework for connecting these datasets. This currently leads to a pattern of data integration that happens in relative vacuums, with little coordinated effort outside of the immediate research group, duplication of efforts, and inefficient workflow processes. The differing length of these cycles is an unintended consequence of the two models. The business analytic model allows for a faster turn around time of the cycle than the research process.

3.2 Quantifying the Cost of Data Analysis

The above differences have profound influence on the cost of the overall data analysis process. This section analyzes the cost factors in the “in advance”

49 traditional warehousing based approach in the business environments and the “on demand” largely manual data preprocessing in the research environment.

3.2.1 Cost in Traditional Data-warehousing

The typical steps and operations and the costs involved are as follows.

1. Cost of “in advance" data integration

Architecturally and computationally, data integration in warehousing requires the triple (Si , Mi , T) (Lenzerini, 2002), as shown in Figure 5. Data from all sources of interest Si is integrated "in advance" through an Extract-Transform-Load (ETL) process for normalizing and mapping, and stored according to a predetermined schema T. These sources are generally a finite manageable number, consisting mainly of internal (e.g. generated by different departments within an organization) or known external data sources (such as environmental or social-economic data).

Hence, the target schema is, typically, of computationally manageable size and complexity.

Mappings Mi play a critical role of anchoring each data source in Si to central schema T for efficient data access (Chaudhuri & Dayal, 1997). Thus, generating these mappings is the most critical cost factor in the integration. Cost of generating mappings {Mi} can be calculated as follows. Suppose the cost of generating a mapping Mi for each Si is Ci. Ci depends on the complexity of Si and coverage of its entities by T. Also, the cost of integrating an external data source is typically higher as there are additional process costs of identification,

50 interpretation, curation, etc. Generally speaking, any mapping generation step that requires modification of the schema will incur higher cost. Overall, the initial cost of

n ∑Ci integrating ‘n’ data sources is i=1 .

Data {Buisness {Q} T {Sq} Data Analysts} Analysis Coordinated query set {Mi} Fixed cost of preprocessing formulated against T {Si} Determined by the time required to access the data "In advance" Data through {Mi} Integration

Cri$cal(Cost(Factors( Cost%of%“in%advance”%Data%Integra4on:%This%is% Incurred%only%once% the%cost%of%genera4ng%mappings%for%the%data% sources%(schema%matching%opera4on)! Other&Costs& Cost%of%data%explora4on%and%query%transla4on:% Incurred%for%each%query,%but% Queries%are%collec4ve%and%persistent%mappings% severely%reduced%due%to%use%of% are%used%for%data%explora4on%and%transla4on% preDdetermined%mappings%that% of%the%query%to%required%data%sub%set% enables%automa4on%

Figure 7. Data query process and associated cost factors in the traditional "in advance" warehousing based approach

However, since these mappings are generated in advance and persisted for reuse this cost of data exploration and integration is incurred only once per data source. In addition, this cost is incurred in advance and not during the query processing. Moreover, usually in an enterprise setting, dedicated knowledge resources and automated schema mapping and matching tools are used to reduce the cost Ci (Bernstein et. al, 2011; Rahm & Bernstein, 2001; Xu & Embley, 2003).

51 2. Cost of schema and mapping maintenance and adding new data sources

The mappings {Mi} are updated as required when the schema is modified or underlying data change. However, this cost is not critical as the effective cost for a single mapping Mn can be at most Cn i.e. cost of generating it originally. The update should, typically, only affect part of the data sources. Also, as in the initial data integration, this cost is incurred before the actual query process itself.

Similarly, when a new data source Sx is added, a mapping Mx is generated with cost Cx and the data source is integrated into T. Since T remains constant, this cost is incurred incrementally in warehousing.

3. Query formulation and cost of query translation

Queries Qi in warehousing are formulated against the central schema, T. The query set is usually agreed and consolidated among the data analysts based on specific business / organization needs. The same schema is reused frequently for querying. The required data is directly accessed through the already existing mappings Mi. Thus the majority of data preprocessing cost is incurred in advance and effectively one time, as shown in Figure 7. Typically, in the data-warehousing paradigm, the frequency and number of queries against the pre-integrated schema far outweigh the number of modifications. Thus, the amortized cost of query itself is fixed and computationally finite.

52 Thus, in the warehousing paradigm, cost of query process itself is highly affordable as it is directly related to the automated query translation and so can be computationally completed in real time.

3.2.2 Cost in the Typical Autonomous Research Process

Typically, individual researchers work highly autonomously and hence the query processing in research is usually performed independently by every researcher as in

Figure 8. Furthermore, the research process itself requires an "on demand" or "ad hoc" data integration to pursue the unforeseen. Such an approach has significant impact in the cost factors and, importantly, when in the research process they are incurred as explained below.

1. Query Formulation

In research, the set of queries are more exploratory in nature and not limited by the scope of the schema. They are determined by the research hypotheses.

Individual researchers have their independent hypotheses, which are also constantly evolving / changing as research progresses. Hence, multiple possibly unrelated queries are generated simultaneously.

2. Cost of “ad hoc” Data Exploration

A researcher explores different heterogeneous data sources {Si} independently to identify the data {Sq} that would be required to answer his/her specific queries.

Thus, the cumulative scope and number of data sources {Si} for all researchers is

53 almost infinite. In research, this exploration is largely done manually relying mainly on their past experiences of domain-specific repositories and the aid of limited search tools. Thus, the cost of exploration may reduce as the researcher gains experience with that particular data or domain. Still, with the increase in number of data sources within the domains themselves the overall cost of data exploration is quite high and incurred by each individual researcher. This cost is compounded further when working across domains. Since each query set or hypothesis potentially requires a completely new set of data to answer, this cost is incurred repeatedly for each new query cycle.

Researcher 1 {Q1} {Si} {Sq1} Data Analysis Cost of Cost of exploration Integration

Researcher 2 {Q2} {Si} {Sq2} Data Analysis Cost of Cost of exploration Integration

Cost of Cost of exploration Integration Researcher n {Qn} {Si} {Sqn} Data Analysis

Cri$cal(Cost(Factors( Cost%of%Data%Explora.on:%% Incurred%repeatedly%for%each% Researcher%has%to%find%the%required%data%to% research%/%query%cycle%for%each% answer%the%queries%from%a%pool%of%data%sources! research%environment% Cost%of%“on%demand”%Data%Integra.on:% Incurred%repeatedly%for%each% Integrate%the%required%data.%Cost%is%comparable% research%/%query%cycle%for%each% to%cost%of%genera.ng%mappings%(a%schema% research%environment% matching%opera.on)! NOTE:(Both%these%costs%are%quite%heavy%in%research%and%incurred%by%each%individual% researcher%working%autonomously% Figure 8. Data query process and associated cost factors for "on demand" research environment

54 3. Cost of “on demand” Data Integration

Once the required data sources {Sq} are collected, they need to be integrated "on demand" in order to answer the queries. This again means creating some mappings across the data sources. Let us assume again that Ci is the cost of generating a mapping. Then Ci in research is somewhat larger than its counterpart in business environment. At the same time researchers lack the resources, tools and expertise to generate the mappings efficiently and thus, often do so entirely manually. Thus, generating mappings is extremely time-consuming and individual researchers incur a high cost of data integration within the query cycle. This drastically increases the overall cost of integration. (Also, this cost limits the complexity of data sources that researchers can integrate.)

In addition, these mappings, unlike mappings {Mi} in the data warehousing approach, do not today persist beyond the integration task. Due to this, the full cost of integration is incurred for each new query cycle.

To emphasize, as there is no effective way to persist the knowledge of previous exploration and integration tasks, these costs are also incurred repeatedly for each new query cycle. Sadly, this is often true even though researchers might be using the same data sources for subsequent query cycles. The compounded effect of the higher cost of exploration and integration incurred within the query cycle with it being incurred repeatedly for each new query cycle multiplies the cost of query in research. A big reason for the hidden cumulative cost of the research process is that

55 the autonomous nature of research means even if different researchers were working on the same set of data, researchers would incur the cost of exploration and integration separately. This duplication further escalates the costs of overall research.

Table 5. Summary of cost factors and how they effect the data warehousing and research environments

1. Predetermined Schema (T) Data warehousing Research environment • Monolithic, centralized organization of • Distributed, autonomous environment. data. • A universal schema that can encompass all • A fixed target schema (T) is possible to entity types would be too complex. achieve due to a finite known number of Numerous, unknown data sources and types data sources {Si}. would require constant maintenance of the • Coordination costs of designing, updating, schema. maintaining and accessing T is incurred but • Thus, hugely undetermined coordination cost finite because {Si} is finite. arising due to lot of atomicity and consistency issues and huge cost of adherence to such a complex and dynamically changing schema. Conclusion: A universal schema for all research data would be ideal but impossible to achieve in reality. The cost of maintaining such a schema cannot be sustained in research environments. 2. Cost of generating mappings {Mi}. Let the cost of generating a mapping Mi for each Si be Ci. Ci depends on the complexity of Si and coverage of its entities by T. Thus, the initial cost n ∑Ci of integrating “n” datasets is i=1 . (Modifications to T incur additional costs.) Data warehousing Research environment • There are dedicated resources and experts • The Ci in research is somewhat larger than its to generate the mappings {Mi} for the counterpart in business. Researchers lack the known sources {Si} in the business resources, tools and expertise to generate the environment. mappings efficiently and thus often do so • These mappings can be generated entirely manually. manually, or using automated schema • Also, generating mappings is extremely time matches. Thus, the Ci is “controlled”. consuming and this limits the complexity of data sources that researchers can integrate. Conclusion: The cost of generating mappings is much greater for researchers working individually and in a distributed fashion to test independent hypotheses than in business environment where teams of data scientists are shared across the enterprise. Continued on next page

56 Continued from previous page 3. Recurring cost of mapping {Si} due to changing hypotheses, i.e. every new Si needs a new mapping Mi. Data warehousing Research environment • When a new data source Sx is added, a • The dynamic nature of research requires mapping Mx is generated with cost Cx and researchers to work with a constantly the data is integrated into T. Since T changing set of data sources {Si} for every remains constant, this cost is incurred new hypothesis / query in {Q}. Because a incrementally. fixed target schema T is not feasible in • The new data sources {Si} + Sx can now be research (as shown in 1), a previous mapping queried through T which is reused. Hence is often useless. the cost of mapping Cx is effectively • Thus, the full cost of integration sum[for incurred only once. n](Ci) is incurred every time a new {Si} is obtained (effectively for each new hypothesis). • This compounded with the generally higher Ci in research(as in 2 above) “cripples” integration in research. Conclusion: Points 1 (T is not feasible) and 2 (cost of Ci is higher) compound in research environments causing the high cost of generating the mappings {Mi} to be incurred repeatedly for every new research hypothesis. This is the reason for high cost of data preprocessing.

3.3 Paradigm Shift to Crowdsourcing for Research

Environments

3.3.1 Why is Data Warehousing not suitable for Research

Environments?

There is no doubt that the “in advance” data integration of warehousing is singularly responsible for the severe reduction in query cost. This enables implementation of several automated / real time data preprocessing and analytics applications, such as OLAP and OLTP, hence reducing the overall preprocessing time and cost. However, we argue that applying these techniques to the distributed

57 research process is impractical due to the process and environment differences. The discussion in Secition 3.1 above can be distilled into following key factors of identifying the need for the paradigm shift:

– In advance versus on demand process of research.

– A predetermined universal schema is impossible to achieve due to its potential

complexity; and

– Researchers need autonomy of process even though they want to be able to

store data for effective reuse and sharing.

These factors are highly inter-related to each other. The main reason why the traditional warehousing will not be applicable for research is the fundamental difference in the “in advance” and “on demand” approach to data integration. In the data warehousing, typically in business environments, teams of data scientists and analysts coordinate their efforts resulting in a well-formed and focused query set. It would be impossible to coordinate the queries among the researchers at the same scale as is done in the business environments. Research is far too distributed and autonomous in nature and funding for sustained efforts uncertain. Also, research produces independent hypotheses and multiple possibly unrelated queries. As a result, as explained in the previous section, the data integration has to necessarily follow the query formulation step since the queries are dictated by open-ended hypotheses. This adds to the difficulty of designing a universal schema.

58 A universal schema for all research data would be ideal, but impossible to achieve in reality. In enterprise / business setting, a fixed target schema (T) is possible to achieve due to 1) a finite known number of data sources {Si} and, consequently, 2) a well-formed and coordinated query set. The volatility and diversity of the research datasets and the need for “on-demand” integration contribute to the difficulty of designing a universal schema T. A schema that is generalizable to any data, i.e. one that can encompass all potential entity types, would be too complex.

In addition, there would be tremendous costs associated with designing, updating, maintaining and accessing such a complex schema cannot be sustained in typical researcher environments. The autonomous nature of research queries would require numerous, unknown data sources and types that would demand constant maintenance and update of the schema. The dynamic nature of research would ultimately require a schema for individual researchers for each hypothesis / research cycle. Thus, the cost of deciding a schema and generating mappings is incurred for each research cycle, culminating in a huge total cost.

3.3.2 Defining the Hard Problem

We know that the cost efficiency factors of data integration in traditional data warehousing are primarily due to mappings to a central schema generated “in- advance” as explained above. Thus, the hard problem is achieving mechanisms that can emulate the same functionality that mappings provide in data

59 warehousing for “on-demand” data integration WITHOUT a predetermined central schema (T). The ultimate goal of recognizing the need for this shift being to minimize the cost of data preprocessing and, consequently, reduce the cost, basically in terms of time and computational expertise required, of the research query process described above.

3.3.3 The Computational Requirements

To establish that such a fundamental shift is feasible we next address the two key costs in the research process, i.e. cost of exploration and cost of "on demand" integration. In order to do this the following requirements must be met:

• Researchers need computational tools for data exploration to discover the

appropriate data sources in {Si} and identify the {Sq} to answer a specific query.

These selected sources may not be centrally stored or connected. However, there

should be mechanisms for them to be curated and shared through a crowd-

sourced data repository.

• New mechanisms to perform on-demand integration of {Sq} are critically

required to offset repeating high cost of manually integrating data as in the

current research process. The mechanisms should provide the functionality

similar to mappings by persisting the specific schema details of the data. Such

mechanisms should enable real-time execution of two basic operations,

occurring at the column level, that are a core requirement for data integration

(Lenzerini. 2002), namely match and ad hoc join.

60 – The “join” operation is when data is actually fused together into a single data

table using a specific column or set of columns.

– The “match” operation, which technically is used for the exploration

performed before the join, finds all possible columns to “join” across given

data sources.

In conclusion, these core features will allow autonomous researchers to start using data more effectively and promote its personal as well as shared reuse without the reliance on centralized schemas. This shift will also overcome the disadvantages of data stored in personal storage or data repositories within the local lab environment. To establish the feasibility of the paradigm shift we show in section 4 that there is a scalable solution to eventually encompass large and diverse number of data sources and still reliably achieve the traditional data warehousing functionality.

61 Chapter 4 PROPOSED DATA CROWDSOURCING PARADIGM PROPOSED DATA CROWDSOURCING PARADIGM

Summary. The proposed data crowdsourcing solution to solve the

“hard problem” defined in the previous chapter is detailed in this

chapter. First, the related previous research is discussed. The proposed

crowdsourcing approach leverages the lessons learned from this

previous work. Next, the principles of crowdsourcing as it relates to the

architectural components of crowdsourcing are identified. The

“semantic match” operation is defined. Operations for “on demand” data

integration using the semantic match are illustrated. The chapter also

provides the rationale for the crowdsourcing approach and explains

how it can resolve the requirements of the research process and the

“hard problem”.

62 4.1 Related Work

The researchers’ requirement of new automated data preprocessing tools and systems is a well-established research problem. Owning to this, there have been parallel research efforts focusing on different aspects of this problem. This section discusses the complimentary research directions. These research topics provide crucial components and should be leveraged to tackle the hard problem of "on demand" data integration identified in the previous section. The proposed data crowdsourcing approach leverages the advantages of the above research. A critical aspect for the success of the approach lies in the effective interaction between the various components.

4.1.1 Emergence of Smarter Data Repositories

The first major research direction was towards establishing shared data repositories. Chapter 2 Section 2.3.1 identified that a lot of research data on the web is collected in data repositories that can be queried to find relevant datasets.

Currently these tools are extremely valuable resources for research datasets, especially in the social, biomedical and economic research domains.

Specifically, the focus here is on the interdisciplinary research repositories.

Notable interdisciplinary research repositories include the Harvard DataVerse

(King, 2007), Registry of Research Repositories (re3data) (Pampel et. al., 2013),

63 which provide facilities for long-term archive and access to a large number of crowd-sourced research datasets. Similarly, the data.gov and census.gov collections contain a variety of publicly available datasets from various government agencies.

Other important collections of datasets are the DataHub based on CKAN (Winn,

2013) and the NuCivic DKAN data catalog platform in Europe.

In general, these repositories provide only dataset search capabilities. The

Integrated Rule-Oriented Data Systems (iRODS) (Hedges et. al. 2007) as well as the recent DataBridge effort (Rajasekar et. al. 2013) provide value-added services such as recommendation of datasets based on usage by other researchers, collaborators, etc. or analyzing the usage patterns of the datasets within the above-mentioned repositories.

Some domain specific data repositories, e.g. American Fact Finder in census.gov, provide limited preprocessing capabilities, such as pivot, column selection / filter, etc., for individual tables. Some extremely domain specific tools might also provide

OLAP like capabilities. However, such tools have only been implemented for very limited domains, such as gene expression analysis (Zeeberg et. al., 2003).

While such “smart” repositories expedite the dataset identification and finding the relevant datasets, the task of integration across these datasets with the appropriate columns for the researchers exact needs is still unaddressed. Also, as

64 the size of the repositories increases, it is getting harder to find the optimal dataset for the researchers exact needs (van der Putten et. al., 2002).

Such repositories provide a starting point for further investigation into end-user tools for researcher. Most rely on standard metadata representations that use semantics limited to the dataset level. For example, the DCAT vocabulary (Ding et. al., 2010; Maali et. al., 2014) provides a rich semantic representation standard for metadata of public datasets from data.gov. Some have their own internal schema, e.g. r3data schema (Pampel et. al., 2013; Vierkant et. al., 2013), CKAN (Winn, 2013) and DKAN18. These standards leverage standard classes and properties from upper level ontologies such as (Weibel et. al., 1998), Simple Knowledge

Organization System – SKOS (Miles & Bechhofer, 2009) and Friend of a Friend - FoaF

(Brickley & Miller, 2012). Others are based on standardized principles for data cataloging, such as “Project Open Data”19.

The research herein provides the means to leverage such repositories within the data fusion platforms described above. We extend the standard metadata representations at the dataset level inherent within most repositories, specifically extend the DCAT vocabulary and the Dublin Core “Dataset” class20. Such extensions

18 http://docs.getdkan.com/dkan-documentation/dkan-users-guide/project-open-data 19 https://project-open-data.cio.gov/v1.1/schema/ 20 http://dublincore.org/documents/2000/07/11/dcmi-type-vocabulary/#dataset

65 provide the interoperability features that are as yet nonexistent across heterogeneous datasets.

4.1.2 Emergence of end-user driven data platforms

Recent research recognizes the need of shift from traditional transaction- oriented DBMS to more end-user driven database systems (Jagadish et. al., 2007;

Bonifati et. al, 2008). The end-users in this case are data scientists and analysts. Our research contributes to this trend by enabling automated data preprocessing functionality for data analytics applications.

Commercial large-scale data integration platforms, such as Informatica,

Infosphere, etc. come bundled with the data pre-processing and visualization capabilities. Other data pre-processing specific platforms, such as Paxata and

Trifacta, are also gaining popularity. These may be ideal for data analytics in industry, justifying the huge investment costs, but are usually beyond the limited research budgets. Moreover, usually they assume a centralized schema and hence are not suitable for research environments for reasons stated in the previous section. As a result, researchers and data scientists in the academic environment are

‘left out’.

66 Franklin et. al. (2005) envisioned the requirements and challenges in implementing “DataSpace Support Platforms” at a crowd-sourced web scale. The underlying principles of the platforms are:

• Seamless integration with data on the web

• Emphasis on ease of use for end-users

• Provide incentives for sharing data

• Facilitate collaboration at web scale

In this research such systems are called “data fusion platforms”. The Google

Fusion Tables (Gonzalez et. al., 2010) serves as an example of this emerging class of systems, providing a solid platform for individual dataset management and sharing.

However, currently, the usability is directed towards towards personal use or a limited group of users. Sharing data is not yet a truly public web- scale crowd-sourced platform. The following major drawbacks are identified in this regard:

1. Lack of sophisticated mechanisms to explore the data publicly to find

relevant datasets efficiently after upload.

2. Manual and costly data integration operation across heterogeneous datasets.

Our research addresses the deeper automation challenges to overcome the above drawbacks to implement data fusion platforms that deal with web-scale interoperability metadata across public existing heterogeneous data repositories.

67 We provide a solid representational framework for a novel crowdsourcing approach in the form of the DFO that enables global sharing and reuse of interoperability meta-data across a large number of datasets.

4.1.3 Crowd-sourced Schema Matching

The “in advance” data integration operation has a huge up front cost of generating mappings across the datasets to the shared schema. In order to minimize this cost, several automated schema-matching techniques have been studied in the last decades (Batani et. al., 1986; Rahm & Bernstein, 2001; Gal, 2006). Yet, such techniques are either time-consuming and cannot be performed in real-time or show moderate accuracies and hence cannot be completely automated reliably

(Bernstein et. al., 2011). Thus, they are not suitable for “on demand” integration, where this cost is incurred in every research cycle.

A new approach is crowdsourcing the schema matching and entity resolution process (Dong et. al., 2007; Wang et. al., 2012). This follows a “pay-as-you-go” model for integration, where only the required datasets are integrated at a time (Das

Sarma et. al. 2008). The mappings created are preserved and the schema is appended as more datasets get added. The cost is amortized over the group of users involved across iterations. This cost can be reduced further by enabling faster semi- automated processes through bootstrapping with automated schema-matching

68 techniques to provide probabilistic schema mappings (Nottelmann & Straccia, 2007).

However, the assumption again is that the same schema remains shared.

The computational solution presented here borrows the crowdsourcing idea but without a centralized schema. Instead, reusable, shared annotations emulating the same functionality of mappings are crowd-sourced from the users.

4.1.4 Role of Ontologies in Data Integration

The semantic web was created to provide a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.

Ontologies provide “a common framework that allows data to be shared and reused across application, enterprise, and community boundaries” (Berners-Lee et. al.,

2001). Significant body of research explores the use of ontologies in data integration tasks, especially their role in bridging heterogeneous relational datasets across the semantic web (Spanos et. al., 2012).

One approach is completely converting relational data values into graph-based ontological representations as in D2R server (Bizer, 2006) or Triplify (Auer et. al.,

2009). This is achieved by mapping the relational data to an RDF representation

(Sahoo et. al., 2009) and profiling the relational data schema (Abedjan et. al., 2015).

Until the last decade, there has been a great deal of research dedicated to use of ontologies for schema mapping and matching including ontology matching (Noy,

69 2004). Though several platforms have been developed to this end (Michel et. al.,

2014), the R2RML (Das et. al., 2012) and RDB2RDF (Halpin & Herman, 2011) frameworks can be considered representative of all these efforts. An RDF representation of the databases (Bizer & Seaborne, 2004) is highly advantageous as it enables complex graph-dependent querying for path analysis, neighborhood exploration, etc. However, significant experience with RDF query tools and languages would be required along with intimate knowledge of the underlying ontology structure for effective queries to be formulated. Neither of these requirements can be assumed in the research environments of our study.

Another and possibly a better approach is a semantic middle-ware layer between the relational data and the ontological knowledge representation. Here the query is translated at the ontology level, and processed at the data level. In a way, the ontologies serve the same purpose as mapping in traditional warehousing. In early work, researchers (Volz et. al., 2009) enhanced the XML dataset schemas with additional annotations. Meanwhile, in (Hassenzadeh et. al., 2009) a language to accommodate such annotations in relational structures themselves is presented.

The MASTRO (Calvanese et. al., 2011) platform is also an example of such approach.

The M4 approach is the most closely related to the solution proposed in this paper.

In recent years, extensions to specifically the R2RML framework do provide such a middle layer capability between relational databases and the end users (Dimou et. al., 2013; Mihindukulasooriya et. al., 2014).

70 Note that in both these approaches, in most cases, the presence of a predetermined schema to which the data can be mapped is again assumed. In the first case the ontologies are used to represent the schema and the underlying data itself. In the second case the ontologies essentially are used represent the mappings between the data and a schema. However, the important lesson from using ontologies as a middle-ware is the decoupling of the data, the mappings and the schema itself. All three can be expressed in different representational formats. A key reason for maintaining the actual data in relational (tabular) format is that such a format is prerequisite for the various methods and tools that will ingest this data in the next step of data analysis.

4.1.5 Leveraging the Linked Open Data cloud

Ever since its initial conceptualization there has been a tremendous growth in the scope of the Linked Open Data (LOD) cloud (Bizer et. al., 2008; Ermilov et. al.,

2013). In general, the Linked Open Data (LOD) has been critical in several applications (Hausenblas, 2009), providing the bridge data and more importantly for global representation and understanding of data entities. Out of the dense network of interconnected ontologies and vocabularies in LOD, dbpedia has been a central nucleus for the web of data (Bizer et. al., 2009a). For example, links between

DataHub and dbpedia making browsing at the dataset level easier. The classes and relationships from the open ontologies form “kind of SQL for the Web” (Cagle, 2014).

71 The next challenge in the story is how to transform / leverage the

LOD for advanced, large scale and global data fusion applications (Bizer et. al.,

2009b). However, an important consideration while using the LOD is that it necessitates the use of semantic web technologies in some capacity. These technologies are not directly compatible with the relational / tabular data formats common in research or analytics in general and can cause friction when implementing end-to-end systems. The middle-ware application of ontologies, discussed in the previous subsection, has an important role to play in overcoming this difficult.

The research here presents an explicit answer to this by leveraging the classes and properties of the ontologies within LOD as a set of globally recognizable entities and relationships.

4.2 Data Crowdsourcing Principles

The thesis first defines the basic principles associated with components of the new crowdsourcing paradigm. These principles are identified as essential to establishing that the crowdsourcing of both data sources and reusable annotations for on demand integration is feasible.

Based on these principles the proposed crowdsourcing approach adopts a middle-layer ontological architecture placed over a shared, crowd-sourced data

72 repository. The goal is to leverage the power of the domain and global ontologies within automated inference capability. The data crowdsourcing architecture model and computational “triple” for the solution is illustrated in Figure 9.

Open Bio Linked Open Ontologies Data (LOD) (OBO)

{O} Reference ontologies A universal set of globally recognized ontologies

{Ai} A1 A2 A3 A4 An Set of reusable semantic annotations

{Si} S1 S2 S3 S4 ... Sn

Unknown set of research data sets from (distributed) sources

{Si}: Set of crowd-sourced data from individual researchers within a shared data repository.

{Ai}: Set of annotations connecting entities in individual data schemas {Si} to “target” globally recognized classes from {O} {O}: Universal set of ontologies with globally recognized entities and relationships serving as a “target schema”.

{Si} à {Ai} is a one-to-many relationship because the researcher is free to use any

O ∈ {O} and produce multiple annotations for same data. The only requirement is

that each annotation Ai is complete and consistent for its respective source Si.

Figure 9. The architectural components for the proposed crowdsourcing paradigm framework

73 New features then can increase global understandability of data, and yet minimize cost / effort required to implement a platform solution that can deliver the required functionality for integration. The subsequent chapters discuss how such a platform can minimize the exposure of the technological aspects to the end user viz. the researcher.

4.2.1 Crowd-sources data sources {Si}

Definition 1. {Si} is a set of data sources registered in shared data

repository by researchers wishing to participate i.e. crowd-source the

data autonomously.

Since these data sources in the repository are crowd-sourced, they do not necessarily adhere to a shared schema nor do they have any metadata that recognizes related entities across them at the time of upload. Note that the data can be stored in a distributed manner.

4.2.2 Ontologies {O} as universally recognized “schema”

Definition 2. {O} is a set of universally recognizable reference

ontologies, such as those from the Linked Open Data (LOD) cloud or

the OBO Foundry (Smith et. al., 2007).

74 The set of ontologies, {O}, provides a comprehensive set of globally recognizable classes and relationships. This approach leverages these as entities in the universal schema and treat {O} as a universal target schema.

4.2.3 Reusable crowd-sourced semantic annotations {Ai}

Definition 3. A is an annotation of S such that ALL entities (columns /

tables) in S are anchored to respective ontological classes in O and all

relations in the schema S are mapped to specific properties in O.

Thus,

If data source Sx has set of columns {Colx} and annotations Ax,

Then, ∀ Coli ∈ {Colx}

∃ at least one annotation ∈ Ax indicating,

Coli à “represents” à x, where x some class ∈ {O}

Note that the annotations are not restricted to a specific ontology. Thus, multiple researchers working autonomously can provide different annotations to the same data. This does not affect the data integration functionality (more on this in section

4.4). The researcher uploading a data source knows the exact and highly granular details about the columns and tables contained within it. Thus, the annotations are provided by the researcher while uploading / registering the new data to the shared repository. Note that these annotations are persistent can be modified subsequently as required. Semantic annotations {Ai} also help preserve knowledge for reuse,

75 similar to mappings in warehousing. The predetermined enterprise specific schema

T is replaced by a universal {O}.

4.2.4 Semantic Match Operation (↔)

Let Col1 and Col2 be columns in data sources S1 and S2 respectively with annotations A1 and A2 respectively.

Definition 4. A “semantic match” between two columns Col1 and

Col2 occurs when an annotation of Col1 in A1 and annotation of Col2

in A2 represents the same ontological class in {O} (i.e. the same URI in

the reference ontology set).

Thus, ∃ semantic match, denoted by ↔, between Col1 and Col2, iff ∃ following annotations ∈ {Ai}, the global annotations set, {Col1 à ‘represents’ à x} AND {Col2 à ‘represents’ à x}, where x some class ∈ {O}

Such a semantic match relies on explicit “ground truth” annotations to standard ontological classes already provided and resourced by the researchers at large; and not on implicit features, like column titles or contents, etc. In other words, the annotations are leveraged to find “matches” across data sources by simply finding columns representing the same semantic entity in the graph, as shown in Figure 10.

76 Thus, the semantic match operation between two columns defined above has trivial cost with O(1) complexity and can be performed in real-time.

represents

mesh:Hospital

mesh:Hospital Semantic match represents

Figure 10. Semantic match operation, using annotations to reference ontology class.

The semantic match in this case does not need to rely on column names or its contents or the structure of the schema, etc. Instead it utilizes “ground truth” annotations. In this example, the “MedCenter No.” and the “Hospital ID” columns represent the same semantic entity, that is the “mesh:Hospital”. The Medical Subject

Headings (MeSH) terminology, a standard vocabulary commonly used in the biomedical domain, is used here for annotation and provides a common “schema” for defining entities and relations.

77 The next discussion illustrates how the combined principles provide the critical mechanisms required to perform the “on demand” data integration.

4.3 “On demand” Data integration Operations using

Semantic Match

This section demonstrates how the semantic match can be leveraged for operations critical for “on demand” data integration functionality and the hard problem (Chapter 3 Section 3.3.2) and its computation requirements (Section

3.3.3), while reflecting the tasks performed by researchers for such data integration.

The computational complexity for the operations is based on the fact established above - the basic semantic match operation has complexity O(1).

4.3.1 Extended Semantic Match (⇔ )

Operation 1. Find potential semantic match between two given

columns Col1 and Col2, when a direct semantic match (↔) does not

exist.

The classes in linked ontologies cloud representing the same entity are already associated with each other through the “owl:sameAs” relationship. The

“owl:sameAs” relationship is used to identify classes that semantically represent the same entity in the ontology cloud (Bizer et. al., 2008). (However, recent research

78 indicates the misuse of owl:sameAs property and the resulting semantic discrepancies in the linked data. Thus, such a match needs to be performed carefully

(Ding et. al., 2010).)

Thus, the above semantic match can also be easily extended to leverage this relationship. Such an extended match will give the researchers flexibility to select any ontological annotation for their data without worrying about how it will be integrated. Thus, the extended semantic match can be defined as follows:

Definition 5. An “extended semantic match” between two columns

Col1 and Col2 occurs when annotation of Col1 in A1 represents X1

and annotation of Col2 in A2 represents X2 AND X1 & X2 have a

transitive “owl:sameAs” relation in {O}.

Thus,

∃ extended semantic match, denoted by ⇔ , between Col1 and Col2, iff ∃ following annotations ∈ {Ai}, the global annotations set, {Col1 à ‘represents’ à x1} AND {Col2 à ‘represents’ à x2}, where x1 and x2 some class ∈ {O} AND

∃ x1↔“owl:sameAs”↔x2 ∈ {O}

Since determining whether a "owl:sameAs" relationship exists between two ontology classes requires O(1) time, the complexity of the extended match is directly based on finding the classes X1 and X2 in {O}. Thus, it has the same complexity O(V), where V is number of classes in {O}. Note that the semantic match

79 (↔) operates at the level of the annotations {Ai} only, whereas the extended match

(⇔ ) operates at the set of reference ontologies {O} itself. Also, the extended match can replace the semantic match in the operations below, albeit at higher computational cost.

4.3.2 Data Exploration with Semantic Match

Operation 2. Discover additional data sources from a pool of

annotated data sources that can be joined with a particular data

source on a particular column.

Complexity: O(C) where ‘C’ is total number of columns of all data

sources in the repository.

This demonstrates how the semantic match is leveraged to discover all possible

“matches” with columns from other data sources representing the same entity (i.e. the same class in the reference ontology) in the data crowdsourcing paradigm. The operation can be implemented as below. Importantly, it has the complexity O(C), where ‘C’ is total number of columns of all data sources in the repository. If the extended match is used instead, the complexity would be O(CV), where V is number of classes in {O}.

80 Let,

{Si} and {Ai} be sets of all data sources and all their annotations respectively.

Sa be a data source with annotations Aa and columns {Col}a.

Let Colx ∈ {Col}a and

∃ annotation {Col1 à ‘represents’ à x} ∈ Aa , where x some class ∈ {O}. Then,

∀ {Col}i of ALL {Si} do {

FIND Colm ∈ {Col}i such that,

{Col}I is set of columns of Si with annotations Ai AND

∃ following annotations ∈ Ai, {Col1 à ‘represents’ à x} }

Thus, the operation finds semantic match between Colx and all columns {Col} of all data sets {Si} (except itself).

Note that this is actually the data exploration operation. For example, lets say the researcher has drug data with the one of the columns representing the disease entity that the drug is supposed to treat. Now the researcher wants to find additional data sources with additional knowledge about the diseases in the context of, say, genes. In the traditional research process, this would require manual exploration of the various relevant data sources, followed by their interpretation to check if they can be joined with the given data (again manually). The above discussion illustrates how executing such an exploratory operation over heterogeneous data sources is now possible in an automated manner in the crowd- sourced paradigm with autonomously annotated datasets.

81 4.3.3 Operation to find matches between given data sources

Operation 3. Find all possible matches between given two data

sources S1 and S2.

Complexity: O(mn) where m and n are number of columns in S1 and S2

respectively.

This is a typical match operation. The researcher has two data sources that need to be integrated. Matches may exist within columns across these two data sources but are not explicit. Normally this would be done manually or through some schema matching approach, either way requiring parsing and understanding the actual data.

A semantic match, on the other hand, can now simply check if the columns across different data sources “represent” the same entity, indicated by the semantic annotations, to find matches automatically across the given data sources.

Let,

S1 and S2 be the data sources and A1 and A2 be their respective annotations.

{Col1} and {Col2} be the columns of S1 and S2. Then,

∀ Colx ∈ {Col1} do { FIND ALL semantic match (↔) [or extended semantic match (⇔ )]

Colx ↔Coly , ∀ Coly ∈ {Col2} }

82 The computational complexity of such an operation is O(mn), where m and n are number of columns in S1 and S2 respectively.

4.3.4 Complex match operation

Operation 4. Find potential matches through an intermediate data

source when no direct matches are found between given two data

sources S1 and S2.

Complexity: O((m+n)*C), where m and n are number of columns in S1

and S2 respectively and ‘C’ is total number of columns of all data sources

in the repository.

This requires expanding on the match by combining operations 2 and 3. Figure

11, illustrates such a match operation. Essentially, we are performing the specific operation 3 between every data source against S1 and S2. Thus, computationally such an operation would take O((m+n)*C), where m and n are number of columns in

S1 and S2 respectively and 'C' is total number of columns of all data sources in the repository. Note that this operation can get quite complex with the extended match.

Thus, the operation can be conducted as follows,

Let,

S1 and S2 be the data sources and A1 and A2 be their respective annotations.

83 {Col1} and {Col2} be the columns of S1 and S2. Then,

∀ Colx ∈ {Col1} AND Coly ∈ {Col2} do {

FIND Sx such that, for semantic match (↔),

Colx ↔Cola AND

Colx ↔Colb AND

Cola and Colb ∈ Sx }

No explicit matches

represents represents

mesh:Hospital geo:ZipCode

Semantic match Semantic match

mesh:Hospital geo:ZipCode

represents represents

Figure 11. Complex match operation through an intermediate dataset

84 4.4 Rationale of Data Crowdsourcing

Having established above the linear cost arguments based semantic and extended semantic match, we next proceed to argue how the proposed crowdsourcing approach accommodates the other requirements and work patterns of the research environment, which are:

• Reduction in the cost of exploration and on demand integration in the

quantitative data analysis process for researchers.

• Enable real-time semantic match and ad hoc join capability, without a

predetermined schema.

• Preserve the researchers' autonomy.

4.4.1 Graph of interconnected data sources

Extensive connections among classes of different ontologies are also already defined in {O} itself. Thus, once generated, the annotations {Ai} along with their ontological references to {O} and the ontologies themselves together form a network of interconnected data sources together in a "graph" structure without the need of a predetermined schema. Over time as the set {Ai} increases through crowdsourcing, the graph gets richer. This graph of interconnected datasets is

85 critical for success of the on demand integration operations described in the previous section.

The amortized cost of generating the annotations, which is arguably incurred only once per dataset, reduces over time. Thus, this increases the interoperability over time and without a centralized schema.

4.4.2 Acceptable cost of querying

Finally, it is important to assess how the semantic match operations, as described above, effectively reduce the query processing cost. This is illustrated in

Figure 12. The query is still formulated against all data sources {Si} to obtain the required set of data {Sq} for a research hypothesis. Note that operation 3 described above can be used to expedite such data exploration.

Once the {Sq} is determined, the semantic match has now empowered us to perform "match", as described in operation 3 and 4 at affordable cost across heterogeneous data sources for on-demand data integration. The complexity of these operations is executable in real time. Due to this, the recurring cost incurred by researchers in the manual "on demand" integration of the current process can be replaced by the semantic match performed (semi-) automatically in polynomial time complexity.

Furthermore this is achieved without a predetermined schema through mappings as in the traditional data warehousing. The burden of integration is

86 shared by the annotations instead. As discussed above, the integration operations rely on reusable crowd-sourced annotations to {O} a universal "schema". The only underlying assumption is that the data sources have been previously "annotated".

{Oi} Annotations {Ai} leverage Researchers {Oi} for "on demand" data Annotations {Ai} are Working integration used for integration {Ai} Autonomously as well as exploration Fixed Cost to Integrate R 1 {Q1} R 2 {Q2} {Si} {Sq} Data Analysis Publish data + annotations to shared repository R n {Qn}

One time cost A1 A2 An for individual researcher Researcher 1 Researcher 2 Researcher n

S1 S2 ... Sn

Cost%Factors% Cost%of%genera,ng%annota,ons%“in%advance”:%This%cost% Incurred%only%once% is%comparable%to%cost%of%genera,ng%mappings%in%the% % tradi,onal%warehousing% Cost%of%Data%Explora,on:%The%data%explora,on%process% Repeated%but%semiD in%research%is%more%expansive%as%queries%are%more%open% automated,%leveraging%the% ended.%Such%explora,on%is%userDdirected%to%obtain% preDexis,ng%annota,ons.% subset%of%data%required%to%answer%the%query.! Thus,%cost%is%reduced.% Cost%of%“on%demand”%Integra,on:%The%integra,on%is% Repeated%but%cost%is%severely% also%userDdirected%and%ad%hoc.%The%actual%“match”%and% reduced%as%the%cri,cal% “join”%opera,ons%are%automated.! opera,ons%are%automated%

Figure 12. Cost of query using the proposed crowdsourcing paradigm

87 4.4.3 Ad hoc integration functionality

Because the cost of semantic match is affordable, multiple match operations can now be completed in real-time to find multiple and complex potential joins. An example complex join that uses an intermediate data source is shown in Figure 11.

These can now be compared using different metrics (such as ) to provide alternatives to the researcher for ad hoc “joins”. This operation is the essence of "on demand" integration.

4.4.4 Preserving researchers' autonomy

The researchers can add annotations to the data independently. These should be ideally added when uploading the data to a shared repository. However, there is no restriction preventing the researchers to add or update these annotations later at their convenience. Also, there can exist multiple annotations for the same dataset, unlike mappings that are strictly unique. In case of multiple annotations, the complexity of the semantic match will be O(n) instead of O(1), where n is the number of annotations for the same column. This would increase the overall complexity by factor of n, but does not affect the feasibility of the integration operations. On the other hand, it preserves the autonomy of the researchers, allowing them to work independently without worrying about the others’ annotations while publishing their data.

88 Chapter 5 Data Fusion Ontology Design DATA FUSION ONTOLOGY DESIGN

Summary. The thesis proposes the Data Fusion Ontology (DFO) as a

necessary representational framework that is required to implement the

crowdsourcing solution defined in the previous chapter. This chapter

describes the methodology used for designing the Data Fusion Ontology.

The previous chapter defined the crowdsourcing computational framework that would provide the interoperability across heterogeneous research datasets and enable the “on demand” integration operations desired by the researchers within an end-to-end data fusion platform. The realization of this framework in the form of an implemented application would require a representational framework that provides the necessary architectural features supporting the stated functionality.

The thesis proposes the Data Fusion Ontology (DFO) that serves as this representational framework. This chapter provides the methodology followed for engineering the DFO.

89 5.1 Design Methodology

The ontology design methodology here adopted the enterprise model approach was in combination with elements from the TOVE methodology for , as recognized in the survey on ontology design methodology by (Jones et. al., 1998). The ontology development also followed the principles and criteria laid out by Gruber (Gruber, 1995), which are generally recognized as the best practices in the semantic web community for design of the ontology. The overall steps of the design methodology are:

1. Identify the purpose of the ontology with the help of motivating scenarios.

2. Formally define the scope of the ontology.

3. Formal terminology and conceptual specification.

4. Evaluation of the ontology through theoretical and applied validation.

Note that although there are several established methodologies for designing domain specific ontologies (Noy & McGuinness, 2001), there has been no consensus within the research community on a methodology for designing application ontologies (as in the case of DFO). As a result, each ontology-based application follows its own methodology, as is evidenced in (Corcho et. al., 2003).

The methodology steps proposed here, combining basic application and ontology development principles and best practices, are specifically designed to provide a generalized template for engineering of application ontologies. Figure 13

90 illustrates the role of each methodology step and its outcome in relation to the development of DFO.

Scope of the DFO Provide the representational framework Conceptual Crowdsourcing to support the above conceptual Solution solution. Mechanisms for crowdsourced annotations {Ai} that bridge Implemented by 1) entities and relationships within individual datasets from a dataset DFO Specification collection {Si} to The specification should cover the 2) globally recognized classes and defined scope with the DFO classes and properties from reference ontologies. properties. DFO extends the dcat:Dataset and the E-R model. Annotations provide the interoperability meta-data across the heterogeneous Validated by datasets without predetermined schema. This achieves reduction in cost of DFO Evaluation exploration and on demand integration. A conceptual as well as an applied evaluation of the DFO with respect to Motivates DFO purposeConceptual and scope. Evaluation The conceptual evaluation checks for Purpose of DFO completeness and consistency of the Defines Obtain data preprocessing scenarios ontology. Applied Evaluation from researchers. The applied evaluation validates how the DFO must support needed preprocessing DFO primitives can be used for "on functionality, which is real time "match" demand" data integration functionality. and "join" operations for "on demand" Example use cases illustrate how these integration in a cost effective manner operations are automated through interoperability meta-data captured by DFO with desired computational cost.

Figure 13. Steps in designing the Data Fusion Ontology.

5.1.1 Purpose of DFO

The researchers we interviewed identified the expectations and specific functionality from a hypothetical data fusion platform for leveraging crowd-sourced

91 datasets. These were condensed into specific motivating scenarios / use cases around the tasks of data discovery, integration and interpretation, illustrated below:

1. Explore relevant datasets {Si} and view meta-information:

a. Find datasets of topical interest {Sq} using existing data repositories like

data.gov, NCBI Entrez datasets, etc.

b. Assess provenance data (publishers, version, etc.) as well as quality

information (e.g. number of data rows, missing value analysis, etc.)

2. Real-time data interpretation:

a. Automatically find matches between datasets {Sq} by resolving semantic

ambiguity (e.g. “Identity” and “Name” mean the same).

b. Automatically find intermediate datasets required to integrate given two

datasets with no direct match between them.

3. Ad hoc integration support within {Sq}:

a. Compare multiple “matches” establishing relevance and inclusion in the

set {Sq} in real-time using heuristics such as data quality / retention /

loss.

b. Perform ad-hoc, real-time joins across datasets in {Sq} and deliver them

in a relational (tabular) form for use in analysis tools like SPSS, Stata, etc.

4. Visualize semantics across datasets

92 a. An intuitive visual / graphical user environment that cuts across datasets,

is non-programmer oriented, and capable of delivering above

functionalities.

The purpose of Data Fusion Ontology (DFO) is to achieve the above preprocessing functionality with mechanisms to preserve the interoperability meta-data across datasets. The DFO should enable the “on demand” data integration operations of

“match” and “join” and achieve this without the need for an “in-advance” centralized schema {T}. This meta-information should provide the semantics within and across datasets, explicitly to the column-level.

5.1.2 Scope of DFO

The scope of the DFO is to represent interoperability meta-data that enables the algorithmic realization of the crowdsourcing solution above and in particular the core operations for data pre-processing.

1. Allow the complete representation of any relational dataset schema from the

source datasets {Si}.

2. Declarative semantic representation of annotations {Ai} from datasets in {Si} to

the reference ontologies {O} in order to be able to implement semantic match

operations.

93 3. The representation should be computationally readable (meaning it should

provide mechanisms for development of automated functionality and direct

interaction with other application) and provide the architecture to handle on-

demand operations, i.e. match and join, in real time.

4. The coverage of classes, entities, relationships, etc. in the reference ontology,

which is used for annotations, is beyond the scope of DFO.

5. The usability and capability of any data fusion platform developed based on the

DFO is also beyond the scope of the DFO itself.

Note that point 4 and 5 define the limits of DFO scope.

The challenge here is that the framework should be able to translate between the semantic operations and technology stack (OWL/RDF store, SPARQL, etc.) required by the ontologies and the traditional relational data representation and query requirements (Entity-Relationship, SQL, etc.). For example, the semantic match (particularly the extended semantic match) and operations defined above are based on ontological or graph inferencing and can be implemented in SPARQL for instance. However, the actual "join" operation itself would require the execution of an appropriate SQL JOIN on the underlying data itself.

94 5.2 Design Rationale

The specification and validation of the DFO is presented in the subsequent chapters. But first, Let us look at how DFO incorporates each of the requirements of the crowdsourcing solution and its rationale from the previous chapter. The rationale behind DFO is discussed here in context of its scope, as identified above.

Figure 14. Positioning the DFO

5.2.1 Bridging crowd-sourced datasets using semantic

interoperability meta-data.

The positioning of the ontology across datasets, shown in Figure 14, is critical for its role in providing the interoperability meta-data across datasets. The figure above

95 complements the conceptual architecture proposed in the previous chapter (Figure

9). As described in the crowdsourcing solution, the DFO enables mapping of the underlying schema of a particular dataset {Si} to classes and associations from reference ontologies {O} using crowd-sourced annotations {Ai}. This allows for semantic annotations that elevate metadata of individual datasets by relating the entire local relational schema to classes and properties from global reference ontologies.

Moreover, due to this architecture, the DFO remains independent of the underlying proprietary schema of dataset, or the reference ontology used as defined by its scope.

5.2.2 Multi-granular Metadata Representation

Metadata has always been used to catalogue data in a more efficient manner. It is vital to even begin to understand and integrate data from heterogeneous sources.

Metadata has evolved with the way we represent data itself since the beginning with a library card systems to the very popular relational database systems and finally to the object oriented and structures of recent years. It has now become imperative that metadata be machine understandable in order to gain efficient access to data. This plays a very important role in the usability and reusability of data. Metadata repositories are traditionally designed as Entity

Relationship model or as an Object Oriented models. Ontology–based metadata

96 management systems were introduced in order to help standardizing metadata to enable sharing and analyzing data.

NISO defines three fundamental types of metadata viz. structural, descriptive and administrative (Guenther & Radebaugh, 2004). Administrative metadata provides information about the technical details such as file type, size, etc. Structural metadata defines the design and specification of the data structure containing the data. Descriptive metadata is used to provide additional information about the data content itself. The targeted government datasets include the structural and administrative metadata but lacks the descriptive metadata.

Table 6. Types of metadata captured by DFO

Data View Metadata Type Metadata Granularity Data Fusion Functionality provided Physical Administrative Dataset level (from Data exploration (access View metadata DCAT, including rights, provenance, publisher, version, quality) source, etc.) Structural Tables and column Ad hoc integration tasks metadata structure (num. of data (matches / joins at rows, data types, etc.) column level) Logical Descriptive Entities and Automated View metadata relationships interpretation tasks across datasets based on Conceptual Semantic Annotations extending achieved semantic View metadata to reference ontologies coherence

97 DFO supports multiple views of the datasets modeled in a singular ontological

(i.e. RDF/OWL) format. This allows the proposed DFO to architecturally leverage the widely used schema models such as the ER model. It also extends preexisting vocabularies and standards like the DCAT and Linked Open Data. Thus, the DFO provides a powerful framework capable of supporting automated data fusion platforms without ‘reinventing the wheel’. Each view of this metadata is important to enable specific scenarios / use cases of the data fusion platform as identified in

Table 5 below.

5.2.3 Lightweight Solution

Let us consider this in the context of the schema-matching problem. Using the

DFO architecture open ontologies and controlled vocabularies standardize the entities and relationships across datasets. For example, the meaning of the entities

‘City’ or ‘Medical Symptom’ in heterogeneous dataset will be established by

‘dbpedia:city’ or ‘mesh:symptom’ respectively.

The DFO should provide the mechanisms to represent the semantic annotations as described in the crowdsourcing solution, enabling the semantic match and on demand operations described in the previous chapter (Section 4.3). Thus, the schema-matching problem is reduced to finding annotations pointing to the right ontological classes or properties in the graph. Such a graph-based query can now be performed in real-time. The DFO supports reusable ontological annotations and

98 representation at the time of dataset publication (or the first time it is used) itself.

This means that cost of annotation is incurred only once vs. recurring cost of schema mapping incurred for every new integration task. This is arguably the most important contribution of DFO

99 Chapter 6 Data Fusion Ontology Specification DATA FUSION ONTOLOGY SPECIFICATION

Summary. This chapter contains the Data Fusion Ontology

specifications. The ontology classes and their object properties and data

properties are defined. A fusion graph notation for the ontology is

provided. The annotation of an example relational (tabular) dataset

using the DFO is discussed in detail. The evaluation for the DFO from

both conceptual (theoretical) as well as applied perspective is provided.

The design and functional requirements as well as the scope of the Data Fusion

Ontology (DFO) are identified in the previous chapter. The DFO specifications and semantic restrictions are presented here.

The DFO is then evaluated from a theoretical as well an validation from an application perspective. The theoretical evaluation tests the conceptual completeness and consistency aspects of the proposed ontology based on its scope

100 definition. The applied validation demonstrates how the DFO components and properties can be leveraged to algorithmically implement the researcher scenarios / use cases.

6.1 Fusion Graph Notation

The DFO components are defined first. These mechanisms represent (capture) the semantic annotations of heterogeneous dataset schema entities as well as relationships and references to standard ontological concepts.

The DFO is achieved with extensions based on the Entity - Relationship (ER) model for representation of relational (tabular) data. The ER model is selected as a starting point because it is a well established model proven to be complete with respect to set manipulations and has the advantage of making joins declarative and graphical (Chen, 1976). DFO builds on these advantages and starts with ER components: “entity”, “attribute”, and “association”. To this the DFO adds additional specific components for interlinking data sources and external ontologies. These are: “column type”, “other relationships”, “class annotation”, “property annotation”,

“ontology URI”, “column-to-column relation”.

Figure 15 specifies the class components of the proposed DFO as a “fusion graph notation”. Such a notation provides declarative and intuitive understanding of how

DFO is used to automate the data pre-processing functions. In addition, it helps us

101 justify the DFO as extension to the Entity-Relationship model explicitly, necessary for the completeness requirement of DFO, discussed in Section 6.4.3.

E-R Model DFO New DFO DFO Representation Component Representation Components

dfo:Dataset ontology"class" dfo:represents dataset& annota.on" ontology"property" dfo:semanticRelation en.ty" dfo:Table annota.on"

a4ribute" dfo:Column ontology"URI" N/A

associa dfo:JoinTable column"type" N/A

.on" " " dfo:belongsTo EBR"rela.onship" Column" dfo:ColumnRelation Rela.onship" dfo:source

<<"">>" dfo:des.na.on "other"DFO" Examples: dfo:unit rela.onships" dfo:isColumnType, etc.

Figure 15. Data Fusion ontology components represented using a “fusion graph notation”. The left side shows the components of the aligned with respective DFO representations. The right side has new DFO extensions to capture the highly granular column-level meta-data and annotations to the reference ontologies.

The semantic annotations of heterogeneous dataset schema entities as well as relationships and references to standard ontological concepts provided by DFO can be visualized as a Data Fusion Graph (DFG). With DFO every dataset table and column is mapped to entities, attributes, and associations that, in turn, are mapped to classes and properties from reference ontologies as part of a single coherent

102 graph, as envisioned in the data crowdsourcing paradigm rationale (Chapter 4

Section 4).

6.2 OWL Representation of DFO

The OWL representation to capture the above graphical notation is specified next. This representation allowed us to use the DFO for actual implementation of the data preprocessing tasks. We used the Protégé tool for ontology development

(Knublauch et. al., 2004) in the OWL/RDF format. Figure 16 shows the overall DFO schema in OWL. Key elements of this specification are as follows.

dcat:Dataset :Agent Prefix& Namespace& dct:creator dfo$ Data$Fusion$Ontology$ dfo:DataSet dct:contributor dcat$ h)p://www.w3.org/ns/dcat#$ dct:title dct:description dct$ h)p://purl.org/dc/terms/$ dct:publisher dct:created foaf$ h)p://xmlns.com/foaf/0.1$ dct:issued dcat:theme dct:modified skos$ h)p://www.w3.org/2004/02/skos/core#$ skos:Concept dcat:keyword dcat:landingPage dfo:isColumnType dfo:ColumnTypePartition dfo:Column dfo:belongsTo dct:title dct:description dfo:Table dfo:PropertyType dfo:represents dfo:ClassType dct:title dfo:belongsTo dct:description dfo:represents dfo:semanticRelation dfo:TemporalType dfo:unit dfo:MeasureType dfo:dimension dfo:JoinTable dfo:temporalFormat dfo:isTemporalType dfo:Interval dfo:temporalGranularity dfo:source dfo:Temporal dfo:ColumnRelation TypePartition dfo:TimeStamp dfo:semanticRelation dfo:destination

Figure 16. Data Fusion Ontology as an OWL/RDF representation

103 6.2.1 Definitions

The classes in the ontology, seen in Figure 16, along with the data and object properties are defined here.

• Classes

Table 5 defines the DFO classes. They capture and store the required

relational metadata consisting of tables, join tables, columns, and column

relations in addition to the dataset specifics. Note that since we wish to make

the DFO declarative, the join tables and column relations are explicitly

identified when ingesting a dataset or tables. These map the

to the ER model introduced earlier.

Table 7. Data Fusion Ontology class definitions

Class Name Definition dfo:DataSet A collection of information represented structured or unstructured (for example table or tables, lists, etc.) dfo:Table A single data table within a dataset consisting of rows and columns dfo:JoinTable A table whose columns represent different entities and the sole purpose of this table is to represent relationship amongst these entities in the dataset dfo:Column A single column within a table dfo:ColumnRelation Represents a object property between two columns of a dfo:JoinTable

104 Table 8. DFO Data Properties

Data Property Data type Property Definition Class Restrictions dfo:represents Literal URI of the ontological dfo:Table, class from the reference dfo:Column of type ontology “Class” dfo:semanticRel Literal URI of the ontological dfo:Column ation relationship between the dfo:ColumnRelationshi two entities being p represented dfo:unit Literal URI of the unit being dfo:Column of type represented from the “Measure” only Quantity, Unit, Dimension and Types (QUDT) ontology dfo:dimension Literal URI of the dimension of dfo:Column of type the specified unit “Measure” only dfo:temporalFor Literal String representing dfo:Column of type mat format of time stamp “Temporal” only based on International Standard ISO 8601 dfo:temporalGra Literal URI from the Time dfo:Column of type nularity Ontology of lowest unit “Temporal” only of time

• Data properties

Table 6 lists the data properties of DFO classes that are used to capture the

different references to standard ontologies. Definitions for properties

105 inherited from upper level ontologies like Dublin Core, FOAF and SKOS

(shown in Figure 16) can be obtained from the respective ontology

specifications. The DFO specific data properties along with their semantic

restrictions are defined in the table below.

Table 9. DFO Object Properties

Object Definition Domain Range Property dfo:belongs From a database point-of- dfo:Table dfo:DataSet (iff To view, indicates that the dfo:Column domain is a domain entity logically dfo:Table) belongs to the range entity dfo:Table (iff domain is a dfo:Column) dfo:source Source (or Domain) of a dfo:Column dfo:Column column relationship Relationship dfo:destinati Destination (or Range) of a dfo:Column dfo:Column on column relationship Relationship dfo:columnT Defines the type of the dfo:Column dfo:ColumnType ype column (viz. “Class”, NOTE: Columns of “Property”, “Measure” or dfo:JoinTable can “Temporal” only be of type “Class” dfo:tempora Defines the type of temporal dfo:Column dfo:Column of type lType quantity represented by the “Temporal” only column values (viz. “TimeStamp” or “Interval”)

106 • Object properties

Table 7 defines the object properties of different associations between the

classes in Table 5.

6.2.2 Design notes:

1. The “dfo:Dataset” is a subclass of the DCAT “dcat:Dataset” and starting point

for the DFO. This allows DFO to easily extend data.gov and other such open

datasets. All the administrative metadata of the DFO at the “Dataset” level are

thus borrowed directly from DCAT (Maali et. al., 2014).

2. DFO also inherits other data properties (in Figure 16) from upper level

ontologies. For example, the “title”, “description” properties from Dublin

Core (Weibel & Koch, 2000) as well as dataset level properties from SKOS

(Miles & Bechhofer, 2009) and FOAF (Brickley & Miller, 2012).

3. The dfo:ColumnTypePartition and the dfo:TemporalTypePartition are meant

solely as type partition classes. Meaning that instances for these class types

cannot be created. They are used to add specific semantic restrictions on the

dfo:Column in the OWL representation.

107 6.3 Annotating tabular dataset using DFO

To complete the above specification, a walkthrough, annotating two simple sample datasets as shown in Tables 8 a and b, is presented next. The “Hospital_Info” has hospital information and the second dataset “Allergy Data” has the patient admittance data. The annotation process is in three major steps, each capturing different types of metadata – dataset metadata, mapping to {O} and types of columns. Additional semantics are also needed to capture association relations, discussed in section 6.4.

Table 10. Example datasets for annotation. These are the datasets that we use for the annotation walkthrough. For simplicity of understanding, each dataset has a single table. The “Hospital Info” table (a) represents an entity by itself, whereas the “Allergy_Data” (b) holds associations between its columns. E.g. number of “Patients” suffering from an “Allergy” showing “Symptom” and treated at “MedCtr”.

Hospi Address City Number MedCtr Allergy Symptom Num of tal of Beds Patients XYZ … New York 1200 XYZ … Rash 400

ABC Columbus 500 ABC Asthma Wheezing 150 PQR Chicago 900 b - Allergy Data a – Hospital Info

108

a. Annotation using Fusion Graph notation

b. Same annotations in Protégé Figure 17. Annotations of structural meta-data (physical view) using “dfo:belongsTo”. The first step of annotation is capturing the structure of the relational datasets. Maintaining this structure is critical for execution of relational queries on the data.

109 6.3.1 Step 1: Capturing Structural Metadata

The structural metadata is captured using DFO classes Dataset, Table and

Column. The relationship “dfo:belongsTo” has domain of Column or Table class and range of Table or Dataset respectively. The structural metadata can be easily obtained from the dataset schema automatically. Interesting to note here is the distinction between a dfo:Table and dfo:JoinTable. While a dfo:Table represents an entity on its own, the dfo:JoinTable is equivalent to an association relationship in a comparable ER model. The semantics of JoinTables are discussed later in sub- section 4.4. The resulting ontology representation is shown in Figure 17.

6.3.2 Step 2: Capturing Annotations to {O}:

In this step the data publisher recognizes each dfo:Table individual as representing a standard class in some reference ontology or controlled vocabulary in {O}, which depends on the domain and scope of the datasets. The property

“dfo:represents” is used to annotate tables with the appropriate class from the reference ontology it represents. Note the annotation for the table “Hospital” in

Figure 18. Here the table itself is representing the class “Hospital” defined in the

Medical Subject Heading (MeSH) (Lipscomb, 2000) reference taxonomy.

110 6.3.3 Step 3: Capturing Relationships and Types of Columns

Every column of the table is annotated with classes from the reference ontology as well. The same property, “dfo:represents”, defines the column level annotations.

An example annotation of classes for the same example as above is shown in Figure

18. The semantic relationships between the dfo:Table, in this case Hospital and its columns are captured using the property “dfo:semanticRelation” whose domain is the Column class and the value can be URI of any property in the reference ontology.

For example, the column ADDR has dfo:semanticRelation of “dbpedia:location” indicating that ADDR contains dbpedia:location of Hospital.

Figure 18. Hospital Table with Column level Annotations. The figure shows the typical column-level annotations that the DFO captures for a “dfo:Table”. These annotations will be leveraged for the automated data integration functionality.

111 It is not possible to capture, and, more importantly apply semantic restrictions if all columns are treated in the same way. For example, the ID column in Hospital is a property of “Hospital”, not representing a class on its own. On the other hand, the

“CITY” column is representing another class object or entity itself. This requires the properties for columns change with respect to the column types. The different types of columns in DFO are identified by “dfo:columnType”. The column types and the rules associated are discussed below.

Class

─ The column represents another class from the reference ontology. A class

type column must have the property “dfo:represents” to point to the

appropriate class entity from reference ontology.

─ In addition, it should also have the “dfo:semanticRelation” property

indicating the relationship between the table entity and the column entity.

Thus, the value of the “dfo:semanticRelation” can be of object properties

─ In a strict ontological annotation schema implementation of DFO, the values

of “semanticRelation” would necessarily have to be object properties

restricted to a set having the domain of the class represented by the table

entity and the range of the class represented by the column entity.

─ In Figure 18, the “City” column is a type of class and represents the class

“dbpedia:City.” It represents the relationship “dbpedia:locationCity.” Thus,

112 the annotations form the triple “mesh:Hospital—dbpedia:locationCity—

dbpedia:City”

Property

─ The column is a data property of the class represented by the table it belongs

to. Since these are expected to be some data attribute, they do not require the

“dfo:represents” property. The HOSP_ID column for example, is the data

property “dct:identifier” of the table entity Hospital.

─ The same property, namely “dfo:semanticRelation”, is used to represent this

information.

─ Again, note that for a strict ontological annotation the value should be strictly

restricted to the data properties of the class represented by the table entity.

Measure

─ A column of measure type is a special case of class type. A measure can be

any numerical data representation. In Figure 18, Num_Beds is a measure type

column.

─ In addition to all properties of class type a column of measure type must have

the property “dfo:quantity” which indicates the quantity measured. The

quantities and units are defined by the “Quantities, Units, Dimensions and

Types” (QUDT) ontology. QUDT covers a wide range of measure and units

113 from different systems of measurement. It even allows generic and

dimensionless measures like “Count” as is the case in our example.

─ If the unit requires a specific unit, the property “dfo:unit” represents what

unit is used by that particular quantity. For example, if the quantity is“Speed”

then the unit is be “mph” or “kmph”. This semantic can also be used for other

kinds of quantities as well. For example, if the quantity is “Monetary Value”

then the unit can be “Dollars” or “Pounds.” In our example, Count is

“qudt:Dimensionless”.

Temporal

─ A column of temporal type can represent a temporal entity. It should,

therefore, have the property “dfo:temporalType” which indicates whether

the values represent timestamps or intervals.

─ It is necessary to capture the granularity and format of the time stamp. The

granularity indicates the least division of time the time stamp or interval.

─ The Time Ontology can be used here to capture these semantics. It can also

be used to capture information about time zones and daylight saving.

On observing common research datasets, it was found that almost all invariably have measurement data for some quantity and most have some temporal component associated with them. Thus, the measure and temporal type columns are specifically treated separately. In addition, it was found that they require additional

114 semantics to enable the needed interoperability. The other special type we considered was the geographical data, such as latitude, longitude, city, address, etc.

However, we argue that the class type column is sufficient in capturing the required semantics, as is evident for the “City” column in our example.

6.3.4 Semantics for Join Table

Associations in the Entity-Relationship model capture relationships between different entities. In a traditional relational dataset, this information is stored in separate tables with columns that contain foreign keys. However, these tables themselves cannot represent any class or entity.

We represent these associations using a subtype of the “Table” class, called

“dfo:JoinTable. Thus, an instance of the JoinTable class does not represent an entity itself, but rather relationships between different entities. For example, the

Allergy_Data gives information about relation between the entities Allergies,

Symptoms, Patients and Hospitals. The JoinTable is used to show relationships between these entities as can be seen in Figure 19, also illustrating the use of

Protégé to manage the representations and mapping.

115

(a) Fusion graph notation

(b) Protégé screenshot Figure 19. Capturing semantics of associations using “dfo:JoinTable”. An association is actually a table itself in the underlying data. The figure shows how the dfo:JoinTable is used to capture the semantics of such a table. The “dfo:ColumnRelationship” is used to capture the relationships between the entities involved in the association. Figure 19(a) is an illustration using the fusion graph notation. Figure 19(b) is a snapshot of the same table annotated in the Protégé tool for ontology development.

116 In the case of “JoinTable”, its columns can only be of type “Class” since they necessarily represent entities themselves. In this case the relationship between the two classes is represented using an instance of class “dfo:ColumnRelationship”. This would have the property “dfo:semanticRelation” whose value would be the relation between the classes represented by the two columns. The directionality of the relationship can be captured using properties “dfo:source” and “dfo:destination” each pointing to the a distinct columns. The columns Person_ID and Hospital_Name represents “mesh:Patient” and “mesh:Hospital” respectively. The columns of a

JoinTable should be “connected” meaning every column is part of at least one relationship.

6.4 Evaluation of DFO

With the annotation process of the data fusion platform discussed above, the evaluation of the DFO is discussed next. It needs to address the following important questions, based on the scope of the DFO:

Conceptual / Theoretical

1. Does the DFO specification provide the required translation between the

semantic layer of reference ontologies and the underlying relational data?

2. Is the DFO capable of representing any relational (tabular) data completely

and consistently?

117 Applied

3. What benefits does DFO provide with its descriptive and deep column level

annotations to support data preprocessing use cases around data

exploration, interpretation and integration (as recognized by the researchers

in Chapter 5 Section 5.1.1)?

4. Can a data fusion platform, that can help the researcher deal better with the

real-world data preprocessing, be actually implemented that leverages the

DFO annotations?

6.4.1 OWL Representation and Feasibility

A necessary first step of the evaluation of the Data Fusion Ontology (DFO) is completely representing it in an ontological format (such as OWL or N3). This was done using the Protégé tool for ontology design and development. An OWL representation can be machine-interpretable. Since the DFO representation can be maintained in a triple store (such as RDF) a variety of incremental functions can be provided to support the needed functionality.

The ontological representation of the DFO can itself be seen as a test of feasibility. The challenge is the translation of the nuances and semantic restrictions as specified in the above sections to an ontological format. The evaluation must verify if this has been achieved successfully.

118 6.4.2 Evaluation Methodology

Once we have the ontology in the OWL format we can conduct a thorough examination of its faculties. The evaluation of the Data Fusion Ontology (DFO) in this thesis is conducted as a combination of a conceptual evaluation and an implementation based evaluation. It is organized as follows:

Conceptual Evaluation

The conceptual evaluation is a theoretical examination.

• Representational Completeness

Assess if DFO can represent any relational dataset.

• Semantic Consistency

o Consistency of the Ontology

o Consistency of the Annotations

Applied Evaluation

The applied evaluation is critical for DFO, as it is an application ontology. The

DFO will be tested on how it supports the different functions of the data fusion

platform. The evaluation is based on the use cases for data preprocessing

designed around the user specified requirements identified for DFO design.

These operations are critical to the implementation of the data fusion platform.

119 Ultimately, this is an implementation-based evaluation that involves the

development of a prototype data fusion platform based on the DFO.

In this chapter, we will concentrate on the heuristic evaluation of the DFO. The computational application of DFO for the data preprocessing use cases is also discussed in this chapter. The actual prototype implementation and practical demonstration of the data preprocessing operations is detailed in the next chapter.

6.4.3 Conceptual Completeness Evaluation

Experiment 1: Prove that the DFO capable of representing ANY and ALL relational (tabular) data.

Rationale: The DFO should be able to fully represent any relational dataset it encounters to satisfy the completeness requirement. As stated earlier, the ER model itself has been argued to be complete with respect to set manipulations and has the advantage of making joins declarative and graphical (Chen, 1976). Thus, we need to show a mapping between the DFO ontology and the ER model. Such a technique is commonly used in evaluation of representational frameworks such as UML class diagram and IDEF data model (Berardi et. al., 2005; Kim et. al., 2003).

The DFO is said to meet the completeness requirement if all the notations in an

ER model can be mapped to notations from the fusion graph illustrated in Figure 15.

This is a necessary condition for the DFO to succeed.

120 Argument: We present the argument for this evaluation as follows. The DFO is achieved by extending the components of the current ER model. These extensions are expressed as the fusion graph components as in the Figure 15. Thus, the components of the fusion graph are a superset of the ER model notations. Based on this mapping, we can argue that the DFO model is completely capable of representing the semantics of any dataset.

Discussion: An important discussion around completeness would be that of coverage. This deals with how entities and attributes are mapped to ontological classes and properties. It should be noted that the coverage of entities and relationships depends entirely on the reference ontologies. The assumption of DFO is that the current set of ontologies from LOD are capable of representing any entity and attribute from any dataset. Thus, the completeness evaluation of the DFO should include the mapping of relational database schemas to the DFO while excluding the mapping of DFO to ontologies and LOD.

6.4.4 Consistency Evaluation

Rationale: The consistency evaluation tests whether the semantics restrictions that are enforced by the DFO are sufficient to protect against annotation of datasets that can cause inconsistencies. For example, the “dfo:Column” property has the following restrictions:

a) belongsTo only Table

121 b) belongsTo some Table

c) isColumnType some ColumnTypePartition

The consistency evaluation would be done in two parts. First, would be to check the Data Fusion Ontology (DFO) itself for inconsistencies. The second test for consistency is that the DFO only allows annotations that are semantically consistent.

Experiment 2: Check the OWL representation of the DFO with automated semantic reasoners to verify that it is semantically consistent.

Method: The Pellet and the Fact++ reasoners are used for this purpose.

The following were possible outcomes:

a) The OWL representation is completely consistent and error free.

b) The OWL representation has inconsistences but these can be compensated

for by modifying the OWL and without jeopardizing the completeness

requirement.

c) The OWL representation has inconsistences that cannot be corrected, as they

would lead to the failure of the completeness requirement. The DFO fails in

this case.

Results: An initial run of the reasoners on the first version of the DFO in OWL format produced result (b). The inconsistencies were, specifically, sematic type errors caused in representation of the “dfo:ColumnTypePartition”. Appropriate changes were made to the OWL representation and are reflected in the latest

122 specifications presented in Figure 16. (Earlier the column was created as an instance of class “dfo:Column” as well as one of the subclasses of

“dfo:ColumnTypePartition”. Now a column is associated with an individual of

“dfo:ColumnTypePartition” with the property “dfo:ColumnType”.) The new version was found to be semantically consistent by the same automated reasoning tools.

Experiment 3: Verify that the DFO allows annotations for a representative dataset in a consistent manner.

Method: A collection of heterogeneous datasets from the biomedical and life sciences domain around drugs, cytogenetics and diseases is used for this experiment. Multiple independent annotators, with the use of Protégé, manually annotated a sufficiently complex dataset. The resulting owl representation (triples) contains both the DFO and individuals produced by annotating the dataset.

Automated reasoners will again be used to check for inconsistencies. Note that a single dataset can have multiple distinct annotations by different users. However, these annotations should not be in conflict with each other.

The possible outcomes of the experiment are:

a) The resulting triples are semantically consistent.

b) The annotations produce inconsistencies and fault caused by the annotator

error. Re-run the experiment in this case.

c) The annotations produce inconsistencies and fault is in the DFO design.

123 a. If modifying the DFO such that it does not conflict with the

completeness requirement or result of experiment 1 above, such

modifications would be implemented and the experiment re-run.

b. In case such modifications are not possible, the DFO fails.

6.5 Application-based Evaluation: Data preprocessing

tasks using DFO annotations

6.5.1 Method

This is a use-case based testing approach. We demonstrate by use cases how the

DFO annotations can be leveraged for “on-demand” data integration functionality.

The interpretation and integration tasks are directly correlated to the “on demand” data integration operations defined in the crowdsourcing paradigm (Chapter 4

Section 4.3). The overall logic of the evaluation is: 1) the conceptual model of 3.2 ensures the computational complexity of operations is of O(1); and 2) since the DFO implements the conceptual components of 3.2 also in O(1); it follows that 3) the semantic match implementations in following use cases with the DFO are also of complexity O(1). Thus, we are effectively demonstrating here that the DFO directly supports the implementation of the crowdsourcing paradigm proposed in the thesis.

Also, the extended match can replace the semantic match in the use cases below, albeit at higher computational cost.

124 For the evaluation, the same collection of datasets used for the consistency evaluation, will be annotated using the DFO specifications.

Note that the actual implementation of these use cases in a working platform is discussed in the next chapter.

6.5.2 Use Cases

Here we briefly discuss the different possible use cases based on the researchers’ requirements and data preprocessing tasks (described in Chapter 5

Section 5.1.1). These tasks are data discovery, interpretation, integration and visualization. The use cases reflect the critical requirements from researchers, which are 1) to provide tools for data exploration and discovery and 2) to perform

“match” and ad hoc “join” functionality in real time.

Use Case 1: “Find relevant data sources and provide quality information”

Here the researcher wishes to search for datasets based on the various meta- information that is not related to the actual contents of the datasets. An important thing to note is the provision by DFO for recognizing the details of data publisher, uploader, etc. as an extension to the DCAT. This information can also be used to filter the required data sources further.

125 The meta-information stored at the data source level (such as semantic tags) are utilized to allow researchers to find relevant data sources. Here we leverage the metadata from DCAT to provide provenance information to the researchers.

Use Case 2: “Discover additional datasets from a pool of annotated datasets that can be joined with a particular dataset on a particular column.”

This is a data exploration operation. For example, the researcher has drug data with the one of the columns representing the disease entity that the drug is supposed to treat. Now the researcher wants to find additional data sources with additional knowledge about the diseases in the context of, say, genes. The goal of the researcher is to automatically find datasets that can actually be fused with the current (selected / uploaded) dataset. From the data integration perspective this is an expansion type operation.

In the traditional research process, this would require manual exploration of the various relevant data sources, followed by their interpretation to check if they can be joined with the given data (again manually).

Here, we demonstrate how executing such an exploratory operation over heterogeneous data sources in an automated manner is now possible using the DFO annotations. This use case can be executed using the operation 2 described in the crowdsourcing (Chapter 4 Section 4.3.2). Here we can perform the semantic match to discover all possible “matches” with columns from other datasets representing

126 the same entity based on the “dfo:represents” property of the column (or the table).

Columns from other datasets the have the same “dfo:represents” property value are the required targets.

Use Case 3: “Find all possible matches between given two datasets S1 and S2.”

This is a typical match operation. The researcher has two data sources that need to be integrated. The columns across the two datasets might represent the same entity, but this is not explicit. Normally this would be done manually or through some schema matching approach, either way requiring parsing and understanding the actual data.

A semantic match, on the other hand, can now simply query to check if the columns across different datasets are annotated to “represent” the same entity (i.e. the same URI in the reference ontology, say “dbpedia:City”) , indicated by the semantic annotation “dfo:represents”. Thus, the DFO annotations are used to find matches automatically across the given datasets.

This use case can also be expanded for an aggregation or union type of data integration, where the goal is to match ALL the columns between two datasets (like traditional schema matching). In this case an application can check if the ALL match condition is met.

Use Case 4: “Suggest intermediate data sources when no direct matches exist between given datasets”

127 This is an extension of the typical match based on operation 4. The researcher’s context is data regarding patients with certain allergy conditions that might be related to environmental factors like air pollution. Suppose the relevant air quality public dataset is found (Figure 20.b) and now this must be joined with a private dataset (Figure 20.a). Notice that the medical centers are located using identifiers, whereas, the air quality data is aggregated at the city level. Thus, even though the researcher has found the right dataset, he/she would have to manually alter the dataset and translate the identifiers into cities they are located in, which is a time consuming process.

Figure 20. (a) Dataset owned by the researcher containing patient and allergy information. (b) Publicly available dataset that contains environmental data to be merged with (a).

Here we need to expand on the match by combining use cases 2 and 3. Figure 21 shows how with the help of the new annotations the platform should also be able to automatically address this problem. This query utilizes the column annotations to find the dataset for conversion between hospital and city and thus act as a mediator

128 for a very time-consuming process for the human. With the contribution of column level annotations the platform can find new “pathways” between the two datasets in question using the saved ontological structure.

The columns across the two datasets might be associated but this is not explicit.

The DFO identifies all possible ways by which the above datasets can be joined. It does so by executing a query to check if the columns across different datasets are annotated to “represent” the same class entity (having the same URI in the reference ontology, say “dbpedia:City”). These columns are used to find matches automatically across the datasets.

Figure 21. Integration using column-level annotations: The Hospital table is used to join the two example datasets using class annotations of columns.

129 Use Case 5: “Suggest the best possible join for me to integrate given datasets”

The above use cases 2, 3 and 4 all deal with finding the matches between heterogeneous datasets. Because the cost of semantic match is affordable, multiple match operations can now be completed in real-time to find multiple and complex potential joins.

In use cases 2 through 4, we saw that the query returns multiple possible matches, both direct and through intermediate datasets, between two different datasets. Provision for ad hoc integration should allow the user to be able to compare the different possible matches and then perform the required join in real- time as well. This decision can be dependent several factors like data quality or on how the join affects the data values. For example, datasets A and B can be joined on column X or can be joined using P, an intermediate dataset with a different set of columns. The number of data rows (observations) retained after the join can be a simple yet good measure to compare which of these “matches” is more suitable for integration. Allowing such comparison is critical for the ad hoc integration requirement.

Following this process the researcher discovers suitable datasets, analyzes that a join through a particular match is right, and obtains an integrated dataset. Since the datasets were integrated semantically based on the ontology, which follows schema rules, the join is schematically correct.

130 Use Case 6: “Integrate two datasets with numeric data in different dimensions”

For this use case we exploit the restrictions and metadata of “Measure” type columns for data aggregation in a completely automated fashion. The DFO exploits the QUDT ontology to store the specific unit and dimension of the column. Consider two data sources, both representing the GDP of countries in N. America and Europe respectively. These two data sources need to be integrated and we have found the necessary matches using use case 2. Now the column containing the GDP value is matched, however, the currency denominations are different (one in dollars and other in euros). Now we do know the denominations as this information is stored using the “dfo:dimension” property. This information can be used to automatically convert the data to a normal form and aggregate the data sources.

Use Case 7: “Integrate two datasets with temporal data in different formats ”

The same principle of the above use case can be applied to “Temporal” type column. The DFO has provision to recognize two types of temporal data, viz. a time- stamp of a time range. The DFO requires both these column types to specify the

“temporal granularity” and the timestamp to additionally specify the “temporal format” using the International Standard ISO 8601 representation. These properties are used to facilitate integration of columns containing temporal data.

Suppose the researcher has found two data sources containing time sensitive patient information. Thus, both have a column that holds time-stamps. However, the

131 format of the time-stamps for these is different, e.g. one has “mm-dd-yyyy” and other has “dd-mm-yyyy”. The “temporal format”, defined in the DFO annotations, provides construct for defining the format of a time stamp value. This can be utilized to programmatically resolve the formats and perform integration, again with minimal programming effort on the part of the researcher.

Use Case 8: "Can the data in my dataset be visualized, say, on a map?"

The research context is visualizations to get better intuition about the data

(Tufte & Graves-Morris, 1983). The declarative annotations at the geographical column level for city, latitude, longitude, etc. can be used to identify possible anchors for visualization. This is more of an application specific use case as compared to the others.

These use cases and the corresponding discussion validates that the DFO can implement the data crowdsourcing operations as envisioned in chapter 4. The ultimate test for the DFO would be actual implementation of the use cases using the

DFO as a representational framework. Such a system is the topic of the next chapter.

132 Chapter 7 PROTOTYPE DATA FUSION PLATFORM PROTOTYPE DATA FUSION PLATFORM

Summary. A prototype data fusion platform is presented that enables

users to perform automated interpretation and ad-hoc integration of

multiple datasets in real-time. It is designed to cut down data pre-

processing time significantly. The prototype platform design illustrates

interactive features that are highly visual and intuitive and do not

require programming expertise.

In this chapter, a proof-of-concept implementation of a data fusion platform is presented. The prototype is designed to demonstrate how the DFO can be leveraged for automated, end-to-end, real-time and ad hoc functionality across the data preprocessing tasks in a “real-world” application. The hypothesis is that such an automated “plug-and-play” solution would drastically reduce the preprocessing

133 time, which, in turn would greatly accelerate the pace of biomedical and healthcare research.

The platform leverages reusable, semantically coherent annotations across datasets using standard reference ontologies as defined by the DFO in order to provide automated features for preprocessing needs of the researchers. This allows individual researchers, working independently to effectively crowd-source dataset collections. Through the proposed platform, the researchers should be able to easily navigate through the dataset collection, interpret and integrate the relevant datasets and obtain data in a format that is ready to be processed by various analysis tools

(such as SPSS, Stata and R). The platform provides an intuitive and researcher friendly interface, without necessity of programming.

7.1 Scope of the Data Fusion Platform

7.1.1 Functional Requirements Specification using Use Cases

The prototype implementation of the data fusion platform here intended to demonstrate how the DFO annotations are implemented for the use cases described in the previous chapter. These use cases, restated here, serve as the functional design requirements for the data fusion platform.

134 Discover relevant datasets and conduct initial assessment of suitability

• Use Case 1: “Find available datasets that can be integrated with given

dataset”: Search for relational datasets of topical interest using existing data

discovery tools like data.gov, census.gov, geodata, NCBI Entrez datasets, etc.

• Use Case 2: “Get quality information about the dataset”: Assess provenance

data (publishers, version, etc.) as well as quality information (e.g. number of

data entries, analysis of missing values, etc.) of the datasets.

Real-time Interpretation

• Use Case 3: “Find all possible matches between given datasets”:

Automatically find “matches” between multiple local and public datasets

using explicit table / column level annotations.

• Use Case 4: “Suggest intermediate datasets for integration”: Automatically

find intermediate datasets required to integrate given two datasets with no

direct “match”.

Perform ad-hoc integration of data

• Use Case 5: “Suggest the best possible join for me to integrate given

datasets”: Compare multiple datasets “matches” for relevance and

integration possibilities in a real-time environment using heuristics of data

quality / retention / loss.

135 • Use Case 6: “Export the “joins” and data in format ready for analysis”:

Perform ad-hoc, real-time “joins” across identified datasets and deliver them

in relational (tabular) format for use in various analysis tools like SPSS, Stata,

etc.

7.1.2 Non-functional Requirements

The interviewed researchers also identified the following non-functional requirements for the data fusion platform:

• Visual query interface: An intuitive visual / graphical user environment that

cuts across datasets, is not programmer oriented, and capable of delivering

above functionalities.

• Semi-automated annotation: The platform should provide some degree of

automation in the annotation process of a particular dataset. This would

reduce the cost of annotation and improve heir precision.

• Low query latency: The query time for the automated interpretation tasks

should be reasonably low to allow smooth operation in real-time.

• Scalable Architecture: The platform should be able to support a large number

of datasets without losing the above requirements of intuitive interface and

low latency.

136 7.1.3 Scope Definition

Based on these requirements, the overall scope of the platform can be defined as follows.

• Support the various data preprocessing tasks / use cases (stated above) in

order to validate the Data Fusion Ontology.

• Provide a lightweight solution to handle research needs and support

automated interpretation and ad-hoc integration tasks for multiple /

changing hypothesis.

• Aid automated and real time execution data preprocessing tasks with limited

requirement of technical resources or programming expertise. The platform

should be capable of accelerating data-driven research with an intuitive and

interactive researcher friendly (and not programmer-oriented) interface.

• The coverage of classes, entities, relationships, etc. by the reference ontology

used for annotations is beyond the scope of the platform implementation.

7.2 Implementation Details

We have successfully implemented a prototype platform that fits the defined scope and addresses most of the stated requirements. It can perform automated interpretation across datasets and can identify “joins” for data integration using

137 user defined semantic annotations. The key design elements that help address the requirements are:

Persistent data store for DFO annotations

A persistent data store is key to scalability of the platform. The annotations are stored in an ontological format (OWL / N-triples) using the DFO specifications. They can then be easily translated to any required database, for instance the Neo4j format for this implementation.

Graph-based exploration of datasets

The interface allows an intuitive graph-based exploration of the datasets. The automated interpretation and ad-hoc integration can also be performed within the same interface.

7.2.1 System components

The components of the platform are shown in Figure 22. The dataset is input using the annotation interface, which interacts with the annotation engine. The user can annotate the tables and columns of the dataset using classes and properties from the reference ontologies. Note that in the above platform design the cost of annotation of a particular dataset is incurred only once. The annotation engine then verifies these annotations for completeness and validates it against the reference ontologies and the data fusion ontology (DFO). They are then translated into a Neo4j

138 schema and loaded into the Neo4j graph database. These can now be reused by the query interface to perform the various preprocessing tasks as discussed in the use cases.

User Interface

Input Annotation Interface Query Interface Dataset (java spring) (javascript, d3)

semi- cypher automated Reference queries with annotation Ontologies jdbc process NLP Tools Internal Data Annotation Neo4j Representation Engine Database Ontology Previous User Annotations data exported in data translation and loading Future Future Extensions an ontological format

Figure 22. Data Fusion Platform components

For this design challenge, we are showcasing only the query interface with the assumption is that the datasets are already annotated and loaded into the Neo4j database. This is done in order to save the evaluators the time and effort of annotating new datasets. The demo interface is implemented as a web application using javascript and the D3 visualization library on the frontend and a Neo4j graph database at the backend. The database is queried using the cypher query language based on the user interactions at the front end. The interface allows users to browse the annotated datasets and perform real-time interpretation and ad-hoc integration.

Currently, the platform can record and export the “joins” made by the users in the interface. This record can, in turn, be used to perform the actual integration.

139 7.3 Walkthrough of Data Preprocessing Tasks

Dataset visualization space Select a dataset. These datasets are already annotated

Start exploring selected dataset

Figure 23. Start up page. The datasets are already annotated as per the DFO specifications.

This section demonstrates how the different preprocessing tasks can be performed in the prototype platform. Figure 23 is the screenshot of the start page of the prototype platform. The display area is divided into two parts:

• The left side changes dynamically and performs multiple functions.

140 • The right side is for visualizing the datasets graphically.

7.3.1 Find available datasets that can be integrated with given

dataset on a particular column

List of columns

Information about selected node

Selected node

Figure 24. Exploring the selected dataset. The column that the user wants to find matches for is selected.

141 This is the typical expansion type data integration. Once a dataset is selected, the user can view its tables and columns as shown in Figure 24. The required column that the user wants to find matches for is selected.

The info panel provides the details of the annotations of a particular column (or even table). In case of dataset it would display provenance information such as who created the dataset, when was the dataset created, etc. In the figure above, the

“Cytogenetic Location” column is selected and it “represents” the ontological class

“some:CytogeneticLocation” as displayed in the info box.

Table of all possible matches

Potential match

Figure 25. All possible matches found for a particular column

142 The “Find Matches” will find all the columns within the annotated datasets that represent the same ontological class, as illustrated in Figure 25. All potential joins are indicated using a dotted edge. The left side displays a table of all such potential matches. The user can chose the necessary matches to perform the actual “join” across these datasets.

7.3.2 Find all possible matches between given two datasets

Figure 26. Finding all possible matches between given two datasets

In this case, the user selects two datasets in the start screen. Clicking “Explore” will automatically execute this query. In the example illustrated in Figure 27, the

143 Gene and Drug dataset are selected. All the possible matches within the tables of the two datasets are found.

7.3.3 Find matches through intermediate dataset

If there are no matches between the given datasets, the platform can also be used to find intermediate datasets automatically. For example, the Gene and Patient datasets have no direct matches between them. This is a complex match operation.

Figure 27. No direct matches between the patient and gene datasets

144 The result of the above operation can be seen in Figure 29. The platform automatically finds matches between the columns through “Drug” and “Disease” datasets.

Figure 28. Automatically finding intermediate datasets

7.3.4 Ad hoc Join functionality and on demand integration

The ad hoc join functionality means that the user should be able to compare multiple join possibilities in real time through the platform. This is achieved as follows. The above basic operations can be combined over a series of steps. Meaning

145 that the users can shuffle back and forth across the various operations, selecting different datasets, in order to create multiple join possibilities. In addition, the platform also maintains the previous “joins” that the user wants to create.

An example result of such ad hoc operation can be seen in Figure 30 below. The user started with the “Disease” and “Drug” datasets above. Then, the “Patient” and

“Gene” datasets were added in subsequent exploration / integration steps.

Figure 29. Ad hoc join functionality. Comparing multiple possible potential joins in real time for integration.

146

New edges

Export the new edges to indicate join

Figure 30. Integrating datasets "on demand"

The “on demand“ data integration functionality should allow users to “export” the joins they created within the platform (however complex) and apply them directly to the data. Ideally, the users should not have to program the integration or perform it manually. The platform itself should be able to generate the required script for data integration. In the platform, the “Export” generates the required SQL query to perform the integration.

147 Notes

The prototype is available at: http://140.254.126.122/dataFusionPlatform/index2.html

The code is available at https://code.bmi.osumc.edu/kbase/data-fusion-platform.git

148 Chapter 8 DISCUSSION DISCUSSION

Summary. The limitations, challenges and the future research agenda

are discussed in this chapter.

This concluding chapter summarizes the key contributions of the thesis. It recognizes the limitations, challenges and open research questions with regards to the proposed crowdsourcing paradigm, the data fusion ontology as well as the data fusion platform implementation (which are the three major products of this research). The chapter also identifies the potential future research directions towards overcoming the above limitations and challenges.

The discussion is organized into sections dealing with issues regarding the 1)

Social and Organizational challenges, 2) Technical aspects of the ontology and 3)

Functionality of the Data Fusion Platform.

149 8.1 Social and Organizational Challenges

8.1.1 Change in incentives for data sharing and annotation for

researchers

For the crowdsourcing paradigm to work, the motivations of researchers to share annotations have to be made clear and supported within the organization

(Fecher et. al., 2015). Today, researchers working autonomously are primarily measured based on publications and not in terms of data-related contributions to the wider community.

Publication houses are changing their policies for researchers to share their data

(Alsheikh-Ali et. al., 2011; Bloom et. al., 2014). An increasing number of datasets are becoming publically available because of such practices. However, the basic need of incentives that will prompt researchers to participate by not only uploading their raw data, but also the proper annotations associated with such data. There is on- going research in large-scale collaboration establishing the benefits of crowdsourcing (Molloy, 2011; Nielsen, 2012; Pampel & Dallmeier-Tiessen, 2014). Such a change critical for the success of the crowdsourcing paradigm.

8.1.2 Impact in business data analytics

Another important direction is how the crowdsourcing approach impacts the business environment. Currently, not all organizations have well-formed and

150 planned data warehouses. Certain key personnel are responsible for database administration. If these individuals move, then there is huge cost incurred by the organization associated with the transfer of their knowledge.

One of the major factors contributing to this is the fact that the schema information itself is not necessarily stored in a computable or even a standardized format. Textual documentation of the entire schema is not uncommon. In addition, there is limited scalability in the current warehousing approaches. Due to this, expanding warehouses across organization is extremely difficult and very costly.

This is not unlike the distributed research environment. Thus, the crowdsourcing approach provides a potential long-term and scalable solution for the business environment as well. This can possibly shift the data-modeling paradigm itself from transaction driven to an end-user, analytics driven system.

Further research and understanding of the data analytics process and how it matures is needed to explore this possibility properly.

8.2 Technical Challenges

8.2.1 Further development of DFO

The proposed ontology is the first iteration of DFO and requires further development. Specifically, in the next iteration we want to make the DFO compatible

151 with the R2ML standards and the recent W3C recommendations for publishing CSV data on the Web, mentioned in related work, for easier adoption into applications.

Such an integration would make the DFO much more universally adaptable.

8.2.2 Inclusion of unstructured data

A key limitation of the current approach is that it only considers structured / tabular form of data. This was mainly because most of data-driven research and analytics (in both academic and business environments) relies on tabular data formats. One reason for this is that most of the commonly used analysis tools (SPSS,

R, Stata, etc. in research; and Tableau, Informatica, etc. in business) primarily require tabular data as input. Most of the statistical and mining methods are also based on tabular data.

This is an important future direction, not only for the crowdsourcing solution but also from the technical aspects for the Data Fusion Ontology and eventual implementation in the Fusion Platform. Some of the open research questions in this direction are:

1. What changes would be needed can the annotations and the annotation

process be modified to incorporate unstructured data?

2. How can the combination of structured and unstructured data be presented

to the researchers efficiently?

152 8.2.3 Cost of annotation

In our cost assessment above we assume that the cost of annotations is incurred only once since an annotation is registered and can be located and reused by other researchers (a.k.a. wikipedia for annotations). However, the generation of annotation for data sources is the biggest cost factor for both warehousing as well as proposed crowdsourcing.

The large annotation cost means that annotating (and thus incorporating in the data fusion platform) research datasets already published in the several data repositories in large-scale is still prohibitively expensive. Thus, we need to find ways to reduce this cost.

Potential research directions for (semi-) automating the annotation process within the data fusion platform implementation are identified in the next section.

8.2.4 Coverage of ontologies

The viability and scalability of ontologies is itself under wide discussion. The proposed crowdsourcing approach relies heavily on the assumption that the current set of ontologies and their interconnections are sufficient for a universal schema, capable of representing any entity. This raises an important question on the coverage of classes in ontologies {O}; is it really universal?

153 A related research challenge is incorporation of the new versions of the ontology efficiently. The annotations should ideally be updated automatically. This raises important concerns regarding versioning and provenance of the annotations themselves.

8.3 Functionality of the Data Fusion Platform

Since the current implemented platform is only a working prototype developed to demonstrate the application of DFO, there is still a lot of functionality that needs to be implemented to convert it into a full-fledged application. The key components are identified below.

8.3.1 Implementing advanced ontology-based use cases

Since the current application is implemented using a Neo4j backend, it lacks the benefit of a fully ontological triple store. In other words, the current implementation is limited to functionality over the annotation layer only. The reference ontologies themselves have not been integrated within the existing backend, which would be easily possible in case of an OWL / RDF based triple store.

The result is that the prototype cannot support the Extended Semantic match operation (as it leveraeges the “owl:sameAs” property from the reference ontology).

In addition, since the QUDT ontology or the Time Ontology are also not integrated,

154 the advanced use cases relating to the Measure and Temporal column types that are supported by the DFO annotations are still not implementable in the current architecture.

Thus, a key future research item is to shift from the current Neo4j architecture to a fully ontological architecture. However, a major consideration if the entire platform is ontology-based is that to implement all the capabilities discussed earlier would require a lot of querying using the semantic web technologies stack. The annotations would be stored in a triple store like Sesame, Jena or DB2-RDF. The typical query language is SPARQL, which does not scale well with the complexity of the query. It is important to consider how the platform will scale as more datasets are added and the annotations get complex.

A more critical factor is that most of the ontology based inference and recommendations for automation discussed in the previous sections need to execute in real time. The current implementation is able to achieve very fast operation with minimal (less than 1 second) latency. The added latency of SPARQL in a fully ontological system as compared to traditional SQL or even a Neo4j Cypher query makes this a challenging task.

155 8.3.2 End-to-end data functionality

All the data preprocessing operations using the DFO annotations within the crowdsourcing paradigm occur at the meta-data layer. Hence, so far, the current prototype implementation has not been integrated with the actual data itself.

For an end-to-end functionality ranging from data exploration to its delivery in an actionable format, the platform should actively interact with the actual data itself.

This would facilitate incorporation of complementary research efforts, such as

ResearchIQ (Raje, 2012; Lele et. al., 2015) and GestureDB (Jiang et. al., 2013). Such a deep integration would enable the advanced data based functionality such as

“smart” previews of the multiple possible joins, advanced search capabilities for finding related resources (experts, publications, etc.).

However, integrating the actual data raises some important issues:

1. Where will the actual data be stored?

2. How will it be accessed?

3. What would the integration mean for the overall latency of the system?

156 8.3.3 Bootstrapping the annotation process

Automated Ontology class discovery

A closer look of the annotation generation process itself reveals that is not unlike the generation of mappings in traditional data integration. The difference is that instead of mappings between source to target schema, we need annotations between the source schema and global ontology classes. In fact, we can potentially borrow some of the schema matching techniques, especially the ontology-based and crowd-sourced methods mentioned in the related work, to (semi-) automate the annotation generation process, thereby reducing cost.

A common set of techniques that are applicable for the ontology class derivation problem can be identified as follows:

1. Domain specific semantic annotation techniques can be reused. However,

this limits the coverage of the ontologies for automated class derivation (or

the entity classification).

2. Most entity-class or entity-entity classification techniques that perform

relatively well employ a supervised (corpus or lexicon based) learning

method, such as Naïve Bayes, decision trees or SVM.

3. Combining multiple features improves the performance of classification.

4. Semi-automated techniques would be more suitable for direct entity

classification (not dependent on syntactic features).

157 Based on these conclusions, a hybrid lexicon-based approach would be most suitable. One potential solution that can be tried is augmenting such a method to create a simple probability distribution model. The model would annotate entries in a column and assign score individually. These individual scores can be combined using a probability function to find the most probable ontological class. A simple formulation is given below, where “w” is an individual entry among the set of entries “W” from a column and “o” is an ontology class from a given ontology “O”.

F(w.O) is an automatic annotation function that assigns an ontology class given a term and the ontology and λ is a weighting parameter.

A model such as the one above is too simplistic to be applied directly. Additional tree based algorithms, especially the “lowest common ancestor”, might need to be applied over the above model. This approach can work for the data fusion platform and would still be relatively easy to implement. The issues to consider with this approach are:

1. The performance of such a method would be directly dependent on the

underlying semantic annotation algorithm. A purely lexicon-based method

would mean that some of the column entries would not be recognized.

158 However, the combined probability product of individual entities should

cover these “missing values”.

2. Numerical data values would be nearly impossible to recognize by an

automated annotation method. So a semi-automated interface would be

more suitable in the interface.

Using Previous Annotations

Using the annotations made by the users previously is an essential human-in- the-loop component of the system. It is necessary for the system to be continuously improving that it should be able to use the previous annotations towards improving the annotation process. A major use of the previous annotations would be to order the recommendations of the semantic relationships between different classes in conjunction with the ontology-based techniques discussed above. This is especially useful as there can be several semantic relationships that can exist between a pair of classes and there is no inherent way in ontologies to distinguish between them. For example, if we have two classes “Person” and “City” then there can be several semantic relationships with domain of Person and range City like “livesIn”, “bornIn”,

“worksIn”, “mayorOf” and so on. When suggesting these relationships to the user, the ordering can be dependent on the number of times they are used. So if the semantic relation “livesIn” is used more often than say “mayorOf” then it would have a higher probability to be used again and would be pushed higher up the recommended list of semantic relationships.

159 Previous annotations can also be used for recommending properties of special column types. For example, if the column is of type “Measure” and represents the class “NumberOfBeds”, the system should be able to recommend possible quantities that the column can be annotated with based on previous annotations. The same technique can be applied to annotating units for the selected quantities in case of measure type column. In case of temporal type column, the previous annotations can be used to predict the format of the time stamps.

Automated Validation

Once annotation for a dataset is completed ontological reasoning techniques can be used to further bolster the annotation process. Some such techniques are suggested below.

Schema Validation Service: A major use of ontologies in any data integration platform is for the validation of the annotations. Defining an ontology-based schema using OWL/RDF is beneficial. Even within a data fusion platform, rule based wrappers and automated resaoners can be implemented to maintain a conflict free, consistent fusion graph. In this case the schema validation is done inherently by most ontology based applications and tools, for example JENA, using ontological reasoners. Ontological reasoners can be programmed to check for the rules and restrictions that apply to the different types of columns as defined above. Such rules

160 are applied at the schema level of the DFO. For example, if a column is of the type

“Measure”, reasoning can ensure the restrictions:

a. That the column has the property “dfo:quantity”

b. The value of the “dfo:quantity” property is of the right type.

Another example would be to check automatically if a JoinTable in the dataset has all the columns of type “Class” and that the connectedness condition is met. Such a validation is similar to the consistency evaluation discussed earlier.

Annotation and Validation of Relationships: This can be used for both validation of annotations already made and for suggesting new relationships between columns for which this property has not been set. For example, once column and table entities are assigned as representing classes from the reference ontologies, an inference engine can determine what semantic relationships should exist between them. To clarify the value of the property “dfo:semanticRelation” is being validated or derived here. A validation of the column of type Property will check if the semantic relation annotated, say “foaf:name”, indeed belongs to the class represented by the table, say “foaf:Person”. Or two entities are annotated as

“dbpedia:City” and “dbpedia:State”, a list of possible relationships that can exist between them (locatedIn, capitalCity, etc.) is easily derived by querying the dbpedia ontology.

161 8.3.4 HCI evaluation of the platform

Once the advanced functionality is implemented, the data fusion platform should be evaluated for functionality as well as usability. The following three different types of evaluation methods are suggested in this regard:

• Task-based evaluation: A task-based evaluation method involves identifying

specific tasks to be performed by the system. In this case, the tasks will be

designed around the use cases specified above. The platform will first be

evaluated for the functionality by checking if it is actually capable of

performing the tasks. A successful platform should be able to perform all the

specified use cases to be functionally complete.

• Cognitive walkthrough: The cognitive walkthrough can be considered as an

extension of the task-based evaluation. An optimal workflow to perform each

task is prerecorded. In the next stage of evaluation, a set of test users will be

asked to perform the same tasks. The eyeball tracking and think-aloud

protocols will be used as users are performing the tasks. An evaluator would

also note various aspects of the process, such as whether the user could

actually finish the task, the deviation from the optimal flow, time required to

complete etc. Both the user actions and the responses will be recorded and

analyzed to evaluate the user interface.

162 • Heuristic evaluation: The heuristic evaluation will be performed in two parts.

First, the interface is to be evaluated by UI experts based on the HCI

principles (Nielson, 1994). Next, a survey using a questionnaire designed

based on these principles should be administered to the naïve users of the

system. The platform will be evaluated as a combination of the expert

comments and the user responses.

8.4 Conclusion

The following is a summary of the key contributions of the thesis:

Key Contribution 1: This thesis first identifies the paradigm shift from current data warehousing practices and the research process requirements. This shift warrants a new approach to tool and platform design for the research environment.

Key Contribution 2: A novel data crowdsouring approach as a computational solution that addresses the limitations of the traditional data warehousing and allows for “on demand” integration of research datasets.

Key Contribution 3: Development of a Data Fusion Ontology (DFO) as a necessary representational and architectural framework within the proposed crowdsourcing paradigm. It should be noted that the ontology design methodology is another sub-contribution of the thesis. Though there are several established methods for design and specification of domain ontologies, the design of application

163 ontologies remains, thus far, fairly ad hoc. In this research, generalizable methodology for application ontology engineering is proposed and used for DFO.

Key Contribution 4: The thesis illustrates how the DFO enables a data fusion platform21 that supports automated, real-time and ad hoc functionality across the data preprocessing tasks of discovery, interpretation, and integration. Thus, this platform specifically addresses the academic research environment with the primary users being researchers heavily involved in quantitative analysis of multidimensional data.

Notes

A: The design was among the top 7 finalists for the American Medical Informatics Association (AMIA) Student Design Challenge 2015 – “The Human Side of Big Data – Facilitating Human-Data Interaction”

21 See Note A

164 BIBLIOGRAPHY

Abedjan, Z., Golab, L., & Naumann, F. (2015). Profiling relational data: a survey. The VLDB Journal, 24(4), 557-581.

Alsheikh-Ali, A. A., Qureshi, W., Al-Mallah, M. H., & Ioannidis, J. P. (2011). Public availability of published research data in high-impact journals. PloS one, 6(9), e24357.

Auer, S., Dietzold, S., Lehmann, J., Hellmann, S., & Aumueller, D. (2009, April). Triplify: light-weight linked data publication from relational databases. In Proceedings of the 18th international conference on World wide web (pp. 621- 630). ACM.

Batini, C., Lenzerini, M., & Navathe, S. B. (1986). A comparative analysis of methodologies for database schema integration. ACM computing surveys (CSUR), 18(4), 323-364.

Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Wheeler, D. L. (2008). GenBank. Nucleic acids research, 36(suppl 1), D25-D30.

Berardi, D., Calvanese, D., & De Giacomo, G. (2005). Reasoning on UML class diagrams. Artificial Intelligence, 168(1), 70-118.

Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific american, 284(5), 28-37.

Bernstein, P. A., Madhavan, J., & Rahm, E. (2011). Generic schema matching, ten years later. Proceedings of the VLDB Endowment, 4(11), 695-701.

165 Bhardwaj, A., Bhattacherjee, S., Chavan, A., Deshpande, A., Elmore, A. J., Madden, S., & Parameswaran, A. G. (2014). Datahub: Collaborative & dataset version management at scale. arXiv preprint arXiv:1409.0798.

Bizer, C., & Seaborne, A. (2004, November). D2RQ-treating non-RDF databases as virtual RDF graphs. In Proceedings of the 3rd international semantic web conference (ISWC2004) (Vol. 2004). Hiroshima: Citeseer.

Bizer, C., & Cyganiak, R. (2006, November). D2r server-publishing relational databases on the semantic web. In Poster at the 5th International Semantic Web Conference (pp. 294-309).

Bizer, C., Heath, T., Idehen, K., & Berners-Lee, T. (2008, April). Linked data on the web (LDOW2008). In Proceedings of the 17th international conference on World Wide Web (pp. 1265-1266). ACM.

Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., & Hellmann, S. (2009a). DBpedia-A crystallization point for the Web of Data. Web Semantics: science, services and agents on the world wide web, 7(3), 154-165.

Bizer, C., Heath, T., & Berners-Lee, T. (2009b). Linked data-the story so far. Semantic Services, Interoperability and Web Applications: Emerging Concepts, 205-227.

Bloom, T., Ganley, E., & Winker, M. (2014). Data access for the open access literature: PLOS's data policy. PLoS Med, 11(2), e1001607.

Bonifati, A., Chrysanthis, P. K., Ouksel, A. M., & Sattler, K. U. (2008). Distributed databases and peer-to-peer databases: past and present. ACM SIGMOD Record, 37(1), 5-11.

Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology, 63(6), 1059-1078.

Brickley, D., & Miller, L. (2012). FOAF vocabulary specification 0.98. Namespace document, 9.

166 Cagle, K. (2014). Semantics in the Library of Babel (with hobbits). Journal of Digital Media Management, 3(3), 241-248.

Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rodriguez-Muro, M., ... & Savo, D. F. (2011). The MASTRO system for ontology-based data access. Semantic Web, 2(1), 43-53.

Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. ACM Sigmod record, 26(1), 65-74.

Chen, P. P. S. (1976). The entity-relationship model—toward a unified view of data. ACM Transactions on Database Systems (TODS), 1(1), 9-36.

Corbin, J., & Strauss, A. (2014). Basics of qualitative research: Techniques and procedures for developing grounded theory. Sage publications.

Corcho, O., Fernández-López, M., & Gómez-Pérez, A. (2003). Methodologies, tools and languages for building ontologies. Where is their meeting point?. Data & knowledge engineering, 46(1), 41-64.

Cragin, M. H., Palmer, C. L., Carlson, J. R., & Witt, M. (2010). Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 368(1926), 4023-4038.

Das, S., Sundara, S., & Cyganiak, R. (2012). {R2RML: RDB to RDF Mapping Language}.

Dimou, A., Sande, M. V., Colpaert, P., Mannens, E., & Van de Walle, R. (2013, October). Extending R2RML to a source-independent mapping language for RDF. In Proceedings of the 2013th International Conference on Posters & Demonstrations Track-Volume 1035 (pp. 237-240). CEUR-WS. org.

Ding, L., Shinavier, J., Finin, T., & McGuinness, D. L. (2010). owl: sameAs and Linked Data: An empirical study.

167 Ding, L., DiFranzo, D., Graves, A., Michaelis, J., Li, X., McGuinness, D. L., & Hendler, J. (2010, March). Data-gov Wiki: Towards Linking Government Data. In AAAI Spring Symposium: Linked data meets artificial intelligence (Vol. 10, pp. 1-1).

Dong, X. L., Halevy, A., & Yu, C. (2009). Data integration with uncertainty. The VLDB Journal, 18(2), 469-500.

Eckerson, W., & White, C. (2003). Evaluating ETL and data integration platforms. Report of The Data Warehousing Institute, 184.

Embi, P. J., & Payne, P. R. (2009). Clinical research informatics: challenges, opportunities and definition for an emerging domain. Journal of the American Medical Informatics Association, 16(3), 316-327.

Ermilov, I., Martin, M., Lehmann, J., & Auer, S. (2013). Linked open data statistics: Collection and exploitation. In Knowledge Engineering and the Semantic Web (pp. 242-249). Springer Berlin Heidelberg. Fecher, B., Friesike, S., & Hebing, M. (2015). What drives academic data sharing?. PloS one, 10(2), e0118053.

Franklin, M., Halevy, A., & Maier, D. (2005). From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record, 34(4), 27-33.

Gal, A. (2006). Why is schema matching tough and what can we do about it?. ACM Sigmod Record, 35(4), 2-5.

Geer, L. Y., Marchler-Bauer, A., Geer, R. C., Han, L., He, J., He, S., ... & Bryant, S. H. (2009). The NCBI biosystems database. Nucleic acids research, gkp858.

Giachetti, R. E. (2010). Design of enterprise systems: Theory, architecture, and methods. CRC Press.

Goecks, J., Nekrutenko, A., & Taylor, J. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol, 11(8), R86.

168 Gonzalez, H., Halevy, A. Y., Jensen, C. S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidon, J. (2010). “Google fusion tables: web-centered data management and collaboration.” in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, ACM, pp. 1061–1066.

Goodhue, D. L., Wybo, M. D., & Kirsch, L. J. (1992). The impact of data integration on the costs and benefits of information systems. MiS Quarterly, 293-311.

Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing?. International journal of human-computer studies, 43(5), 907-928.

Guenther, R., & Radebaugh, J. (2004). Understanding metadata. National Information Standard Organization (NISO) Press, Bethesda, USA.

Gustas, R., & Gustiene, P. (2009). Service-Oriented Foundation and Analysis Patterns for Conceptual Model-ling of Information Systems. In Information Systems Development (pp. 249-265). Springer US.

Halevy, A., Rajaraman, A., & Ordille, J. (2006, September). Data integration: the teenage years. In Proceedings of the 32nd international conference on Very large data bases (pp. 9-16). VLDB Endowment.

Halpin, H., Hayes, P. J., McCusker, J. P., McGuinness, D. L., & Thompson, H. S. (2010). When owl: sameas isn’t the same: An analysis of identity in linked data. In The Semantic Web–ISWC 2010 (pp. 305-320). Springer Berlin Heidelberg.

Halpin, H., & Herman, I. (2011). RDB2RDF Working Group Charter.

Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R. J., & Wang, M. (2009, November). A framework for semantic link discovery over relational data. In Proceedings of the 18th ACM conference on Information and (pp. 1027-1036). ACM. Hausenblas, M. (2009). Exploiting linked data to build web applications. IEEE Computing, 13(4), 68.

169 Hedges, M., Hasan, A., & Blanke, T. (2007, November). Management and preservation of research data with iRODS. In Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience (pp. 17- 22). ACM.

Hert, M., Reif, G., & Gall, H. C. (2011, September). A comparison of RDB-to-RDF mapping languages. In Proceedings of the 7th International Conference on Semantic Systems (pp. 25-32). ACM.

Holdren, P. (2013, February). Memorandum For The Heads Of Executive Departments And Agencies. SUBJECT: Increasing Access to the Results of Federally Funded Scientific Research.

https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_ access_memo_2013.pdf

Ingwersen, P., & Chavan, V. (2011). Indicators for the Data Usage Index (DUI): an incentive for publishing primary biodiversity data through global information infrastructure. BMC , 12(15), 1.

Inmon, W. H., & Linstedt, D. (2014). Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault. Morgan Kaufmann.

Inmon, W. H. (2005). Building the data warehouse. John wiley & sons.

Jagadish, H., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C. (2007). Making database systems usable. in: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, ACM, pp. 13–24.

Jiang, L., Mandel, M., & Nandi, A. (2013). GestureQuery: a multitouch database query interface. Proceedings of the VLDB Endowment, 6(12), 1342-1345.

Jones, D., Bench-Capon, T., & Visser, P. (1998). Methodologies for ontology development.

Kamel, M. (2009). for .

170 Kim, C. H., Weston, R. H., Hodgson, A., & Lee, K. H. (2003). The complementary use of IDEF and UML modelling approaches. Computers in Industry, 50(1), 35-56.

Kimball, R., & Ross, M. (2013). The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons.

King, G. (2007). An introduction to the Dataverse Network as an infrastructure for data sharing. Sociological Methods & Research, 36(2), 173-199.

Knublauch, H., Fergerson, R. W., Noy, N. F., & Musen, M. A. (2004). The Protégé OWL plugin: An open development environment for semantic web applications. In The Semantic Web–ISWC 2004 (pp. 229-243). Springer Berlin Heidelberg.

Lele, O., Raje, S., Yen, P. Y., & Payne, P. (2015). ResearchIQ: Design of a Semantically Anchored Integrative Query Tool. AMIA Summits on Translational Science Proceedings, 2015, 97.

Lenzerini, M. (2002, June). Data integration: A theoretical perspective. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (pp. 233-246). ACM.

Lipscomb, C. E. (2000). Medical subject headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265.

Maali, F., Erickson, J., & Archer, P. (2014). Data catalog vocabulary (DCAT). W3C Recommendation.

Maglott, D., Ostell, J., Pruitt, K. D., & Tatusova, T. (2005). Entrez Gene: gene-centered information at NCBI. Nucleic acids research, 33(suppl 1), D54-D58.

Mailman, M. D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., ... & Popova, N. (2007). The NCBI dbGaP database of genotypes and phenotypes. Nature genetics, 39(10), 1181-1186.

171 Marcial, L. H., & Hemminger, B. M. (2010). Scientific data repositories on the Web: An initial survey. Journal of the American Society for Information Science and Technology, 61(10), 2029-2048.

Michel, F., Montagnat, J., & Faron-Zucker, C. (2014). A survey of RDB to RDF translation approaches and tools (Doctoral dissertation, I3S).

Mihindukulasooriya, N., Priyatna, F., Corcho, O., García-Castro, R., & Esteban- Gutiérrez, M. (2014). morph-LDP: An R2RML-based Linked Data Platform implementation. In The Semantic Web: ESWC 2014 Satellite Events (pp. 418- 423). Springer International Publishing.

Miles, A., & Bechhofer, S. (2009). SKOS simple knowledge organization system reference. W3C recommendation, 18, W3C.

Molloy, J. C. (2011). The open knowledge foundation: open data means better science. PLoS Biol, 9(12), e1001195.

Munson, M. A. (2012). A study on the importance of and time spent on different modeling steps. ACM SIGKDD Explorations Newsletter, 13(2), 65-71.

Murray-Rust, P. (2008). Open data in science. Serials Review, 34(1), 52-64.

Nielsen, J. (1994, April). Usability inspection methods. In Conference companion on Human factors in computing systems (pp. 413-414). ACM.

Nielsen, M. (2012). Reinventing discovery: the new era of networked science. Princeton University Press.

Nottelmann, H., & Straccia, U. (2007). Information retrieval and machine learning for probabilistic schema matching. Information processing & management, 43(3), 552-576.

Noy, N. F., & McGuinness, D. L. (2001). Ontology development 101: A guide to creating your first ontology.

172 Noy, N. F. (2004). : a survey of ontology-based approaches. ACM Sigmod Record, 33(4), 65-70.

Palmer, C. L., Cragin, M. H., Heidorn, P. B., & Smith, L. C. (2007, December). for the long tail of science: The case of environmental sciences. In Third International Digital Curation Conference, Washington, DC.

Pampel, H., Vierkant, P., Scholze, F., Bertelmann, R., Kindling, M., Klump, J., ... & Dierolf, U. (2013). Making research data repositories visible: the re3data. org registry. PloS one, 8(11), e78080.

Pampel, H., & Dallmeier-Tiessen, S. (2014). Open research data: From vision to practice. In Opening science (pp. 213-224). Springer International Publishing.

Pawlowsky-Glahn, V., & Buccianti, A. (2011). Compositional data analysis: Theory and applications. John Wiley & Sons.

Payne, P. R., Embi, P. J., & Sen, C. K. (2009). Translational informatics: enabling high- throughput research paradigms. Physiological genomics, 39(3), 131-140.

Pyle, D. (1999). Data preparation for data mining (Vol. 1). Morgan Kaufmann.

Rajasekar, A., Kum, H. C., Crosas, M., Crabtree, J., Sankaran, S., Lander, H., ... & Zhan, J. (2013). The DataBridge. SCiENCE, 2(1), 1.

Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. the VLDB Journal, 10(4), 334-350.

Rajasekar, A., Kum, H. C., Crosas, M., Crabtree, J., Sankaran, S., Lander, H., ... & Zhan, J. (2013). The DataBridge. SCiENCE, 2(1), 1.

Raje, S. (2012). ResearchIQ: An End-To-End Semantic Knowledge Platform For Resource Discovery in Biomedical Research (Doctoral dissertation, The Ohio State University).

173 Raje, S., Kite, B., Ramanathan, J., & Payne, P. (2015). Real-time Data Fusion Platforms: The Need of Multi-dimensional Data-driven Research in Biomedical Informatics. Studies in health technology and informatics, 216, 1107-1107.

Rex, D. E., Ma, J. Q., & Toga, A. W. (2003). The LONI pipeline processing environment. Neuroimage, 19(3), 1033-1048.

Robson, C., & McCartan, K. (2016). Real world research. Wiley.

Sarma, A. D., Dong, X., & Halevy, A. (2008, June). Bootstrapping pay-as-you-go data integration systems. In Proceedings of the 2008 ACM SIGMOD (861-874). ACM.

Sahoo, S. S., Halb, W., Hellmann, S., Idehen, K., Thibodeau Jr, T., Auer, S., ... & Ezzat, A. (2009). A survey of current approaches for mapping of relational databases to RDF. W3C RDB2RDF Incubator Group Report.

Science Staff. (2011, February). Challenges and opportunities. Science, 331(6018), 692-693. DOI: 10.1126/science.331.6018.692

Schutt, R., & O'Neil, C. (2013). Doing data science: Straight talk from the frontline. O'Reilly Media, Inc.

Seligman, L. J., Rosenthal, A., Lehner, P. E., & Smith, A. (2002). Data Integration: Where Does the Time Go?. IEEE Data Eng. Bull., 25(3), 3-10.

Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., ... & Leontis, N. (2007). The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature biotechnology, 25(11), 1251-1255.

Spanos, D. E., Stavrou, P., & Mitrou, N. (2012). Bringing relational databases into the semantic web: A survey. Semantic Web, 3(2), 169-209.

Tolle, K. M., Tansley, D. S. W., & Hey, A. J. (2011). The fourth paradigm: Data- intensive scientific discovery [point of view]. Proceedings of the IEEE, 99(8), 1334-1337.

174 Tufte, E. R., & Graves-Morris, P. R. (1983). The visual display of quantitative information (Vol. 2, No. 9). Cheshire, CT: Graphics press. van der Putten, P., Kok, J. N., & Gupta, A. (2002, April). Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out. In SDM (pp. 128-138).

Vierkant, P., Spier, S., Rücknagel, J., Pampel, H., & Gundlach, J. (2013). Schema for the description of research data repositories. re3data. org, Version, 2.

Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009). Silk-A Link Discovery Framework for the Web of Data. LDOW, 538.

Wang, J., Kraska, T., Franklin, M. J., & Feng, J. (2012). Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11), 1483-1494.

Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If we share data, will anyone use them? Data sharing and reuse in the long tail of science and technology. PloS one, 8(7), e67332.

Weber, G. M., Murphy, S. N., McMurry, A. J., MacFadden, D., Nigrin, D. J., Churchill, S., & Kohane, I. S. (2009). The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. Journal of the American Medical Informatics Association, 16(5), 624-630.

Weibel, S. L., & Koch, T. (2000). The Dublin core metadata initiative. D-lib magazine, 6(12), 1082-9873.

Widom, J. (1995, December). Research problems in data warehousing. In Proceedings of the fourth international conference on Information and knowledge management (pp. 25-30). ACM.

Winn, J. (2013). Open data and the academy: An evaluation of CKAN for research data management.

175 Wolstencroft, K., Haines, R., Fellows, D., Williams, A., Withers, D., Owen, S., ... & Bhagat, J. (2013). The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic acids research, gkt328.

Xu, L., & Embley, D. W. (2003). Using schema mapping to facilitate data integration. ER03

Zeeberg, B. R., Feng, W., Wang, G., Wang, M. D., Fojo, A. T., Sunshine, M., ... & Bussey, K. J. (2003). GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol, 4(4), R28.

Zerbe, R. O., & Scott, T. A. (2015). Benefit-Cost Analysis and Integrated Data Systems. In Actionable Intelligence (pp. 157-167). Palgrave Macmillan US.

176