End Semantic Knowledge Platform for Resource Discovery in Biomedical Research

RESEARCHIQ: AN END-TO-END SEMANTIC KNOWLEDGE PLATFORM FOR RESOURCE DISCOVERY IN BIOMEDICAL RESEARCH Thesis Presented in Partial FulFillments oF the Requirements For the Degree Master oF Science in the Graduate School oF The Ohio State University By Satyajeet Raje Graduate Program in Computer Science and Engineering The Ohio State University 2012 Thesis Committee Dr. Jayashree Ramanathan, Advisor Dr. Rajiv Ramnath Copyright by Satyajeet Raje 2012 2 ABSTRACT There is a tremendous change in the amount of electronic data available to us and the manner in which we use it. With the on going “Big Data” movement we are facing the challenge of data “volume, variety and velocity.” The linked data movement and semantic web technologies try to address the issue of data variety. The current demand for advanced data analytics and services have triggered the shift from data services to knowledge services and delivery platforms. Semantics plays a major role in providing richer and more comprehensive knowledge services. We need a stable, sustainable, scalable and verifiable framework for knowledge-based semantic services. We also need a way to validate the “semantic” nature of such services using this framework. Just having a framework is not enough. The usability of this framework should be tested with a good example of a semantic service as a case study in a key research domain. The thesis addresses two research problems. Problem 1: A generalized framework for the development of end-to-end semantic services needs to be established. The thesis proposes such a framework that provides architecture for developing end–to– end semantic services and metrics for measuring its semantic nature. ii Problem 2: To implement a robust knowledge based service using the architecture proposed by the semantic service framework and its semantic nature can be validated using the proposed framework. ResearchIQ, a semantic search portal for resource discovery in the biomedical research domain, has been implemented. It is intended to serve as the required case study for testing the framework. The architecture of the system follows the design principles of the proposed framework. The ResearchIQ system is truly semantic from end-to-end. The baseline evaluation metrics of the said framework are used to prove this claim. Several key data sources have been integrated in the first version of the ResearchIQ system. It serves as a framework for semantic data integration in the biomedical domain. It can be used as a platform for development and support of a variety of semantic services and applications in the biomedical domain. A large part of this thesis is devoted to the details regarding the ResearchIQ project. The document is intended as a report of the ResearchIQ project as a successful implementation of an end-to-end semantic framework. iii DEDICATION To my parents, Sanjeev and Swati Raje, and my sister Surabhi for their constant belief in my ability to achieve any goal set and for teaching me to believe in myself. iv ACKNOWLEDGEMENTS A special thank you to my advisors, Dr. Jayashree Ramanathan, Dr. Rajiv Ramnath and Dr. Philip Payne for their guidance throughout this project and my graduate studies. I have benefited greatly from your continual support and direction. I would also like to acknowledge Dr. Tara Payne, Omkar Lele and the rest of the ResearchIQ team; Puneet Mathur, Sandeep Chatra Raveesh and Dr. Po-Yin Yen, without whose contribution this work would not have been possible. Likewise, I thank CETI and my colleagues at the CSE Department. I have thoroughly enjoyed working with you and shall continue to do so. Finally, I would like to thank my roommates Shrikant, Pranav and Akshay and friends in Columbus who have been my family away from home. I gratefully acknowledge that the work done under ResearchIQ project is supported in part by an Institutional Clinical and Translational Science Award, NIH/NCRR Grant Number UL1-RR025755. v TABLE OF CONTENTS Abstract ii Dedication iv Acknowledgements v Table of Contents vi List of Figures x List of Tables xii CHAPTER 1 1 INTRODUCTION 1 1. Background 1 2. Desiderata for semantic services 2 3. Problem Analysis 3 Need of a Semantic Search Portal for Biomedical Research 4 4. Contributions 6 vi CHAPTER 2 7 RELATED WORK 7 1. Linked Data 7 2. Introduction to Semantic Web 13 3. Semantic Web (Web 3.0) technologies 17 CHAPTER 3 22 THE SEMANTIC SERVICES FRAMEWORK 22 1. Semantic Web Applications 22 2. The Proposed Framework 27 3. Semantic Web Services Evaluation 31 vii CHAPTER 4 36 INTRODUCTION TO RESEARCHIQ 36 1. Goals of ResearchIQ 36 1. Challenges 37 2. Contributions within ResearchIQ 40 CHAPTER 5 47 METHODS AND IMPLEMENTATION 47 1. The Basics 47 2. Implementation 50 3. Component Diagram 54 viii CHAPTER 6 56 ANNOTATION 56 1. Annotation Pipeline 56 2. Annotations 60 CHAPTER 7 64 QUERYING 64 1. Querying for Direct Resources 64 2. Querying for concepts 65 CHAPTER 8 72 CONCLUSION 72 1. Discussion 72 2. Final Comments 76 REFERENCES 78 ix LIST OF FIGURES Figure 1: Semantic Services in the New Era 3 Figure 2: Size of the Web. (Source: netcraft.com) 8 Figure 5: Comparison Between Traditional (Web 2.0) and Semantic Web 14 Figure 6: Semantic Web Services and Intelligent Business 15 Figure 7: Semantic Web Stack [28] 17 Figure 8: Example RDF Graph 19 Figure 9: Gene Ontology (Source: Nature.com) 23 Figure 10(a): Visualization of the Google Knowledge Graph (Source: cnet.com). 9(b): An example search for "Barack Obama" showing the different semantic types of results (Source: Google.com). 25 Figure 11: Ontology for role-based access control in MetaDB [46] 26 Figure 12: Evolution of semantic web services [48] 27 Figure 13: Proposed semantic services framework 28 Figure 14: Transforming data for meaningful use using semantics 31 Figure 15: Evaluating Semantic ETL 32 Figure 16: Evaluating semantic services 33 Figure 17: Evaluating ResearchIQ semantic search 44 Figure 18: ResearchIQ utilizes ontological data structure 46 Figure 19: ResearchIQ Ontology Hierarchy 49 x Figure 20: ResearchIQ Semantic Web Stack 52 Figure 21 (a): Home Page with Auto complete list. 53 (b): Search Results for “Mass Spec” 53 Figure 22: Component Diagram 55 Figure 23: Annotation Pipeline 57 Figure 24: Annotation Process 58 Figure 25: Knowledge Graph 59 Figure 26: Example PubMed Resource 62 Figure 27: Query Pipeline 67 Figure 28: Propagation in the Knowledge Graph 68 xi LIST OF TABLES Table 1: 5 star rating for data as suggested in the linked data movement 10 xii CHAPTER 1 INTRODUCTION 1. BACKGROUND In the past decade, there has been progress in the field of electronic data accrual and its dissemination [1]. Data in its electronic form is no longer restricted to few critical domains. It has pervaded through, and in most cases has become a requirement in a variety of commercial and social applications. Consequently there has been a shift in the way we perceive digital data from its initial use case as just a persistent form of storage that can help increase access and longevity of data to valuable pieces of information from which knowledge can be drawn. The push was to identify and leverage secondary uses of data to extract wisdom out of it. Researchers and industry experts realized the value of data and a vigorous race to collect data began. This race evolved into the “Big Data” challenge that we face now [2]. With the vast amounts of data from a variety of sources, which might not be connected explicitly, it has become necessary to have smart knowledge-based services that can perform intelligent tasks on such data. Semantic services [3] are viewed as a promising solution that can answer to this need. Hence, there has been a lot of research around development of such services. These services offer to add contextual and meta-information to provide better semantics which 1 can be utilized while performing the services. Few of the key requirements of a good semantic service are identified in the section below. 2. DESIDERATA FOR SEMANTIC SERVICES The desired features of a semantic service are derived from the demands of key data related issues as perceived today [4]. Figure 1 shows a mash-up of these requirements and the environments that they hail from. ABILITY TO HANDLE DATA The semantic services should by default be able to handle data heterogeneity. In addition it is desirable that any sematic service should be able to process large volumes of data and be able to process it in a reasonable amount of time. UTILIZE SEMANTIC WEB TECHNOLOGIES EFFICIENTLY The semantic services rely on semantic web technologies for their functioning. A smart semantic service should maximize the utilization of standard technological frameworks and methods of handling data. This will make the service extensible and reusable. SUPPORT A VARIETY OF SERVICES A single service framework should be designed such that it provides an infrastructure for a variety of data and knowledge-based services. These would include information retrieval, visualization, data integration and such other services. 2 PROVIDE SOLID INTERFACE TO THE USERS A semantic service is ineffective if it cannot be used efficiently for its intended purposes. The service interface is required to be functional, usable and testable. Figure 1: Semantic Services in the New Era 3. PROBLEM ANALYSIS NEED FOR A GENERALIZED FRAMEWORK FOR SEMANTIC SERVICES The number of semantic web services is increasing progressively and well- defined representational architectures are established. However, many of these are domain specific or specific to the service delivered. For instance, a semantic services framework for the healthcare domain is proposed by Niland et. al. [5]. The Ding et. al. [6] and [7] provide frameworks that are suitable for information 3 retrieval as a semantic service.

End Semantic Knowledge Platform for Resource Discovery in Biomedical Research

A Tool for Identifying Attribute Correspondences in Heterogeneous Databases Using Neural Networks Q

Web Service Semantics - WSDL-S

An Approach to Manage Semantic Heterogeneity in Unstructured P2P Information Retrieval Systems Thomas Cerqueus, Sylvie Cazalens, Philippe Lamarre

Resolution of Semantic Heterogeneity in Database Schema Integration Using Formal Ontologies

Towards a Ubiquitous User Model for Profile Sharing and Reuse

Ontology-Based Integration of Information — a Survey of Existing Approaches H

Approximate Semantic Matching of Heterogeneous Events

A Framework Uniting Ontology-Based Geodata Integration and Geovisual Analytics

The Role of Ontologies in Data Integration

A Semantic Approach to Enable Data Integration for the Domain of Flood

Implicit, Formal, and Powerful Semantics in Geoinformation

Comparing New Semantic Web Approaches with Those of Digital