RESEARCHIQ: AN END-TO-END SEMANTIC KNOWLEDGE PLATFORM FOR RESOURCE DISCOVERY IN BIOMEDICAL RESEARCH

Thesis

Presented in Partial Fulfillments of the Requirements for the Degree

Master of Science in the Graduate School of The Ohio State University

By

Satyajeet Raje

Graduate Program in Computer Science and Engineering

The Ohio State University

2012

Thesis Committee

Dr. Jayashree Ramanathan, Advisor

Dr. Rajiv Ramnath

Copyright by

Satyajeet Raje

2012

2 ABSTRACT

There is a tremendous change in the amount of electronic data available to us and the manner in which we use it. With the on going “Big Data” movement we are facing the challenge of data “volume, variety and velocity.” The linked data movement and technologies try to address the issue of data variety. The current demand for advanced data analytics and services have triggered the shift from data services to knowledge services and delivery platforms. plays a major role in providing richer and more comprehensive knowledge services.

We need a stable, sustainable, scalable and verifiable framework for knowledge-based semantic services. We also need a way to validate the “semantic” nature of such services using this framework. Just having a framework is not enough. The usability of this framework should be tested with a good example of a semantic service as a case study in a key research domain. The thesis addresses two research problems.

Problem 1: A generalized framework for the development of end-to-end semantic services needs to be established.

The thesis proposes such a framework that provides architecture for developing end–to– end semantic services and metrics for measuring its semantic nature.

ii Problem 2: To implement a robust knowledge based service using the architecture proposed by the semantic service framework and its semantic nature can be validated using the proposed framework.

ResearchIQ, a semantic search portal for resource discovery in the biomedical research domain, has been implemented. It is intended to serve as the required case study for testing the framework. The architecture of the system follows the design principles of the proposed framework.

The ResearchIQ system is truly semantic from end-to-end. The baseline evaluation metrics of the said framework are used to prove this claim. Several key data sources have been integrated in the first version of the ResearchIQ system. It serves as a framework for semantic in the biomedical domain. It can be used as a platform for development and support of a variety of semantic services and applications in the biomedical domain.

A large part of this thesis is devoted to the details regarding the ResearchIQ project. The document is intended as a report of the ResearchIQ project as a successful implementation of an end-to-end semantic framework.

iii

DEDICATION

To my parents, Sanjeev and Swati Raje, and my sister Surabhi

for their constant belief in my ability to achieve any goal set and

for teaching me to believe in myself.

iv ACKNOWLEDGEMENTS

A special thank you to my advisors, Dr. Jayashree Ramanathan, Dr. Rajiv Ramnath and Dr. Philip Payne for their guidance throughout this project and my graduate studies. I have benefited greatly from your continual support and direction.

I would also like to acknowledge Dr. Tara Payne, Omkar Lele and the rest of the ResearchIQ team; Puneet Mathur, Sandeep Chatra Raveesh and Dr. Po-Yin Yen, without whose contribution this work would not have been possible. Likewise, I thank CETI and my colleagues at the CSE Department. I have thoroughly enjoyed working with you and shall continue to do so.

Finally, I would like to thank my roommates Shrikant, Pranav and Akshay and friends in Columbus who have been my family away from home.

I gratefully acknowledge that the work done under ResearchIQ project is supported in part by an Institutional Clinical and Translational Science Award, NIH/NCRR Grant Number UL1-RR025755.

v TABLE OF CONTENTS

Abstract ii

Dedication iv

Acknowledgements v

Table of Contents vi

List of Figures x

List of Tables xii

CHAPTER 1 1

INTRODUCTION 1

1. Background 1

2. Desiderata for semantic services 2

3. Problem Analysis 3

Need of a Semantic Search Portal for Biomedical Research 4

4. Contributions 6

vi CHAPTER 2 7

RELATED WORK 7

1. Linked Data 7

2. Introduction to Semantic Web 13

3. Semantic Web (Web 3.0) technologies 17

CHAPTER 3 22

THE SEMANTIC SERVICES FRAMEWORK 22

1. Semantic Web Applications 22

2. The Proposed Framework 27

3. Semantic Web Services Evaluation 31

vii CHAPTER 4 36

INTRODUCTION TO RESEARCHIQ 36

1. Goals of ResearchIQ 36

1. Challenges 37

2. Contributions within ResearchIQ 40

CHAPTER 5 47

METHODS AND IMPLEMENTATION 47

1. The Basics 47

2. Implementation 50

3. Component Diagram 54

viii CHAPTER 6 56

ANNOTATION 56

1. Annotation Pipeline 56

2. Annotations 60

CHAPTER 7 64

QUERYING 64

1. Querying for Direct Resources 64

2. Querying for concepts 65

CHAPTER 8 72

CONCLUSION 72

1. Discussion 72

2. Final Comments 76

REFERENCES 78

ix LIST OF FIGURES

Figure 1: Semantic Services in the New Era 3

Figure 2: Size of the Web. (Source: netcraft.com) 8

Figure 5: Comparison Between Traditional (Web 2.0) and Semantic Web 14

Figure 6: Semantic Web Services and Intelligent Business 15

Figure 7: Semantic Web Stack [28] 17

Figure 8: Example RDF Graph 19

Figure 9: Gene Ontology (Source: Nature.com) 23

Figure 10(a): Visualization of the Google Knowledge Graph (Source: cnet.com). 9(b): An

example search for "Barack Obama" showing the different semantic types of results

(Source: Google.com). 25

Figure 11: Ontology for role-based access control in MetaDB [46] 26

Figure 12: Evolution of semantic web services [48] 27

Figure 13: Proposed semantic services framework 28

Figure 14: Transforming data for meaningful use using semantics 31

Figure 15: Evaluating Semantic ETL 32

Figure 16: Evaluating semantic services 33

Figure 17: Evaluating ResearchIQ semantic search 44

Figure 18: ResearchIQ utilizes ontological data structure 46

Figure 19: ResearchIQ Ontology Hierarchy 49

x Figure 20: ResearchIQ Semantic Web Stack 52

Figure 21 (a): Home Page with Auto complete list. 53

(b): Search Results for “Mass Spec” 53

Figure 22: Component Diagram 55

Figure 23: Annotation Pipeline 57

Figure 24: Annotation Process 58

Figure 25: Knowledge Graph 59

Figure 26: Example PubMed Resource 62

Figure 27: Query Pipeline 67

Figure 28: Propagation in the Knowledge Graph 68

xi LIST OF TABLES

Table 1: 5 star rating for data as suggested in the linked data movement 10

xii CHAPTER 1

INTRODUCTION

1. BACKGROUND

In the past decade, there has been progress in the field of electronic data accrual and its dissemination [1]. Data in its electronic form is no longer restricted to few critical domains. It has pervaded through, and in most cases has become a requirement in a variety of commercial and social applications. Consequently there has been a shift in the way we perceive digital data from its initial use case as just a persistent form of storage that can help increase access and longevity of data to valuable pieces of information from which knowledge can be drawn. The push was to identify and leverage secondary uses of data to extract wisdom out of it. Researchers and industry experts realized the value of data and a vigorous race to collect data began. This race evolved into the “Big Data” challenge that we face now [2]. With the vast amounts of data from a variety of sources, which might not be connected explicitly, it has become necessary to have smart knowledge-based services that can perform intelligent tasks on such data.

Semantic services [3] are viewed as a promising solution that can answer to this need.

Hence, there has been a lot of research around development of such services. These services offer to add contextual and meta-information to provide better semantics which

1 can be utilized while performing the services. Few of the key requirements of a good semantic service are identified in the section below.

2. DESIDERATA FOR SEMANTIC SERVICES

The desired features of a semantic service are derived from the demands of key data related issues as perceived today [4]. Figure 1 shows a mash-up of these requirements and the environments that they hail from.

ABILITY TO HANDLE DATA

The semantic services should by default be able to handle data heterogeneity. In

addition it is desirable that any sematic service should be able to process large

volumes of data and be able to process it in a reasonable amount of time.

UTILIZE SEMANTIC WEB TECHNOLOGIES EFFICIENTLY

The semantic services rely on semantic web technologies for their functioning. A

smart semantic service should maximize the utilization of standard technological

frameworks and methods of handling data. This will make the service extensible

and reusable.

SUPPORT A VARIETY OF SERVICES

A single service framework should be designed such that it provides an

infrastructure for a variety of data and knowledge-based services. These would

include information retrieval, visualization, data integration and such other

services.

2 PROVIDE SOLID INTERFACE TO THE USERS

A semantic service is ineffective if it cannot be used efficiently for its intended

purposes. The service interface is required to be functional, usable and testable.

Figure 1: Semantic Services in the New Era

3. PROBLEM ANALYSIS

NEED FOR A GENERALIZED FRAMEWORK FOR SEMANTIC SERVICES

The number of semantic web services is increasing progressively and well-

defined representational architectures are established. However, many of these are

domain specific or specific to the service delivered. For instance, a semantic

services framework for the healthcare domain is proposed by Niland et. al. [5].

The Ding et. al. [6] and [7] provide frameworks that are suitable for information

3 retrieval as a semantic service. A generalized framework for development and

evaluation of semantic web services themselves, that are truly semantic from end-

to-end, is still lacking. Most of the semantic web service providers have

developed system architectures specific to their application or service. It cannot

be disregarded that common elements exist within several of these architectures.

Still none of them provide all the components of an end-to-end system on their

own. It is only after components from several of these separate architectures are

integrated can we derive a generalized framework. Also, to the best of knowledge,

there has yet to be a standard defined to measure the semantic nature of the

services. It is crucial that we have a method to validate the semantic nature of any

service claiming to be a sematic service in the first place. Such validation tools

can be used to better identify the requirements of a semantic system and

document the nature of semantic knowledge utilized by the service.

NEED OF A SEMANTIC SEARCH PORTAL FOR BIOMEDICAL RESEARCH

The growing demand for innovative and effective methods and treatments by the

healthcare delivery agencies has put tremendous pressure on the biomedical

research community. With the increased use of data-centric methods in

biomedical informatics research it is not surprising that the number of datasets

available to researchers is enormous [8]. Though the availability of more data is

generally a good thing, it is of limited use if its access is restricted due to its

heterogeneous and distributed nature. Moreover, there are very few knowledge

based query tools that can truly cater to the need of perusing through the massive

amount of data and provide access to credible and relevant resources [9]. The

4 sheer extent and expanse of data and the growing use of sophisticated

technologies used to access it causes one of the major hindrances to the pace of

research to be the inability of researchers to discover and leverage resources [10].

For example, information about laboratory equipment, data samples can be

distributed across the organization. Similarly a vast quantity of data in electronic

medical records, clinical trials and their results as well as genomics data is stored

in heterogeneous datasets where it is hard to reach. This problem of

unavailability of easy access to quality data stifles research and collaboration and

is even more pronounced when the data is across organizations. Another major

need for research is the availability of expertise and collaborators and the means

to find them quickly. The domain of biomedical research is riddled with lack of

data standards both for storage and interchange. A lot of current research is

directed towards development of these standards and integrative tools that

employ them [11]. Though progress has been made within silos of types of

resources by the use of domain ontologies, like the Gene Ontology or MESH

vocabulary, we are far from realizing full semantic connectedness.

Thus, we have identified two problem statements that are addressed in this thesis.

PROBLEM STATEMENT 1: Design a generalized framework for the development of end-to- end semantic services and validation of their semantic nature. Such a framework should provide the architecture for that identifies the key elements required for implementation of semantic services. The framework should also define some way to check the semantic nature of such services.

5 PROBLEM STATEMENT 2: Implement a robust knowledge based service using the architecture proposed by the semantic service framework for the biomedical research domain. The service should provide semantic search capability and serve as a platform for semantic data integration. Validate the semantic nature of such an implementation using baseline metrics proposed in the framework.

4. CONTRIBUTIONS

In this work, a framework that describes the different components required for implementing semantic services has been proposed. The framework provides a software architecture that is generic to all semantic services. Baseline evaluation metrics that apply to all semantic services are also provided as a means to test the “semantic” nature of the developed services.

ResearchIQ [12] is implemented as a semantic search portal for resource discovery for researchers in biomedical informatics. ResearchIQ serves as a framework for and knowledge delivery for the biomedical domain. It provides a case study for testing the architecture and metrics as defined in the proposed framework.

6 CHAPTER 2

RELATED WORK

1. LINKED DATA

The World Wide Web has evolved tremendously over the last decade. The

Netcraft web server survey for the month of October 2012 [13] reported responses from over 620 million servers as can be seen in Figure 2. This number is staggering as compared to about 9 million in 2002 and only about 2.5 million in 1998 [14]. The amount of data generated by the Web is simply colossal. The key reasons for this massive expansion [15], along with some examples, are:

INTERNET AVAILABILITY HAS INCREASED AND MORE USERS AND ENTITIES ARE CONNECTED TO

THE INTERNET

In a recent study published by the Internet World Stats 63.5% of the population of

the Europe are Internet users. This number is 78.6% in North America. The total

number of Internet users in the world is 5 times what it used to be in 2000.

COMPUTATIONAL POWER HAS INCREASED THE DATA STORAGE AND PROCESSING CAPACITY

Evidence is provided by the amount of data handled by social networking sites

[16] like Facebook and Twitter and also commercial websites like Amazon and

7 Wal-Mart generating millions of transactions per day. Google is undoubtedly the

leader in this area with information flowing into petabytes.

LEGISLATIVE NORMS DICTATE THE RECORDING AND RETENTION OF MORE INFORMATION

This is tied into the previous point of availability of computational power. For

example, adaptation of the electronic health record system by health care

providers and organizations across developed countries and many of the

developing countries.

Figure 2: Size of the Web. (Source: netcraft.com)

This explosion of information generated led to the idea of Big Data [17], which is collection of data sets so large and complex that conventional data management becomes obsolete. Other than volume, heterogeneity of the data is yet another effect of the data explosion. In order to put this voluminous data to good use, it needs to be organized in an

8 orderly and efficient manner. Increased use of data analytics and smarter information retrieval has led to an unfulfilled and growing need for access to such data.

This is the essence of Linked Data. The term Linked Data refers to a set of best practices for publishing and interlinking structured data on the Web as defined by [18]. The goal of linked data is to revolutionize the way we discover, access, integrate and use data. Tim

Berners-Lee emphasized the need of linked data, its uses and, most significantly, principles [19] of linked open data (LOD) [20].

• Use Uniform resource identifiers (URIs) to name the entities. These entities

can be tangible things like people, places, publications, etc while they can also

be abstract such as relationships or sets of entities.

• As far as possible use HTTP URIs so that people can look up these entities.

• Provide useful meta-information using established standards.

• Provide appropriate valid URIs linking to your entity.

A key factor that determines how re-usable data is the extent to which it is well structured. Table 1 gives a simple 5 star evaluation of data that is put on the Web proposed by Tim Berners-Lee [19]. It should be noted that these are achieved progressively. Thus, the adoption of linked data framework can be hierarchical rather than in a single “go.”

9 ★ Make your stuff available on the web (whatever format)

★★ Make it available as structured data (e.g. Excel instead of image scan of a table)

★★★ Non-proprietary format (e.g. csv instead of excel)

★★★★ Use URLs to identify things, so that people can point at your stuff

★★★★★ Link your data to other people’s data to provide context

Table 1: 5 star rating for data as suggested in the linked data movement

In conclusion, linked data leverages on the same architectural principles of the

Web, as we know it, and reengineers it into a global data space. The properties of the linked data framework allow data consumers to discover, re-use and integrate data.

Linked data helps this integration of data from heterogeneous sources by its reliance on shared vocabularies. One issue relating to linked data is the lack of explicit licensing information provided by the publishers, which leaves the data consumers skeptical about using the data. Another issue related to this is lack of a proper unified forum to advertise the published data. Being relatively new, there are a lot of standards within the linked data that leads to confusion about how these can be connected to each other. With the tremendous growth in the linked open data (LOD) initiative in the past couple of years as can be seen in Figure 3 [18] there has been subsequent growth and adoption of semantic web and semantic web technologies.

10

Figure 3: Growth in Linked Open Data 11

ata 4: LinkedD 4: Open Figure

12 2. INTRODUCTION TO SEMANTIC WEB

Tim Berners-Lee coined the term “Semantic Web”[20] which defined the web of linked data in his book published in 2009 [21]. He defined it as a “web of data that can be processed directly and indirectly by machines,” true to the philosophy of linked data. The

World Wide Web Consortium (W3C) defined it as a network “that facilitates machines to understand the semantics, or meaning, of information on the WWW” and has called it

Web 3.0 [22]. It extends the network of hyperlinked human-readable web pages by inserting machine-readable metadata about pages and how they are related to each other, enabling automated agents to access the Web more intelligently and perform tasks on behalf of users. Even though the idea of a semantically structured has been around for about a decade, only the recent push towards ideologies of Linked Data and Big Data has brought it to limelight. As mentioned afore, semantic web technologies have matured significantly only in the last few years. The linked open data cloud [23] as of September

2010, shown in Figure 4, bears testimony to the potential of semantic web. The different colors represent the different broad domains to which the data sets belong. Linked data and semantic web technologies have pervaded significantly through domains of publications, geographic data, government data and life sciences data. The point of heterogeneity and scope of the data is clearly visible from the graph.

A reasonable and now well-established assumption is that we can no longer look at the Web in its traditional sense as a collection of documents interconnected. The linked data movement put forth the idea of a “web of things” [24] wherein we look at the web as a collection of entities linked by well-defined relationships. As can be seen in Figure 5, the different entities and the relationships form an intricate graph structure. Hence, an

13 alternative view of the web is as a “Giant Global Graph” of data [25]. The advantage of looking at the web in this way is that you can now apply graph-based algorithms to do information retrieval and analysis. Moreover, since the “edges” of this graph are well- defined relationships, additional sophistication can be applied to these algorithms to provide semantic services.

Figure 3: Comparison Between Traditional (Web 2.0) and Semantic Web

The availability of data and the computational power to perform analytics over it has triggered the use of large-scale data analytics in commercial, academic as well as research environments. However, just availability of data is not sufficient for these tasks.

Ultimately the amount of computation that can be performed on the data is governed by:

14 • How easy it is to find data?

• Is the data available for reuse?

• How structured the data is?

The purpose of semantic web is to answer these questions by providing a global framework to find, share and integrate data easily. Additionally, it supports developers to re-use, analyze and exchange data for knowledge driven applications. Thus, semantic web is needed for transforming raw data to usable knowledge. Semantic web technologies provide the platform for a variety of intelligent web-based activities [26].

Significant among these are data interoperability, cross data referencing and inference and integration of web services. Figure 6 summarizes the various applications that can be supported, thus exemplifying the need for a semantic web.

Figure 4: Semantic Web Services and Intelligent Business

15 MYTHS ABOUT SEMANTIC WEB

There are several misunderstandings about semantic web. In his book “Semantic Web for

Dummies,” author Jeff Pollock [27] elaborated on some of the top myths, which are reiterated here:

• SEMANTIC WEB IS SCIENCE FICTION: Though it is true that a full-fledged

implementation of Web 3.0 has not been achieved yet, it cannot be denied that it

does not exist. The semantic web can be seen in bits and pieces all around us. We

shall brief over some of the many successful stories of functioning semantic web

based systems in a later section of this document.

• SEMANTIC WEB IS FOR TAGGING WEB SITES ONLY: Though the idea of semantic web

was conceptualized primarily for tagging of web pages, it is hardly the sum of its

scope. As discussed earlier, semantic web can enhance a variety of data driven

services. It provides for generalized views and services by grouping entities

across domains and geographical boundaries. This points to the future directions

of social and business interactions and networking. At the same time it has

capacity to enable personalized services due to ample meta-information attached

to entities. Semantic web technologies support many services other than just

linking. In fact, the semantic search service that is the topic of this thesis provides

a counter example to this myth.

• SEMANTIC WEB IS TOO COMPLICATED TO SUCCEED: The progressive approach in

adoption of linked data applies to implementation of semantic web also. There is

no need to implement the whole thing overnight. Simple beginnings will help

evolve the semantic web, the way they did with the traditional web.

16 • ALL DATA ON SEMANTIC WEB IS IN ONTOLOGICAL FORMAT: Eventually maybe, but

there are several intermediate stages that also contribute to the semantic web. It is

not necessary that the data that you publish needs to be in ontological format. This

ties into the previous myth. As long as there is clear understanding of the

principles of linked data, there can be contribution to the semantic web.

3. SEMANTIC WEB (WEB 3.0) TECHNOLOGIES

This section elaborates on the technologies that have been developed and are in use for supporting the semantic web. It should be noted that all of these technologies and standards have been established to facilitate the publishing and exchange of data in the semantic web. The semantic web stack demonstrates the different layers of services. We shall look at the key concepts in detail.

Figure 5: Semantic Web Stack [28]

17 HYPERTEXT (WEB 2.0) TECHNOLOGIES

Since the semantic web is based on the traditional web, the bottommost layers are

formed of the existing Web 2.0 technologies. The Identifier and Character Set are

universally accepted. These along with the Syntax are part of the current

architecture. The XML [29] syntax should be noted as it set the groundwork for

creating and exchanging structured data over the Web. However, on its own it

lacked the capability of representing complex data structures. Since XML was

essentially used to feed web services, it only allowed tree structures for fear of

causing the services to loop infinitely.

RESOURCE DESCRIPTION FRAMEWORK (RDF)

The W3C specified the Resource Description Framework (RDF) [30] as a way of

representing information originally to model meta-data. This is the recommended

framework for data interchange as per the W3C. Almost immediately after the

XML was conceived that the need of RDF was realized [31]. Over the years, there

have been continuous refinements to the framework but the fundamentals have

remained same. The basic structure of RDF is very similar to the entity

relationship format in relational databases. The data is represented in the form of

“triples” as subject – predicate – object. The “subject” and “object” are entities

while the “predicate” is an abstract link or relationship between the two entities.

For example the graph in Figure 8 [32] has two triples. The subject in both these

cases is “IBM”, which is of type “company”. Thus, the triples are:

1. “IBM” – “Headquarters located in” – “Armonk, New York, United States”

and

18 2. “IBM” – “Research lab located in” – “Zurich, Switzerland”

All the subjects, objects and their relationships are defined using Uniform

Resource Identifiers (URIs). The insistence on URIs prevents ambiguity due to

replication of resource and allows others to understand and use your data more

effectively. So care should be taken while assigning URIs to resources.

Figure 6: Example RDF Graph

Since all the triples are directed in nature, the resulting structure is a directed

graph. A collection of triples, like the simple one above, is stored in “triplestores.”

A triplestore is purpose built for storing and retrieving triples. Several triplestore

implementations, including Jena and Sesame, are available for use.

RDF SCHEMA (RDFS)

RDF Schema [33] was a knowledge representation method recommended by

W3C to define RDF structures. It defined the basic entities within RDF as well as

provided basic constructs within RDF such as Classes and Properties. It provided

logic and structure to the raw RDF.

19 ONTOLOGIES

Tom Gruber proposed the idea of ontologies in the information sciences a long

time back, in 1993 [34]. He said, “An ontology is a description (like a formal

specification of a program) of the concepts and relationships that can formally

exist for an agent or a community of agents. This definition is consistent with the

usage of ontology as set of concept definitions, but more general." The modern

definition, again given by Gruber [35], is not very different from the one. In 2008

the term was redefined as, “An ontology defines a set of representational

primitives with which to model a domain of knowledge or discourse,” giving

more insight into what the goal of defining an ontology is. In general an ontology

can be seen as a very well defined vocabulary for a particular domain, containing

not only the definitions for all entities within its scope but also stating explicitly

the relationships between these entities [36].

In essence, ontologies are more expressive and refined version of the RDF and

RDFS previously defined. They are built on the same basic principles but

incorporate additional components like Individuals, Attributes and Events in

addition to just Classes and Relations. This means that ontologies can also be

stored in the form of triplestores provided there is some language to represent and

query it. In fact a fully formed ontology is itself in essence a triplestore. Thus,

ontologies can be used to represent the schema of the data being stored as well as

the actual data values.

There are several representational languages that can formulate an ontology. One

of the most popular languages, also recommended by W3C, is the Web Ontology

Language (OWL) [37]. The reason that OWL is preferred is that it is based on the

20 original RDF and RDFS. Additionally it allows for reasoning and inferencing

over the structure of the ontology [38]. Other formats such as N–triples or N3 [39]

are also very popular. These representations are typically interchangeable and it is

not surprising that all of these are valid representations for triplestores.

ADDITIONAL LAYERS

Over these core layers lie the Unifying Logic and the Proof Layer. The proof

layer is used to validate the underlying structure and logic. This is an optional

layer though it might be beneficial and good practice to have the proof layer for

the applications.

21 CHAPTER 3

THE SEMANTIC SERVICES FRAMEWORK

1. SEMANTIC WEB APPLICATIONS

In this section we illustrate a few case studies where semantic web technologies have been used successfully. This is to demonstrate the scope of these technologies, especially beyond just linking of data on the web.

CASE 1: LINKING GENOMIC DATA USING THE GENE ONTOLOGY

The Gene Ontology (GO) [40] was developed in order to facilitate a unified representation of gene and gene products across all species in order to provide ease of access to the annotated data. The Gene Ontology project provides an ontology of defined terms representing gene product properties. Previous to GO there was no accepted standard for gene representation, which made sharing of genomic data very difficult.

Three domains are covered by this ontology:

• Cellular Component

The parts of a cell or its extracellular environment.

• Molecular Function

The elemental activities of a gene product at the molecular level, such as

binding or catalysis.

22 • Biological Process

Operations or sets of molecular events with a defined beginning and end,

pertinent to the functioning of integrated living units: cells, tissues, organs, and

organisms.

Figure 7: Gene Ontology (Source: Nature.com)

Figure 9 shows an excerpt of the gene ontology specific to the cancer gene annotations.

The gene ontology is structured as a directed acyclic graph (DAG) as can be seen in the figure. The gene ontology continues to grow as researchers contribute new annotations as the corresponding samples are discovered [41]. Several tools have been developed in

23 order to visualize as well as annotate data using GO. Most notable among these are

AmiGo [42], GOCat [43] and OBO-Edit [44].

The gene ontology allowed execution of automated analysis and reasoning procedures to be run over the ontology structure. These procedures generated inferred genomic annotations that are critical for research. GO provides an example of using ontology for standardization and creating an environment for integrating and sharing data.

CASE 2: GOOGLE MOVING TO SEMANTIC SEARCH

In 2012, Google announced that it would be using semantics to perform search engine optimizations on the web [45]. They are using a “knowledge graph” as illustrated Figure

10(a) below, which is essentially a graph of connected entities similar to the RDF graph structure. The idea is that this should provide more contextual information in order to give better search results. Or at least the results can be portrayed in a format that can give more context to the users.

Figure 10(b) is a search example for the keywords “Barack Obama.” Google is able to identify the most likely entity related to these keywords and provide search results accordingly. The search is separated into several categories as seen in the figure. This effective integration of results is achieved using semantic web technologies.

It is quite a challenge to effectively design semantic services in a way that utilizes the background information. The semantic search structure is an example of a sematic service that utilizes the available contextual and semantic data effectively.

24 (a)

(b)

- Search Results - News - Social Media - Videos - Related Searches - Meta Information - Images

Figure 8(a): Visualization of the Google Knowledge Graph (Source: cnet.com). 9(b): An example search for "Barack Obama" showing the different semantic types of results (Source: Google.com).

25 CASE 3: ONTOLOGICAL ACCESS CONTROL FRAMEWORK BY METADB

The MetaDB [46] project was designed as an effective system to provide role-based access control to federated data in project oriented environments. It uses an ontology to organize the different roles in a single organization and across organizations and the access rights assigned with each. Since such organizations can become quite complicated, an ontology based architecture provided easy identification and extensibility. Figure 11 shows an example ontology for providing access control within the MetaDB system.

Figure 9: Ontology for role-based access control in MetaDB [46]

These are but a few examples of what semantic web technologies can achieve. The rise in adoption of semantics in the way we create, analyze and share data will, in turn, increase the number of semantic services and applications exponentially.

26 2. THE PROPOSED FRAMEWORK

As seen in the earlier section a wide scope of applications and services can be implemented using the semantic web framework. We refer to these as semantic web services. Web services as we know are interaction based and machine understandable.

The traditional web services are based on WSDL [47]. The new semantic web services are designed using RDF. Figure 12 [48] explains the evolution of semantic web services and where they fit in terms of the traditional web and services.

Figure 10: Evolution of semantic web services [48]

Figure 13 describes the proposed framework. It is necessary to separate the data generation from the data consumption by taking a modular approach to the semantic web services. The resources that are fed into semantic web services can be monolithic.

Usually they are disparate and heterogeneous and hence the need for semantics. The initial data extraction and annotation together form the “Semantic ETL” process. The goal of semantic ETL process is to associate semantic meta-information with the data while transforming it by annotating the appropriate semantic content within the raw data.

27 Let us look at each of the Extract, Transform (Annotate) and Load processes in the semantic context in detail.

Resources

Data Extractor

Ontological Semantic Annotator Knowlegdge

Triple Store

Semantic Services Bus

Applications

Figure 11: Proposed semantic services framework

28 SEMANTIC ETL PROCESS

DATA EXTRACTION

The semantic data extractor needs to be able to handle different types of structured data sets like RDBMS or other triplestores and semi structured data like other web services and scientific corpora. It should also be able to digest completely unstructured data. It is possible that for large integrative systems the extractor can be subdivided into modules for different sources of data. It might be beneficial to do so, in fact, as it makes the system more scalable and extensible. Newer resources and data sets can be extracted by adding the appropriate module. The data extractor feeds the semantic annotator, which acts as the transformer of data in the semantic ETL process.

SEMANTIC ANNOTATOR

The semantic annotator is the most crucial step in the semantic ETL process. Similar to the traditional ETL process, it is responsible for transforming the data as it arrives from the extraction phase into a consolidated, standardized and usable data store. The difference is that in the case of semantic ETL the annotator stores semantic annotations and metadata rather of the data along with direct data values. Also, the semantic annotator relies on ontological knowledge to enable the transformation of data. The level of abstraction of the ontological information used and the type of knowledge it represents is dependent on the scope of the data set being congregated. Typically, this should be domain specific collection of ontologies and controlled vocabularies. There are several ways of storing semantic information and it need not be restricted to ontologies. The different levels of data representations are shown in Figure 14, which is a modified version of the McCandless pyramid [49]. It is conceivable that ontologies provide the

29 deepest semantic knowledge and context. Storing the information in one of these formats is the job of the loader.

LOADER AND TRIPLESTORE

The loader should store information in a triplestore as far as possible. The ontological format for storing information has several benefits despite being slower than the traditional RDBMS.

• Ontologies have very rich schema descriptions. This is because every

relationship is explicitly specified within the ontology. This might not be

available for fields within a table for instance in traditional RDBMS systems.

• Conformance to ontologies, especially standard ontologies, immediately makes

your data portable and re-usable. This is beneficial and ranks high in the five

star evaluation discussed earlier in this document.

• The schema including the constraints and inference logic, meta-data and data

itself are all bundled into one single data store.

Once all the data is loaded it can be exposed to data consumers. It is important that the standard ontologies on which the ETL process is based are made explicit and are open to other users in order to facilitate effective use of the data.

30

Figure 12: Transforming data for meaningful use using semantics

3. SEMANTIC WEB SERVICES EVALUATION

Data consumers develop the semantic web services. They form the applications as part of the semantic service bus. These services as seen in the use cases earlier can range from information retrieval to analysis to visualization. The breadth of the semantic service bus depends in turn on the ontological knowledge used during the semantic ETL process. It is important to evaluate the different semantic services irrespective of their type. Since standard evaluation methods for web services already exist the additional evaluation needed for semantic web services is to check “how semantic the web service actually is?”

First we need to evaluate the semantic web services bus. A baseline judgement could be the number of different types of services and applications that it can support. This depends highly on the outcomes of the semantic ETL process and the way the data is

31 represented in the data store. Borrowing guidelines from the linked data movement we can have a 5 star rating the levels of which can be seen in Figure 14. The storage of raw data gives you a single star while a fully ontological triplestore eanrs you 5 stars.

Now to evaluate the semantic services and applications themselves. Giunchiglia et. al.

[50] used a three point evaluation that could be used for search engines. Here we extended the evaluation such that it can encompass all types of semantic services. The axis have been selected with the goal of providing means of validating the semantic nature and allowing comparison with other semantic services irrespective of their type.

Figure 16 shows the proposed evaluation criteria that can be extended to all semantic services. The evaluation is done on three axes.

Data Raw (Structured or Unstructured)

Syntactic Metadata Tag Sets, Bags of Words, Descriptions Structural Metadata Schema, XMLs, JSONs Increased Semantic Semantics Vocabulary, Taxonomy

Ontology

Figure 13: Evaluating Semantic ETL

32

Figure 14: Evaluating semantic services

1. THE KNOWLEDGE CONSUMED BY THE SERVICE

The service, irrespective of the type or domain, may utilize only up to a certain

level of meta-information. The deeper the semantics involved, richer the semantic

understanding of the service would be. The range of semantic meta-information

types is portrayed in Figure 15. The scale ranges from purely syntactic form of

knowledge representation to highly semantic. Understandably, ontologies form

the apex of the semantic meta-information representation. They provide the most

precise and complete semantic knowledge and hence any service able to articulate

the usage of such a representation can be considered more “semantic” in nature.

33 2. GRANULARITY OF THE ENTITY REASONED UPON

The semantic web services make use of the semantic graph of data for providing

their service. They may differ in what they use as nodes that form the “entities” of

this graph. Though this might be an evaluation criteria better suited for

information retrieval services, it still might be a good idea to define the

granularity for all services for more transparency. For instance a visualization

service might be restricted to the level of visualization it is intended to provide but

specifying the same helps other services and applications to interact with it better.

The range of granularity here is from understanding individual words to more

conceptual representations. This can also be seen as ranging from syntactic to

semantic representation of the entities of the knowledge graph used.

3. UNDERSTANDABILITY OF USER INPUT

The lowest form of understandability is when the user input is parsed in the form

of keywords. Better understanding of the user input in the form of clusters and

phrases can be done by better parsing capabilities. The ultimate semantic parser

will be able to understand input in natural language and extract the necessary

semantic information in order to form the query.

This can be thought of as a white box testing method for evaluating the semantics involved in the system. The semantic evaluation should provide better transparency of the web service. This in turn will help develop better applications and interfaces to these services. The evaluation metrics should provide a basis for performing semantic level comparisons between different semantic services irrespective of the domain of their implementation or the kind of service they provide.

34 The proposed framework is exemplified in the implementation of the “ResearchIQ” project. The goal of the project is to provide semantic search capabilities to researchers over several heterogeneous data sources in the domain of biomedical informatics. The subsequent chapters of this thesis are dedicated to ResearchIQ.

35 CHAPTER 4

INTRODUCTION TO RESEARCHIQ

In this section the Research Integrative Query (ResearchIQ) tool [51] is presented. It is a semantically anchored resource discovery platform that facilitates the discovery of semantically related local and publically available data through a single web portal. The tool uses an ontological framework for annotation and linking of resources. These semantic linkages are subsequently leveraged while searching for relevant knowledge and resources. The user interface is designed to be simple and intuitive taking HCI factors into consideration. ResearchIQ has been developed to act as a front door for different forms and types of information specifically for researchers in the clinical and translational sciences domain. ResearchIQ attempts to break the barrier between the multidimensional data and the researchers by providing a means to find the much sought after quality data in a simplified manner. The system relies heavily on ontologies in order to perform all its function. This ontological framework adds extensibility to the system and the ability to use semantics thoroughly and not just at the surface.

1. GOALS OF RESEARCHIQ

The following are summarized the broad goals of ResearchIQ:

• Integration of heterogeneous data sources in the biomedical informatics

36 domain.

• Consolidated search capability over heterogeneous data sets.

• Semantic search capability to provide more meaningful results.

• Singular easy-to-use search portal for resource discovery.

1. CHALLENGES

There are several inherent challenges when handling large and heterogeneous data sets.

These problems are even more pronounced when complications related to semantic meta- data are combined with them. The key problems faced by the system depends on the kind of service provided. Since ResearchIQ is essentially an information retrieval system, the following section highlights the key challenges in such systems.

CHALLENGES IN SEMANTIC DATA EXTRACTION

Converting the raw textual data into semantic representations is a separate

research problem. Several natural language processing algorithms have been used

to varying degrees of effectiveness to perform the task of annotating text to

semantic structures. It is a very challenging task and is constant cause of debate in

the semantic web community. Since a lot of the textual content will be discarded,

the process of annotation is inherently lossy. However, the precision and recall of

the generated annotations is defined by the intended application. This makes the

task even more arduous.

PROBLEMS WITH HETEROGENEITY

There are several types of heterogeneity that make the task of integration of

information particularly difficult. In this case we are not referring to the technical

37 heterogeneity involved in bringing in data from different sources. Rather we are speaking of heterogeneity at a higher level that dealing with any data entails. Such heterogeneity can be broadly classified as follows:

• Syntactic Heterogeneity

This results from different representations of data. For example, “Date of

Birth” in one data source is referred to as “DOB” in another. It can even be as

simple as data being in different languages.

• Schematic or Structural Heterogeneity

The structure of the data stored in different data sources is different. To

continue the earlier example of “date of birth” or “DOB” if you prefer, it is

easily conceivable that the actual format of storing dates is different. One

might be “mm/dd/yyyy” while other “dd-mm-yy.”

• Semantic Heterogeneity

The meaning of data changes with the context. It is quite a challenge to

understand the right meaning in natural text and even more so in case of

abbreviations. Even if the problem is more prominent across domains, the

field of biomedicine is riddled with abbreviations that make the problem

critical. For instance TOD stands for “time of death” in the clinical domain

while it refers to “tricho onychodental dysplasia” in the medical context. It is

quite possible that both these occur in the same text.

• System Heterogeneity

Though this kind of heterogeneity is not present at the application layer, it is

still included here for completeness. It refers to heterogeneity derived from

38 use of different operating systems or versions of the same. The “big Endean –

little Endian” problem is a classical example.

PROBLEMS WITH AMBIGUITY

Ambiguity refers to difficulty in understanding the search terms while accepting

input from the user or during reasoning while performing the search. The basic

problems are similar to those that occur when dealing with heterogeneity but take

place at different stages of the system. Typically ambiguity arises when dealing

with natural language. The types of problems faced are:

• Homonyms

A word with the same spelling may have different meanings in different

context. This type of ambiguity is very hard to detect, especially when

dealing with only keywords as inputs as is the case with most search

engines. A simple example of such ambiguity is the word “tire” means

to wear down and also a rubber ring.

• Polysemy

This is a special case of homonyms where the two meanings of the

words are related to each other. For instance the term “Age” in clinical

notes might mean “present age” or “age at time of surgery” or “age at

time of admission.”

• N-types

Derivatives of the same word like “corpus” and “corpora.” It is essential

to know these refer to the same concept in order to reduce ambiguity in

the search.

39 • Semantic

These are the hardest to detect and are formed by the structure of the

language. The sentence “I cooked her goose” has several different

meanings.

CHALLENGES OF RESULT DISSEMINATION

It is important to be able to deliver the results of any information retrieval system

in a suitable manner. This problem is difficult as number of different type of

resources increases. The system should be designed with the understanding of the

constraints and capabilities of the user.

2. CONTRIBUTIONS WITHIN RESEARCHIQ

To better understand the contributions made by ResearchIQ, examples and comparisons with other state-of-the-art semantic systems for information retrieval are provided wherever possible.

DATA SOURCES

As mentioned earlier several heterogeneous knowledge sources have been

incorporated as resources within the ResearchIQ framework. These form a

mixture of structured, semi-structured and completely unstructured data

sets each treated as a separate resource type. A list, as of September 2012, of

the types of resources and status along with short descriptions of each is

given below:

40 PUBMED

NCBI Pubmed is one of the most comprehensive collections of

publications related to the medical domain. Currently ResearchIQ holds

over 14000 articles from PubMed. These include all the publications

related to OSU from the year 2004 till present (July 2012).

OSUPRO

Contains profiles for faculty and researchers within OSU. We currently

have added over 1200 profiles related to biomedical informatics.

CLINICAL TRIALS

These are from two sources. First, the publically available clinical trials

datasets available from clinicaltrials.gov and the studies registered under

the StudySearch databases. About 950 of these are currently annotated.

WEBSITES

These are the websites of the various labs, departments, resources and

equipment scattered across the OSU medical center. Since we need a

single place to find information regarding them. About 200 selected

websites have been included as resources in ResearchIQ.

GRANTS.GOV

Grants.gov holds information of calls for proposal of grants from various

funding sources such as NIH, Department of public health, etc. We

currently have about 500 grants.

41 LARGE AND CONNECTED TRIPLESTORE

It can be observed that even though ResearchIQ is still in its initial stages of

deployment, it has collected a large number of different resources. The project has

been able to connect the metadata of the different data sets with reasonable

success. For example the people resources within OSUPro are connected to their

respective publications in PubMed. Similarly, connections between the labs and

instruments to the people and services within OSU are made. These connections

are generated semi automatically. The system relies on statistical and natural

language processing (NLP) techniques to provide recommendations of possible

connections that can be made in the triplestore. The system provides confidence

scores with these recommendations and automatically makes the connections

between the resources that it considers “obvious,” meaning that the confidence

score is sufficiently high. Manual annotators make the remaining connections of

“fair” confidence if applicable. A naïve observation of the recommendations has

shown that the system is fairly accurate.

Also, it should be noted that ResearchIQ relies on standard ontologies in the

BioPortal and Open Biomedical Ontologies for data representation. The content

management infrastructure is also described in standard ontologies. The details of

the ontological structure of ResearchIQ are provided later in this document. The

result of reliance on standard ontologies and semantic cross-referencing between

the different data sources is a highly connected triplestore that is portable and

extensible. The current size of this triplestore is ## triples, though the project is

rapidly growing. The triplestore itself can be used to develop a plethora of

42 different semantic services.

DEEP SEMANTIC SEARCH

Since the goal of ResearchIQ is to provide search capability based on semantics

rather than just syntactic, it is essential that the search incorporate deep semantic

reliance. The evaluation of the semantic search in ResearchIQ according to the

framework introduced earlier is shown in Figure 17. The search utilizes full

ontological knowledge for querying, leveraging the semantic graph and

connectedness of the triplestore. The challenge of heavy reliance on semantics for

reasoning is that the underlying technologies like OWL and SPARQL [52] are

generally slower than the traditional approaches like XMLs or RDBs. So it is

essential to keep the actual SPARQL queries simple and exploit the semantics

algorithmically as far as possible.

Another challenge is to effectively utilize the semantic relationships. ResearchIQ

processes different semantic relationships separately. The search is based on

concepts and the linkages between them by using graph based propagation over

the data. It considers the resources and concepts to be the nodes and the

relationships to be the edges of the graph. Thus, the reasoning and resource

discovery is inherently semantic in nature.

The query processing also circumvents problems relating to syntactic ambiguity

by keeping the search purely semantic. The user directly enters concepts rather

than ambiguous free text. The query processor allows the user to form complex

queries from individual concepts and resources. For example, if A, B and C are

43 three concepts, then the user can form a query such as, (A AND B) OR C. This

adds to the precision of the search and allows the user to zero in on the relevant

results.

Figure 17: Evaluating ResearchIQ semantic search

CONSOLIDATED QUERY STRUCTURE

The ResearchIQ query structure relies on single user submitted query to be able to

find all the resources spread across all resource types related to that query. Since

the triplestore is unified for all these resource types, it is easier to execute such

queries. As mentioned earlier, the ResearchIQ triplestore stores relations between

different types of resources. This and the fact that the triplestore is unified allows

44 for consolidated reasoning and propagation over all resource types. This is a

different approach than seen in the semantic search provided by Harvard Catalyst.

END TO END SEMANTIC SYSTEM

ResearchIQ is truly semantic in nature as it uses ontological representations for

not only the content of the resources but also the meta-information of the

resources themselves. It uses semantics for annotation, reasoning as well as

visualization. It should be noted that other state of the art systems like Eagle-i and

Harvard catalyst do not have an end-to-end system. The semantics are restricted

to the annotation or the reasoning and not flourishing across the entire system.

PLATFORM FOR SEMANTIC DATA INTEGRATION AND ITS DISSEMINATION FOR BIOMEDICAL

INFORMATICS

The framework for development of semantic services is followed in ResearchIQ.

The ResearchIQ system uses all standard ontologies in the domain of biomedical

informatics. Thus, it can serve as a platform for integrating additional data sets

with relatively less effort. Since the core system is built to annotate these

resources such that newer data sources can be incorporated easily and that

congregated queries are run on them, we have a robust and highly extensible

platform for providing semantic services.

This is the most significant contribution of the ResearchIQ project. The project

architecture fits into the semantic services framework proposed in this thesis.

Figure 18 depicts the evaluation of ResearchIQ as per the linked open data

guidelines.

45

Figure 18: ResearchIQ utilizes ontological data structure

46 CHAPTER 5

METHODS AND IMPLEMENTATION

1. THE BASICS

DEFINITIONS

There are several terms that are used frequently in ResearchIQ. The following section provides definitions for some terms that are specific to the project:

RESOURCE: A resource in ResearchIQ is a source of data, knowledge or tools. The types of

resources are publications, web pages, expertise, clinical data sets, medical data

sets, etc. The details of the variety of these resources are provided in the

contributions section. The goal of ResearchIQ is to retrieve these resources using

semantic search logic.

INSTANCE: An instance is used in conjunction to the resource. It refers to the actual entry

of the resource in the triplestore. An instance of a resource contains all the meta-

information about the resource represented semantically.

CONCEPT: A concept in ResearchIQ refers to the concept in the UMLS [53] ontology. It

contains information about a particular conceptualization in the UMLS ontology.

47 CUI: CUI stands for “Concept Unique Identifier” in UMLS. In context of ResearchIQ it is

used alternatively with the term “Concept.”

ONTOLOGY STRUCTURE OF RESEARCHIQ

ResearchIQ uses several standard ontologies to form a complex and comprehensive semantic structure. A bottom up analysis of the structure would be easier to understand.

Figure 19 shows the ontological structure followed in ResearchIQ.

ResearchIQ Ontology

Instances BRO UMLS

SNOMED- MESH HUGO ... CT OMIN

GO NCI RXNORM ICD9CM Biomedical Domain Ontologies

Dublin FOAF NAO Core SKOS NCO

Upper Level ontologies

RDF OWL XSD RDFS Schema Ontologies

Figure 19: ResearchIQ Ontology Hierarchy

48 SCHEMA ONTOLOGIES

These form the foundational layer for creating and maintaining any ontological

structure. These include the schema level features as defined by OWL, RDF and

RDFS. The basic relationships are defined at this level.

UPPER LEVEL ONTOLOGIES

ResearchIQ relies on relationships and classes defined in the upper level

ontologies to enrich its own ontology. This reliance on widely used and standard

ontologies increases the portability of the triplestore. This is key to making the

ResearchIQ data interoperable and extensible.

UMLS AND THE BIOMEDICAL DOMAIN ONTOLOGIES

UMLS stands for the Unified Medical Library System [53]. It is a collection of

over 150 different taxonomies, controlled vocabularies and ontologies in the

biomedical domain. It provides a consolidated collection of biomedical concepts

and detailed relationships between them. Each concept has a unique identifier

associated with it. This makes it possible to address some of the problems with

ambiguity and heterogeneity discussed earlier. In ResearchIQ the UMLS ontology

is used to annotate the resources and to perform the semantic search.

INSTANCES

All the resources and their respective meta-information are stored in an

ontological format as part of the instances ontology in ResearchIQ. It uses

relationships from the upper level and schema ontologies to annotate the

49 resources. In addition, all the annotated UMLS concepts corresponding to any

particular resource are also stored in the instances ontology.

BIOMEDICAL RESOURCE ONTOLOGY (BRO)

The Biomedical Resource Ontology (BRO) [54] is used to classify the resources

into different resource types. These types are classes within the BRO. The

classification is used in the user interface to arrange the results.

2. IMPLEMENTATION

TECHNOLOGIES AND TOOLS

ResearchIQ is implemented in a completely open source environment. It is developed in

Java and Java based tools. Due to limitation of size the details of implementation are not in this thesis. Below is a list of external tools and applications utilized by ResearchIQ.

LUCENE / SOLR

Lucene [55] and SOLR [56] are indexing tools provided by apache. While

Lucene has been around for a long time and has a text based data management,

SOLR is a more recent XML extension of it. SOLR allows for distributed

computing which gives it capability to handle large number of resources.

METAMAP

MetaMap [57] is a free text annotator provided by the National Library of

Medicine (NLM) and Unified Medical Library System (UMLS.) The MetaMap

tool annotates concepts from the specified ontology in a given raw text. It

provides a lot of functionality and options that allow the annotations to be

controlled.

50 JENA

Jena [58] is a technological framework for developing semantic web applications

provided by Apache. It provides extensive Java libraries that support a variety of

ontology representation formats. It includes a rule-based inference engine and a

variety of storage strategies to store RDF triples. Jena is completely in-memory

application and loads the entire triplestore into the memory. Thus, Jena is not a

scalable tool and cannot be used for larger triplestores.

SESAME

Sesame [59] is an open source technological framework for storing and analyzing

RDF data. It is deployed on the web and thus is scalable. It offers the same

capabilities as Jena. Both Jena and Sesame support SPARQL as query language to

RDF.

SEMANTIC WEB STACK

The semantic web stack used by ResearchIQ can be seen in Figure 19. It is very similar to the original semantic web stack, the difference being that it does not contain the proofing and encryption layers. The ontology is either in the form of OWL or N-triples. The query language in either case is SPARQL.

51 User Interface and Applications

Web Services

Unifying Logic

Query: Ontologies: Rules: SPARQL OWL / N3 SWRL / RIF

Taxonomies: RDFS

Data Interchange: RDF

Syntax: XML

Character Set: Indentifiers: URI UNICODE

Figure 20: ResearchIQ Semantic Web Stack

USER INTERFACE

The effective delivery of search results is an important function of a semantic search engine. The method of delivery should be able to exploit the semantic backbone architecture and at the same time should provide ease of access. Developing such a user interface is a whole research topic in itself. The work done in that regard though crucial is not part of this thesis. However, for the sake of completeness we have provided screenshots of the current implemented system in Figure 21.

52 (a)

(b)

Figure 21(a): Home Page with Auto complete list. (b): Search Results for “Mass Spec”

53 3. COMPONENT DIAGRAM

Figure 22 shows the different components of the system. An analogy to the semantic services framework proposed by this thesis can be drawn immediately. The system is architected using the same guidelines.

As can be seen in the diagram, there are several different types of resources being integrated. The entire ResearchIQ system can be broken into two core components namely the Annotation Pipeline and the Query Pipeline. The annotation pipeline gets input from several data sources and crunches knowledge into the triple store and the indices into the Lucene / SOLR directory. The query pipeline reads this stored knowledge and runs semantic queries on it. We will look at each of these pipelines in detail.

54 Websites

Clinical PubMed OSU Pro Trials

NCBI Datasets

Grants.gov

Extractor

Annotator

Loader

Annotation Pipeline

Triple Store

Seed Score Generator

Propogator

Aggregator

Query Pipeline

Query Results

User Interface

Figure 22: Component Diagram

55 CHAPTER 6

ANNOTATION

1. ANNOTATION PIPELINE

The annotation pipeline, seen in figure 24, is essentially an ETL process.

EXTRACTOR

The extraction process is different for the different types of resources. The job of the extractor is to parse information from the various data sources and provide the annotator with the raw data that needs to be transformed. The extractor also captures the meta- information that will be associated with the resource such as its name, URL and such as well as semantic information like type of resource, other related resources, etc. The first stage in Figure 23 shows this process. This meta-information will also be sent to the annotator, which in turn will pass it to the loader. It is crucial to have this meta- information for executing semantic queries.

A custom data extractor needs to be written for each source of data even though most of the technical details and functions used remains the same. For example, the clinical trials data is stored in a structured MySQL database. The database connector queries this database for the required fields. In this case the MESH terms associated with the clinical study is sent to the annotator.

56 Extractor

Database Clinical Connector Annotator Trials Metamap Annotator Websites

Nutch Crawler Annotation UMLS Engine

XML Grants.gov Parser Collection of Semantic Collection of Semantic Documents Documents

Semantic Semantic Documents Documents (OSUPro) (PubMed) Semantic Lucene PubMed Documents Index Creation Indices (PubMed) Pipeline

NCBI OSU Pro Semantic Semantic Datasets Documents Documents (ClinicalTrial) (Grants.gov)

Query Engine

Loader

Lucene / SOLR UMLS Ontology Builder Indices Generator

Annotation Triple Resource Pipeline Store

Triple Store

Figure 23: Annotation Pipeline

57 Resources + Meta Information

Meta Information

Person

c5 c6

c7

c4

c3 c8

c2 c9 c1 Grant Publication

Meta Information Meta Information

Figure 24: Annotation Process

ANNOTATOR

The annotator uses Metamap Annotation API for annotating the raw text of the resources with appropriate UMLS concepts. A wrapper around the API determines the filters and granularity of the annotation process. Not all annotations found in the document are retained. The concepts that are too close to the root node in the UMLS tree structure are too generic. For this reason only the annotations of concepts below a certain depth in the

UMLS structure are retained. We only have the concept identifiers at this stage; all the meta-information about the CUIs is yet to be extracted.

58 The Annotation Engine also folds the meta-information of the resource forwarded from the extractor phase into Data Objects. The structure of the data objects is different for each type of resource. These are archived and used by the loader. The semantic annotations are stored separately. The process is illustrated in Figure 24. The data objects can also be seen in the diagram.

Meta Information

Person

c5 c6

c7

c12 c4

c11 c3 c8

c2 c9 c1 Grant Publication

Meta Information Meta Information

Figure 25: Knowledge Graph

LOADER

The loader is tasked to convert the resources into ontological format and also generate the

Lucene / SOLR indices for the resources. Traditionally, Lucene is used to index only

59 textual matter in natural language at different levels of granularity. ReseachIQ feeds

Lucene with documents at a conceptual level using formal or conceptual language. The fact that Lucene uses documents in the semantic view, as a list of concepts rather than words, is good proof that we are doing semantic search.

The loader also derives the extended UMLS ontology. It does this by storing all synonymous and antecedent concepts up to a certain depth in the UMLS hierarchy. This step allows the system to store the deep semantic relationships between the concepts within the UMLS ontology.

The meta-information about relationships between the different types of resources is also attached to the data before it is loaded into the triple store. For example, the authors of a certain publication are recognized with an appropriate “isAuthorOf” relationship and so on. The resulting structure is a comprehensive knowledge graph as seen in Figure 23.

This graph structure is represented using RDF triples and is stored as a triplestore.

2. ANNOTATIONS

RESOURCE ANNOTATIONS

The annotations for a resource are stored in the instances ontology. Each resource is marked with at least one resource type from the biomedical resource ontology (BRO).

There can be multiple resource types associated with a single resource. An annotated resource has different meta-data associated with it depending on which type of resource it is. The following are the associations common to all resources regardless of its type:

PREFERRED LABEL

The name that will be displayed to the user

60 URL

The source location of the resource on the Internet.

DESCRIPTION

A short description of the resource that will be displayed to the user on the user

interface.

ORGANIZATION

Which organization or sub organization within OSU the resource belongs to. It should

be noted that the organizations themselves are usually resources within ResearchIQ.

RELATED CONCEPTS

The annotations from the UMLS ontology that describe the resource.

Now let us consider specialized types of annotations:

RELATED RESOURCES

These are stored in case of Publications, Labs and Clinical Trials and are People

resources in all these cases. In case of publications they are the authors of the

particular publication. In the other cases they are the principle investigators

responsible for that particular resource. Since the ontology allows reasoning we do

not have to store a complimentary association with the people resources.

FIRST AND LAST NAMES

These are stored in case of People resources.

JOB TITLE

These like the names are also associated with the People resources only.

61 START AND END DATES

These are associated with Grant resources. The start date indicates the date on which

the grant was posted and end date is the deadline for submitting proposals for that

grant.

Figure 26 shows annotations for a typical Publication resource. The various annotations discussed in this section are represented by the different colored edges.

Figure 26: Example PubMed Resource

CONCEPT ANNOTATIONS

The following annotations are stored for every concept extracted from the UMLS.

Though UMLS has a lot of information associated with each CUI, for the purposes of

62 ResearchIQ the following relationships were sufficient as of September 2012. Hence only these required annotations are stored in the triplestore.

PREFERRED NAME

The preferred label of the concept

STRING

Alternative names and possible string representations of the concept.

SOURCE (SAB)

A list of ontologies, taxonomies and vocabularies that the concept originally appears

in.

SYNONYMS (SY)

List of synonymous concepts. For example, “Myocardial Infarction” and “Heart

Attack.” By default each concept is synonymous to itself.

CHILD RESOURCES (CH)

List of child concepts in the UMLS ontology.

RELATED CONCEPT (RN)

List of concepts that are related to a given concept.

HONONYMS (ISA)

These are abstractions of a given concept.

All these associations have a critical role to play while performing the semantic search.

The next chapter explains the query pipeline, which will detail how these associations are utilized.

63 CHAPTER 7

QUERYING

The detailed query pipeline of ResearchIQ is shown in Figure 27. The user is restricted to select query terms from a drop down auto complete list. Each entry in this list is associated with a URI in the knowledge graph of the triplestore.

The query can consist of several individual terms either with the boolean AND or OR between them. The first step is to separate these terms. After this each term can either be a concept or a direct resource. Based on this the search is done differently. In case of direct resources we use the knowledge graph directly. Let us look at both these scenarios in detail. First, the query for direct resources.

1. QUERYING FOR DIRECT RESOURCES

1. In case a resource is selected in the auto complete list the relationships associated

with the resource are utilized.

2. The query engine searches the triplestore for all the resources that are directly

associated with the searched resource. The derived resources will be assigned a

lower score in order to maintain the hierarchy of the search when displaying the

results.

64 3. The above step is repeated for the derived resources as well in order to find more

semantically associated resources.

For instance consider that a user searches for a particular Person resource within

OSUPro. Then the search would return all the publications that the person has authored as well as all the clinical trials, labs and grants the person is associated with. In the next step all the resources related to the ones that are just found will be derived as part of the search results.

2. QUERYING FOR CONCEPTS

The querying for concepts is far more complex. This is achieved in two steps; first the seed resources are obtained and then the scores are propagated using the UMLS ontology to discover additional resources.

SEED SCORE GENERATION

1. For every query concept we generate all its synonyms. These will be CUIs in

the UMLS ontology related by “SY”.

2. A set of seed resources and associated scores is generated using each synonym.

In case of repetitions, which should be large in number, the higher seed score is

retained for the resource.

3. For propagation the lowest of the seed resource scores is propagated using the

propagation technique described elsewhere in this document. The reason for

using the lowest score is that it keeps the comparative “relevance” of seed

resources higher than that of resources generated by propagation.

4. Let us call the set generated after propagation a “resultset”. Note that this is for

65 a single queried concept. We immediately normalize the score with respect to the highest score for each resultset. This maintains comparability across the resultsets for individual CUIs

66 Visual Interface

Query Terms

Term Seperator

Core Query Logic

Is Concept No Direct Resource Term? Finder

Yes

Lucene / Seed Resource Related SOLR Indices Finder Resource Finder Triple Store

Score Propagator

Single Term Single Term Single Term Single Term Results Results SingleSingleResults T erm SingleSingleResults T erm conceptResults resourceResults query results query results

Result Aggregator

Search results

Figure 27: Query Pipeline

67

Figure 28: Propagation in the Knowledge Graph

SCORE PROPAGATION

The example in Figure 28 explains at a high level how the propagation actually

proceeds. In the figure, squares mark the concepts and circles mark the resources.

1. "C" is the queried concept. Let us assume that the seed resources and their scores

are already generated. We are only interested in the propagation of the scores.

Also assume that R1, R2 and R3 in the figure above are not seed resources. If we

encounter seed resources or previously visited resources in the propagation, we

68 simply ignore them.

2. The lowest seed score of each seed resultset is used for propagation as it ensures

that the score of any derived result will still be lower than the initial set of results

irrespective

3. There are three possible relations between the concepts in the UMLS ontology.

a. "CH" indicating child

b. "RN" indicating association and

c. "SY" meaning the CUIS are synonyms.

4. These types are leveraged such that there is different propagation for the different

relations.

a. In case of "SY" the same score is assigned to the related concept.

b. In case of "CH" and "RN" we use an exponential decay of the score.

5. The propagation of the concepts is limited to a depth of 5. If a concept is revisited

in the propagation then it is stopped for that path. For now, we are proceeding in a

depth-first manner as the program executes recursively.

6. For each new concept found by the propagator, it will find all the resources

related to it. These are added to the resultset for that concept with the appropriate

score.

7. Notice in the figure that "R2" is reachable from two paths. In such a case (which

is quite common) the highest possible score is retained.

Such resultsets are generated for each of the queried concepts in the same manner. The results in these sets are then retained following the AND/OR logic as selected while querying.

69 • In case of AND (which is a set intersection operation) only the resources

common to all sets are retained as part of the final results. The highest score for

the resource among all resultsets is kept.

• In case of OR (which is a set union operation) all unique results are retained in

the final result set while maintaining the highest possible scores for individual

resources.

In the early stages of the project, the AND/OR logic was applied to only the seed resources before propagating. This did not implement the Boolean logic in its true sense and was modified to its current form where the logic is applied across all the resources.

The propagation is done by finding concepts related to the query concepts, while taking seed scores of the resources. This is because if we try to use the concepts related to the resource, (which had been tried earlier) the focus of the search is lost as a resource can have annotations of generic concepts other than the ones that the search is intended for.

The Result Aggregator fetches the meta-information like the preferred label and the description that is required for the user interface.

3. FEATURES OF THE RESEARCHIQ SEMANTIC SEARCH

We have already assessed the significance of the semantic search architecture from the point of view of the proposed semantic services framework in the contributions section.

In this section we cover the algorithmic features of the search engine.

70 PURELY SEMANTIC SEARCH

Since the semantic ETL process delivers a purely semantic database the serach

has to be semantic as well. Even within Lucene the semantic concepts are

indexed. The semantic search tends to be very precise as the search solely relies

on the semantic relationships defined by standard ontologies.

PLUGGABLE DECAY ALGORITHM

The decay algorithm for the search propagation is pluggable. It is possible to

choose between exponential and logarithmic decay algorithms and also to

determine the slope of decay. This allows for controlling the rate at which the

score will be decayed during propagation.

CACHING FOR PARTS OF SEARCH

A caching system is maintained to store the information associated with all the

unique concepts explored during propagation. This reduces the redundancy of

looking up the concept in the triple store and increases the speed of the search.

SEMI – MULTITHREADED ARCHITECTURE

The Result Aggregator is run in a multi-threaded fashion in order to fetch the

meta-information as quickly as possible.

71 CHAPTER 8

CONCLUSION

1. DISCUSSION

RESEARCHIQ

The primary focus of the ResearchIQ project thus far had been to successfully implement a semantic search engine. Based on the arguments presented in this work it can be commented that this goal has been achieved. Henceforth, the efforts will be directed towards expansion of the ReserchIQ framework. The following are the planned next steps and possible future direction for ResearchIQ.

INCLUSION OF MORE DATA SETS

Now that an extensible and scalable framework is established, the focus would be

to increase its data coverage. There are two future goals in this regard:

1. To integrate more data from the existing data sources. For example, increase

the number of publications from PubMed; gather more people profiles from

OSUPro, etc. This expansion should be a low hanging fruit to grab, as the

end-to-end pipelines for these resources already exist.

2. Integrate new data sources into ResearchIQ. This requires more effort as a

new module might have to be written for the data extractor. However, the

good news is that the subsequent pipeline is complete.

72 HYBRID SEARCH

As previously stated, ResearchIQ is a purely semantic search engine. This

provides high precision in the search results. However a lot of the syntactic

information is discarded when the resources are converted to the semantic

structure. The semantic search structure does not provide a lot of recall in the

search, meaning that the “fuzziness” that syntactic search provides is missing.

The hybrid search infrastructure will try to combine semantic and syntactic search

in order to provide “better” search results. The idea is that the users of the system

can choose how semantic or syntactic they want the search results to be. This

should provide additional functionality at the user interface.

FREE FORM SEARCH

Currently the user input is restricted to the search terms provided by the auto

complete list. The users have to choose a specific term or set of terms from the list

else the search cannot be completed. In the future the user should be able to enter

free text and the search interface should be able to process the input and find the

appropriate concepts.

PARALLEL QUERY ARCHITECTURE

Since the primary goal of ResearchIQ is the delivery of search results effectively

to researchers, speed of the result delivery is a critical factor. The resource

discovery platform is of no use if it takes too much time to retrieve and display

the results.

73 Semantic technologies tend to be slower than their syntactic counter parts due to

the overhead of additional reasoning required. It is a challenge then to implement

an efficient search engine that delivers results quickly.

To make the semantic search architecture scalable, parallel processing techniques

like MapReduce and Hadoop are planned to be included. It is speculated that a

map–reduce style search architecture should reduce the query time significantly

and should scale well even with increased number of resources

EVALUATION

A “semantic” evaluation of the ResearchIQ system is provided in this thesis.

Since the ResearchIQ system is constantly growing in terms of data and a lot of

enhancements are being implemented for the semantic search, it is difficult to

perform functional and usability evaluation.

The following factors need to considered during the evaluation:

• Functional Evaluation

1. A task based evaluation of the effectiveness of the search results needs

to be done. It is crucial that the researchers should be able to find the

resources that they are looking for. Such an evaluation should provide

insight into the accuracy of the search. Once we have the results of this

evaluation, only then can we compare ResearchIQ with other similar

systems.

2. A scientific evaluation of the query times required to perform the

search also needs to be done.

74 3. A comparative analysis with other similar systems in the biomedical

research domain can be performed.

• Usability Evaluation

1. The current graphical user interface is developed after a preliminary

usability analysis. A more comprehensive evaluation in the form of

cognitive, aesthetic and non-functional requirements should be

performed.

2. A comparative usability analysis should be performed with other

popular search systems.

SEMANTIC SERVICES FRAMEWORK

From the point of view of the semantic services framework, ResearchIQ provides a great example of how one might utilize the framework architecture for developing a semantic system and eventually validating its semantic nature. The project supports the validity of the framework itself.

Future direction would be to refine the evaluation metrics for both the semantic ETL process and the semantic services bus. It would be beneficial to include some basic functional evaluation within the framework itself.

Another direction should be to apply the semantic framework to other domains and services. The lessons learned from the ResearchIQ project would be helpful in order to achieve this.

75 2. FINAL COMMENTS

In this thesis, a generalized framework for developing and evaluating semantic services has been proposed. The framework provides architecture coupling the key components required for semantic services. This is essential if we are to identify semantic services.

The metrics from measuring the semantic nature of the services are also proposed. These metrics provide a white box evaluation to validate the services as being semantic. They lay the guidelines for the documentation and details that would be needed to make the semantics involved more transparent. This in turn would make services more discoverable and portable. It would be easier for data consumers to access these services.

In order to verify the authenticity of the framework we implemented a semantic search service as a case study. A semantic search portal for resource discovery for researchers in biomedical informatics is implemented in the form of ResearchIQ. The ResearchIQ system serves as a framework for semantic integration and knowledge delivery for the biomedical domain. The success of the ResearchIQ project reflects the stability of the proposed semantic services framework. Please note that ResearchIQ is a working system and has been implemented as part of the Clinical and Translational Sciences website at

The Ohio State University Medical Center. It can be explored by anyone interested at the web address http://researchiq.bmi.osumc.edu.

It is intended that the framework should provide a good platform for accelerating of semantic services that are more “semantic” in nature. It should provide the means to analyze and compare semantic services. The developers of such services should use the framework to understand the semantic requirements of their system and implement

76 services that are extensible, discoverable and transparent and that better conform to the requirements of the linked open data movement.

77 REFERENCES

[1] Wiig, K. M. (1997). Knowledge management: where did it come from and where will it go?. Expert systems with applications, 13(1), 1-14.

[2] Lohr S. (2012). How Big Data became so big. The New York Times

[3] McIlraith, S. A., Son, T. C., & Zeng, H. (2001). Semantic web services. Intelligent Systems, IEEE, 16(2), 46-53.

[4] Studer, R., Grimm, S., & Abecker, A. (Eds.). (2007). Semantic web services: concepts, technologies, and applications. Springer.

[5] Niland, J. C., & Rouse, L. (2010). Clinical research systems and integration with medical systems. Biomedical Informatics for Cancer Research, 17-37.

[6] Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., ... & Sachs, J. (2004, November). Swoogle: a search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM international conference on Information and knowledge management (pp. 652-659). ACM.

[7] Decker, S., Erdmann, M., Fensel, D., & Studer, R. (1998). Ontobroker: Ontology based access to distributed and semi-structured information. AIFB.

[8] Butte, A. J. (2008). Translational bioinformatics: coming of age. Journal of the American Medical Informatics Association, 15(6), 709-714.

[9] Payne, P. R., Johnson, S. B., Starren, J. B., Tilson, H. H., & Dowdy, D. (2005). Breaking the translational barriers: the value of integrating biomedical informatics and translational research. Journal of investigative medicine, 53(4), 192-201.

78 [10] Collins, F. S. (2011). Reengineering translational science: the time is right. Science translational medicine, 3(90), 90cm17-90cm17.

[11] Embi, P. J., & Payne, P. R. (2009). Clinical research informatics: challenges, opportunities and definition for an emerging domain. Journal of the American Medical Informatics Association, 16(3), 316-327.

[12] Borlawsky, T. B., Lele, O., & Payne, P. R. (2011). Research-IQ: Development and evaluation of an ontology-anchored integrative query tool. Journal of biomedical informatics, 44, S56-S62.

[13] “October 2012 Web Server Survey.” Retrieved on 12 November, 2012 from Netcraft.com

[14] World Internet Users Statistics Usage and World Population Stats. Retrieved on 12 November, 2012 from http://www.internetworldstats.com/stats.htm

[15] Where did the ‘Data Explosion’ come from? Retrieved on 10 November. 2012 from http://blog.bimeanalytics.com

[16] Top 10 Largest Databases in the World | Reviews, Comparisons and Buyer’s Guides. Retrieved on 10 November, 2012

[17] Mike2.0. Defining Big Data. (2010)

[18] Heath, T., & Bizer, C. (2011). Linked data: Evolving the web into a global data space. Synthesis Lectures on the Semantic Web: Theory and Technology, 1(1), 1- 136.

[19] Berners-Lee, T. Linked Data - Design Issues, (2006). Retrieved from http://www.w3.org/DesignIssues/LinkedData.html.

[20] Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 28-37.

79 [21] Berners-Lee, T., & Fischetti, M. (2001). Weaving the Web: The original design and ultimate destiny of the World Wide Web by its inventor. DIANE Publishing Company.

[22] "W3C Semantic Web Activity". World Wide Web Consortium (W3C). November 7, 2011. Retrieved November 10, 2012.

[23] Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. Retrieved 10 November 2012 from http://lod-cloud.net/

[24] Guinard, D., & Trifa, V. (2009, April). Towards the web of things: Web mashups for embedded devices. In Workshop on Mashups, Enterprise Mashups and Lightweight Composition on the Web (MEM 2009), in proceedings of WWW (International World Wide Web Conferences), Madrid, Spain.

[25] Berners-Lee, T. Giant global graph, November 2007. Available from World Wide Web: http://dig. csail. mit. edu/breadcrumbs/node/215 [cited 28.09. 2008].

[26] McIlraith, S. A., & Martin, D. L. (2003). Bringing semantics to web services. Intelligent Systems, IEEE, 18(1), 90-93.

[27] Pollock, J. T. (2009). Semantic Web for dummies. For Dummies.

[28] "Semantic Web - XML2000, slide 10". W3C. Retrieved 20 October, 2012

[29] "XML 1.0 Specification". W3.org. Retrieved 20 October, 2012

[30] "Resource Description Framework (RDF) Model and Syntax Specification" http://www.w3.org/TR/PR-rdf-syntax/

[31] "XML and Semantic Web W3C Standards Timeline". Retrieved 20 October, 2012

[32] Sheth, A. (2012) Semantic Web: Intro and Overview. Presentation.

[33] Brickley, D., Guha R. V. & Layman A. Resource Description Framework (RDF) Schemas. http://www.w3.org/TR/1998/WD-rdf-schema-19980409/

80 [34] Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge acquisition, 5(2), 199-220.

[35] Gruber, T. (2008). What is an Ontology. Encyclopedia of Database Systems, 1.

[36] Fensel, D. (2003). Ontologies:: A Silver Bullet for Knowledge Management and Electronic Commerce. Springer.

[37] "OWL 2 Web Ontology Language Document Overview". W3C. Retrieved on 20 October 2012 from http://www.w3.org/TR/owl2-overview/

[38] McGuinness, D. L., & Van Harmelen, F. (2004). OWL web ontology language overview. W3C recommendation, 10(2004-03), 10.

[39] "N-Triples". W3C RDF Core WG Internal Working Draft. www.w3.org. Retrieved 20 October 2012 from http://www.w3.org/2001/sw/RDFCore/ntriples/

[40] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., ... & Sherlock, G. (2000). Gene Ontology: tool for the unification of biology. Nature genetics, 25(1), 25.

[41] Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., ... & Gwinn, M. (2004). The Gene Ontology (GO) database and informatics resource. Nucleic acids research, 32(Database issue), D258.

[42] Carbon, S., Ireland, A., Mungall, C. J., Shu, S., Marshall, B., & Lewis, S. (2009). AmiGO: online access to ontology and annotation data. Bioinformatics, 25(2), 288-289.

[43] Gobeill, J., & Ruch, P. (2007). GOCat: A Gene Ontology Categorization/Navigation Service for Functional Annotation of Proteins.

[44] Day-Richter, J., Harris, M. A., Haendel, M., & Lewis, S. (2007). OBO-Edit—an ontology editor for biologists. Bioinformatics, 23(16), 2198-2200.

81 [45] “Introducing the Knowledge Graph: things, not strings.” Google Blog. (2012) Retrieved on 14 November 2012 from http://googleblog.blogspot.com

[46] Raje, S., Davuluri, C., Freitas, M., Ramnath, R., & Ramanathan, J. (2012, July). Using Semantic Web Technologies for RBAC in Project-Oriented Environments. In Computer Software and Applications Conference (COMPSAC), 2012 IEEE 36th Annual (pp. 521-530). IEEE.

[47] "Web Services Description Language (WSDL) Version 2.0 Part 1: Core Language.” W3C. Retrived 15 November 2012.

[48] Jeckle, M. (2004). Semantik, Odem einer Service-orientierten Architektur.

[49] McCanndles, D. Data, Information, Knowledge, Wisdom? (2010). Retrieved on 12 November 2012 from http://www.informationisbeautiful.net/

[50] Giunchiglia, F., Kharkevich, U., & Zaihrayeu, I. (2010). Concept Search: Semantics Enabled Information Retrieval.

[51] Lele, O., Raje, S., Yen, P., Borlawsky T.B. & Payne P.R.O. (2012). ResearchIQ: An Ontology-anchored Knowledge and Resource Discovery Tool [Poster]. AMIA Annual Symposium Proc.

[52] Prud’Hommeaux, E., & Seaborne, A. (2008). SPARQL query language for RDF. W3C recommendation, 15.

[53] Bodenreider, O. (2004). The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research, 32(suppl 1), D267- D270.

[54] Tenenbaum, J. D., Whetzel, P. L., Anderson, K., Borromeo, C. D., Dinov, I. D., Gabriel, D., ... & Lyster, P. (2011). The Biomedical Resource Ontology (BRO) to enable resource discovery in clinical and translational research. Journal of biomedical informatics, 44(1), 137-145.

82 [55] “Lucene.” http://lucene.apache.org/

[56] “SOLR.” http://lucene.apache.org/solr/index.html

[57] Aronson, A. R. (2001). Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In Proceedings of the AMIA Symposium (p. 17). American Medical Informatics Association.

[58] Carroll, J. J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., & Wilkinson, K. (2004, May). Jena: implementing the semantic web recommendations. In Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters (pp. 74-83). ACM.

[59] Open, R. D. F. Sesame RDF Database, 2006. Internet: http://www.openrdf.org.

83