Scalable Methods to Analyze Semantic Web Data

UNIVERSITAT JAUME I Departament de Llenguatges i Sistemes Informàtics Scalable methods to analyze Semantic Web data Ph. D. dissertation Victoria Nebot Romero Supervisor Dr. Rafael Berlanga Llavori Castellón,December, 2012 To my family. To Rafa. Abstract Semantic Web data is currently being heavily used as a data rep- resentation format in scientific communities, social networks, business companies, news portals and other domains. The irruption and availability of Semantic Web data is demanding new methods and tools to efficiently analyze such data and take advantage of the underlying semantics. Although there exist some applications that make use of Semantic Web data, advanced analytical tools are still lacking, preventing the user from exploiting the attached semantics. The main objective of this dissertation is to provide a formal framework that enables the multidimensional analysis of Semantic Web data in an scalable and efficient manner. The success of multidimensional analysis techniques applied to large volumes of structured data in the context of business intelligence, especially for data ware- housing and OLAP applications, has prompted us to investigate the application of such techniques to Semantic Web data, whose nature is semi-structured and contain implicit knowledge. Multidimensionality is based on the fact/dimension dichotomy. Data are modeled in terms of facts, which are the analytical metrics, and dimensions, which are the different analysis perspectives, and are usually hierarchically organized. We believe that the construc- tion of a multidimensional view of Semantic Web data driven by the semantics encoded in the data themselves and the user's re- quirements, empowers and enriches the analysis process in a unique manner, as it brings about new analytical capabilities not possible before. Aggregations and display operations typical of multidimensional analysis tools, such as changing the granularity level of the displayed data, or adding a new analysis perspective to the data, will be performed based on the semantic relations encoded in the data. This is possible thanks to the mapping of the data to a conceptual multidimensional space. We base our research on the hypothesis that Semantic Web data is an emerging knowledge resource worth exploiting, and that the knowledge encoded in Semantic Web data can be leveraged to per- form an efficient, scalable and full-fledged multidimensional analysis. Scalability is achieved by two means. On one hand, we provide ii an ontology indexing model over ontologies that allows to manage implicit knowledge in a compact format. Therefore, operations that require reasoning can be efficiently solved using the indexes. On the other hand, we have developed several scalable modularization techniques that build upon the previous indexes and allow to extract and work only with the ontological subsets of interest. The previous indexing and modularization methods assist in the analytical tasks by enabling the extraction of multidimensional data from semantic sources based on the user query. That is, the methods are used to make the extraction of facts and dimensions from Se- mantic Web data efficient and scalable. However, identifying facts, dimensions, measures and well-shaped dimension hierarchies from the graph structure that underlies SW data is a big challenge due to the mismatch between the graph model that underlies Semantic Web data and the multidimensional model. Therefore, the notions of fact and dimension are revisited in the Semantic Web context and both facts and dimensions are defined from a logical viewpoint. Facts are formally defined as multidimensional points quantified by measure values, all of which are logically reachable from the subject of analysis defined by the user. To this end, we use the notion of aggregation path, which is less restrictive than the traditional multidimensional constraints usually imposed between facts and dimensions, and define different interesting subgroups for analysis. We also detect the summarizability of the extracted facts and produce correctly aggregated results. Dimensions are defined as direct acyclic graphs composed by nodes that are semantically related and adhere to the conceptual specifi- cation of the user. That is, nodes are sub-concepts of the dimension type and the edges are subsumption relations. Two alternative methods allow to re-shape the extracted dimension hierarchies to favor aggregation while preserving the semantics of the dimension values as much as possible. The flexibility introduced in the discovery of facts and dimensions allows us to analyze data that have complex relations, which es- cape to the traditional multidimensional model and cannot be oth- erwise analyzed. Moreover, instead of building a huge, one time data warehouse, our method for fact and dimension extraction is based on several indexes and precomputed data, which allows us to efficiently materialize only the facts and dimensions required by an analytical query. This provides up-to-date, dynamic and cus- tomized results to the user. The experimental evaluation performed on each of the components of the framework demonstrates that the proposed analysis frame- iii work scales to large ontological resources. Likewise, the developed use cases show the potential of the multidimensional analysis of Semantic Web data. vi Resumen Introducción La Web se ha convertido en una de las mayores fuentes de conocimiento, tanto a nivel cient´ıficoy empresarial, como a nivel cotidiano. Inicialmente, la Web se planteócomo un recurso de consulta de informacióncon contenidos máso menos estáticos,el cual se compon´ıade una serie de documentos enlazados, t´ıpicamente ricos en texto. La primera gran evoluciónde la Web desembocó en la Web 2.0, cuyo objetivo es crear una Web másdinámicaque facilite a los usuarios interactuar y compartir información.A travésde las tecnolog´ıasWeb 2.0 se permite a los usuarios crear y gestionar contenidos en la Web. Parte del éxitode la Web 2.0 se debe a la utilizaciónde lenguajes semi-estructurados como XML para el intercambio de información,los cuales permiten dar cierta estructura al contenido. Sin embargo, se ha demostrado que las tecnologas XML son insuficientes para gestionar de forma eficiente la avalancha de in- formaciónpublicada en la Web, puesto que XML es capaz de estructurar sintácticamente un documento pero no su contenido. Ello ha propiciado la evoluciónde la Web hacia una Web másinteligente, la Web Semántica, o Web 3.0. El objetivo que persigue la Web Semántica es describir los contenidos y la informaciónpresente en la Web utilizando formalismos lógicospara que éstapueda ser \entendida" y procesada de forma automáticapor las máquinas. Aunque este objetivo es muy ambicioso, ya se dispone de tecnolog´ıasadecuadas para la descripciónsemántica de los contenidos, tales como RDF y OWL. Además,existen iniciativas como Linked Open Data (LOD) que promueven la publicacióny enlace de datos en formato semántico en la Web. Tal ha sido el éxitode este tipo de iniciativas que distintas organizaciones de ámbitos muy diversos han decidido publicar sus datos en la Web en formato semántico. Entre las organizaciones que se han adherido a esta iniciativa podemos men- cionar: distintos gobiernos como el de Reino Unido o los Estados Unidos, los cuales han publicado gran cantidad de datos acerca de temas tan diversos e interesantes como los niveles de ozono de las distintas ciudades o la situación financiera de las empresas; medios de comunicacióncomo la BBC, que pone a vii viii disposiciónde los usuarios datos sobre sus programas de televisióny radio enlazados con otros recursos de conocimiento; varias comunidades cient´ıficasdel ámbito de la biomedicina como Bio2RDF, DailyMed, Diseasome o DrungBank entre muchas otras, han publicado gran cantidad de datos sobre el genoma hu- mano, medicamentos junto con su composiciónqu´ımicay su uso, enfermedades y sus relaciones genéticas, etc. El dominio biomédicoes especialmente com- plejo, y es por ello que la conceptualizaciónde estos datos puede acarrear avances importantes. La existencia de este tipo de datos anotados semánticamente abre nuevas áreasde investigaciónpara el desarrollo de aplicaciones que permitan explotar el conocimiento impl´ıcitode los datos. Sin embargo, la explotaciónde datos anotados semánticamente es bastante compleja debido a su estructura subya- cente en forma de grafo y a la semántica impl´ıcita,la cual requiere de técnicas de razonamiento que suelen ser costosas computacionalmente. Objetivos En esta tesis se proponen una serie de métodos para al análisisde datos anotados semánticamente de forma escalable y eficiente. El éxitoque han tenido las técnicasde analisis multidimensional aplicadas a grandes cantidades de datos estructurados, mayormente las técnicas de almacenes de datos y OLAP, nos ha hecho plantearnos la aplicaciónde dichas técnicassobre datos anotados semánticamente. De esta forma, en un hospital en donde la informaciónde los pacientes estéanotada semánticamente y enlazada con datos biomédicos, el médicopodrárealizar un análisismultidimensional a nivel conceptual para poder analizar el impacto de administrar ciertos fármacosa pacientes con un de- terminado tipo de enfermedad. Es decir, el médicopodráescoger sus variables de estudio o dimensiones, tales como el tipo de enfermedad, el sexo del paciente, o el tipo de medicamento administrado, y estudiar el impacto que tienen en distintos indicadores

Scalable Methods to Analyze Semantic Web Data

The Semantic Web - ISWC 2004

Trombert-Paviot, Alan Rector, Robert Baud, Pieter

Ontologies and the Semantic Web

Knowledge Driven Software and “Fractal Tailoring”: Ontologies in Development Environments for Clinical Systems

Analysing Syntactic Regularities and Irregularities in SNOMED-CT Eleni Mikroyannidi*, Robert Stevens, Luigi Iannone and Alan Rector

Ontologies and the Semantic Web

Thesis Submitted to the University of Manchester for the Degree of Doctor of Philosophy in the Faculty of Medicine

Semantic Mapping of Clinical Model Data to Biomedical Terminologies to Facilitate Interoperability

An Introduction to OWL

OWL 2 Web Ontology Language Primer (Second Edition) W3C Recommendation 11 December 2012

Proceedings Ontology Matching 2009

A Practical Introduction to Protégé OWL