Scalable Methods to Analyze Semantic Web Data
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITAT JAUME I Departament de Llenguatges i Sistemes Inform`atics Scalable methods to analyze Semantic Web data Ph. D. dissertation Victoria Nebot Romero Supervisor Dr. Rafael Berlanga Llavori Castell´on,December, 2012 To my family. To Rafa. Abstract Semantic Web data is currently being heavily used as a data rep- resentation format in scientific communities, social networks, busi- ness companies, news portals and other domains. The irruption and availability of Semantic Web data is demanding new methods and tools to efficiently analyze such data and take advantage of the underlying semantics. Although there exist some applications that make use of Semantic Web data, advanced analytical tools are still lacking, preventing the user from exploiting the attached semantics. The main objective of this dissertation is to provide a formal frame- work that enables the multidimensional analysis of Semantic Web data in an scalable and efficient manner. The success of multidi- mensional analysis techniques applied to large volumes of structured data in the context of business intelligence, especially for data ware- housing and OLAP applications, has prompted us to investigate the application of such techniques to Semantic Web data, whose nature is semi-structured and contain implicit knowledge. Multidimensionality is based on the fact/dimension dichotomy. Data are modeled in terms of facts, which are the analytical metrics, and dimensions, which are the different analysis perspectives, and are usually hierarchically organized. We believe that the construc- tion of a multidimensional view of Semantic Web data driven by the semantics encoded in the data themselves and the user's re- quirements, empowers and enriches the analysis process in a unique manner, as it brings about new analytical capabilities not possible before. Aggregations and display operations typical of multidimen- sional analysis tools, such as changing the granularity level of the displayed data, or adding a new analysis perspective to the data, will be performed based on the semantic relations encoded in the data. This is possible thanks to the mapping of the data to a con- ceptual multidimensional space. We base our research on the hypothesis that Semantic Web data is an emerging knowledge resource worth exploiting, and that the knowledge encoded in Semantic Web data can be leveraged to per- form an efficient, scalable and full-fledged multidimensional analy- sis. Scalability is achieved by two means. On one hand, we provide ii an ontology indexing model over ontologies that allows to manage implicit knowledge in a compact format. Therefore, operations that require reasoning can be efficiently solved using the indexes. On the other hand, we have developed several scalable modularization tech- niques that build upon the previous indexes and allow to extract and work only with the ontological subsets of interest. The previous indexing and modularization methods assist in the an- alytical tasks by enabling the extraction of multidimensional data from semantic sources based on the user query. That is, the meth- ods are used to make the extraction of facts and dimensions from Se- mantic Web data efficient and scalable. However, identifying facts, dimensions, measures and well-shaped dimension hierarchies from the graph structure that underlies SW data is a big challenge due to the mismatch between the graph model that underlies Semantic Web data and the multidimensional model. Therefore, the notions of fact and dimension are revisited in the Semantic Web context and both facts and dimensions are defined from a logical viewpoint. Facts are formally defined as multidimensional points quantified by measure values, all of which are logically reachable from the subject of analysis defined by the user. To this end, we use the notion of aggregation path, which is less restrictive than the tradi- tional multidimensional constraints usually imposed between facts and dimensions, and define different interesting subgroups for anal- ysis. We also detect the summarizability of the extracted facts and produce correctly aggregated results. Dimensions are defined as direct acyclic graphs composed by nodes that are semantically related and adhere to the conceptual specifi- cation of the user. That is, nodes are sub-concepts of the dimen- sion type and the edges are subsumption relations. Two alternative methods allow to re-shape the extracted dimension hierarchies to favor aggregation while preserving the semantics of the dimension values as much as possible. The flexibility introduced in the discovery of facts and dimensions allows us to analyze data that have complex relations, which es- cape to the traditional multidimensional model and cannot be oth- erwise analyzed. Moreover, instead of building a huge, one time data warehouse, our method for fact and dimension extraction is based on several indexes and precomputed data, which allows us to efficiently materialize only the facts and dimensions required by an analytical query. This provides up-to-date, dynamic and cus- tomized results to the user. The experimental evaluation performed on each of the components of the framework demonstrates that the proposed analysis frame- iii work scales to large ontological resources. Likewise, the developed use cases show the potential of the multidimensional analysis of Semantic Web data. vi Resumen Introducci´on La Web se ha convertido en una de las mayores fuentes de conocimiento, tanto a nivel cient´ıficoy empresarial, como a nivel cotidiano. Inicialmente, la Web se plante´ocomo un recurso de consulta de informaci´oncon contenidos m´aso menos est´aticos,el cual se compon´ıade una serie de documentos enlazados, t´ıpicamente ricos en texto. La primera gran evoluci´onde la Web desemboc´o en la Web 2.0, cuyo objetivo es crear una Web m´asdin´amicaque facilite a los usuarios interactuar y compartir informaci´on.A trav´esde las tecnolog´ıasWeb 2.0 se permite a los usuarios crear y gestionar contenidos en la Web. Parte del ´exitode la Web 2.0 se debe a la utilizaci´onde lenguajes semi-estructurados como XML para el intercambio de informaci´on,los cuales permiten dar cierta estructura al contenido. Sin embargo, se ha demostrado que las tecnologas XML son insuficientes para gestionar de forma eficiente la avalancha de in- formaci´onpublicada en la Web, puesto que XML es capaz de estructurar sint´acticamente un documento pero no su contenido. Ello ha propiciado la evoluci´onde la Web hacia una Web m´asinteligente, la Web Sem´antica, o Web 3.0. El objetivo que persigue la Web Sem´antica es describir los contenidos y la informaci´onpresente en la Web utilizando formalismos l´ogicospara que ´estapueda ser \entendida" y procesada de forma autom´aticapor las m´aquinas. Aunque este objetivo es muy ambicioso, ya se dispone de tecnolog´ıasadecuadas para la descripci´onsem´antica de los contenidos, tales como RDF y OWL. Adem´as,existen iniciativas como Linked Open Data (LOD) que promueven la publicaci´ony enlace de datos en formato sem´antico en la Web. Tal ha sido el ´exitode este tipo de iniciativas que distintas organizaciones de ´ambitos muy diversos han decidido publicar sus datos en la Web en formato sem´antico. Entre las organizaciones que se han adherido a esta iniciativa podemos men- cionar: distintos gobiernos como el de Reino Unido o los Estados Unidos, los cuales han publicado gran cantidad de datos acerca de temas tan diversos e interesantes como los niveles de ozono de las distintas ciudades o la situaci´on financiera de las empresas; medios de comunicaci´oncomo la BBC, que pone a vii viii disposici´onde los usuarios datos sobre sus programas de televisi´ony radio en- lazados con otros recursos de conocimiento; varias comunidades cient´ıficasdel ´ambito de la biomedicina como Bio2RDF, DailyMed, Diseasome o DrungBank entre muchas otras, han publicado gran cantidad de datos sobre el genoma hu- mano, medicamentos junto con su composici´onqu´ımicay su uso, enfermedades y sus relaciones gen´eticas, etc. El dominio biom´edicoes especialmente com- plejo, y es por ello que la conceptualizaci´onde estos datos puede acarrear avances importantes. La existencia de este tipo de datos anotados sem´anticamente abre nuevas ´areasde investigaci´onpara el desarrollo de aplicaciones que permitan explotar el conocimiento impl´ıcitode los datos. Sin embargo, la explotaci´onde datos anotados sem´anticamente es bastante compleja debido a su estructura subya- cente en forma de grafo y a la sem´antica impl´ıcita,la cual requiere de t´ecnicas de razonamiento que suelen ser costosas computacionalmente. Objetivos En esta tesis se proponen una serie de m´etodos para al an´alisisde datos anota- dos sem´anticamente de forma escalable y eficiente. El ´exitoque han tenido las t´ecnicasde analisis multidimensional aplicadas a grandes cantidades de datos estructurados, mayormente las t´ecnicas de almacenes de datos y OLAP, nos ha hecho plantearnos la aplicaci´onde dichas t´ecnicassobre datos anotados sem´anticamente. De esta forma, en un hospital en donde la informaci´onde los pacientes est´eanotada sem´anticamente y enlazada con datos biom´edicos, el m´edicopodr´arealizar un an´alisismultidimensional a nivel conceptual para poder analizar el impacto de administrar ciertos f´armacosa pacientes con un de- terminado tipo de enfermedad. Es decir, el m´edicopodr´aescoger sus variables de estudio o dimensiones, tales como el tipo de enfermedad, el sexo del paciente, o el tipo de medicamento administrado, y estudiar el impacto que tienen en dis- tintos indicadores