Integrating Semi-structured Information using Semantic Technologies An Evaluation of Tools and a Case Study on University Rankings Data

Alejandra Casas-Bayona and Hector G. Ceballos Campus Monterrey, Tecnologico de Monterrey, Av. Eugenio Garza Sada 2501, 64849, Monterrey, Nuevo León, Mexico

Keywords: , Ontologies, Information Integration, Information Reconciliation, Best Practices.

Abstract: Information integration is not a trivial activity. Information managers face problems like: heterogeneity (in data, schemas, syntax and platforms), distribution and duplicity. In this paper we: 1) analyze ontology-based methodologies that provide mediation frameworks for integrating and reconciling information from structured data sources, and 2) propose the use of available semantic technologies for replicating such functionality. Our aim is providing an agile method for integrating and reconciling information from semi- structured data (spreadsheets) and determining to which extent available semantic technologies minimize the need of ontological expertise for information integration. We present our findings and lessons learned from a case study on university rankings data.

1 INTRODUCTION whereas instance reconciliation consists on identifying instances that represent the same entity Information has become a fundamental asset for in the real world despite their multiple company’s competitiveness. Integrating information representations (Zhao & Ram, 2008). distributed across systems and platforms in the Ontologies are particularly suitable for organization for its later analysis has become the integrating information as long as they deal with most important task for information managers. syntactic and (Silvescu, Integrating information residing in distributed and Reinoso-castillo, & Honavar, 1997). The term heterogeneous data sources is a well-known problem ontology was introduced by Gruber in the context of that faces two difficult challenges: heterogeneity and knowledge sharing as “the conceptualization of a reconciliation. specification” (Gruber, 2009). This is, an ontology is Heterogeneity can be classified in four the description of concepts and relationships that an categories: 1) structural, i.e. different schemas in agent or a community may use (Gruber, 2008). data sources; 2) syntactical, i.e. different ways of Some ontology-based methodologies and naming the same object (catalogues); 3) systemic, techniques for information integration have been i.e. diverse platforms governed by different proposed (Buccella et al., 2005). For example, the authorities; and 4) semantic, i.e. different meanings project MOMIS (Mediator environment for Multiple for same concept or different names for the same Information Sources) allows integrating data from concept (Buccella, Cechich, & Brisaboa, 2005). For structured and semi-structured data sources (Cui, Brien, & Park, 2000), resolving semantic (Beneventano et al., 2001). SIMS is another heterogeneity consists in identifying equivalent mediator platform that provides access to distributed concepts, no-related concepts and related concepts. data sources and online integration (ARENS et al., On the other hand, data reconciliation has been 1993). MIRSOFT, unlike the other two approaches, classified on two levels: schema and instance does not map entirely the origin data sources but (Bakhtouchi, Jean, & Ait-ameur, 2012). Schema focuses on the minimum necessary data for reconciliation consists on finding equivalences answering the integration question (Bakhtouchi et between tables and columns from different data al., 2012). These approaches support the task of data sources, i.e. it addresses structural heterogeneity, interpretation by requiring the participation of the domain expert at some extent and minimizing the

Casas-Bayona A. and Ceballos H.. 357 Integrating Semi-structured Information using Semantic Technologies - An Evaluation of Tools and a Case Study on University Rankings Data. DOI: 10.5220/0005004203570364 In Proceedings of 3rd International Conference on Data Management Technologies and Applications (DATA-2014), pages 357-364 ISBN: 978-989-758-035-2 Copyright c 2014 SCITEPRESS (Science and Technology Publications, Lda.) DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

involvement of the database administrator. The MOMIS methodology starts in the extraction In scenarios with periodic data extraction the phase where it uses a wrapper that transform the cost of the required infrastructure and the time structure of available data sources to a model OLDI3 dedicated by the expert is justified. Nevertheless, in based on Description Logics (Moreno Paredes, scenarios where an agile response is needed, 2007). The integration process generates an information requirements are volatile, and data is integrated view of the data sources (global-as-view) provided by different departments in semi-structured through the generation of a thesaurus. Then it formats, a more agile solution is needed. Our performs an analysis for determining affinity proposal is to use available semantic technologies between terms and it makes clusters of similar for replicating the functionality that these classes (Beneventano et al., 2001). Queries are approaches provide. expressed in terms of thesaurus schemas. In section 2 we describe and analyse the facilities SIMS, on the opposite way, was created provided by each approach. In section 3 we assuming that data sources are dynamic; hence it described our approach for information integration starts by defining an initial question from which and our criteria for selecting semantic technologies. proceeds to the selection of suitable data sources, in In section 4 we exemplify our methodology by order to minimize the cost of mapping schemas. The integrating faculty staff information from two phase starts with an expert annotating sources for a university ranking. Finally in section 5 data sources with ontology schemas and continues we present a summary of lessons learned and enriching initial mappings with other vocabularies conclude with closing remarks in section 6. automatically. SIMS elaborates query plans for answering queries expressed in terms of the used ontologies (Arens et al., 1996). 2 RELATED WORK MIRSOFT departs from an initial user requirement and importing mediator ontology. Then Ontology-based information integration approaches it requires mapping data sources to mediator schemas, as well as expressing functional provide a solution for integrating heterogeneous databases and semi-structured data. Some of them dependencies (FDs) of mapped classes, providing additionally provide data reconciliation facilities. ontology-based database access, also known as OBDB or OntoDB. Before starting the mapping Table 1 shows a comparison of the three analyzed methodologies found in literature related to stage, a selection of relevant classes and properties is ontology-based information integration. For the done, producing a pruned ontology. This methodology additionally uses functional comparison we focused in five aspects: 1) the original information requirement, 2) the way on dependencies for complementing queries and which information is mapped, 3) the way on which providing reconciliation in query results (Bakhtouchi, Bellatreche, Jean, & Ameur, 2012). integration is performed, 4) reconciliation facilities, and 5) the schemas used for expressing queries. Queries are expressed and answered in terms of the Numbers in each row represent the order followed in mediator ontology. each methodology. Table 1: Comparison of ontology-based methodologies for information integration.

358 IntegratingSemi-structuredInformationusingSemanticTechnologies-AnEvaluationofToolsandaCaseStudyon UniversityRankingsData

3 INTEGRATION AND In this step is important to declare consistency constraints like if two classes CR are disjoint, this is, RECONCILIATION METHOD when an individual must be classified as member of only one of these classes. We evaluated the use of semantic technologies in scenarios of periodic integration exercises where 3.2 Local Ontologies information used for answering queries is provided by different departments using spreadsheets with Next, each datasheet is transformed to a RDF file varying formats. Heterogeneity is present in all of its where each row represents an individual of a new expressions: information resides in different systems class CL, columns represent properties of the class (systemic), spreadsheets have different columns and cells are property values that describe each (structural), varying column names and use different individual. The resulting class, properties and catalogues (syntactical), and the same concept is individuals constitute a local ontology OL. Available defined distinctly in each data source (semantic). tools for this purpose are referenced by W3Cii. Additionally, equivalent individuals can be found across these data sources with contradictory 3.3 Local-Global Ontology Alignments information. In this scenario, we cannot rely on fixed structures, and mappings must be generated The ontology O is imported in each O for cost-effectively. G L expressing the corresponding definition of each C . Based on the aspects analyzed in previous R If the definition of C only requires information of a methodologies we propose a method suitable for this R single data source then C is defined in O . If the scenario that provides ontology-based information R L definition of C requires information from multiple integration taking advantage of current Semantic R sources its definition must be done in the integrated Web standards and tools. Stages of our method are ontology. The integrated ontology O is a new OWL described Figure 1. I file that imports OD and each OI. This stage can be facilitated by ontology mapping tools that identify semantic correspondences between elements in two different schemas (Zohra et al., 2011). Available tools for this task are referred by the initiativeiii.

3.4 Reconciliation

Product of local definitions we can have multiple Figure 1: Information integration process. individuals representing the same individual. Reconciliation of equivalent individuals is made 3.1 User Requirement declaring an OWL 2 constraint HasKey for classes CR. Another alternative is declaring a SWRL rule or using a SPARQL CONSTRUCT query that assert We start with an initial user requirement expressed statements SameIndividual for each pair of through a global schema represented by an extension equivalent individuals, based on functional O of an existing ontology O . A suitable OWL G B dependencies. ontology for your domain can be chosen from This procedure is especially important when ontologies repositories like Linked Open datasheets contain multiple rows for the same Vocabularies i. A data requirement is denoted by a individual that is being quantified in CR. class CR that represents a set of individuals that must be identified and quantified across the available data sources. Each class CR is declared in OG as subclass of a class CG originally contained in OB in order to make query results compatible with the original ii ontology OB. http://www.w3.org/wiki/ConverterToRdf iiihttp://wiki.opensemanticframework.org/index.php/Ontol i http://lov.okfn.org/dataset/lov/ ogy_Tools#Ontology_Mapping

359 DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

3.5 Inference integration to replicate one of the integration exercises. For this purpose we prepared two user

Finally, a concept reasoner is used on OI for guides with a summary of use cases given in classifying individuals in each CR. Reasoners can be (Allemang & Hendler, 2008), where OWL executed from the IDE of a comprehensive constructors are classified according to integration framework like Protégé or Top Braid, or be executed or reconciliation purposes. Reconciliation guide is directly over OI. Another alternative is using an RDF shown in the Appendix. database with inference capabilities like Virtuosoiv or Stardogv. (i) OG OB In this stage, concept reasoners are used for detecting inconsistencies between definitions and data. In order to solve these inconsistencies, violated (ii) L1 L2 constraints can be lift or inconsistent individuals can Spreadsheet be removed from the data set. For instance, an converter individual classified in two disjoint classes will be (iii) OL1 OL2 inconsistent and the solution will be up to the (v) domain expert.

O OI’ Results Imported (iv) I 3.6 Query Resolution ProcessReasoner SPARQL (vi)

Finally, the domain expert identify and quantifies individuals classified in each CR by formulating a Figure 2: Datasets and Tools. query over a SPARQL end-point connected to the inferred model OI’ or by enquiring the RDF In this section we describe the results of the database using the reasoning level required by evaluation of tools and in section 5 we discuss our definitions. CR is quantified by a query like: findings and perspectives on semantic technologies. SELECT COUNT (DISTINCT (?ind)) 4.1 User Requirement WHERE{ ?ind a CR }

Figure 2 shows the inputs (OB, OG, L1, L2), the The initial requirement is given by definitions of the working models (OL1, OL2, OI, OI’), and tools. Stages ranking evaluation criteria. In our integration of the method are numbered as well. exercise we identified six classes CR that can be used for answering the faculty staff criteria: Faculty Staff, Full Time Professor, Part Time Professor, Local 4 CASE STUDY Professor, International Professor and Professor with PhD. Further calculations like the Full-time Our case study was carried out in a University that equivalent are out of the scope of the integration and provides statistical data to an international Ranking reconciliation task. We evaluated Neon Toolkitvi, Protégévii and organization. In this sense, the rankings department viii collects information in spreadsheets from several TopBraid Composer Standard Edition based on the departments. In order to identify and classify the following criteria: 1) the usability of its interface, 2) university academic staff, information is collected visualization of definitions, 3) facilities for from both Human Resources and Academic importing data from spreadsheets, 4) configurable departments. reasoning levels, and 5) facilities for exporting query The integration was initially performed by a results. Additionally, our selection criteria computer science specialist with basic knowledge on considered the minimization of the number of tools ontologies, following the use cases from (Allemang used along the entire process, being TopBraid & Hendler, 2008). After a series of integration Composer choosen for the first four stages and exercises that allowed us to evaluate available tools Protégé for the last two. we asked two other specialists in information On the other hand, we evaluated two ontologies:

vi http://neon-toolkit.org iv http://virtuoso.openlinksw.com/rdf-quad-store/ vii http://protege.stanford.edu/ v http://stardog.com/ viii http://www.topquadrant.com/

360 IntegratingSemi-structuredInformationusingSemanticTechnologies-AnEvaluationofToolsandaCaseStudyon UniversityRankingsData

VIVOix and for Research Communitiesx (SWRC) as base for our global ontology, and selected the last one because it provides enough expressivity for our purpose and has less and simpler schemas.

4.2 Local Ontologies

We used TopBraid Composer facilities for Figure 4: Equivalent classes and properties. transforming spreadsheet data to local ontologies, generating a local ontology for each data source (one class and N properties). Two local ontologies where 4.4 Reconciliation produced, denoted Local A (OL-A) and Local B (OL- ) in the following. Local classes (C ) generated in Reconciliation was done through the assertion of B L statements SameIndividual between persons with the each local ontology are denoted Person A (CL-A) and same employee identification number. In OL-A this Person B (CL-B), respectively. Figure 2 illustrates with an example the heterogeneity between both property is called hasId, whereas in OL-B the data sources. equivalent property is named hasIdentification. This procedure was done with a SPARQL CONSTRUCT query.

4.5 Inference

For the reasoning stage we evaluated the reasoners integrated in Protégé and TopBraid Composer: FACT++, Hermit, OWLIMxi and Pellet. Figure 2: Example of data found in both data sources. Not a single reasoner in TopBraid Composer could be configured for making the Closed World 4.3 Local-Global Ontology Alignments Assumption (CWA), hence it was not possible to classify individuals based on definitions like In the Global-Local ontology alignments stage we International Professor that uses a ComplementOf formulated one definition for each CR per local constraint. This was not the case of Hermit in ontology, i.e. 12 definitions in total. Figure 3 Protégé which makes the CWA by default and illustrates some definitions made for OL-A and OL- correctly made this classification. The resoner that B. Prefixes G: and A: represent the ontologies OG had better performance in TopBraid Composer was and OL-A respectively. OWLIM. TopBraid Composer suggested mappings Thanks to the definition of Full-time Professor between equivalent properties in OL-A and OL-B after and Part-time Professor as disjoint classes it was they were imported to the integrated ontology OI. possible to detect inconsistencies between data (see Figure 4). Mappings proposed by this tool were sources. Figure 4 illustrates an inconsistency of this correct and facilitated using properties defined in type, where the same professor is classified as Part- local ontologies for recovering information from time in OA (A:hasContract = A) and as Full-time in both data sources (see queries in section 4.6). OB (B:hasContract = X), given local definitions. To G:PTProfessor  A:Person  A:hasContrat = {A,B,C} (1) solve this problem we removed the disjoint G:FTProfessor  A:Person  A:hasContrat = {D, E, F} (2) constraint and adjusted the query for retrieving Part- G:FacultyStaff  G:PTProfessor  G:FTProfessor (3) time professors as explained in next section. G:FTProfessor  G:PTProfessor ⊆ ⊥ (4) G:LocalProfessor  A:Person  A:hasNac = ‘Mexican’ (5) 4.6 Query Resolution G:LocalProfessor  B:Person  B:hasNacionality = ‘Mex’ (6) G:InternationalProfessor  ¬ A:MexicanProfessor (7) Information required for answering rankings Figure 3: Example of definitions in local ontologies. xi ix http://www.vivoweb.org http://owlim.ontotext.com/display/OWLIMv40/OWLI x http://ontoware.org/swrc/ M-Lite+Reasoner

361 DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

questions were expressed in SPARQL through the database with inference support. In the first case, combination of classes CR. For instance, for information that could be useful for further determining the number of Full Time International integrations is lost, and in the second we sacrify Faculty Staff we used the intersection of Full Time the facilities provided by the IDE for making Professor and International Professor. definitions and export results. Given that we found professors classified as From the experience of users following our Full-time and Part-time the query for retrieving Part- method and the selected tools we draw the following time Professors was adjusted as follows: preliminary conclusions: SELECT COUNT (DISTINCT (?ID))  The most dificult concept to understand for WHERE{ database specialists was the notion of Open ?prof a G:PTProfessor . World Assumption in the construction of ?prof A:hasID ?ID . definitions. Unfortunately most of the FILTER NOT EXISTS {?prof a G:FTProfessor.} reasoners implement this assumption and forces the user to adopt the equally confusing We also needed to adjust the query for retrieving notion of negation as failure (c.f. queries in International professors using OWLIM. Given that it section 4.6). does not support CWA, the following query asks for faculty staff and discards local professors:  The usability of the selected tools was the best given our evaluation and user guides prepared SELECT COUNT (DISTINCT (?ID)) for applying OWL constructors in the WHERE{ integration and reconciliation stages (see ?prof a G:FacultyStaff . Appendix), in despite the users found hard to ?prof A:hasID ?ID . understand the meaning of OWL constructors FILTER NOT EXISTS {?prof a without a previous introduction that include G:MexicanProfessor.} notions of set theory.  Interfaces for constructing OWL definitions are not suitable for non-experts on ontology 5 DISCUSSION languages.

In this section we describe the lessons learned from 5.2 Ontology-based versus tool’s maturity and the experience of non-ontologist Relational-based Integration users with our methodology. We also identify the advantages of ontology-based versus relational- Volatile input formats in our scenario make based integration, and delimit the reach of ontology affordable the use of semantic technology as long as mapping tools in the integration process. avoids the necessity of creating fixed-data structures (tables) that cannot be reused in subsequent 5.1 Lessons Learned integrations. On the other hand, the absence of common keys From the proposed method and the evaluation of between datasources requires the use of entity tools we get to the following conclusions. resolution techniques, which have been developed  In our scenario, an ontologist expert could for both technologies. Nonetheless, the semantic replicate most of the functionality provided by representation allows representing equivalences mediator architectures described in section 2, between individuals that produce an automatic including reconciliation of individuals. horizontal integration once datasources are merged.  It was possible to represent semantically the Silk is an example of a tool deviced for entity concepts in our case study with constructors resolution in (Volz, et al., 2009). supported by OWL 2 profiles. Despite reasoners allow inferring information  Deficiencies in inference support was remediated like individual equivalences based on keys, it is through the use of SPARQL. necessary to merge the information from all  Inference in memory becomes unfeasible with a datasources in a single repository or having relative small ammount of data (above a distributed stream reasoning. This kind of thousand records with more than 30 columns). technologies are still emerging. Stardog is an We have to decide between preselect the example of it. columns to use in the integration, or use a RDF

362 IntegratingSemi-structuredInformationusingSemanticTechnologies-AnEvaluationofToolsandaCaseStudyon UniversityRankingsData

5.3 Ontology Mapping Tools according to their usage in the integration stage, but they can be further classified according to the Ontology mapping approaches may solve the heterogeneity type they address, in order to facilitate problem of aligning local and global schemas, user’s adoption. avoiding (or at least reducing) the need of a non- ontologist user to decide which OWL constructors to use for integrating and reconciling information. ACKNOWLEDGEMENTS Despite local schemas may contain few properties (datasheet columns), mapping them to Authors thank to Tecnologico de Monterrey and large global ontologies may be largely benefited CONACYT for sponsoring this research through from the use of these tools, as ilustrated by grants 0020PRY058 and CB-2011-01-167460, (Rodriguez, et al., 2011). respectively. However, the automatic discovery of mappings could exacerbate the problem of reasoning in large amounts of data producing an excess of mappings between schemas that are not used to answer the REFERENCES question that motivates integration. Allemang, D, & Hendler, J., 2008. Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL. Morgan Kaufmann. 6 CONCLUSIONS Arens, Y., Chee, C., Hsu, C., In, H., Knoblock, C. A., & Rey, M., 1996. Query Processing in the SIMS The need of an agile process for information Information Mediator. In Proceedings of ARPI 1996. integration has become more evident in large 61–69. Arens, Y., Chee, C. Y., Hsu, C.-N., & Craig A., K., 1993. organizations, as much as for trivial questions as for Retriving and Integrating Data from Multiple building indicators that involve information Information Sources. International Journal of managed by external organizations. Intelligent and Cooperative Information Systems, 2. We presented a method supported by current 127-158. semantic technologies that provide information Bakhtouchi, A., Bellatreche, L., Jean, S., & Ameur, Y. A., integration of heterogeneous data sources. In 2012. MIRSOFT: mediator for integrating and scenarios where data is provided in semi-structured reconciling sources using ontological functional formats and varies over time, our method matches dependencies. International Journal of Web and Grid the results obtained by other robust mediator Services. doi:10.1504/IJWGS.2012.046731. Beneventano, D., Bergamaschi, F., Guerra, M., & Vincini, architectures in a cost-effective manner. 2001. The MOMIS approach to Information Nevertheless, information integration requires a Integration. deep understanding of ontological representations in Beneventano, Domenico, & Bergamaschi, S., 2004. The order to perform a proper integration and Momis Methodology For Integrating Heterogeneous reconciliation, despite the facilities provided by Data Sources. Chapter in Building the Information current tools. Society, 19-24. Despite semantic web languages and protocols as Buccella, A., Cechich, A., & Brisaboa, N. R, 2005. well as current semantic technologies are mature Ontology-Based Data Integration Methods : A enough for enabling information integration there Framework for Comparison. Revista Colombiana de Computación, 6. are still some issues that must be addressed. For Cui, Z., Brien, P. O., & Park, A., 2000. Domain Ontology instance, by providing scalable reasoning on large Management Environment, In Proceedings of the 33rd amounts of data, friendly interfaces for non-ontology Hawaii International Conference on System Sciences. experts and configuration options for reasoners that 1–9. allows making the Closed World Assumption when Gruber, T., 2009. What is Ontology? In the Encyclopedia the information is complete. of Database Systems, Ling Liu and M. Tamer Özsu Solving these problems will allow taking (Eds.), Springer-Verlag. advantage of ontology mapping tools for discovering Lenzerini, M., Sapienza, L., Salaria, V., & Roma, I., 2002. all possible alignments between local and global Data Integration : A Theoretical Perspective. In Proceedings of the twenty-first ACM SIGMOD- schemas. Furthermore, it will be possible to select SIGACT-SIGART symposium on Principles of automatically the domain ontology that better suits database systems. 233 - 246. (best scored matches) to a set of local schemas. Moreno Paredes, A., 2007. Técnicas de depuración e In this exercise we classified OWL constructors

363 DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

integración de ontologías en el ámbito empresarial. APPENDIX Universidad de Sevilla. Rodríguez-Mancha, M., Ceballos, H., Cantú, F., & Díaz- Prado, A., 2011. Mapping relational databases through Table 2 is an example of the guide with ontology matching: a case study on information reconciliation operators provided to database migration. In Proceedings of the 6th International specialist during the integration exercise. Cases and Workshop on Ontology Matching (OM-2011). CEUR- solutions were obained from (Allemang & Hendler, WS Vol-814, 244–245. 2008). A similar table was prepared for integration Silvescu, A., Reinoso-castillo, J., & Honavar, V., 2001. operators. Ontology-Driven Information Extraction and Knowledge Acquisition from Heterogeneous , Distributed , Autonomous Biological Data Sources. In Proceedings of the IJCAI-2001 Workshop on Knowledge Discovery from Heterogeneous, Distributed, Autonomous, Dynamic Data and Knowledge Sources. Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G., 2009. Discovering and Maintaining Links on the Web of Data. In Proceedings of the 8th International Semantic Web Conference (ISWC 2009), 650–665. Zhao, H., & Ram, S., 2008. Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization. Data & Knowledge Engineering, 66(3), 368–381. Zohra B., Angela B., Erhard R., 2011. Schema matching and mapping. Springer.

Table 2: User’s guide with OWL reconciliation operators.

364