Semantic Data Mining: a Survey of Ontology-Based Approaches

Semantic Data Mining: A Survey of Ontology-based Approaches Dejing Dou Hao Wang Haishan Liu Computer and Information Science Computer and Information Science Computer and Information Science University of Oregon University of Oregon University of Oregon Eugene, OR 97403, USA Eugene, OR 97403, USA Eugene, OR 97403, USA Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Semantic Data Mining refers to the data mining tasks domain knowledge can work as a set of prior knowledge of that systematically incorporate domain knowledge, especially for- constraints to help reduce search space and guide the search mal semantics, into the process. In the past, many research efforts path [8], [9]. Further more, the discovered patterns can be have attested the benefits of incorporating domain knowledge in data mining. At the same time, the proliferation of knowledge cleaned out [49], [48] or made more visible by encoding them engineering has enriched the family of domain knowledge, espe- in the formal structure of knowledge engineering [76]. cially formal semantics and Semantic Web ontologies. Ontology is To make use of domain knowledge in the data mining pro- an explicit specification of conceptualization and a formal way to cess, the first step must account for representing and building define the semantics of knowledge and data. The formal structure the knowledge by models that the computer can further access of ontology makes it a nature way to encode domain knowledge for the data mining use. In this survey paper, we introduce and process. The proliferation of knowledge engineering (KE) general concepts of semantic data mining. We investigate why has remarkably enriched the family of domain knowledge with ontology has the potential to help semantic data mining and techniques that build and use domain knowledge in a formal how formal semantics in ontologies can be incorporated into way [64]. Ontology is one of successful knowledge engineer- the data mining process. We provide detail discussions for the ing advances, which is the explicit specification of a concep- advances and state of art of ontology-based approaches and an introduction of approaches that are based on other form of tualization [26], [67]. Normally, an ontology is developed to knowledge representations. specify a particular domain (e.g., genetics). Such an ontology, often known as a domain ontology, formally specifies the I. INTRODUCTION concepts and relationships in that domain. The encoded formal Data mining, also known as knowledge discovery from semantics in ontologies is primarily used for effectively shar- database (KDD), is the process of nontrivial extraction of im- ing and reusing of knowledge and data. Prominent examples plicit, previously unknown, and potentially useful information of domain ontologies include the Gene Ontology (GO [73]), from data [30]. In the past few decades, advances in data Unified Medical Language System (UMLS [45]), and more mining techniques lead to many remarkable revolutions in than 300 ontologies in the National Center for Biomedical data analytics and big data. Data mining also combines tech- Ontology (NCBO [2]). niques from statistics, artificial intelligence, machine learning, Research in the area of the Semantic Web [10] has led to database system, and many other disciplines to analyze large quite mature standards for modeling and codifying domain data sets. Semantic Data Mining refers to data mining tasks knowledge. Today, Semantic Web ontologies become a key that systematically incorporate domain knowledge, especially technology for intelligent knowledge processing, providing a formal semantics, into the process. The effectiveness of do- framework for sharing conceptual models about a domain. The main knowledge in data mining has been attested in past Web Ontology Language (OWL) [1], which has emerged as research efforts. Fayyad et al. [21] claimed that domain knowl- the de facto standard for defining Semantic Web ontologies, is edge can play an important role in all stages of data mining widely used for this purpose. The Semantic Web technologies including, data transformation, feature reduction, algorithm that formally represent domain knowledge including structured selection, post-processing, model interpretation and so forth. collection of prior information, inference rules, knowledge Russell and Norvig [64] believed that an intelligent agent enriched datasets etc., could thus develop frameworks for (e.g., a data mining system) must have the ability to obtain systematic incorporation of domain knowledge in an intelligent the background knowledge and should learn knowledge more data mining environment. effectively with the background knowledge. In this survey paper, we study the advances and state of Previous semantic data mining research has attested the pos- art of semantic data mining. We specifically focus on the itive influence of domain knowledge on data mining. For ex- ontology-based approaches. The ontology-based approaches ample, the preprocessing can benefit from domain knowledge for semantic data mining attempt to make use of formal on- that can help filter out the redundant or inconsistent data [41], tologies in the data mining process. This is generally achieved [59]. During the searching and pattern generating process, by using the formal definition of concepts and relationships in ontologies as auxiliary information or constraint conditions into one combined attribute. In practice, semantic gaps are usu- to guide the data mining process. For example, in classifi- ally filled manually by domain experts. However, ontologies cation, ontology can specify the consistency relationships of have been shown to be beneficial in many data preprocessing the classification task. By ruling out the inconsistent search tasks [41], [59], [72]. space, the classification task would result in a better accu- There exists semantic gap between the data mining algo- racy [8]. Further more, structured organization of ontologies rithm and data as well. Data mining algorithms are usually can serve as a good representation for the data mining result. designed for data collected from different domains and sce- For example, in information extraction and text mining, the narios. However, data from a specific domain usually carry extracted information can be presented through the ontology domain specific semantics. The generic data mining algorithms itself using an ontology definition language (e.g., OWL) [77]. lack the ability to identify and make use of semantics across In this paper, we focus on three perspectives of ontology-based different domains and applications. Ontologies are useful to approaches in the research of semantic data mining: specify domain semantics and can reduce the semantic gap by • Role of ontologies: Why domain knowledge with formal annotating the data with rich semantics. Semantic annotation semantics, such as ontologies, are necessary in all stages aims at assigning the basic element of information links to of the data mining process. formal semantic descriptions [20], [42]. Such elements should • Mining with ontologies: How ontologies are represented constitute the semantics of their source. Semantic annotation and processed to help the data mining process. is crucial in realizing semantic data mining by bringing formal • Performance evaluation: How ontologies can improve the semantics to data. The annotated data are very convenient for performance of data mining systems in applications. the later steps of semantic data mining because the data are promoted to the formal and structured format that connects II. ROLE OF ONTOLOGIES IN SEMANTIC DATA MINING ontological terms and relations. Many research efforts have dedicated to bridge the semantic The perspective and mechanism of utilizing ontologies in gap between data mining results and users. Marinica et al. [49], semantic data mining varies across different systems and [50] used ontology for the post pruning and filtering of the applications. The question why ontology is useful in assisting association rule mining results. Mansingh et al. [48] used on- data mining process does not have an uniform conclusion. tology to assist the subjective analysis for the association rule By reviewing the previous ontology-based approaches, we post-pruning task. The data mining results can be represented summarize the following three purposes that ontologies have by ontologies in the semantic rich format which help sharing been introduced to semantic data mining: and reuse. For example, information extraction (IE) is the task • To bridge the semantic gap between the data, applica- of automatically extracting structured information from text. tions, data mining algorithms, and data mining results. The data/text mining results are sets of structured information • To provide data mining algorithms with a priori knowl- and knowledge with regarding to the domain. To represent edge which either guides the mining process or re- the structured and machine-readable information, it is nature duces/constrains the search space. to represent the information with ontology. Ontology Based • To provide a formal way for representing the data mining Information Extraction (OBIE) [77] has extensively used this flow, from data preprocessing to mining results. representation. With OBIE, the information extracted is not only well structured but also represented by predicates in the A. Bridging the semantic gap ontology which are easy for sharing and reuse. The question why domain

Semantic Data Mining: a Survey of Ontology-Based Approaches

Metadata for Semantic and Social Applications

From the Semantic Web to Social Machines

A New Generation Digital Content Service for Cultural Heritage Institutions

Enterprise Data Classification Using Semantic Web Technologies

Web Semantics in Cloud Computing

A Multimodal Approach for Automatic WS Semantic Annotation

Semantic Computing

Exploiting Semantic Web Knowledge Graphs in Data Mining

Ontology Engineering for the Design and Implementation of Personal Pervasive Lifestyle Support Michael A

Leveraging Protection and Efficiency of Query Answering in Heterogenous RDF Data Using Blockchain

Computing for the Human Experience: Semantics-Empowered Sensors, Services, and Social Computing on the Ubiquitous Web

A Semantic Engine for Internet of Things: Cloud, Mobile Devices and Gateways