A Framework for Semantic Text Clustering

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 A Framework for Semantic Text Clustering

Soukaina Fatimi1, Chama EL Saili2, Larbi Alaoui3 TIC Lab, International University of Rabat Sala Al Jadida Morocco

Abstract—Existing approaches for text clustering are either interconnected chunks of information. Hence, both humans and agglomerative, divisive or based on frequent itemsets. However, machines can navigate between bits of data to explore it and most of the suggested solutions do not take the semantic retrieve more information from it. This collection of associations between words into account and documents are only interrelated data is referred to as Linked Data [2]. The Linked regarded as bags of unrelated words. Indeed, traditional text Data is guided with a set of principles to allow easy sharing of clustering methods usually focus on the frequency of terms in structured data planet-wide. To represent and enable the use of documents to create connected homogenous clusters without this linked data and to allow the navigation between pieces of considering associated semantic which will of course lead to information, special representation should be used. The inaccurate clustering results. Accordingly, this research aims to Resource Description Framework is at the core of the linked understand the meanings of text phrases in the process of data paradigm. The RDF model, in which the data is clustering to make maximum usage and use of documents. The semantic web framework is filled with useful techniques enabling represented as triples of interconnected subject and object with database use to be substantial. The goal is to exploit these the intermediary of a predicate, is the mainstay of the techniques to the full usage of the Resource Description interconnection of the information in the semantic web. This Framework (RDF) to represent textual data as triplets. To come model enables the navigation between pieces of information up a more effective clustering method, we provide a semantic following RDF links. It is indeed true that unstructured representation of the data in texts on which the clustering process representation of information is still used, and that studies have would be based. On the other hand, this study opts to implement provided significantly valuable tools for the manipulation tasks other techniques within the clustering process such as ontology of textual documents, such as text clustering, information representation to manipulate and extract meaningful information retrieval topic identification, etc. Still, the advantages of the using RDF, RDF Schemas (RDFS), and Web Ontology Language linked data are captivating. Therefore, providing semantic data (OWL). Since Text clustering is an indispensable task for better manipulation based on a semantic web model needs to be exploitation of documents, the use of documents may be more explored and strongly highlighted. intelligently conducted while considering semantics in the process of text clustering to efficiently identify the more related groups in Text clustering has been widely explored for textual a document collection. To this end, the proposed framework document manipulation. Yet proposed methodologies have combines multiple techniques to come up with an efficient lacked in the use of semantic relationships between words. approach combining machine learning tools with semantic web Generally, documents are considered a bag of unrelated words principles. The framework allows documents RDF and semantics are not explored in the process of text clustering. representation, clustering, topic modeling, clusters summarizing, information retrieval based on RDF querying and Reasoning Nevertheless, this work aims to use the semantic web tools. It also highlights the advantages of using semantic web approach for a semantic text clustering using graph-based techniques in clustering, subject modeling and knowledge representation model RDF with the respect of the linked data extraction based on processes of questioning, reasoning and principles. We propose a system that is an integrated set of inferencing. techniques in which the textual documents are transformed into an RDF graphs representation and divided into homogenous Keywords—Text clustering; similarity measure; ontology; clusters based on a semantic clustering approach. These semantic web; RDF; RDFS; OWL; reasoning; inferencing rules; documents are further explored using semantic web techniques SPARQL; topic modeling; summarization such as querying and information retrieval using inferencing and reasoning tools. I. INTRODUCTION It has been a while since the web has changed from the web Text clustering is an indispensable task for better of documents to the web of data. Before knowing this upgrade, exploitation of documents to retrieve information, identify the information on the web was designed to be human- topics in more efficient ways. The provided system is a holistic understandable only. Therefore a device or a robot could not approach allowing better understanding and use of textual access information in the same manner as humans, and documents with the mean of a semantic framework based on artificial intelligence cannot evolve under these circumstances. the RDF model. The purpose of working with RDF is due to its This particular issue is considered as the motivation behind the countless advantages, the self-explanatory or semantic evolution of information representation and the launch of the characteristics of RDF data and is very beneficial for better Semantic web as a web of connected data. The concept base is semantic similarity computing and more efficient clustering. to transform the web of unstructured data to a network of We present an overall framework, and show how to apply This work is within the framework of the research project "Big Data machine learning techniques to mine textual documents using Analytics - Methods and Applications (BDA-MA)". Authors S. Fatimi and C. El Saili are financially supported by a PhD grant of International University of Rabat. 451 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020

Linked Data principles and highlight the importance of text B. Text Clustering clustering and the use of semantics in text clustering based on Document clustering is an unsupervised learning process the RDF model. that separates documents into significant groups. It’s one of the The rest of the paper is organized as fellows. The next main techniques of text mining [11]. Document clustering is to section introduces a review and presents the general context of designate a corpus of content documents into distinctive our work, such as text clustering, semantic web, semantic bunches so that documents within the same gather depict the similarity measurement, and topic identification. In the third same subject. Clustering of documents contains three section, the overall framework is presented and the steps of the categories: partitioning methods, agglomerative and divisive system are discussed in the subsections emphasizing the clustering. Researchers proposed several document clustering clustering process. Finally, a conclusion and perspective work algorithms like K-Means, Hierarchical Agglomerative are given in the last section. clustering and Frequent Itemsets based clustering and more algorithms having been utilized in this learning process. II. RELATED WORK Text clustering is a dynamic field that caught the The semantic oriented clustering approach that we are researcher’s consideration. The enormous textual data shared presenting in this paper is a combination of interesting on the net is considered as a bag of information and can be concepts and techniques, from text clustering to similarity labeled as the crude fabric of information. Diverse methods are measurement and also to semantic web concepts and actualized to move forward the extraction of profitable data frameworks. In this section, we will give an overview of all the from this information. Text clustering consists of indexing, above notions and mention some of the related studies and crawling and filtering the information. We distinguish four works on these fields. steps within the process of text clustering: the collection of A. Semantic Web data, the preprocessing at that point, the clustering and the post processing of the clusters. Initially, documents are collected The Semantic Web concept was introduced by Tim and put away, and it is basic to preprocess all these documents Berners-Lee as a novel form of web content that is to dispose of the commotion [11] before clustering these understandable by humans [2]. The main goal of this concept is documents. to interconnect and structure data in the World Wide Web to create an environment where programs can ramble between The Internet is nowadays advancing from a Web of different pages to understand, process, and question existing documents to a Web of Information, employing a graph-based information. Semantic web has caught the attention of many representation and a set of basic standards, known Linked Data researchers ([3], [4], [5], [6]). Tim Berners-Lee introduced Principles [12]. In any case, statistics on the LOD Cloud (April several principles for semantic web concept, he defined 2017) reports that over 149 billion realities are as of now put Resource Description Framework (RDF) a graph model to away as RDF triples in 9960 information sources. Hence, the present data on the web, RDF interconnects data as triplets of a number of RDF datasets distributed on the Net is continuously subject, predicate and object where subject and object are and rapidly expanding. A client willing to use these datasets nodes and property is an arc. These RDF elements may be a will begin to have to investigate them in order to decide which textual value, or a blank and may be represented as Universal data is pertinent to his particular needs. Therefore, to Resources Identifiers (URI) to distinct notions and relations encourage this interaction, a topical see of an RDF dataset can that can connect them [2]. The use of RDF allows machines to be given by applying the clustering instrument which can be understand the meaning of these notions and their linkage. This characterized as making a set of homogeneous clusters with type of data is stored in special repositories called Triple Stores expansive intra-cluster similarity and expansive inter-cluster [7]. One other fundamental component of Semantic Web is disparity [12]. ontology creation. Researchers have intensely studied this C. Text Documents to RDF Triples concept and its application in many domains like biomedical network security [8], smart cities [9] and robotic application In order to handle documents of unstructured text data with [10]. Ontology is defined as a collection of information that semantic web techniques, it is obvious that the conversion of describes a concept and provides its vocabulary. Ontologies are text documents to RDF triples is the major step to be done. The understandable by both humans and machines and allow objective of this transformation is to change plain text into data semantics and syntactic exchange. Definite web ontology units understandable by machines. Authors in [33] proposed languages have been unified due to research in the Semantic another approach that converts a given content into RDF triples Web that allows to efficiently describe a domain with the use based on the semantic and syntactic structure of sentences. of the semantic web languages RDF Schemas (RDFS), and Based on this approach they built a system called T2R that Web Ontology Language (OWL). Ontologies are engineered creates important triples with all fundamental linguistic based on the domain concepts referred to by "classes" and the relations and semantic parts of the text. This approach can be relationships between these concepts which can be hierarchical used for any plain text. T2R inputs a text document into a as subclass relationships or predefined as properties. The syntactic parser using the Stanford tool and semantic parser models can also include constraints on the expressed utilizing the Senna tool. information. LODifier [34] is one of the inspiring approaches in the Knowledge graph construction process to provide a tool for the conversion of texts to RDF. It is based on both deep semantic analysis and named entity recognition systems. Based on

452 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020

LODifier, authors in [13] proposed a conversion of tweets into develop a progressive structure to indicated connections or RDF triple where tweets are assembled topic-wise, by utilizing proximities among the data. topic identification methods and shaping homogeneous clusters 2) Semantic similarity measure: Research on the semantic using the K-Means algorithm. Only the tweets containing similarity measures based on RDF data has mainly been done named substances in DBpedia datasets are used. Each topic for the similarity measurement of RDF graphs for the query corpus is summarized then transformed into an RDF chart utilizing the LODifier tool. In the work [14], another model matching. The goal is to extract the best matching result. A that collects resumes data from the internet and classifies them similarity measure (gSemSim) was proposed to progress based on the cosine similarity measure was proposed. In this ordinary similarity measures to decrease their impediments. model, the data is represented using the semantic web like RDF The notable feature of this semantic similarity measure is its based on Protégé tool and SPARQL. Another methodology is capacity to display more reasonable similarity between proposed by authors of [15] as a combination of the techniques concepts in the viewpoint of space information. Reference of similarity computing, visualization procedures and RDF [20] demonstrates pairwise word interactions and displays a query language (SPARQL) to manipulate academic contents. new similarity center instrument to recognize vital They utilized also the ontological model to form syllabus correspondences for superior similitude estimation. These information justifiable by both people and computers. Another thoughts are executed in a neural network design that framework for automatic knowledge graph (KG) extraction from unstructured text was proposed by authors of [16] and illustrates state-of-the-art precision on three SemEval extended in their second work [17]. They underlined two RDF assignments and two reply determination tasks. extraction steps. Firstly, candidate generation by focusing on E. Semantic Text Clustering the importance of mapping predicates to a referential KG for more searchability increase, and secondly, candidate selection Document clustering is one of the main techniques of text process using pre-defined ontologies. Following the same path, mining that is considered as an unsupervised learning process [18] proposed an open-source platform for KG construction that separates documents into significant groups. It is to that includes graph management and downstream application designate a corpus of content documents into distinctive support and is based on tools such as Stanford CoreNLP, Neo4j bunches so that documents within the same gather depict the and Apache Solr. same subject. Researchers proposed several document clustering algorithms like Hierarchical Agglomerative D. Similarity Measures clustering and Frequent Itemsets based clustering and others The literature has many methods for computing the that are used in this learning process [21]. Traditional text semantic similarity between terms. Semantic similarity clustering methods usually focus on the frequency of terms in measures can be classified into four categories: Edge Counting documents to create connected homogenous clusters, thus, Measures, Information Content Measures, Feature Based documents can be semantically related so these approaches will Measure and Hybrid Methods. In the followings we present the conduct inaccurate clustering results. The complexity of overall idea behind similarity measurement and highlight the natural language results in the complexity of having accurate antecedent works about semantic similarity measure and its and efficient text clustering. Researchers have made use of uses. semantic web technologies such as ontologies to take advantage of the semantic relationship between words in 1) Similarity measure concept: A similarity measure is a clustering. Walaa K. Gad and Mohamed S. Kamel [34] function that assigns a non-negative real number to each pair proposed a semantic similarity-based model (SSBM) to handle of patterns, defining a notion of resemblance and having the the semantic in documents. They incorporated the use of target range between [0,1]. Similarity measures form the basis ontology in their case WordNet to obtain the semantic for many patterns matching algorithms. Besides that, similarities between words, such as synonyms and hypernyms, and the documents vector is constructed based on a refined similarity measures compare vectors which should be terms weight that includes term frequency (TF) and Inverse symmetrical and assign a value to them becoming larger when document frequency (IDF) and the semantic weight based on they are similar and getting the largest value when they are terms semantic relationships. Following the same path, authors identical. Usually measured as the cosine of the angle between in [35] applied text clustering Based on the semantic body for vectors, that is, the so-called cosine similitude, the Cosine Chinese spam mail Filtering, the proposed methodology is similarity is one of the foremost well-known closeness based on lexical chains and HowNet semantic similarity to measures in various data recovery applications and clustering handle the words' synonyms, this technique helps to overcome as well. A Jaccard degree was introduced in [19] and is in defiance related to synonyms and near-synonyms by merging some cases alluded to as the Tanimoto coefficient measures them. Thus, the results of the experiment were good, but the closeness between limited test sets and is characterized as the use of HowNet resulted in some limitations since it doesn't estimate of the intersection isolated by the estimate of the cover all possible similarities between words. [22] also presents an approach using lexical chains combined with union of the test sets. For this measure, the Jaccard coefficient WordNet; A WordNet-based semantic similarity measure for compares individuals for two sets to see which individuals are solving the problem of Polysemy and synonymy, and lexical shared and which are unmistakable. The foremost AHC chains to extricate a little subset of the semantic features which strategies do a calculation on this similarity matrix and not as it denoted the topic of documents but moreover are advantageous to clustering. In [15] a combination of the

453 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 techniques, methods, and algorithms such as cosine similarity, disclosure based on textual information. The aim behind visualization procedures have been used for semantic text transforming textual data into an RDF model is to make it clustering, moreover, the ontological model to form syllabus understandable by both humans and machines. The information intelligible for both people and computers. In [12], transformation should take into consideration syntactic and using Candidate Description (CD) as a set of predicates, a form semantic relations between terms. The goal is to analyze, of RDF clustering algorithm has been developed, it used summarize, and extract information from this data. All these similarity matrix which contains the pairwise similarities errands require a profound understanding of the basic between CDs clusters and utilized Cosine Similarity, Jaccard structures and semantics of the documents. Exploring large similarity and Sorensen Dice To measure the similarity amounts of data in order to retrieve relevant information can be between CDs. a frustrating task. Based on the RDF framework and using associated techniques such as SPARQL for querying the data, F. Topic Modeling RDF Schema (RDFS) and Web Ontology Language (OWL) to Topic models are unsupervised machine learning apply reasoning and inference support on the data. techniques used to thematically describe a set of documents, it intends to detect the group of words that characterize and Furthermore, clustering methods result in improving these describe the collection of documents. Topic modeling is among interactions in order to provide better results and is considered important techniques used for the measurement of document as the pillar for other knowledge discovering tasks such as similarity for classification [23], the clustering and cluster summarization and visualization. Classic clustering methods labeling, summarizing documents, and more [24]. Topic ignore the semantics between the words, generally, documents modeling was firstly introduced for textual documents. Yet, its are considered as a bag of words, and do not make use of the use for unstructured types of data such as images has been relations that may exist between the words. Words can have explored. In multiple researches, topic modeling has been multiple meanings depending on the context there are used in. combined with semantic web [25] to improve the topic Therefore, separating words from their context can lead to a modeling results. However, few are the techniques that have misunderstanding of the words. The use of RDF based models been provided in order to apply topic modeling over for textual documents clustering is a step toward preserving unstructured topics. [26] Proposed a framework for applying semantics in documents and providing efficient clustering with topic modeling to RDF graph data based on LDA, they better accuracy. highlighted some of the major challenges in using topic In this sense, and as previously mentioned, this work aims modeling over RDF data. These challenges are related to the to take advantage of most of the semantic web techniques' sparseness and the unnatural language of the RDF graphs and benefits. Therefore, it proposes a graph data model for gave some methods to tackle it. In [27] a method to profile clustering and mining text documents. The model is based on RDF datasets on Knowledge-based modeling techniques is the use of semantic web technology RDF to represent the given with the goal to describe the content of the datasets. The information, SPARQL, RDFS and OWL to use reasoning extracted representative topics for the RDF dataset are engine in order to retrieve information from it. annotated with Wikipedia categories. Knowledge-based topic modeling has been earlier used for entity summarization in [36] The proposed methodology starts with the extraction of using a probabilistic model called ES-LDA that uses a RDF representation from textual documents based on the modified version of the LDA algorithm was used to handle the semantic and syntactic nature of sentences, and then several challenges of working with the RDF model.. The model uses mining techniques are introduced. This methodology focuses prior knowledge for statistical learning techniques to create on clustering based on a proposed similarity measure of the representative summaries for the large semantic web RDF graphs, in order to group related documents in documents in order to facilitate the use of semantic web homogenous clusters, and topic modeling process to extract the entities. underlying topics presented in the documents. Finally, an inferencing model is introduced based on RDFS and OWL The whole approaches presented have not provided language in order to extract more facts from the data. significantly valuable tools for the manipulation tasks of textual documents, such as text clustering, information retrieval Fig. 1 summarizes the proposed framework whose main topic identification, etc. Still, the disadvantages of the linked components are explained in the subsequent. data are captivating. Therefore, providing semantic data A. Extract RDF from Data manipulation based on a semantic web model needs to be explored and strongly highlighted. This work aims to take The first step toward our semantic-based clustering system advantage of most of the semantic web techniques' benefits and and documents querying is to transform the textual present an overall framework for semantic text clustering based unstructured documents into RDF triples representation. As on RDF data more efficient than these approaches. previously discussed in the above section, there have been many studies tackling the transaction from text to RDF triples. III. METHODOLOGY The main goal of this transformation is to switch from textual sentences that are understandable by humans only to Text is considered the essential and mostly utilized interconnected information and intelligible by both humans and representation of data, numerous investigations and strategies machines. have been examined to move forward the information

454 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020

Fig. 1. Semantic-based Text Clustering Framework.

An RDF graph G is defined as a collection of statements. A For each extracted single-clause sentence, the Stanford statement is a triple (t) representing the relationship - named parser [28] can be used to analyze the grammatical structure of predicate (p) which is generally presented as a URI - between a sentences, and in particular to identify the subject or object of a subject (s) and an object (o) in the form of t=(s,p,o). A subject verb, in order to represent sentences in the form of the triple can be either a resource (URI or IRI) or a basic string (Literal), (subject, verb, object). Senna parser [29] can also be used to while an object can be denoted as a resource, a literal or an enrich the RDF extraction. Senna provides useful information, abstract identifier (Blank). by allowing entity name recognition. It allows labeling of the named entities with given categories such as organizations, Overall, the task of text transformation to RDF is an monetary value, person, etc. Its semantic role labeling can as iterative process that consists of converting each sentence into well be used to enhance the discovery of the semantic sense of RDF triples under its semantic and syntactic form. This process words in a given sentence. can be represented by the schema given in Fig. 2. Now that the triplets have been discovered, it is time to The preprocessing phase is vital before addressing the map to each extracted entity and predicate its Unified Resource triples extraction. Usually, sentence parsers that can be used for Identifier (URI). DBpedia is a data set powered by Wikipedia the triples extraction cannot handle some special case words, articles that relates an entity to a Wikipedia article and provides such as capital names, multiple word names which are a URI to identify it. In the mapping of RDF triples, using URIs considered as independent words and also the ambiguities in provided by DBpedia. The use of DBpedia is due to the distinguishing named entities. These issues can be handled richness of the subject’s details and the Multilanguage’s during the preprocessing phase to prepare the sentences for the description provided as well as the continuity of updates of the RDF extraction. Another cause for concern is the multi-clause data sets. The identification of the most relevant meaning of sentences, the parsing of these sentences will lead to shortage words can be done based on the synsets provided by WordNet, or false representation of the real meaning of the sentence. Isn which is a large-scale lexical database for English. After this case, multi-clause sentences should be split into single- identifying the most relevant meaning of an entity based on its clause ones with the maintenance of the semantics in the context and the syntactic and semantic role, it can be associated original text. with its DBpedia URIs or WordNet URIs if it existed. The named entity recognition Wikifier [30] can be used to obtain links to Wikipedia of the associated articles to the named entity. The following example illustrates the transformation from an unstructured text to an RDF graph. Considering the sentence: “The WHO declared Covid-19 a pandemic”. The Stanford parser will enable the tagging of this sentence (Table I). Using WordNet to discover the appropriate sense of the words and select the named entity that corresponds to it in order to assign its DBpedia URIs/IRIs, Fig. 3 represents an Fig. 2. RDF Extraction Process. RDF graphic representation of the sentence.

455 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020

TABLE I. PART OF WORDS TAGGING USING STANFORD PARSER RDF triples and precisely it's about the calculation of the similarity between triples' subject, predicate, and object. Graph Word POS POS Word matching algorithms for RDF have been used for the matching 1 The DT 4 COVID-19 VBN of RDF queries and RDF graphs in order to implement 2 WHO NN 5 A DT searching processes or to put in place Linked Data recommendation processes. This study introduce the Graph 3 declared VBN 6 Pandemic NN matching algorithm to calculate the similarity between RDF graphs corresponding to a documents' dataset in order to perform clustering over these documents. In the following, we introduce and discuss the RDF graph distance computing in the clustering process. 1) Similarity computing between documents and clustering: We assume in the following that the previous step of RDF extraction is completed and that each document is represented as an RDF graph, where every RDF graph consists of a list of RDF statements. The matching of two graphs can be translated to an assignment problem that would be solved using the Hungarian matching algorithm over a bipartite graph. The bipartite graph whose vertices are the set of triples of the two RDF graphs, each RDF’s triples group is considered as an independent vertex of the bipartite graph, whereas the weight of the edges of the graph is the similarity measure between the triples. Fig. 3. Example of RDF Schema. Considering two documents D1 and D2, we represent these documents by two RDF graphs G1 and G2 respectively, and The extracted sets of triplets of each document not only we define the bipartite graph BG as BG:=(U,V,E), U and V allows the comprehension of documents for machines, but it being the BG partition’s parts such that U is the set of nodes will also be used in more sophisticated tasks such as document related to the triples of the document D1, and V represents the clustering, query answering and text summarization. triples of the second document D2. Moreover, E is stating the B. RDF based Clustering edges of the graph and the weight of these edges is the computed similarity measure between the nodes connecting the After retrieving RDF triples for each text document we can edges. The Hungarian matching algorithm is used over the BG proceed to the clustering of these documents. The clustering to find the maximum similarity matching between the pairs of process consists of grouping documents in related clusters triples represented by U and V. Based on the matching result, based on the similarity between them. In this case, the the overall similarity measure can be computed between the documents are represented using RDF graphs. The RDF triples two RDF graphs. This ability of measuring the similarity in the graphs correspond to the document sentences, where between a pair of documents yields to a similarity matrix based each subject, object, or predicate are identified using a URI on the computed results. The computed similarity matrices can from the DBpedia datasets. In traditional clustering, the be used in multiple clustering techniques to provide similarity between documents is calculated based on the text homogenous clusters. This proposed framework allows words considered as independent items. The use of RDF therefore introducing an agglomerative hierarchical clustering representation allows the incorporation of the semantic to extract the clusters. relationship of terms. However, the key to an efficient clustering is the use of a similarity measure that results in The following subsection discusses how we can obtain the better matching between documents, not only based on relevant similarity measure between a pair of triples. words with the highest strength or occurrence frequency in the documents as feature words for clustering but taking into 2) Similarity computing between triples: As a result of the consideration the semantic relationship between words and previous section, computing the similarity between triples is between documents. The extracted RDF graphs are loaded with most crucial tasks in the clustering process that is to put in semantic and syntactic information about the texts. place since it can impact the clustering efficiency. As already mentioned, several methods consider the text as a dissociated Our goal is to put forward a semantic similarity measure based on the RDF model with the exploitation of the semantic bag of words, ignoring the semantics in texts. This is what we web tools. To tackle this issue, a similarity function based on aim to tackle by handling unstructured textual data within the RDF graph matching is going to be set up to compute the context of an RDF model in order to preserve the semantic similarities between documents. relationships in the text, linking it to the important knowledge base in our case DBpedia and furthermore include a semantic As earlier discussed, textual unstructured data is similarity measure to compute the distance between RDF transformed into RDF graphs, the matching of two RDF graphs consists of the matching of their unitary elements which are the triples.

456 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020

It is trivial that in order to compute the similarities between enhances the understanding of the words, and finally, the short documents we need to measure the unitary similarity of the text problem which can be handled by either text triples pairs. One of the major advantages of RDF supplementing or providing a modified version of the LDA representation of the unstructured textual documents is to be algorithm. However, the strengths of our RDF graph able to utilize the data values of resources to calculate the representation process help overcoming these challenges. since resources’ similarity scores. textual documents were converted to RDF graphs, the use of semantic and syntactic parsing tools and the introduction of Considering two triples t1=(s1, p1, o1) and t2=(s2, p2 , o2) Dbpedia and WordNet synsets for efficient entity recognition the similarity between t1 and t2, Sim_t (t1, t2), is related to the based on the context of the sentence helps to tackle the issues similarities between subjects Sim_s(s1,s2), between predicates related to unnatural language nature of RDF science the Sim_p(p1,p2) and between objects Sim_o(o1,o2). Firstly, to identified entities are based on the context of the documents, in compare words, in this case, it is essential to use a linguistic order to identify the most relevant meaning of an entity. On the similarity measure based on a reliable source such as WordNet. other hand, the RDF graph is enhanced with semantic role Several researchers have tackled the use of WordNet in the relations and Dbpedia classes allow overcoming the likely similarity computing based on the provided synsets, and one of sparseness nature of RDF and text shortness problems. the most used formulae is Lin's similarity. Secondly, and in the case of a string value or not being able to find the word in D. Summarizing Clusters and Questioning System based on WordNet database we can proceed to a string similarity RDF Clustered Data measurement such as the normalized compression distance and In order to enhance the exploration of RDF data and due to the Levenshtein Distance [31]. Finally, for the URI form of the big amount of RDF data and its complexity, RDF data, if the corresponding value can be matched to a WordNet summarization was introduced to assist the understanding and word then we could use the linguistic computing method, and use of this type of data. Summarization aim is to provide brief, if not the string similarity measurement could be used instead. concise, and significant information. Our framework goal is to The triple’s object similarity will be handled in the same way make better use of the textual documents through a semantic as the subject, as for the predicate, we can consider the fact that text clustering system. Therefore the use of the RDF is two triple’s subjects (s1, s2) and objects (o1, o2) are similar summarization techniques in the proposed framework is guided then it is very likely that the predicates (p1, p2) are also with the attention to improve the information extraction from similar, otherwise linguistic and string similarity computing the handle datasets. Hence, it is important to assist the queuing methods can be used. system based on RDF data since the extracted RDF graphs C. Topic Modeling clusters can be significantly large resulting in a querying process that is extremely expansive with regards to resource Clustering goal is to divide a collection of text documents and time. into different category groups so that documents in the same category group describe the same topic. One of the major Summarization can be used for various reasons or challenges related to clustering is cluster labeling and how to applications such as ontology extraction from RDF graphs, provide a clear description of the clusters. Therefore, we can assisting users by providing graph visualization, and improving note that for identifying and describing the constructed clusters the querying process. In our case, we are basically interested in Topic models can be used. In this case, clustering allows the these applications related to the advancement of the querying inference of more coherent topics. A topic is a group of words task in many ways. In particular, indexing is when summary that resume and refers to the content of the cluster. The graphs are seen as an index for the larger RDF graph. In this identification of the cluster's topic allows a previous view over case, a query is initially matched with the summary graph for its content and eases the searching process. Topic models were finding equaling index nodes, and then the original graph is firstly introduced for text documents and are easily adaptable explored after detecting the matching nodes. Thus this process to the case of RDF graphs. In [26] an approach to use topic reduces the computation time and improves the querying task. modeling with RDF data was proposed using Latent Dirichlet To be noticed is that a summary will also help identify the best Allocation (LDA) which is a commonly used model to identify matching data partition to apply the querying when working the topics of documents. LDA aims to extract thematic with distributed systems. information from documents' collections and it is based on the bag of words as vocabulary extracted from these documents. In There are multiple RDF summarization approaches, some our use case, the documents are the RDF Graph and the words include ontologies to handle the summarization of an RDF are the extracted words from the graphs' triples. [26] graph, and others can ignore their use and only work on the introduced several limitations and challenges of using topic bare RDF graph. Based on some recent reviews such as [32], a modeling for RDF graphs; Firstly, the sparseness of RDF data RDF summary can either be compact information that contains which means that even when having large datasets, the the major meanings of the graph or can be a graph that is preprocessing of this data could result in a restricted set of exploitable rather than the massive original graph. In [32] the words that could be used as a bag of words. Secondly, the lack existing summarization approaches can be classified based on of context is encountered since used words can have several multiple criteria. Among others we cite the input and output meanings. In the case of RDF data, the context is hard to be type, the purpose and the methods which can be structural, determined due to unnatural language of data and/or to the statistical, pattern-mining, or hybrid. In order to reach our goal sparseness of RDF graphs. The unnatural language is related to structural quotient summaries for its wide applicability in the the graph representation, unlike sentence representation that indexing and query answering tasks. Quotient summarization

457 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 graphs are summary graphs where each summary node is REFERENCES connected to the IDs of the original graph nodes. These graphs [1] P. Hitzler, M. Krötzsch, B. Parsia, P. F. Patel-Schneider, and S. can be obtained based on equivalence relations such as bi- Rudolph, “OWL 2 web ontology language primer,” similarity. https://www.w3.org/TR/owl-primer/, 2012. [2] T. Heath and C. Bizer, Linked Data: Evolving the Web into a Global E. Reasoning using Jena Inference Engine Data Space, 1st ed.; Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool, 2011. RDF schema (RDFS) allows defining and organizing RDF [3] M. Noura, A. Gyrard, S. Heil, and M. Gaedke, “Automatic knowledge data vocabularies. In RDFS, the relationships between extraction to build semantic web of things applications,” IEEE Internet properties and resources are defined using RDF which offers a Things J., vol. 6, no. 5, pp. 8447–8454, Oct. 2019. typing arrangement for RDF models. These relations are [4] P. Ristoski and H. Paulheim, “Semantic Web in data mining and hierarchical like the notion of classes, subclasses, and knowledge discovery: A comprehensive survey,” J. of Web Semantics, properties, a huge amount of links between elements can be vol. 36. Elsevier, pp. 1–22, Jan. 01, 2016. identified by specifying properties of classes and inheritance [5] A. Gangemi, V. Presutti, D. Reforgiato Recupero, A. G. Nuzzolese, F. between classes, therefore RDF objects are considered as an Draicchio, and M. Mongiovì, “Semantic web machine reading with FRED,” Semant. Web, vol. 8, no. 6, pp. 873–893, Aug. 2017. instance of one or many classes and are specified with the class [6] P. Pauwels, S. Zhang, and Y. C. Lee, “Semantic web technologies in properties and parent class specifications. Many projects have AEC industry: A literature overview,” Automation in Construction, vol. incorporated the use of RDF(S) representation format such as 73. Elsevier B.V., pp. 145–165, Jan. 01, 2017. Protégé, and Mozilla. Web Ontology Language enhances the [7] M. Trianes Torres, A. Sánchez Sánchez, M. Blanca Mena, and J. García, describing properties and classes by providing an extended “Competencia social en alumnos con necesidades educativas especiales: vocabulary. It allows for example expressing the cardinality of nivel de inteligencia, edad y género,” Rev. Psicol. Gen. y Apl. Rev. la relations between classes, offers other assets such as equality Fed. Española Asoc. Psicol., vol. 56, no. 3, pp. 325–338, 2003. and symmetry of properties, and so on [1]. These OWL [8] G. Xu, Y. Cao, Y. Ren, X. Li, and Z. Feng, “Network security situation awareness based on semantic ontology and user-defined rules for characteristics result in more detailed ontologies allowing high Internet of Things,” IEEE Access, vol. 5, pp. 21046–21056, Aug. 2017. performance in documents reasoning tasks. As we already [9] N. Komninos, C. Bratsas, C. Kakderi, and P. Tsarchopoulos, “Smart city mentioned semantic web concept is all about allowing both ontologies: improving the effectiveness of smart city applications,” J. humans and machines to understand data, therefore data should Smart Cities, vol. 1, no. 1, pp. 31–46, Nov. 2016. be presented in a well-structured form and rules should be [10] G. Sarthou, R. Alami, and A. Clodic, "Semantic Spatial Representation: provided in a well-defined language in order to implement a unique representation of an environment based on an ontology for reasoning process and allows data to be shared onto the web robotic applications," in Proc. Combined Workshop on Spatial [2]. Logically, notions are inferred from ontologies if they Language Understanding and Grounded Communication for Robotics (SpLU-RoboNLP), pp. 50-60, Association for Computational conform to their associated models; this process is referred to Linguistics, 2019. as reasoning. The clustering semantic-based framework [11] G. S. Reddy, T. V. Rajinikanth, and A. A. Rao, “A frequent term based proposed in this work can be enhanced by including a text clustering approach using novel similarity measure,” Souvenir 2014 reasoning layer that allows deriving additional truth from the IEEE Int. Adv. Comput. Conf. IACC 2014, pp. 495–499, 2014. RDF graphs. Tools such as the Jena framework have been [12] S. Eddamiri, E. M. Zemmouri, and A. Benghabrit, “An improved RDF widely used to extract data from RDF graphs and OWL data Clustering Algorithm,” Procedia Comput. Sci., vol. 148, pp. 208– ontologies. 217, 2019. [13] M. Achichi, Z. Bellahsene, D. Ienco, and K. Todorov, “Towards Linked IV. CONCLUSION Data Extraction From Tweets,” https://hal.archives-ouvertes.fr/hal- 01411403/document, 2016. This paper showed how semantic web techniques can be [14] A. M. Abirami, A. Askarunisa, R. Sangeetha, C. Padmavathi, and M. used for the textual documents clustering and exploration. It Priya, “Ontology based ranking of documents using Graph Databases: a underlined some of the existing works of manipulating RDF Big Data Approach,”: https://www.amrita.edu/icdcn/sangeetha-r.pdf, data and got inspired from these to present a connected 2014. pipeline of semantic processes for the semantic text clustering [15] V. Saquicela, F. Baculima, G. Orellana, N. Piedra, M. Orellana, and M. based on RDF. The main contribution consists on presenting an Espinoza, “Similarity detection among academic contents through semantic technologies and text mining,” in IWSW, pp. 1-12, 2018. overall framework for semantic text clustering based on RDF [16] N. Kertkeidkachorn and R. Ichise, “T2KG: An end-to-end system for data modeling. This framework combines multiple techniques creating knowledge graph from unstructured text,” AAAI Work. - Tech. in order to get an efficient and accurate system, allowing Rep., vol. WS-17-01-, pp. 743–749, 2017. exploring textual documents using machine learning [17] N. Kertkeidkachorn and R. Ichise, “An automatic knowledge graph techniques combined with semantic web principles. The creation framework from natural language text,” IEICE Trans. Inf. Syst., system allows documents RDF representation, clustering, topic vol. E101D, no. 1, pp. 90–98, 2018. modeling and clusters summarizing as well as information [18] R. Clancy, I. F. Ilyas, J. Lin, and D. R. Cheriton, “Knowledge Graph retrieval using both RDF querying and reasoning tools. The Construction from Unstructured Text with Applications to Fact Verification and Beyond,” in Proc. Second Workshop on Fact Extraction aim is to take advantage of the semantic web in order to and VERification (FEVER), pages 39–46 Hong Kong, November 3, enhance the exploration of documents and enhance the use of 2019. Association for Computational Linguistic. semantics along the whole process. In future work we intend to [19] T. Tran, H. Wang, and P. Haase, “Hermes: Data Web search on a pay- validate our framework and improve it by choosing most as-you-go integration infrastructure,” J. Web Semant., vol. 7, no. 3, pp. relevant tools and techniques in each of the framework’s steps. 189–203, 2009. An experiment of the proposed approach on a real dataset will [20] H. He and J. Lin, “Pairwise word interaction modeling with deep neural be further attacked. networks for semantic similarity measurement,” 2016 Conf. North Am.

458 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020

Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. NAACL HLT [29] “SENNA,” https://ronan.collobert.com/senna/. 2016 - Proc. Conf., pp. 937–948, 2016. [30] J. Brank, G. Leban, and M. Grobelnik, “Annotating documents with [21] H. Patil and R. S. Thakur, "Frequent Term-Based Text Clustering Using relevant {Wikipedia} concepts,” Proc. Slov. Conf. Data Min. Data Hidden Support," Proc. Int. Conf. on Recent Advancement on Computer Wareh., p. 4 pages, 2017. and Communication, B. Tiwari et al. (eds.), Lecture Notes in Networks [31] K. Al-Khamaiseh, "A Survey of String Matching Algorithms," Int. J. of and Systems 34, vol. 34. Springer Singapore, 2018. Engineering Research and Applications, vol. 4, Issue 7 (Version 2), [22] T. Wei, Y. Lu, H. Chang, Q. Zhou, and X. Bao, “A semantic approach pp.144-156, 2014. for text clustering using WordNet and lexical chains,” Expert Syst. [32] Š. Čebirić, F. Goasdoué, H. Kondylakis, D. Kotzinos, I. Manolescu, G. Appl., vol. 42, no. 4, pp. 2264–2275, 2015. Troullinou, and M. Zneika, “Summarizing semantic graphs: a survey,” [23] V. S. Anoop, S. Asharaf, and P. Deepak, “Topic modeling for The VLDB J., vol. 28, no. 3, pp. 295–327, 2019. unsupervised concept extraction and document ranking,” Adv. Intell. [33] K. Hassanzadeh, M. Reformat, W. Pedrycz, I. Jamal, and J. Berezowski, Syst. Comput., vol. 683, pp. 123–135, 2018. "T2R: System for Converting Textual Documents into RDF Triples;" in [24] L. Liu, L. Tang, W. Dong, S. Yao, and W. Zhou, “An overview of topic Proc. IEEE/WIC/ACM Int. Joint Conferences on Web Intelligence (WI) modeling and its current applications in bioinformatics,” Springerplus, and Intelligent Agent Technologies (IAT), 2013. vol. 5, no. 1, 2016. [34] Augenstein, I., S. Padó, and S. Rudolph, "Lodifier: Generating linked [25] L. Yao et al., “Incorporating knowledge graph embeddings into topic data from unstructured text," In ESWC, pp. 210–224, 2012. modeling,” 31st AAAI Conf. Artif. Intell. AAAI 2017, pp. 3119–3126, [35] Q.-Y Zhang, P. Wang, and H.-J Yang, "Applications of Text Clustering 2017. Based on Semantic Body for Chinese Spam Filtering," J. of Computers, [26] J. Sleeman, T. Finin, and A. Joshi, “Topic modeling for RDF graphs,” vol. 7, no. 11, pp. 2612-2616, 2012. CEUR Workshop Proc., vol. 1467, pp. 48–62, 2015. [36] Seyedamin Pouriyeh, Mehdi Allahyari, Krzysztof Kochut, Gong Cheng, [27] S. Pouriyeh, M. Allahyaril, G. Cheng, H. R. Arabnia, K. Kochut, and M. and Hamid Reza Arabnia. Es-lda: Entity summarization using Atzori, “R-LDA: Profiling RDF Datasets using knowledge-based topic knowledge-based topic modeling. in Proc. of the Eighth International modeling,” Proc. - 13th IEEE Int. Conf. Semant. Comput. ICSC 2019, Joint Conference on Natural Language Processing (Volume 1: Long pp. 146–149, 2019. Papers), vol. 1, pages 316–325, 2017. [28] “The Stanford Natural Language Processing Group.” https://nlp.stanford.edu/software/lex-parser.shtml.

459 | Page www.ijacsa.thesai.org