Unsupervised Domain Ontology Learning from Text V
Total Page:16
File Type:pdf, Size:1020Kb
ISSN 2395-8618 Unsupervised Domain Ontology Learning from Text V. Sree Harissh, M. Vignesh, U. Kodaikkaavirinaadan, and T. V. Geetha Abstract—Construction of Ontology is indispensable with rapid results can be obtained and recall of the system will be increase in textual information. Much research in learning improved subsequently. Ontology are supervised and require manually annotated Ontology is generally built under the supervision of domain resources. Also, quality of Ontology is dependent on quality of corpus which may not be readily available. To tackle these experts and are time intensive process. Corpus required for problems, we present an iterative focused web crawler for building Ontology are not always readily available. Therefore, building corpus and an unsupervised framework for construction it is important to build corpus from web through crawling. of Domain Ontology. The proposed framework consists of five Very few work is available that have incorporated crawling as a phases, Corpus Collection using Iterative Focused crawling phase for collecting corpus in building Ontology. Since general with novel weighting measure, Term Extraction using HITS algorithm, Taxonomic Relation Extraction using Hearst and crawling does not always provide domain related pages, lot Morpho-Syntactic Patterns, Non Taxonomic relation extraction of irrelevant pages are downloaded and filtering is required. using association rule mining and Domain Ontology Building. Terms extracted using statistical measure or linguistic patterns Evaluation results show that proposed crawler outweighs are prone to noise and require additional level of filtering traditional crawling techniques, domain terms showed higher using machine learning techniques. Also, most systems rely precision when compared to statistical techniques and learnt ontology has rich knowledge representation. on manually annotated resources for obtaining terms and also for relation discovery. These resources however mostly contain Index Terms—Iterative focused crawling, domain ontology, domain generic concepts and lack domain specific concepts domain terms extraction, taxonomy, non taxonomy. and relations [1]. Ontology extracted using lexico-syntactic patterns are limited to certain patterns and require enrichment. I. INTRODUCTION In this work we propose a framework for crawling websites relevant to the domain of interest and also build Domain NTOLOGY in computer science can be viewed Ontology without use of any annotated resource in an O as formal representation of knowledge pertaining to unsupervised manner. The crawling framework uses a novel particular domain [1]. In simpler terms ontology provides weighting measure to rank the domain terms. The proposed concepts and relationship among concepts in a domain. framework consists of five phases Corpus Collection, Term Machines perceive contents of documents(blogs, articles, web Extraction, Taxonomic Relation Extraction, Non-taxonomic pages, forums, scientific research papers, e-books, etc.) as relation extraction and Domain Ontology building. Corpus is sequence of character. Much of the semantic information are crawled using iterative focused web crawler which downloads already encoded in some form or other in these documents. the content which are pertinent to the domain by selectively There is an increasing demand to convert these unstructured rejecting URL’s based on link, anchor text and link context. information into structured information. Ontology plays a key Terms are extracted by feeding graph based algorithm HITS role in representing the knowledge hidden in these texts and with Shallow Semantic Relations and proposed use of make it human and computer understandable. adjective modifiers to obtain fine grained domain terms. Hearst Construction of Domain Ontology provides various pattern and Morpho-Syntactic patterns are extracted to build semantic solutions including: (1) Knowledge Management, taxonomies. Non-taxonomic relation extraction is obtained (2) Knowledge Sharing, (3) Knowledge Organization, and (4) through Association Rule Mining on Triples. Knowledge Enrichment. The organization of the paper is as follows: section two It can be effectively used in semantic computing describes Related Work, section three describes the System applications ranging from Expert Systems [2], Search Design, section four describes the Results and Evaluation, Engines [3], Question and Answering System [4], etc. to solve section five describes Conclusion and section six describes day to day problems. For example, if the search engine is Future Work. aware that “prokaryote” is a type of organism, better search II. RELATED WORK Manuscript received on December 21, 2016, accepted for publication on June 18, 2017, published on June 30, 2018. In this section, we discuss the literature survey in Corpus The authors are with Department of Computer Science and Engineering, College of Engineering, Guindy, India (e-mail: Collection, Term Extraction, Taxonomic Relation Extraction fvharissh14,vigneshmohanceg,[email protected], tv [email protected]). and Non Taxonomic Relation Extraction. https://doi.org/10.17562/PB-57-6 59 POLIBITS, vol. 57, 2018, pp. 59–66 V. Sree Harissh, M. Vignesh, U. Kodaikkaavirinaadan, T. V. Geetha ISSN 2395-8618 A. Domain Corpus Collection 4) Graph Based Measure: Graph Based Measure is used to Domain Corpus is a coherent collection of domain text. It model the importance of a term and the relationship between requires the usage of iterative focused or topical web crawler the terms in an effective way. Survey on Graph Methods to fetch the pages that are pertinent to the domain of interest. by Beliga et al. [10], suggest that graphs can be used to In the work proposed by [5], a heuristic based approach is represent co-occurrence relations, semantic relations, syntactic used to locate anchor text by using DOM tree instead of relations and other relations (intersecting words from sentence, using the entire HTML Page. A statistical based term weighing paragraph, etc.). Work by Ventura et al. [11] used novel graph measure based on TF-IDF called TFIPNDF (Term Frequency based ranking method called “Terminology Ranking Based on Inverse Positive Negative Document Frequency) was proposed Graph Information” to rank the terms and dice coefficient for weighing anchor text and link context. The pages are was used to measure the co-occurrence between two terms. classified as relevant or not relevant on the basis of trained Mukherjee et al. [12] used HITS index with hubs as Shallow classifier and is entirely supervised. The work however lacks Semantic Relations and authorities as nouns. Terms are filtered iterative learning of terms to classify pages [6]. based on hubs and authority scores. C. Taxonomic Relation Extraction B. Domain Term Extraction Taxonomy construction involves building a concept Domain Terms are the elementary components used to hierarchy in which broader-narrower relations are stored represent concepts of a domain. Example of domain terms and can be visualized as a hierarchy of concepts. For pertaining to agricultural domain are “farming”, “crops”, example “rice”, “wheat”, “maize” come under “crop”. They “plants”, “fertilizers”, etc. Term Extraction is generally are commonly built using predefined patterns such as the performed from collection of domain documents using any work by Hearst [13] and Ochoa et al. [14]. Meijer et of the following methods: Statistical Measure, Linguistic al. [7] proposed construction of taxonomy using subsumption Measure, Machine Learning and Graph-based Measure. method. This method calculates co-occurrence relations 1) Statistical Measure: Most common Statistical Measure between different concepts. Knijff et al. [15], compared two make use of TF (Term Frequency) and IDF (Inverse Document methods subsumption method and hierarchical agglomerative Frequency). Meijer et al. [7], proposed four measures namely clustering to construct taxonomy. They concluded that Domain Pertinence, Lexical Cohesion, Domain Consensus and subsumption method is suitable for shallow taxonomies and Structural Relevance to compute the importance of terms in a hierarchical agglomerative clustering is suitable for building domain. Drymonas et al. [8], used C/NC values to calculate deep taxonomies. the relevance of multiword terms in corpus. These measures however fail to consider the context of terms and fails to D. Non Taxonomic Relation Extraction capture the importance of infrequent domain terms. Non Taxonomic Relations best describe the non-hierarchical 2) Linguistic Measure: Linguistic Measures traditionally attributes of concept. For example, in the non taxonomic acquire terms by using syntactic patterns such as Noun-Noun, relation “predators eat plants”, eat is a feature of predator. Adjective-Noun, etc. For example, the POS tagging of the Nabila et al. [16] proposed an automatic construction sentence “Western Rajasthan and northern Gujarat are included of non-taxonomic relation extraction by finding the non- in this region” tags “Western” as an adjective and “Rajasthan” taxonomic relations between the concepts in the same sentence as Noun. Lexico-Syntactic patterns makes use of predefined and non-taxonomic relations between concepts in different patterns such as “including”, “like”, “such as”, etc., to extract sentences. Serra and Girardri [17] proposed a semi-automatic terms. It is however tedious and time consuming to pre-define construction of non-taxonomic relations from text corpus. patterns. Association between two concepts