Extracting Conceptual Relationships and Inducing Concept Lattices from Unstructured Text

J. Intell. Syst. 2019; 28(4): 669–681 V.S. Anoop* and S. Asharaf Extracting Conceptual Relationships and Inducing Concept Lattices from Unstructured Text https://doi.org/10.1515/jisys-2017-0225 Received May 16, 2017; previously published online September 26, 2017. Abstract: Concept and relationship extraction from unstructured text data plays a key role in meaning aware computing paradigms, which make computers intelligent by helping them learn, interpret, and synthesis information. These concepts and relationships leverage knowledge in the form of ontological structures, which is the backbone of semantic web. This paper proposes a framework that extracts concepts and relationships from unstructured text data and then learns lattices that connect concepts and relationships. The proposed framework uses an off-the-shelf tool for identifying common concepts from a plain text corpus and then implements machine learning algorithms for classifying common relations that connect those concepts. Formal concept analysis is then used for generating concept lattices, which is a proven and principled method of creating formal ontologies that aid machines to learn things. A rigorous and structured experimental evaluation of the proposed method on real-world datasets has been conducted. The results show that the newly proposed framework outperforms state-of-the-art approaches in concept extraction and lattice generation. Keywords: Formal concept analysis, concept extraction, concept lattices, relation extraction, knowledge discovery. 1 Introduction Text is considered to be one form of data that is very rapidly generating because of the number of text-pro- ducing and text-consuming applications. User applications and platforms such as online social networks, digital libraries, e-commerce websites, and blogs generate text data, and this caused the creation of large unstructured text archives in organizations. These repositories are gold mines for organizations, as they contain invaluable patterns that help them leverage knowledge that can be used as an input to strategic but intelligent decision-making process. As the complexity and quantity of text data being generated grows exponentially, the need for more intelligent, scalable, and text-understanding algorithms is indispensable. The advent of semantic web, an extension and a meaning aware version of current World Wide Web, leads way to the introduction of numerous tools and techniques for leveraging, organizing, and presenting knowledge. Ontologies are building blocks of any meaning aware or semantic computing paradigm that comprises a set of concepts and their hierarchies and relationships in a domain of interest. Thus, the automated concept hierarchy learning from unstructured text has gained significant attention among text mining and natural language processing (NLP) researchers and practitioners. Concept hierarchy learning algorithms extract concepts from text and connect those concepts using potential relations that exist among them. Such hierarchies may find useful applications in concept-based ontology generation [19], concept-guided document summari- zation [24], and concept-guided information retrieval [10] to name a few. *Corresponding author: V.S. Anoop, Data Engineering Lab, Indian Institute of Information Technology and Management-Kerala (IIITM-K), Thiruvananthapuram, India, e-mail: [email protected] S. Asharaf: Indian Institute of Information Technology and Management-Kerala (IIITM-K), Thiruvananthapuram, India 670 V.S. Anoop and S. Asharaf: Extracting Conceptual Relationships and Inducing Concept Lattices 1.1 Contributions This work proposes a framework for identifying commonly occurring relations that connect concepts in unstructured text data and then learn them using machine learning techniques. Specifically, the proposed approach can identify and learn subsumption (“is-a”), Hearst patterns [14] (“such as”, “or other”, “and other”, “including”, “especially”, etc.), and other potential indications of relations among concepts. This approach makes use of formal concept analysis (FCA) [34], which is a well-established mathematical theory of analyzing data, to form context table and concept lattices [34] from the identified concept and relations. These lattices can then be used for generating ontologies that may be used by intelligent and meaning aware computing systems. The authors compared the performance of this proposed system to some state-of-the- art methods through rigorous experiment, and results indicate that this approach outperforms the chosen baselines. 1.2 Organization The rest of this paper is organized as follows. Section 2 discusses some of the very recent state-of-the-art methods in relation extraction (RE) and knowledge representation using FCA. The research objective and a formal problem definition are given in Sections 3 and 4, respectively. The new method proposed in this paper is explained in Section 5, and our experimental setup is given in Section 6. A detailed evaluation of results is given in Section 7, and we draw conclusions and then discuss future work in Section 8. 2 Related Work RE is a subtask information extraction (IE) that aims at extracting relevant and potentially useful patterns or information from humongous data that are being generated day by day. The sheer volume and heterogeneity of data makes it difficult to analyze and extract these patterns manually. Thus, we need automated techniques for this process. NLP is one of the major areas that address this issue by scanning natural language texts and extract useful patterns. IE tasks, specifically the RE, have a long history and go back to late 1970s. However, successful commercial systems were introduced in the 1990s. In this section, we discuss some of the recent approaches in RE. In addition, we also throw light on state-of-the-art methods in FCA-based concept lattice generation approaches. Informally, we can group all the RE approaches introduced in the literature into five categories: hand-built patterns, bootstrapping methods, supervised methods, distant supervision, and unsupervised approaches. Hand-built patterns uses handcrafted rules for extracting potentially relevant relation words from text, and one very notable work was introduced by M.A. Hearst known as Hearst patterns [14]. One issue with this approach is that it is difficult to write all sets of possible rules, and for other tasks such as meronym extraction, the set of rules will be different. But still, a good number of extensions are being reported that use Hearst patterns as their foundation for RE [16, 27, 29]. Another category of RE is the bootstrapping-based approach in which a specific set of seed relation instances are created and these are used for searching for new tuples. One such approach for extracting “author book” relation was the DIPRE [8]. Later, another system was introduced that uses the idea of bootstrapping; Snowball [1] extracts “organization-location” relation pairs. The limitations with the above-said algorithms are the specific relations they can only deal with. Users have to specify the type of relation they need to work with such as “author-book” or “organization-location”. Later, TextRunner [36] was introduced in the domain of RE, which can learn relations, classes, and entities from a corpus in a self-supervised manner. This approach first tags the training data as positive and nega- tive and then trains a classifier on the data to generate potential relations and entities. Another two-stage V.S. Anoop and S. Asharaf: Extracting Conceptual Relationships and Inducing Concept Lattices 671 bootstrapping algorithm [33] was proposed by Sun. In the first step, the algorithm uses a bootstrapping method to scan the tuples, and in the second stage, it learns relation nominals and contexts. More recently, supervised and semisupervised approaches are found to be promising, and a very good number of literatures have been reported in the area of RE that uses deep learning techniques for identifying relation patterns. Very recently, a neural temporal RE approach [12] has been introduced in which the authors experimented with neural architectures for temporal RE. They showed that neural models that take tokens only outperform state-of-the-art hand-engineered feature-based models in performance. They also reported that encoding relation arguments with XML tags performs superior than a traditional position-based encoding. Another notable approach was introduced that attempts neural RE with selective attention over instances [21]. This work employs convolutional neural networks (CNN) to embed the semantics of sentences. Experimental results show that this model could make full use of all informative sentences and achieved significant and consistent improvement on RE task. An approach for extracting relationships from clinical text was introduced [28] that exploited CNN to learn features automatically that reduces the dependency on manual feature engineering. They showed that CNN can be effectively used for RE in clinical text without being dependent on expert knowledge on feature engineering. Our proposed RE method uses machine learning techniques to classify Hearst patterns [14] (“such as”, “or other”, “and other”, “including”, “especially”, etc.) and other potential indications of relations such as “is-a” among textual concepts. In recent years, FCA [34] has attained significant interest from research communities of various domains. FCA can analyze data that describe relationships

Load more