Evolving Knowledge Extraction from Online Resources Zhibo Xiao, Tharini Nayanika De Silva, Kezhi Mao
Total Page:16
File Type:pdf, Size:1020Kb
World Academy of Science, Engineering and Technology International Journal of Computer and Systems Engineering Vol:11, No:6, 2017 Evolving Knowledge Extraction from Online Resources Zhibo Xiao, Tharini Nayanika de Silva, Kezhi Mao Abstract—In this paper, we present an evolving knowledge information retrieval and information extraction have been extraction system named AKEOS (Automatic Knowledge Extraction investigated. reference [5] proposed a summarization method from Online Sources). AKEOS consists of two modules, including to track the development of tweet stream. This work extracts a one-time learning module and an evolving learning module. The one-time learning module takes in user input query, and key phrases but does not discover the development of an event. automatically harvests knowledge from online unstructured resources Reference [6] proposed a method to track the emerging topics in an unsupervised way. The output of the one-time learning is a of streaming documents using dictionary learning. References CORE ZENODO structured vector representing the harvested knowledge. The evolving [7]-[9] investigated a specific time-related information retrieval learning module automatically schedules and performs repeated task, chronological citation recommendation, which utilized one-time learning to extract the newest information and track the development of an event. In addition, the evolving learning module different topic models to discover time-related features provided by summarizes the knowledge learned at different time points to produce of academic citations. In spite of the success of the a final knowledge vector about the event. With the evolving learning, aforementioned work, each work only touches on one single brought to you by we are able to visualize the key information of the event, discover aspect of evolving learning. In this study, we present the trends, and track the development of an event. our AKEOS (Automatic Knowledge Extraction from Online Keywords—Evolving learning, knowledge extraction, knowledge Sources) which is an end-to-end system. graph, text mining. AKEOS consists of two main modules, including a one-time learning module and an evolving learning module. The I. INTRODUCTION one-time learning aims to distil the valuable information HE information on the Internet is overwhelming us. Even from the overwhelmingly large volume of unstructured text T for the result of a single query on a search engine, it that search engine returns for one query, and output a is hard to quickly grasp the key information underlying the structured knowledge vector to summarize the query result. returned search results. Traditional media and social media are The one-time learning extracts knowledge for one query also shortening our attentions and shifting our focuses with the session and the knowledge extracted just reflects the event news flash that we usually may not be able to keep up with. In until a particular time. As the event develops, new information order to address this problem, automatic knowledge extraction will be generated. The objective of the evolving learning is to has been proposed. extract the knowledge of the new development of the event. A few open information extraction (OIE) systems have been The user only needs to submit a query once, the evolving developed. As stated in [1], an OIE system takes in a corpus learning mechanism in AKEOS automatically schedules and of texts, without any a priori knowledge or specification of the performs the one-time knowledge extraction, and summarize relations of interest, and outputs a set of extracted relations. the knowledge extracted at a different time. Here, the One example work is KnowItAll [2], which is an information knowledge summarization is done through knowledge vector extraction system that addresses the challenge of automated merging, including attribute merging and value merging. The information extraction by learning to label its own training end product is a knowledge vector. The final knowledge vector examples. KnowItAll sets a foundation of modern OIE system summarizes the overall information of an event, while the but it requires a large amount of data from search engine individual knowledge vectors generate at different time points to train and learn relations. The more recent TextRunner [3], reflect how the event evolves over time. however, learns a general model of how relations are expressed In summary, AKEOS has the following merits: in a particular language using CRF paradigm, in this way, 1) It harvests unstructured text data from online resources, International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007303 it not only reduces the error rate of KnowItAll, but also distils the important information from the data to save the time when learning new relations. Recently, SemIE produce structured knowledge vectors. is proposed in [4], which extends TextRunner and exploits 2) It is an end-to-end system, with query words as input, the predicate-argument structure of a text to handle complex knowledge vector as the end product. sentences. 3) It tracks the development of events and discovers the In recent years, evolving learning has received more and underlying patterns. more attentions. Different aspects of evolving learning in Zhibo Xiao and Tharini Nayanika de Silva are with the School of Electrical II. SYSTEM ARCHITECTURE and Electronic Engineering, Nanyang Technological University, Singapore. Kezhi Mao is with the School of Electrical and Electronic Engineering, AKEOS consists of two modules: one-time learning Nanyang Technological University, Singapore (e-mail: [email protected]). and evolving learning. The one-time learning module is a International Scholarly and Scientific Research & Innovation 11(6) 2017 747 scholar.waset.org/1307-6892/10007303 View metadata, citation and similar papers at core.ac.uk World Academy of Science, Engineering and Technology International Journal of Computer and Systems Engineering Vol:11, No:6, 2017 Fig. 1 The AKEOS architecture stand-alone system, while the evolving learning module is built III. ONE-TIME KNOWLEDGE EXTRACTION upon the one-time learning module. The system architecture One-time learning module takes in user query to crawl the is shown in Fig. 1. The inputs to the one-time learning Internet for relevant articles and extract relevant information system are query words from a user. The query words are from these article. One-time learning is the foundation of the then used to crawl related documents from online resources. evolving learning module. Since Google ranks search results by relevance to the query, The module first takes in user query and use Google search top documents, say top 100 documents, are deemed “relevant engine to search relevant articles on the Internet. We treat documents”. The one-time learning system first extracts the the top 100 query results from Google as relevant articles. main paragraph text from each document. Feature words are After extracting the text from these webpages, we adopted then selected from the relevant documents. These feature an enhanced TF-IDF method which combines the TF-IDF words are associated with key information of an event and measure with word frequencies from Google’s N-gram dataset are used as a guide to relevant sentence selection. From these to measure the importance of the word in the context of the relevant sentences, useful entities are extracted. Through this user query in relation to a generic context. Extracted feature procedure of knowledge extraction, the unstructured text is words will be used in the following tasks, including one-time transformed into a structured knowledge vector. Each time learning and evolving learning. the one-time learning is performed, the structured knowledge Till now, the relevant documents concerning the query vector will be stored in the knowledge base for further use. words are collected, but not all the sentences are relevant to the query words. Hence, feature words are used to select The one-time learning module only extract the knowledge the relevant sentences. For a sentence containing M words of an event until a particular time point. The evolving learning in the corpus, each word is transformed into a vector, then module schedules and performs repeated one-time learning at convoluted with each feature word vector. The sentence is different stages of an event to ensure the knowledge extracted transformed into a vector with the length of M. A further is up-to-date. The newly extracted knowledge will be merged max pooling operation will select the most distinct feature in with previously extracted knowledge to form a new vector. For the sentence. In this manner, a sentence with variable length example, after the first day of an event, the feature word list is transformed into a scalar. With a predefined threshold, the for day 1 is learned, along with relevant sentences, extracted relevant sentences are selected. useful entities and the knowledge vector v1 of the event for After relevant sentences are selected, the next step is to day 1. After the second day, the feature word list based on International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007303 extract entities containing useful information. In AKEOS, the the newly published articles on day 2 are generated, along following are extracted as entities: with the corresponding relevant sentences, useful entities and • knowledge vector of v2. At the end of day 2, the knowledge Quantitative words, which are words associated with a vectors of day 1 and day 2 are merged into a new vector, number, such as ‘deaths’, ‘people killed’, ‘missing’, ‘km’, denoted by u1. After n days, un−1 is learned and is treated ‘magnitude’. • as final vector v = un−1. User can use the final vector v Geographic locations to check the summarized knowledge of the event, and use • NER tags, such as date, time, location, person, vi,i =1, 2,...,n to review the development of the event.