Strategies for Lifelong Knowledge Extraction from the Web
Total Page:16
File Type:pdf, Size:1020Kb
Strategies for Lifelong Knowledge Extraction from the Web Michele Banko and Oren Etzioni Turing Center University of Washington Computer Science and Engineering Box 352350 Seattle, WA 98195, USA [email protected], [email protected] ABSTRACT Categories and Subject Descriptors The increasing availability of electronic text has made it I.2.6 [Artificial Intelligence]: Learning—Knowledge possible to acquire information using a variety of tech- Acquisition; I.2.8 [Artificial Intelligence]: Problem niques that leverage the expertise of both humans and Solving, Control Methods, and Search—Control theory machines. In particular, the field of Information Extrac- tion (IE), in which knowledge is extracted automatically General Terms from text, has shown promise for large-scale knowledge Algorithms acquisition. While IE systems can uncover assertions about individ- 1. INTRODUCTION ual entities with an increasing level of sophistication, The accumulation of knowledge takes on a variety of text understanding – the formation of a coherent the- forms, ranging from the provision of information by do- ory from a textual corpus – involves representation and main experts and volunteer contributors, to techniques learning abilities not currently achievable by today’s IE that extract information from structured data, to sys- systems. Compared to individual relational assertions tems that seek to understand natural language. Driven outputted by IE systems, a theory includes coherent by the increasing availability of online information and knowledge of abstract concepts and the relationships recent advances in the fields of machine learning and among them. natural language processing, the AI community is in- creasingly optimistic that knowledge capture from text We believe that the ability to fully discover the rich- is within reach [4, 9]. Unsupervised algorithms that ness of knowledge present within large, unstructured exploit the availability of increasingly large volumes of and heterogeneous corpora will require a lifelong learn- Web pages are learning to extract structured knowledge ing process in which earlier learned knowledge is used to from unstructured text. guide subsequent learning. This paper introduces Al- ice, a lifelong learning agent whose goal is to automat- Previous efforts in text-based knowledge acquisition can ically discover a collection of concepts, facts and gen- largely be attributed to the field of Information Extrac- eralizations that describe a particular topic of interest tion (IE), where the task is to recognize entities and directly from a large volume of Web text. Building upon relations mentioned within text corpora. Traditional recent advances in unsupervised information extraction, IE systems focused on locating instances of narrow, we demonstrate that Alice can iteratively discover new pre-specified relations, such as the time and place of concepts and compose general domain knowledge with events, from small, homogeneous corpora. The Know- a precision of 78%. ItAll system [5] advanced the field of IE by capturing knowledge in a manner that scaled to the size and di- versity of relationships present within millions of Web pages. KnowItAll accomplished this task by learn- ing to label its own training examples using only a set of domain-independent extraction patterns and a boot- strapping procedure. While KnowItAll is capable of self-supervising its training process, extraction is not fully automatic; KnowItAll requires a user to name a K-CAP’07, October 28–31, 2007, Whistler, British Columbia, Canada. relation prior to each extraction cycle for every relation of interest. When acquiring knowledge from corpora as large and varied as the the Web, the task of anticipating In this paper, we: all relations of interest becomes highly problematic. • Develop the paradigm of lifelong learning in the A recent goal for knowledge extraction systems has been context of hierarchical, unsupervised knowledge ac- to eliminate the need for human involvement, thus mak- quisition from text. ing it possible to automate the knowledge acquisition process for a new domain or set of relations. Such efforts • Describe Alice, a lifelong learning agent that builds have resulted in the paradigm of preemptive or open upon recent advances in unsupervised information information extraction. Compared to traditional infor- extraction to discover new concepts and compose mation extraction, open information extraction systems abstract domain knowledge with a precision of 78%. attempt to automatically discover all possible relations from each sentence encountered. Shinyama and Sekine • Report on a collection of lifelong learning strate- [12] took important steps in the direction of open IE by gies that enable Alice to iteratively acquire a tex- developing a method for unrestricted relation extraction tual theory at Web scale. that could be applied to a small corpus of newswire ar- ticles. The TextRunner system [1] was the first open The remainder of the paper is organized as follows. In information extraction system to do this at Web scale, Section 2, we reintroduce the paradigm of lifelong learn- processing just over 110,000,000 Web pages and yield- ing previously articulated by members AI community. ing over 330,000,00 statements about concrete entities In Section 3, we describe Alice, our embodiment of a with a precision of 88%. lifelong learning agent that builds a domain theory from text. We report on experimental results in Section 4, Despite advances in information extraction, the process followed by a review of related research in textual the- of text understanding – the formation of a coherent set ory composition in Section 5. The paper concludes with of beliefs from a textual corpus – involves representa- a discussion of future work. tion and learning abilities not currently achievable by today’s IE systems. While IE systems can uncover as- 2. LIFELONG LEARNING sertions about individual objects, a theory of a partic- The paradigm of “lifelong learning” was articulated in ular domain is built from collective knowledge of con- the machine learning community in response to stark cepts and relationships among them. Compared to a differences observed between human and machine learn- vast set of independent statements extracted through ing abilities [16]. While humans can successfully learn IE, a domain theory represents knowledge compactly; behaviors having had only a small set of marginally re- relationships are expressed between abstract concepts lated experiences, machine learning algorithms typically at an appropriate level of generalization in a hierarchy. have difficulty learning from small datasets. The abil- Additionally, a domain theory contains only informa- ity of humans to learn when faced with complex tasks tion relevant to the topic at hand. in new settings fraught with rich sensory input can be attributed to the ability to exploit knowledge learned We believe that theory formation will involve a more during previous learning tasks. The fact that humans complex process in which the corpus is continually ana- learn continuously – increasingly more complicated con- lyzed and knowledge is acquired increasingly over time. cepts and behaviors are developed over the course of an The amount and richness of information that can be entire lifetime – has motivated the development of boot- gleaned from large, heterogeneous corpora such as the strapping algorithms in which knowledge is transferred Web will require an ongoing or lifelong learning pro- over time from simple tasks to increasingly more diffi- cess in which earlier learned knowledge is used to guide cult ones. subsequent learning. While open IE further automated the process of relation discovery, theory formation in- By definition, lifelong learning agents reduce the dif- volves the challenge of intelligently mechanizing the ex- ficulty of solving the nth learning problem they face ploration of concepts within a given domain. by using knowledge acquired from having solved ear- lier problems. Thus, the learning process of a lifelong This paper introduces Alice,1 one of the first lifelong learning agent happens incrementally; learning occurs learning agents to automatically build a collection of at every time step and knowledge acquired can be used concepts, facts and generalizations about a particular later. Learning also happens hierarchically; knowledge topic directly from a large volume of Web text. Over is acquired in a bottom-up fashion and can be sub- time, Alice uses knowledge gained about attributes of sequently built upon and altered. This type of be- the domain to focus its search for additional knowledge. havior has been implemented in several autonomous agents, mostly with application to robotics [17, 13] and 1Not to be confused with the infamous chatbot with the same name, the name Alice was inspired by the fictional reinforcement-learning-based tasks [10]. creator of a text-reading agent in Astro Teller’s novel, Exe- gesis [15]. An agent that learns incrementally inherently possesses a mechanism for identifying and executing a series of In addition to the TextRunner system that provides simple problem-solving tasks from the overall complex Alice with the ability to uncover a set of specific facts problem at hand. This process has historically been about the domain, Alice’s set of learners currently in- cast in terms of one of the most pervasive paradigms cludes modules for concept discovery and proposition of AI – learning as search. One of the earliest systems abstraction.