Strategies for Lifelong Knowledge Extraction from the Web
Michele Banko and Oren Etzioni Turing Center University of Washington Computer Science and Engineering Box 352350 Seattle, WA 98195, USA [email protected], [email protected]
ABSTRACT Categories and Subject Descriptors The increasing availability of electronic text has made it I.2.6 [Artificial Intelligence]: Learning—Knowledge possible to acquire information using a variety of tech- Acquisition; I.2.8 [Artificial Intelligence]: Problem niques that leverage the expertise of both humans and Solving, Control Methods, and Search—Control theory machines. In particular, the field of Information Extrac- tion (IE), in which knowledge is extracted automatically General Terms from text, has shown promise for large-scale knowledge Algorithms acquisition.
While IE systems can uncover assertions about individ- 1. INTRODUCTION ual entities with an increasing level of sophistication, The accumulation of knowledge takes on a variety of text understanding – the formation of a coherent the- forms, ranging from the provision of information by do- ory from a textual corpus – involves representation and main experts and volunteer contributors, to techniques learning abilities not currently achievable by today’s IE that extract information from structured data, to sys- systems. Compared to individual relational assertions tems that seek to understand natural language. Driven outputted by IE systems, a theory includes coherent by the increasing availability of online information and knowledge of abstract concepts and the relationships recent advances in the fields of machine learning and among them. natural language processing, the AI community is in- creasingly optimistic that knowledge capture from text We believe that the ability to fully discover the rich- is within reach [4, 9]. Unsupervised algorithms that ness of knowledge present within large, unstructured exploit the availability of increasingly large volumes of and heterogeneous corpora will require a lifelong learn- Web pages are learning to extract structured knowledge ing process in which earlier learned knowledge is used to from unstructured text. guide subsequent learning. This paper introduces Al- ice, a lifelong learning agent whose goal is to automat- Previous efforts in text-based knowledge acquisition can ically discover a collection of concepts, facts and gen- largely be attributed to the field of Information Extrac- eralizations that describe a particular topic of interest tion (IE), where the task is to recognize entities and directly from a large volume of Web text. Building upon relations mentioned within text corpora. Traditional recent advances in unsupervised information extraction, IE systems focused on locating instances of narrow, we demonstrate that Alice can iteratively discover new pre-specified relations, such as the time and place of concepts and compose general domain knowledge with events, from small, homogeneous corpora. The Know- a precision of 78%. ItAll system [5] advanced the field of IE by capturing knowledge in a manner that scaled to the size and di- versity of relationships present within millions of Web pages. KnowItAll accomplished this task by learn- ing to label its own training examples using only a set of domain-independent extraction patterns and a boot- strapping procedure. While KnowItAll is capable of self-supervising its training process, extraction is not fully automatic; KnowItAll requires a user to name a K-CAP’07, October 28–31, 2007, Whistler, British Columbia, Canada. relation prior to each extraction cycle for every relation of interest. When acquiring knowledge from corpora as large and varied as the the Web, the task of anticipating In this paper, we: all relations of interest becomes highly problematic. • Develop the paradigm of lifelong learning in the A recent goal for knowledge extraction systems has been context of hierarchical, unsupervised knowledge ac- to eliminate the need for human involvement, thus mak- quisition from text. ing it possible to automate the knowledge acquisition process for a new domain or set of relations. Such efforts • Describe Alice, a lifelong learning agent that builds have resulted in the paradigm of preemptive or open upon recent advances in unsupervised information information extraction. Compared to traditional infor- extraction to discover new concepts and compose mation extraction, open information extraction systems abstract domain knowledge with a precision of 78%. attempt to automatically discover all possible relations from each sentence encountered. Shinyama and Sekine • Report on a collection of lifelong learning strate- [12] took important steps in the direction of open IE by gies that enable Alice to iteratively acquire a tex- developing a method for unrestricted relation extraction tual theory at Web scale. that could be applied to a small corpus of newswire ar- ticles. The TextRunner system [1] was the first open The remainder of the paper is organized as follows. In information extraction system to do this at Web scale, Section 2, we reintroduce the paradigm of lifelong learn- processing just over 110,000,000 Web pages and yield- ing previously articulated by members AI community. ing over 330,000,00 statements about concrete entities In Section 3, we describe Alice, our embodiment of a with a precision of 88%. lifelong learning agent that builds a domain theory from text. We report on experimental results in Section 4, Despite advances in information extraction, the process followed by a review of related research in textual the- of text understanding – the formation of a coherent set ory composition in Section 5. The paper concludes with of beliefs from a textual corpus – involves representa- a discussion of future work. tion and learning abilities not currently achievable by today’s IE systems. While IE systems can uncover as- 2. LIFELONG LEARNING sertions about individual objects, a theory of a partic- The paradigm of “lifelong learning” was articulated in ular domain is built from collective knowledge of con- the machine learning community in response to stark cepts and relationships among them. Compared to a differences observed between human and machine learn- vast set of independent statements extracted through ing abilities [16]. While humans can successfully learn IE, a domain theory represents knowledge compactly; behaviors having had only a small set of marginally re- relationships are expressed between abstract concepts lated experiences, machine learning algorithms typically at an appropriate level of generalization in a hierarchy. have difficulty learning from small datasets. The abil- Additionally, a domain theory contains only informa- ity of humans to learn when faced with complex tasks tion relevant to the topic at hand. in new settings fraught with rich sensory input can be attributed to the ability to exploit knowledge learned We believe that theory formation will involve a more during previous learning tasks. The fact that humans complex process in which the corpus is continually ana- learn continuously – increasingly more complicated con- lyzed and knowledge is acquired increasingly over time. cepts and behaviors are developed over the course of an The amount and richness of information that can be entire lifetime – has motivated the development of boot- gleaned from large, heterogeneous corpora such as the strapping algorithms in which knowledge is transferred Web will require an ongoing or lifelong learning pro- over time from simple tasks to increasingly more diffi- cess in which earlier learned knowledge is used to guide cult ones. subsequent learning. While open IE further automated the process of relation discovery, theory formation in- By definition, lifelong learning agents reduce the dif- volves the challenge of intelligently mechanizing the ex- ficulty of solving the nth learning problem they face ploration of concepts within a given domain. by using knowledge acquired from having solved ear- lier problems. Thus, the learning process of a lifelong This paper introduces Alice,1 one of the first lifelong learning agent happens incrementally; learning occurs learning agents to automatically build a collection of at every time step and knowledge acquired can be used concepts, facts and generalizations about a particular later. Learning also happens hierarchically; knowledge topic directly from a large volume of Web text. Over is acquired in a bottom-up fashion and can be sub- time, Alice uses knowledge gained about attributes of sequently built upon and altered. This type of be- the domain to focus its search for additional knowledge. havior has been implemented in several autonomous agents, mostly with application to robotics [17, 13] and 1Not to be confused with the infamous chatbot with the same name, the name Alice was inspired by the fictional reinforcement-learning-based tasks [10]. creator of a text-reading agent in Astro Teller’s novel, Exe- gesis [15]. An agent that learns incrementally inherently possesses a mechanism for identifying and executing a series of In addition to the TextRunner system that provides simple problem-solving tasks from the overall complex Alice with the ability to uncover a set of specific facts problem at hand. This process has historically been about the domain, Alice’s set of learners currently in- cast in terms of one of the most pervasive paradigms cludes modules for concept discovery and proposition of AI – learning as search. One of the earliest systems abstraction. to demonstrate this notion in the area of knowledge representation and discovery was Automated Mathe- Each of Alice’s learning tasks is defined by a concept matician (AM) [6]. AM was an agent that attempted c and a set of attributes π(c). Attributes take the form to discover new concepts in mathematics and the rela- of relations associated with the current concept accord- tionships between them using a heuristic-driven search ing to Alice’s unrestricted relation extraction process. strategy. While AM notably suffered from several limi- In our current implementation, relations in π(c) are ex- tations, it demonstrated that open-ended theory forma- plored in descending order of the frequency with which tion could be mechanized and modeled as search. they have been observed to occur in the corpus along with a sample of known instances of the concept c. Al- 3. LIFELONG KNOWLEDGE ACQUISITION though π is currently implemented as a heuristic func- WITH ALICE tion, we aim to have Alice learn π as well. We describe Alice’s learning process in more detail in Section 3.1. This section introduces Alice, a lifelong learning agent whose goal is to iteratively and hierarchically construct The final property that describes Alice’s behavior is a theory — a collection of concepts, facts and general- S, the strategy that defines the order in which indi- izations that describe a particular domain of interest – vidual learning tasks are addressed. Instead of taking directly from text. Alice begins with a domain-specific an uninformed, exhaustive path to theory construction, textual corpus, a source of background knowledge, and Alice is guided by knowledge accrued about attributes a control strategy and embarks on a search to update perceived to be highly relevant to understanding of the and refine a theory of the domain. Alice’s domain the- domain. One strategy may be for the system to thor- ory is iteratively updated with various forms of knowl- oughly explore properties of the first concept it discov- concepts instances attributes edge, including and their , ers before attempting to learn about other concepts en- relationships of concepts, and the various among them. countered subsequently. Alternatively, one can imagine Concepts abstractly refer to all instances of a given cate- an agent in possession of a short attention span who e.g. orange gory or class of entities ( is an instance of the upon discovery of a new concept, immediately shifts its concept,
Alice’s general learning architecture is depicted in Fig- The knowledge output upon completion of a learning T ure 1. From a given input text corpus in a domain task is used in two ways: to update the current do- D , Alice begins by using the TextRunner open in- main theory and to generate subsequent learning tasks. formation extraction system to extract facts about in- This very behavior is what makes Alice a lifelong agent dividual objects in the domain from each sentence in – Alice uses the knowledge acquired during the nth the corpus. A fact takes the form of a relational tuple learning task to specify its future learning agenda. Al- f e , r ,e e e = ( i i,j j), where i and j are strings that denote ice’s general learning behavior is summarized in Fig- r objects in the domain, and i,j is a string that expresses ure 2. a relationship believed to exist between them. Each tu- ple is assigned a probability based on a probabilistic model of redundancy in text [3]. 3.1 Theory Learning For our experiments, we have given Alice access to 2.5 The set of individual assertions output by TextRun- million Web pages focused on the topic of nutrition. ner serves as the basis from which Alice then proceeds As an initial source of knowledge, Alice makes use to add general knowledge to its domain theory. Begin- of WordNet, a hand-built semantic lexicon for the En- ning with a single domain-specific concept c , which can glish language that groups 117,097 distinct nouns into 0 3 be specified manually or heuristically, Alice uses a set 81,426 concepts. WordNet also expresses a handful of of learners L and a given search strategy S to develop an semantic relationships between concepts, such as hyper- agenda A of learning tasks to be completed over time. nymy/hyponymy and holonymy/meronymy. Despite its ability to cover a wide range of concepts, this knowledge 2A domain theory may also contain rules describing depen- dencies among propositions (e.g. GrowIn(x, y) ∧ IsA(x, 3Although Alice receives a background theory as input,
o r y
T h e
C o r p u s
A l i c e
u n n r
T e x t R e
f o o d
o n
R e l a t i a l
u p s
T l e
d
a i r y
...
...
g r a i n
o d u t
r c
p
e s s e t
n i a l
o d
a m i n a c i
C 0
p r o v i d e
w h o l e
o t e
r i n
n
g r a i n p
i
n t a
n d
A g e a
c o
b u e t
c k w h a
...
c o m p l e t e
p r o t e i n
C n C n + 1 C n + i
e
e fi r g g
... k
.. .
s o y p r o t e i n w h e y p r o t e i n
o r y
T h e
r n n
L e a i g
Figure 1: System Overview of Alice. Given a domain, corpus and background theory, Alice iteratively adds general, domain knowledge to its theory. The output of each learning cycle suggests the focus of subsequent learning tasks. Examples of the concepts, instances and relations automatically added to Alice’s theory are given in boldface. source falls short in its ability to offer complete cover- to which they belong, such as the knowledge that buck- age of the domain. WordNet’s knowledge of entities and wheat is a type of grain and that grain is a type of food. concepts is incomplete, and it fails to specify the myr- Yet the concept hierarchy is in some places incomplete. iad of relationships that exist between them. For ex- Alice further utilizes the small set of extraction pat- ample, in the nutrition domain, we observed omissions terns to add more fine-grained classes of existing con- such as the lack of an entry for the noni, an exotic fruit cepts present in its theory. By applying the constrained touted for its potent antioxidant properties, no concept extraction patterns using our knowledge of buckwheat, of “healthy food” and the inability to express the belief (e.g. “ such as buckwheat”,“buckwheat is a that foods rich in antioxidants may prevent disease if < ? food >”) and using a redundancy-based model of present in one’s diet. fact assessment over Alice’s corpus, Alice finds that buckwheat is not just any grain or food, but more specif- ically, a type of whole grain, gluten-free grain, fiber-rich 3.1.1 Concept Discovery food and nutritious food. Although the initial theory provided by WordNet con- tains a large number of entities and concepts, it is by Occasionally we find that an existing WordNet concept no means exhaustive. Therefore, in addition to the hy- possesses more than one sense. While fruit is most often pernymy and hyponymy information available in Word- used to refer to a ripe object produced by a seed plant, Net, Alice acquires class membership information di- it can also designate the consequence of some action. rectly from its corpus using a set of domain-independent While it may be possible to disambiguate among vari- extraction patterns and probabilistic assessment model ous senses of a word, we found that using the heuristic employed by KnowItAll. This data-driven approach of placing the newly discovered subcategory under the makes it possible to acquire knowledge about entities most frequent sense of the word according to WordNet missing from WordNet with high accuracy. served well.
At the beginning of each learning task, Alice receives Upon inspection of approximately 100 random samples an unexplored concept cn as input, instantiates the set of subclasses proposed by this method, we found that of extraction patterns (e.g.
The task of proposition abstraction can also be cast as the process of estimating the likelihood that a given |Tk| concept appears as an argument to a given relation. j=1 Count(r, ei,j ∈ c) Yet relations often take multiple concepts as arguments M(argi = c|r)= wc P |tk| – the predicate Provide(x,y) can describe both a re- lationship between
Compared to best-first search, associative and breadth- 5. RELATED WORK limited search fared much better, adding appropriately- To date, the task of inducing domain theories directly general, domain-specific abstractions to Alice’s theory from textual corpora has been explored by only a few with a precision of 78.0% and 75.5%, respectively. The systems. Liakata and Pulman [7] showed that they bulk of the remaining assertions were characterized as could induce a logically structured inference rules from a on-topic but vacuously true — 9.5% in the case of asso- 4000-word corpus describing company succession events. ciative search, and 12.5% in the case of breadth-limited While the method was anecdotally shown to learn a search. The small amount of propositions judged as handful of useful domain-specific rules, to our knowl- “incomplete” is due to a limitation of the knowledge edge, an extensive empirical evaluation of the system representation scheme in TextRunner in which n-ary output has not been reported. relations are currently not handled. The few instances deemed to be “errors” were largely due to cascading er- While previous efforts to augment the WordNet ontol- rors originating with TextRunner or an inability to ogy are too numerous to discuss here, most recently, correctly disambiguate objects having multiple senses. Suchanek et. al. developed YAGO [14], a system ca- pable of unifying facts automatically extracted from We also compared the recall of the three approaches, as Wikipedia Web pages to concepts in WordNet. While YAGO performs this unification with an accuracy of 1400 Best-First abstract domain theory with a precision of 78%. Associative 1200 Breadth-Limited In the immediate future, we plan to explore the notion 1000 of mutual recursion in the context of Alice. Mutual re- cursion can take several forms, including the use of the 800 domain theory output by Alice to improve modules
600 like TextRunner that underlie its learning process, and the construction of inference rules from Alice’s 400 general world knowledge. Finally, we plan to explore
# Of Abstract Propositions (Total) what Alice can learn from its search through the space 200 of knowledge, making it possible to both eliminate fruit- 0 less learning tasks and identify areas in which Alice’s 0 50 100 150 200 Iteration theory can be further enriched.
Figure 3: Knowledge Acquisition By Search 7. ACKNOWLEDGMENTS Strategy. For each algorithm, recall is computed This research was supported in part by NSF grants IIS- as the total number of distinct abstractions pro- 0535284 and IIS-0312988, DARPA contract NBCHD030010, posed after each search iteration. ONR grant N00014-05-1-0185as well as gifts from Google, and carried out at the University of Washington’s Tur- ing Center. 95%, its coverage is currently limited to a handful of re- lations that are derivable using the structure and head- 8. REFERENCES ings contained within Wikipedia pages. [1] M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In Procs. of IJCAI, 2007. Earlier, Schubert and Tong developed a method for ex- [2] S. Clark and D. Weir. Class-based probability estimation using tracting “general world knowledge” of the flavor de- a semantic hierarchy. Computational Linguistics, 28(2), 2002. duced by Alice. Their approach took a set of human- [3] D. Downey, O. Etzioni, and S. Soderland. A probabilistic validated parse trees from the Penn Treebank, and used model of redundancy in information extraction. In Procs. of a combination of tree-based pattern matching and heuris- IJCAI 2005, 2005. tics to obtain a set of abstract propositions. Using sim- [4] O. Etzioni, M. Banko, and M. Cafarella. Machine reading. In AAAI, 2006. ilar criteria that we have employed in our evaluation [5] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, of Alice, human assessors found about 60% of state- T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised ments extracted by this method to be “reasonable gen- named-entity extraction from the web: An experimental study. eral claims.” Also related to the task of deriving rela- Artificial Intelligence, 165(1):91–134, 2005. tionships between concepts is Clark and Weir’s use of [6] D. Lenat. Automated theory formation in mathematics. In parsed corpora and chi-squared statistics to induce re- Procs. of IJCAI, 1977. lationships between existing concepts in WordNet [2]. [7] M. Liakata and S. Pulman. Learning theories from text. In Procs. of COLING, 2004. Their method was evaluated indirectly by its ability [8] J. B. MacQueen. Some methods for classification and analysis to improve a natural language disambiguation task on of multivariate observations. In Proc. of 5th Berkeley which a precision of around 75% was obtained. Symposium on Mathematical Statistics and Probability, 1967. [9] T. Mitchell. Reading the web: A breakthrough goal for AI. In While these methods attempt to solve the same task of AI Magazine. AAAI Press, 2005. finding abstract propositions within text, their inherent [10] M. B. Ring. CHILD: A first step towards continual learning. assumptions – the use of small, parsed corpora – make Machine Learning, 28:77–105, 1997. it impossible to perform a direct comparison relative [11] L. Schubert and M. Tong. Extracting and evaluating general world knowledge from the brown corpus. In Proc. of the to the method we have presented in this paper. The HLT/NAACL Workshop on Text Meaning, 2003. success of Alice relies not on the ability to parse its [12] Y. Shinyama and S. Sekine. Preemptive information extraction input corpus, which is orders of magnitude larger, but using unrestricted relation discovery. In Procs. of on unsupervised techniques that exploit the redundancy HLT/NAACL, 2006. of information found within unstructured text. [13] P. Stone and M. Veloso. Layered learning. In Proc. of ECML, 2000. [14] F. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of 6. CONCLUSIONS AND FUTURE WORK semantic knowledge. In Procs. of WWW, 2007. We have introduced Alice, one of the first lifelong [15] A. Teller. Exegesis. Random House, 1999. learning agents capable of building a domain theory [16] S. Thrun. Lifelong learning algorithms. In S. Thrun and from a large collection of Web text. Alice uses in- L. Pratt, editors, Learning To Learn. Kluwer Academic formation from previous knowledge acquisition tasks to Publishers, 1998. iteratively compose individual statements output by a [17] S. Thrun and T. Mitchell. Lifelong robot learning. Robotics state-of-the-art information extraction system into an and Autonomous Systems, 15:25–46, 1995.