<<

Strategies for Lifelong from the Web

Michele Banko and Oren Etzioni Turing Center University of Washington Computer Science and Engineering Box 352350 Seattle, WA 98195, USA [email protected], [email protected]

ABSTRACT Categories and Subject Descriptors The increasing availability of electronic text has made it I.2.6 [Artificial Intelligence]: Learning—Knowledge possible to acquire information using a variety of tech- Acquisition; I.2.8 [Artificial Intelligence]: Problem niques that leverage the expertise of both humans and Solving, Control Methods, and Search—Control theory machines. In particular, the field of Information Extrac- tion (IE), in which knowledge is extracted automatically General Terms from text, has shown promise for large-scale knowledge Algorithms acquisition.

While IE systems can uncover assertions about individ- 1. INTRODUCTION ual entities with an increasing level of sophistication, The accumulation of knowledge takes on a variety of text understanding – the formation of a coherent the- forms, ranging from the provision of information by do- ory from a textual corpus – involves representation and main experts and volunteer contributors, to techniques learning abilities not currently achievable by today’s IE that extract information from structured , to sys- systems. Compared to individual relational assertions tems that seek to understand natural language. Driven outputted by IE systems, a theory includes coherent by the increasing availability of online information and knowledge of abstract and the relationships recent advances in the fields of and among them. natural language processing, the AI community is in- creasingly optimistic that knowledge capture from text We believe that the ability to fully discover the rich- is within reach [4, 9]. Unsupervised algorithms that ness of knowledge present within large, unstructured exploit the availability of increasingly large volumes of and heterogeneous corpora will require a lifelong learn- Web pages are learning to extract structured knowledge ing process in which earlier learned knowledge is used to from unstructured text. guide subsequent learning. This paper introduces Al- ice, a lifelong learning agent whose goal is to automat- Previous efforts in text-based knowledge acquisition can ically discover a collection of concepts, facts and gen- largely be attributed to the field of Information Extrac- eralizations that describe a particular topic of interest tion (IE), where the task is to recognize entities and directly from a large volume of Web text. Building upon relations mentioned within text corpora. Traditional recent advances in unsupervised , IE systems focused on locating instances of narrow, we demonstrate that Alice can iteratively discover new pre-specified relations, such as the time and place of concepts and compose general domain knowledge with events, from small, homogeneous corpora. The Know- a precision of 78%. ItAll system [5] advanced the field of IE by capturing knowledge in a manner that scaled to the size and di- versity of relationships present within millions of Web pages. KnowItAll accomplished this task by learn- ing to label its own training examples using only a set of domain-independent extraction patterns and a boot- strapping procedure. While KnowItAll is capable of self-supervising its training process, extraction is not fully automatic; KnowItAll requires a user to name a K-CAP’07, October 28–31, 2007, Whistler, British Columbia, Canada. relation prior to each extraction cycle for every relation of interest. When acquiring knowledge from corpora as large and varied as the the Web, the task of anticipating In this paper, we: all relations of interest becomes highly problematic. • Develop the paradigm of lifelong learning in the A recent goal for knowledge extraction systems has been context of hierarchical, unsupervised knowledge ac- to eliminate the need for human involvement, thus mak- quisition from text. ing it possible to automate the knowledge acquisition process for a new domain or set of relations. Such efforts • Describe Alice, a lifelong learning agent that builds have resulted in the paradigm of preemptive or open upon recent advances in unsupervised information information extraction. Compared to traditional infor- extraction to discover new concepts and compose mation extraction, open information extraction systems abstract domain knowledge with a precision of 78%. attempt to automatically discover all possible relations from each sentence encountered. Shinyama and Sekine • Report on a collection of lifelong learning strate- [12] took important steps in the direction of open IE by gies that enable Alice to iteratively acquire a tex- developing a method for unrestricted relation extraction tual theory at Web scale. that could be applied to a small corpus of newswire ar- ticles. The TextRunner system [1] was the first open The remainder of the paper is organized as follows. In information extraction system to do this at Web scale, Section 2, we reintroduce the paradigm of lifelong learn- processing just over 110,000,000 Web pages and yield- ing previously articulated by members AI community. ing over 330,000,00 statements about concrete entities In Section 3, we describe Alice, our embodiment of a with a precision of 88%. lifelong learning agent that builds a domain theory from text. We report on experimental results in Section 4, Despite advances in information extraction, the process followed by a review of related research in textual the- of text understanding – the formation of a coherent set ory composition in Section 5. The paper concludes with of beliefs from a textual corpus – involves representa- a discussion of future work. tion and learning abilities not currently achievable by today’s IE systems. While IE systems can uncover as- 2. LIFELONG LEARNING sertions about individual objects, a theory of a partic- The paradigm of “lifelong learning” was articulated in ular domain is built from collective knowledge of con- the machine learning community in response to stark cepts and relationships among them. Compared to a differences observed between human and machine learn- vast set of independent statements extracted through ing abilities [16]. While humans can successfully learn IE, a domain theory represents knowledge compactly; behaviors having had only a small set of marginally re- relationships are expressed between abstract concepts lated experiences, machine learning algorithms typically at an appropriate level of generalization in a hierarchy. have difficulty learning from small datasets. The abil- Additionally, a domain theory contains only informa- ity of humans to learn when faced with complex tasks tion relevant to the topic at hand. in new settings fraught with rich sensory input can be attributed to the ability to exploit knowledge learned We believe that theory formation will involve a more during previous learning tasks. The fact that humans complex process in which the corpus is continually ana- learn continuously – increasingly more complicated con- lyzed and knowledge is acquired increasingly over time. cepts and behaviors are developed over the course of an The amount and richness of information that can be entire lifetime – has motivated the development of boot- gleaned from large, heterogeneous corpora such as the strapping algorithms in which knowledge is transferred Web will require an ongoing or lifelong learning pro- over time from simple tasks to increasingly more diffi- cess in which earlier learned knowledge is used to guide cult ones. subsequent learning. While open IE further automated the process of relation discovery, theory formation in- By definition, lifelong learning agents reduce the dif- volves the challenge of intelligently mechanizing the ex- ficulty of solving the nth learning problem they face ploration of concepts within a given domain. by using knowledge acquired from having solved ear- lier problems. Thus, the learning process of a lifelong This paper introduces Alice,1 one of the first lifelong learning agent happens incrementally; learning occurs learning agents to automatically build a collection of at every time step and knowledge acquired can be used concepts, facts and generalizations about a particular later. Learning also happens hierarchically; knowledge topic directly from a large volume of Web text. Over is acquired in a bottom-up fashion and can be sub- time, Alice uses knowledge gained about attributes of sequently built upon and altered. This type of be- the domain to focus its search for additional knowledge. havior has been implemented in several autonomous agents, mostly with application to robotics [17, 13] and 1Not to be confused with the infamous chatbot with the same name, the name Alice was inspired by the fictional reinforcement-learning-based tasks [10]. creator of a text-reading agent in Astro Teller’s novel, Exe- gesis [15]. An agent that learns incrementally inherently possesses a mechanism for identifying and executing a series of In addition to the TextRunner system that provides simple problem-solving tasks from the overall complex Alice with the ability to uncover a set of specific facts problem at hand. This process has historically been about the domain, Alice’s set of learners currently in- cast in terms of one of the most pervasive paradigms cludes modules for discovery and proposition of AI – learning as search. One of the earliest systems . to demonstrate this notion in the area of knowledge representation and discovery was Automated Mathe- Each of Alice’s learning tasks is defined by a concept matician (AM) [6]. AM was an agent that attempted c and a set of attributes π(c). Attributes take the form to discover new concepts in mathematics and the rela- of relations associated with the current concept accord- tionships between them using a heuristic-driven search ing to Alice’s unrestricted relation extraction process. strategy. While AM notably suffered from several limi- In our current implementation, relations in π(c) are ex- tations, it demonstrated that open-ended theory forma- plored in descending order of the frequency with which tion could be mechanized and modeled as search. they have been observed to occur in the corpus along with a sample of known instances of the concept c. Al- 3. LIFELONG KNOWLEDGE ACQUISITION though π is currently implemented as a heuristic func- WITH ALICE tion, we aim to have Alice learn π as well. We describe Alice’s learning process in more detail in Section 3.1. This section introduces Alice, a lifelong learning agent whose goal is to iteratively and hierarchically construct The final property that describes Alice’s behavior is a theory — a collection of concepts, facts and general- S, the strategy that defines the order in which indi- izations that describe a particular domain of interest – vidual learning tasks are addressed. Instead of taking directly from text. Alice begins with a domain-specific an uninformed, exhaustive path to theory construction, textual corpus, a source of background knowledge, and Alice is guided by knowledge accrued about attributes a control strategy and embarks on a search to update perceived to be highly relevant to understanding of the and refine a theory of the domain. Alice’s domain the- domain. One strategy may be for the system to thor- ory is iteratively updated with various forms of knowl- oughly explore properties of the first concept it discov- concepts instances attributes edge, including and their , ers before attempting to learn about other concepts en- relationships of concepts, and the various among them. countered subsequently. Alternatively, one can imagine Concepts abstractly refer to all instances of a given cate- an agent in possession of a short attention span who e.g. orange gory or class of entities ( is an instance of the upon discovery of a new concept, immediately shifts its concept, ). Attributes of concepts include the current focus to learn something about the new item. relations various propositions or in which they take part Thus, S can be exhaustive (e.g. depth-first or breadth- e.g. ( a is something that Grows). Relation- first search) or heuristic-driven (e.g. best-first or A∗ ships describe how concepts or instances are associated search). We describe several lifelong search strategies e.g. 2 ( GrowIn(, )). in Section 3.2.

Alice’s general learning architecture is depicted in Fig- The knowledge output upon completion of a learning T ure 1. From a given input text corpus in a domain task is used in two ways: to update the current do- D , Alice begins by using the TextRunner open in- main theory and to generate subsequent learning tasks. formation extraction system to extract facts about in- This very behavior is what makes Alice a lifelong agent dividual objects in the domain from each sentence in – Alice uses the knowledge acquired during the nth the corpus. A fact takes the form of a relational tuple learning task to specify its future learning agenda. Al- f e , r ,e e e = ( i i,j j), where i and j are strings that denote ice’s general learning behavior is summarized in Fig- r objects in the domain, and i,j is a string that expresses ure 2. a relationship believed to exist between them. Each tu- ple is assigned a probability based on a probabilistic model of redundancy in text [3]. 3.1 Theory Learning For our experiments, we have given Alice access to 2.5 The set of individual assertions output by TextRun- million Web pages focused on the topic of nutrition. ner serves as the basis from which Alice then proceeds As an initial source of knowledge, Alice makes use to add general knowledge to its domain theory. Begin- of WordNet, a hand-built semantic lexicon for the En- ning with a single domain-specific concept c , which can glish language that groups 117,097 distinct nouns into 0 3 be specified manually or heuristically, Alice uses a set 81,426 concepts. WordNet also expresses a handful of of learners L and a given search strategy S to develop an semantic relationships between concepts, such as hyper- agenda A of learning tasks to be completed over time. nymy/hyponymy and holonymy/meronymy. Despite its ability to cover a wide range of concepts, this knowledge 2A domain theory may also contain rules describing depen- dencies among propositions (e.g. GrowIn(x, y) ∧ IsA(x, 3Although Alice receives a background theory as input, ) → HasWarmClimate(y)). We antici- the system learns to add knowledge without hand-tagged pate adding the ability to learn rules in future work. examples or manual intervention, and is thus unsupervised.

o r y

T h e

C o r p u s

A l i c e

u n n r

T e x t R e

f o o d

o n

R e l a t i a l

u p s

T l e

d

a i r y

...

...

g r a i n

o d u t

r c

p

e s s e t

n i a l

o d

a m i n a c i

C 0

p r o v i d e

w h o l e

o t e

r i n

n

g r a i n p

i

n t a

n d

A g e a

c o

b u e t

c k w h a

...

c o m p l e t e

p r o t e i n

C n C n + 1 C n + i

e

e fi r g g

... k

.. .

s o y p r o t e i n w h e y p r o t e i n

o r y

T h e

r n n

L e a i g

Figure 1: System Overview of Alice. Given a domain, corpus and background theory, Alice iteratively adds general, domain knowledge to its theory. The output of each learning cycle suggests the focus of subsequent learning tasks. Examples of the concepts, instances and relations automatically added to Alice’s theory are given in boldface. source falls short in its ability to offer complete cover- to which they belong, such as the knowledge that buck- age of the domain. WordNet’s knowledge of entities and wheat is a type of grain and that grain is a type of food. concepts is incomplete, and it fails to specify the myr- Yet the concept hierarchy is in some places incomplete. iad of relationships that exist between them. For ex- Alice further utilizes the small set of extraction pat- ample, in the nutrition domain, we observed omissions terns to add more fine-grained classes of existing con- such as the lack of an entry for the noni, an exotic fruit cepts present in its theory. By applying the constrained touted for its potent antioxidant properties, no concept extraction patterns using our knowledge of buckwheat, of “healthy food” and the inability to express the belief (e.g. “ such as buckwheat”,“buckwheat is a that foods rich in antioxidants may prevent disease if < ? food >”) and using a redundancy-based model of present in one’s diet. fact assessment over Alice’s corpus, Alice finds that buckwheat is not just any grain or food, but more specif- ically, a type of whole grain, gluten-free grain, fiber-rich 3.1.1 Concept Discovery food and nutritious food. Although the initial theory provided by WordNet con- tains a large number of entities and concepts, it is by Occasionally we find that an existing WordNet concept no means exhaustive. Therefore, in addition to the hy- possesses more than one sense. While fruit is most often pernymy and hyponymy information available in Word- used to refer to a ripe object produced by a seed plant, Net, Alice acquires class membership information di- it can also designate the consequence of some action. rectly from its corpus using a set of domain-independent While it may be possible to disambiguate among vari- extraction patterns and probabilistic assessment model ous senses of a word, we found that using the heuristic employed by KnowItAll. This data-driven approach of placing the newly discovered subcategory under the makes it possible to acquire knowledge about entities most frequent sense of the word according to WordNet missing from WordNet with high accuracy. served well.

At the beginning of each learning task, Alice receives Upon inspection of approximately 100 random samples an unexplored concept cn as input, instantiates the set of subclasses proposed by this method, we found that of extraction patterns (e.g. such as ) with the 78.3% of the classes were legitimate and meaningful ad- name of cn, (e.g. fruit such as ), applies them over ditions to the theory (e.g. oily fish, leafy green vegetable, the input corpus, and uses the assessment model to add complex carbohydrate), 14.1% were vacuous or not im- high-probability instances of cn to its theory. mediately useful (e.g. important antioxidant, favorite food), and the remaining 7.6% were errors. Alice’s initial theory contains useful knowledge in the form of previously discovered instances and some classes 3.1.2 Proposition Abstraction Alice ( Corpus T , Another type of information that can be added to Al- Background Theory θ, ice’s domain theory are general propositions found by Search Strategy S ) the process of abstraction. Compared to statements Facts F ← TextRunner(T ) about individual entities found by relation extraction Agenda A ←{c0} in TextRunner, general propositions express relation- For n =0 . . . ∞ ships among concepts. Such relationships encode “rea- cn ← next(A) sonable general claims” about the world from partic- If IsNewConcept(cn) ′ ular facts [11]. In our domain, the general proposi- θ ← DiscoverConcept(cn,T,θ) tion Provide(, ) should be de- θ ← θ′ duced from the set of tuples { (oranges, provide, vi- r = SelectAttribute(cn) ′ tamin c), (bananas, provide, a source of B vitamins), θ ← Abstract(r, cn,F,θ) (an avocado, provides, niacin)}. A key challenge in A ← Insert(A, cn+1 ...cn+i) according to S deducing general claims from a set of isolated facts is θ ← θ′ finding a suitable level of generalization in the con- cept hierarchy. Consider for example, the difference Figure 2: Alice’s Lifelong Learning Process in knowing Provide(, ) versus Provide(, ). The goalof propo- sition abstraction is to find the lowest point in concept hierarchy that describes a set of related instances. the ith argument slot in the context of the relation r:

The task of proposition abstraction can also be cast as the process of estimating the likelihood that a given |Tk| concept appears as an argument to a given relation. j=1 Count(r, ei,j ∈ c) Yet relations often take multiple concepts as arguments M(argi = c|r)= wc P  |tk|  – the predicate Provide(x,y) can describe both a re- lationship between and as well as one between and . If in Since we typically prefer more specific concepts to those addition to the three tuples observed in the previous that are more general, wC weights each concept accord- example, we saw that (Vitamin World, provides, Su- ing to its place in the theory’s current concept hierarchy, per Antioxidant Plus), a naive approach might propose 1 biasing the measure by d , where d is the maximum dis- Provide(, ), an abstraction too tance found between any entity ei,j and the concept c general to be of use. We use this observation to motivate under consideration. The algorithm greedily searches a clustering-based approach to proposition abstraction, for the combination of conceptual slot descriptions that which we now describe. best covers the set of tuples in the cluster, with the constraint that an abstraction must cover a minimum Given an input relation r whose arguments we wish to number of tuples to be meaningful. If any tuples remain generalize, Alice obtains a set of TextRunner tu- undescribed by the best abstraction , the procedure is ples t that match the form (ei,r,ej). The system then repeated until all items in the cluster are fully covered, examines each entity ei and ej in t and tries to ob- or until we can propose no more generalizations. tain the set of concepts to which they belong. In some cases, Alice will already have this information in the In order to sidestep the problem of needing to pre- theory; otherwise, Alice will take some time to learn specify the number of clusters to find in the data, we about the possible class and subclass memberships for first employed a expectation-maximization clustering a new instance using the discovery process described in algorithm that used cross validation to determine the the previous section. The system automatically clusters number of clusters. We found the approach to not only the tuples in t , using class-membership features where be too slow in practice, but unable to produce clusters known, and words frequently observed to appear in the that were sufficiently fine-grained. We found a better individual entity strings. For each cluster tk ∈ t Al- solution in k-means clustering [8]. Although k-means ice generates a set of using the following requires the number of clusters to be specified a priori, algorithm, and adds them to the theory. the system used its theory to estimate k as the maxi- mum number of distinct concepts present in each argu- Given a cluster of tuples, tk, Alice obtains the set of ment position according to the entities in the cluster. In th entities ei,1 ...ei,|tk| observed in the i argument po- practice, this approach results in a large number of clus- sition and retrieves the set of concepts that possibly ters, which can easily be reduced via a post-processing characterize it according to the current theory. The step that merges clusters whose centroids were found system then computes an association measure M that to be close together. We empirically evaluate Alice’s measures the likelihood that each concept c describes ability to generalize from individual facts in Section 4. 3.2 Search During associative search, Alice uses the output of We now describe how Alice generates and prioritizes iteration n to immediately conjecture similar learning subsequent theory-learning tasks. Recall that our goal tasks for iteration n + 1. For example, if Alice learns is to automatically construct a theory of a particular that Contain(, ), it reasonable topic from a textual corpus. In some cases, such as when to expect a cohesive theory to tell us what, if any- working with a set of automatically gathered Web pages thing else in the domain, contains vitamins. Associa- or a general-purpose text collection, out-of-domain in- tive search provides an elegant manner for exploring re- formation may be present. A page might mention the lated concepts while remaining focused on the domain nutrient iron as part of a discussion about health con- at hand. If as before, Alice concluded that MayPre- ditions in which the body is unable to absorb it. A text vent(, ), the agent will collection might contain pages about an object in the next try to understand what else may prevent disease, domain in a different sense of the word (e.g. iron, a de- rather than fall prey to exploring a tangent on the topic vice used to press clothing). Therefore, Alice’s search of disease that gradually shifts its focus away from the must be guided towards domain-specific concepts and domain of interest. relations as much as possible. An uninformed exhaus- tive strategy that simply iterates over all concepts and Upon learning about an active concept cn having at- relations found in the input corpus independent of ex- tribute r, associative search places all c′ found to be re- isting domain knowledge will likely introduce spurious lated to cn via r into the concept list A with attribute information into our domain theory. r. After Alice has completed this stack of learning tasks, (A is empty), a new attribute of cn is chosen by the function π, using the frequency-based ordering 3.2.1 Best-first search described previously. Our first attempt at building a lifelong learner pro- duced a heuristic-driven agent who is eager to explore 3.2.3 Breadth-Limited search new concepts at every opportunity. For instance, af- The final search strategy we implemented was a simple ter beginning with the concept and learning one that encouraged the agent to explore concepts in a that in general, the proposition Contain(, semi-exhaustive fashion. Instead of pursuing a fully ex- ) holds, Alice is eager to learn more haustive strategy in which the agent attempts to learn a about the concept of . This inclina- complete set of statements about a concept before mov- tion feels quite natural. Yet upon subsequently learning ing on to another, we implemented breadth-first search that MayPrevent(, ) is algorithm that limits the number of concepts explored true in the general case, Alice prefers to see what can based on information obtained during previous learning be gleaned by learning about diseases. This greedy ap- tasks. proach can sometimes send Alice into a tailspin. From an implementation standpoint, this search strat- Formally, after an initial concept c has been specified 0 egy is identical to the best-first strategy with one key π c to Alice, the agent executes ( 0) to obtain the next exception – concepts placed in A are explored in order relation r associated with the concept. Alice uses the of when they enter the queue, rather than sorted by r set of learners to add general propositions of and fur- the “newness” criteria previously defined. Each time ther develop concepts encountered during this proce- a concept c′ is activated during a learning task involv- nth dure. After completing the learning cycle, all con- ing c at depth d, a new learning task for c′ is gen- cepts c′ which have been activated while learning about n erated for depth d + 1. While Alice might end up r A are put into , which in this case, is a priority queue. learning some additional facts about , this During best-first search, the concepts in A are ordered will happen incrementally at an appropriate rate rela- in descending order according to how many instances of tive to other concepts, rather than allowing the agent the concept have been newly discovered by the learner. to spend all immediate resources learning about disease Other possible heuristics include an ordering of con- and related concepts. cepts based on distance in the concept hierarchy, or the similarity with which attributes have been observed to 4. EXPERIMENTAL RESULTS occur for concepts under consideration. In this section, we empirically assess Alice’s ability to add a collection of abstract assertions about nutri- 3.2.2 Associative search tion from a 2.5-million-page Web crawl constructed to We developed an alternate search strategy based on yet contain documents in the domain. There are several another natural learning process – search by analogy, criteria at play when judging the overall “goodness” of or what we will refer to as associative search. Broadly the learned abstractions. Our assessment should take speaking, learning by analogy utilizes the structure and into account an abstraction’s ability to generalize from outcome of the agent’s problem-solving history to learn instances to concepts at an appropriate level. Our eval- something about a new problem. uation metric should also consider whether or not the abstraction in question is on topic given the domain. On-Topic Off This criteria provides us with a way to assess the per- True Vacuous Inc Error Topic formance of each search strategy we have implemented. Best-First 39.5% 6.5% 0.0% 1.0% 43.0% Associative 78.0% 9.5% 3.0% 3.5% 6.0% We ran each search strategy over the same corpus for a Breadth-Limited 75.5% 12.5% 3.0% 1.5% 7.5% fixed number of learning steps (200), and chose a ran- dom sample of 200 abstract propositions for each of the Table 1: Precision of Proposition Abstraction three search methods. Using a modified version of an as- sessment scheme previously developed by Schubert [11], one of the authors of this paper was asked to assign each shown in Figure 3. Since true recall cannot be easily proposition P one of the following labels in the context computed from an unstructured Web corpus, we mea- of the domain D: sure recall in terms of the size of the set of distinct abstractions proposed. Although the best-first strategy Off-Topic: P D expresses a fact not central to , outputs a large number of generalizations, recall that a e.g. Cause(, ) low proportion of them were judged to meet our mea- True: P D is a reasonable general claim about , sure of goodness. Using our precision assessment, both e.g. Provide(, ) associative and breadth-limited search are estimated to Vacuous: One or more arguments of P are not output a greater proportion of correct, on-topic gener- appropriately specific/general, alizations after 200 iterations – 543/696 and 779/1038 e.g. Provide(, ), respectively – compared to only 517/1309 for best-first Incomplete: P is missing something, search. e.g. Provide(, ) P D Error: contradicts our knowledge of , Finally, we measured the diversity of concepts and re- e.g. BeNot(, ) lations expressed within the propositions found by each search algorithm, and found that the statements cover The results of our assessment are given in Table 1. The a wide variety of attributes of the domain. Associative heuristic-driven best-first search was observed to ven- search was found to propose statements about distinct ture off-topic after a few iterations, yielding a theory concepts at a rate of 0.67, and novel predicates at a rate judged to contain a sizable amount of out-of-domain of 0.32. The propositions output by best-first search in- knowledge. Once Alice has discovered the concept volved a large number of distinct concepts (0.78), but of , Alice pursues a set of illness-related a smaller number of predicates (0.23). Breadth-limited concepts including , among search proposed abstractions at a rate that contains the others. This failure can be attributed not only to the fewest number of distinct concepts (0.56), while the rate greediness of the heuristic and the imperfect nature of of unique predicates was observed to be 0.29. Nearly all the Web corpus from which Alice constructs the the- of the concept-to-concept relationships proposed by Al- ory, but by the overly general concept, , that ice were novel relative to the input source of WordNet generated this particular learning task. The fact that – 98.5% of the propositions output by assocative search the concept of is positioned fairly high in the were not already present in the background onotlogy. concept hierarchy has the impact that it stands in re- Similar results were observed for breadth-limited search lation to quite a number of concepts, a proportion of (99.5%) and best-first search (95.5%). which will be out-of-domain.

Compared to best-first search, associative and breadth- 5. RELATED WORK limited search fared much better, adding appropriately- To date, the task of inducing domain theories directly general, domain-specific abstractions to Alice’s theory from textual corpora has been explored by only a few with a precision of 78.0% and 75.5%, respectively. The systems. Liakata and Pulman [7] showed that they bulk of the remaining assertions were characterized as could induce a logically structured inference rules from a on-topic but vacuously true — 9.5% in the case of asso- 4000-word corpus describing company succession events. ciative search, and 12.5% in the case of breadth-limited While the method was anecdotally shown to learn a search. The small amount of propositions judged as handful of useful domain-specific rules, to our knowl- “incomplete” is due to a limitation of the knowledge edge, an extensive empirical evaluation of the system representation scheme in TextRunner in which n-ary output has not been reported. relations are currently not handled. The few instances deemed to be “errors” were largely due to cascading er- While previous efforts to augment the WordNet ontol- rors originating with TextRunner or an inability to ogy are too numerous to discuss here, most recently, correctly disambiguate objects having multiple senses. Suchanek et. al. developed [14], a system ca- pable of unifying facts automatically extracted from We also compared the recall of the three approaches, as Web pages to concepts in WordNet. While YAGO performs this unification with an accuracy of 1400 Best-First abstract domain theory with a precision of 78%. Associative 1200 Breadth-Limited In the immediate future, we plan to explore the notion 1000 of mutual recursion in the context of Alice. Mutual re- cursion can take several forms, including the use of the 800 domain theory output by Alice to improve modules

600 like TextRunner that underlie its learning process, and the construction of inference rules from Alice’s 400 general world knowledge. Finally, we plan to explore

# Of Abstract Propositions (Total) what Alice can learn from its search through the space 200 of knowledge, making it possible to both eliminate fruit- 0 less learning tasks and identify areas in which Alice’s 0 50 100 150 200 Iteration theory can be further enriched.

Figure 3: Knowledge Acquisition By Search 7. ACKNOWLEDGMENTS Strategy. For each algorithm, recall is computed This research was supported in part by NSF grants IIS- as the total number of distinct abstractions pro- 0535284 and IIS-0312988, DARPA contract NBCHD030010, posed after each search iteration. ONR grant N00014-05-1-0185as well as gifts from Google, and carried out at the University of Washington’s Tur- ing Center. 95%, its coverage is currently limited to a handful of re- lations that are derivable using the structure and head- 8. REFERENCES ings contained within Wikipedia pages. [1] M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In Procs. of IJCAI, 2007. Earlier, Schubert and Tong developed a method for ex- [2] S. Clark and D. Weir. Class-based probability estimation using tracting “general world knowledge” of the flavor de- a semantic hierarchy. Computational Linguistics, 28(2), 2002. duced by Alice. Their approach took a set of human- [3] D. Downey, O. Etzioni, and S. Soderland. A probabilistic validated parse trees from the Penn Treebank, and used model of redundancy in information extraction. In Procs. of a combination of tree-based pattern matching and heuris- IJCAI 2005, 2005. tics to obtain a set of abstract propositions. Using sim- [4] O. Etzioni, M. Banko, and M. Cafarella. Machine reading. In AAAI, 2006. ilar criteria that we have employed in our evaluation [5] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, of Alice, human assessors found about 60% of state- T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised ments extracted by this method to be “reasonable gen- named-entity extraction from the web: An experimental study. eral claims.” Also related to the task of deriving rela- Artificial Intelligence, 165(1):91–134, 2005. tionships between concepts is Clark and Weir’s use of [6] D. Lenat. Automated theory formation in mathematics. In parsed corpora and chi-squared statistics to induce re- Procs. of IJCAI, 1977. lationships between existing concepts in WordNet [2]. [7] M. Liakata and S. Pulman. Learning theories from text. In Procs. of COLING, 2004. Their method was evaluated indirectly by its ability [8] J. B. MacQueen. Some methods for classification and analysis to improve a natural language disambiguation task on of multivariate observations. In Proc. of 5th Berkeley which a precision of around 75% was obtained. Symposium on Mathematical Statistics and Probability, 1967. [9] T. Mitchell. Reading the web: A breakthrough goal for AI. In While these methods attempt to solve the same task of AI Magazine. AAAI Press, 2005. finding abstract propositions within text, their inherent [10] M. B. Ring. CHILD: A first step towards continual learning. assumptions – the use of small, parsed corpora – make Machine Learning, 28:77–105, 1997. it impossible to perform a direct comparison relative [11] L. Schubert and M. Tong. Extracting and evaluating general world knowledge from the brown corpus. In Proc. of the to the method we have presented in this paper. The HLT/NAACL Workshop on Text Meaning, 2003. success of Alice relies not on the ability to parse its [12] Y. Shinyama and S. Sekine. Preemptive information extraction input corpus, which is orders of magnitude larger, but using unrestricted relation discovery. In Procs. of on unsupervised techniques that exploit the redundancy HLT/NAACL, 2006. of information found within unstructured text. [13] P. Stone and M. Veloso. Layered learning. In Proc. of ECML, 2000. [14] F. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of 6. CONCLUSIONS AND FUTURE WORK semantic knowledge. In Procs. of WWW, 2007. We have introduced Alice, one of the first lifelong [15] A. Teller. Exegesis. Random House, 1999. learning agents capable of building a domain theory [16] S. Thrun. Lifelong learning algorithms. In S. Thrun and from a large collection of Web text. Alice uses in- L. Pratt, editors, Learning To Learn. Kluwer Academic formation from previous knowledge acquisition tasks to Publishers, 1998. iteratively compose individual statements output by a [17] S. Thrun and T. Mitchell. Lifelong robot learning. Robotics state-of-the-art information extraction system into an and Autonomous Systems, 15:25–46, 1995.