Omcsnet: a Commonsense Inference Toolkit
Total Page:16
File Type:pdf, Size:1020Kb
OMCSNet: A Commonsense Inference Toolkit Hugo Liu and Push Singh MIT Media Laboratory 20 Ames St., Bldg. E15 Cambridge, MA 02139 USA {hugo, push}@media.mit.edu Abstract throughout the computational linguistics community, running the gamut of linguistic processing tasks (see Large, easy-to-use semantic networks of sym- WordNet Bibliography, 2003). bolic linguistic knowledge such as WordNet and However, in many ways WordNet is far from ideal. MindNet have become staple resources for se- Often, the knowledge encoded by WordNet is too formal mantic analysis tasks from query expansion to and taxonomic to be of practical value. For example, word-sense disambiguation. However, the WordNet can tell us that a dog is a kind of canine which knowledge captured by these resources is limited is a kind of carnivore, which is a kind of placental mam- to formal taxonomic relations between or dic- mal, but it does not tell us that a dog is a kind of pet, tionary definitions of lexical items. While such which is something that most people would think of. knowledge is sufficient for some NLP tasks, we Also, because it is a lexical database, WordNet only in- believe that broader opportunities are afforded cludes concepts expressable as single words. Further- by databases containing more diverse kinds of more, its ontology of relations consists of the limited set world knowledge, including substantial knowl- of nymic relations comprised by synonyms, is-a relations, edge about compound concepts like activities and part-of relations. (e.g. “washing hair”), accompanied by a richer Ideally, a semantic resource should contain knowledge set of temporal, spatial, functional, and social re- not just those concepts that are lexicalized, but also about lations between concepts. lexically compound concepts. It should be connected by Based on this premise, we introduce an ontology of relations rich enough to encode a broad OMCSNet, a freely available, large semantic range of commonsense knowledge about objects, actions, network of commonsense knowledge. Built goals, the structure of events, and so forth. In addition, it from the Open Mind Common Sense corpus, should come with tools for easily making use of that which acquires world knowledge from a web- knowledge within linguistic applications. We believe based community of instructors, OMCSNet is that such a resource would open the door to many new presently a semantic network of 280,000 items innovations and improvements across the gamut of lin- of common-sense knowledge, and a set of tools guistic processing tasks. for making inferences using this knowledge. In Building large-scale databases of commonsense this paper, we describe OMCSNet, evaluate it in knowledge is not a trivial task. One problem is scale. It the context of other semantic knowledge bases, has been estimated that the scope of common sense may and review how OMCSNet has been used to en- involve many tens of millions of pieces of knowledge. able and improve various NLP tasks. Unfortunately, common sense cannot be easily mined from dictionaries, encyclopedias, the web, or other cor- 1 Introduction pora because it consists largely of knowledge obvious to There has been an increasing thirst for large-scale seman- a reader, and thus omitted. Indeed, it likely takes much tic knowledge bases in the AI community. Such a re- common sense to even interpret dictionaries and ency- source would improve many broad-coverage natural lan- clopedias. Until recently, it seemed that the only way to guage processing tasks such as parsing, information re- built a commonsense knowledgebase was through the trieval, word-sense disambiguation, and document sum- expensive process of hand-coding each and every fact. marization, just to name a few. WordNet (Fellbaum, However, in recent years we have been exploring a 1998) is currently the most popular semantic resource in new approach. Inspired by the success of distributed and the computational linguistics community. Its knowledge collaborative projects on the Web, Singh et al. (2002) is easy to apply in linguistic applications because Word- turned to the general public to massively distribute the Net takes the form of a simple semantic network—there problem of building a commonsense knowledgebase. is no esoteric representation to map into and out of. In They succeeded at gathering well over 500,000 simple addition, WordNet and tools for using it are freely avail- assertions from many contributors. From this corpus of able to the community and easily obtained. As a result, commonsense facts, we built OMCSNet, a semantic net- WordNet has been used in hundreds of research projects work of 280,000 items of commonsense knowledge. An excerpt of OMCSNet is shown in Figure 1. Our aim was sense knowledge—simple assertions, descriptions of to create a large-scale machine-readable resource struc- typical situations, stories describing ordinary activities tured as an easy-to-use semantic network representation and actions, and so forth. Since then the website has like WordNet (Fellbaum, 1998) and MindNet (Richard- gathered nearly 500,000 items of commonsense knowl- son et al., 1998), yet whose contents reflect the broader edge from over 10,000 contributors from around the range of world knowledge characteristic of commonsense world, many with no special training in computer sci- as in Cyc (Lenat, 1995). While far from being a perfect ence. The OMCS corpus now consists of a tremendous or complete commonsense knowledgebase, OMCSNet range of different types of commonsense knowledge, has nonetheless offered world knowledge on a large scale expressed in natural language. The earliest applications of the OMCS corpus made Washing Hair H use of its knowledge not directly but by first extracting For as E Shampoo Used ffec Co t Clean Hair nt into semantic networks only the types of knowledge they ai Loc ns ate E Lo d v needed. For example, the ARIA photo retrieval system t c At en c a Shower t e t Turning On j ed b I (Lieberman & Liu, 2002) extracted taxonomic, spatial, O n Water s a H Bottle functional, causal, and emotional knowledge to improve M information retrieval. This suggested a new approach to Buying Shampoo ade Of Plastic building a commonsense knowledgebase. Rather than R equ ires directly engineering the knowledge structures used by the Money Bank Located At reasoning system, as is done in Cyc, OMCS encourages L O o ft c e a t n people to provide information clearly in natural language t r N e a e d a P r I b and then extract from that more usable knowledge repre- s y n T a o H Wallet ATM sentations. Inspiring was the fact that there had been Front Door significant progress in the area of information extraction Part lves Opening Door Invo from text in recent years, due to improvements in broad- To Ach Pulling coverage parsing (Cardie, 1997). A number of systems ieve T his Handle are able to successfully extract facts, conceptual rela- tions, and even complex events from text. Figure 1. An excerpt from OMCSNet. Relation names are OMCSNet is produced by an automatic process, which expanded here for illustrative purposes. applies a set of ‘commonsense extraction rules’ to the OMCS corpus. A pattern matching parser uses 40 map- and has been employed to support and tackle a variety of ping rules to easily parse semi-structured sentences into linguistic processing tasks. predicate relations and arguments which are short frag- This paper is structured as follows. First, we discuss ments of English. These arguments are then normalized how OMCSNet was built, how it is structured, and the using natural language techniques (stripped of stop nature of its contents. Second, we present the OMCSNet words, lemmatized), and are massaged into one of many inference toolkit distributed with the semantic network. standard syntactic forms. To account for richer concepts Third, we review how OMCSNet has been applied to which are more than words, we created three categories improve or enable several linguistic processing tasks. of concepts: Noun Phrases (things, places, people), At- Fourth, we evaluate several aspects of the knowledge and tributes (modifiers), and Activity Phrases (actions and the inference toolkit, and compare it to several other actions compounded with a noun phrase or prepositional large-scale semantic knowledge bases. We conclude phrase, e.g.: “turn on water,” “wash hair.”). A small with a discussion of the potential impact of this resource part-of-speech tag –driven grammar filters out non- on the computational linguistics community at large, and compliant text fragments and massages the rest to take explore directions for future work. one of these standard syntactic forms. When all is done, the cleaned relations and arguments are linked together 2 OMCSNet into the OMCSNet semantic network. In this section, we first explain the origins of OMCSNet 2.3 Contents of OMCSNet in the Open Mind Commonsense corpus; then we demon- strate how knowledge is extracted to produce the seman- At present OMCSNet consists of the 20 binary relations tic network; and third, we describe the structure and se- shown below in Table 1. These relations were chosen mantic content of the network. The OMCSNet Knowl- because the original OMCS corpus was built largely edge Base, Knowledge Browser, and Inference Tool API through its users filling in the blanks of templates like ‘a is available for download (Liu & Singh, 2003). hammer is for _____’. Thus the relations we chose to extract largely reflect the original choice of templates 2.1 Building OMCSNet used on the OMCS web site. OMCSNet came about in a unique way. Three years ago, the Open Mind Commonsense (OMCS) web site (Singh et al. 2002) was built, a collection of 30 different activi- ties, each of which elicits a different type of common- Table 1.