Representing General Relational Knowledge in Conceptnet 5

Representing General Relational Knowledge in ConceptNet 5 Robert Speer and Catherine Havasi MIT Media Lab 20 Ames St., Cambridge MA 02139 [email protected], [email protected] Abstract ConceptNet is a knowledge representation project, providing a large semantic graph that describes general human knowledge and how it is expressed in natural language. This paper presents the latest iteration, ConceptNet 5, including its fundamental design decisions, ways to use it, and evaluations of its coverage and accuracy. Keywords: ConceptNet, common sense, semantic networks Web site to collect new statements from visitors to the site, and asks them target questions about statements it thinks may be true. Later releases included knowledge from similar websites in other languages, such as Portuguese and Dutch, and collaborations with online word games that automatically collect general knowledge, yielding further knowledge in English, Japanese, and Chinese. ConceptNet gives a foundation of real-world knowledge to a variety of AI projects and applications. Previous versions of ConceptNet (Havasi et al., 2007) have been used, for example, to build a system for analyzing the emotional con- Figure 1: A high-level view of the knowledge ConceptNet tent of text (Cambria et al., 2010), to create a dialog system has about a cluster of related concepts. for improving software specifications (Korner and Brumm, 2009), to recognize activities of daily living (Ullberg et al., 2010), to visualize topics and trends in a corpus of unstruc- 1. Introduction tured text (Speer et al., 2010), and to create public information displays by reading text about people and projects ConceptNet is a knowledge representation project, provid- from a knowledge base (Havasi et al., 2011). ing a large semantic graph that describes general human knowledge and how it is expressed in natural language. The ConceptNet provides a combination of features not avail- scope of ConceptNet includes words and common phrases able in other knowledge representation projects: in any written human language. It provides a large set of • Its concepts are connected to natural language words background knowledge that a computer application work- and phrases that can also be found in free text. ing with natural language text should know. These words and phrases are related through an open do- • It includes not just definitions and lexical relation- main of predicates, describing not just how words are re- ships, but also the common-sense associations that or- lated by their lexical definitions, but also how they are re- dinary people make among these concepts. Its sources lated through common knowledge. For example, its knowl- range in formality from dictionaries to online games. edge about “jazz” includes not just the properties that de- • The concepts are not limited to a single language; they fine it, such as IsA(jazz, genre of music); it also includes can be from any written language. incidental facts such as • It integrates knowledge from sources with varying lev- • AtLocation(jazz, new orleans) els of granularity and varying registers of formality, and makes them available through a common repre- • UsedFor(saxophone, jazz), and sentation. • plays percussion in(jazz drummer, jazz). ConceptNet aims to contain both specific facts and the messy, inconsistent world of common sense knowledge. To A cluster of related concepts and the ConceptNet assertions truly understand concepts that appear in natural language that connect them is visualized in Figure 1. text, it is important to recognize the informal relations ConceptNet originated as a representation for the knowl- between these concepts that are part of everyday knowledge collected by the Open Mind Common Sense project edge, which are often under-represented in other lexical re- (Singh et al., 2002), which uses a long-running interactive sources. WordNet, for example, can tell you that a dog is a 3679 type of carnivore, but not that it is a type of pet. It can tell to read it and operations that could be performed with it you that a fork is an eating utensil, but has no link between such as spreading activation from a set of words. fork and eat to tell you that a fork is used for eating. ConceptNet 2 was only in English. The project became Adding common sense knowledge creates many new ques- multilingual shortly afterward, both with the sister project tions. Can we say that “a fork is used for eating” if a fork OMCS no Brasil (Anacleto et al., 2006), collecting knowl- is used for other things besides eating, and other things are edge in Brazilian Portuguese, and GlobalMind (Chung, used for eating? Should we make sure to distinguish the 2006), collecting knowledge in English, Chinese, Japanese, eating utensil from the branching of a path? Is the state- and Korean. ment still true in cultures that typically use chopsticks in- ConceptNet 3 (Havasi et al., 2007) made ConceptNet into stead of forks? We can try to collect representations that a SQL database, so it could be easily updated by processes answer these questions, while pragmatically accepting that including user interaction from a Web site. There were sep- much of the content of a common sense knowledge base arate editions of ConceptNet 3 for English and Brazilian will leave them unresolved. Portuguese. ConceptNet 4 was quite similar to ConceptNet 3, but it re- 2. Motivation for ConceptNet 5 vised and normalized the database structure so that it could contain all languages of ConceptNet simultaneously, so that The new goals of ConceptNet 5 include to include knowl- it finally could represent all the knowledge from the En- edge from other crowd-sourced knowledge with their own glish OMCS, OMCS no Brasil, a new OMCS in Dutch communities and editing processes, particularly data mined (Eckhardt, 2008), and GlobalMind in one place. It also from Wiktionary and Wikipedia; to add links to other re- incorporated contributions from other projects, including sources such as DBPedia (Auer et al., 2007), Freebase (Bol- online games collecting knowledge in English, Chinese, lacker et al., 2008), and WordNet (Fellbaum, 1998); to and Japanese. To aid the use of ConceptNet within other support machine-reading tools such as ReVerb (Etzioni et projects, we also added a Web API for accessing and query- al., 2008), which extracts relational knowledge from Web ing the data in ConceptNet 4. pages; and to find translations between concepts repre- A strong motivation for why a new version is necessary is sented in different natural languages. that the data from other projects was difficult to fully incor- ConceptNet 5 aims to grow freely and absorb knowledge porate into ConceptNet 4. It had to be aligned and dedupli- from many sources, with contributions from many different cated in a sprawling SQL database, a time-consuming and projects. We aim to allow different projects to contribute code-intensive process that was performed infrequently, data that can easily be merged into ConceptNet 5 without and meanwhile the projects at other locations had to main- the difficulty of aligning large databases. tain their own out-of-sync versions of the database. Con- Combining all these knowledge sources in a useful way ceptNet 5 contains many representational improvements, requires processes for normalizing and aligning their dif- but the primary focus is to make the collection, storage, ferent representations, while avoiding information loss. It and querying of knowledge truly distributable. also requires a system for comparing the reliability of its collected knowledge when it can come from a variety of 3. Knowledge in ConceptNet 5 processes, sometimes involving unreliable sources (such as ConceptNet expresses concepts, which are words and players of online games) and sometimes involving unreli- phrases that can be extracted from natural language text, able processes (parsers and transformations between repre- and assertions of the ways that these concepts relate to sentations). each other. These assertions can come from a wide variety In a sense, while ConceptNet 4 and earlier versions col- of sources that create justifications for them. The current lected facts, ConceptNet 5 at a higher level collects sources sources of knowledge in ConceptNet 5 are: of facts. This greatly expands its domain, makes it inter- operable with many other public knowledge resources, and • The Open Mind Common Sense website (http: makes it applicable to a wider variety of text-understanding //openmind.media.mit.edu), which collects applications. common-sense knowledge mostly in English, but has more recently supported other languages. 2.1. Previous Development of ConceptNet • Sister projects to OMCS in Portuguese (Anacleto et al., 2006) and Dutch (Eckhardt, 2008). ConceptNet has been developed as part of the Open Mind Common Sense project, a Media Lab project to collect the • The multilingual data, including translations between things that computers should know in order to understand assertions, collected by GlobalMind. what people are talking about, which then grew into an in- • “Games with a purpose” that collect common knowl- ternational, multi-homed project called the Common Sense edge, including Verbosity (von Ahn et al., 2006) in Computing Initiative. English, nadya.jp in Japanese, and the “pet game” The first publicly released version of ConceptNet was Con- (Kuo et al., 2009) on the popular Taiwanese bulletin ceptNet 2 (Liu and Singh, 2004). ConceptNet 2 was dis- board PTT, collecting Chinese knowledge in tradi- tributed as a packed Python data structure, along with code tional script. 3680 • A new process that scans the English Wiktionary (a 3.1. Representation Wikimedia project at en.wiktionary.org that defines words in many languages in English). In ad- Open&Mind& pa8ern& en.wikipedia.org& ReVerb&parser& dition to extracting structured knowledge such as syn- contributors& matching& onyms and translations, it also extracts some slightly- unstructured knowledge. For example, it extracts addi- && && tional translations from the English-language glosses of words in other languages.

Load more