Open Mind Animals: Insuring the Quality of Data Openly Contributed Over the World Wide Web

Open Mind Animals: Insuring the quality of data openly contributed over the World Wide Web David G. Stork and Chuck P. Lam Ricoh Silicon Valley 2882 Sand Hill Road, Suite 115 Menlo Park CA 94025-7022 {stork,lam}¢rsv, ricoh, com From: AAAI Technical Report WS-00-05. Compilation copyright © 2000, AAAI (www.aaai.org). All rights reserved. Abstract The Open Mind Initiative Briefly stated, the Open Mind Initiative is an inter- Wedescribe the Open Mind Initiative, a frame- net based collaborative framework for developing intel- work for building intelligent systems collabora- tively over the internet, and focus on one of its ligent software such as speech and handwriting recog- simpler componentprojects, Open MindAnimals. nizers, commonsense reasoning systems, smart spam The Initiative extends traditional open source de- filters, and so on (Stork 1999a; 1999b; Valin & Stork velopment methods by allowing non-expert neti- 1999). The Initiative relies on three main forms of zens to contribute informal data over the inter- contributions: 1) domain experts provide fundamental net. Suchdata is used to train classifiers or guide algorithms such as learning algorithms for character automatic inference systems, and thus it is im- recognition, 2) infrastructure/tool developers provide portant that only data of high accuracy and con- software infrastructure such as for capturing raw data sistency be accepted. We identify a number of and rewarding netizens, and 3) non-expert netizens con- possible sources of poor data in Animals -- sev- 1 eral of whichare generic and applicable to a range tribute raw data over the internet. There are many of open data collection projects -- and imple- incentives for netizens to contribute, including public ment a system of software modules for automati- acknowledgement on the Open Mind website, altruism cally and semi-automatically preventing poor data and inherent interest in artificial intelligence and par- from being accepted. Our system, tested in a ticular project domains, pleasure from playing games controlled laboratory intranet, filters faulty data (serving as interfaces), lotteries, gifts and couponsdo- through a variety of mechanismsand leads to ac- nated by corporations, and more (Stork 1999a). curate decision tree classifiers. Ourreusable mod- In contradistinction with traditional data mining ules can be employedin our planned large-scale techniques (Fayyad et al. 1996), Open Mind affords internet deployment of Animals and other Open - Mindprojects. interactive learning, such as learning with queries CAn gluin 1988) in which the most informative or most use- ful information (given the state of the system) is re- quested from contributors. Thus learning is recast as a Introduction problem in decision theory in which each action (query) Unlike many small demonstration and toy systems in has an expected cost/benefit, and the optimal strat- egy is to present the query which when answered is artificial intelligence, virtually all real-world systems re- expected to improve the classifier’s accuracy the most quire large sets of knowledgeor training data (Shortliffe (Pratt, Ralffa, & Schlaifer 1995; Thrun 1995). Learn- 1976; Lenat 1995). Pattern recognition systems in par- ing with queries is generally faster than non-interactive ticular rely quite heavily on large corpora of training learning based on randomly sampled data. Consider, data (Ho & Baird 1997; Jelinek 1998). In many cases for instance, the use of interactive learning for devel- such data can be collected from non-experts, for in- oping a recognizer of handwritten digits. The current stance speech data or the labels of handwritten charac- classifier identifies the region in pattern space where ters. The central hypothesis underlying the Open Mind images are ambiguous (e.g., the boundary between cat- Initiative is that for someclasses of problems, such data egory "I" and "1"), and interactively queries contrib- can be collected from large number of non-expert "net- utors to provide labels of such ambiguous patterns. In izens" over the internet. Our focus in this paper is on a this way the system "focusses in" on difficult or most key problem in Open Mind: how to ensure that the data informative patterns and increases the opportunity to contributed in this way has an acceptably small number resolve ambiguities in the data. of errors and represents consensus of the participating The software from separate Open Mind projects can netizen population. be integrated, for instance by using a language identification or topic identification system to provide con- Copyright ~) 2000, AmericanAssociation for Artificial In- telligence (www.aaai.org).All rights reserved. lwww.OpenMind.org straints for handwritten OCR.All the resulting data Open Source Open Mind and software are then available through an open source no netizens netizens crucial license and experts can propose changes. expert knowledge informal knowledge The Initiative arose from a deep appreciation of the no machine learning machine learning used following facts and recent trends: adaptive techniques used machine learning used to ¯ The success and increasing acceptance of open source to facilitate navigation build a single classifier or development methods and resulting software such as of contributed data (e.g., AI system (e.g., speech Linux, emacs, Mozilla and Apache. movie recommendations) recognizer) ¯ The realization that manyproblems in pattern recog- most work is directly on most work is not on the nition and intelligent systems require very large data the final software final software sets that can be provided by non-experts. hacker/programmer cul- netizen/business culture ture (~ 105 contributors (~ 108 members) ¯ The refinement of well developed techniques in pat- to Linux) tern recognition, machine learning, grammatical in- separate functions single function goal (e.g., ference, data mining and closely related core disci- contributed (e.g., Linux high OCRaccuracy) plines. device drivers) ¯ The increase in collaboration over the internet, and the improvementof tools for facilitating such collab- Table 1: Comparison of traditional open source and oration, both amongexperts and non-experts. Open Mind approaches. ¯ The growth in the participation of non-specialists (e.g., non-programmers) in group projects over the internet. Some of these allow participants to do- a dynamic pen tablet in a semi-controlled environment. nate temporarily unused computer resources, as the The segmention information and character labels pro- Search for Extraterrestrial Intelligence, 2 Great Inter- vided by netizens in this way are used to train mul- net Mersenne Prime Search, 3 prime4 factorization, ticlassifier based letter and full word recognizers. The and others. 5 In other collaborative projects, netizens existing data labelling system, written in Java, has been used successfully by experienced segmenters over an in- contribute "informal" or non-expert knowledge, as tranet and is being made more robust and user friendly in the Newhoo! open web directory 6 and MovieLens movie~ recommendation database. for deployment on the World Wide Web. ¯ The massive expansion of the web itself and partic- ularly the growing number of non-specialist netizens on the web. Table 1 summarizes some of the differences between traditional open source and Open Mind. Current Open Mind projects There are currently three full Open Mind projects un- der development. Open Mind speech recognition is initially addressing the recognition of isolated spoken Linux commandsand speaker identification (Valin Stork 1999). Open Mind common sense builds common sense ontologies and reasoning mechanisms; netizens contribute simple assertions (e.g., "a mother is older than her children"), ontology information ("all Figure 1: In an Open Mind project for recognizing rabbits are animals"), as well as abstract inferencing handwriting words, netizens segment letters within a rules. Open Mind handwriting is focussing on recog- pixel image of a connected handwritten word from an nizing handwritten English words (Schomaker & Stork unlabelled database of scanned or dynamically sampled 2000). Figure 1 shows the web interface for netizen handwritten words. Netizens use a mouse to indicate contributions in this project. Here the netizen’s task the beginning and end of each letter, and type the let- is to segment individual characters in pixel images of ter’s label, all through a browser interface. unlabelled handwritten words collected previously with 2setiathome.ssl. berkeley,edu Open Mind Animals E3~rww.mersenne, or 4www. rsasecurity,com To better understand general problems in infrastructure 5dlstributed.net and data quality assurance in the Open Mind frame- E6www. dmoz. or work we have implemented a simple "20 questions" AI 7www. movielens,umn. edu game called Animals (Shapiro 1982). During each ses- sion (game) a decision tree based on animal attributes is grown in the background. The game proceeds as follows. The netizen player thinks of a target animal and the system tries to guess this animal based on the player’s answers to a sequence of yes/no questions; this sequence corresponds to a path through a binary decision tree (Fig. 2). The system ventures a guess the target animal’s identity according to the label at the leaf node reached this way. If this guess is correct, then the game is over and the player is encouraged to play the game again. If instead the guess is incorrect, then the system asks the player for the identity of the

Load more