CAKES: Crowdsourced Automatic Keyword Extraction

CAKES: Crowdsourced Automatic Keyword Extraction Daniel Erenrich Chris Kennelly [email protected] [email protected] ABSTRACT Sixty-five years ago, Vannevar Bush presented a vision for an We present progress towards applying \Games With A Pur- automated system of data organization entitled\the Memex"[1]. pose" (GWAP) techniques to extract keywords which de- Since then, the task of automatically organizing, summariz- scribe films. We discuss the use of machine learning algo- ing, and indexing information still remains. While search rithms to automatically extract keywords from movie scripts. engines perform much of the task of indexing, automatic The data collected from the game is used to generate sim- summarization is a work in progress [2]. Like image recogni- ilarity metrics between the films which eventually can be tion, keyword assignment to text challenges machines, both used to recommend films. Finally, we discuss the merits of in describing a large piece of content with relevant keywords GWAP as a system and make recommendations concerning as well as ascertaining the importance of each keyword. This its use. task is important for content recommendation systems and search relevance. Categories and Subject Descriptors Game-playing humans provide a potential audience for this I.2.7 [Natural Language Processing]: Text Analysis; H.5.2 task. The dataset produced by gameplay can be used to [User Interfaces]: Natural language assess the relatedness of films and quality of keywords describing films. Additionally, possible keyword generation al- General Terms gorithms can be validated on-the-fly by emulating an op- Measurement ponent. Rather than summarize widely available texts, we chose keyword generation. In order to apply the Games 1. MOTIVATION With A Purpose-style approach to validation, we must keep As part of the CS144 \Rankmaniac 2010" competition, we our game interesting in order to encourage play. Summariz- created an online game using the Internet Movie Database ing dusty tomes into multiple paragraphs may be an inter- (IMDB) dataset. Hoping that it would encourage incoming esting AI problem but lacks the pace that keywords appear links, we chose to make a game so our website would have to provide to attract interest as a fun game to a broad au- content. The initial game gave the player a set of tags for dience. a randomly chosen movie in the dataset. The goal of the player was to guess the movie possessing those tags in sixty 2. METHODOLOGY seconds. If needed, the player could request more tags from Our research focuses on automatic keyword generation from the pool. From the start of the project, we chose to diligently underlying source texts and keyword validation by human log game play in order to construct a fascinating dataset for game play. The former produces keywords by analyzing movies. source material, movie scripts, automatically. The latter maximizes the value of keywords by attempting to discover This strategy has been dubbed by Luis von Ahn, a professor which keywords convey the most information to a human at Carnegie Mellon University, a \Game with a Purpose." about the content of a film by leading them to guess a movie These games serve to be both entertaining while nevertheless title. Statistical information gleaned from the work in key- guiding players to simultaneously solving difficult problems word presentation provides feedback to keyword discovery for machines. For example, he designed an image labeling algorithms for selecting more optimal keywords. game which asks two players to describe an image with the same word. Gameplay lends itself to several potentially noisy approaches for measuring the usefulness of keywords for a human player. We considered focusing on the last displayed keyword, as- signing equal weight to each displayed keyword, and giving an exponential decay of weights skewed towards more re- cently displayed keywords. For a \stuck" user, the last displayed keyword ought to give the most insight into the movie of interest, justifying the exclusive use of last keyword for measuring effectiveness. However, many keywords are used to describe multiple movies, limiting the usefulness of this Number of Movies per Keyword Number of Keywords per Movie 8000 100 80 6000 60 4000 Frequency Frequency 40 2000 20 0 0 0 10 20 30 40 50 0 50 100 150 200 250 300 Number of Movies Number of Keywords Figure 1: Distribution of frequency of keyword ap- Figure 2: Distribution of frequency of keyword pearances counts technique. This suggests that the whole set of keywords is with movies. greater than the sum of its parts. The other aspect of the project is the games with a pur- 3.2 Movie Scripts pose angle. We are using techniques similar to games dis- Scripts were scraped from The Internet Movie Script Database tributed on the Games With A Purpose website created by (IMSDb). After being downloaded, they automatically cleaned Luis von Ahn [3]. These games are designed to be max- for HTML markup. While some formatting could provide imally fun while at the same time extracting usable infor- additional information to indicate distinctions between stage mation. There are both one player and two player games directions, scenes, and dialogue, we felt that accurately dis- on the GWAP website. We implemented a two-player game tinguishing between these across hundreds of scripts would centered around tagging movies similar to GWAP's \ESP." be difficult. Small variations could easily disrupt attempts ESP works by having two users describe a particular image reliably analyze this information. and the round ends when both users choose the same word for the image and a new image is shown. Additionally, some Ultimately, 845 scripts were collected for analysis. Manual terms are marked as \taboo" and cannot be used. Taboo examination showed that there was considerable variation words are used to ensure that new rounds of the game do not between some scripts and their respective movies in the form result in keywords that are already known. In our version, of scene additions and deletions. Figure 3 shows the distri- one user describes the film and the other user guesses the bution of unique words in each script. Figure 4 shows the movie's name. This alteration allows us to insert a\bot"into frequency of the number of movie appearances each word one half of the game in order to test new keyword extraction had. algorithms and to provide an opponent when no opponent is available. Our two-player game is a combination of the 4. KEYWORD EXTRACTION so-called \output-agreement" and \input-agreement" games As this project sought to extract keywords from movies, we which is called an \inversion problem game" [4]. The one applied two approaches for automatically analyzing scripts. player version of the game is the game that already exists and is used to provide humans with game play when no human is available to be matched. Users are shown a series of 4.1 Statistical Techniques keywords and they then attempt to use to use them to guess In the literature, the most simple and common technique the film's name. We use the frequency with which certain for extracting key words from bodies of text are based on words result in correct identification to determine how likely simple word frequency. These methods count the number of words are to be good descriptions of a film. times each words appears and then declare a word to be a keyword if it appears very frequently. Obviously we do not want to consider stop-words like \the" or \hello" in our set 3. DATASETS of keywords. One technique that deals with this problem 3.1 The IMDB Dataset has a program scan a corpus of text first and then extracts The Internet Movie Database (IMDB) releases its under- words which are unusually common from a text. That is, the lying data for non-commercial use. Figures 1 and 2 give word appears more often than in the average text. Applying distributions on the frequencies of keywords as associated this algorithm to the film scripts in our possession revealed reasonable, but not great keywords. An example keyword set for the film \Watchmen" is: rorschach, laurie, dan, adrian, blake, manhat- Unique Words Per Script tan,owl, continued, jon, hollis, moloch, blake's, dr, rorschach's, veidt, moloch's, comedian, janey, dan's, watchmen, ship, karnak, sally,laurie's, forbes, 120 slater, cont'd, det, adrian's, seymour, roth, doug, thug, lynx, dreiberg, nite, industries, manhat- 100 tan's, jupiter, pyramid, osterman, antarctica, daniel, costume, jon's, mars, psychiatrist, thug's, ion, 80 flashback, agent, frontiersman, swat, gallagher, particles, kashmir, gang, anchorwoman, archie, 60 kovacs, intruder, editor, smartest, mask, prison, cancer, enterprises, edward, rockefeller, vendor, Number of Movies welder, supervillian, 40 20 The words are all relevant, but are not what we would consider keywords. Even if we were to remove proper-nouns 0 and stem the remaining words we would be left with a col- lection of keywords which fail to capture many of the most 1000 2000 3000 4000 5000 6000 important characteristics of the film. Unique Words From the original IMDB set as well as our game play-based dataset, we assigned keywords to movies. Standard key- Figure 3: Distribution of Unique Words per Script word generation algorithms accept blocks of text and find words which have statistically more appearances than might be expected. Film scripts formed the basis for these algorithms as direct analysis of the audio and video of a film is a much more difficult and error-prone task. We considered and tested the use of film-reviews as a corpus, but the re- sults were not promising enough to justify the much larger difficulty in assembling a good corpus.

CAKES: Crowdsourced Automatic Keyword Extraction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support