CAKES: Crowdsourced Automatic Keyword Extraction Daniel Erenrich Chris Kennelly
[email protected] [email protected] ABSTRACT Sixty-five years ago, Vannevar Bush presented a vision for an We present progress towards applying \Games With A Pur- automated system of data organization entitled\the Memex"[1]. pose" (GWAP) techniques to extract keywords which de- Since then, the task of automatically organizing, summariz- scribe films. We discuss the use of machine learning algo- ing, and indexing information still remains. While search rithms to automatically extract keywords from movie scripts. engines perform much of the task of indexing, automatic The data collected from the game is used to generate sim- summarization is a work in progress [2]. Like image recogni- ilarity metrics between the films which eventually can be tion, keyword assignment to text challenges machines, both used to recommend films. Finally, we discuss the merits of in describing a large piece of content with relevant keywords GWAP as a system and make recommendations concerning as well as ascertaining the importance of each keyword. This its use. task is important for content recommendation systems and search relevance. Categories and Subject Descriptors Game-playing humans provide a potential audience for this I.2.7 [Natural Language Processing]: Text Analysis; H.5.2 task. The dataset produced by gameplay can be used to [User Interfaces]: Natural language assess the relatedness of films and quality of keywords de- scribing films. Additionally, possible keyword generation al- General Terms gorithms can be validated on-the-fly by emulating an op- Measurement ponent. Rather than summarize widely available texts, we chose keyword generation.