Toward Emergent Representations for Video
Total Page:16
File Type:pdf, Size:1020Kb
Toward Emergent Representations for Video Ryan Shaw and Marc Davis University of California at Berkeley School of Information Management and Systems Garage Cinema Research 102 South Hall, Berkeley, CA 94720-4600, USA http://garage.sims.berkeley.edu {ryanshaw, marc}@sims.berkeley.edu ABSTRACT More worrisome is the possibility that annotators will choose to Advanced systems for finding, using, sharing, and remixing video describe attributes of content unrelated to the kinds of attributes require high-level representations of video content. A number of video seekers specify in their requests. For example, an annotator researchers have taken top-down, analytic approaches to the may choose to spend all her time describing the lighting in a shot, specification of representation structures for video. The resulting neglecting to provide any information about the action it depicts. schemes, while showing the potential of high-level representations If a video seeker is interested in action, these annotations will be for aiding the retrieval and resequencing of video, have generally useless to him. proved too complex for mainstream use. In this paper, we propose This could be solved by requiring annotators to annotate all a bottom-up, emergent approach to developing video possible facets of a given shot, but this solution makes the cost of representation structures by examining retrieval requests and voluntarily contributing to the metadata repository unacceptably annotations made by a community of video remixers. Our initial high for a single annotator. A “divide and conquer” approach in research has found a useful degree of convergence between user- which various facets are described by different annotators or generated indexing terms and query terms, with the salient groups of annotators could achieve the same goal, while keeping exception of descriptions of characters' corporeal characteristics. the granularity of individual contributions reasonably small. Such an approach requires representation structures which support the Categories and Subject Descriptors creation of compatible annotations by different users. H.1.2 [Models and Principles]: User/Machine Systems – human factors, human information processing; H.3.1. [Information Another possible solution is to provide representation structures Storage and Retrieval]: Content Analysis and Indexing – which can relate concepts used in annotations to those specified in abstracting methods, indexing methods, thesauruses. queries. Rather than imposing an a priori representation structure on users, we have examined queries and annotations made by General Terms: Experimentation, Human Factors. members of the target user community in order to determine what kinds of representation structures may be useful for ensuring Keywords: Video retrieval, video annotation, video coordination. representation, remixing. The goal of the experiment reported here was to determine the degree of divergence or convergence between the descriptions 1. INTRODUCTION users use to annotate video content and the queries music video This research is part of an effort to design a system to enable non- makers use to find content for remixing. The larger research this professional users to find, use, share, and remix video on the web. study is a part of aims to create systems that generalize and A key component in this system will be a shared metadata leverage the collective annotative efforts of enthusiast repository which holds pointers to fragments of video along with communities to enable the large scale annotation and retrieval of descriptive metadata for the content of those fragments. Users will media on the web. both contribute descriptive metadata to the repository by annotating video content and make requests for video by 2. METHODOLOGY providing descriptions of the desired content. The study was conducted through AMV.Org [1], a large (approximately 300,000 members) community of fans who Such a system poses a couple of coordination problems. First appropriate Japanese animation (anime) content and re-edit it into there is the well-known problem of coordinating the vocabularies music videos. (AMV is an acronym for Anime Music Video.) of metadata providers (users annotating video) with the Editors use the site's discussion forums to make requests for vocabularies of video seekers [4]. For example, one user may desired content, which other fans then attempt to satisfy by describe a character in a shot as a “bad guy” while another may suggesting suitable shots from various anime releases. request shots portraying “villains.” This problem may be solved in part by creating lists of related terms, either manually or by analyzing patterns of co-occurrence in textual descriptions. 2.1 Request Analysis We analyzed 220 requests for anime shots made on the AMV.Org “AMV Suggestions” discussion forum from July 2002 to April 2005 using a two-pass approach [5]. We first examined 100 Permission to make digital or hard copies of all or part of this work for requests and manually categorized them according to the personal or classroom use is granted without fee provided that copies are not attributes used in specifying the request. We developed the made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or attribute set in parallel with our examination of the user requests, republish, to post on servers or to redistribute to lists, requires prior specific and based it on the actual requests generated by the user specific permission and/or a fee. community. Next this coding scheme was used to categorize the MM’05, November 6–11, 2005, Singapore. full set of 220 requests. Copyright 2005 ACM 1-59593-044-2/05/0011…$5.00. 431 2.2 Annotation Experiment Requests We created a simple media player application which enabled users containing to select temporal and spatial regions of two one-minute anime attribute clips and add free-text annotations. The application was implemented using Macromedia Flash so that it could be accessed Subject plot/theme one character through a standard web browser. Annotations and their -related dumping another; corresponding selected regions were stored in a relational heroic characters database as they were created. Users could see a list of all the working toward a annotations they had created and could select annotations in the common good 26 12 list to see the associated regions highlighted in the media player. Annotations could not be removed once they had been added. mood/emotion silly; psychedelic; sad 19 9 The first one-minute clip was a scene from the English-dubbed USA release of Cowboy Bebop: Knockin' on Heaven's Door role villain; father; (2001), a popular anime film used as source content for 568 music maniac 17 8 videos listed on AMV.Org. This scene was selected because of its high action content, relative lack of dialogue, and the likelihood genre sci-fi; romance; that it would be well-known to fans. The second clip was a scene yaoi 11 5 from the English-dubbed USA release of Perfect Blue (1997), a Specific cast member Vash; Tenchi somewhat less well-known film used as source content for 250 names name 20 9 music videos listed on AMV.Org. This scene primarily featured a dialogue between two characters. film/series name Escaflowne; Trigun 20 9 We invited 60 survey respondents—who had provided contact information and expressed interest in participating further in the Production- camera/animation camera scrolls up; project—to use the tool to annotate the two clips. Invitees were related style/etc. zoom in or out; given no guidelines on what or how to annotate beyond basic loopable 9 4 instructions on how to use the annotation tool. The instructions simply asked users to “Add as many tags as you think are 3.2 Annotation Experiment necessary to describe the scene.” Over the two weeks of the study, 25 respondents created a total of 244 annotations for the two clips. Of these annotations, 210 3. RESULTS (86%) specified particular temporal regions of the clip. In most of the cases where a specific region was not specified, it seems that 3.1 Request Analysis this was due to user misunderstanding, since the annotation text The results of the request analysis are shown in Table 1. Requests seemed to refer to specific shots rather than the scene as a whole specified two different kinds of attributes on average, with a (although there were a few annotations which described the scene combination of action and object being the most common (14% of as a whole). Only 16 (7%) of the annotations specified spatial requests). The most commonly specified attributes overall were regions. This low number was probably due to the fact that the appearance-related, especially: actions depicted, specific objects, instructions did not make it sufficiently clear that it was possible and specific physical attributes of bodies or faces. Specific names to make spatial selections. Since spatial region selection, unlike of characters, films, or TV series were less common than we temporal extent selection, is not a common feature of the media expected. manipulation tools with which the annotators were familiar, it is Table 1. Shot request attributes and their frequencies. likely this functionality was simply overlooked. A wide variety of annotation styles were observed. Some Requests Attribute Examples annotators wrote long descriptions consisting of complete containing sentences: attribute Nina's so stressed out and confused that she