Movie Description

Int J Comput Vis DOI 10.1007/s11263-016-0987-1 Movie Description Anna Rohrbach1 · Atousa Torabi3 · Marcus Rohrbach2 · Niket Tandon1 · Christopher Pal4 · Hugo Larochelle5,6 · Aaron Courville7 · Bernt Schiele1 Received: 10 May 2016 / Accepted: 23 December 2016 © The Author(s) 2017. This article is published with open access at Springerlink.com Abstract Audio description (AD) provides linguistic teams who participated in the challenges organized in the descriptions of movies and allows visually impaired people context of two workshops at ICCV 2015 and ECCV 2016. to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an inter- Keywords Movie description · Video description · Video esting data source for computer vision and computational captioning · Video understanding · Movie description linguistics. In this work we propose a novel dataset which dataset · Movie description challenge · Long short-term contains transcribed ADs, which are temporally aligned to memory network · Audio description · LSMDC full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Descrip- 1 Introduction tion Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies Audio descriptions (ADs) make movies accessible to mil- (around 150h of video in total). The goal of the challenge lions of blind or visually impaired people.1 AD—sometimes is to automatically generate descriptions for the movie clips. also referred to as descriptive video service (DVS)—provides First we characterize the dataset by benchmarking differ- an audio narrative of the “most important aspects of the ent approaches for generating video descriptions. Comparing visual information” (Salway 2007), namely actions, gestures, ADs to scripts, we find that ADs are more visual and describe scenes, and character appearance as can be seen in Figs.1 precisely what is shown rather than what should happen and 2. AD is prepared by trained describers and read by pro- according to the scripts created prior to movie production. fessional narrators. While more and more movies are audio Furthermore, we present and compare the results of several transcribed, it may take up to 60 person-hours to describe a 2- hmovie(Lakritz and Salway 2006), resulting in the fact that today only a small subset of movies and TV programs are Communicated by Margaret Mitchell, John Platt and Kate Saenko. available for the blind. Consequently, automating this pro- cess has the potential to greatly increase accessibility to this B Anna Rohrbach [email protected] media content. In addition to the benefits for the blind, generating descrip- 1 Max Planck Institute for Informatics, Saarland Informatics tions for video is an interesting task in itself, requiring the Campus, Saarbrücken, Germany combination of core techniques from computer vision and 2 ICSI and EECS, UC Berkeley, Berkeley, CA, USA computational linguistics. To understand the visual input one 3 Disney Research, Pittsburgh, PA, USA has to reliably recognize scenes, human activities, and par- 4 École Polytechnique de Montréal, Montreal, Canada ticipating objects. To generate a good description one has to 5 Université de Sherbrooke, Sherbrooke, Canada 1 In this work we refer for simplicity to “the blind” to account for all 6 Twitter, Cambridge, ON, USA blind and visually impaired people which benefit from AD, knowing of 7 Université de Montréal, Montreal, Canada the variety of visually impaired and that AD is not accessible to all. 123 Int J Comput Vis has been particularly potent (Krizhevsky et al. 2012). To be able to learn how to generate descriptions of visual content, parallel datasets of visual content paired with descriptions are indispensable (Rohrbach et al. 2013). While recently several large datasets have been released which provide images with AD: Abby gets in Mike leans over and Abby clasps her the basket. sees how high they hands around his descriptions (Younget al. 2014; Lin et al. 2014; Ordonez et al. are. face and kisses him 2011), video description datasets focus on short video clips passionately. Script: After a Mike looks down to For the first time in with single sentence descriptions and have a limited number moment a frazzled see – they are now her life, she stops Abby pops up in his fifteen feet above the thinking and grabs of video clips (Xu et al. 2016; Chen and Dolan 2011)orare place. ground. Mike and kisses the not publicly available (Over et al. 2012). TACoS Multi-Level hell out of him. (Rohrbach et al. 2014) and YouCook (Das et al. 2013)are Fig. 1 Audio description (AD) and movie script samples from the exceptions as they provide multiple sentence descriptions and movie “Ugly Truth” longer videos. While these corpora pose challenges in terms of fine-grained recognition, they are restricted to the cooking scenario. In contrast, movies are open domain and realistic, decide what part of the visual information to verbalize, i.e. even though, as any other video source (e.g. YouTube or recognize what is salient. surveillance videos), they have their specific characteristics. Large datasets of objects (Deng et al. 2009) and scenes ADs and scripts associated with movies provide rich multiple (Xiao et al. 2010; Zhou et al. 2014) have had an important sentence descriptions. They even go beyond this by telling a impact in computer vision and have significantly improved story which means they facilitate the study of how to extract our ability to recognize objects and scenes. The combination plots, the understanding of long term semantic dependencies of large datasets and convolutional neural networks (CNNs) and human interactions from both visual and textual data. AD: Buckbeak rears and Hagrid lifts Malfoy up. As Hagrid carries Malfoy attacks Malfoy. away, the hippogriff gen- tly nudges Harry. Script: In a flash, Buck- Malfoy freezes. Looks down at the blood Buckbeak whips around, beak’s steely talons slash blossoming on his robes. raises its talons and - down. seeing Harry - lowers them. AD: Another room, the She smokes a cigarette Putting the cigarette out, She pats her face and She pats her face and wife and mother sits at a with a latex-gloved hand. she uncovers her hair, re- hands with a wipe, then hands with a wipe, then window with a towel over moves the glove and pops sprays herself with per- sprays herself with per- her hair. gum in her mouth. fume. fume. Script: Debbie opens a She holds her cigarette She puts out the cigarette She puts some weird oil She sprays cologne and window and sneaks a with a yellow dish wash- and goes through an elab- in her hair and uses a walks through it. cigarette. ing glove. orate routine of hiding wet nap on her neck and the smell of smoke. clothes and brushes her teeth. AD: They rush out onto A man is trapped under a Valjean is crouched down Javert watches as Valjean Javert’s eyes narrow. the street. cart. beside him. places his shoulder under the shaft. Script: Valjean and A heavily laden cart has Valjean, Javert and He throws himself under Javert stands back and Javert hurry out across toppled onto the cart Javert’s assistant all the cart at this higher looks on. the factory yard and driver. hurry to help, but they end, and braces himself down the muddy track can’t get a proper pur- to lift it from beneath. beyond to discover - chase in the spongy ground. Fig. 2 Audio description (AD) and movie script samples from the movies “Harry Potter and the Prisoner of Azkaban”, “This is 40”, and “Les Miserables”. Typical mistakes contained in scripts marked in red italic 123 Int J Comput Vis Fig. 3 Some of the diverse verbs/actions present in our Large Scale Movie Description Challenge (LSMDC) Figures 1 and 2 show examples of ADs and compare them As a first study on our dataset we benchmark several to movie scripts. Scripts have been used for various tasks approaches for movie description. We first examine near- (Cour et al. 2008; Duchenne et al. 2009; Laptev et al. 2008; est neighbor retrieval using diverse visual features which do Liang et al. 2011; Marszalek et al. 2009), but so far not for not require any additional labels, but retrieve sentences from video description. The main reason for this is that automatic the training data. Second, we adapt the translation approach alignment frequently fails due to the discrepancy between of Rohrbach et al. (2013) by automatically extracting an the movie and the script. As scripts are produced prior to the intermediate semantic representation from the sentences shooting of the movie they are frequently not as precise as using semantic parsing. Third, based on the success of the AD (Fig. 2 shows some typical mistakes marked in red long short-term memory networks (LSTMs) (Hochreiter italic). A common case is that part of the sentence is correct, and Schmidhuber 1997) for the image captioning problem while another part contains incorrect/irrelevant information. (Donahue et al. 2015; Karpathy and Fei-Fei 2015; Kiros As can be seen in the examples, AD narrations describe key et al. 2015; Vinyals et al. 2015) we propose our approach visual elements of the video such as changes in the scene, Visual-Labels. It first builds robust visual classifiers which people’s appearance, gestures, actions, and their interaction distinguish verbs, objects, and places extracted from weak with each other and the scene’s objects in concise and pre- sentence annotations. Then the visual classifiers form the cise language. Figure 3 shows the variability of AD data input to an LSTM for generating movie descriptions.

Load more