A Scalable Video Search Engine Based on Audio Content
Total Page:16
File Type:pdf, Size:1020Kb
A Scalable Video Search Engine Based on Audio Content Indexing and Topic Segmentation Julien Lawto, Jean-Luc Gauvain, Lori Lamel, Gregory Grefenstette, Guillaume Gravier, Julien Despres, Camille Guinaudeau, Pascale Sébillot To cite this version: Julien Lawto, Jean-Luc Gauvain, Lori Lamel, Gregory Grefenstette, Guillaume Gravier, et al.. A Scalable Video Search Engine Based on Audio Content Indexing and Topic Segmentation. 2011 Networked and Electronic Media (NEM) Summit : Implementing Future Media Internet, Sep 2011, Torino, Italy. 160 p. hal-00645228 HAL Id: hal-00645228 https://hal.archives-ouvertes.fr/hal-00645228 Submitted on 27 Nov 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A SCALABLE VIDEO SEARCH ENGINE BASED ON AUDIO CONTENT INDEXING AND TOPIC SEGMENTATION Julien LawTo1, Jean-Luc Gauvain2, Lori Lamel2, Gregory Grefenstette1, Guillaume Gravier3, Julien Despres4, Camille Guinaudeau3, Pascale Sebillot3 1Dassault Systèmes/Exalead, Paris, France; 2LIMSI, Orsay, France ; 3IRISA, Rennes, France ; 4Vocapia Research, Orsay, France E-mail: 13ds.com, 2limsi.fr, 3irisa.fr, 4vocapia.com Abstract: One important class of online videos is that Most current video search engines rely, in a large part, on of news broadcasts. Most news organisations provide indexing the textual metadata associated with the video near-immediate access to topical news broadcasts over (title, tags, surrounding page-text). Videos that are the Internet, through RSS streams or podcasts. Until returned for a search over common search engines are lately, technology has not made it possible for a user to those which contain those search terms in their metadata. automatically go to the smaller parts, within a longer Recent progress in spoken language processing (in broadcast, that might interest them. Recent advances transcription, topic segmentation, keyword extraction) has in both speech recognition systems and natural led to a number of robust tools that allow us to now language processing have led to a number of robust provide users with quicker and more focussed access to tools that allow us to provide users with quicker, more relevant segments of one or more news broadcast videos. focussed access to relevant segments of one or more Researchers have focused on either one or the other news broadcast videos. Here we present our new aspect of the processing chain. For example, new interface for browsing or searching news broadcasts approaches to topic segmentation of broadcast news have (video/audio) that exploits these new language been proposed in [6, 7, 8], the latter focusing on processing tools to (i) provide immediate access to robustness to different shows and to transcription errors. topical passages within news broadcasts, (ii) browse Still, few systems integrate all of these components in a news broadcasts by events as well as by people, places complete and comprehensive large scale demonstration and organisations, (iii) perform cross lingual search of able to return portions of videos relevant to a query, while news broadcasts, (iv) search for news through a map also providing query-free browsing capabilities. interface, (v) browse news by trending topics, and (vi) This paper presents a complete system demonstrating an see automatically-generated textual clues for news alternative approach for browsing and searching videos segments, before listening. Our publicly searchable and audio newscasts based on robust spoken document demonstrator currently indexes daily broadcast news processing in multiple languages. In broadcast news, most content from 50 sources in English, French, Chinese, of the linguistic information is encoded in the audio Arabic, Spanish, Dutch and Russian. channel of video data, which, once transcribed, can be Keywords: Video indexing, Video search, Video processed using natural language processing and semantic Browsing processing techniques. The interface presented here integrates many of these technologies to provide topical access to automatically identified broadcast segments. 1 INTRODUCTION The main novelty here, with respect to similar available Many of the major news organisations provide immediate systems (including previous versions of our own) is the access to topical news broadcasts over the internet, topical segmentation of news broadcasts. All the search through RSS streams or podcasts. In parallel, many users and browsing tools are generated from automatically 1 rely on third-party sites to describe topical extracts of detected topic segments, using methods described below. longer news broadcasts. However, in spite of early Search is constrained to each segment, and these attempts on broadcast news retrieval and browsing from segments are returned as the result of a search, though the speech [1, 2, 3], technology has not made it possible for a entire broadcast is still accessible, if desired. For example, user to efficiently find small segments of interest from if a user searches for Ron Paul AND Barack Obama, only longer broadcasts within a large collection spanning those segments in which both politicians are mentioned multiple languages. In particular, work on topic will be returned. Our interface also provides additional, segmentation of broadcast news, e.g. [4, 5] was limited in more elaborate optional annotations for each segment: the number of shows and languages that could be handled. named entities, timestamps of mentions of each query term, a pincushion timeline bar showing mentions, and 1 For example, reddit.com, huffingtonpost.com, newser.com, ... for each segment, a label of corpus-derived important Corresponding author: Julien Lawto, Exalead, [email protected] Figure 1 Overview of the architecture of the video indexing terms mentioned in that segment. Querying can be containing the words identified in the audio document, performed in a language different from broadcast along with their time codes and a transcription confidence language, exploiting commonly available translation tools. measure. The paper is organized as follows: Back-end processing Our first processing step partitions the data into speech of the video and audio sources is described in Section 2. segments, and, after determining the gender, clusters Section 3 describes the user interface, illustrating the segments from the same speaker. This information can various features of the system. In Section 4, processing later be combined with the content in the automatic time is discussed, followed by a conclusion and a transcription to associate true speaker names to parts of description of future evolution of our system. the data. Each language has recognition word lists containing from 50k to 300k words which generally give 2 BACK END: OFF LINE PROCESSING a good coverage of the language. However, “breaking Our system indexes freely available podcast sources, news” may have repeated occurrences of words that are broadcast via RSS streams. Figure 1 presents the main unknown to the system. New functionality has recently steps of the processing of these podcasts. been incorporated which allows users to update the recognition word list and this technology is currently 2.1 Automatic Speech Recognition (ASR) undergoing experimentation. This automatic speech State-of-the art speech transcription systems for 7 recognition technology used in our system has been languages (French, English, Spanish, Mandarin, Dutch, frequently demonstrated to obtain top performance in Russian and Arabic) are the core of the demonstration. international benchmarks. The transcription systems make use of statistical modelling techniques similar to those described in [9, 10], 2.2 Topic segmentation which gives details for an English broadcast news system. A news broadcast is often divided into stories, which may The acoustic and language models and pronunciation have no relation with each other. If the broadcast is dictionaries are language dependent [11, 12], and trained transcribed into one textual document, a complex search, on large audio and text corpora. Speech decoding is such as Barack Obama in China, may return videos in carried out in a single pass with statistical n-gram which China is mentioned in one story and Barack Obama language models, and takes less time than the signal in another, contrary to what the user intended to find. To duration. Proper case is output for all languages, and remedy this problem in the newest version of our system, postprocessing converts numerical quantities for amounts, we process the uninterrupted textual output of the dates, telephone numbers to Arabic numbers for English, automatic speech transcription by applying topic French, and Spanish. The system outputs an xml file segmentation to break the transcript of a show into Figure 2 News trends over different periods topically homogeneous segments. These segments would respect to [14]. We added features to account for the ideally correspond to individual reports in classical news. peculiarities of broadcast news transcripts, namely Topic segmentation has been studied in natural