Multimedia Content Analysis: Using Both Audio and Visual Clues
Total Page:16
File Type:pdf, Size:1020Kb
Using Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang ultimedia content analysis refers to the computerized understanding of the seman- tic meanings of a multimedia document, Msuch as a video sequence with an accompa - nying audio track. As we enter the digital multimedia in- formation era, tools that enable such automated analysis are becoming indispensable to be able to efficiently access, digest, and retrieve information. Information retrieval, as a field, has existed for some time. Until recently, however, the focus has been on understanding text information, e.g., how to extract key words from a document, how to categorize a document, and how to summarize a docu- ment, all based on written text. With a multimedia docu- ment, its semantics are embedded in multiple forms that are usually complimentary of each other. For example, live coverage on TV about an earthquake conveys information that is far beyond what we hear from the reporter. We can see and feel the effects of the earthquake, while hearing the reporter talking about the statistics. Therefore, it is neces- sary to analyze all types of data: image frames, sound tracks, texts that can be extracted from image frames, and spoken words that can be deciphered from the audio track. 1991 21Century Media This usually involves segmenting the document into se- mantically meaningful units, classifying each unit into a predefined scene type, and indexing and summarizing the document for efficient retrieval and browsing. In this article, we review recent advances in using audio and visual information jointly for accomplishing the 1995 Master Series and above tasks. We will describe audio and visual features that can effectively characterize scene content, present se- wards, we use the word video to refer to both the image lected algorithms for segmentation and classification, and frames and the audio waveform contained in a video.) For review some testbed systems for video archiving and re- a video, this usually means segmenting the entire video trieval. We will also briefly describe audio and visual into scenes so that each scene corresponds to a story unit. descriptors and description schemes that are being consid- Sometimes it is also necessary to divide each scene into ered by the MPEG-7 standard for multimedia content de- shots, so that the audio and/or visual characteristics of scription. each shot are coherent. Depending on the application, dif- ferent tasks follow the segmentation stage. One important task is the classification of a scene or shot into some prede- What Does Multimedia fined category, which can be very high level (an opera per- Content Analysis Entail? formance in the Metropolitan Opera House), mid level (a The first step in any multimedia content analysis task is the music performance), or low level (a scene in which audio parsing or segmentation of a document. (From here on- is dominated by music). Such semantic level classification 12 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 1053-5888/00/$10.00©2000IEEE is key to generating text-form indexes. Beyond such “la- List of Abbreviations. beled” indexes, some audio and visual descriptors may also be useful as low-level indexes, so that a user can re- Abbreviation Full Term trieve a video clip that is aurally or visually similar to an 4ME 4-Hz modulation energy example clip. Finally, video summarization is essential in AMDF Average magnitude difference function building a video retrieval system to enable a user to AV Audiovisual quickly browse through a large set of returned items in re- BW Bandwidth sponse to a query. Beyond a text summary of the video CC Ceptral coefficient CCV Color coherence vector content, some AV summaries will give the user a better D Descriptor grasp of the characters, the settings, and the style of the DCH Difference between color histogrm video. DS Description scheme Note that the above tasks are not mutually exclusive, ERSB1/2/3 Subband energy ratio at frequency band but may share some basic elements or be interdependent. 0-630 Hz, 630-1720 Hz, and 1720-4400 For example, both indexing and summarization may re- Hz, respectively quire the extraction of some key frames within each FC Frequency centroid scene/shot that best reflects the visual content of the GMM Gaussian mixture model GoF/GoP Group of frames/pictures scene/shot. Likewise, scene segmentation and classifica- HMM Hidden Markov model tion are dependent on each other, because segmentation KLT Karhunen Loeve transform criteria are determined by scene class definitions. A key to LSMDC Least square minimum distance classifier the success of all above tasks is the extraction of appropri- MDA Multiple discriminant analysis ate audio and visual features. They are not only useful as ME Motion energy low-level indexes, but also provide a basis for comparison MFCC Mel-frequency ceptral coefficient between scenes/shots. Such a comparison is required for MoCA Movie content analysis scene/shot segmentation and classification and for choos- MPEG Motion picture expert group ing sample frames/clips for summarization. NPR Nonpitch ratio NSR Nonsilence ratio Earlier research in this field has focused on using visual OCR Optical character recognition features for segmentation, classification, and summariza- PCF Phase correlation function tion. Recently, researchers have begun to realize that au- PSTD Standard deviation of pitch dio characteristics are equally, if not more, important rms Root mean square when it comes to understanding the semantic content of a SPR Smooth pitch ratio video. This applies not just to the speech information, SPT Spectral peak track which obviously provides semantic information, but also SVM Support vector machine to generic acoustic properties. For example, we can tell VDR Volume dynamic range VQ Vector quantizer whether a TV program is a news report, a commercial, or VSTD Volume standard deviation a sports game without actually watching the TV or un- VU Volume undulation derstanding the words being spoken because the back- ZCR Zero crossing rate ground sound characteristics in these scenes are very ZSTD Standard deviation of ZCR different. Although it is also possible to differentiate these scenes based on the visual information, audio-based pro- been proposed for this purpose. Some of them are de- cessing requires significantly less complex processing. signed for specific tasks, while others are more general When audio alone can already give definitive answers re- and can be useful for a variety of applications. In this sec- garding scene content, more sophisticated visual process- tion, we review some of these features. We will describe ing can be saved. On the other hand, audio analysis results audio features in greater detail than for visual features, as can always be used to guide additional visual processing. there have been several recent review papers covering vi- When either audio or visual information alone is not suf- sual features. ficient in determining the scene content, combining au- dio and visual cues may resolve the ambiguities in individual modalities and thereby help to obtain more ac- Audio Features There are many features that can be used to characterize curate answers. audio signals. Usually audio features are extracted in two levels: short-term frame level and long-term clip level. AV Features for Here a frame is defined as a group of neighboring samples which last about 10 to 40 ms, within which we can as- Characterizing Semantic Content sume that the audio signal is stationary and short-term A key to the success of any multimedia content analysis features such as volume and Fourier transform coeffi- algorithm is the type of AV features employed for the cients can be extracted. The concept of audio frame comes analysis. These features must be able to discriminate from traditional speech signal processing, where analysis among different target scene classes. Many features have over a very short time interval has been found to be most NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 13 N −1 1 2 vn()= ∑ sin () N i=0 (the rms volume is also referred as energy). Note that the volume of an audio signal Volume depends on the gain value of the recording and digitizing devices. To eliminate the in- fluence of such device-dependent condi- tions, we may normalize the volume for a 0 5 10 Time (Second) frame by the maximum volume of some Clip 1 Clip 2 Clip 3 Clip 4 Clip 5 previous frames. Zero Crossing Rate: Besides the volume, ZCR is another widely used temporal fea- Frame 1 ture. To compute the ZCR of a frame, we Frame 2 count the number of times that the audio Frame 3 Frame N waveform crosses the zero axis. Formally, ▲ 1. Decomposition of an audio signal into clips and frames. appropriate. For a feature to reveal the semantic meaning 1 N −1 f Zn()=−− sign() si () sign ( si (1 ) ) s of an audio signal, analysis over a much longer period is ∑ nn 2 i=1 N necessary, usually from one second to several tens sec- onds. Here we call such an interval an audio clip (in the where f s represents the sampling rate. ZCR is one of the literature, the term “window” is sometimes used). A clip most indicative and robust measures to discern unvoiced consists of a sequence of frames, and clip-level features speech. Typically, unvoiced speech has a low volume but usually characterize how frame-level features change a high ZCR. By using ZCR and volume together, one can over a clip. The clip boundaries may be the result of au- prevent low energy unvoiced speech frames from being dio segmentation such that the frame features within classified as silent. each clip are similar. Alternatively, fixed length clips, Pitch: Pitch is the fundamental frequency of an audio usually 1 to 2 seconds, may be used.