Using Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang

ultimedia content analysis refers to the computerized understanding of the seman- tic meanings of a multimedia document, Msuch as a sequence with an accompa - nying audio track. As we enter the digital multimedia in- formation era, tools that enable such automated analysis are becoming indispensable to be able to efficiently access, digest, and retrieve information. Information retrieval, as a field, has existed for some time. Until recently, however, the focus has been on understanding text information, e.g., how to extract key words from a document, how to categorize a document, and how to summarize a docu- ment, all based on written text. With a multimedia docu- ment, its semantics are embedded in multiple forms that are usually complimentary of each other. For example, live coverage on TV about an earthquake conveys information that is far beyond what we hear from the reporter. We can see and feel the effects of the earthquake, while hearing the reporter talking about the statistics. Therefore, it is neces- sary to analyze all types of data: image frames, sound tracks, texts that can be extracted from image frames, and spoken words that can be deciphered from the audio track. 1991 21Century Media  This usually involves segmenting the document into se- mantically meaningful units, classifying each unit into a predefined scene type, and indexing and summarizing the document for efficient retrieval and browsing. In this article, we review recent advances in using audio and visual information jointly for accomplishing the 1995 Master Series and above tasks. We will describe audio and visual features  that can effectively characterize scene content, present se- wards, we use the word video to refer to both the image lected algorithms for segmentation and classification, and frames and the audio waveform contained in a video.) For review some testbed systems for video archiving and re- a video, this usually means segmenting the entire video trieval. We will also briefly describe audio and visual into scenes so that each scene corresponds to a story unit. descriptors and description schemes that are being consid- Sometimes it is also necessary to divide each scene into ered by the MPEG-7 standard for multimedia content de- shots, so that the audio and/or visual characteristics of scription. each shot are coherent. Depending on the application, dif- ferent tasks follow the segmentation stage. One important task is the classification of a scene or shot into some prede- What Does Multimedia fined category, which can be very high level (an opera per- Content Analysis Entail? formance in the Metropolitan Opera House), mid level (a The first step in any multimedia content analysis task is the music performance), or low level (a scene in which audio parsing or segmentation of a document. (From here on- is dominated by music). Such semantic level classification

12 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 1053-5888/00/$10.00©2000IEEE is key to generating text-form indexes. Beyond such “la- List of Abbreviations. beled” indexes, some audio and visual descriptors may also be useful as low-level indexes, so that a user can re- Abbreviation Full Term trieve a video clip that is aurally or visually similar to an 4ME 4-Hz modulation energy example clip. Finally, video summarization is essential in AMDF Average magnitude difference function building a video retrieval system to enable a user to AV Audiovisual quickly browse through a large set of returned items in re- BW Bandwidth sponse to a query. Beyond a text summary of the video CC Ceptral coefficient CCV Color coherence vector content, some AV summaries will give the user a better D Descriptor grasp of the characters, the settings, and the style of the DCH Difference between color histogrm video. DS Description scheme Note that the above tasks are not mutually exclusive, ERSB1/2/3 Subband energy ratio at frequency band but may share some basic elements or be interdependent. 0-630 Hz, 630-1720 Hz, and 1720-4400 For example, both indexing and summarization may re- Hz, respectively quire the extraction of some key frames within each FC Frequency centroid scene/shot that best reflects the visual content of the GMM Gaussian mixture model GoF/GoP Group of frames/pictures scene/shot. Likewise, scene segmentation and classifica- HMM Hidden Markov model tion are dependent on each other, because segmentation KLT Karhunen Loeve transform criteria are determined by scene class definitions. A key to LSMDC Least square minimum distance classifier the success of all above tasks is the extraction of appropri- MDA Multiple discriminant analysis ate audio and visual features. They are not only useful as ME Motion energy low-level indexes, but also provide a basis for comparison MFCC Mel-frequency ceptral coefficient between scenes/shots. Such a comparison is required for MoCA Movie content analysis scene/shot segmentation and classification and for choos- MPEG Motion picture expert group ing sample frames/clips for summarization. NPR Nonpitch ratio NSR Nonsilence ratio Earlier research in this field has focused on using visual OCR Optical character recognition features for segmentation, classification, and summariza- PCF Phase correlation function tion. Recently, researchers have begun to realize that au- PSTD Standard deviation of pitch dio characteristics are equally, if not more, important rms Root mean square when it comes to understanding the semantic content of a SPR Smooth pitch ratio video. This applies not just to the speech information, SPT Spectral peak track which obviously provides semantic information, but also SVM Support vector machine to generic acoustic properties. For example, we can tell VDR Volume dynamic range VQ Vector quantizer whether a TV program is a news report, a commercial, or VSTD Volume standard deviation a sports game without actually watching the TV or un- VU Volume undulation derstanding the words being spoken because the back- ZCR Zero crossing rate ground sound characteristics in these scenes are very ZSTD Standard deviation of ZCR different. Although it is also possible to differentiate these scenes based on the visual information, audio-based pro- been proposed for this purpose. Some of them are de- cessing requires significantly less complex processing. signed for specific tasks, while others are more general When audio alone can already give definitive answers re- and can be useful for a variety of applications. In this sec- garding scene content, more sophisticated visual process- tion, we review some of these features. We will describe ing can be saved. On the other hand, audio analysis results audio features in greater detail than for visual features, as can always be used to guide additional visual processing. there have been several recent review papers covering vi- When either audio or visual information alone is not suf- sual features. ficient in determining the scene content, combining au- dio and visual cues may resolve the ambiguities in individual modalities and thereby help to obtain more ac- Audio Features There are many features that can be used to characterize curate answers. audio signals. Usually audio features are extracted in two levels: short-term frame level and long-term clip level. AV Features for Here a frame is defined as a group of neighboring samples which last about 10 to 40 ms, within which we can as- Characterizing Semantic Content sume that the audio signal is stationary and short-term A key to the success of any multimedia content analysis features such as volume and coeffi- algorithm is the type of AV features employed for the cients can be extracted. The concept of audio frame comes analysis. These features must be able to discriminate from traditional speech signal processing, where analysis among different target scene classes. Many features have over a very short time interval has been found to be most

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 13 N −1 1 2 vn()= ∑ sin () N i=0

(the rms volume is also referred as energy). Note that the volume of an audio signal

Volume depends on the gain value of the recording and digitizing devices. To eliminate the in- fluence of such device-dependent condi- tions, we may normalize the volume for a 0 5 10 Time (Second) frame by the maximum volume of some Clip 1 Clip 2 Clip 3 Clip 4 Clip 5 previous frames. Zero Crossing Rate: Besides the volume, ZCR is another widely used temporal fea- Frame 1 ture. To compute the ZCR of a frame, we Frame 2 count the number of times that the audio Frame 3 Frame N waveform crosses the zero axis. Formally,

▲ 1. Decomposition of an audio signal into clips and frames. appropriate. For a feature to reveal the semantic meaning 1  N −1  f Zn()=−− sign() si () sign ( si (1 ) ) s of an audio signal, analysis over a much longer period is ∑ nn 2  i=1  N necessary, usually from one second to several tens sec- onds. Here we call such an interval an audio clip (in the where f s represents the sampling rate. ZCR is one of the literature, the term “window” is sometimes used). A clip most indicative and robust measures to discern unvoiced consists of a sequence of frames, and clip-level features speech. Typically, unvoiced speech has a low volume but usually characterize how frame-level features change a high ZCR. By using ZCR and volume together, one can over a clip. The clip boundaries may be the result of au- prevent low energy unvoiced speech frames from being dio segmentation such that the frame features within classified as silent. each clip are similar. Alternatively, fixed length clips, Pitch: Pitch is the fundamental frequency of an audio usually 1 to 2 seconds, may be used. Both frames and waveform and is an important parameter in the analysis clips may overlap with their previous ones, and the over- and synthesis of speech and music. Normally only voiced lapping lengths depend on the underlying application. speech and harmonic music have well-defined pitch. But Figure 1 illustrates the relation of frame and clip. In the we can still use pitch as a low-level feature to characterize following, we first describe frame-level features and then the fundamental frequency of any audio waveforms. The move onto clip-level features. typical pitch frequency for a human being is between 50-450 Hz, whereas the pitch range for music is much wider. It is not easy to robustly and reliably estimate the Frame-Level Features pitch value for an audio signal. Depending on the re- Most of the frame-level features are inherited from tradi- quired accuracy and complexity constraint, different tional speech signal processing. Generally they can be sep- methods for pitch estimation can be applied [1]. arated into two categories: time-domain features, which One can extract pitch information by using either tem- are computed from the audio waveforms directly, and fre- poral or frequency analysis. Temporal estimation meth- quency-domain features, which are derived from the ods rely on computation of the short time autocorrelation Fourier transform of samples over a frame. In the follow- function Rln ()or AMDF Aln (), where ing, we use N to denote the frame length, and sin ()to de- note the ith sample in the nth audio frame. Nl−−1 Rl()=+ sisil () ( ) Volume: The most widely used and easy-to-compute nn∑ n i=0 frame feature is volume. (Volume is also referred as loud- and ness, although strictly speaking, loudness is a subjective Nl−−1 measure that depends on the frequency response of the Alnn()= ∑ | s ( i+−lsi)()|.n human listener.) Volume is a reliable indicator for si- i=0 lence detection, which may help to segment an audio se- quence and to determine clip boundaries. Normally Figure 2 shows the autocorrelation function and volume is approximated by the rms of the signal magni- AMDF for a typical voiced speech. We can see that there tude within each frame. Specifically, the volume of exist periodic peaks in the autocorrelation function. Simi- frame n is calculated by larly, there are periodic valleys in the AMDF. Here peaks

14 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 and valleys are defined as local extremes that satisfy addi- It has been found that FC is related to the human sen- tional constraints in terms of their values relative to the sation of the brightness of a sound we hear [4]. global minimum and their curvatures. For example, the In addition to FC and BW, Liu et al. proposed to use AMDF in Fig. 2 has two valleys, and the pitch frequency the ratio of the energy in a frequency subband to the total is the reciprocal of the time period between the origin and energy as a feature [3], which is re- the first valley. Such valleys exist in voiced and music ferred to as ERSB. Considering the perceptual property frames and vanish in noise or unvoiced frames. of human ears, the entire frequency band is divided into In frequency-based approaches, pitch is determined four subbands, each consisting of the same number of from the periodic structure in the magnitude of the Fou- critical bands, where the critical bands correspond to co- rier transform or cepstral coefficients of a frame. For ex- chlear filters in the human auditory model [5]. Spe- ample, we can determine the pitch by finding the cifically, when the sampling rate is 22050 Hz, the maximum common divider for all the local peaks in the frequency ranges for the four subbands are 0-630 Hz, magnitude spectrum [2]. When the required accuracy is 630-1720 Hz, 1720-4400 Hz, and 4400-11025 Hz. Be- high, a large size Fourier transform needs to be computed, which is time consuming. Autocorrelation and AMDF Functions Spectral Features: The spectrum of an audio frame refers to the Fourier trans- 60 form of the samples in this frame. Figure 40 3 shows the waveforms of three audio 20 clips digitized from TV broadcasts. The 0

R(I) commercial clip contains male speech −20 over a music background, the news clip −40 includes clean male speech, and the sports 0 50 100 150 200 250 300 clip is from a live broadcast of a basketball I game. Figure 4 shows the spectrograms (magnitude spectrums of successive over- 300 lapping frames) of these three clips. Ob- 250 viously, the difference among these three 200 clips is more noticeable in the frequency 150

A(I) domain than in the waveform domain. 100 Therefore, features computed from the 50 spectrum are likely to help audio content 0 analysis. 0 50 100 150 200 250 300 The difficulty with using the spectrum itself as a frame-level feature lies in its very ▲ 2. Autocorrelation and AMDF of a typical male voice segment. high dimension. For practical applications, it is necessary to find a more succinct de- Commercial News Sport 1 1 1 scription. LetS n ()ω denote the power spec- trum (i.e., magnitude square of the 0.8 0.8 0.8 spectrum) of frame n. If we think of ω as a 0.6 0.6 0.6 random variable and S n ()ω normalized by the total power as the probability density 0.4 function ofω, we can define mean and stan- 0.4 0.4 dard deviation of ω. It is easy to see that, 0.2 0.2 0.2 the mean measures the FC, whereas the standard deviation measures the BW of the 0 0 0 signal. They are defined as [3] Waveform −0.2 −0.2 −0.2

−0.4 −0.4 −0.4 ∞ ωωωSd() ∫ n FC() n = 0 −0.6 −0.6 −0.6 ∞ Sd()ωω ∫ n 0 −0.8 −0.8 −0.8 and −1 −1 − ∞ 1 (())ω − FC n 2 Sd()ωω 0 123 0 123 0 123 ∫ n BW2 () n = 0 . Time (Second) ∞ Sdn ()ωω ∫0 ▲ 3. Waveforms of commercial, news, and sports clips.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 15 cause the summation of the four ERSBs is always one, temporal variation of the volume in a clip does. To mea- only first three ratios were used as audio features, re- sure the variation of volume, Liu et al. proposed several ferred as ERSB1, ERSB2, and ERSB3, respectively. clip-level features [3]. The VSTD is the standard devia- Scheirer et al. used spectral rolloff point as a frequency tion of the volume over a clip, normalized by the maxi- domain feature [6], which is defined as the 95th percen- mum volume in the clip. The VDR is defined as tile of the power spectrum. This is useful to distinguish (()())/()maxvvv− min max , where min(v ) and max(v ) are voiced from unvoiced speech. It is a measure of the the minimum and maximum volume within an audio “skewness” of the spectral shape, with a right-skewed dis- clip. Obviously these two features are correlated, but they tribution having a higher value. do carry some independent information about the scene MFCCs or CCs [7] are widely used for speech recogni- content. Another feature is VU, which is the accumula- tion and speaker recognition. While both of them provide tion of the difference of neighboring peaks and valleys of a smoothed representation of the original spectrum of an the volume contour within a clip. audio signal, MFCC further considers the nonlinear prop- Scheirer proposed to use percentage of “low-energy” erty of the human hearing system with respect to different frames [6], which is the proportion of frames with rms vol- frequencies. Based on the temporal change of MFCC, an ume less than 50% of the mean volume within one clip. audio sequence can be segmented into different segments, Liu et al. used NSR [3], the ratio of the number of so that each segment contains music of the same style or nonsilent frames to the total number of frames in a clip, speech from one person. Boreczky and Wilcox used 12 where silence detection is based on both volume and ZCR. cepstral coefficients along with some color and motion fea- The volume contour of a speech waveform typically tures to segment video sequences [8]. peaks at 4 Hz. To discriminate speech from music, Scheirer et al. proposed a feature called 4ME [6], which is calculated based on the energy distribution in 40 Clip-Level Features subbands. Liu et al. proposed a different definition that As described before, frame-level features are designed to can be directly computed from the volume contour. Spe- capture the short-term characteristics of an audio signal. cifically, it is defined as [3] To extract the semantic content, we need to observe the ∞ temporal variation of frame features on a longer time ∫ WC()|()|ωω2 d ω 4ME = 0 , scale. This consideration leads to the development of vari- ∞ ous clip-level features, which characterize how |()|Cdωω2 ∫0 frame-level features change over a clip. Therefore, clip-level features can be grouped by the type of where C()ω is the Fourier transform of the volume con- frame-level features on which they are based. tour of a given clip andW()ω is a triangular window func- Volume Based: Figure 5 presents the volume contours tion centered at 4 Hz. Speech clips usually have higher of the three clips previously shown in Fig. 3. By compar- values of 4 ME than music or noise clips. ing these three graphs, we see that the mean volume of a ZCR Based: Figure 6 shows the ZCR contours of pre- clip does not necessarily reflect the scene content, but the vious three clips. We can see that, with a speech signal, low and high ZCR periods are inter- laced. This is because voiced and un- Commercial News Sport voiced sounds often occur alternatively 30 in a speech. The commercial clip has a 10000 10000 10000 20 relatively smooth contour since it has a 10 strong music background. The sports clip has a smoother contour than the 8000 8000 8000 0 news clip, due to the existence of noise −10 background. Liu et al. used the ZSTD within a clip to classify different audio 6000 6000 6000 −20 contents [3]. Saunders proposed to use −30 four statistics of the ZCR as features − [9]. These are i) standard deviation of Frequency (Hz) Frequency 4000 4000 4000 40 first order difference, ii) third central − 50 moment about the mean, iii) total num- 2000 2000 2000 −60 ber of zero crossing exceeding a thresh- −70 old, and iv) difference between the number of zero crossings above and be- 0 0 0 low the mean values. Combined with 0 12 0 12 0 1 2 the volume information, the proposed Time (Second) algorithm can discriminate speech and ▲ 4. Spectrograms of commercial, news, and sports clips. music at a high accuracy of 98%.

16 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 Pitch Based: Figure 7 shows the pitch contours of the than speech, and tracks with fundamental frequency three audio clips, which are obtained using the higher than 300 Hz. Speech with music background seg- autocorrelation function method. In these graphs, the ment has SPTs concentrating in the lower to middle fre- frames that are silent or without detected pitch are as- quency bands and has lengths within a certain range. signed a zero pitch frequency. For the news clip, by com- Those segments without certain characteristics are classi- paring with its volume and ZCR contours, we know that fied as environmental sound with music background. the zero-pitch segments correspond to silence or un- There are other clip features that are very useful. Due voiced speech. Although the sports clip has many to the space limit, we cannot include all of them here. In- zero-pitch segments, they correspond not to silent peri- terested readers are referred to [11]-[13]. ods but periods with only background sounds. There are many discontinuous pitch segments in the commercial Visual Features clip, wherein each the pitch value is almost constant. This Several excellent papers have appeared recently, summa- pattern is due to the music background in the commercial rizing and reviewing various visual features useful for im- clip. The pitch frequency in a speech signal is primarily in- age/video indexing [14], [15]. Therefore, we only briefly fluenced by the speaker (male or female), whereas the review some of the visual features in this article. Visual pitch of a music signal is dominated by the strongest note features can be categorized into four groups: color, tex- that is being played. It is not easy to derive the scene con- ture, shape, and motion. We describe these separately. tent directly from the pitch level of isolated frames, but Color: Color is an important attribute for image rep- the dynamics of the pitch contour over successive frames resentation. Color histogram, which represents the appear to reveal the scene content more. color distribution in an image, is one of the most widely Liu et al. utilized three clip-level features to capture the used color features. It is invariant to image rotation, variation of pitch [3]: PSTD, SPR, and NPR. SPR is the translation, and viewing axis. The effectiveness of the percentage of frames in a clip that have similar pitch as the color histogram feature depends on the color coordinate previous frames. This feature is used to measure the per- used and the quantization method. Wan and Kuo [16] centage of voiced or music frames within a clip, since only studied the effect of different color quantization meth- voiced and music have smooth pitch. On the other hand, ods in different color spaces including RGB, YUV, NPR is the percentage of frames without pitch. This fea- HSV, and CIE L*u*v*. When it is not feasible to use the ture can measure how many frames are unvoiced speech complete color histogram, one can also specify the first or noise within a clip. few dominant colors (the color values and their percent- Frequency Based: Given frame-level features that reflect ages) in an image. frequency distribution, such as FC, BW, and ERSB, one A problem with the color histogram is that it does not can compute their mean values over a clip to derive corre- consider the spatial configuration of pixels with the same sponding clip-level features. Since the frame with a high color. Therefore, images with similar histograms can energy has more influence on the perceived sound by the have drastically different appearances. Several approaches human ear, Liu et al. proposed using a weighted average of corresponding frame-level features, where the weighting Commercial News Sport for a frame is proportional to the energy of 0.5 0.5 0.5 the frame. This is especially useful when 0.45 0.45 0.45 there are many silent frames in a clip be- cause the frequency features in silent 0.4 0.4 0.4 frames are almost random. By using en- ergy-based weighting, their detrimental ef- 0.35 0.35 0.35 fects can be removed. 0.3 0.3 0.3 Zhang and Kuo used SPTs in a spectro- gram to classify audio signals [10]. First, 0.25 0.25 0.25 SPT is used to detect music segments. If Volume 0.2 0.2 0.2 there are tracks which stay at about the same frequency level for a certain period of 0.15 0.15 0.15 time, this period is considered a music seg- ment. Then, SPT is used to further classify 0.1 0.1 0.1 music segments into three subclasses: 0.05 0.05 0.05 song, speech with music, and environmen- tal sound with music background. Song 0 0 0 0 1 2 0123012 segments have one of three features: rip- Time (Second) ple-shaped harmonic peak tracks due to voice sound, tracks with longer duration ▲ 5. Volume contours of commercial, news, and sports clips.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 17 have been proposed to circumvent this problem. In [17], Texture: Texture is an important feature of a visible Pass and Zabih proposed a histogram refinement algo- surface where repetition or quasi-repetition of a funda- rithm. The algorithm is based on CCV, which partitions mental pattern occurs. There are two popular texture rep- pixels based upon their spatial coherence. A pixel is con- resentations: co-occurrence matrix representation and sidered coherent if it belongs to a sizable contiguous re- Tamura representation. A co-occurrence matrix describes gion with similar colors. A CCV is a collection of orientation and distance between image pixels, from coherence pairs, which are numbers of coherent and inco- which meaningful statistics can be extracted. In contrast, herent pixels, for each quantized color. Similarly, Chen inverse difference moment and entropy have been found and Wong proposed an augmented image histogram to have the biggest discriminatory power [19]. Tamura [18], which includes, for each color, not only its probabil- representation is motivated by the psychological studies ity, but also mean, variance, and entropy values of in human visual perception of texture and includes mea- pair-wise distances among pixels with this color. sures of coarseness, contrast, directionality, linelikeness, regularity, and roughness [20]. Tamura features are attractive in image retrieval be- Commercial News Sport cause they are visually meaningful. Other 10000 10000 10000 representations include Markov random 9000 9000 9000 field, Gabor transforms, and wavelet trans- forms, etc. 8000 8000 8000 Shape: Shape features can be represented using traditional shape analysis such as mo- 7000 7000 7000 ment invariants, Fourier descriptors, 6000 6000 6000 autoregressive models, and geometry at- tributes. They can be classified into global 5000 5000 5000 and local features. Global features are the

ZCR properties derived from the entire shape. 4000 4000 4000 Examples of global features are roundness 3000 3000 3000 or circularity, central moments, eccentric- ity, and major axis orientation. Local fea- 2000 2000 2000 tures are those derived by partial process- ing of a shape and do not depend on the en- 1000 1000 001000 tire shape. Examples of local features are 0 0 0 size and orientation of consecutive bound- 0 1 2 0 1 2 3 0 1 2 ary segments, points of curvature, corners, Time (Second) and turning angle. The most popular ▲ 6. ZCR contours of commercial, news, and sports clips. global representations are Fourier descrip- tor and moment invariants. Fourier Commercial News Sport descriptor uses the Fourier transform of the 300 300 300 boundary as the shape feature. Moment invariants use region-based moments, which are invariant to geometric transfor- 250 250 250 mations. Studies have shown that the com- bined representation of Fourier descriptor 200 200 200 and moment invariants performs better than using Fourier descriptor or moment invariants alone [21]. Other works in shape 150 150 150 representations include finite element method [22], turning function [23], and

Pitch (Hz) wavelet descriptors [24]. 100 100 100 Motion: Motion is an important attribute of video. Motion information can be gener- ated by block-matching or optical flow tech- 50 50 50 niques. Motion features such as moments of the motion field, motion histogram, or

0 0 0 global motion parameters (e.g., affine or 0 1 2 0 1 2 0 1 2 bilinear) can be extracted from motion vec- Time (Second) tors. High-level features that reflect camera ▲ 7. Pitch contours of commercial, news, and sports clips. motions such as panning, tilting, and zoom-

18 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 ing can also be extracted. Another useful motion feature is Correlation between Audio and Visual the PCF between two frames [25]. When one frame is the Features and Feature Space Reduction translation of the other, the PCF has a single peak at a loca- Given an almost endless list of audio and visual features tion corresponding to the translation vector. When there that one can come up with, a natural question to ask is are multiple objects with different motions in the imaged whether they provide independent information about the scene, the PCF tends to have multiple peaks, each with a scene content, and, if not, how to derive a reduced set of magnitude proportional to the number of pixels experienc- features that can best serve the purpose. One way to mea- ing a particular motion. In this sense, the PCF reveals simi- sure the correlation among features within the same mo- lar information as the motion histogram. But it can be dality and across different modalities is by computing the computed from the image functions directly and therefore covariance matrix is not affected by motion estimation inaccuracy. Figure 8 shows the PCFs corresponding to several typical motions. 11T C(xm(xmmx=−−∑∑))with =, For a motion field that contains primarily zero motion, the NN PCF has a single peak at (0, 0). For a motion field that con- xx∈∈χχ tains a global translation, peak occurs at a nonzero posi- tion. The peak spreads out gradually when the camera where x =(,xx ,,… x )T is a K-dimensional feature 12 K zooms, and the PCF is almost flat when a shot change oc- vector, χ is the set containing all feature vectors derived curs and the estimated motion field appears as a uniform from training sequences, and N is the total number of fea- random field. Instead of using the entire PCF, a parametric ture vectors in χ. The normalized correlation between representation can be used as a motion feature. features i and j is defined by

3D View of PCF 0.7 0.18 3D View of PCF 0.6 0.16 0.14 0.5 0.12 0.4 0.1 0.3 0.08 0.06 0.2

Magnitude of PCF Magnitude of PCF 0.04 0.1 0.02 0 0 50 50 0 50 0 50 0 0 -50 -50 -50 -50 YX YX (a) (b)

0.04 3D View of PCF 0.025 3D View of PCF 0.035 0.02 0.03

0.025 0.015 0.02 0.015 0.01

Magnitude of PCF Magnitude of PCF 0.01 0.005 0.005 0 0 50 50 0 0 0 50 0 50 -50 -50 -50 -50 YX YX

(c) (d)

▲ 8. The phase correlation functions for typical motion: (a) static background (news report), (b) camera pans [football game; peaks at (,)−27 and (,)−17 ], (c) camera zooms into a baseball field, and (d) scene change.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 19 ~ Ci(, j ) Cij(, )= High correlation between certain features in the above CiiCj(,) ( , j ) example suggests that the feature dimension can be re- duced through proper transformations. Two powerful where Ci(, j )is the (,ij )th element in C. feature space reduction techniques are KLT and MDA, Figure 9 shows the normalized correlation matrix in both using linear transforms. With KLT, the transform is absolute value derived from a training set containing five designed to decorrelate the features, and only those fea- types of TV programs: commercials, news, live basket- tures with eigenvalues larger than a threshold will be re- ball games, live football games, and weather forecast. tained. With MDA, the transform is designed to About ten minutes of each scene type are included in the maximize the ratio of between-class scattering to the training set. A total of 28 features are considered: 14 au- within-class scattering. The maximum dimension of the dio features, eight color features, and six motion fea- new feature space is the number of classes minus one. Fig- tures. The audio features consist of VSTD, VDR, VU, ure 10 shows the distribution of feature points from five ZSTD, NSR, 4ME, PSTD, SPR, NPR, FC, BW, scene classes (denoted by different symbols and colors). ERSB1, ERSB2, and ERSB3. The color features in- The original feature space consist of the same 28 features clude the mean values for the three color components used in Fig. 9. The left plot is based on two original fea- and percentage of the most dominant color, followed by tures: FC and the mean of the most dominant color (red their variances; and the motion features are the mean values of the two motion components and percentage of component). The middle plot is based on the first two the most dominant motion vector, followed by their features obtained after applying KLT on the original fea- variances. The color and motion features are collected ture vector. The right plot is based on the first two fea- over the same time duration associated with an audio tures obtained with MDA. We can easily see that there is clip, so that the mean or variance is calculated over all the least amount of between-class overlapping in the fea- frames corresponding to an audio clip. As shown in Fig. ture space obtained with MDA. This means that the two 9, the correlation between features from different mo- new features after MDA have the best scene discrimina- dalities (e.g., audio, color, and motion) is very low. tion capability. Within the same modality, high correlation exists be- tween some features, such as NSR and VSTD, VSTD and VDR, SPR and NPR, FC and BW, ERSB1 and Video Segmentation Using AV Features ERSB2, among audio features, and between the means Video segmentation is a fundamental step for analyzing and variances of three color components in the domi- the content of a video sequence and for the efficient ac- nant color. cessing, retrieving and browsing of large video data- bases. A video sequence usually consists of separate scenes, and each scene includes many shots. The ulti- Within Class Scatter Coefficient Matrices mate goal of video segmentation is to automatically group shots into what the human being perceives as “scenes.” In the literature, scenes are also referred to as story units, story segments, or video paragraphs. Using 5 the movie production terminology, a shot is “one unin- terrupted image with a single static or mobile framing” [26], and a scene is “usually composed of a small number 10 of inter-related shots that are unified by location or dra- matic incident” [27]. Translating the definition of shots into technical terms, a shot should be a group of frames 15 that have consistent visual (including color, texture, and motion) characteristics. Typically, the camera direction and view angle define a shot: when a camera looks at the 20 same scene from different angles, or at different regions of a scene from the same angle, we see different (camera) shots. Because shots are characterized by the coherence 25 of some low-level visual features, it is a relatively easy task to separate a video into shots. On the other hand, the clustering of “shots” into “scenes” depends on sub- 5 10 15 20 25 jective judgment of semantic correlation. Strictly speak- ing, such clustering requires the understanding of the ▲ 9. The normalized correlation matrix between features from different modalities. The first 14 features are audio features; semantic content of the video. By joint analysis of audio the next eight are color features; and the last six are motion and visual characteristics, however, sometimes it is pos- features. sible to recognize shots that are related in locations or

20 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 events, without actually invoking high-level analysis of observation that a scene change is usually associated semantic meanings. with simultaneous changes of color, motion, and audio Earlier works on video segmentation have primarily characteristics, whereas a shot break is only accompa- focused on using the visual information. They mostly nied with visual changes. For example, a TV commercial rely on comparing color, motion, or other visual features. often consists of many shots, but the audio in the same The resulting segments usually correspond to individual commercial follows the same rhythm and tone. The al- camera shots, which are often overly detailed for con- gorithm proceeds in two steps. First, significant changes tent-level analysis. Recognizing the importance of audio in audio, color, and motion characteristics are detected in video segmentation, recently, more research efforts separately. Then shot and scene breaks are identified de- have been devoted to scene level segmentation using joint pending on the coincidence of changes in audio, color, AV analysis. We review one such work in this section. and motion. Other approaches that accomplish scene segmentation To detect audio breaks, an audio feature vector con- and classification jointly are reviewed later on. We also re- sisting of 14 audio features is computed over each short view an approach for shot segmentation using both audio audio clips. The audio features consist of VSTD, VDR, and visual information. VU, ZSTD, NSR, 4ME, PSTD, SPR, NPR, FC, BW, ERSB1, ERSB2, and ERSB3. Then, at each time in- stance, the difference between the audio features in sev- Hierarchical Segmentation eral previous clips and following clips is calculated. This In [28], a hierarchical segmentation approach was pro- difference is further normalized by the variance of the au- posed that can detect scene breaks and shot breaks at dif- dio features in these clips to yield a measure of the relative ferent hierarchies. The algorithm is based on the change in audio characteristics.

Two Original Features 2-D KLT MDA 3 7 2

6 1.5

2.5 5 1

4 2 0.5

3 0

1.5 2

Dimension 2 Dimension 2 0.5

Dominant Color–R 1

1 −1 0

Commercial −1.5 −1 Basketball 0.5 Football News − −2 2 Weather

0 −3 −2.5 0246−15 −10 −5 0 5 −4 −20 2

FC Dimension 1 Dimension 1

▲ 10. Distribution of two features in the original feature vector, after KLT, and after MDA. Feature vectors are extracted from five scene classes.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 21 rectly with one falsely detected 2 break and one missed break. The false detection happens True Shot Break .5 True Scene Break when the camera suddenly stopped tracking the basketball 1 player, and the missed break is due to the imposed constraint .5 that forbids more than one break within one second. 0 050100150 Time (Seconds) Video Shot Detection and Classification Using HMM ▲ 11. Scene and shot segmentation result for a testing sequence using audio, color, and motion A common approach to detect information. The short lines indicate detected shot breaks whereas the tall lines indicate scene shot boundaries is by comput- breaks. ing the difference between the luminance or color histograms Color breaks are detected by comparing the color histo- of two adjacent frames and compare it to a preset thresh- grams between adjacent video frames. Direct thresholding old. A problem with this approach is that it is hard to se- the histogram difference value can yield false breaks during lect a threshold that will work with different shot a fast camera pan, where different parts (differing in color) transitions. To circumvent this problem, Boreczky and of the same scene are captured in successive frames. There- Wilcox proposed an alternative approach which uses a fore, a relative change is examined, which normalizes the HMM (see “TV Program Categorization Using HMM” histogram difference by the mean differences of several for more details on HMM) to model a video sequence previous and future frames. and accomplish shot segmentation and classification si- Motion breaks are detected by comparing the PCF of multaneously [8]. By turning the problem into a classifi- adjacent frames. For frames that experience camera mo- cation problem, the need for thresholding is eliminated. tion acceleration or de-acceleration, shift of the dominant Another advantage of the HMM framework is that it can motion location often occurs in the PCF. Hence, correla- integrate multimodal features easily. tion of two PCFs is used instead of a simple subtraction. The use of an HMM for modeling a video sequence is Finally, a relative change measure is used to locate motion motivated by the fact that a video consists of different break boundaries. shots connected by different transition types and camera Real shot breaks are usually accompanied by signifi- motions. As shown in Fig. 12, different states of the cant changes in color and discontinuities in motion. HMM are used to model different types of video seg- Using color histogram alone may result in false detection ments: the shots themselves, the transitions between under lighting changes (e.g., camera flash). This false de- them (cuts, fades, and dissolves), and camera motions tection can be avoided by examining the PCF, which is in- (pans and zooms). Transition is allowed only between variant to lighting changes. When no significant color two states connected through an arc. For example, from change occurs during a shot break, motion change is usu- the shot state, it is possible to go to any of the transition ally very significant. In the algorithm of [28], the shot states, but a transition state cannot go to a different transi- breaks are located by using three thresholds, Tc for color tion state. The parameters of the HMM (including the high low and T m and T m for motion. A shot break is declared distribution of each state and transition probabilities be- high only if change in motion is above T m , or change in mo- tween states) are learned using training data of a video low tion is above T m and change in color is above Tc . manually labeled with shots, transition types, and motion To detect scene breaks, frames with both audio and vi- types. Note that in fact, this model is not a “hidden” one, sual breaks are located. The threshold for audio break is as the states are prelabeled, and the probability distribu- intentionally set low so that no scene changes are missed. tion of each state is trained using training segments with The falsely detected breaks can be corrected by visual in- the corresponding label. Once the HMM is trained, a formation. For each detected audio break, visual breaks given video is segmented into its component shots and are searched in neighboring frames of the time instant as- transitions, by applying the Viterbi algorithm, to deter- sociated with the audio break. If a visual break is present, mine the most likely sequence of states. then a scene change is declared. Both audio and visual features can be used in the input The above algorithm has been tested over several data for the HMM, and they are calculated at every new 3-minute long sequences digitized from broadcasting TV video frame time. The visual features are the gray-level programs. Figure 11 shows the result for one test se- histogram difference and two motion features that are in- quence, which comprises five segments with different se- tended to separate camera pans and zooms. The audio mantic content. All the scene breaks are correctly features are computed in two stages. First, a 12- dimen- identified. Most of the true shot breaks are detected cor- sional cepstral vector is computed over short (20 ms) au-

22 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 dio clips. Then all such vectors in a 2-second interval characteristic of noise is much different from that of before the frame being considered and those in the inter- speech or music. val after this frame are compared to yield a single audio A more elaborate audio content categorization was feature, which measures the likelihood that the two inter- proposed by Wold et al. [4], which divides audio content vals share the same sound type. into ten groups: animal, bells, crowds, laughter, machine, This method has been applied to portions of a video instrument, male speech, female speech, telephone, and database containing television shows, news, movies, water. Furthermore, instrument sound is classified into commercials, and cartoons. The HMM method, with ei- altotrombone, cellobowed, oboe, percussion, ther the histogram feature alone, or with additional au- tubularbells, violinbowed, and violinpizz. To character- dio or motion features, was compared to the ize the difference among these audio groups, the authors thresholding method. For shot boundary detection, the used mean, variance, and autocorrelation of loudness, HMM method has achieved higher precision than the pitch, brightness (i.e., frequency centroid), and band- threshold method, while maintaining a similar recall width as audio features. A nearest neighbor classifier rate. The HMM method has an added advantage in that based on a weighted Euclidean distance measure was em- it can classify the motion and transition types. Using the ployed. The classification accuracy is about 81% over an histogram and audio feature together yielded the high- audio database with 400 sound files. est classification accuracy. Another interesting work related to general audio con- tent classification is by Zhang and Kuo [30]. They ex- plored five kinds of audio features: energy, ZCR, fundamental frequency, timber, and rhythm. Based on Scene Content Classification these features, a hierarchical system for audio classification The goal of scene content classification is to label an AV and retrieval was built. In the first step, audio data is classi- segment as one of several predefined semantic classes. fied into speech, music, environmental sounds, and silence Scene classification can be done at different levels. At the using a rule-based heuristic procedure. In the second step, very low level, an audio or visual segment can be catego- environmental sounds are further classified into applause, rized into some elementary classes, e.g., speech versus rain, birds’ sound, etc., using an HMM classifier. These music, indoor versus outdoor, high action or low action. two steps provide the so-called coarse-level and fine-level At the next mid level, some basic scene types may be iden- classification. The coarse-level classification achieved 90% tified, such as a dialog between two people, an indoor accuracy and the fine-level classification achieved 80% ac- concert, a typical scene on the beach, etc. At the highest curacy in a test involving ten sound classes. level, the actual story needs to be understood, e.g., a hur- ricane in Florida, a New Year’s Eve celebration in New York’s Times Square. Somewhat in parallel with PF mid-level scene classification is the problem of recogniz- ing typical video categories (or genres), e.g., news report, PD commercial, sports game, cartoon, and in the case of a Fade movie, drama, comedy, or action.

Obviously, based on low-level audio and visual fea- 1–PF Dissolve tures alone, it is hard to understand the story content. On PT the other hand, there have been encouraging research ef- PT 1–PD forts demonstrating successful low- to mid-level scene 1–3P–TM 2 P classification and categorization. In this section, we first P review works on low-level audio content classification, T Cut 1 and then we describe works in basic scene type detection Shot 1–P and video program categorization. C P 1 C

Audio Content Classification 1–PP PM Cut 2 In a content-based audio indexing and retrieval system, PM the most important task is to identify the content of audio PZ automatically. Depending on the application, different PP 1–PZ categorization can be applied. In [9], Saunders consid- ered the discrimination of speech from music. Saraceno Pan Zoom and Leonardi further classified audio into four groups: si- lence, speech, music, and noise [29]. The addition of the silence and noise categories is appropriate, since a large si- ▲ 12. Hidden Markov model for video shot segmentation/classifi- lence interval can be used as segment boundaries, and the cation. From [8].

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 23 shots, and each audio shot is classified as silence, speech, Usually audio features are music, or noise. Simultaneously, the video signal is di- vided into video shots based on the color information. extracted in two levels: For each detected video shot, typical block patterns in this short-term frame level and shot are determined using a vector quantization method. Finally, the scene detector and characterization unit long-term clip level. groups a set of consecutive video shots into a scene if they match the visual and audio characteristics defined for a particular scene type. Besides audio classification in general video, several re- To accomplish audio segmentation and classification, search groups have focused on the problem of music clas- the silence segments are first detected based on an analysis sification. This is important to automatically index large of the signal energy. The nonsilence segments are further volumes of music resources and to provide the capability separated into speech, music, and noise by evaluating the of music query by example. Matityaho et al. proposed to degree of periodicity and the ZCR of the audio samples use a multilayer neural network classifier to separate clas- [29], [34]. sical and pop music [31]. The audio features used are the Video shot breaks are determined based on the color average amplitude of Fourier transform coefficients histograms, using the twin comparison techniques pro- within different subbands. The subband is defined by di- posed in [35]. For each shot, a codebook is designed, viding the cochlea into several equal-sized bands and which contains typical block patterns in all frames within choosing corresponding resonance frequencies along the this shot. Then successive shots are compared and labeled cochlea at these positions. The neural network considers a sequentially: a shot that is close to a previous shot is given window of successive frames simultaneously, and the fi- the same label as that shot; otherwise, it is given a new la- nal decision is made after the output of the neural net- bel. To compare whether two shots are similar, the work is integrated over a short period. An accuracy of codebook for one shot is used to approximate the block 100% was reported with the best parameter setting over a patterns in the other shot. If each shot can be represented database containing 24 music pieces, with about half con- well using the codebook of the other shot, then these two taining classic music and the other half rock music. shots are considered similar. Lambrou et al. [32] attempted to classify music into Scene detection and classification depends on the defi- rock, piano, and jazz. They collected eight first- and sec- nition of scene types in terms of their audio and visual pat- ond-order statistical features in the temporal domain as terns. For example, in a dialog scene, the audio signal well as three different transform domains: adaptive split- should consist of mostly speech, and video shot labels ting wavelet transform, logarithmic splitting wavelet should follow a pattern of the type ABABAB. On the transform, and uniform splitting wavelet transform. For other hand, in an action scene, the audio signal should features from each domain, four different classifiers were mostly belong to one nonspeech type, and the visual in- examined: minimum distance classifier, K-nearest neigh- formation should exhibit a progressive pattern of the type bor distance classifier, LSMDC, and quadrature classifier. ABCDEF. By examining successive video shots, to see An accuracy of 91.67% was achieved under several com- whether they follow one of the predefined patterns (e.g., binations of feature set and classifiers. The LSMDC was ABABAB) and whether the corresponding audio shots the best classifier for most feature types. have the correct labels (e.g., mostly speech), the maxi- mum set of consecutive video shots that satisfy a particu- lar scene definition is identified and classified. Basic Video Scene Type Detection The above technique was applied to 75 minutes of We present two approaches for segmenting a video into a movie and 30 minutes of news. It has been found some basic scene types. In both approaches, shot-level that more accurate results can be obtained for news segmentation is first accomplished, and then shots are than for movies. This is as expected, because news se- grouped into a scene based on the scene definition. quences follow the somewhat simplified scene defini- Therefore, scene segmentation and classification are ac- tion more closely. complished simultaneously.

Scene Characterization Based on Separate Criteria Scene Characterization Using AV Features Jointly Lienhart et al. [12] proposed to use different criteria to Saraceno and Leonardi [33] considered segmenting a segment a video: scenes with similar audio characteristics, video into the following basic scene types: dialogs, sto- scenes with similar settings, and dialogs. The scheme con- ries, actions, and generic. This is accomplished by first di- sists of four steps. First, video shot boundaries are de- viding a video into audio and visual shots independently tected by the algorithm in [36], which can detect and and then grouping video shots so that audio and visual classify hard cuts, fades, and dissolve. Then, audio fea- characteristics within each group follow some predefined tures, color features, orientation features, and faces are ex- patterns. First, the audio signal is segmented into audio tracted. Next, distances between every two video shots

24 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 are calculated, with respect to each feature modality, to ria. The algorithm works by merging two clusters result- form a distance table. Finally, based on the calculated shot ing from different criteria, whenever they overlap. This distance tables, video shots are merged based on each fea- has yielded much better results than those obtained based ture separately. The authors also investigated how to on audio, color, or orientation features separately. merge the scene detection results obtained using different features. The authors argued that it is better to first per- form scene segmentation/classification based on separate Video Program Categorization criteria and leave the merging task to a later stage that is Film Genre Recognition application dependent. Fischer et al. investigated automatic recognition of film genres using both audio and visual cues [39]. In their To examine audio similarity, an audio feature vector is work, film materials are actually extracted from TV pro- computed for each short audio clip, which includes the grams, and film genres refer to the type of programs: magnitude spectrum of the audio samples. A forecasting news cast, sports, commercials, or cartoons. In their feature vector is also calculated at every instance using ex- work, video analysis is accomplished at three increasing ponential smoothing of previous feature vectors. An au- levels of abstraction. At the first level, some syntactic dio shot cut is detected by comparing the calculated properties of a video, which include color histogram, mo- feature vector with the forecasting vector. The prediction tion energy (sum of frame difference), motion field, process is reset after a detected shot cut. All feature vec- waveform and spectrum of the audio, are extracted. Shot tors of an audio shot describe the audio content of that cut detection is also accomplished at this stage based on shot. The distance between two shots is defined as the changes in both color histogram and motion energy. (In minimal distance between two audio feature vectors of their paper, each shot is referred to as a scene.) Video style the respective video shots. A scene in which audio charac- attributes are then derived from syntactic properties at teristics are similar is noted as an “audio sequence.” It the second level. Typical style attributes are shot lengths, consists of all video shots so that every two shots are sepa- camera motion such as zooming and panning, shot tran- rated no more than a look-ahead number (three) of shots sitions such as fading and morphing, object motion (only and that the distance between these two shots is below a rigid objects with translational motions are considered), threshold. presence of some predefined patterns such as TV station A setting is defined as a locale where the action takes logos, and audio type (speech, music, silence, or noise, place. The distribution of the color and the edge orienta- detected based on the spectrum and loudness). At the fi- tion are usually similar in frames under the same setting. nal level, the temporal variation pattern of each style at- To examine color similarity, the CCV is computed be- tribute, called style profile, is compared to the typical tween every two frames. For edge orientation, an orienta- profiles of various video genres. Four classification mod- tion correlogram is determined, which describes the ules are used to compare the profiles: one for shot length correlation of orientation pairs separated by a certain spa- and transition styles, one for camera motion and object tial distance. The color (respectively, orientation) dis- motion, one for the occurrence of recognized patterns, tance between two video shots is defined as the minimum and one for audio spectrum and loudness. The classifica- distance between the CCVs (respectively, orientation tion is accomplished in two steps. In the first step, each correlograms) of all frames contained within the respec- classification module assigns a membership value to each tive shots. A scene in which color or orientation charac- particular genre. In the second step, the outputs of all teristics are similar is noted as a “video setting.” It is classification modules are integrated by weighted average determined in the same way as for an “audio sequence,” to determine the genre of the film. but using the shot distance table calculated based on the The above technique was applied to separate five film color or orientation feature. genres, namely news cast, car race, tennis, animated car- A dialog scene is detected by using face detection and toon, and commercials. The test data consist of about two matching techniques. Faces are detected by modifying four-minute video segments for each genre. Promising the algorithm developed by Rowley et al. [37]. Similar results were reported. faces are then grouped into face-based sets using the eigenface face recognition algorithm [38]. A consecutive set of video shots is identified as a dialog scene if alternat- TV Program Categorization Using HMM ing face-based sets occur in these shots. The classification approaches reviewed so far are The above scene determination scheme has been ap- primarily rule based, in that they rely on some rules that plied to two full-length feature movies. On average, the each scene class/category must follow. As such, the result- accuracy of determining the dialog scenes is much higher ing solution depends on the scene definition and the ap- than that for audio sequences and settings. This is proba- propriateness of the rules. Here, we describe an bly because the definition and extraction of dialog scenes HMM-based classifier for TV program categorization, conform more closely with the human perception of a di- which is driven by training data, instead of manually de- alog. The authors also attempted to combine the video vised rules [40], [41]. Similar framework can be applied shot clustering results obtained based on different crite- to other classification/categorization tasks.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 25 |λ HMM for PO()1 Scene Class 1

|λ HMM for PO()2 Feature Vector Scene Class 2 Scene Class Sequence X O Label VQ MAX •

|λ HMM for PO()L Scene Class L

▲ 13. Illustration of a discrete HMM classifier.

Our use of HMM to model a video program is moti- classification is done for every short sequence of 11 s. vated by the fact that the feature values of different pro- Audio features are collected over 1 s long clips, and each grams at any time instant can be similar. Their temporal audio observation sequence consists of 20 feature vectors behaviors, however, are quite different. For example, in a calculated over 20 overlapping clips. Visual (color and basketball game, a shooting consists of a series of basic mo- motion) features are calculated over 0.1 s long frames, tions: the camera first points to the player (static) and and each visual observation sequence consists of 110 fea- then follows the ball after the ball is thrown away from the ture vectors. The audio features are VSTD, VDR, VU, player’s hand (panning/tilting). These individual basic ZSTD, NSR, 4ME, PSTD, SPR, NPR, FC, BW, events can occur in a football game or a commercial as ERSB1, ERSB2, and ERSB3; the color features are the well, but the ordering of the basic events and the duration most dominant color with its percentage, and the motion of each are quite unique with basketball games. features are the first four dominant motions with their An HMM assumes that an input observation sequence percentages. For each scene class, a five-state ergodic of feature vectors follows a multistate distribution. An HMM is used and the feature vector is quantized into 256 HMM is characterized by the initial state distribution observation symbols. (Π), the state transition probabilities (A), and the obser- Tables 1-3 show the classification results using audio, vation probability distribution in each state (B). An color, and motion features separately. We can see that au- HMM classifier requires training, which estimates the pa- dio features alone can effectively separate video into three rameter set λ=(,,)ABΠ for each class based on certain super classes, i.e., commercial, basketball and football training sequences from that class. A widely used algo- games, and news and weather forecast. This is because rithm is the Baum-Welch method [7], which is an itera- each super class has distinct audio characteristics. On the tive procedure based on expectation-maximization. The other hand, visual information, color or motion, can dis- classification stage is accomplished using the maximum tinguish basketball game from football game and weather likelihood method. Figure 13 illustrates the process of a forecast from news. Overall, audio features are more ef- discrete HMM classifier. An input sequence of feature fective than the visual features in this classification task. vectors is first discretized into a series of symbols using a To resolve the ambiguities present in individual mo- VQ. The resulting discrete observation sequence is then dalities, we explored different approaches for integrating fed into pretrained HMMs for different classes and the multimodal features [41]. Here, we present the two more most likely sequence of states and the corresponding like- effective and generalizable approaches. lihood for each class is determined using the Viterbi algo- Direct Concatenation: Concatenating feature vectors rithm. The class with the highest likelihood is then from different modalities into one super vector is a identified. For more detailed descriptions on HMM, see straightforward way for combining audio and visual in- the classical paper by Rabiner [42]. formation. This approach has been employed for speech We considered the classification of five common TV recognition using both speech and mouth shape informa- programs: commercial, live basketball game, live football tion [43], [44]. To combine audio and visual features game, news report, and weather forecast. For each pro- into a super vector, audio and visual features need to be gram type, 20 minutes of data are collected, with half synchronized. Hence, visual features are calculated for the used for training and the remaining half for testing. The same duration associated with each audio clip. In general,

26 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 the concatenation approach can improve classification re- Table 1. Classification Accuracy Based on Audio sults. As the feature dimension increases, however, more Features Only (Average Accuracy: 79.71%). data are needed for training. Output Class Product HMM: With the product approach, we as- Input sume that the features from different modalities are inde- Class pendent of each other and use an HMM classifier to ad bskb ftb news wth compute the likelihood values for all scene classes based on features from each modality separately. The final likeli- ad 75.66 7.36 0.38 15.66 0.94 hood for a scene class is the product of the results from all bskb 1.46 91.79 5.29 1.46 0.00 modules, as shown in Fig. 14. With this approach, HMM classifiers for different modalities can be trained sepa- ftb 1.82 13.28 83.64 1.26 0.00 rately. Also, features from different modalities can be cal- culated on different time scales. Another advantage of news 0.00 0.19 4.58 57.55 37.68 this approach is that it can easily accommodate a new mo- dality if the features in the new modality are independent wth 0.00 0.00 0.00 10.08 89.92 of existing features. Tables 4 and 5 show the results obtained using direct ad—commercial; bskb—basketball game; ftb—football; concatenation and product HMM approaches, respec- wth—weather forecast. tively. Both approaches achieved significantly better per- formance than that obtained with any single modality. Overall, the product approach is better than the concate- Table 2. Classification Accuracy Based on Color nation approach. This is partly because model parameters Features Only (Average Accuracy: 72.79%). can be trained reliably with the product approach. An- Output Class other reason is that, as previously shown in Fig. 9, audio, Input Class color, and motion features employed for this classifica- ad bskb ftb news wth tion task are not strongly correlated, and therefore, one does not lose much by assuming these features are inde- ad 84.25 1.79 0.00 10.19 3.77 pendent. In our simulations, because of the limitation of training data, HMMs obtained for the concatenation ap- bskb 9.76 78.10 0.00 9.67 2.46 proach tend to be unreliable, and the classification accu- racy is not always as expected. With either approach, the ftb 0.79 0.00 98.10 1.11 0.00 classification accuracy for the news category is lowest. This is because the news category in our simulation in- news 56.67 11.88 0.00 20.64 10.81 cludes both anchor-person reports in a studio and live re- ports in various outdoor environments. A much higher wth 3.78 4.07 0.00 9.30 82.85 accuracy is expected if only in-studio reports are included. ad—commercial; bskb—basketball game; ftb—football; wth—weather forecast. Testbeds for Video Archiving and Retrieval Informedia: The CMU Digital Video Library The Informedia project [45]-[47] at Carnegie Mellon Table 3. Classification Accuracy Based on Motion University (CMU) is one of the six projects supported by Features Only (Average Accuracy: 78.02%). the U.S. Digital Library Initiative [48]. It has created a Output Class terabyte digital video library (presently containing only Input Class news and documentary video) that allows users to re- ad bskb ftb news wth trieve/browse a video segment using both textual and vi- sual means. A key feature of the system is that it combines ad 64.45 3.68 5.94 21.51 1.42 speech recognition, natural language understanding, and image processing for multimedia content analysis. The bskb 1.55 88.59 1.00 8.85 0.00 following steps are applied to each video document: i) generate a speech transcript from the audio track using ftb 0.47 9.41 80.55 9.57 0.00 CMU’s Sphinx-II speech recognition system [49]; ii) se- lect keywords from the transcript using natural language news 17.43 9.15 7.79 65.63 0.00 understanding techniques to support text-based query and retrieval; iii) divide the video document into story wth 3.10 0.00 7.56 1.45 87.89 units; iv) divide each story unit into shots and select a key ad—commercial; bskb—basketball game; ftb—football; frame for each shot; iv) generate visual descriptors for wth—weather forecast. each shot; and v) provide a summary for each story unit in

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 27 era motion in a shot and this motion stops, the corre- Visual features can be sponding frame usually contains a person or an object of significance and hence can be chosen as a key frame. An- categorized into four groups: other cue is the presence of faces and texts in the video. color, texture, shape, and For this purpose, face detection [37] and text detection [51] are applied to the video frames. Text detection fol- motion. lowed by OCR is also used to improve the accuracy of speech recognition based on the audio track. Visual Descriptors: In addition to the keywords gener- different forms. Figure 15 illustrates the analysis results ated from the transcript, Informedia provides other in- from different processing modules for a sample video seg- formation extracted from the video frames. First, to ment. We briefly review the techniques used for the last enable image-similarity based retrieval, a human percep- three tasks, with emphasis on how audio and visual infor- tual color clustering algorithm was developed [52], mation are combined. We will also describe a companion which quantizes colors using a perceptual similarity dis- project Name-It. tance and forms regions with similar colors using an au- Segmentation into Story Units: Each video document is tomatic clustering scheme. Each key frame is indexed by partitioned into story units. This is done manually in the two sets of descriptors: a modified color histogram, current running version of the system. However, various which includes the color value, its percentage, and the automatic segmentation schemes have been developed spatial connectivity of pixels for each quantized color; and evaluated. The approach described in [50] makes use and a list of regions, which describes the mean color and of text, audio, and image information jointly. (In [50], geometry of each region. “video paragraph” refers to a story unit, and “scene” actu- In addition to the above low-level color and shape fea- ally refers to a shot.) When closed caption is available, text tures, the presence of faces and/or texts is also part of the markers such as punctuation are used to identify story index. The face presence descriptor also provides library segments. Otherwise, silence periods are detected based users a face-searching capability. on the audio sample energy. Either approach yields Video-Skims: Informedia supports three types of visual boundaries of “acoustic paragraphs.” Then for each summary, including thumbnail, filmstrip, and video- acoustic paragraph boundary, the nearest shot break is skim [53]. To generate the video-skim for a segment, detected (see below), which is considered the boundary keywords and phrases that are most relevant to the query of a story unit. text are first identified in the transcript. Then, time-corre- Segmentation into Shots: To enable quick access to con- sponding video frames are examined for scene changes, tent within a segment and to support visual summary and entrance and exit of a human face, presence of overlaid image-similarity matching, each story unit is further par- texts, and desirable camera motion (e.g., a static frame be- titioned into shots, so that each shot has a continuous ac- fore or after a continuous camera motion). Audio levels tion or similar color distribution. Shot break detection is are also examined to detect transitions between speakers based on frame differences in both color histogram and and topics, which often lead to low energy or silence in motion field [50]. the sound track. Finally, the chosen audio and visual in- Key Frame Selection within Each Shot: Several heuristics tervals are combined to yield a video-skim, individually were employed to choose key frames. One is based on the extended or reduced in time as necessary to yield satisfac- camera motion analysis. When there is a continuous cam- tory AV synchronization. Such skims have demonstrated significant benefits in performance and user satisfaction compared to simple subsampled skims in empirical stud- P1 P2 PL ies [54]. Name-It: Name-It is a companion project to Informedia. It is aimed at automatically associating faces X P1,m X PL,a X P1,a PL,m detected from video frames and names detected from the P1,c P L,c closed caption in news [55]. It does not rely on any prestored face templates for selected names, which is both HMM HMM HMM the challenge and novelty of the system. Initial face detec- Classifier I Classifier II Classifier III tion is accomplished using CMU’s neural net-based ap- proach [37], and then the same face is tracked based on color similarity until a scene change occurs. To recognize Audio Color Motion the same face appearing in separated shots, the eigenface technique developed at MIT [56] is used. For name de- ▲ 14. The product HMM classifier: PPP,, are the likelihood la,, lc lm that an input sequence belongs to class l based on individual tection, it uses not only closed-caption text, but also texts detected from video frames [51]. To avoid detecting all feature sets. Pl is the overall likelihood for class l based on all features. names appearing in the transcript, lexical and grammati-

28 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 cal analysis as well as the a priori knowledge about the rate news report and commercial [59] and to detect structure of a news coverage is used to identify names that anchor-person segments [60]. may be of interest to the underlying news topic. To ac- Separation of News and Commercials: To accomplish complish name and face association, a co-occurrence fac- this task, a news program is first segmented into clips tor is calculated for each possible pairing of a detected based on audio volume, where each clip is about 2 s long. name and a detected face, which is primarily determined For each clip, 14 audio features and four visual features by their time-coincidence. Multiple co-occurrences of the are extracted and integrated into a 18-dimension same name-face pair will lead to a higher score. Finding clip-level feature vector. The used audio features are correct association is a very difficult task, because the NSR, VSTD, ZSTD, VDR, VU, 4ME, PSTD, SPR, name detected in the closed caption is often not the same NPR, FC, BW, and ERSB1-3. To extract visual features, as the person appearing on the screen. This will be the the DCH and ME of adjacent frames are first computed. case, for example, when an anchor or reporter is telling The visual features are means and standard deviations of the story of someone else. The system reported in [55], DCH and ME within a clip. Then each clip is classified as although impressive, still has a relatively low success rate either news or commercial. For this purpose, four differ- (lower than 50%), either for face-to-name retrieval or ent classification mechanisms have been tested, including name-to-face retrieval. hard threshold classifier, linear fuzzy classifier, GMM One possible approach to improve the performance of classifier, and SVM classifier. The SVM classifier was Name-It is by exploiting the sound characteristics of the speaker. For example, if one can tell a male speaker from a Table 4. Classification Accuracy Based on Audio, female speaker based on the sound characteristics and dif- Color, and Motion Features Using the Direct ferentiate male and female names based on some lan- Concatenation Approach guage models, then one can reduce the false association of (Average Accuracy: 86.49%). a female face with a male name or vice verse. One can also Output Class use speaker identification techniques to improve the per- Input formance of face clustering: by requiring similarity in Class ad bskb ftb news wth both facial features and voice characteristics. ad 91.23 7.08 0.00 1.60 0.09

AT&T Pictorial Transcript System bskb 2.55 86.13 8.21 3.10 0.00 Pictorial Transcript is an automated archiving and re- trieval system for broadcast news program, developed at ftb 1.58 1.34 94.31 2.77 0.00 AT&T Labs [57]. In the first generation of the system, Pictorial Transcript extracts the content structure by lexi- news 2.63 1.66 3.02 64.95 27.75 cal and linguistic processing based on the text informa- tion in the closed caption. After text processing, a news wth 0.00 0.00 0.00 4.17 95.83 program is decomposed into a hierarchical structure, in- cluding page, paragraph, and sentence, which helps to ad—commercial; bskb—basketball game; ftb—football; generate a hypermedia document for browsing and re- wth—weather forecast. trieval. Simultaneously, the video stream is segmented into shots by a content-based sampling algorithm [58], Table 5. Classification Accuracy Based on Audio, based on brightness and motion information. A key Color, and Motion Features Using the Product frame is then chosen for each shot. The key frames for the Approach (Average Accuracy: 91.40%). shots corresponding to a paragraph in the closed caption Output Class are used as the visual representation of that paragraph. Input Class In the current development of the system, more so- ad bskb ftb news wth phisticated audio and visual analysis is being added to help the automatic generation of the content hierarchy. ad 93.58 0.47 0.00 5.38 0.57 The multimedia data stream, composed of audio, video, and text information, lies at the lowest level of the hierar- bskb 6.39 93.34 0.27 0.00 0.00 chy. On the next level, two major content categories in news program, news report and commercial, are sepa- ftb 0.00 0.00 100.00 0.00 0.00 rated. On the third level, news report is further seg- mented into anchor-person speech and others, which news 7.30 2.14 0.29 83.54 6.72 includes live report. On the highest level, text processing is used to generate a table of contents based on the bound- wth 0.39 0.00 0.00 13.08 86.53 ary information extracted in lower levels and correspond- ad—commercial; bskb—basketball game; ftb—football; ing closed-caption information. In the following, we wth—weather forecast. describe how audio and visual cues are combined to sepa-

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 29 found to achieve the best performance, with an error rate based on color histogram difference and block matching of less than 7% over two-hour video data from NBC error. An agglomerative hierarchical clustering scheme is Nightly News. It was found that audio features play a more then applied to group faces into clusters that are associ- important role in this classification task compared to the ated with similar feature blocks (i.e., the same clothing). visual counterpart [59]. Given the nature of the anchor’s function, it is clear that Detection of Anchor Person: When the anchor person for the largest cluster with the most scattered appearance a certain news program is known and fixed, audio ap- time corresponds to the anchor class. proach (e.g., speaker identification) or visual approach To recover segments where the anchor speech is pres- (e.g., face recognition) alone may be sufficient. But it is ent but not the anchor appearance, an audio-based an- not easy to reliably detect unknown anchor persons. Fig- chor detection scheme is also developed. The anchor ure 16 shows an approach to adaptively detect unknown frames detected from the video stream identify the loca- anchors using on-line trained audio and visual models tions of the anchor speech in the audio stream. Acoustic [60]. First, audio cues are exploited to identify the theme data at these locations are gathered as the training data to music segment of the given news program. Based on the build an on-line speaker model for the anchor, which is starting time of the theme music, an anchor frame is lo- then applied to the remaining audio stream to extract all cated from which a feature block, the neck down area other segments containing the anchor speech. with respect to the detected face region, is extracted to build an on-line visual model for the anchor. Such a fea- ture block captures both the style and the color of the Movie Content Analysis (MoCA) Project clothing, and it is independent of the background setting at the University of Mannheim and the location of the anchor. Using this model, all other MoCA is a project at the University of Mannheim, Ger- anchor frames are identified by matching against it. many, targeted mainly for understanding the semantic When the theme music cannot be detected reliably, content of movies (including those made for television face detection is applied to every key frame and then fea- broadcasting) [61]. We have reviewed the techniques ture blocks are identified for every detected face. Once the used for movie genre recognition previously, as an exam- feature blocks are extracted, dissimilarity measures are ple for scene classification. Here we review the techniques computed among all possible pairs of detected persons, developed for video abstracting [62], [63].

Scene Changes

Camera Motion

Face Detection

Text Detection

Word Relevance

Audio Level

▲ 15. Component technologies applied to segment video data in the Informedia system. From [46, 3c].

30 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 Video abstracting refers to the construction of a video sequence, similar to the trailer (or preview) of a movie, The goal of video segmentation that will give the audience a good overview of the story content, the characters, and the style of the movie. Four is to automatically group shots steps are involved in producing a video abstract: i) seg- into what the human being ment the movie into scenes and shots; ii) identify clips containing special events; iii) select among the special perceives as “scenes.” event clips those to be included in the abstract and certain filler clips; and iv) assemble the selected clips into a video chosen special event clips, to ensure a good coverage of all abstract by applying appropriate ordering and editing ef- parts of a movie. fects between clips. We have reviewed the segmentation technique previously. Not all the features of this scheme has been implemented in the current MoCA system. In MPEG-7 Standard for Multimedia the following, we describe steps ii)-iv) in more detail. Content Description Interface Special Event Detection: Special events are defined as MPEG-7 is an on-going standardization effort for content those shots containing close-up views of main characters, description of AV documents. The readers are referred to dialogs between main characters, title texts, and shots [65]-[67] for a broader coverage about the standard. We with special sound characteristics such as explosion and will only review the audio and visual descriptors and de- gun fires. Dialog detection is part of scene segmentation, scription schemes that are currently being considered by which has been described in “Some Characteristics Based MPEG-7. We first provide an overview on how MPEG-7 on Separate Criteria.” Detection of close-up views is ac- decomposes an AV document to arrive at both syntactic complished as a post-processing of dialog detection. To and semantic descriptions. We then describe audio and vi- detect title texts, the algorithm of [64] is used, which sual descriptors used in these descriptions. makes use of heuristics about the special appearance of ti- tles in most movies. The detected texts are also subject to OCR to produce text-based indexes. To detect gun fire Syntactic and Semantic and explosion sounds, an audio feature vector consisting Decomposition of AV Documents of loudness, frequencies, pitch, fundamental frequency, With MPEG-7, an AV document is represented by a hier- onset, offset, and frequency transition is computed for archical structure both syntactically and semantically. each short audio clip. This feature vector is then com- Our description below follows [68] and [69]. pared to those stored for typical gun-fire and explosion Syntactic Decomposition: With syntactic decomposi- sounds [2]. tion, an AV document (e.g., a video program with audio Selection of Clips for Inclusion: Several heuristic criteria tracks) is divided into a hierarchy of segments, known as a are applied when choosing candidate clips for inclusion in segment tree. For example, the top segment could be the the abstract. Obviously, the criteria would vary depend- entire document, the next layer includes segments corre- sponding to different story units, the subsegments in- ing on the type of video: the preview for a feature movie is cluded in each story segment then corresponds to different from the summary for a documentary movie. different scenes, and finally each scene segment may con- For feature movies, the criteria used include: i) important tain many subsegments corresponding to different cam- objects and people (main actors/actresses); ii) shots con- era shots. A segment is further divided into video taining action (gun fire and explosion); iii) scenes con- segment and audio segment, corresponding to the video taining dialogs; and iv) frames containing title text and frames and the audio waveform, respectively. In addition title music. to using a video segment that contains a set of complete Generation of Abstract: Including all the scenes/shots video frames (may not be contiguous in time), a still or that contain special events may generate too long an ab- moving region can also be extracted. A region can be re- stract. Also, simply staggering them together may not be cursively divided into subregions to form a region tree. visually or aurally appealing. In the MoCA project, it was Each segment or region is described by a set of DSs determined that only 50% of the abstract should contain and Ds. A DS is a combination of Ds specified with a special events. The remaining part should be left for filler designated format. There are some DSs that specify clips. The special event clips to be included are chosen some meta information, e.g., the source, creation, us- uniformly and randomly from different types of events. age, and time duration. There are also Ds that specify the The selection of a short clip from a scene is subject to segment level in the segment tree and its relative impor- some additional criteria, such as the amount of action and tance. The spatial-temporal connectivity of a segment or the similarity to the overall color composition of the region is defined by some spatial- and temporal-connec- movie. Closeness to the desired AV characteristics of cer- tivity attributes. The actual AV characteristics are de- tain scene types are also considered. The filler clips are scribed by audio and visual Ds, which are explained in chosen so that they do not overlap with the content of more detail below. The relation between the segments is

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 31 ject with a region. Each event or object may occur multiple times in a document, and their actual loca- Video Input Key Frames tions (which segment or region) are described by a syntactic-se- mantic link. In this sense, the syn- Face Detection Theme Music Detection tactic structure, represented by segment tree and region tree, is like the table of contents in the be- Dissimilarity Feature Block On-Line Anchor Feature Extraction Extraction Visual Model ginning of a book, whereas the se- mantic structure, i.e., the event tree and object tree, is like the in- Unsupervised Model Based Anchor Detection Anchor Detection dex at the end of the book. An event or object can be char- acterized by a model, and any seg- ment or region can be assigned Audio Model Based into an event or object by a classi- Anchor Detection fier. MPEG-7 defines a variety of Anchor Keyframes model descriptions: probabilis- Integrated Audio/Visual Anchor Detection tic, analytic, correspondence, synthetic (for synthetic objects), and knowledge (for defining dif- Online Training Online Speaker Speech Extraction Model Generation ferent semantic notions). For ex- Anchor Segments ample, probabilistic models define the statistical distributions of the AV samples in a segment or ▲ 16. Diagram of an integrated algorithm for anchor detection. From [60]. region belonging to a particular specified by a segment relation graph, which describes event/class. Defined models include: Gaussian, high-or- the spatial-temporal relations between the segments. der Gaussian (in MPEG-7, Gaussian refers to a joint Similarity certain AV attributes can also be specified Gaussian distribution where individual variables are in- (e.g., similar color, faster motion, etc.). dependent, so that only the mean and variance of indi- To facilitate fast and effective browsing of an AV doc- vidual components need to be specified; high-order ument, MPEG-7 also defines different summary repre- Gaussian refers to the more general case, through the sentations for an AV segment. For example, a hierarchical definition of a covariance matrix), mixture of Gaussian, summary of a segment can be formed by selecting one and HMM. key-frame for each node in the tree decomposition of this segment. Instead of using a key- frame, an audio or visual or AV clip (a subsegment) can also be used at all or certain Visual Descriptors nodes of the tree. For each segment or region at any level of the hierarchy, a Semantic Decomposition: In parallel with the syntactic set of audio and visual Ds and DSs are used to character- decomposition, MPEG-7 also uses semantic decomposi- ize the segment or region. MPEG-7 video group has cur- tion of an AV document to describe its semantic content. rently established the following features as visual Ds, It identifies certain events and objects that occur in a doc- which may still undergo changes in the future. The visual ument and attach corresponding “semantic labels” to Ds can be categorized into four groups: color, shape, mo- them. For example, the event type could be a news broad- tion, and texture. Our description below follows [70]. cast, a sports game, etc. The object type could be a person, Color: These descriptors describe the color distributions a car, etc. An event is described by an event-type D and an in a video segment, a moving region, or a still region. annotation DS (who, what, where, when, and why, and a ▲ Color Space: Four color spaces are defined: RGB, free-text description). An object is described by an ob- YCrCb, HSV, HMMD. Alternatively, one can specify ject-type D and an annotation DS (who, what object, an arbitrary linear transformation matrix from the what action, where, when, and why, and a free-text de- RGB coordinate. scription). An event can be further broken up into many ▲ Color Quantization: This D is used to specify the subevents to form an event tree. Similarly, an object tree quantization method, which can be linear, nonlinear (in can be formed. An event-object relation graph describes MPEG-7, uniform-quantization is referred as linear the relation between events and objects. quantization, nonuniform quantizer as nonlinear), or Relation between Syntactic and Semantic Decompositions: lookup table, and the quantization parameters (e.g., An event is usually associated with a segment and an ob- number of bins for linear quantization).

32 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 ▲ Dominant Color: This D describes the dominant colors two dominant directions and the coarseness along each in the underlying segment, including the number of dom- direction can be specified. inant colors, a confidence measure on the calculated dom- ▲ Edge Histogram: This D is used to describe the edge ori- inant colors, and for each dominant color, the value of entation distribution in an image. Three types of edge his- each color component and its percentage. tograms can be specified, each with five entries, describing ▲ Color Histogram: Several types of histograms can be the percentages of directional edges in four possible orien- specified: i) the common color histogram, which in- tations and nondirectional edges. The global edge histo- cludes the percentage of each quantized color among all gram is accumulated over every pixel in an image; the local pixels in a segment or region; ii) the GoF/GoP histo- histogram consists of 16 subhistograms, one for each gram, which can be the average, median, or intersection block in an image; the semi- global histogram consists of (minimum percentage for each color) of conventional eight subhistograms, one for each group of rows or col- histograms over a group of frames or pictures; and iii) umns in an image. color-structure histogram, which is intended to capture Motion: These Ds describe the motion characteristics some spatial coherence of pixels with the same color. An of a video segment or a moving region as well as global example is to increase the counter for a color as long as camera motion. there is at least one pixel with this color in a small neigh- ▲ Camera Motion: Seven possible camera motions are borhood around each pixel. considered: panning, tracking (horizontal translation), ▲ Compact Color Descriptor: Instead of specifying the en- tilting, booming (vertical translation), zooming, tire color histogram, one can specify the first few coeffi- dollying (translation along the optical axis), and rolling cients of the Haar transform of the color histogram. (rotation around the optical axis). For each motion, two ▲ Color Layout: This is used to describe in a coarse level moving directions are possible. For each motion type the color pattern of an image. An image is reduced to88× and direction, the presence (i.e., duration), speed and blocks with each block represented by its dominant color. the amount of motion are specified. The last term mea- Each color component (Y/Cb/Cr) in the reduced image is sures the area that is covered or uncovered due to a par- then transformed using discrete cosine transform, and the ticular motion. first few coefficients are specified. Motion Trajectory: This D is used to specify the trajec- Shape: These descriptors are used to describe the spa- ▲ tory of a nonrigid moving object, in terms of the 2-D or tial geometry of still and moving regions. 3-D coordinates of certain key points at selected sampling ▲ Object Bounding Box: This D specifies the tightest rect- times. For each key point, the trajectory between two ad- angular box enclosing two- or three-dimensional (2- or jacent sampling times is interpolated by a specified inter- 3-D) object. In addition to the size, center, and orienta- polation function (either linear or parabolic). tion of the box, the occupancy of the object in the box is also specified, by the ratio of the object area (volume) to ▲ Parametric Object Motion: This D is used to specify the the box area (volume). 2-D motion of a rigid moving object. Five types of mo- ▲ Contour-Based Descriptor: This D is applicable to a 2-D tion models are included: translation, rotation/scaling, region with a closed boundary. MPEG-7 has chosen to affine, planar perspective, and parabolic. In addition to use the peaks in the curvature scale space (CSS) represen- the model type and model parameters, the coordinate ori- tation [71], [72] to describe a boundary, which has been gin and time duration need to be specified. found to reflect human perception of shapes, i.e., similar ▲ Motion Activity: This D is used to describe the intensity shapes have similar parameters in this representation. and spread of activity over a video segment (typically at the ▲ Region-Based Shape Descriptor: This D can be used to shot level). Four attributes are associated with this D: i) in- describe the shape of any 2-D region, which may consist tensity of activity, measured by the standard deviation of of several disconnected subregions. MPEG-7 has chosen the motion vector magnitudes; ii) direction of activity, de- to use Zernike moments [73] to describe the geometry of termined from the average of motion vector directions; iii) a region. The descriptor specifies the number of moments spatial distribution of activity, derived from the run- used and the value of each moment. lengths of blocks with motion magnitudes lower than the Texture: This category is used to describe the texture average magnitude; and iv) temporal distribution of activ- pattern of an image. ity, described by a histogram of quantized activity levels ▲ Homogeneous Texture: This is used to specify the energy over individual frames in the shot. distribution in different orientations and frequency bands (scales). This can be obtained through a Gabor transform with six orientation zones and five scale bands. Audio Descriptors ▲ Texture Browsing: This D specifies the texture ap- At the time of writing, the work on developing audio pearances in terms of regularity, coarseness, and descriptors are still on-going in MPEG-7 audio group. It directionality, which are more in line with the type of is planned, however, that Ds and DSs for four types of au- descriptions that a human may use in browsing/retriev- dio will be developed: music, pure speech, sound effect, ing a texture pattern. In addition to regularity, up to and arbitrary sound track [74].

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 33 Concluding Remarks Leonardi for providing updated information regarding their research, and Philippe Salembier, Adam Lindsay, In this article, we have reviewed several important aspects and B.S. Manjunath for providing information regarding of multimedia content analysis using both audio and vi- MPEG-7 audio and visual descriptors and description sual information. We hope that the readers would carry schemes. This work was supported in part by the Na- away two messages from this article: i) There is much to tional Science Foundation through its STIMULATE be gained by exploiting both audio and visual informa- program under Grant IRI-9619114. tion, and ii) there are still plenty of unexplored territories in this new research field. We conclude the paper by out- Yao Wang (Senior Member) received the B.S. and M.S. lining some of the interesting open issues. degrees from Tsinghua University, Beijing, China, and Most of the approaches we reviewed are driven by ap- the Ph.D degree from University of California at Santa plication-dependent heuristics. One question that still Barbara, all in electrical engineering. Since 1990, she has lacks consensus among researchers in the field is whether been with Polytechnic University, Brooklyn, NY, where it is feasible to derive some application-domain inde- she is a Professor of Electrical Engineering. She is also a pendent approaches. We believe that at least at the low- consultant with AT&T Bell Laboratories, now AT&T and mid-level processing this is feasible. For example, Labs–Research. She was on sabbatical leave at Princeton feature extraction, shot-level segmentation, and some University in 1998 and was a visiting professor at the basic scene level segmentation and classification are re- University of Erlangen, Germany. She was an Associate quired in almost all multimedia applications requiring Editor for IEEE Transactions on Multimedia and IEEE understanding of semantic content. The features and Transactions on Circuits and Systems for Video Technology. segmentation approaches that have been proven useful She is a member of several IEEE technical committees. for one application (e.g., news analysis) are often appro- Her current research interests include image and video priate for another application (e.g., movie analysis) as compression for unreliable networks, motion estimation well. On the other hand, higher level processing (such as and object-oriented video coding, signal processing using story segmentation and summarization) may have to be multimodal information, and image reconstruction prob- application specific, although even there, methodolo- lems in medical imaging. She was awarded the New York gies developed for one application can provide insights City Mayors award for excellence in Science and Technol- for another. ogy in the Young Investigator category in 2000. How to combine audio and visual information be- Zhu Liu (Student Member) received the B.S. and M.S. longs to the general problem of multiple evidence fusion. degrees in electronic engineering from Tsinghua Univer- Until now, most work in this area was quite heuristic and sity, Beijing, China, in 1994 and 1996, respectively. Since application-domain dependent. A challenging task is to September 1996, he has been a Ph.D. candidate in the develop some theoretical framework for joint AV pro- Electrical Engineering Department in Polytechnic Uni- cessing and, more generally, for multimodal processing. versity. From May 1998 to August 1999, he was a consul- For example, one may look into theories and techniques tant with AT&T Labs–Research; in 2000 he joined the developed for sensor fusion, such as Dempster-Shafer company as a Senior Technical Staff Member. His re- theory of evidence, information theory regarding mutual search interests include audio/video signal processing, information from multiple sources, and Baysian theory. multimedia database, pattern recognition, and neural Another potential direction is to explore the analogy network. He is a member of Tau Beta Pi. between natural language understanding and multimedia content understanding. The ultimate goals of these two Jincheng Huang (Student Member) received the B.S. problems are quite similar: being able to meaningfully and M.S. degrees in electrical engineering from Poly- partition, index, summarize, and categorize a document. technic University, Brooklyn, NY, in 1994. Currently, A major difference between a text medium and a multi- he is pursuing the Ph.D. degree in electrical engineer- media document is that, for the same semantic concept, ing at the same institution. Since 2000, he has been a there are relatively few text expressions, whereas there Member of Technical Staff at Epson Palo Alto Labora- could be infinitely many AV renditions. This makes the tory, Palo Alto, CA. His current research interests in- latter problem more complex and dynamic. On the other clude image halftoning, image and video compression, hand, the multiple cues present in a multimedia docu- and multimedia content analysis. He is a member of ment may make it easier to derive the semantic content. Eta Kappa Nu.

Acknowledgments References

We would like to thank Howard Wactlar for providing [1] W. Hess, Pitch Determination of Speech Signals. New York: Springer-Verlag, information regarding CMU’s Informedia project, Qian 1983. Huang for reviewing information regarding AT&T’s Pic- [2] S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analy- torial Transcript project, Rainer Lienhart for exchanging sis,” in Proc. 4th ACM Int. Conf. Multimedia, Boston, MA, Nov. 18-22 information regarding the MoCA project, Riccardo 1996, pp. 21-30.

34 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 [3] Z. Liu, Y. Wang, and T. Chen, “Audio feature extraction and analysis for [23] E.M. Arkin, L. Chew, D. Huttenlocher, K. Kedem, and J. Mitchell, “An scene segmentation and classification,” J. VLSI Signal Processing Syst. Signal, efficiently computable metric for comparing polygonal shapes,” IEEE Image, Video Technol., vol. 20, pp. 61-79, Oct. 1998. Trans. Pattern Anal. Mach. Intell., vol. 13, pp. 209-216, Mar. 1991.

[4] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classifica- [24] G.C.-H. Chuang and C.-C.J. Kuo, “Wavelet descriptor of planar curves: tion, search, and retrieval of audio,” IEEE Multimedia Mag., vol. 3, pp. Theory and applications,” IEEE Trans. Image Processing, vol. 5, pp. 56-70, 27-36, Fall 1996. Jan. 1996.

[5] N. Jayant, J. Johnston, and R. Safranek, “Signal compression based on [25] Y. Wang, J. Huang, Z. Liu, and T. Chen, “Multimedia content classifica- models of human perception,” Proc. IEEE, vol. 81, pp. 1385-1422, Oct. tion using motion and audio information,” in Proc. IEEE Int. Symp. Circuits 1993. and Systems (ISCAS-97), vol. 2, Hong Kong, June 9-12, 1997, pp. [6] E. Scheirer and M. Slaney, “Construction and evaluation of a robust 1488-1491. multifeatures speech/music discriminator,” in Int. Conf. Acoustic, Speech, and [26] D. Bordwell and K. Thompson, Film Art: An Introduction, 4th ed. New Signal Processing (ICASSP-97), vol. 2, Munich, Germany, Apr. 21-24, York: McGraw-Hill, 1993. 1997, pp. 1331-1334. [27] F. Beaver, Dictionary of Film Terms. New York: Twayne, 1994. [7] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall, 1993. [28] J. Huang, Z. Liu, and Y. Wang, “Integration of audio and visual informa- [8] J.S. Boreczky and L.D. Wilcox, “A hidden Markov model framework for tion for content-based video segmentation,” in Proc. IEEE Int. Conf. Image video segmentation using audio and image features,” in Proc. Int. Conf. Processing (ICIP-98), vol. 3, Chicago, IL, Oct. 4-7, 1998, pp. 526-530. Acoustics, Speech, and Signal Processing (ICASSP-98), vol. 6, Seattle, WA, [29] C. Saraceno and R. Leonardi, “Audio as a support to scene change detec- May 12-15, 1998, pp. 3741-3744. tion and characterization of video sequences,” in Proc. Int. Conf. Acoustic, [9] J. Saunders, “Real-time discrimination of broadcast speech/music,” in Proc. Speech, and Signal Processing (ICASSP-97), vol. 4, Munich, Germany, Apr. Int. Conf. Acoustic, Speech, and Signal Processing (ICASSP-96), vol. 2, At- 21-24, 1997, pp. 2597-2600. lanta, GA, May 7-10, 1996, pp. 993-996. [30] T. Zhang and C.-C.J. Kuo, “Hierarchical classification of audio data for [10] T. Zhang and C.-C.J. Kuo, “Video content parsing based on combined archiving and retrieving,” in Proc. Int. Conf. Acoustic, Speech, and Signal Pro- audio and visual information,” in Proc. SPIE’s Conf. Multimedia Storage and cessing, vol. 6, Phoenix, AZ, Mar. 15-19, 1999, pp. 3001-3004. Archiving Systems IV, Boston, MA, Sept. 1999, pp. 78-89. [31] B. Matityaho and M. Furst, “Neural network based model for classifica- [11] K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura, “Video handling tion of music type,” in Proc. 18th Conv. Electrical and Electronic Engineers in with music and speech detection,” IEEE Multimedia Mag., vol. 5, pp. Israel, Tel Aviv, Israel, Mar. 7-8, 1995, pp. 4.3.4/1-5. 17-25, July-Sept. 1998. [32] T. Lambrou, P. Kudumakis, R. Speller, M. Sandler, and A. Linney, “Clas- [12] R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Scene determination based on sification of audio signals using statistical features on time and wavelet video and audio features,” in Proc. IEEE Int. Conf. Multimedia Computing tranform domains,” in Proc. Int. Conf. Acoustic, Speech, and Signal Processing and Systems, vol. 1, Florence, Italy, June 7-11, 1999, pp. 685-690. (ICASSP-98), vol. 6, Seattle, WA, May 12-15, 1998, pp. 3621-3624. [13] Y. Chang, W. Zeng, I. Kamel, and R. Alonso, “Integrated image and [33] C. Saraceno and R. Leonardi, “Identification of story units in AV speech analysis for content-based video indexing,” in Proc. 3rd IEEE Int. sequencies by joint audio and ,” in Proc. Int. Conf. Image Conf. Multimedia Computing and Systems, Hiroshima, Japan, June 17-23, Processing (ICIP-98), vol. 1, Chicago, IL, Oct. 4-7, 1998, pp. 363-367. 1996, pp. 306-313. [14] F. Idris and S. Panchanathan, “Review of image and video indexing tech- [34] C. Saraceno and R. Leonardi, “Indexing AV databases through a joint au- niques,” J. Visual Commun. Image Represent., vol. 8, pp. 146-166, June dio and video processing,” Int. J. Image Syst. Technol., vol. 9, pp. 320-331, 1997. Oct. 1998. [15] Y. Rui, T. S. Huang, and S.-F. Chang, “Image retrieval: current technolo- [35] H.J. Zhang, A. Kankanhalli, and S.W. Smoliar, “Automatic partioning of gies, promising directions, and open issues,” J. Visual Commun. Image Rep- video,” Multimedia Syst., vol. 1, no. 1, pp. 10-28, 1993. resent., vol. 10, pp. 39-62, Mar. 1999. [36] R. Lienhart, “Comparison of automatic shot boundary detection algo- [16] X. Wan and C.-C. J. Kuo, “A new approach to image retrieval with hierar- rithms,” in Proc. SPIE Conf. Image and Video Processing VII, San Jose, CA, chical color clustering,” IEEE Trans. Circuits Systems Video Technol., vol. 8, Jan. 26-29, 1999, pp. 290-301. pp. 628-643, Sept. 1998. [37] H.A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face de- [17] G. Pass, R. Zabih, and J. Miller, “Comparing images using color coher- tection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, pp. 23-38, Jan. ence vectors,” in Proc. 4th ACM Int. Conf. Multimedia, Boston, MA, Nov. 1998. 5-9, 1996, pp. 65-73. [38] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognitive Neu- [18] Y. Chen and E.K. Wong, “Augmented image histogram for image and roscience, vol. 3, no. 1, pp. 71-86, 1991. video similarity search,” in Proc. SPIE Conf. Storage and Retrieval for Image and Video Database VII, San Jose, CA, Jan. 26-29, 1999, pp. 523-532. [39] S. Fischer, R. Lienhart, and W. Effelsberg, “Automatic recognition of film genres,” in Proc. 3rd ACM Int. Conf. Multimedia, San Francisco, CA, Nov. [19] C.C. Gotlieb and H. E. Kreyszig, “Texture descriptors based on 5-9, 1995, pp. 295-304. co-ocurrence matrices,” Comput. Vision, Graphics, Image Processing, vol. 51, pp. 70-86, July 1990. [40] Z. Liu, J. Huang, and Y. Wang, “Classification of TV programs based on [20] H. Tamura, S. Mori, and T. Yamawaki, “Texture features corresponding audio information using hidden Markov model,” in IEEE Workshop Multi- to visual perception,” IEEE Trans. System, Man, Cybernet., vol. SMC-8, no. media Signal Processing (MMSP-98), Los Angeles, CA, Dec. 7-9, 1998, pp. 6, 1978. 27-32. [21] B.M. Mehtre, M.S. Kankanhalli, and W.F. Lee, “Shape measures for con- [41] J. Huang, Z. Liu, Y. Wang, Y. Chen, and E. K. Wong, “Integration of tent based image retrieval: A comparison,” Inform. Processing Manage., vol. multimodal features for video classification based on HMM,” in IEEE 33, pp. 319-337, May 1997. Workshop Multimedia Signal Processing (MMSP-99), Copenhagen, Denmark, Sept 13-15, 1999, pp. 53-58. [22] A. Pentland, R.W. Picard, and S. Sclaroff, “Photobook: Content-based manipulation of image databases,” Int. J. Computer Vision, vol. 18, pp. [42] L.R. Rabiner, “A tutorial on hidden Markov models and selected applica- 233-254, June 1996. tions in speech recognition,” Proc. IEEE, vol. 77, pp. 257-286, Feb. 1989.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 35 [43] M.T. Chan, Y. Zhang, and T.S. Huang, “Real-time lip tracking and bi- cessing Workshop (MMSP-97), Princeton, NJ, June 23-25, 1997, pp. modal continuous speech recognition,” in IEEE Workshop Multimedia Sig- 581-586. nal Processing (MMSP-98), Los Angeles, CA, Dec. 7-9, 1998, pp. 65-70. [58] B. Shahraray, “Scene change detection and content-based sampling of [44] G. Potamianos and H.P. Graf, “Discriminative training of HMM stream video sequences,” in Proc. SPIE Conf. Digital Video Compression: Algorithms exponents for AV speech recognition,” in Proc. Int. Conf. Acoustic, Speech and Technologies, 1995, pp. 2-13. and Signal Processing (ICASSP-98), vol. 6, Seattle, WA, May 12-15, 1998, [59] Z. Liu and Q. Huang, “Detecting news reporting using AV information,” pp. 3733-3766. in Proc. IEEE Int. Conf. Image Processing (ICIP-99), Kobe, Japan, Oct. [45] H.D. Wactlar, M.G. Christel, Y. Gong, and A.G. Hauptmann, “Lessons 1999, pp. 324-328. learned from building a terabyte digital video library,” IEEE Computer [60] Z. Liu and Q. Huang, “Adaptive anchor detection using on-line trained Mag., vol. 32, pp. 66-73, Feb. 1999. AV model,” in Proc. SPIE Conf. Storage and Retrieval for Media Database, [46] H.D. Wactlar, T. Kanade, M.A. Smith, and S.M. Stevens, “Intelligent ac- San Jose, CA, Jan. 2000, pp. 156-167. cess to digital video: Informedia project,” IEEE Computer Mag., vol. 29, [61] Available http://www.informatik.uni-mannheim.de/informatik/pi4/pro- pp. 46-52, May 1996. jects/MoCA [47] Available http://www.infomedia.cs.cmu.edu/html/main.html [62] R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Video abstracting,” J. ACM, vol. 40, pp. 55-62, Dec. 1997. [48] Available http://www.dli2.nsf.gov/index.html [63] S. Pfeiffer, R. Lienhart, S. Fischer, and W. Effelsberg, “Abstracting digital [49] M. Hwamg, R. Rosenfeld, E. Theyer, R. Mosur, L. Chase, R. Weide, X. movies automatically,” J. Visual Commun. Image Representation, vol. 7, no. Huang, and F. Alleva, “Improving speech recognition performance via 4, pp. 345-353, 1996. phone-dependent VQ codebooks and adaptive language models in SPHINX-II,” in Proc. Int. Conf. Acoustic, Speech, and Singal Processing [64] R. Lienhart, “Automatic text recognition for video indexing,” in Proc. 4th (ICASSP-94), vol. 1, Adelaide, Australia, Apr. 19-22, 1994, pp. 549-552. ACM Int. Conf. Multimedia ‘96, Boston, MA, Nov. 18-22, 1996, pp. 11-20. [50] A.G. Hauptman and M.A. Smith, “Text, speech and vision for video seg- mentation: the informedia project,” in Proc. AAAI Fall Symp. Computational [65] Available http://www.darmstadt.gmd.de/mobile/MPEG7 Models for Integrating Language and Vision, Boston, MA, Nov 10-12, 1995. [66] F. Nack and A.T. Lindsay, “Everything you wanted to know about [51] T. Sato, T. Kanade, E.K. Hughes, M.A. Smith, and S. Satoh, “Video MPEG-7: Part I,” IEEE Multimedia Mag., vol. 6, pp. 65-77, July-Sept. OCR: Indexing digital news libraries by recognition of superimposed cap- 1999. tion,” ACM Multimedia Syst., vol. 7, no. 5, pp. 385-395, 1999. [67] F. Nack and A. T. Lindsay, “Everything you wanted to know about [52] Y.H. Gong, G. Proietti, and C. Faloutos, “Image indexing and retrieval MPEG-7: Part II,” IEEE Multimedia Mag., vol. 6, pp. 64-73, Oct.-Dec. based on human perceptual color clustering,” in Proc. Computer Vision and 1999. Pattern Recognition Conf., 1998, pp. 578-583. [68] MPEG-7 Generic AV Description Schemes (v0.8), ISO/IEC JTC1/SC29/WG11 M5380, Dec. 1999. [53] M. Smith and T. Kanade, “Video skimming and characterization through the combination of image and language understanding techniques,” in Proc. [69] Supporting Information for MPEG-7 Description Schemes, ISO/IEC Computer Vision and Pattern Recognition Conf., San Juan, Puerto Rico, June JTC1/SC29/WG11 N3114, Dec. 1999. 1997, pp. 775-781. [70] MPEG-7 Visual Part of Experimentation Model (v4.0), ISO/IEC [54] M. Christel et al., “Evolving video skims into useful multimedia abstrac- JTC1/SC29/WG11 N3068, Dec. 1999. tions,” in Proc. CHI’98 Conf. Human Factors in Computing System, 1998, [71] F. Mokhtarian, S. Abbasi, and J. Kittler, “Robust and efficient shape in- pp. 171-178. dexing through curvature scale space,” in Proc. British Machine Vision Conf., [55] S. Satoh, Y. Nakamura, and T. Kanade, “Name-it: Naming and detecting Edinburgh, UK, 1996, pp. 53-62. faces in news videos,” IEEE Multimedia Mag., vol. 6, pp. 22-35, Jan.-Mar. [72] Available http://www.ee.surrey.ac.uk/Research/VSSP imagedb/ 1999. demo.html [56] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognitive Neu- [73] A Rotation Invariant Geometric Shape Description Using Zernike Moments, roscience, vol. 3, no. 1, pp. 71-86, 1991. ISO/IEC JTC1/SC29/WG11, P687, Feb. 1999. [57] B. Shahraray and D. Gibbon, “Pictorial transcripts: Multimedia processing [74] Framework for MPEG-7 Audio Descriptor, ISO/IEC JTC1/SC29/WG11 applied to digital library creation,” in Proc. IEEE 1st Multimedia Signal Pro- N3078, Dec. 1999.

36 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000