Real-Time Soundtrack Analysis

UNIVERSIDAD AUTONOMA DE MADRID ESCUELA POLITECNICA SUPERIOR PROYECTO FIN DE CARRERA REAL-TIME SOUNDTRACK ANALYSIS Francesco Alfeo 2009 REAL-TIME SOUNDTRACK ANALYSIS (Análisis en vivo de la banda sonora) AUTOR: Francesco Alfeo TUTOR: José M. Martínez Dpto. de Ingeniería de Telecomunicación Escuela Politécnica Superior Universidad Autónoma de Madrid diciembre de 2009 Summary The practical realization of the project described in the present document took place in the Telecommunication Engineering Faculty of the Escuela Politecnica Superior (Autónoma University, Madrid), with the supervision of prof. Josè Marià Martinez Sanchez. This work tries to find a solution to the problem of automatically segment a video using information extracted from the audio track that accompanies it. For this purpose, two classifiers have been designed and projected: the first one is capable to perform this task in a real-time contest, while the other performs the same task without worrying about any time constrain. Another objective of the present work is to discover and discuss the differences between these two classifiers in terms of computational complexity, execution velocity and obtained performances. Both classifiers start from a WAV file (obtained from the starting audio o video file of a generic format) to subdivide the analyzed clip into smaller parts, depending on the identified audio type. At the end of the process, there will be parts belonging to 4 different audio types: voice, music, noise and silence. The proposed classification algorithm is divided into two parts: the first one recognizes and marks the silence frames in the audio clip, while the second one distinguishes the remaining frames marking them as music, noise or voice. Both parts use a series of heuristic conditions based on statistical measures (local maximum and minimum, variance and arithmetic mean) calculated from a series of audio low-level features (energy, Zero Crossing Rate, Root Mean Square, Spectral Energy Distribution in 4 frequency sub bands, spectral Centroid and Roll-Off point). Only in the second part of the algorithm, a certain number of points (stored in a vector) is assigned to each audio type for a certain frame each time one of the above mentioned conditions is satisfied. Results obtained by both classifiers appear satisfactory: the off-line classifier reaches on average (evaluating 10 different combinations “length of an analysis window, number of analysis windows for frame” - both parameters are set by the user) a classification accuracy equal to 77%, while the real-time one obtains 72% and fully satisfies real-time constraints. These results appear to be very promising as a starting point for future lines of research and development in the field of automated video segmentation using audio features only. INDEX OF CONTENTS Chapter 1 INTRODUCTION.......................................................................................1 1.1 Motivations and objectives.......................................................................1 1.2 Work structure..........................................................................................3 Chapter 2 BACKGROUND..........................................................................................4 2.1 Digital Signal Processing: definition and history.....................................4 2.2 Automated video segmentation and classification....................................5 2.2.1 Approaches to video classification......................................................................6 2.2.2 A panoramic on algorithms of video segmentation using video features............8 2.3 Audio classification and segmentation...................................................10 2.3.1 Why audio content analysis?.............................................................................10 2.3.2 Important contributions in the field of audio classification...............................11 2.3.3 Audio classification algorithms with real-time constraints...............................13 2.4 Integration of audio and visual information for video segmentation......15 2.5 Summary.................................................................................................17 Chapter 3 WORK DESIGN........................................................................................18 3.1 Introduction............................................................................................18 3.2 Classified audio types: a definition........................................................18 3.2.1 Voice..................................................................................................................19 3.2.2 Music.................................................................................................................21 3.2.3 Noise..................................................................................................................22 3.2.4 Silence...............................................................................................................22 3.3 Low-level features choice.......................................................................23 3.3.1 Time-domain features........................................................................................24 3.3.1.1 Energy.........................................................................................................................24 3.3.1.2 Zero Crossing Rate......................................................................................................25 3.3.2 Frequency-domain features...............................................................................26 3.3.2.1 Energy distribution across frequency bands................................................................26 3.3.2.2 Root Mean Square.......................................................................................................27 3.3.2.3 Spectral Centroid.........................................................................................................28 3.3.2.4 Roll-off Point...............................................................................................................29 3.4 An idea for the classification algorithm.................................................29 Chapter 4 WORK DEVELOPMENT........................................................................32 4.1 Chapter organization.............................................................................32 4.2 Audio file opening and reading..............................................................33 4.2.1 Practical implementation...................................................................................35 4.3 Features extraction.................................................................................36 4.3.1 Off-line classifier...............................................................................................36 4.3.2 Real-time classifier............................................................................................38 4.4 Features plotting....................................................................................40 4.4.1 Audio types behavior as a function of low-level features.................................41 4.4.1.1 Energy.........................................................................................................................42 4.4.1.2 Zero Crossing Rate......................................................................................................43 4.4.1.3 Energy distribution across frequency bands................................................................44 4.4.1.4 Root Mean Square.......................................................................................................47 4.4.1.5 Spectral Centroid.........................................................................................................48 4.4.1.6 Roll-off Point...............................................................................................................49 4.5 Classification algorithm.........................................................................50 4.5.1 Heuristic-based conditions................................................................................51 4.5.2 A points vector for each audio type...................................................................54 4.5.3 Practical implementation...................................................................................56 4.5.3.1 Off-line classifier.........................................................................................................56 4.5.3.2 Real-time classifier......................................................................................................62 4.6 Classification results plotting.................................................................65 4.6.1 Off-line classifier...............................................................................................66 4.6.2 Real-time classifier............................................................................................71 4.7 Summary.................................................................................................71 Chapter 5 EVALUATIONS........................................................................................74 5.1 Preliminary considerations....................................................................74 5.1.1 Manual classification and its limits...................................................................74 5.1.2 Choice of parameters to describe

Load more