Contributions in Audio Modeling and Multimodal Data Analysis Via Machine Learning Ngoc Duong
Total Page:16
File Type:pdf, Size:1020Kb
Contributions in Audio Modeling and Multimodal Data Analysis via Machine Learning Ngoc Duong To cite this version: Ngoc Duong. Contributions in Audio Modeling and Multimodal Data Analysis via Machine Learning. Machine Learning [stat.ML]. Université de Rennes 1, 2020. tel-02963685 HAL Id: tel-02963685 https://hal.archives-ouvertes.fr/tel-02963685 Submitted on 11 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ANNÉE 2020 UNIVERSITÉ DE RENNES 1 sous le sceau de l’Université Européenne de Bretagne pour le grade de Habilitation à Diriger des Recherches Mention : Traitement du Signal et l’Apprentissage Automatique présentée par Rémi, Quang-Khanh-Ngoc DUONG préparée et soutenue chez InterDigital Cesson-Sévigné, France, le 8 Octobre 2020. devant le jury composé de : Contributions in Mark Plumbley Professeur, University of Surrey, UK / Audio Modeling and rapporteur Laurent Girin Multimodal Data Professeur, Institut Polytechnique de Grenoble / rapporteur Analysis via Caroline Chaux Chargée de Recherche, CNRS/rapporteur Machine Learning Patrick Le Callet Professeur, Polytech Nantes/Université de Nantes / examinateur Guillaume Gravier Directeur de Recherche, CNRS/ examina- teur 2 Acknowledgements I would like to acknowledge the support from many people who have helped me along the way to this milestone. First of all, I would like to thank all the jury members: Mark Plumbley, Laurent Girin, Caroline Chaux, Patrick Le Callet, and Guillaume Gravier for the evaluation of this thesis. My special thank goes to Alexey Ozerov, Patrick Pérez, Claire-Hélène De- marty, Slim Essid, Gaël Richard, Bogdan Ionescu, Mats Sjöberg, and many other colleagues whose names I could not mention one by one here. They have kindly collaborated with me along the way to this milestone. I would like to thank Emmanuel Vincent and Rémi Gribonval for supervising my PhD work, which opened a new door to my professional career. I would also like to thank Seung-Hyon Nam for supervising my Master’s thesis, which initiated me to the signal processing area. I am also very grateful to Romain Cohendet, Sanjeel Parekh, Hien-Thanh Duong, Dalia El Badawy, and other thirteen Master’s intern students whom I co-supervised with colleagues. Without them the work described in this manuscript would not be possible. I thank all my friends in Vietnam, France, and other countries without mentioning their names as there are many. You are always with me during the happy or sad moments. Last but not least, my special love goes to my parents, my sisters, my relatives in Vietnam, my wife, and my children Kévin and Émilie. Abstract This HDR manuscript summarizes our work concerning the applications of machine learning techniques to solve various problems in audio and multi- modal data. First, we will present nonnegative matrix factorization (NMF) modeling of audio spectrograms to address audio source separation prob- lem, both in single-channel and multichannel settings. Second, we will focus on the multiple instance learning (MIL)-based audio-visual representation learning approach, which allows to tackle several tasks such as event/object classification, audio event detection, and visual object localization. Third, we will present contributions in the multimodal multimedia interestingness and memorability, including novel dataset constructions, analysis, and com- putational models. This summary is based on major contributions published in three journal papers in the IEEE/ACM Transactions on Audio, Speech, and Language Processing, and a paper presented at the International Con- ference on Computer Vision (ICCV). Finally, we will briefly mention our other works in different applications concerning audio synchronization, au- dio zoom, audio classification, audio style transfer, speech inpainting, and image inpainting. Contents Acknowledgements3 Introduction9 1 Audio source separation 13 1.1 Motivation and problem formulation . 13 1.2 Background of NMF-based supervised audio source separation . 15 1.3 Single-channel audio source separation exploiting the generic source spec- tral model (GSSM) . 16 1.3.1 Motivation, challenges, and contributions . 16 1.3.2 GSSM construction and model fitting . 18 1.3.3 Group sparsity constraints . 19 1.3.4 Relative group sparsity constraints . 20 1.3.5 Algorithms for parameter estimation and results . 23 1.4 Multichannel audio source separation exploiting the GSSM . 23 1.4.1 Local Gaussian modeling . 24 1.4.2 NMF-based source variance model . 26 1.4.3 Source variance fitting with GSSM and group sparsity constraint 26 1.4.4 Algorithms for parameter estimation and results . 27 1.5 Other contributions in audio source separation . 29 1.5.1 Text-informed source separation . 29 1.5.2 Interactive user-guided source separation . 29 1.5.3 Informed source separation via compressive graph signal sampling 30 1.6 Conclusion . 30 2 Audio-visual scene analysis 31 2.1 Motivation and related works . 31 5 CONTENTS 2.2 Weakly supervised representation learning framework . 33 2.3 Some implementation details and variant . 34 2.4 Results . 36 2.5 Conclusion . 37 3 Media interestingness and memorability 39 3.1 Image and video interestingness . 40 3.2 Video memorability . 42 3.2.1 VM dataset creation . 44 3.2.2 VM understanding . 46 3.2.3 VM prediction . 48 3.3 Conclusion . 50 4 Other contributions 53 4.1 Audio synchronization using fingerprinting . 53 4.2 Audio zoom via beamforming technique . 54 4.3 Audio classification . 54 4.4 Audio style transfer . 56 4.5 Speech inpainting . 56 4.6 Image inpainting . 57 5 Conclusion 59 5.1 Achievements . 59 5.2 Future directions . 60 Appendices 65 A Paper 1: On-the-Fly Audio Source Separation—A Novel User-Friendly Framework 65 B Paper 2: Gaussian Modeling-Based Multichannel Audio Source Sepa- ration Exploiting Generic Source Spectral Model 79 C Paper 3: Weakly Supervised Representation Learning for Audio-Visual Scene Analysis 93 D Paper 4: VideoMem: Constructing, Analyzing, Predicting Short- Term and Long-Term Video Memorability 107 6 CONTENTS E Curriculum Vitae 119 7 CONTENTS 8 Introduction This HDR thesis is an extensive summary of a major part of the work done since my PhD defense in 2011. Following the PhD focusing on audio signal processing, I first worked on audio related topics such as non-negative matrix factorization (NMF)-based audio source separation [1,2,3,4,5,6,7,8,9, 10, 11], audio synchronization us- ing fingerprinting techniques [12, 13], and audio zoom using beamforming techniques [14]. Then in the deep learning era, thanks to the collaboration with a number of colleagues and PhD/Master’s students, I have extended my research interest to the application of machine learning techniques in audio/image manipulation and multi- modal data analysis. In the first area, I have considered multiple problems such as audio style transfer [15], speech inpainting [16], and image inpainting [17]. In the sec- ond area, I have investigated other challenges such as audio-visual source separation [18, 19], audio-visual representation learning applied to event/object classification and localization [20, 21, 22], image/video interestingness [23, 24, 25, 26], and image/video memorability [27, 28, 29] for visual content assessment. Especially, to push forward for research in such high level concepts of how media content can be interesting and mem- orable to viewers, I co-founded two series of international challenges in the MediaEval benchmarch1: Predicting Media Interestingness Task running in 2016 [30], 2017 [31], and Predicting Media Memorability Task running in 2018 [32], 2019 [33]. These tasks have greatly interested the multimedia research community as shown by a large number of international participants. As most of my work applies machine learning techniques (whether it is a conven- tional model such as NMF or the emerging deep learning approach) to analyse audio and multimodal data, I entitle the thesis as "Contributions in Audio Modeling and Mul- timodal Data Analysis via Machine Learning" and decide to present in this document mostly about the work described in four major publications as follows: • Paper 1 [8]: a novel user-guided audio source separation framework based on NMF 1http://www.multimediaeval.org/ 9 CONTENTS with group sparsity constraints is introduced for single-channel setting. The NMF- based generic source spectral models (GSSM) that govern the separation process are learned on-the-fly from audio examples retrieved online. • Paper 2 [11]: a combination of the GSSM introduced in the paper 1 with the full- rank spatial covariance model within a unified Gaussian modeling framework is proposed to address multichannel mixtures. In particular, a new source variance separation criterion is considered in order to better constrain the intermediate source variances estimated in each EM iteration. • Paper 3 [22]: a novel multimodal framework that instantiates multiple instance learning (MIL) is proposed for audio-visual (AV) representation learning. The learnt representations are shown to be useful for performing several tasks such as event/object classification, audio event detection, audio source separation and visual object localization. Especially, the proposed framework has capacity to learn from unsynchronized audio-visual events. • Paper 4 [29]: this work focuses on understanding the intrinsic memorability of visual content. For this purpose, a large-scale dataset (VideoMem10k) composed of 10,000 videos with both short-term and long-term memorability scores is intro- duced to the public. Various deep neural network-based models for the prediction of video memorability are investigated, and our model with attention mechanism provides insights of what makes a content memorable.