Design of Qbe-STD System: Audio Representation and Matching Perspective

Design of QbE-STD System: Audio Representation and Matching Perspective by Maulik C. Madhavi 201121003 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY in INFORMATION AND COMMUNICATION TECHNOLOGY to DHIRUBHAI AMBANI INSTITUTE OF INFORMATION AND COMMUNICATION TECHNOLOGY October, 2017 Declaration I hereby declare that i) the thesis comprises of my original work towards the degree of Doctor of Philosophy in Information and Communication Technology at Dhirubhai Ambani Institute of Information and Communication Technology and has not been submitted elsewhere for a degree, ii) due acknowledgment has been made in the text to all the reference material used. Maulik C. Madhavi Certificate This is to certify that the thesis work entitled, “Design of QbE-STD System : Audio Representation and Matching Perspective,” has been carried out by Maulik C. Madhavi for the degree of Doctor of Philosophy in Information and Communication Tech- nology at Dhirubhai Ambani Institute of Information and Communication Technology under my supervision. Prof. Hemant A. Patil Thesis Supervisor i Acknowledgments First and foremost, I would like thank the Almighty God for providing me the courage and capability to proceed and to complete this thesis work successfully. Many extraordinary people assisted me directly or indirectly to achieve this work. I would mention just a few of them. Foremost, I would like to express my sincere gratitude to my supervisor Prof. Hemant A. Patil for his insightful suggestions and expert guidance. His patience, motivation, enthusiasm, immense knowledge and his friendly nature helped me a lot to produce this fruitful outcome. I could not have imagined having a bet- ter supervisor (and mentor) than Prof. Hemant A. Patil. I am fortunate to have excellent support from him during my doctoral study. I would also like to thank DA-IICT, my research progress seminar (RPS) committee (Prof. L. Pillutla and Prof. A. Tatu) and thesis examination committee (Prof. A. Banerjee, Prof. L. Pillutla, and Prof. Y. Vasavada) for suggesting valuable inputs to my work. I acknowledge my thesis examiners, namely, Prof. Hynek Hermansky (Johns Hopkins University) and Prof. K. Sri Rama Murty (IIT Hyder- abad) for valuable technical inputs related to the thesis. Besides them, I would also thank all the staff and faculty members of DA-IICT, who have been kind enough to help me throughout my stay at DA-IICT. I have learned many things, and their help shaped my career as a researcher. I thank Resource Center (RC), DA-IICT for creating research excelling atmosphere and infrastructure. I would like to thank the organizers of MediaEval SWS 2013 and MediaEval QUESST 2014 for providing the database, which is extensively used in this thesis. I would also like to thank Prof. (Dr.) Cheung-Chi Leung and his team for sharing MATLAB code of Acoustic Segment Model (ASM) training, which was used in Chapter 4. I also sincerely thank Department of Electronics and Information Technology (DeitY), Govt. of India, for sponsoring a consortium project (in which I was a project staff member) authorities of DA-IICT for providing excellent infrastructure. I would like to thank anonymous reviewers, who reviewed my work submitted to journals and conferences. They provided very useful insights into the work, which tremendously helped me to revise the particular paper as well as this iii thesis. Besides them, I would also thank my lab-mates of Speech Research Lab, DA- IICT. The fruitful technical discussions with all the members were incredible learning opportunities. Thanks to all for creating a continuous learning environment in the lab. The team effort and sharing learning experiences across the members tremendously helped in shaping the thesis and developing me as a good team member. We all worked together to meet the deadlines and enjoyed a lot several celebrating moments during the last six years. Last but not the least, I would like to thank my family, my parents, brother and wife. I have no words to describe their huge support and love. Thanks is very tiny word to describe their affection to me. They missed me during our family get- together, and supported me throughout the doctoral study. I must acknowledge them, without their encouragement, love, care and blessings, I would not have finished this thesis. iv Contents Abstract ....................................... xii List of Acronyms .................................. xii List of Symbols ................................... xvi List of Tables .................................... xix List of Figures .................................... xxi 1 Introduction ................................... 1 1.1 Motivation . 1 1.2 Spoken Content Retrieval Systems . 3 1.2.1 Spoken Term Detection . 3 1.2.1.1 Research Issues in STD . 5 1.2.2 Query-by-Example Spoken Term Detection . 5 1.2.2.1 Research Issues in QbE-STD . 6 1.2.3 Keyword Spotting System . 6 1.2.3.1 Research Issues in KWS . 7 1.3 Focus and Contributions in the Thesis . 7 1.3.1 GMM Framework for VTLN . 7 1.3.2 Mixture of GMMs for Posteriorgram Design . 8 1.3.3 Partial Matching for Non-Exact Query Matching . 8 1.3.4 Feature Reduction Approach . 9 1.3.5 Segment-Level Bag-of-Acoustic Words . 9 1.3.6 Exploring Detection Sources and Multiple Acoustic Features 9 1.4 Organization of the Thesis . 10 1.5 Chapter Summary . 11 2 Literature Survey ................................ 13 2.1 Introduction . 13 2.2 Motivation and Components of QbE-STD . 13 2.3 Performance Evaluation Metrics . 15 v 2.4 Front-end Subsystem . 18 2.4.1 Acoustic Representation . 19 2.4.2 Speech Activity Detection . 20 2.4.3 Posteriorgram Representation . 21 2.4.3.1 Supervised Posteriorgram . 21 2.4.3.2 Unsupervised Posteriorgram . 25 2.4.4 Symbol-based Representation . 30 2.5 Searching Subsystem . 30 2.5.1 Search Subsystem for Frame-based Representation . 31 2.5.1.1 Computational Improvement for Frame-based Ap- proaches . 32 2.5.2 Search Subsystem for Symbol-based Representation . 35 2.6 Detection Subsystem . 37 2.6.1 Score Normalization . 37 2.6.2 Query Selection . 37 2.6.3 Pseudo Relevance Feedback . 38 2.6.4 Non-Exact Query Matching . 40 2.6.5 Calibration and Fusion of Multiple Search Systems . 40 2.7 Research Issues in QbE-STD System . 41 2.8 QbE-STD Submission in MediaEval . 42 2.8.1 Summary of MediaEval SWS 2011 . 42 2.8.2 Summary of MediaEval SWS 2012 . 44 2.8.3 Summary of MediaEval SWS 2013 . 45 2.8.4 Summary of MediaEval QUESST 2014 . 47 2.8.5 Summary of MediaEval QUESST 2015 . 47 2.9 Chapter Summary . 49 3 Experimental Setup ............................... 51 3.1 Introduction . 51 3.2 Databases Used . 51 3.2.1 Challenges in Databases . 53 3.3 Front-end Subsystem . 53 3.3.1 Acoustic Representation . 53 3.3.1.1 MFCC and PLP . 54 3.3.1.2 MFCC-TMP . 54 3.3.2 Posterior Representation . 55 3.3.2.1 Motivation for GMM . 55 3.4 Searching Subsystem . 56 vi 3.5 Detection Subsystem . 61 3.5.1 Score-level Fusion of Multiple Systems . 64 3.6 Effect of Local Constraints . 65 3.7 Effect of Dissimilarity Functions . 70 3.8 Chapter Summary . 70 4 Representation Perspective .......................... 71 4.1 Introduction . 71 4.2 Vocal Tract Length Normalization . 71 4.2.1 Prior Studies in VTLN . 73 4.2.2 GMM-based VTLN . 75 4.2.3 Iterative Approach for VTLN . 78 4.2.4 Results for Phoneme Recognition . 80 4.2.5 Experimental Results . 82 4.2.5.1 Effect of Number of Gaussians . 84 4.2.5.2 Effect of Local Constraints . 85 4.2.5.3 Multiple Examples per Query . 85 4.2.5.4 Score-level Fusion of VTL-warped Gaussian Poste- riorgrams . 86 4.2.5.5 VTLN on Reduced/Expanded Number of Features 88 4.2.5.6 Deterministic Annealing Expectation Maximization 89 4.3 Mixture of GMMs . 91 4.3.1 Mixture of GMM Posteriorgram . 91 4.3.2 Practical Implementation . 93 4.3.2.1 Broad Phoneme Posterior Probability . 93 4.3.2.2 Relative Significance of Each Broad Class . 94 4.3.2.3 Training Procedure and Posteriorgram Computation 95 4.3.3 Experimental Results . 98 4.3.3.1 Score-level Fusion . 98 4.3.3.2 Effect of Amount of Labeled Data . 99 4.4 Chapter Summary . 100 5 Audio Matching Perspective ......................... 101 5.1 Introduction . 101 5.2 Partial Matching for Non-exact Query Matching . 102 5.3 Feature Reduction Approach . 110 5.3.1 Phone Segmentation . 111 5.3.2 Results for SWS 2013 . 113 5.3.3 Results for QUESST 2014 . 114 vii 5.4 Segment-level Bag-of-Acoustic Words . 115 5.4.1 BoAW Model . 115 5.4.1.1 Score Computation . 119 5.4.2 Results for SWS 2013 using Phonetic Posteriorgram . 120 5.4.2.1 Performance of the First Stage . 121 5.4.2.2 Performance of the Second Stage . 122 5.4.3 Results for SWS 2013 using Gaussian Posteriorgram . 125 5.4.3.1 Performance of the First Stage . 125 5.4.3.2 Performance of the Second Stage . 126 5.4.4 Results for QUESST 2014 . 128 5.5 Chapter Summary . 128 6 Exploring Multiple Resources ........................ 131 6.1 Introduction . 131 6.2 Acoustic Features . 132 6.2.1 Warped Linear Prediction . 133 6.2.2 Modified Group Delay Function . 134 6.2.3 OpenSMILE Library Features . 134 6.2.4 Experimental Results . 134 6.3 Detection Sources . 137 6.3.1 Depth of Detection Valley . 138 6.3.2 Term Frequency Similarity . 138 6.3.3 Self-Similarity Matrix . 138 6.3.4 Pseudo Relevance Feedback .

Design of Qbe-STD System: Audio Representation and Matching Perspective

Relating Optical Speech to Speech Acoustics and Visual Speech Perception

Improving Opus Low Bit Rate Quality with Neural Speech Synthesis

Lecture 16: Linear Prediction-Based Representations

Source-Channel Coding for CELP Speech Coders

Samia Dawood Shakir

End-To-End Video-To-Speech Synthesis Using Generative Adversarial Networks

Linear Predictive Modelling of Speech - Constraints and Line Spectrum Pair Decomposition

Audio Processing and Loudness Estimation Algorithms with Ios Simulations

Code Excited Linear Prediction with Multi-Pulse Codebooks

A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds

CELP and Speech Enhancement Ian Mcloughlin

Harmonic Coding of Speech at Low Bit Rates