An Indirect Speech Enhancement Framework Through Intermediate Noisy Speech Targets

AN INDIRECT SPEECH ENHANCEMENT FRAMEWORK THROUGH INTERMEDIATE NOISY SPEECH TARGETS A Dissertation Presented to The Academic Faculty By Sicheng Wang In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Engineering Department of Electrical and Computer Engineering Georgia Institute of Technology Mar 2021 c Sicheng Wang 2021 AN INDIRECT SPEECH ENHANCEMENT FRAMEWORK THROUGH INTERMEDIATE NOISY SPEECH TARGETS Thesis committee: Prof. Chin-Hui Lee, Advisor Prof. Sabato Marco Siniscalchi, Co- Electrical and Computer Engineering Advisor Georgia Institute of Technology Computer engineering Universita` degli Studi di Enna “Kore” Prof. David Anderson Electrical and Computer Engineering Prof. Alexander Lerch Georgia Institute of Technology School of Music Georgia Institute of Technology Prof. Elliot Moore II Electrical and Computer Engineering Georgia Institute of Technology Date approved: Feb. 24, 2021 A thousand roads lead a man forever toward Rome. Alain de Lille For my parents ACKNOWLEDGMENTS I want to express my sincere gratitude to my advisor and mentor, Prof. Chin-Hui Lee, for offering me the opportunity to join the human language technology lab and leading me to research in the fast-evolving field of automatic speech processing. As my advisor, he lends his insights on many complex problems. As a mentor, he is patient and guides me through setbacks. I am deeply indebted to him for his advice and kindness. I also want to thank my co-advisor, Prof. Sabato Marco Siniscalchi. I am immensely thankful for his guidance and encouragement during my study. As an experienced re- searcher, he helped me brainstorm many ideas, plan my experiments, and proof-read my writing. I am deeply affected by his diligence at work and attention to detail. I will always cherish the memories of many discussions and hang-outs we had together. I am also grateful to my committee members, Prof. David Anderson, Prof. Elliot Moore II, and Prof. Alexander Lerch, for serving on my thesis reading committee. Prof. Anderson and Prof. Moore provided many helpful feedbacks after my proposal. Prof. Moore also kindly offered me to use his computing resources for my experiments. I was fortunate to work with a team of talented lab mates at Georgia Tech. Dr. You- Chi Cheng shared his internship experience with me. Dr. I-Fan Chen helped me furnish the voice tag package. Dr. Zhen Huang provided many useful tips in my job searching. Dr. Kehuang Li coached me to maintain the server cluster. Dr. Wei Li, Qi Jun, Hu Hu, Chao-Han Huck Yang, and Yongliang He collaborated with me on several projects. I had the opportunity to collaborate with some industrial researchers both during semesters and during the summers. Each project rewarded me with invaluable experience. I want to thank Dr. Pongtep Angkititrakul, Dr. Zhe Feng, Mr. Peter Eastty, Dr. Chao Weng, Dr. Joshua Atkins, Dr. Jason Wung, Dr. Ramin Pishehvar, Dr. Ming Lei, Dr. Zhijie Yan, Yongtao Jia, Linzhang Wang. My life at the lab is also enjoyable and memorable because of the visiting scholars, including Dr. Yong Xu, Dr. Ji Wu, Dr. Jing Zhang, Dr. Fengpei v Ge, Dr. Hu Chen, Dr. Haifeng Sun, Dr. Bo Wu, Dr. Yi Lin, Dr. Quandong Wang, Dr. Gang Chen, Leonard Loo Tiang Kuan, Meri Tan and friends at CSIP, Sam Li, Dr. Meng Zhong, Dr. Kyle Xu, Dr. Zhen Wang, Dr. Yuting Hu, Dr. Le Liang, Dr. Hao Ye, Ziyan He, and Helen Li. Finally, I want to thank Pat Dixon, Raquel Plaskett, Dr. Daniela Staiculescu, and Tasha Torrence for their great administrative support. Lastly, I want to thank my parents for their support and encouragement during my study away from home. I’m grateful for my friends, Huaidong Yang, Ruxiu Liu, Congshan Wan, Li Wang, Yuan Gao, Yingdan Wu, Qiming Zhang, and Xiaoyao Liu for being part of my life in Atlanta. vi TABLE OF CONTENTS Acknowledgments................................... v List of Tables...................................... xi List of Figures..................................... xiii List of Acronyms.................................... xvi Chapter 1: Introduction................................ 1 1.1 Overview ................................... 1 1.2 Main contributions .............................. 4 Chapter 2: Background................................ 5 2.1 Speech enhancement ............................. 5 2.1.1 Classical signal processing techniques................ 8 2.1.2 Deep learning methods........................ 16 2.1.3 Progressive speech enhancement................... 20 2.2 Characterization of noise........................... 22 2.3 Speech transformation............................. 25 2.3.1 Mean-variance normalization..................... 26 2.3.2 Exemplar-based methods....................... 28 vii 2.3.3 Multi-layer perceptrons........................ 28 2.3.4 Generative models .......................... 29 2.4 A high-level description of the proposed progressive enhancement approach with intermediate noisy speech target..................... 30 Chapter 3: Characterization of Additive Noises................... 37 3.1 Introduction.................................. 37 3.1.1 Noise in speech enhancement..................... 37 3.1.2 Enhancement quality and improvement depending on noise types . 37 3.2 On the criteria to select intermediate targets ................. 39 3.2.1 Simple noise with high PESQ score ................. 43 3.2.2 Noise shape.............................. 45 3.2.3 Bandwidth............................... 47 3.2.4 Stationarity .............................. 47 3.3 Summary ................................... 49 Chapter 4: Indirect Speech Enhancement with Supervised Learning....... 50 4.1 Introduction.................................. 50 4.2 Matching feature statistics .......................... 51 4.2.1 Effects of noise in feature normalization in speech enhancement . 51 4.2.2 Deviation of mean in normalizing speech in difficult noise types . 54 4.2.3 Deviation of variance in normalizing speech in difficult noise types . 57 4.2.4 Mean-variance matching....................... 59 4.2.5 Histogram equalization........................ 61 viii 4.2.6 Experiments and discussions..................... 62 4.3 Speech conversion with DNN mapping.................... 66 4.3.1 DNN training............................. 67 4.3.2 Experiments and discussions..................... 68 4.4 Interference of multiple noise sources .................... 70 4.4.1 Framework of indirect enhancement with two noise sources . 72 4.4.2 Experiments and discussions..................... 73 4.5 Summary ................................... 77 Chapter 5: Indirect Speech Enhancement with Latent Space Learning...... 79 5.1 Introduction.................................. 79 5.2 Representational learning via auto-encoder.................. 80 5.2.1 Latent space of speech features using PCA.............. 82 5.2.2 Use of nonlinear auto-encoders to convert speech features . 87 5.2.3 Experimental results ......................... 95 5.3 Dictionary-based indirect speech conversion and enhancement . 102 5.3.1 Problem formulation .........................102 5.3.2 Experiments and discussions.....................105 5.4 Summary ...................................109 Chapter 6: Conclusions................................ 111 6.1 Summary of research .............................111 6.2 Contributions .................................111 6.2.1 Noise characterization ........................111 ix 6.2.2 Indirect enhancement via supervised learning . 112 6.2.3 Indirect enhancement via representational learning . 113 6.3 Future work..................................114 6.3.1 Theoretical characterization of noise types . 114 6.3.2 Noisy speech with multiple sources . 115 6.3.3 Disentanglement of latent feature...................115 6.3.4 Explorations of different deep architectures for speech transformation115 Appendices....................................... 116 Appendix A: Derivation of mean deviation in the normalization of LPS feature . 117 Appendix B: Derivation of the variance deviation in the normalization of LPS feature ................................118 Appendix C: Definition of colored noise......................121 Appendix D: Description of Nonspeech noise...................125 Appendix E: Description of Noisex92 noise....................126 References....................................... 131 Vita........................................... 144 x LIST OF TABLES 3.1 PESQ on Noisex92 noise types........................ 39 3.2 PESQ of conversion using simple noise as the intermediate targets . 43 3.3 Comparison of MSE of direct and indirect methods on the 1st quantile of noisy speech.................................. 44 3.4 Effects of spectral shapes on the suitability as intermediate targets . 45 3.5 MSE comparison between volvo and ovlov noise............... 46 3.6 Effects of bandwidth of conversion targets.................. 47 3.7 Effects of stationarity of conversion targets.................. 48 4.1 PESQ score with and without matching statistics............... 64 4.2 KLD between other noise and volvo in a hidden layer............ 66 4.3 Progressive indirect enhancement with volvo intermediate noise . 71 4.4 MSE between various learning pairs ..................... 71 4.5 3dB babble noise mixed with various colored noise at 3dB. The intermediate target is babble for Path 1 and the corresponding colored noise for Path 2......................................... 75 4.6 3dB factory noise mixed with various colored noise at 3dB. The intermediate target is factory1 for Path 1 and the corresponding

Load more