Speech Enhancement Using Speech Synthesis Techniques

City University of New York (CUNY) CUNY Academic Works Dissertations, Theses, and Capstone Projects CUNY Graduate Center 2-2021 Speech Enhancement Using Speech Synthesis Techniques Soumi Maiti The Graduate Center, City University of New York How does access to this work benefit ou?y Let us know! More information about this work at: https://academicworks.cuny.edu/gc_etds/4202 Discover additional works at: https://academicworks.cuny.edu This work is made publicly available by the City University of New York (CUNY). Contact: [email protected] Speech Enhancement Using Speech Synthesis Techniques by Soumi Maiti A dissertation submitted to the Graduate Faculty in Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy, The City University of New York 2021 ii © 2021 SOUMI MAITI All Rights Reserved iii Speech Enhancement Using Speech Synthesis Techniques by Soumi Maiti This manuscript has been read and accepted for the Graduate Faculty in Computer Science in satisfaction of the dissertation requirement for the degree of Doctor of Philosophy. Date Michael I. Mandel Chair of Examining Committee Date Ping Ji Executive Officer Supervisory Committee Rivka Levitan Kyle Gorman Ron Weiss THE CITY UNIVERSITY OF NEW YORK iv Abstract Speech Enhancement Using Speech Synthesis Techniques by Soumi Maiti Advisor: Michael I. Mandel Traditional speech enhancement systems reduce noise by modifying the noisy signal to make it more like a clean signal, which suffers from two problems: under-suppression of noise and over-suppression of speech. These problems create distortions in enhanced speech and hurt the quality of the enhanced signal. We propose to utilize speech synthesis techniques for a higher quality speech enhancement system. Synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. We first show that we can replace the noisy speech with its clean resynthesis from a previously recorded clean speech dictionary from the same speaker (concatenative resynthesis). Next, we show that using a speech synthesizer (vocoder) we can create a “clean” resynthesis of the noisy speech for more than one speaker. We term this parametric resynthesis (PR). PR can generate better prosody from noisy speech than a TTS system which uses textual information only. Additionally, we can use the high quality speech generation capability of neural vocoders for better quality speech enhancement. When trained on data from enough speakers, these vocoders can generate speech from unseen speakers, both male, and female, with similar quality as seen speakers in training. Finally, we show that using neural vocoders we can achieve better objective signal and overall quality than the state-of-the-art speech enhancement systems and better subjective quality than an oracle mask-based system. v To my parents, Suvabrata and Sumita Maiti ACKNOWLEDGEMENTS I am thankful for my advisor Prof. Michael I. Mandel. I am truely grateful for your constant guid- ance, support and mentorship. I am especially indebted to you for your kindness and encouragement over the years. I would like to express my sincere gratitude to Prof. Rivka Levitan, for her unwavering support. I am grateful to the members of my thesis committee, Prof. Kyle Gorman and Ron Weiss for their valuable advice and support. I would also like to thank David Guy Brizan, for introducing me to my advisor and his encouragement. I extremely grateful to Svetalana Stoyanchev for her constant support and encouragement. I am thankful for Prof. Shweta Jain, for her unwavering support and motivation throughout the years. A special thank you goes to Prof. Robert Haralick for his great understanding, help and support when I needed it the most. I would also like to thanks Dilvania Rodriguez her warmth and kindness. I would also like to thank my friends and lab-mates Viet Anh Trinh, Zhaoheng Ni, Priya Chakrabarty, Andreas Weise, Priyanka Samanta. I would not be here without the support of my friends and family. To my parents, Suvabrata and Sumita Maiti and my brother Soumyadipta Maiti, thank you for your love and continued support. Lastly I would like to thank my husband, Sourjya Dutta, for being a partner in this adventure and for all your support. vi Contents 1 Introduction1 1.1 Motivation.......................................1 1.2 Summary of Research.................................3 1.2.1 Concatenative Resynthesis..........................4 1.2.2 Parametric Resynthesis............................5 1.3 Background and Related Work............................7 I Concatenative Resynthesis 11 2 Concatenative Resynthesis 12 2.1 Introduction...................................... 12 2.2 System Overview................................... 14 2.2.1 Paired-Input Network............................ 14 2.2.2 Ranking Paired-Input Network....................... 17 2.2.3 Twin Networks................................ 18 2.3 Experiments...................................... 18 2.4 Evaluation....................................... 20 vii CONTENTS viii 2.4.1 Ranking of Dictionary Elements....................... 20 2.4.2 Listening Tests................................ 23 2.5 Conclusion...................................... 24 3 Large Vocabulary Concat Resyn 25 3.1 Introduction...................................... 25 3.2 System Overview................................... 26 3.2.1 Similarity Network.............................. 27 3.2.2 Efficient Decoding.............................. 28 3.2.3 Approximating Transition Affinities..................... 28 3.3 Small Vocabulary Experiments............................ 30 3.3.1 Efficient Decoding using Approximate Nearest Neighbors......... 31 3.3.2 Approximating the Transition Matrix.................... 32 3.3.3 Denoising................................... 33 3.4 Large Vocabulary Experiments............................ 33 3.4.1 Ranking Test................................. 35 3.4.2 Denoising using Increasing Dictionary Size................. 36 3.4.3 Listening Test................................. 37 3.5 Conclusions...................................... 38 II Parametric Resynthesis 41 4 Background: Vocoders 42 4.1 Non-neural Vocoders................................. 43 CONTENTS ix 4.1.1 The Channel Vocoder............................. 43 4.1.2 Phase Vocoder................................ 45 4.1.3 STRAIGHT.................................. 47 4.1.4 WORLD................................... 50 4.2 Neural Vocoders.................................... 53 4.2.1 WaveNet................................... 54 4.2.2 SampleRNN................................. 59 4.2.3 Parallel WaveNet............................... 61 4.2.4 WaveRNN.................................. 63 4.2.5 WaveGlow.................................. 65 5 Parametric Resynthesis 71 5.1 Introduction...................................... 71 5.2 System Overview................................... 72 5.3 Experiments...................................... 74 5.3.1 Dataset.................................... 74 5.3.2 Evaluation.................................. 75 5.3.3 TTS Objective Measures........................... 75 5.3.4 Speech Enhancement Objective Measures.................. 77 5.3.5 Subjective Intelligibility and Quality..................... 78 5.4 Conclusion...................................... 81 6 Parametric Resynthesis with Neural Vocoders 82 6.1 Introduction...................................... 82 CONTENTS x 6.2 System Overview................................... 83 6.2.1 Prediction Model............................... 84 6.2.2 Neural Vocoders............................... 84 6.2.3 Joint Training................................. 87 6.3 Experiments...................................... 87 6.4 Results......................................... 90 6.5 Discussion of Joint Training............................. 93 6.6 Conclusion...................................... 94 7 Speaker Independence 95 7.1 Introduction...................................... 95 7.2 System Overview................................... 96 7.3 Experiments...................................... 98 7.3.1 Noisy Dataset................................. 98 7.3.2 Speaker Independence of Neural Vocoders................. 99 7.3.3 Speaker Independence of Parametric Resynthesis.............. 101 7.3.4 Tolerance to error............................... 105 7.4 Conclusion...................................... 106 8 Analysis and Ablation Studies 108 8.1 Prediction Model: L1 vs. L2 loss........................... 109 8.2 Training Vocoder................................... 109 8.3 Training with Smaller Hop Size........................... 110 8.4 Training Vocoder from Noisy Features........................ 111 CONTENTS xi 8.4.1 Comparing Acoustic Features........................ 112 8.5 Conclusion...................................... 113 9 Conclusions and Future Work 116 9.1 Summary....................................... 116 9.2 Future Work...................................... 117 List of Figures 2.1 Overview of Concatenative Resynthesis....................... 15 2.2 Concatenative Resynthesis: Similarity functions................... 16 2.3 Concatenative Resynthesis Small Vocabulary: Intelligibility Test.......... 21 2.4 Concatenative Resynthesis Small Vocabulary: Quality Test............. 22 3.1 Twin Networks.................................... 27 3.2 Concat Resyn: Recall vs. Time with ANN...................... 30 3.3 Concatenative Resynthesis: Large Vocabulary Intelligibility test.......... 39 3.4 Concatenative

Speech Enhancement Using Speech Synthesis Techniques

The Phase Vocoder: a Tutorial Author(S): Mark Dolson Source: Computer Music Journal, Vol

TAL-Vocoder-II

The Futurism of Hip Hop: Space, Electro and Science Fiction in Rap

Overview of Voice Over IP

Software Sequencers and Cyborg Singers

A Vocoder (Short for Voice Encoder) Is a Synthesis System, Which Was

Mediated Music Makers. Constructing Author Images in Popular Music

Robotic Voice Effects

Microkorg Owner's Manual

History of Electronic Sound Modification'

Advanced Speech Compression VIA Voice Excited Linear Predictive Coding Using Discrete Cosine Transform (DCT)

Revoicer Manual