This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg) Nanyang Technological University, Singapore.

Multi‑task learning for end‑to‑end noise‑robust bandwidth extension

Hou, ; Xu, Chenglin; Zhou, Joey Tianyi; Chng, Eng Siong; Li, Haizhou

2020

Hou, N., Xu, C., Zhou, J. T., Chng, E. S., & Li, H. (2020). Multi‑task learning for end‑to‑end noise‑robust bandwidth extension. Interspeech 2020, 4069‑4073. https://hdl.handle.net/10356/144855

© 2020 International Speech Communication Association (ISCA). All rights reserved. This paper was published in Interspeech 2020 and is made available with permission of International Speech Communication Association (ISCA).

Downloaded on 25 Sep 2021 18:01:23 SGT INTERSPEECH 2020 October 25–29, 2020, Shanghai, China

Multi-task Learning for End-to-end Noise-robust Bandwidth Extension

Nana Hou1, Chenglin Xu1,4, Joey Tianyi Zhou3, Eng Siong Chng1,2, Haizhou Li4,5

1School of Computer Science and Engineering, Nanyang Technological University, Singapore 2Temasek Laboratories, Nanyang Technological University, Singapore 3Institute of High Performance Computing (IHPC), A*STAR, Singapore 4Department of Electrical and Computer Engineering, National University of Singapore, Singapore 5Machine Listening Lab, University of Bremen, Germany [email protected]

Abstract noise-robust bandwidth extension ideal bandwidth extension Bandwidth extension aims to reconstruct wideband speech sig- nals from narrowband inputs to improve perceptual quality. step step Prior studies mostly perform bandwidth extension under the as- 1 2 sumption that the narrowband signals are clean without noise. The use of such extension techniques is greatly limited in prac- noisy narrowband signals clean narrowband signals clean wideband signals tice when signals are corrupted by noise. To alleviate such problem, we propose an end-to-end time-domain framework Figure 1: The work flow of noise-robust bandwidth extension. for noise-robust bandwidth extension, that jointly optimizes a In Step 1, the noisy narrowband signal is enhanced to remove mask-based speech enhancement and an ideal bandwidth exten- noise. In Step 2, the enhanced narrowband signal is bandwidth- sion module with multi-task learning. The proposed framework extended to generate the clean wideband signal. avoids decomposing the signals into magnitude and phase spec- tra, therefore, requires no phase estimation. Experimental re- sults show that the proposed method achieves 14.3% and 15.8% With the advent of deep learning, recent studies suggest relative improvements over the best baseline in terms of percep- [17] an unified approach that combines speech enhancement tual evaluation of speech quality (PESQ) and log-spectral dis- and bandwidth extension () in a joint training neural net- tortion (LSD), respectively. Furthermore, our method is 3 times work. As shown in Figure 2(a), the UEE approach firstly ap- more compact than the best baseline in terms of the number of plies a bi-directional long-short-term-memory (BLSTM) layer parameters. as the speech enhancement module to map the noisy narrow- Index Terms: Noise-robust bandwidth extension, multi-task band input to enhanced narrowband features. Then, another learning, time-domain masking, temporal convolutional net- BLSTM layer is applied as the ideal bandwidth extension mod- work ule [18] to recover the missing high-frequency information from the enhanced narrowband features. The speech enhancement and bandwidth extension module are first trained separately as 1. Introduction the pre-training, which are then fine-tuned with a single mean Speech signals with broader bandwidth provide higher percep- square error (MSE) loss between the clean wideband ground- tual quality and intelligibility. Bandwidth extension aims to re- truth and enhanced-plus-extended output. Overall, the UEE ap- cover the high-frequency information from narrowband si