Scalable and Efficient Neural Speech Coding

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 1 Scalable and Efficient Neural Speech Coding Kai Zhen, Student Member, IEEE, Jongmo Sung, Mi Suk Lee, Seungkwon Beak, Minje Kim, Senior Member, IEEE, Abstract—This work presents a scalable and efficient neural the more the system reduces the amount of bits per second waveform codec (NWC) for speech compression. We formulate (bitrate), the worse the perceptual similarity between the the speech coding problem as an autoencoding task, where a original and recovered signals is likely to be perceived. In convolutional neural network (CNN) performs encoding and decoding as its feedforward routine. The proposed CNN au- addition, the speech coding systems are often required to toencoder also defines quantization and entropy coding as a maintain an affordable computational complexity when the trainable module, so the coding artifacts and bitrate control hardware resource is at a premium. are handled during the optimization process. We achieve effi- ciency by introducing compact model architectures to our fully For decades, speech coding has been intensively studied convolutional network model, such as gated residual networks yielding various standardized codecs that can be categorized and depthwise separable convolution. Furthermore, the proposed into two types: the vocoders and waveform codecs. A vocoder, models are with a scalable architecture, cross-module residual also referred to as parametric speech coding, distills a set of learning (CMRL), to cover a wide range of bitrates. To this physiologically salient features, such as the spectral envelope end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where an NWC module (equivalent to vocal tract responses including the the contri- performs residual coding to restore any reconstruction loss that bution from mouth shape, tongue position and nasal cavity), its preceding modules have created. CMRL can scale down to fundamental frequencies, and gain (voicing level), from which cover lower bitrates as well, for which it employs linear predictive the decoder synthesizes the speech. Typically, a vocoder is coding (LPC) module as its first autoencoder. Once again, computationally efficient, but it usually operates in the narrow- instead of a mere concatenation of LPC and NWC, we redefine LPC’s quantization as a trainable module to enhance the bit band mode due to its limited performance [1][2]. A waveform allocation tradeoff between LPC and its following NWC modules. codec aims to perfectly reconstruct the speech signal, which Compared to the other autoregressive decoder-based neural features up-to-transparent quality with a higher bitrate range. speech coders, our decoder has significantly smaller architecture, The latter can be generalized to non-speech audio signal e.g., with only 0.12 million parameters, more than 100 times compression as it is not restricted to those speech production smaller than a WaveNet decoder. Compared to the LPCNet-based speech codec, which leverages the speech production model to priors, although waveform coding and parametric coding can reduce the network complexity in low bitrates, ours can scale be coupled for a hybrid design [3][4][5]. up to higher bitrates to achieve transparent performance. Our lightweight neural speech coding model achieves comparable Under the notion of unsupervised speech representation subjective scores against AMR-WB at the low bitrate range and learning, deep neural network (DNN)-based codecs have re- provides transparent coding quality at 32 kbps. vitalized the speech coding problem and provided different perspectives. The major motivation of employing neural net- Index Terms—Neural speech coding, waveform coding, repre- works to speech coding is twofold: to fill the performance sentation learning, model complexity gap between vocoders and waveform codecs towards a near- transparent speech synthesis quality; to use its trainable encoder and learn latent representations which may benefit other I. INTRODUCTION DNN-implemented downstream applications, such as speech PEECH coding can be implemented as an encoder- enhancement [6][7], speaker identification [8] and automatic decoder system, whose goal is to compress input speech speech recognition [9][10]. Having that, a neural codec can arXiv:2103.14776v1 [eess.AS] 27 Mar 2021 S signals into the compact bitstream (encoder) and then to recon- serve as a trainable acoustic unit integrated in future digital struct the original speech from the code with the least possible signal processing engines [11]. quality degradation. Speech coding facilitates telecommunica- Recently proposed neural speech codecs have achieved high tion and saves data storage among many other applications. coding gain and reasonable quality by employing deep autore- There is a typical trade-off a speech codec must handle: gressive models. The superior speech synthesis performance achieved in WaveNet-based models [12] has successfully This work was supported by the Institute for Information and Commu- nications Technology Promotion (IITP) funded by the Korea government transferred to neural speech coding systems, such as in [13], (MSIT) under Grant 2017-0-00072 (Development of Audio/Video Coding where WaveNet serves as a decoder synthesizing wideband and Light Field Media Fundamental Technologies for Ultra Realistic Tera- speech samples from a conventional non-trainable encoder at Media). Kai Zhen is with the Department of Computer Science and Cogni- 2:4 tive Science Program at Indiana University, Bloomington, IN 47408 USA. kbps. Although its reconstruction quality is comparable to Jongmo Sung, Mi Suk Lee, and Seungkwon Beack are with Electronics and waveform codecs at higher bitrates, the computational cost is Telecommunications Research Institute, Daejeon, Korea 34129. Minje Kim is significant due to the model size of over 20 million parameters. with the Dept. of Intelligent Systems Engineering at Indiana University (e- mails: [email protected], [email protected], [email protected], [email protected], Meanwhile, VQ-VAE [14] integrates a trainable vector [email protected]). quantization scheme into the variational autoencoder (VAE) Manuscript received March XX, 2021; revised YYY ZZZ, 2021. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 2 TABLE I: Categorical summary of recently proposed neural • Compactness: Having achieved the superior speech qual- " speech coding systems. means the system features the ity, the model is with a much lower complexity than characteristic. 7means not and lmeans not reported. WaveNet [12] and VQ-VAE [14] based codecs. Our WaveNet [13] VQ-VAE [16] LPCNet [17] Proposed decoder contains only 0:12 million parameters which is 100 times more compact than a WaveNet counterpart. " " Transparent coding l 7 The execution time to encode and decode a signal is 7 7 "" Less than 1M parameters only 42:44% of its duration on a single-core CPU, which Real time communications 7 7 "" Encoder trainable "" 7 " facilitates real-time communications. • Trainability: Our method is with a trainable encoder as in VQ-VAE, which can be integrated into other DNNs for acoustic signal processing. Besides, it is not constrained [15] for discrete speech representation learning. While the to speech, and can be generalized to audio coding with 64 bitrate can be lowered by reducing the sampling rate minimal effort as shown in [22]. times, the downside for VQ-VAE is that the prosody can be significantly altered. Although [16] provides a scheme to pass TABLE I highlights the comparison. the pitch and timing information to the decoder as a remedy, it In Sec.IV, both objective comparisons and subjective listen- does not generalize to non-speech signals. More importantly, ing tests are conducted for model evaluation. With a trainable VQ-VAE as a vocoder does not address the complexity issue quantizer for the LPC coefficients, the neural codec com- since it uses WaveNet as the decoder. Although these neural presses the residual signal, showing noticeable performance speech synthesis systems noticeably improve the speech qual- gain, which outperforms Opus and is on a par with AMR- ity at low bitrates, they are not feasible for real-time speech WB at lower bitrates; our codec is slightly superior to AMR- coding on the hardware with limited memory and bandwidth. WB and Opus at higher bitrates when operating at 20 kbps; LPCNet [17] focuses on efficient neural speech coding at 32 kbps, our codec is capable of scaling up to near via a WaveRNN [18] decoder by leveraging the traditional transparency when residual coding among neural codecs is linear predictive coding (LPC) techniques. The input of the enabled in CMRL. Additionally, we investigate the effect of LPCNet is formed by 20 parameters (18 Bark scaled cepstral various blending ratios of loss terms and bit allocation schemes coefficients and 2 additional parameters for the pitch informa- on the experimental result via the ablation analysis. The tion) for every 10 millisecond frame. All these parameters are execution time and delay analysis is given under 4 hardware extracted from the non-trainable encoder, and vector-quantized specifications, too. We conclude in Sec. V. with a fixed codebook. As discussed previously, since LPCNet functions as a vocoder, the decoded speech quality is not II. END-TO-END NEURAL WAVEFORM CODEC (NWC) considered transparent [19]. In this paper, we propose a novel neural waveform coding

Scalable and Efficient Neural Speech Coding

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support