IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 1

Scalable and Efficient Neural Speech Coding Kai Zhen, Student Member, IEEE, Jongmo Sung, Mi Suk Lee, Seungkwon Beak, Minje Kim, Senior Member, IEEE,

Abstract—This work presents a scalable and efficient neural the more the system reduces the amount of bits per second waveform (NWC) for speech compression. We formulate (bitrate), the worse the perceptual similarity between the the speech coding problem as an autoencoding task, where a original and recovered signals is likely to be perceived. In convolutional neural network (CNN) performs encoding and decoding as its feedforward routine. The proposed CNN au- addition, the speech coding systems are often required to toencoder also defines quantization and entropy coding as a maintain an affordable computational complexity when the trainable module, so the coding artifacts and bitrate control hardware resource is at a premium. are handled during the optimization process. We achieve effi- ciency by introducing compact model architectures to our fully For decades, speech coding has been intensively studied convolutional network model, such as gated residual networks yielding various standardized that can be categorized and depthwise separable . Furthermore, the proposed into two types: the and waveform codecs. A , models are with a scalable architecture, cross-module residual also referred to as parametric speech coding, distills a set of learning (CMRL), to cover a wide range of bitrates. To this physiologically salient features, such as the spectral envelope end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where an NWC module (equivalent to vocal tract responses including the the contri- performs residual coding to restore any reconstruction loss that bution from mouth shape, tongue position and nasal cavity), its preceding modules have created. CMRL can scale down to fundamental frequencies, and gain (voicing level), from which cover lower bitrates as well, for which it employs linear predictive the decoder synthesizes the speech. Typically, a vocoder is coding (LPC) module as its first autoencoder. Once again, computationally efficient, but it usually operates in the narrow- instead of a mere concatenation of LPC and NWC, we redefine LPC’s quantization as a trainable module to enhance the bit band mode due to its limited performance [1][2]. A waveform allocation tradeoff between LPC and its following NWC modules. codec aims to perfectly reconstruct the speech signal, which Compared to the other autoregressive decoder-based neural features up-to-transparent quality with a higher bitrate range. speech coders, our decoder has significantly smaller architecture, The latter can be generalized to non-speech audio signal e.g., with only 0.12 million parameters, more than 100 times compression as it is not restricted to those speech production smaller than a WaveNet decoder. Compared to the LPCNet-based speech codec, which leverages the speech production model to priors, although waveform coding and parametric coding can reduce the network complexity in low bitrates, ours can scale be coupled for a hybrid design [3][4][5]. up to higher bitrates to achieve transparent performance. Our lightweight neural speech coding model achieves comparable Under the notion of unsupervised speech representation subjective scores against AMR-WB at the low bitrate range and learning, deep neural network (DNN)-based codecs have re- provides transparent coding quality at 32 kbps. vitalized the speech coding problem and provided different perspectives. The major motivation of employing neural net- Index Terms—Neural speech coding, waveform coding, repre- works to speech coding is twofold: to fill the performance sentation learning, model complexity gap between vocoders and waveform codecs towards a near- transparent quality; to use its trainable en- coder and learn latent representations which may benefit other I.INTRODUCTION DNN-implemented downstream applications, such as speech PEECH coding can be implemented as an encoder- enhancement [6][7], speaker identification [8] and automatic decoder system, whose goal is to compress input speech [9][10]. Having that, a neural codec can arXiv:2103.14776v1 [eess.AS] 27 Mar 2021 S signals into the compact bitstream (encoder) and then to recon- serve as a trainable acoustic unit integrated in future digital struct the original speech from the code with the least possible signal processing engines [11]. quality degradation. Speech coding facilitates telecommunica- Recently proposed neural speech codecs have achieved high tion and saves data storage among many other applications. coding gain and reasonable quality by employing deep autore- There is a typical trade-off a speech codec must handle: gressive models. The superior speech synthesis performance achieved in WaveNet-based models [12] has successfully This work was supported by the Institute for Information and Commu- nications Technology Promotion (IITP) funded by the Korea government transferred to neural speech coding systems, such as in [13], (MSIT) under Grant 2017-0-00072 (Development of Audio/ Coding where WaveNet serves as a decoder synthesizing wideband and Light Field Media Fundamental Technologies for Ultra Realistic Tera- speech samples from a conventional non-trainable encoder at Media). Kai Zhen is with the Department of Computer Science and Cogni- 2.4 tive Science Program at Indiana University, Bloomington, IN 47408 USA. kbps. Although its reconstruction quality is comparable to Jongmo Sung, Mi Suk Lee, and Seungkwon Beack are with Electronics and waveform codecs at higher bitrates, the computational cost is Research Institute, Daejeon, Korea 34129. Minje Kim is significant due to the model size of over 20 million parameters. with the Dept. of Intelligent Systems Engineering at Indiana University (e- mails: [email protected], [email protected], [email protected], [email protected], Meanwhile, VQ-VAE [14] integrates a trainable vector [email protected]). quantization scheme into the variational autoencoder (VAE) Manuscript received March XX, 2021; revised YYY ZZZ, 2021. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 2

TABLE I: Categorical summary of recently proposed neural • Compactness: Having achieved the superior speech qual- " speech coding systems. means the system features the ity, the model is with a much lower complexity than characteristic. means not and Gmeans not reported. WaveNet [12] and VQ-VAE [14] based codecs. Our WaveNet [13] VQ-VAE [16] LPCNet [17] Proposed decoder contains only 0.12 million parameters which is 100 times more compact than a WaveNet counterpart. " " Transparent coding G  The execution time to encode and decode a signal is   "" Less than 1M parameters only 42.44% of its duration on a single-core CPU, which Real time communications   "" Encoder trainable ""  " facilitates real-time communications. • Trainability: Our method is with a trainable encoder as in VQ-VAE, which can be integrated into other DNNs for acoustic signal processing. Besides, it is not constrained [15] for discrete speech representation learning. While the to speech, and can be generalized to audio coding with 64 bitrate can be lowered by reducing the sampling rate minimal effort as shown in [22]. times, the downside for VQ-VAE is that the prosody can be significantly altered. Although [16] provides a scheme to pass TABLE I highlights the comparison. the pitch and timing information to the decoder as a remedy, it In Sec.IV, both objective comparisons and subjective listen- does not generalize to non-speech signals. More importantly, ing tests are conducted for model evaluation. With a trainable VQ-VAE as a vocoder does not address the complexity issue quantizer for the LPC coefficients, the neural codec com- since it uses WaveNet as the decoder. Although these neural presses the residual signal, showing noticeable performance speech synthesis systems noticeably improve the speech qual- gain, which outperforms and is on a par with AMR- ity at low bitrates, they are not feasible for real-time speech WB at lower bitrates; our codec is slightly superior to AMR- coding on the hardware with limited memory and bandwidth. WB and Opus at higher bitrates when operating at 20 kbps; LPCNet [17] focuses on efficient neural speech coding at 32 kbps, our codec is capable of scaling up to near via a WaveRNN [18] decoder by leveraging the traditional transparency when residual coding among neural codecs is (LPC) techniques. The input of the enabled in CMRL. Additionally, we investigate the effect of LPCNet is formed by 20 parameters (18 Bark scaled cepstral various blending ratios of loss terms and bit allocation schemes coefficients and 2 additional parameters for the pitch informa- on the experimental result via the ablation analysis. The tion) for every 10 millisecond frame. All these parameters are execution time and delay analysis is given under 4 hardware extracted from the non-trainable encoder, and vector-quantized specifications, too. We conclude in Sec. V. with a fixed codebook. As discussed previously, since LPCNet functions as a vocoder, the decoded speech quality is not II.END-TO-END NEURAL WAVEFORM CODEC (NWC) considered transparent [19]. In this paper, we propose a novel neural waveform coding The neural waveform codec (NWC), is an end-to-end au- model, serving as a trainable acoustic processing unit with a toencoder that forms the base of our proposed coding systems. It directly encodes and quantizes the input waveform x RT lightweight design and scalable performance. In Sec. II, we ∈ into the bitstream h˜ RN using a convolutional neural introduce our basic model: a compact neural waveform codec ∈ with only 0.35 million parameters, much more lightweight network (CNN) encoder module, and then reconstructs the than WaveNet, VQ-VAE and our previous neural codec [20]. output waveform using a decoder with a similar topology: Based on this neural codec, we introduce two mechanisms ˜ ˜ x xˆ dec(h), h (h), h enc(x). (1) to integrate speech production theory and residual coding ≈ ← F ← Q ← F techniques in Sec. III. First, benefited from the residual- Fig. 1 (a) depicts NWC’s overall system architecture. The excited linear prediction (RELP) [21], we conduct LPC and structure is detailed in TABLE II. It serves as a basic compo- apply the neural waveform codec to the excitation signal, nent in the proposed speech coding system. In Sec. III-A and which is illustrated in Sec.III-A. In this integration, a trainable III-B we introduce our scaling mechanism using NWC as the soft-to-hard quantizer bridges the encoding of linear spectral building block of the CMRL framework. pairs and the corresponding LPC residual, making the entire NWC is defined with a compact architecture while achieving quantization and entropy coding modules trainable. Second, high reconstruction quality. The codec is a fully convolutional to scale up the performance for high bitrates, we propose network (FCN) defined with 1-D convolutional layers. Both cross-module residual learning (CMRL), a cascaded model encoder and decoder adopt gated linear units (GLU) [23], architecture that adds up neural codecs for residual coding, which is based on the ResNet’s cross-layer residual learning sequentially (Sec.III-B). As a result, the proposed neural [24] with dilation [25] to achieve a compact model architecture speech coding systems have following characteristics: and to expand the receptive field in the time domain. The gating mechanism (Fig. 1 (b)) boosts the gradient flow with • Scalability: Similar to LPCNet [17], the proposed scheme is compatible with conventional spectral envelope estima- superior performance as evidenced in [26]. tion techniques. However, ours operates at a much wider The depthwise separable convolution [27] is used to further bitrate range with comparable or superior speech quality save up the computational cost in one of the decoder layers to standardized waveform codecs. (Fig. 1 (c)). For example, to transform a feature map of size IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 3

TABLE II: Architecture of the neural waveform codec: input

Upsample and output tensors are shaped as (sample, channel), while the kernel is represented as (kernel size, in channel, out channel). Soft-to-hard Softmax Quantization Layer Input shape Kernel shape Output shape Channel expansion (512, 1) (55, 1, 100) (512, 100) (1, 100, 20)  (15, 20, 20)† Entropy Coding Gated linear unit (512, 100) ×2 (512, 100) Data (15, 20, 20)†  Transmission (9, 20, 100) bitstream Downsampling (512, 100) (9, 100, 100) (256, 100)  Downsample (1, 100, 20) (15, 20, 20)† Gated linear unit (512, 100) ×2 (512, 100) (15, 20, 20)†  (a) The high-level structure of proposed neural waveform codec (9, 20, 100) Channel reduction (256, 100) (9, 100, 1) (256, 1)

Channel expansion (256, 1) (9, 1, 100) (256, 100) Channel-1 Channel-2 (1, 100, 20)  (15, 20, 20)† Channel-wise Gated linear unit (256, 100) ×2 (256, 100) Convolution (15, 20, 20)†  (9, 20, 100) Channel-1 Channel-2 (9, 100, 1) Upsampling (256, 100) (512, 50) (1, 100, 100)

Depth-wise  Convolution (1, 50, 20) (15, 20, 20)† Output Channel Gated linear unit (512, 50) ×2 (512, 50) (15, 20, 20)†  (9, 20, 50) (b) Dilated gated linear unit (GLU) (c) Separable convolution Channel reduction (512, 50) (55, 50, 1) (512, 1) Fig. 1: Proposed lightweight neural waveform codec.

256 100 (features, channels) into the same-sized tensor, we We use subpixel CNN layer proposed in [28] to recover the × need a kernel of size c 100 100 (features, input channels, original sampling rate. Concretely, the subpixel upsampling × × output channels). With the depthwise separable convolution, it involves a feature transformation implemented in depthwise first conducts channel-wise convolution with a kernel shape of convolution, and a shuffle operation that interlaces features c 1 1 on every input channel, respectively, thus requiring from two channels into a single channel, as shown in Eq. (3), × × 100 such kernels. It results in a 256 100 tensor. A depthwise where the input feature of the shuffle operation is shaped as × Hamdard product with the kernel shape of 1 100 100 (N, 2) and the output is shaped as (2N, 1). × × follows. It is easy to show that c 100 100 > c 100 + × × × [h11, h21, h12, h22, . . . , h1N , h2N ] 100 100 for a small c. (3) × Upsampling([h11, h12, . . . , h1N ; h21, h22, . . . , h2N ] The proposed NWC achieves compression through two ← strategies: feature map compression and trainable quantization. B. The Trainable Quantizer for Bit Depth Reduction

A. Feature Map Compression The dimension-reduced feature map can be further com- pressed via bit depth reduction. Hence, the floating-point code One way to compress the input signal in the proposed h goes through quantization and entropy coding, which will encoder architecture is to reduce the data rate. The CNN finalize the bitrate based on the entropy of the code value encoder function takes an input frame, x RT , and converts distribution. Typically, a bit depth reduction procedure lowers ∈ it into a feature map h RN , the average amount of bits to represent each sample. In our ∈ case, we could employ a quantization process that assigns the h (x), (2) ← Fenc output of the encoder to one of the pre-defined quantization which then goes through quantization, transmission, and de- bins. If there are 25 = 32 quantization bins, for example, a coding to recover the input as shown in Fig. 1 (a). During single-precision floating-point value’s bit depth reduces from the encoding process, we introduce a downsampling operation, 32 to 5. In addition, various entropy coding techniques, such reducing the dimension of the code vector h. We employ a as Haffman coding, can be further employed to losslessly dedicated downsampling layer by setting up the stride value to reduce the bit depth. While the quantization could be done be 2 during its convolution, reducing the data rate by 50%, i.e., in a traditional way, e.g., using Lloyds-Max quantization [29] N = T/2. Accordingly, the decoder needs a corresponding after the neural codec is fully trained, we encompass the upsampling operation to recover the original sampling rate. quantization step as a trainable part of the neural network as IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 4

Algorithm 1 Trainable Softmax quantization, (h, α, β) classification mode is enabled with A(soft) so as not to block Q ˆ (soft) 1: Input: the code, e.g., the encoder output, h = (x) the gradient flow. In other words, h = A β represents each Fenc the Softmax scaling factor, α encoder output with a linear combination of all quantization the centroid vector, β RK bins. The process is summarized in Algorithm 1. Although this ∈ 2: Output: the quantized code, hˆ (training) or h˜ (testing) soft quantization process is differentiable and desirable during (soft) (hard) 3: Compute the dissimilarity matrix: D ` (h β ) training, the discrepancy between A and A creates nk ← 2 n|| k 4: Softmax conversion: A(soft) Softmax( αD ) higher error during the test time, requiring a mechanism to n: ← − n: 5: if Training then reduce the discrepancy as in the following section. 6: Soft quantization: hˆ A(soft)β 1) Soft-to-hard quantization penalty: ← Although the limit of 7: else if Testing then A(soft) is A(hard) as α approaches , the change of α should 8: Hard quantization: h˜ A(hard)β ∞ ← be gradual to allow gradient flows in the initial phase of 9: end if training. We control the hardness of A(soft) using the soft- to-hard quantization loss derived from [31]: 1 proposed in [30]. Consequently, we expect that the codec is = A(soft), (7) LQ N nk aware of the quantization error, which the training procedure n,k q tries to reduce it. It is also convenient to control the bitrate by X (soft) whose minimum, 1, is achieved when An: is a one-hot controlling the entropy of the code value distribution, which (soft) can be also done as a part of network training. vector for all n. Conversely, when Ank = 1/K, the loss is maximum. Hence, by minimizing this soft-to-hard quantiza- In NWC, the quantization process is represented as classi- tion penalty term, we can regularize the model to have harder fication on each scalar value of the encoder output. Given A(soft) values by updating α and the other model parameters a vector with K centroids, β = [β , β , , β ]>, the 1 2 ··· K accordingly. As a result, the test time quantization loss will quantizer’s goal is to assign each feature hn to the closest be reasonably small when A(soft) is replaced by A(hard). centroid in terms of `2 distance, which is defined as follows: 2) Bitrate calculation and entropy control: The bitrate is h1 β1 2 h1 βK 2 calculated as a product of the number of code values per || −. || ·. · · || −. || D =  . .. .  , (4) second and the average bit depth for each code. The former hN β1 2 hN βK 2 is defined by the dimension of the code vector N multiplied || − || · · · || − ||  F   by the number of frames per second, T −o , where T , o, and where n-th row in D is a vector of `2 distance between n-th F are the input frame size, overlap size, and the original code value hn to all K quantization bins. Then, we employ the sampling rate, respectively. If we denote the average bit depths softmax function to turn each row of D into a K-dimensional per sample by a function g(h˜n), the bitrate can be computed probabilistic assignment vector: as in Eq. (8), softmax( αD1:) − bitrate = g(h˜n)NF/(T o). (8) softmax( αD2:) − A(soft) =  .−  , (5) . When F = 16, 000, T = 512, o = 32, and N = 256 after   downsampling, for example, there are about 8, 533 samples softmax( αDN:)  −  per second. If g(h˜ ) = 3 bits, the bitrate is estimated as 25.6   n where we turn the distance into a similarity metric by multi- kbps. On the contrary, the uncompressed bitrate is 256 kbps plying a negative number α, such that the shortest distance − because N = T = 512, o = 0, and g(xt) = 16 bits. is converted to the largest probability. We adjust the entropy of β to indirectly control the codec’s (soft) Note that Eq. (5) yields a soft assignment matrix A bitrate, because the entropy serves as the lower bound of N×K ∈ R . In practice, though, the quantization process must per- g(h˜n) based on Shannon’s entropy theory. We first estimate form a hard assignment, so each code value hn is replaced by the entropy using the sample distribution, an integer index to the closest centroids: zn 1, 2, ,K , ∈ { ··· } K which is represented by log2 K bits as the quantization d e (β) p(βk) log p(βk), (9) result. The hard kernel assignment matrix A(hard), where H ≈ − 2 k=1 each row is a one-hot vector, can be induced by turning on X (soft) 1 (hard) the maximum element of A while suppressing the non- where p(βk) = N n Ank is the relative frequency of the maximum: k-th centroid being chosen during quantization. It is only an P 1 if arg max A(soft) = k estimate of the true entropy, because it depends on the quality A(hard) = j∈{1,2,··· ,K} nj (6) nk 0 otherwise of the sample distribution p(βk).  To navigate the model training towards the target bitrate, ˜ (hard) On the decoder side, h = A β recovers h. (β) defined in Eq. (9) is included to the loss function as a H Since arg max operation in Eq. (6) is not differentiable, a regularizer: its smaller value leads to a lower bitrate, and vice soft-to-hard scheme is proposed in [30], where A(hard) is used versa. However, during training, we approximate the relative (soft) only at test time. During backpropagation for training, the soft frequency p(βk) using the soft assignment matrix A rather IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 5

(hard) 1 (soft) than the hard one A , i.e., p(β ) A , due Input Speech Frame Residual Frame k ≈ N n nk to the discrete nature of A(hard) that prevents gradient-based updates. Since (β) is parameterized by A(soft)P and β, their H optimal values are learned during training, making the entire High-pass filter Calculate LPC quantization process trainable. For example, if the current residuals Pre-emphasis filter codec’s bitrate is higher than desired, the optimization process Coded LPC will make the regularization effect stronger to lower the Neural LPC Coefficients

AAACFHicZZDLThsxFIY9UC4Nt0CX3ViNkEBCyQxCwAoFhUUXXaRSA0hMiDzOGWLiy8g+gxKN5jVYsi0P0V3Vbfc8Ay9RJ2RR4EiWvvPb/9Hxn2RSOAzDp2Bu/sPC4tLyx8rK6tr6RnVz69yZ3HLocCONvUyYAyk0dFCghMvMAlOJhItk2JrcX9yBdcLoHzjOoKvYjRap4Ay91KtuxcnpdbETI4yw+NZulbtlr1oL6+G06HuIZlAjs2r3qs9x3/BcgUYumXNXUZhht2AWBZdQVuLcQcb4kN3AlUfNFLhuMd29pNte6dPUWH800qn6v6NgyrmxSvaoB8VwsEcT5W0TdK9HY3rcLYTOcgTNXyanuaRo6OTntC8scJRjD4xb4ZejfMAs4+jzqcRTY9HoON81lNC3MBSqcWZNlphRow9p3QGWFZ9O9DaL93C+X48O6+H3g1rzZJbTMvlMvpAdEpEj0iRfSZt0CCcj8kB+ksfgPvgV/A7+vDydC2aeT+RVBX//AcQNnVk= Calculate LPC coefficients (LPC) entropy, i.e., to increase the “spikiness” of the distribution. coefficients A quantization (for regularization) More details on the training process is discussed in Sec. IV-B (a) The trainable LPC analyzer where the linear spectral pairs are quantized by our trainable soft-to-hard quantizer (in the III.THE PROPOSED SCALABLE NWCMODELS dotted box). By having NWC introduced in Sec. II as the basic module, we propose two different extension mechanisms to improve 1 257 481 512 737 768 1024 1248 1504 the codec’s performance in a wider range of bitrates, but (b) Cross-frame windowing without increasing the model complexity significantly. The

NWC module’s capability is limited to perform the whole 257 512 768 speech reconstruction process due to its compact topology as well as the discrepancy between the optimization objectives (c) Sub-frame windowing and the hard-to-quantify perceptual speech quality. We resolve this issue by introducing a scalable model architecture that 257 512 737 768 1217 1248 can concatenate multiple speech coding modules, so a module (d) Synthesis windowing can improve the mistake its predecessors made. First, in Sec. III-A, we propose to harmonize LPC as our first coding Fig. 2: The signal flow chart for LPC analyzer (a) and module. Followed by an NWC module, and by making LPC’s windowing schemes in LPC (b)-(d). quantization module trainable, we achieve a win-win strategy that fuses the traditional DSP technique and the modern deep learning model. Starting from this idea, which works well in efficiently estimated via Levinson-Durbin algorithm [35], and the low-bitrate cases, we also extend it to cascading more are to be quantized before LPC residual is calculated, i.e., autoencoders (Sec. III-B). The proposed CMRL system relays et encompasses the quantization error. The LPC residual et residual signals among the series of NWCs to scale up the serves as input to the NWC module, which works as explained coding performance at high bitrates. in Sec. II, but on e rather than x. Hence, how LPC coefficients are quantized determines NWC’s input, the LPC residual. A. Trainable LPC Analyzer 2) Collaborative quantization: The conventional LPC coef- ficient quantization process is standardized in ITU-T G.722.2 LPC has been widely used to facilitate speech compression (AMR-WB) [36]: 2.4k bits are assigned to represent the LPC source-filter and sysnthesis, where model “explains out” the coefficient per second though multistage envelope of a speech spectrum, leaving a low-entropy residual (MSVQ) [37] in a classic LPC analyzer. Once again, we signal [32]. Similarly, LPC serves as a pre-processor in our employ the soft-to-hard quantizer as illustrated in Sec. II to system before its residual signal being compressed by NWC make the quantization and bit allocation steps in the LPC as we will see in Sec. III-B. In this subsection, we redesign the analyzer trainable and communicatable with the neural codec. LPC coefficient quantization process as a trainable module. We introduce collaborative quantization (CQ) to jointly optimize We compute the LPC coefficients as in [38], first by the LPC analyzer and NWCs as a residual coder. applying high-pass filtering followed by pre-emphasizing (Fig. 2 (a)). When calculating LPC coefficients, the window in Fig. 1) Speech resonance modeling: In the speech production 2 (b) is used. The window is symmetric with the left and right process, the source as wide-band excitation signals go through 25% parts being tapered by a 512-point Hann window. After the vocal tract tube. The shape-dependent resonances of the representing the 16 LPC coefficients in linear spectral pairs vocal tract filter the excitations before it being transformed (LSP) [39], we quantize it using the soft-to-hard quantization to speech signals [33]. In speech coding, the “vocal tract re- scheme. Then, the sub-frame window in Fig. 2 (c) is applied sponse” is often modeled as an all-pole filter [34]. Having that, to calculate LPC residual, which assures a more accurate t x the -th sample t can be approximated by a autoregressive residual calculation. The frame that covers samples [256:768], M model using previous samples, for instance, is decomposed into 7 sub-frames to calculate LPC M residuals separately. Each 128-point Hann window in Fig. 2 xt = lkxt−k + et, (10) (c) is with 50% overlap, except for the first and last window. kX=1 They altogether form a constant overlap-add operation. Finally, where the estimation error et represents the LPC residual, after the synthesis using the reconstructed residual signal and and lk denotes the filter coefficients. Typically, lk can be corresponding LPC coefficients, the window in Fig. 2 (d) ta- LP Residuals + Output Frames : LP Codec

AAACBnicZZDLSgMxFIYz3q23qks3wSK4KO2MCLoSQRcuFWwV2iKZzJkam8uQnBHL0L1Lt/oQ7sStr+Ez+BKmtQu1BwLf+ZP/cPLHmRQOw/AzmJqemZ2bX1gsLS2vrK6V1zeazuSWQ4Mbaex1zBxIoaGBAiVcZxaYiiVcxb2T4f3VPVgnjL7EfgYdxbpapIIz9FKzzROD7qZcCWvhqOgkRGOokHGd35S/2onhuQKNXDLnWlGYYadgFgWXMCi1cwcZ4z3WhZZHzRS4TjHadkB3vJLQ1Fh/NNKR+ttRMOVcX8VV6kExvK3SWHnbEN3f0ZgedgqhsxxB85/JaS4pGjr8K02EBY6y74FxK/xylN8yyzj6RErtkbGoN5zv6kroO+gJVT+1JovNQz2BtOYAByWfTvQ/i0lo7tWisBZd7FeOj8Y5LZAtsk12SUQOyDE5I+ekQTi5I0/kmbwEj8Fr8Ba8/zydCsaeTfKngo9vqEeYhg==

Synthesizer Decoder Decoder LP LP ··· Coefficients Data Transmission : Neural Codec

xˆ(1) (x(1); (1)) xˆ(2) (x(2); (2)) (N) (N) (N) AAACT3icZZDfahQxFMYzW7Xt+m/VS2+Ci7CFsjtTCwqCFCrijVDB7RY665JkznTTTSZDcqbtEuapfBIve6uXPoB3YmZ3Lqw9EPidk+87JB8vlXQYx9dRZ+PO3XubW9vd+w8ePnrce/L02JnKChgLo4w94cyBkgWMUaKCk9IC01zBhC8Om/vJBVgnTfEFlyVMNTsrZC4FwzCa9T6lc4Y+5Vf1Vz9IdupUQY7MWnNJU81wLpjyH+pByo3Kmr6RrpX127WCcz9pzTuzXj8exquityFpoU/aOpr1fqWZEZWGAoVizp0mcYlTzyxKoaDuppWDkokFO4PTgAXT4KZ+9e2avgyTjObGhlMgXU3/dXimnVtqvksDNE/dpVwHW4Pu5mrM30y9LMoKoRDrzXmlKBrahEYzaUGgWgZgwsrwOCrmzDKBIdpuujL60diFbqRlcQ4LqUfvrSm5uRplkA8dYN0N6ST/Z3EbjveGyavh3uf9/sG7Nqct8py8IAOSkNfkgHwkR2RMBPlGrskP8jP6Hv2O/nRaaSdq4Rm5UZ3tvwlTs7k= ˆ AAACT3icZZDNahsxFIU17l/i/rnpshtRU3Ag2DNuoYVCCLSUbgop1HEg4xpJcydWLI0G6U4SI+ap+iRdZtsu+wDdlWrsWTTNBcF3r865SIeXSjqM46uoc+v2nbv3tra79x88fPS492TnyJnKCpgIo4w95syBkgVMUKKC49IC01zBlC/fNffTc7BOmuILrkqYaXZayFwKhmE0731KFwx9yi/rr34w3q1TBTkya80FTTXDhWDKf6gHKTcqa/pGulHWbzcKzv20Ne/Oe/14GK+L3oSkhT5p63De+5VmRlQaChSKOXeSxCXOPLMohYK6m1YOSiaW7BROAhZMg5v59bdr+iJMMpobG06BdD391+GZdm6l+R4N0Dx1j3IdbA2666sxfzPzsigrhEJsNueVomhoExrNpAWBahWACSvD46hYMMsEhmi76droRxMXupGWxRkspR69t6bk5nKUQT50gHU3pJP8n8VNOBoPk5fD8edX/YP9Nqct8ow8JwOSkNfkgHwkh2RCBPlGrsgP8jP6Hv2O/nRaaSdq4Sm5Vp3tvw6Ds7w= x (x ; ) W W AAACT3icZZBdaxNBFIZnU7Vt/Ip62ZvBIKRQkt0qVBCkoIg3SgXTFLoxzMyebcbMxzJz1jYs+6v8JV72tl76A7wTZ5O9sPbAwHPOvO9h5uWFkh7j+DLqbNy6fWdza7t79979Bw97jx4fe1s6AWNhlXUnnHlQ0sAYJSo4KRwwzRVM+OJNcz/5Bs5Laz7jsoCpZmdG5lIwDKNZ70M6Z1il/KL+Ug0+7tapghyZc/acpprhXDBVvasHKbcqa/pGulbWr9YKzqtJa96d9frxMF4VvQlJC33S1tGs9yvNrCg1GBSKeX+axAVOK+ZQCgV1Ny09FEws2BmcBjRMg59Wq2/X9FmYZDS3LhyDdDX911Ex7f1S8z0aoHnqHuU62Br011dj/nJaSVOUCEasN+elomhpExrNpAOBahmACSfD46iYM8cEhmi76cpYjcY+dCMtzVdYSD1662zB7cUog3zoAetuSCf5P4ubcLw/TJ4P9z+96B++bnPaIjvkKRmQhByQQ/KeHJExEeQ7uSRX5Gf0I/od/em00k7UwhNyrTrbfwGfw7QQ W F F F

Residual signal Residual signal Decoder Decoder Decoder

AAACBnicZZDLSgMxFIYz3q23qks3wSK4KO2MCLoSQRcuFWwV2iKZzJkam8uQnBHL0L1Lt/oQ7sStr+Ez+BKmtQu1BwLf+ZP/cPLHmRQOw/AzmJqemZ2bX1gsLS2vrK6V1zeazuSWQ4Mbaex1zBxIoaGBAiVcZxaYiiVcxb2T4f3VPVgnjL7EfgYdxbpapIIz9FKzzROD7qZcCWvhqOgkRGOokHGd35S/2onhuQKNXDLnWlGYYadgFgWXMCi1cwcZ4z3WhZZHzRS4TjHadkB3vJLQ1Fh/NNKR+ttRMOVcX8VV6kExvK3SWHnbEN3f0ZgedgqhsxxB85/JaS4pGjr8K02EBY6y74FxK/xylN8yyzj6RErtkbGoN5zv6kroO+gJVT+1JovNQz2BtOYAByWfTvQ/i0lo7tWisBZd7FeOj8Y5LZAtsk12SUQOyDE5I+ekQTi5I0/kmbwEj8Fr8Ba8/zydCsaeTfKngo9vqEeYhg== (1) (2) (N) : Decoding

AAACGHicbVDLSgMxFM34rPVVdelmsAi6aWeqoMuiLlxJBacV21oy6R0bm8eQZMQy9C9cCfot7sStOz/Fneljoa0HAifn3ENuThgzqo3nfTkzs3PzC4uZpezyyuraem5js6ploggERDKprkOsgVEBgaGGwXWsAPOQQS3sng782gMoTaW4Mr0YmhzfCRpRgo2Vbhph5zbdu9jvZ1u5vFfwhnCniT8meTRGpZX7brQlSTgIQxjWuu57sWmmWBlKGPSzjURDjEkX30HdUoE56GY63Ljv7lql7UZS2SOMO1R/J1LMte7x0E5ybDp60huI/3n1xETHzZSKODEgyOihKGGuke7g+26bKiCG9SzBRFG7q0s6WGFibEnZxjCYFgNtb0VOxT10KS+eKRmH8rHYhqigwfRtV/5kM9OkWir4B4XS5WG+fDJuLYO20Q7aQz46QmV0jiooQAQJ9IRe0Kvz7Lw5787HaHTGGWe20B84nz+cV6Ae

AAACGHicbVDLTgIxFO34RHyhLt1MJCa4gRk00SVRFy4xkUcEJJ1yByp9TNqOkUz4C1cm+i3ujFt3foo7y2Oh4EmanJ5zT3p7gohRbTzvy1lYXFpeWU2tpdc3Nre2Mzu7VS1jRaBCJJOqHmANjAqoGGoY1CMFmAcMakH/YuTXHkBpKsWNGUTQ4rgraEgJNla6bQa9uyRXPBqm25msl/fGcOeJPyVZNEW5nfludiSJOQhDGNa64XuRaSVYGUoYDNPNWEOESR93oWGpwBx0KxlvPHQPrdJxQ6nsEcYdq78TCeZaD3hgJzk2PT3rjcT/vEZswrNWQkUUGxBk8lAYM9dId/R9t0MVEMMGlmCiqN3VJT2sMDG2pHRzHEwKFW1vBU7FPfQpL1wqGQXysdCBMK/BDG1X/mwz86RazPvH+eL1SbZ0Pm0thfbRAcohH52iErpCZVRBBAn0hF7Qq/PsvDnvzsdkdMGZZvbQHzifP21PoAI= hAAACGHicbVDLTgIxFO34RHyhLt1MJCa4gRk00SVRFy4xkUcEJJ1yByp9TNqOkUz4C1cm+i3ujFt3foo7y2Oh4EmanJ5zT3p7gohRbTzvy1lYXFpeWU2tpdc3Nre2Mzu7VS1jRaBCJJOqHmANjAqoGGoY1CMFmAcMakH/YuTXHkBpKsWNGUTQ4rgraEgJNla6bQa9uyTnHw3T7UzWy3tjuPPEn5IsmqLcznw3O5LEHIQhDGvd8L3ItBKsDCUMhulmrCHCpI+70LBUYA66lYw3HrqHVum4oVT2COOO1d+JBHOtBzywkxybnp71RuJ/XiM24VkroSKKDQgyeSiMmWukO/q+26EKiGEDSzBR1O7qkh5WmBhbUro5DiaFira3AqfiHvqUFy6VjAL5WOhAmNdghrYrf7aZeVIt5v3jfPH6JFs6n7aWQvvoAOWQj05RCV2hMqogggR6Qi/o1Xl23px352MyuuBMM3voD5zPH2uhoAE= Quantizer h Quantizer ··· h Quantizer Encoder Encoder Encoder : Encoding (1) (2) (1) (N) N 1 (i)

AAACIHicbVDLSgMxFM34rPU16tJNsAi6aWeqoBuhqAuXCrYV2loy6Z02No8hyYhl6J+4EvRb3IlL/RN3prULXwcCJ+fcw72cKOHM2CB486amZ2bn5nML+cWl5ZVVf229ZlSqKVSp4kpfRcQAZxKqllkOV4kGIiIO9ah/MvLrt6ANU/LSDhJoCdKVLGaUWCe1fb8Z3V1nO+HuEB9hx/NtvxAUgzHwXxJOSAFNcN72P5odRVMB0lJOjGmEQWJbGdGWUQ7DfDM1kBDaJ11oOCqJANPKxpcP8bZTOjhW2j1p8Vj9nsiIMGYgIjcpiO2Z395I/M9rpDY+bGVMJqkFSb8WxSnHVuFRDbjDNFDLB44Qqpm7FdMe0YRaV1a+OQ5mpapxv5Jg8gb6TJROtUoidVfqQFw0YIeuq/B3M39JrVwM94rli/1C5XjSWg5toi20g0J0gCroDJ2jKqLoFt2jR/TkPXjP3ov3+jU65U0yG+gHvPdPzvyiMQ== AAACMnicbZDLbhMxFIY9KdAQLk3Lko2VCCldkMykldoNUgVdsAwSSStlQuRxziRufBnZZ1Cj0ex5GlZI7auUHWLLC7DDuSwg5UiWPv/n/Dr2n2RSOAzDu6Cy8+Dho93q49qTp8+e79X3DwbO5JZDnxtp7GXCHEihoY8CJVxmFphKJFwk83fL/sVnsE4Y/REXGYwUm2qRCs7QS+N6I06uPxWt7mFJ31DP9DWNZwwLj6XXo8OyNq43w3a4Knofog00yaZ64/rveGJ4rkAjl8y5YRRmOCqYRcEllLU4d5AxPmdTGHrUTIEbFau/lPSVVyY0NdYfjXSl/u0omHJuoRI/qRjO3HZvKf6vN8wxPR0VQmc5gubrRWkuKRq6DIZOhAWOcuGBcSv8WymfMcs4+vhq8cpYdPrO3zpK6CuYC9U5tyZLzHVnAmnbAZY+q2g7mfsw6Lajo3b3w3Hz7O0mtSp5SRqkRSJyQs7Ie9IjfcLJF/KV3JDb4FvwPfgR/FyPVoKN5wX5p4JffwAUNajy x = x xˆ Input Frames x = x x = x xˆ AAACUnicbVLLThsxFHUCLTTQNtBlN1YjpLAgmaFIsEFClEVXCKQGkDIh8jh3iIkfI/sOIrLmf/gaVkjQT2lXOCELXleydHwesn3kNJfCYRT9rVTn5j98XFj8VFta/vzla31l9cSZwnLocCONPUuZAyk0dFCghLPcAlOphNN09Guin16BdcLoPzjOoafYhRaZ4AwD1a/vJ+n1uW8erpd0lwZMN6hPHLciR4djCTRxhep7sRuX5/5wIy5LmgwZ+mANRFOsl7V+vRG1ounQtyCegQaZzVG//i8ZGF4o0Mglc64bRzn2PLMouISylhQOcsZH7AK6AWqmwPX89K0lXQvMgGbGhqWRTtnnCc+Uc2OVBqdiOHSvtQn5ntYtMNvpeaHzAkHzp4OyQlI0dFIcHQgLHOU4ABb6CXelfMgs4xjqrSXToG93XNi1ldCXMBKqfWBNnprr9gCylgMsQ1fx62begpPNVvyztXm81djbn7W2SL6TH6RJYrJN9shvckQ6hJMbckvuyUPlrvK/Gn7Jk7VamWW+kRdTXX4ErwazxA== IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 i=1 6 The 1st codec The 2nd codec The NthPcodec

AAACS3icbVBNa1NBFJ0Xra3xK+rSzWAQ6iZ5r4gWRCjqwpVUMG0hE8O8efc108wXM/dJwvB+jL/GlaBLf4crxYWTNAttPTBw7jn3MDOndEoGzPPvWefK1a1r2zvXuzdu3rp9p3f33lGwjRcwElZZf1LyAEoaGKFEBSfOA9elguNy/mrlH38EH6Q173HpYKL5qZG1FByTNO09Z6VVFTDunLeLyILw0mHApQLKQqOnUb7Yaz/Et21L2YxjZOUijbvycdud9vr5IF+DXibFhvTJBofT3k9WWdFoMCgUD2Fc5A4nkXuUQkHbZU0Ax8Wcn8I4UcM1hElcf7Klj5JS0dr6dAzStfp3InIdwlKXaVNznIWL3kr8nzdusN6fRGlcg2DE+UV1oyhaumqMVtKDQLVMhKd20lupmHHPBaZeu2wdjMNRSNNQS3MGc6mHr711pV0MK6gHAbBNXRUXm7lMjvYGxdNB8e5J/+DlprUd8oA8JLukIM/IAXlDDsmICPKJfCZfybfsS/Yj+5X9Pl/tZJvMffIPOlt/AC4gtS4= pers both ends of the synthesized signal, covering 512 samples e N xˆ(i) Output LPC LPC ⇡ i=2 with 32 overlapping samples between adjacent windows. Frames Synthesizer residual P

As an intuitive example, given the samples [1:1024] as the LPC Decoder AAACBnicZZDLSgMxFIYz3q23qks3wSK4KO2MCLoSQRcuFWwV2iKZzJkam8uQnBHL0L1Lt/oQ7sStr+Ez+BKmtQu1BwLf+ZP/cPLHmRQOw/AzmJqemZ2bX1gsLS2vrK6V1zeazuSWQ4Mbaex1zBxIoaGBAiVcZxaYiiVcxb2T4f3VPVgnjL7EfgYdxbpapIIz9FKzzROD7qZcCWvhqOgkRGOokHGd35S/2onhuQKNXDLnWlGYYadgFgWXMCi1cwcZ4z3WhZZHzRS4TjHadkB3vJLQ1Fh/NNKR+ttRMOVcX8VV6kExvK3SWHnbEN3f0ZgedgqhsxxB85/JaS4pGjr8K02EBY6y74FxK/xylN8yyzj6RErtkbGoN5zv6kroO+gJVT+1JovNQz2BtOYAByWfTvQ/i0lo7tWisBZd7FeOj8Y5LZAtsk12SUQOyDE5I+ekQTi5I0/kmbwEj8Fr8Ba8/zydCsaeTfKngo9vqEeYhg== Decoder coefficients ··· input, after the LPC analysis, neural residual coding, and LPC Receiver synthesis, samples [257:768] are decoded; the next input frame Data Transmission Transmitter is [481:1504] (the dotted window in Fig. 2 (b)), whose decoded LPC NWC code NWC code

AAACPHicZZA9bxNBEIb3wkcS82WgpFlhITlSZN9FiCAKFCkIUQYJJ5Fyxppdz9mL9+O0OwexTvdf8ksoaaFNTxfRUrN2XCRkpJWemZ13NPOKUqtAaXqerN26fefu+sZm6979Bw8ftR8/OQyu8hIH0mnnjwUE1MrigBRpPC49ghEaj8Rsf/F/9BV9UM5+onmJQwMTqwolgWJp1H6TT4HqXJw2n+vuzlaTaywIvHffeG6AphJ0/b7JhZp0r3bFfGvU7qS9dBn8JmQr6LBVHIzaF/nYycqgJakhhJMsLWlYgyclNTatvApYgpzBBE8iWjAYhvXyxoa/iJUxL5yPzxJfVq8qajAhzI3Y5hEWm29zYaJsgeH6aCpeD2tly4rQysvJRaU5Ob5wiI+VR0l6HgGkV3E5LqfgQVL0sZUvhXV/EGLWN8p+wZky/XfelcKd9sdY9AJS04ruZP97cRMOd3rZq1728WVn7+3Kpw32jD1nXZaxXbbHPrADNmCSnbEf7Cf7lXxPficXyZ/L1rVkpXnKrkXy9x93Sq4m coefficients AAACPHicZZBNaxRBEIZ7Ej/i+rXq0UvjImwg7M6ImOBBAhHxJBHcJJBZl+remt12+2PorkmyDPNf/CUeverVu7fg1bO9mz0kpqDhqep6i6pXlFoFStNfydr6jZu3bm/cad29d//Bw/ajxwfBVV7iQDrt/JGAgFpZHJAijUelRzBC46GY7S3+D0/QB+XsJ5qXODQwsapQEiiWRu3X+RSozsVZ87nufthsco0FgffulOcGaCpB1++aXKhJ93JXzDdH7U7aS5fBr0O2gg5bxf6ofZ6PnawMWpIaQjjO0pKGNXhSUmPTyquAJcgZTPA4ogWDYVgvb2z481gZ88L5+CzxZfWyogYTwtyILR5hsfkWFybKFhiujqZiZ1grW1aEVl5MLirNyfGFQ3ysPErS8wggvYrLcTkFD5Kij618Kaz7gxCzvlH2C86U6b/1rhTurD/GoheQmlZ0J/vfi+tw8KKXveplH192dt+sfNpgT9kz1mUZ22a77D3bZwMm2Vf2nf1gP5Nvye/kPPlz0bqWrDRP2JVI/v4D11KuXg== samples are within [737:1248]. The overlap-add operation is xˆ(2) x(2) xˆ(N) x(N) Residual signal

AAACInicbVBNa9tAFFwlaeO6X45z7EXEFNyLLYWS5BiSHnp0oYoNlmtWqyd76/0Qu0/BRuiv5BRIf0tvoadC/0huWTs+tHYHFmZn3vAek+SCWwyC397O7t6z5/u1F/WXr16/eds4aF5ZXRgGEdNCm0FCLQiuIEKOAga5ASoTAf1kdrn0+9dgLNfqKy5yGEk6UTzjjKKTxo1mjFykUMbJtPpWtsMPVX3caAWdYAV/m4Rr0iJr9MaNhzjVrJCgkAlq7TAMchyV1CBnAqp6XFjIKZvRCQwdVVSCHZWr2yv/vVNSP9PGPYX+Sv07UVJp7UImblJSnNpNbyn+zxsWmJ2NSq7yAkGxp0VZIXzU/rIIP+UGGIqFI5QZ7m712ZQaytDVVY9XwbIbWffrSq6+w4zL7iej80TPuylkHQtYua7CzWa2ydVxJzzphF8+ts4v1q3VyDtyRNokJKfknHwmPRIRRubkhtyRH96t99O79349je5468wh+Qfen0fq1qPq F F ˜ (1) AAACBnicZZDLSgMxFIYz3h1vVZdugkVwUdoZEXUlBV24VLAXsEUymTNtbC5DkhHL0L1Lt/oQ7sStr+Ez+BKm0y68HAh850/+w8kfpZwZGwSf3szs3PzC4tKyv7K6tr5R2txqGpVpCg2quNLtiBjgTELDMsuhnWogIuLQigZn4/vWPWjDlLy2wxS6gvQkSxgl1knNTqR4DLelclANisL/IZxCGU3r8rb01YkVzQRISzkx5iYMUtvNibaMchj5ncxASuiA9ODGoSQCTDcvth3hPafEOFHaHWlxof505EQYMxRRBTsQxPYrOBLONkbze7RNTro5k2lmQdLJ5CTj2Co8/iuOmQZq+dABoZq55TDtE02odYn4ncKY1xrGdTXB5B0MmKida5VG6qEWQ1I1YEe+Syf8m8V/aB5Uw6NqeHVYrp9Oc1pCO2gX7aMQHaM6ukCXqIEoukNP6Bm9eI/eq/fmvU+eznhTzzb6Vd7HN4UrmHU= h e residual Decoder Decoder

AAACE3icZZDNTttAFIXHKW0h/QtlyWZEVIlKKLFRVbqqkMqCJZUIIMVuNB5fJ0Pmx5q5po0sP0aXbOlDdIfY8gB9Bl6CSfCClCuN9N0zc67unLSQwmEY/gtaz1aev3i5utZ+9frN23ed9fcnzpSWw4AbaexZyhxIoWGAAiWcFRaYSiWcptNv8/vTC7BOGH2MswISxcZa5IIz9NKosx6jkBlUcTqpf1Tbux/rUacb9sJF0acQNdAlTR2NOndxZnipQCOXzLlhFBaYVMyi4BLqdlw6KBifsjEMPWqmwCXVYvWafvBKRnNj/dFIF+pjR8WUczOV7lAPiuFkh6bK2+bolkdj/iWphC5KBM0fJuelpGjo/OM0ExY4ypkHxq3wy1E+YZZx9PG044Wx6g+c7/pK6HOYCtU/sKZIza9+BnnPAdZtn070fxZP4WS3F33uRd8/dfe/Njmtkk2yRbZJRPbIPjkkR2RAOPlJLskV+RP8Dv4G18HNw9NW0Hg2yFIFt/dB1Z0X applied to the final decoded samples [737:768] (Fig. 2 (d)). AAACE3icZZDLShxBFIarNd7GS0azdFNkEBRkpjuIuhLBLFwFA5lRsMehuvq0U05dmqrT6tD0Y7h0mzyEu5BtHsBn8CVSc1lEPVDwnb/qP5z6k1wKh2H4HMzMfpibX1hcqi2vrK59rK9vdJwpLIc2N9LYi4Q5kEJDGwVKuMgtMJVIOE8GJ6P781uwThj9A4c5dBW71iITnKGXevX1GIVMoYyTfnVVbn/bqXr1RtgMx0XfQzSFBpnWWa/+EqeGFwo0csmcu4zCHLslsyi4hKoWFw5yxgfsGi49aqbAdcvx6hXd8kpKM2P90UjH6v+OkinnhirZpR4Uw/4uTZS3jdC9Ho3ZYbcUOi8QNJ9MzgpJ0dDRx2kqLHCUQw+MW+GXo7zPLOPo46nFY2PZajvftZTQNzAQqvXVmjwx960UsqYDrGo+nehtFu+h86UZ7Tej73uN46NpTotkk3wm2yQiB+SYnJIz0iac3JFH8pP8Ch6Cp+B38GfydCaYej6RVxX8/QdvOZ0z (2) (N) h˜ Quantizer AAACBnicZZDLSgMxFIYz3q23qks3wSK4KO2MCLoSQRcuFWwV2iKZzJkam8uQnBHL0L1Lt/oQ7sStr+Ez+BKmtQu1BwLf+ZP/cPLHmRQOw/AzmJqemZ2bX1gsLS2vrK6V1zeazuSWQ4Mbaex1zBxIoaGBAiVcZxaYiiVcxb2T4f3VPVgnjL7EfgYdxbpapIIz9FKzzROD7qZcCWvhqOgkRGOokHGd35S/2onhuQKNXDLnWlGYYadgFgWXMCi1cwcZ4z3WhZZHzRS4TjHadkB3vJLQ1Fh/NNKR+ttRMOVcX8VV6kExvK3SWHnbEN3f0ZgedgqhsxxB85/JaS4pGjr8K02EBY6y74FxK/xylN8yyzj6RErtkbGoN5zv6kroO+gJVT+1JovNQz2BtOYAByWfTvQ/i0lo7tWisBZd7FeOj8Y5LZAtsk12SUQOyDE5I+ekQTi5I0/kmbwEj8Fr8Ba8/zydCsaeTfKngo9vqEeYhg== h˜ Quantizer LPC LPC During this process, the calculated LPC coefficients are Analyzer Encoder ··· Encoder

signal

AAACInicbVDLSgMxFM34tr6qLt0Ei6CbdkZE3QiiLlwq2Fbo1JLJ3Glj8xiSjLQM/RVXgn6LO3El+CPuTGsXvg4ETs65h3s5UcqZsb7/5k1MTk3PzM7NFxYWl5ZXiqtrNaMyTaFKFVf6OiIGOJNQtcxyuE41EBFxqEfd06FfvwNtmJJXtp9CU5C2ZAmjxDqpVVwLo95Nvr27M8BHOIwUj6FVLPllfwT8lwRjUkJjXLSKH2GsaCZAWsqJMY3AT20zJ9oyymFQCDMDKaFd0oaGo5IIMM18dPsAbzklxonS7kmLR+r3RE6EMX0RuUlBbMf89obif14js8lhM2cyzSxI+rUoyTi2Cg+LwDHTQC3vO0KoZu5WTDtEE2pdXYVwFMwrVeN+FcHkLXSZqJxplUaqV4khKRuwA9dV8LuZv6S2Ww72y8HlXun4ZNzaHNpAm2gbBegAHaNzdIGqiKIeukeP6Ml78J69F+/1a3TCG2fW0Q94758KdKNq

AAACWXicbVBNbxMxFHSWr7B8hfbIxSFCSg9NdivUcqlUAQe4VEUibaU4jbzet42JvV7Zb6tG1v6m/poeEBL8Dm44aQ7QMpKl8cwbPXuySkmHSfK9Fd27/+Dho/bj+MnTZ89fdF5uHDtTWwEjYZSxpxl3oGQJI5So4LSywHWm4CSbf1j6JxdgnTTlV1xUMNH8vJSFFByDNO18Ztnlme8fbjWsu8+6LDMqB9bdZl3PnLCyQocLBZS5Wk+93N9pzvzhdto0lM04+pAOQl9uNfG000sGyQr0LknXpEfWOJp2frPciFpDiUJx58ZpUuHEc4tSKGhiVjuouJjzcxgHWnINbuJXX27om6DktDA2nBLpSv074bl2bqGzMKk5ztxtbyn+zxvXWLybeFlWNUIpbhYVtaJo6LI/mksLAtUiEB76CW+lYsYtFxhajtkq6IcjF25DLctvMJd6+NGaKjOXwxyKgQNsQlfp7WbukuOdQbo7SL+87R28X7fWJq/Ia9InKdkjB+QTOSIjIsgVuSY/ya/Wj6gVtaP4ZjRqrTOb5B9Em38ApaG1rg== Input (1) (2) (N) N 1 (i) quantized using Algorithm 1, where the code vector is with xAAACIHicbVDLSgMxFM34rPU16tJNsAi6aWeqoBuhqAuXCrYV2loy6Z02No8hyYhl6J+4EvRb3IlL/RN3prULXwcCJ+fcw72cKOHM2CB486amZ2bn5nML+cWl5ZVVf229ZlSqKVSp4kpfRcQAZxKqllkOV4kGIiIO9ah/MvLrt6ANU/LSDhJoCdKVLGaUWCe1fb8Z3V1nO+HuEB9hx/NtvxAUgzHwXxJOSAFNcN72P5odRVMB0lJOjGmEQWJbGdGWUQ7DfDM1kBDaJ11oOCqJANPKxpcP8bZTOjhW2j1p8Vj9nsiIMGYgIjcpiO2Z395I/M9rpDY+bGVMJqkFSb8WxSnHVuFRDbjDNFDLB44Qqpm7FdMe0YRaV1a+OQ5mpapxv5Jg8gb6TJROtUoidVfqQFw0YIeuq/B3M39JrVwM94rli/1C5XjSWg5toi20g0J0gCroDJ2jKqLoFt2jR/TkPXjP3ov3+jU65U0yG+gHvPdPzvyiMQ== = x x = e x =e i=2 xˆ Frames 16 dimensions, i.e., h R16. The number of kernels is set The 1st codec The 2nd codec The Nth codecP ∈ (LPC analysis) (NWC) (NWC) to be K = 28 = 256. Note that the soft assignment matrix for the LPC quantization, A(LPC), is also involved in the loss Fig. 3: The flow diagram of the test-time inference. function to regularize the bitrate. We investigate the impact of the trainable LPC quantization The purpose for Phase-I training is to get parameters for in collaboration with the rest of the NWC modules in Sec. IV. each codec properly initialized. Then, Phase-II finetunes all trainable parameters of the concatenated modules to minimize the global reconstruction loss, B. Cross-Module Residual Learning (CMRL) N x xˆ(i) . (14) To achieve scalable coding performance towards trans- E i=1 ! parency at high bitrates, we propose cross-module residual X Phase-II also re-adjusts the quantization components of all learning (CMRL) to conduct bit allocation among multiple modules, seeking the optimal bit allocation for all modules. neural codecs in a cascaded manner. CMRL can be regarded as a natural extension of what is described in Sec. III-A, where the LPC as a codec conducts the first round of coding by only C. Signal Flow during Inference modeling the spectral envelope. It leaves the residual signal for Fig. 3 shows the full CMRL signal flow with N sub-codecs, a subsequent NWC to be further compressed. With CMRL, we having an LPC module as the first one. On the transmitter side employ the concept of residual coding to cascade more NWCs. the LPC analyzer first processes the input frame x of 512 We also present a dual-phase training scheme to effectively samples and computes 16 coefficients, h(1), as well as the train the CMRL model. residual samples x(2). Then, the residual signal goes through CMRL’s scalability comes from its residual coding con- the N 1 NWCs in sequentially. Note that the transmission − cept that enables a concatenation of multiple autoencoding process’s primary job is to produce a quantized bitstring h˜ (i) modules. We define the residual signal recursively: i-th codec from LPC and each NWC. To this end, NWC’s decoder part takes the residual of its predecessor as input, and the i-th must also run to compute the residual signal and relay it to reconstruction creates another residual for the next round, and the next NWC module. The bitstring is generated as a con- so on. Hence, we have catenation of all encoder outputs: h˜ = h˜ (1); h˜ (2); ; h˜ (N) . ··· Once the bitstring is available on the receiver side, all NWC xˆ(i) (i)(x(i)), x(i) x(i−1) xˆ(i−1), x(1) x, (11)   ← F ← − ← decoders run to reconstruct the LPC residual signal, i.e., N (i) where xˆ(i) stands for the reconstruction of the i-th input using xˆ(2) (h˜ (i)). Then it is used as input of the LPC ≈ i=2 Fdec the i-th coding module (i)( ), while the input to the first synthesizer, along with the LPC coefficients. F · P codec is defined by the raw input frame x. If we expand the (i) recursion, we arrive at the non-recursive definition of x , IV. EVALUATION − i 1 In this section, we examine the proposed neural speech x(i) = x xˆ(j), (12) − coding model presented in Sec.II and III. The evaluation critera j=1 X include both objective measures such as PESQ [40] and signal- which means the input to i-th model is the residual of the sum to-noise ratio (SNR) and subjective scores from MUSHRA of all preceding i 1 codecs’ decoded signals. It ensures the listening tests [41]. In addition, we conduct ablation analysis − additivity of the entire system: adding more modules keeps to provide a detailed comparison between various loss terms improving the reconstruction quality. Hence, CMRL can scale and bit allocation schemes. Finally, we report the system delay up to high bitrates at the cost of increased model complexity. and execution time under four hardware specifications. CMRL is optimized in two phases. During Phase-I training, we sequentially train each codec from the first to the last one A. Data Processing using a module-specific residual reconstruction goal, The training dataset is created from 300 speakers randomly (x(i) xˆ(i)). (13) selected from the TIMIT corpus [42] with no gender pref- E || IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 7

0.04 8 filters are sufficiently trained until the validation loss converges after 5 16 filters being exposed to about 5 10 batches. These hyperparameters 0.02 32 filters × 128 filters were chosen based on validation. Coefficient 0.00 The blending weights in the loss function in Eq. (15) are 0.00 1.60 3.20 4.80 6.40 8.00 also selected based on the validation performance. Empiri- Frequency (kHz) cally, the ratio between the time-domain loss and mel-scaled Fig. 4: The coarse-to-fine filter bank analysis in the mel scale. frequency loss affects the trade-off between the SNR and perceptual quality of decoded signals. If the time-domain loss erence. Each speaker contributes 10 utterances totaling 2.6 dominates the optimization process, the model compresses hour-long training set, which is a reasonable size due to our each sub-band with an equal effort. In that case, the artifact compact design. The same scheme is adopted when creating will be audible unless the SNR reaches a rather high level (over the validation dataset and test dataset with 50 speakers, respec- 30 dB) which entails a high bitrate and model complexity. On tively. All three datasets are mutually exclusive. All neural the other hand, if only the mel-scaled frequency loss is in codecs in this work are trained and tested via the same set place, the reconstruction quality in the high frequency will of data for a fair comparison. We normalize each utterance degrade. The impact of these blending weights for these two to have a unit variance, then divided by the global maximum loss terms is detailed in Sec. IV-F via an ablation analysis. amplitude, before being framed into segments with the size of The weights for the quantization regularizer λQ and entropy 512 samples. On the receiver side, we conduct overlap-and- regularizer λent are set to be 0.5 and 0.0, respectively. As for add after the synthesis of the frames, where a 32-sample Hann λent, we alter it after every epoch by 0.015: if the current window is applied to the overlapping region of the same size. model’s bitrate is higher than the target bitrate, λent increases With the LPC codec, we apply high-pass filtering defined to penalize the model’s entropy more, and vice versa. Note 1 2 0.989502−1.979004z− 0.989502z− that we omit the module index i in Eq. (15), so the meaning in the z-space, hp(z)= 1 2 , G 1−1.978882− 0.979126− xˆ to the normalized waveform. A pre-emphasis filter, of t depends on the context: either the module-specific re- (z)=1 0.68z−1, follows to boost the high frequencies. construction as in Eq. (13) or the sum of all recovered residual Gpreemp − signals for Phase-II finetuning as in Eq. (14). Similarly, LQ and β can encompass all modules’ quantization and entropy H B. Training Targets and Hyperparameters losses including LPC’s for Phase-II. We delay the introduction of the quantization and entropy loss until the fifth epoch. The loss function is defined as F T 4 b 2 C. Bitrate Modes and Competing Models = λ (x xˆ )2 + λ y(b) yˆ(b) L MSE t − t mel f − f t=1 (15) We consider three bitrates, 12, 20, and 32 kbps, to validate X Xb=1 fX=1   + λ + λ (β) models’ performance in a range of use cases. We evaluate QLQ entH following different versions of neural speech codec: where the first term measures the mean squared error (MSE) • Model-I: The NWC baseline (Sec. II). between the raw waveform samples and their reconstruction. • Model-II: Another baseline that combines the legacy LPC Ideally, if the model complexity and the bitrate is sufficiently and an NWC module for residual coding. large, an accurate reconstruction is feasible by using MSE • Model-III: A trainable LPC quantization module followed as the only loss function. Otherwise, the result is usually sub- by an NWC and finetuning (Sec. III-A); optimal due to the lack of bits: coupled with the MSE loss, the • Model-IV: Similar to Model-III but with two NWC mod- decoded signals tend to contain broadband artifact. The second ules: the full-capacity CMRL implementation (Sec.III-B). term supplements the MSE loss and helps suppress this kind of It is tested cover the high bitrate case, 32 kbps. artifact. To this end, we follow the common steps to conduct mel-scaled filter bank analysis, which results in a mel spectrum Regarding the standard codecs, AMR-WB [44] and Opus [45] y that has a higher resolution in the low frequencies than in the are considered for comparison. AMR-WB, as an ITU standard high frequencies. The filter bank size defines the granularity speech codec, operates in nine different modes covering a level of the comparison. Following [31], we conduct a coarse- bitrate range from 6.6 kbps to 23.85 kbps, providing excellent to-fine filter bank analysis by setting four filter bank sizes, speech quality with a bitrate as low as 12.65 kbps in wideband F1 = 8,F2 = 16,F3 = 32,F4 = 128 as shown in Fig. 4, mode. As a more recent codec, Opus shows the state-of-the-art which result in four kinds of resolutions for mel spectra y(b) performance in most bitrates up to 510 kbps for stereo audio indexed by b 1, 2, 3, 4 . coding, except for the very low bitrate range. ∈ { } All models are trained on Adam optimizer with default We first compare all models with respect to the objective learning rate adaptation rates [43]. The batch size is fixed measures, while being aware that they are not consistent with with 128 frames . The initial learning rate is 2 10−3 for the subjective quality. Hence, we also evaluate these codecs × the first neural codec. With CMRL, the learning rate for the in two rounds of MUSHRA subjective listening tests: the successive neural codecs is 2 10−4. Finetuning of all those neural codecs are compared in the first round, whose winner × models is with a smaller learning rate 2 10−5. All models is compared with other standard codecs in the second round. × IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 8

TABLE III: Objective measurements for neural codec comparison under three bitrate cases.

Bitrate SNR (dB) PESQ-WB (kbps) Model-I Model-II Model-III Model-IV AMR-WB Opus Model-I Model-II Model-III Model-IV AMR-WB Opus ∼12 12.37 10.69 10.85 – 11.60 9.63 3.67 3.45 3.60 – 3.92 3.93 ∼20 16.87 10.73 13.65 – 13.14 9.46 4.37 3.95 4.01 – 4.18 4.37 ∼32 20.24 11.84 14.46 17.11 – 17.66 4.42 4.15 4.18 4.35 – 4.38

16 SNR of 0.45M model CMRL-base 40 SNR of 0.35M model CMRL-2 15 CMRL-3 2.5 35 CMRL-4 CMRL-5 14 30 CMRL-5 Entropy of 0.45M model Non-Greedy Entropy

SNR (dB) 2.0 Entropy of 0.35M model Non CMRL 13 25

SNR (dB) 20 12 1.5 0 20 40 60 80 100 15 Epoch 10 (a) The validation SNR curve during training 0 15 30 45 60 75 Training Steps (k)

(a) Scalability with respect to SNR 3.5 PESQ of 0.45M model 2.5 PESQ of 0.35M model

4.5

PESQ 3.0 Entropy of 0.45M model 2.0 Entropy Entropy of 0.35M model 4.0 2.5 1.5 3.5 0 20 40 60 80 100 3.0 CMRL-base

Epoch PESQ CMRL-2 2.5 CMRL-3 (b) The validation PESQ curve during training CMRL-4 CMRL-5 2.0 CMRL-5 Non-Greedy Fig. 5: Speech reconstruction performance stays almost the Non CMRL same when the model size decreases from 0.45 to 0.35 million 1.5 0 15 30 45 60 75 parameters with the help from the structural modification. Training Steps (k)

(b) Scalability with respect to PESQ

D. Objective Measurements Fig. 6: In CMRL, performance leaps when the new neural codec is added for residual cascading. 1) The compact NWC module and its performance: Com- pared to our previous models in [20][46][47] that use 0.45 million parameters, the newly proposed NWC in this work wise pretraining is important for the performance: whenever a only has 0.35 million parameters. It is also a significant new model is added, it is pretrained to minimize the module reduction from the other compact neural speech codec [31] specific loss Eq. (13) first (Phase-I), then the global loss Eq. with 1.6 million parameters. As introduced in Sec. II the (14), subsequently (Phase-II). A model that does not perform model size reduction is achieved via the GLU [26] and Phase-II (thick gray line) stagnates no matter how many NWCs depthwise separable convolution for upsampling [27]. In our are added. Second, we also train a very large NWC model first experiment, we show that the objective measures stay the with the same amount of parameters as CMRL with five same. Fig. 5 compares the NWC modules before and after NWCs combined (grey dash). It turns out the equally large the structural modification proposed in Sec. II in terms of (a) model fails to scale up due to its single integrated architecture. signal-to-noise ratio (SNR) and (b) PESQ-WB [40]. We can While we eventually decide to use only up to two NWCs for see that the newly proposed model with 0.35M parameters is speech coding for our highest bitrate case, 32 kbps, CMRL’s comparable to the larger model. Therefore, it justifies its use scalability is clearly beneficial if one has to extend to higher as the basic module in a range of models from Model-I to IV. bitrates for non-speech audio coding. 2) The impact of CMRL’s residual coding: To validate the 3) Overall objective comparison of all competing models: merit of CMRL’s residual coding concept, we scale up the TABLE III reports SNR and PESQ-WB from all competing CMRL model by incrementally adding more NWC modules up systems. AMR-WB in the low-range bitrate setting operates at to five. In Fig. 6, both SNR and PESQ values keep increasing 12.65 kbps and 23.05 kbps for the mid-range. It is noticeable when CMRL keeps adding a new NWC module. There are two that, among neural codecs, the simplest Model-I outperforms noticeable points in these graphs. First, the greedy module- others in all three bitrate setups both in terms of SNR and IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 9

100 TABLE IV: Ablation analysis on blending weights.

80 (a) Neural codec only

60 Blending Ratio (MSE : mel) Decoded SNR (dB) Decoded PESQ LP-4k 40 Model-I 1 : 0 18.12 3.67 Model-II 0 : 1 0.16 4.23 Subjective Score 20 Model-III 1 : 1 6.23 4.31 Model-IV 10 : 1 16.88 4.37 Ref 0 Low-Range Medium-Range High-Range (b) Collaboratively trained LPC codec and neural codec (a) Neural waveform codecs comparison. Blending Ratio Residual SNR Decoded SNR Decoded PESQ (MES : mel) (dB) (dB) 100 1 : 0 9.73 14.25 3.84 0 : 1 1.79 17.23 4.02 80 1 : 1 7.11 17.82 4.08 10 : 1 8.26 17.55 4.01 60

40 kbps LP-4k 1) Comparison among the proposed neural codecs: In Fig. kbps kbps AMR-WB kbps Subjective Score kbps 30.72 kbps Opus kbps 20 32 7 (a) we see that Model-III’s produces decoding results that are kbps 12.65 19.20

24 Proposed 11.77 kbps 23.05 20 Ref much more preferred than both Model-I and Model-II, which 0 12 are a pure end-to-end model and with the non-trainable legacy Low-Range Medium-Range High-Range LPC module, respectively. The advantage is more significant (b) Comparison between proposed methods and standard codecs. in lower bitrates. It is contradictory to the objective scores Fig. 7: MUSHRA subjective listening test results. reported in TABLE III where Model-I often achieved the highest scores. It is because Model-III’s joint training of PESQ-WB. Even compared with AMR-WB and Opus, Model- the LPC quantization module and NWCs. Compared to the I is the winner except for the low bitrate case where Opus deterministic quantization module in the legacy LPC, CQ can achieves the highest PESQ score. It is because the single assign a different amount of bits to different frames in col- autoencoder model is highly optimized for the objective loss laboration with the following NWC module, maximizing the during training, although it does not necessarily mean that coding efficiency. We also note that Model-III’s performance the higher objective score leads to a better subjective quality stagnates in the high bitrate experiments, suggesting its poor as presented in Sec. IV-E. It is also observed that with scalability. To this end, for the high bitrate experiment, we CQ, Model-III gains slightly higher SNR and PESQ scores additionally test Model-IV with two NWC residual coding compared to Model-II, which uses the legacy LPC. Finally, modules instead of just one. Model-IV outperformed Model-II the performance scales up significantly when Model-IV starts by a large margin, showcasing a transparent quality. to employ two NWCs on top of LPC, which is our proposed 2) Comparison with standardized codecs: Fig. 7 (b) shows full neural speech coding setup. Aside from objective measure that, our Model-II is on par with AMR-WB for the low-range comparison, to further evaluate the quality of proposed codec, bitrate case, while outperforming Opus which tends to lose we discuss the subjective test in the next section. high frequency components. In the medium-range, Model-II at 19.2 kbps is comparable to Opus at 20.0 kbps and AMR- WB at 23.05 kbps. In the high bitrate range, our Model-IV E. Subjective Test outperforms Opus that operates 32 and 24 kbps, while AMR- WB is omitted as it does not support those high bitrates. We conduct two rounds of MUSHRA tests: (a) to select the best one out of the proposed models (from Model-I to IV) (b) to compare it with the standard codecs, i.e., AMR-WB and F. Ablation Analysis Opus. Each round covers three different bitrate ranges, totaling six MUSHRA sessions. A session consists of ten trials, for In this section, we perform some ablation analyses to justify which ten gender-balanced test signals are randomly selected. our choices that led to CQ and CMRL’s superior subjective test Each trial has one low-pass filtered signal serving as the anchor results. We investigate how different blending ratios between (with a cutoff frequency at 4kHz), the hidden reference, as loss terms can alter the performance. We will also explore the well as signals decoded from competing systems. We recruit optimal bit allocation strategy among coding modules. ten participants who are audio experts with prior experiences 1) Blending weights for the loss terms: Out of the two in speech/audio quality evaluation. The subjective scores are major reconstruction loss terms, MSE serves as the main loss rendered in Fig. 7 as boxplots. Each box ranges from the 25 to for the end-to-end NWC system, while the mel-scaled loss 75 percentile with a 95% confidence interval. The mean and prioritizes certain frequency bands over the others. In TABLE median are presented as the green dotted line and pink hard IV (a), the SNR reaches the highest when there is only the line, respectively. Outliers are represented in circles. MSE term, which however, leads to the lowest PESQ score. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 10

76 LPC Codec Neural Codec 4.0 PESQ without CQ 2.5 PESQ with CQ 600 74 3.8 Entropy without CQ

Bitrate (b/f) 400 Bitrate (b/f) Entropy with CQ 2.0 72 .

PESQ 3 6

Entropy grɛ g baɪ z f rɛ ʃ mɪl ki ʧ ˈwi kˌ deɪ ˈmɔr nɪŋ

3.4 1.5

3.2 0 20 40 60 80 100 Epoch Fig. 8: The ablation analysis on CQ. Fig. 9: The frame-wise bit allocation analysis.

By only keeping the mel-scaled loss term, the PESQ score TABLE V: Bit allocation among coding components. is decent (4.23) while the sample-by-sample reconstruction is Bitrate Modes LPC Coefficients LPC Residual Total poor as suggested by the SNR value (0.16 dB). Similarly, in (kbps) (bits / frame) (bits / frame) (bits / frame) TABLE IV (b) where the input of the neural codec is the ∼ 11.77 58 295 353 LPC residuals, MSE alone yields the highest SNR for the ∼ 19.20 74 502 576 reconstruction of the LPC residual, which however, does not ∼ 30.72 74 486+384 944 benefit the final synthesized signal even in terms of SNR. We choose the blending ratio of 10 : 1, which consistently shows SNR a good performance in all proposed models. 25 4.4 PESQ

2) CQ’s impact on the speech quality: We compare the 20 . PESQ values of the decoded signals from Model II and III. 4 2 PESQ

Since Model-III shares the same architecture with Model-II SNR (dB) 15 except for the CQ training strategy, the comparison is to verify 4.0 that CQ can effectively allocate bits to the LPC and NWC 10 modules. Fig. 8 shows that the total entropy of the two models 3.8 33% more bits Same bits 33% more bits are under the control regardless of the use of CQ mechanism. for codec-1 for both codecs for codec-2 However, we can see that Model-III with CQ achieves higher PESQ during and after the control of the entropy, showcasing Fig. 10: Ablation analysis on bit allocation schemes between that the CQ approach benefits the codec’s performance. codec-1 and codec-2 in Model-IV at 32 kbps. 3) Bit allocation between the LPC and NWC modules: Since the proposed CQ method is capable of assigning differ- scores degrade when the second NWC uses 33.3% more bits ent bits to the LPC and NWC modules dynamically, i.e., in a than the first one. Among these 3 choices, the highest PESQ frame-by-frame manner, we analyze its impact in more detail. score is obtained when the first NWC module uses 33.3% more In the mid-range bitrate setting, Fig. 9 shows the amount of bits. In practice, the bit allocation is automatically determined bits assigned to both modules per frame (b/f). First of all, we during the optimization process. In TABLE V, for example, observe that the dynamic bit allocation scheme indeed adjusts the bit ratio between two NWC modules of Model-IV in the the LPC and NWC bitrates over time. It is also noticeable that high bitrate case is about 486 : 384 55.9 : 44.1, in accord ≈ the LPC module consumes more than average bits for near- with the observation from the ablation analysis that the first silent frames, while the corresponding NWC bitrate is reduced. module should use more bits. Given that the LPC module consumes a significant smaller order of magnitudes, i.e., less than 80 b/f in general, it seems to be an efficient bit allocation behavior that the NWC module G. Complexity and Delay saves 100 b/f at the cost of a small increase of 2 b/f in the ∼ LPC module. However, it still requires a significant amount of The proposed NWC model is with 0.35 million parameters, bits to even represent those near-silent frames, which can be a half of which is for the decoder. Hence, in 32 kbps with two further optimized. Finally, it appears that NWC is less efficient NWC modules for residual coding, the model size totals 0.7M for fricatives (e.g., f and S ) and affricates (e.g., tS ). TABLE parameters, with the decoder size of 0.35M. Even though our V shows the overall bit allocation among different modules. decoder is still not as compact as those in traditional codecs, In the low bitrate case, it is worth noting that CQ uses 58 b/f it is 100 smaller than a WaveNet decoder. or 1.93 kbps, differently from AMR-WB’s standard, 2.4 kbps. × Aside from the model size, we investigate the codec’s delay 4) Bit allocation between the two NWC modules in Model- and the processing time. The codec will have algorithmic delay IV: To find the optimal bit allocation between two NWC if it relies on future samples to predict the current sample. The modules, we first conduct an ablation analysis on 3 different processing time during the encoding and decoding processes bit allocation choices. In Fig. 10, both the SNR and PESQ also adds up to the runtime overhead. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 11

TABLE VI: Execution time ratios during model inference (%). single-core CPU without compiler-level matrix multiplication optimization, it achieved real-time processing. Hardware 0.45M 0.35M 0.45M×2 0.35M×2 Admittedly, conventional codecs have well performed from 1× Tesla V100 12.49 13.38 20.69 21.12 1× Tesla K80 24.45 22.53 39.42 38.82 narrow to full band scenarios already. But it is not easy to integrate these standalone DSP components to a neural 8× CPU cores 20.76 18.91 35.17 33.80 network-powered conversational AI engine, as the acoustic 1× CPU core 46.88 42.44 87.38 80.21 signal representation is predetermined by the bit allocation logic, not learned in the latent space via a global training process. To this end, our trainable codec provides an affordable way for unsupervised speech waveform representation learning 1) Algorithmic delay: The delay of our system is defined that also provides competent level of compression. by the frame size: the first sample of a frame can be processed only after the entire frame is buffered: 512/16000 = 32ms. Causal convolution can minimize such delay at the expense of REFERENCES the reduced speech quality, because it only uses past samples. [1] M. R. Schroeder, “Vocoders: Analysis and synthesis of speech,” Pro- 2) Analysis of the processing time: The execution time ceedings of the IEEE, vol. 54, no. 5, pp. 720–734, 1966. is another important factor to be considered for real-time [2] A. V. McCree and T. P. Barnwell, “A mixed excitation LPC vocoder model for low speech coding,” IEEE Transactions on Speech communications. The bottom line is that the execution of the and audio Processing, vol. 3, no. 4, pp. 242–250, 1995. encoding and decoding processes is expected to be within the [3] J. Stachurski and A. McCree, “Combining parametric and waveform- duration of the hop length so as not to add extra delays. For matching coders for low bit-rate speech coding,” in 2000 10th European example, WaveNet codec [13] minimizes the system delay Signal Processing Conference. IEEE, 2000, pp. 1–4. [4] M. Schroeder and B. Atal, “Code-excited linear prediction (CELP): using causal convolution, but its processing time, though High-quality speech at very low bit rates,” in Acoustics, Speech, and Sig- not reported, can be rather high as it is an autoregressive nal Processing, IEEE International Conference on ICASSP’85., vol. 10. model with over 20 million parameters. TABLE VI lists the IEEE, 1985, pp. 937–940. [5] J. Stachurski and A. McCree, “A 4 kb/s hybrid MELP/CELP coder with execution time ratio of our models. The ratio (in percentage) alignment phase encoding and zero-phase equalization,” in ICASSP’20, is defined as the execution time to encode and decode the test IEEE International Conference on Acoustics, Speech, and Signal Pro- signals divided by the duration of those signals. Meanwhile, cessing. IEEE, 2000, pp. 1379–1382. [6] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech Kankanahalli’s model requires 4.78ms to encode and decode enhancement based on deep neural networks,” IEEE/ACM Transactions a hop length of 30ms on an NVIDIA® GeForce® GTX on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 1080 Ti GPU, and 21.42ms on an Intel® CoreTM i7-4970K 2014. [7] Y. Luo and N. Mesgarani, “TasNet: Surpassing ideal time-frequency CPU (3.8GHz), which amount to 15.93% and 71.40% of masking for speech separation,” IEEE/ACM transactions on audio, the execution time ratio, respectively [31]. Our small-sized speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019. models (0.45M and 0.35M) on both CPU (Intel® Xeon® [8] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017. Processor E5-2670 V3 2.3GHz) and GPU run faster than [9] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with Kankanahali’s, while direct comparison is not fair due to the deep recurrent neural networks,” in 2013 IEEE international conference different computing environment. The CMRL models with on acoustics, speech and signal processing. Ieee, 2013, pp. 6645–6649. [10] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, two NWC modules require more execution time. Note that and D. Yu, “Convolutional neural networks for speech recognition,” all our models compared in this test achieved the real-time IEEE/ACM Transactions on audio, speech, and language processing, processing goal as their ratios are under 100%. However, the vol. 22, no. 10, pp. 1533–1545, 2014. [11] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsu- ratio comparison results between the 0.45M model and 0.35M pervised speech representation learning using WaveNet autoencoders,” model are not consistent. We believe this is because that the IEEE/ACM transactions on audio, speech, and language processing, internal implementation of different residual learning blocks in vol. 27, no. 12, pp. 2041–2053, 2019. [12] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, TensorFlow may lead to various runtime optimization effects. A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio.” Speech Synthesis Work- shop, vol. 125, 2016. V. CONCLUDING REMARKS [13] W. B. Kleijn, F. S. C. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, “WaveNet based low rate speech coding,” in Proceedings of the IEEE International Conference on Acoustics, Speech, Recent neural waveform codecs have outperformed the and Signal Processing (ICASSP), 2018, pp. 676–680. conventional codecs in terms of coding efficiency and speech [14] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete quality, at the expense of model complexity. We proposed a representation learning,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 6306–6315. scalable and lightweight neural acoustic processing unit for [15] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv waveform coding. Our smallest model contains only 0.35 preprint arXiv:1312.6114, 2013. million parameters whose decoder is more than 100X smaller [16] Y. L. C. Garbacea, A. van den Oord, “Low bit-rate speech coding with VQ-VAE and a WaveNet decoder,” in Proceedings of the IEEE than the WaveNet based codec. By incorporating a trainable International Conference on Acoustics, Speech, and Signal Processing LPC analyzer with collaborative quantization and residual cas- (ICASSP), 2019. cading, our model clearly demonstrates superior or comparable [17] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech syn- thesis through linear prediction,” in IEEE International Conference on performance to the standardized codecs. Our model operates Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. frame-wise, leading to a delay as small as 16 msec; even on a 5891–5895. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, SEPTEMBER 2021 12

[18] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, [43] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, in Proceedings of the International Conference on Learning Represen- “Efficient neural audio synthesis,” in International Conference on Ma- tations (ICLR), 2015. chine Learning. PMLR, 2018, pp. 2410–2419. [44] B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila, [19] J.-M. Valin and J. Skoglund, “A real-time wideband neural vocoder at J. Vainio, H. Mikkola, and K. Jarvinen, “The adaptive multirate wide- 1.6 kb/s using LPCNet,” in Proceedings of the Annual Conference of the band speech codec (AMR-WB),” IEEE transactions on speech and audio International Speech Communication Association (Interspeech), 2019. processing, vol. 10, no. 8, pp. 620–636, 2002. [20] K. Zhen, J. Sung, M. S. Lee, S. Beack, and M. Kim, “Cascaded cross- [45] J.-M. Valin, G. Maxwell, T. B. Terriberry, and K. Vos, “High- module residual learning towards lightweight end-to-end speech coding,” quality, low-delay music coding in the Opus codec,” arXiv preprint in Proceedings of the Annual Conference of the International Speech arXiv:1602.04845, 2016. Communication Association (Interspeech), 2019. [46] K. Zhen, M. S. Lee, J. Sung, S. Beack, and M. Kim, “Efficient and scal- [21] C. Un and D. Magill, “The residual-excited linear prediction vocoder able neural residual waveform coding with collaborative quantization,” in with transmission rate below 9.6 kbits/s,” IEEE transactions on commu- ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech nications, vol. 23, no. 12, pp. 1466–1474, 1975. and Signal Processing (ICASSP). IEEE, 2020, pp. 361–365. [22] K. Zhen, M. S. Lee, J. Sung, S. Beack, and M. Kim, “Psychoacoustic [47] K. Zhen, M. S. Lee, J. Sung, S. Beack, and M. Kim, “Psychoacoustic calibration of loss functions for efficient end-to-end neural audio cod- calibration of loss functions for efficient end-to-end neural audio cod- ing,” IEEE Signal Processing Letters, vol. 27, pp. 2159–2163, 2020. ing,” IEEE Signal Processing Letters, pp. 1–1, 2020. [23] K. Tan, J. T. Chen, and D. L. Wang, “Gated residual networks with dilated for supervised speech separation,” in Proc. ICASSP, 2018. [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [25] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015. [26] K. Tan, J. T. Chen, and D. L. Wang, “Gated residual networks with dilated convolutions for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 189–198, 2019. [27] F. Chollet, “Xception: Deep learning with depthwise separable convolu- tions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258. [28] W. Shi, J. Caballero, F. Huszar,´ J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883. [29] M. Garey, D. Johnson, and H. Witsenhausen, “The complexity of the generalized Lloyd-max problem (corresp.),” IEEE Transactions on , vol. 28, no. 2, pp. 255–256, 1982. [30] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end- to-end learning compressible representations,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 1141–1151. [31] S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. [32] F. Itakura, “Early developments of LPC speech coding techniques,” in icslp, 1990, pp. 1409–1410. [33] I. R. Titze and D. W. Martin, “Principles of voice production,” 1998. [34] D. O’Shaughnessy, “Linear predictive coding,” IEEE potentials, vol. 7, no. 1, pp. 29–32, 1988. [35] J. Franke, “A levinson-durbin recursion for autoregressive-moving aver- age processes,” Biometrika, vol. 72, no. 3, pp. 573–581, 1985. [36] ITU-T G.722.2:, “Wideband coding of speech at around 16 kbit/s using adaptive multi-rate wideband (AMR-WB),” 2003. [37] A. Gersho and V. Cuperman, “Vector quantization: A pattern-matching technique for speech coding,” IEEE Communications magazine, vol. 21, no. 9, pp. 15–21, 1983. [38] B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila, J. Vainio, H. Mikkola, and K. Jarvinen, “The adaptive multirate wide- band speech codec (AMR-WB),” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 8, pp. 620–636, 2002. [39] F. Soong and B. Juang, “Line spectrum pair (LSP) and speech data com- pression,” in ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9. IEEE, 1984, pp. 37–40. [40] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, vol. 2. IEEE, 2001, pp. 749–752. [41] ITU-R Recommendation BS 1534-1, “Method for the subjective assess- ment of intermediate quality levels of coding systems (MUSHRA),” 2003. [42] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium, Philadelphia, 1993.