<<

Cascaded Cross-Module Residual Learning towards Lightweight End-to-End

Kai Zhen1,2, Jongmo Sung3, Mi Suk Lee3, Seungkwon Beack3, Minje Kim1,2

1Indiana University, School of Informatics, Computing, and Engineering, Bloomington, IN 2Indiana University, Cognitive Science Program, Bloomington, IN 3Electronics and Research Institute, Daejeon, South Korea

[email protected], [email protected], [email protected], [email protected], [email protected]

Abstract paradigm, they are computationally expensive to run on smart devices. Many DNN-based achieve both low bitrates Speech codecs learn compact representations of speech and high perceptual quality, two main targets for speech codecs to facilitate data transmission. Many recent deep neural net- [17][18][19], but with a high model complexity. A WaveNet work (DNN) based end-to-end speech codecs achieve low bi- based variational autoencoder (VAE) [16] outperforms other trates and high perceptual quality at the cost of model com- low bitrate codecs in the listening test, however, with 20 mil- plexity. We propose a cross-module residual learning (CMRL) lions parameters, a too big model for real-time processing in pipeline as a module carrier with each module reconstructing a resource-constrained device. Similarly, codecs built on Sam- the residual from its preceding modules. CMRL differs from pleRNN [20][21] can also be energy-intensive. other DNN-based speech codecs, in that rather than modeling Motivated by DNN based end-to-end codecs [14] and resid- speech compression problem in a single large neural network, it ual cascading [22][23], this paper proposes a “cross-module” optimizes a series of less-complicated modules in a two-phase residual learning (CMRL) pipeline, which can lower the model training scheme. The proposed method shows better objec- complexity while maintaining a high perceptual quality and tive performance than AMR-WB and the state-of-the-art DNN- compression ratio. CMRL hosts a list of less-complicated end- based speech with a similar network architecture. As an to-end speech coding modules. Each module learns to re- end-to-end model, it takes raw PCM signals as an input, but cover what is failed to be reconstructed by its preceding mod- is also compatible with (LPC), show- ules. CMRL differs from other residual learning networks, ing better subjective quality at high bitrates than AMR-WB and e.g. ResNet [24], in that rather than adding identical short- . The gain is achieved by using only 0.9 million trainable cuts between layers, CMRL cascades residuals across a series parameters, a significantly less complex architecture than the of DNN modules. We introduce a two-round model training other DNN-based codecs in the literature. scheme to train CMRL models. In addition, we also show that Index Terms: speech coding, deep neural network, entropy CMRL is compatible with LPC by having it as one of the mod- coding, residual learning ules. With LPC coefficients being predicted, CMRL recovers the LPC residuals which, along with the LPC coefficients, syn- 1. Introduction thesize the decoded speech at the receiver side. Speech coding, where the encoder converts the speech signal The evaluation of the propose method is threefold: ob- into bitstreams and the decoder synthesizes reconstructed signal jective measures, subjective assessment and model complexity. from received bitstreams, serves an important role for various Comparing with AMR-WB, OPUS, and the recently proposed purposes: to secure a communication [1][2], to facilitate end-to-end system [14], CMRL showed promising performance data transmission [3], etc. There have been various conventional both in objective and subjective quality assessments. As for speech coding methodologies, including linear predictive cod- complexity, CMRL contains only 0.9 million model parameters, ing (LPC) [4], adaptive encoding [5], and perceptual weighting significantly less complicated than the WaveNet based speech [6] among other domain specific knowledge about the speech codec [16] and the end-to-end baseline [14]. signals, that are used to construct classic codecs, such as AMR- WB [7] and OPUS [8] with high perceptual quality. 2. Model description Since the last decade, data-driven approaches have vital- arXiv:1906.07769v4 [eess.AS] 13 Sep 2019 ized the use of deep neural networks (DNN) for speech cod- Before introducing CMRL as a module carrier, we describe the ing. A speech coding system can be formulated by DNN as component module to be hosted by CMRL. an autoencoder (AE) with a code layer discretized by (VQ) [9] or bitwise network techniques [10], etc. 2.1. The component module Many DNN methods [11][12] take inputs in time-frequency (T- F) domain from short time (STFT) or modi- Recently, an end-to-end DNN speech codec (referred to as fied discrete cosine transform (MDCT), etc. Recent DNN-based Kankanahalli-Net) has shown competitive performance compa- codecs [13][14][15][16] model speech signals in time domain rable to one of the standards (AMR-WB) [14]. We describe our directly without T-F transformation. They are referred to as end- component model derived from Kankanahalli-Net that consists to-end methods, yielding competitive performance comparing of bottleneck residual learning [24], soft-to-hard quantization with current speech coding standards, such as AMR-WB [7]. [25], and sub-pixel convolutional neural networks for upsam- While DNN serves a powerful parameter estimation pling [26]. Figure 1 depicts the component module. clustering assignment

0 1 … 0 entropy … Down … 0 0 so… 1 coding sampling … … … …

Encoder 0 1 … 0 (a). Raw PCM or LPC residual signal (b). bottle-neck blocks (b). bottle-neck blocks (c). softmax quantization

entropy(d). coding entropy Up … … sampling decoding Decoder

(f). decoded signal (b). bottle-neck blocks (b). bottle-neck blocks (e). quantized code

Figure 1: A schematic diagram for the end-to-end speech coding component module: some channel change steps are omitted.

2.2. The module carrier: CMRL Figure 3 shows the proposed cascaded cross-module residual learning (CMRL) process. In CMRL, each module does its best Figure 2: The interlacing-based upsampling process. to reconstruct its input. The procedure in the i-th module is (i) (i) (i) denoted as F(x ; W ), which estimates the input as xˆ . The input for the i-th module is defined as 2.1.1. Four non-linear mapping types i−1 (i) X (j) In the end-to-end speech codec, we take S = 512 time domain x = x − xˆ , (1) samples per frame, 32 of which are windowed by the either left j=1 or right half of a Hann window and then overlapped with the ad- (1) jacent ones. This forms the input to the first 1-D convolutional where the first module takes the input speech signal, i.e., x = layer of C kernels, whose output is a tensor of size S × C . x. The meaning is that each module learns to reconstruct the There are four types of non-linear transformations involved residual which is not recovered by its preceding modules. Note in this fully convolutional network: downsampling, upsam- that module homogeneity is not required for CMRL: for exam- pling, channel changing, and residual learning. The downsam- ple, the first module can be very shallow to just estimate the pling operation reduces S down to S/2 by setting the stride d envelope of MDCT spectral structure while the following mod- of the convolutional layer to be 2, which turns an input example ules may need more parameters to estimate the residuals. S × C into S/2 × C. The original dimension S is recovered in Each AE decomposes into the encoder and decoder parts: the decoder with recently proposed sub-pixel [25], (i) h(i) = F (x(i); (i)), xˆ(i) = F (h(i); ), (2) which forms the upsampling operation. The super-pixel convo- enc Wenc dec Wdec lution is done by interlacing multiple feature maps to expand where h(i) denotes the part of code generated by the i-th en- the size of the window (Figure 2). In our case, we interlace a (i) (i) coder, and ∪ = (i). pair of feature maps, and that is why in Table 1 the upsampling Wenc Wdec W The encoding process: For a given input signal x, the en- layer reduces the channels from 100 to 50 while recovers the coding process runs all N AE modules in a sequential order. original 512 dimensions from 256. Then, the bitstring is generated by taking the encoder outputs In this work, to simplify the model architecture we have > h (1)> (2)> (N)>i identical shortcuts only for cross-layer residual learning, while and concatenating them: h = h , h , ··· , h . Kankanahalli-Net employs them more frequently. Furthermore, The decoding process: Once the bitstring is available inspired by recent work in source separation with dilated con- on the receiver side, all the decoder parts of the modules, volutional neural network [27], we use a “bottleneck” residual (i) (i) Fdec(x ; Wdec) ∀N, run to produce the reconstructions which learning block to further reduce the number of parameters. This are added up to approximate the initial input signal with the can lower the amount of parameters, because the reduced num- global error defined as ber of channels within the bottleneck residual learning block de- creases the depth of the kernels. See Table 1 for the size of our N ! X (i) Eˆ x xˆ . (3) kernels. Likewise, the input S × 1 tensor is firstly converted to a S ×C feature map, and then downsampled to S/2×C. Even- i=1 tually, the code vector shrinks down to S/2 × 1. The decoding 2.2.1. The two-round training scheme process recovers it back to a signal of size S × 1, reversely. Intra-module greedy training: We provide a two-round train- 2.1.2. Softmax quantization: ing scheme to make CMRL optimization tractable. The first round adopts a greedy training scheme, where each AE tries The coded output from each encoder is still a real-valued vector (i) (i) (i) its best to minimize the error: arg min E(x ||F(x ; W )). of size S/2. Softmax quantization [25] performs scalar quanti- (i) W zation by assigning each real value to the nearest representative The greedy training scheme echoes a divide-and-conquer man- (Figure 1 (c)). In the proposed system, softmax quantization ner, leading to an easier optimization for each module. The maps the input scalar to one of the 32 clusters, or quantization thick gray arrows in Figure 3 show the flow of the backpropaga- levels, which requires log2 32 = 5 bits per dimension. Huffman tion error to minimize the individual module error with respect (i) coding further reduces the bitrate [28]. to the module-specific parameter set W . N ˆ (N) (x(1) xˆ(1)) (x(2) xˆ(2)) (x(N) xˆ(N)) x xˆ AAACL3icZZDPShxBEMZ7TDRmNWaNx1waF2EFsztjBHMKQhLI0YCrgrMu3b01bmf7z9BdIy7jPEWexGOu+hAhl5BryEukd3YPUQsKfvV1f0Xx8VxJj3H8M1p48nRx6dny88bK6ou1l831V8feFk5AT1hl3SlnHpQ00EOJCk5zB0xzBSd8/GH6fnIJzktrjnCSQ1+zCyMzKRgGadB8k9Y7Sq4KqFLB1Kd2yq/Oy3ayXV1fpyOGZZirmbA9aLbiTlwXfQzJHFpkXoeD5t90aEWhwaBQzPuzJM6xXzKHUiioGmnhIWdizC7gLKBhGny/rE+q6FZQhjSzLrRBWqv/O0qmvZ9ovkMDaIajHcp1sE3R31+N2bt+KU1eIBgx25wViqKl01DoUDoQqCYBmHAyHEfFiDkmMETXSGtj2e35MHW1NF9hLHX3o7M5t1fdIWQdD1g1QjrJwywew/FuJ3nb2f2y1zp4P89pmbwmm6RNErJPDshnckh6RJBv5Du5JXfRTfQj+hX9nn1diOaeDXKvoj//AI97qKo= AAACL3icZZDfShtBFMZnrbU2tprWy94MhkIEm+ymgr0qQlvopQWjghvDzOSsGTN/lpmzxbDuU/RJvOytPkTpTelt6Ut0ssmF1gMHfueb+Q6Hj+dKeozjn9HSo+XHK09WnzbWnj1f32i+eHnkbeEE9IVV1p1w5kFJA32UqOAkd8A0V3DMJx9m78dfwXlpzSFOcxhodm5kJgXDIA2bb9J6R8lVAVUqmPrUTvnlWdnubVdXV+mYYRnmai5sD5utuBPXRR9CsoAWWdTBsPk3HVlRaDAoFPP+NIlzHJTMoRQKqkZaeMiZmLBzOA1omAY/KOuTKvo6KCOaWRfaIK3Vu46Sae+nmu/QAJrheIdyHWwz9PdXY/ZuUEqTFwhGzDdnhaJo6SwUOpIOBKppACacDMdRMWaOCQzRNdLaWHb7PkxdLc0FTKTufnQ25/ayO4Ks4wGrRkgn+T+Lh3DU6yRvO70vu63994ucVskrskXaJCF7ZJ98JgekTwT5Rr6TG3IbXUc/ol/R7/nXpWjh2ST3KvrzD5LMqKw= AAACL3icZZDPahRBEMZ7EmPiJupGj14al8AG4u5MDMSTBKLgSSJkk0Bms3T31mQ723+G7hrJMpmn8Ek8etWHkFyCV/El7J3dQ/4UFPzq6/6K4uO5kh7j+DpaWHy09Hh55Uljde3ps+fN9RdH3hZOQE9YZd0JZx6UNNBDiQpOcgdMcwXHfLw/fT/+Cs5Law5xkkNfs3MjMykYBmnQfJPWO0quCqhSwdTHdsovz8r2583q6iodMSzDXM2EzUGzFXfiuuhDSObQIvM6GDT/pUMrCg0GhWLenyZxjv2SOZRCQdVICw85E2N2DqcBDdPg+2V9UkU3gjKkmXWhDdJave0omfZ+ovkWDaAZjrYo18E2RX93NWbv+qU0eYFgxGxzViiKlk5DoUPpQKCaBGDCyXAcFSPmmMAQXSOtjWW358PU1dJcwFjq7gdnc24vu0PIOh6waoR0kvtZPISj7U7ytrP9Zae1936e0wp5RV6TNknILtkjn8gB6RFBvpEf5Cf5FX2Pfkc30Z/Z14Vo7nlJ7lT09z/vqKjk E ! E || E || E || AAACT3icZZDNbhMxFIU94a8NfwGWbCwipFSqkpmCBBtQJUBiQ1Uk0laq08j23JmY2DMj+w5KZOapeBKW3cKSB2CHcKazoPRItj4f33t1dUSllcM4Po96167fuHlra7t/+87de/cHDx4eubK2Eqay1KU9EdyBVgVMUaGGk8oCN0LDsVi+2fwffwHrVFl8wnUFM8PzQmVKcgzWfPCBtTO8hbRhC46eSa7fNUxDhiMmVkyoPP/a3a42c69eJc3ZwUWtWDVnfnSw0zCr8gXuzAfDeBy3olch6WBIOh3OB79YWsraQIFSc+dOk7jCmecWldTQ9FntoOJyyXM4DVhwA27m25Ub+jQ4Kc1KG06BtHX/7fDcOLc2YpcGMBwXu1SY0LZBd3k0Zi9nXhVVjVDIi8lZrSmWdBMaTZUFiXodgEurwnJULrjlEkO0fdY2+snUhdfEqOIzLJWZvLVlJcrVJIVs7ACbfkgn+T+Lq3C0N06ejfc+Ph/uv+5y2iKPyRMyIgl5QfbJe3JIpkSSb+Sc/CA/o+/R7+hPryvtRR08IpfU2/4Lx0e1Sw== i=1 xˆ(1) (x(1); (1)) xˆ(2) (x(2); (2)) xˆ(N) (x(N); (N)) X AAACT3icZZDfahQxFMYzW7Xt+m/VS2+Ci7CFsjtTCwqCFCrijVDB7RY665JkznTTTSZDcqbtEuapfBIve6uXPoB3YmZ3Lqw9EPidk+87JB8vlXQYx9dRZ+PO3XubW9vd+w8ePnrce/L02JnKChgLo4w94cyBkgWMUaKCk9IC01zBhC8Om/vJBVgnTfEFlyVMNTsrZC4FwzCa9T6lc4Y+5Vf1Vz9IdupUQY7MWnNJU81wLpjyH+pByo3Kmr6RrpX127WCcz9pzTuzXj8exquityFpoU/aOpr1fqWZEZWGAoVizp0mcYlTzyxKoaDuppWDkokFO4PTgAXT4KZ+9e2avgyTjObGhlMgXU3/dXimnVtqvksDNE/dpVwHW4Pu5mrM30y9LMoKoRDrzXmlKBrahEYzaUGgWgZgwsrwOCrmzDKBIdpuujL60diFbqRlcQ4LqUfvrSm5uRplkA8dYN0N6ST/Z3EbjveGyavh3uf9/sG7Nqct8py8IAOSkNfkgHwkR2RMBPlGrskP8jP6Hv2O/nRaaSdq4Rm5UZ3tvwlTs7k= W AAACT3icZZDNahsxFIU17l/i/rnpshtRU3Ag2DNuoYVCCLSUbgop1HEg4xpJcydWLI0G6U4SI+ap+iRdZtsu+wDdlWrsWTTNBcF3r865SIeXSjqM46uoc+v2nbv3tra79x88fPS492TnyJnKCpgIo4w95syBkgVMUKKC49IC01zBlC/fNffTc7BOmuILrkqYaXZayFwKhmE0731KFwx9yi/rr34w3q1TBTkya80FTTXDhWDKf6gHKTcqa/pGulHWbzcKzv20Ne/Oe/14GK+L3oSkhT5p63De+5VmRlQaChSKOXeSxCXOPLMohYK6m1YOSiaW7BROAhZMg5v59bdr+iJMMpobG06BdD391+GZdm6l+R4N0Dx1j3IdbA2666sxfzPzsigrhEJsNueVomhoExrNpAWBahWACSvD46hYMMsEhmi76droRxMXupGWxRkspR69t6bk5nKUQT50gHU3pJP8n8VNOBoPk5fD8edX/YP9Nqct8ow8JwOSkNfkgHwkh2RCBPlGrsgP8jP6Hv2O/nRaaSdq4Sm5Vp3tvw6Ds7w= W AAACT3icZZBdaxNBFIZnU7Vt/Ip62ZvBIKRQkt0qVBCkoIg3SgXTFLoxzMyebcbMxzJz1jYs+6v8JV72tl76A7wTZ5O9sPbAwHPOvO9h5uWFkh7j+DLqbNy6fWdza7t79979Bw97jx4fe1s6AWNhlXUnnHlQ0sAYJSo4KRwwzRVM+OJNcz/5Bs5Laz7jsoCpZmdG5lIwDKNZ70M6Z1il/KL+Ug0+7tapghyZc/acpprhXDBVvasHKbcqa/pGulbWr9YKzqtJa96d9frxMF4VvQlJC33S1tGs9yvNrCg1GBSKeX+axAVOK+ZQCgV1Ny09FEws2BmcBjRMg59Wq2/X9FmYZDS3LhyDdDX911Ex7f1S8z0aoHnqHuU62Br011dj/nJaSVOUCEasN+elomhpExrNpAOBahmACSfD46iYM8cEhmi76cpYjcY+dCMtzVdYSD1662zB7cUog3zoAetuSCf5P4ubcLw/TJ4P9z+96B++bnPaIjvkKRmQhByQQ/KeHJExEeQ7uSRX5Gf0I/od/em00k7UwhNyrTrbfwGfw7QQ W st F F F 1 Round Residual signal Residual signal Ba ckpropa ga tion Flow Decoder Decoder Decoder 2nd Round BP for BP for AAACBnicZZDLSgMxFIYz3q23qks3wSK4KO2MCLoSQRcuFWwV2iKZzJkam8uQnBHL0L1Lt/oQ7sStr+Ez+BKmtQu1BwLf+ZP/cPLHmRQOw/AzmJqemZ2bX1gsLS2vrK6V1zeazuSWQ4Mbaex1zBxIoaGBAiVcZxaYiiVcxb2T4f3VPVgnjL7EfgYdxbpapIIz9FKzzROD7qZcCWvhqOgkRGOokHGd35S/2onhuQKNXDLnWlGYYadgFgWXMCi1cwcZ4z3WhZZHzRS4TjHadkB3vJLQ1Fh/NNKR+ttRMOVcX8VV6kExvK3SWHnbEN3f0ZgedgqhsxxB85/JaS4pGjr8K02EBY6y74FxK/xylN8yyzj6RErtkbGoN5zv6kroO+gJVT+1JovNQz2BtOYAByWfTvQ/i0lo7tWisBZd7FeOj8Y5LZAtsk12SUQOyDE5I+ekQTi5I0/kmbwEj8Fr8Ba8/zydCsaeTfKngo9vqEeYhg== BP for (1) (2) (N) Ba ckpropa ga tion Flow

(1) (2) AAACGHicbVDLSgMxFM34rPVVdelmsAi6aWeqoMuiLlxJBacV21oy6R0bm8eQZMQy9C9cCfot7sStOz/Fneljoa0HAifn3ENuThgzqo3nfTkzs3PzC4uZpezyyuraem5js6ploggERDKprkOsgVEBgaGGwXWsAPOQQS3sng782gMoTaW4Mr0YmhzfCRpRgo2Vbhph5zbdu9jvZ1u5vFfwhnCniT8meTRGpZX7brQlSTgIQxjWuu57sWmmWBlKGPSzjURDjEkX30HdUoE56GY63Ljv7lql7UZS2SOMO1R/J1LMte7x0E5ybDp60huI/3n1xETHzZSKODEgyOihKGGuke7g+26bKiCG9SzBRFG7q0s6WGFibEnZxjCYFgNtb0VOxT10KS+eKRmH8rHYhqigwfRtV/5kM9OkWir4B4XS5WG+fDJuLYO20Q7aQz46QmV0jiooQAQJ9IRe0Kvz7Lw5787HaHTGGWe20B84nz+cV6Ae (N) AAACGHicbVDLTgIxFO34RHyhLt1MJCa4gRk00SVRFy4xkUcEJJ1yByp9TNqOkUz4C1cm+i3ujFt3foo7y2Oh4EmanJ5zT3p7gohRbTzvy1lYXFpeWU2tpdc3Nre2Mzu7VS1jRaBCJJOqHmANjAqoGGoY1CMFmAcMakH/YuTXHkBpKsWNGUTQ4rgraEgJNla6bQa9uyTnHw3T7UzWy3tjuPPEn5IsmqLcznw3O5LEHIQhDGvd8L3ItBKsDCUMhulmrCHCpI+70LBUYA66lYw3HrqHVum4oVT2COOO1d+JBHOtBzywkxybnp71RuJ/XiM24VkroSKKDQgyeSiMmWukO/q+26EKiGEDSzBR1O7qkh5WmBhbUro5DiaFira3AqfiHvqUFy6VjAL5WOhAmNdghrYrf7aZeVIt5v3jfPH6JFs6n7aWQvvoAOWQj05RCV2hMqogggR6Qi/o1Xl23px352MyuuBMM3voD5zPH2uhoAE= AAACGHicbVDLTgIxFO34RHyhLt1MJCa4gRk00SVRFy4xkUcEJJ1yByp9TNqOkUz4C1cm+i3ujFt3foo7y2Oh4EmanJ5zT3p7gohRbTzvy1lYXFpeWU2tpdc3Nre2Mzu7VS1jRaBCJJOqHmANjAqoGGoY1CMFmAcMakH/YuTXHkBpKsWNGUTQ4rgraEgJNla6bQa9uyRXPBqm25msl/fGcOeJPyVZNEW5nfludiSJOQhDGNa64XuRaSVYGUoYDNPNWEOESR93oWGpwBx0KxlvPHQPrdJxQ6nsEcYdq78TCeZaD3hgJzk2PT3rjcT/vEZswrNWQkUUGxBk8lAYM9dId/R9t0MVEMMGlmCiqN3VJT2sMDG2pHRzHEwKFW1vBU7FPfQpL1wqGQXysdCBMK/BDG1X/mwz86RazPvH+eL1SbZ0Pm0thfbRAcohH52iErpCZVRBBAn0hF7Qq/PsvDnvzsdkdMGZZvbQHzifP21PoAI= h h Code Layer AAACIXicbVDLSsNAFJ34rPHV6tJNsAh10yZV0GVRFy4VrC20tUymNzp2HmFmopaQT3El6Le4E3fil7hzGrvwdWDgzLn3cA8njBnVxvffnKnpmdm5+cKCu7i0vLJaLK2da5koAk0imVTtEGtgVEDTUMOgHSvAPGTQCoeH43nrBpSmUpyZUQw9ji8FjSjBxkr9YqnLsbkKw7SVXaSVYDtz+8WyX/VzeH9JMCFlNMFJv/jRHUiScBCGMKx1J/Bj00uxMpQwyNxuoiHGZIgvoWOpwBx0L82jZ96WVQZeJJV9wni5+t2RYq71iId2cxxU/56Nxf9mncRE+72UijgxIMjXoShhnpHeuAdvQBUQw0aWYKKozeqRK6wwMbYtt5sb01pT21+NU3ENQ8prR0rGobyrDSCqajCZ7Sr43cxfcl6vBjvV+uluuXEwaa2ANtAmqqAA7aEGOkYnqIkIukX36BE9OQ/Os/PivH6tTjkTzzr6Aef9E/1+o2s= W h Code Layer AAACIXicbVC7TgJBFJ3FF+ILtLTZSEywgV000ZKohaUmIiSAZHa4CyPz2MzMqmSzn2Jlot9iZ+yMX2LngBSKnmSSM+fek3tygohRbTzv3cnMzS8sLmWXcyura+sb+cLmlZaxIlAnkknVDLAGRgXUDTUMmpECzAMGjWB4Mp43bkFpKsWlGUXQ4bgvaEgJNlbq5gttjs0gCJJGep2UqntprpsvemVvAvcv8aekiKY47+Y/2z1JYg7CEIa1bvleZDoJVoYSBmmuHWuIMBniPrQsFZiD7iST6Km7a5WeG0plnzDuRP3pSDDXesQDuzkOqmdnY/G/WSs24VEnoSKKDQjyfSiMmWukO+7B7VEFxLCRJZgoarO6ZIAVJsa2lWtPjEmlru2vwqm4gSHllVMlo0DeV3oQljWY1Hblzzbzl1xVy/5+uXpxUKwdT1vLom20g0rIR4eohs7QOaojgu7QA3pCz86j8+K8Om/fqxln6tlCv+B8fAH/LKNs W ··· Code Layer AAACIXicbVDLSsNAFJ34tr6qLt0Ei1A3TVIFXYq6cCUK1gptLZPJTTt2HmFmopaQT3El6Le4E3fil7hzWrtQ64GBM+fewz2cMGFUG99/dyYmp6ZnZufmCwuLS8srxdW1Sy1TRaBGJJPqKsQaGBVQM9QwuEoUYB4yqIe9o8G8fgtKUykuTD+BFscdQWNKsLFSu7ja5Nh0wzCr59dZ+XQ7L7SLJb/iD+GOk2BESmiEs3bxsxlJknIQhjCsdSPwE9PKsDKUMMgLzVRDgkkPd6BhqcAcdCsbRs/dLatEbiyVfcK4Q/WnI8Nc6z4P7eYgqP47G4j/zRqpifdbGRVJakCQ70Nxylwj3UEPbkQVEMP6lmCiqM3qki5WmBjbVqE5NGZeTdufx6m4gR7l3rGSSSjvvQjiigaT266Cv82Mk8tqJdipVM93SweHo9bm0AbaRGUUoD10gE7QGaohgu7QA3pCz86j8+K8Om/fqxPOyLOOfsH5+AIuQ6OI W Residua l Signa ls Flow Encoder Encoder Encoder (x(1Nst )Roundxˆ(N Error)) Function AAACL3icZZDPahRBEMZ7EmPiJupGj14al8AG4u5MDMSTBKLgSSJkk0Bms3T31mQ723+G7hrJMpmn8Ek8etWHkFyCV/El7J3dQ/4UFPzq6/6K4uO5kh7j+DpaWHy09Hh55Uljde3ps+fN9RdH3hZOQE9YZd0JZx6UNNBDiQpOcgdMcwXHfLw/fT/+Cs5Law5xkkNfs3MjMykYBmnQfJPWO0quCqhSwdTHdsovz8r2583q6iodMSzDXM2EzUGzFXfiuuhDSObQIvM6GDT/pUMrCg0GhWLenyZxjv2SOZRCQdVICw85E2N2DqcBDdPg+2V9UkU3gjKkmXWhDdJave0omfZ+ovkWDaAZjrYo18E2RX93NWbv+qU0eYFgxGxzViiKlk5DoUPpQKCaBGDCyXAcFSPmmMAQXSOtjWW358PU1dJcwFjq7gdnc24vu0PIOh6waoR0kvtZPISj7U7ytrP9Zae1936e0wp5RV6TNknILtkjn8gB6RFBvpEf5Cf5FX2Pfkc30Z/Z14Vo7nlJ7lT09z/vqKjk N (1) (2) (1) (N) N 1 (i) E ||

AAACIHicbVDLSgMxFM34rPU16tJNsAi6aWeqoBuhqAuXCrYV2loy6Z02No8hyYhl6J+4EvRb3IlL/RN3prULXwcCJ+fcw72cKOHM2CB486amZ2bn5nML+cWl5ZVVf229ZlSqKVSp4kpfRcQAZxKqllkOV4kGIiIO9ah/MvLrt6ANU/LSDhJoCdKVLGaUWCe1fb8Z3V1nO+HuEB9hx/NtvxAUgzHwXxJOSAFNcN72P5odRVMB0lJOjGmEQWJbGdGWUQ7DfDM1kBDaJ11oOCqJANPKxpcP8bZTOjhW2j1p8Vj9nsiIMGYgIjcpiO2Z395I/M9rpDY+bGVMJqkFSb8WxSnHVuFRDbjDNFDLB44Qqpm7FdMe0YRaV1a+OQ5mpapxv5Jg8gb6TJROtUoidVfqQFw0YIeuq/B3M39JrVwM94rli/1C5XjSWg5toi20g0J0gCroDJ2jKqLoFt2jR/TkPXjP3ov3+jU65U0yG+gHvPdPzvyiMQ== AAACMnicbZDLbhMxFIY9KdAQLk3Lko2VCCldkMykldoNUgVdsAwSSStlQuRxziRufBnZZ1Cj0ex5GlZI7auUHWLLC7DDuSwg5UiWPv/n/Dr2n2RSOAzDu6Cy8+Dho93q49qTp8+e79X3DwbO5JZDnxtp7GXCHEihoY8CJVxmFphKJFwk83fL/sVnsE4Y/REXGYwUm2qRCs7QS+N6I06uPxWt7mFJ31DP9DWNZwwLj6XXo8OyNq43w3a4Knofog00yaZ64/rveGJ4rkAjl8y5YRRmOCqYRcEllLU4d5AxPmdTGHrUTIEbFau/lPSVVyY0NdYfjXSl/u0omHJuoRI/qRjO3HZvKf6vN8wxPR0VQmc5gubrRWkuKRq6DIZOhAWOcuGBcSv8WymfMcs4+vhq8cpYdPrO3zpK6CuYC9U5tyZLzHVnAmnbAZY+q2g7mfsw6Lajo3b3w3Hz7O0mtSp5SRqkRSJyQs7Ie9IjfcLJF/KV3JDb4FvwPfgR/FyPVoKN5wX5p4JffwAUNajy x = x xˆ nd (N) x = x x = x xˆ AAACUnicbVLLThsxFHUCLTTQNtBlN1YjpLAgmaFIsEFClEVXCKQGkDIh8jh3iIkfI/sOIrLmf/gaVkjQT2lXOCELXleydHwesn3kNJfCYRT9rVTn5j98XFj8VFta/vzla31l9cSZwnLocCONPUuZAyk0dFCghLPcAlOphNN09Guin16BdcLoPzjOoafYhRaZ4AwD1a/vJ+n1uW8erpd0lwZMN6hPHLciR4djCTRxhep7sRuX5/5wIy5LmgwZ+mANRFOsl7V+vRG1ounQtyCegQaZzVG//i8ZGF4o0Mglc64bRzn2PLMouISylhQOcsZH7AK6AWqmwPX89K0lXQvMgGbGhqWRTtnnCc+Uc2OVBqdiOHSvtQn5ntYtMNvpeaHzAkHzp4OyQlI0dFIcHQgLHOU4ABb6CXelfMgs4xjqrSXToG93XNi1ldCXMBKqfWBNnprr9gCylgMsQ1fx62begpPNVvyztXm81djbn7W2SL6TH6RJYrJN9shvckQ6hJMbckvuyUPlrvK/Gn7Jk7VamWW+kRdTXX4ErwazxA== ˆ 2 Round Error Function i=1 x xˆ st nd th P E ! The 1 module The 2 module The N module AAACT3icZZDNbhMxFIU94a8NfwGWbCwipFSqkpmCBBtQJUBiQ1Uk0laq08j23JmY2DMj+w5KZOapeBKW3cKSB2CHcKazoPRItj4f33t1dUSllcM4Po96167fuHlra7t/+87de/cHDx4eubK2Eqay1KU9EdyBVgVMUaGGk8oCN0LDsVi+2fwffwHrVFl8wnUFM8PzQmVKcgzWfPCBtTO8hbRhC46eSa7fNUxDhiMmVkyoPP/a3a42c69eJc3ZwUWtWDVnfnSw0zCr8gXuzAfDeBy3olch6WBIOh3OB79YWsraQIFSc+dOk7jCmecWldTQ9FntoOJyyXM4DVhwA27m25Ub+jQ4Kc1KG06BtHX/7fDcOLc2YpcGMBwXu1SY0LZBd3k0Zi9nXhVVjVDIi8lZrSmWdBMaTZUFiXodgEurwnJULrjlEkO0fdY2+snUhdfEqOIzLJWZvLVlJcrVJIVs7ACbfkgn+T+Lq3C0N06ejfc+Ph/uv+5y2iKPyRMyIgl5QfbJe3JIpkSSb+Sc/CA/o+/R7+hPryvtRR08IpfU2/4Lx0e1Sw== i=1 X Figure 3: Cross-module residual learning pipeline

Cross-module finetuning: The greedy training scheme ac- Table 1: Architecture of the component module as in Figure 1. cumulates module-specific error, which the earlier modules do Input and output tensors sizes are represented by (width, chan- not have a chance to reduce, thus leading to a suboptimal re- nel), while the kernel shape is (width, in channel, out channel). sult. Hence, the second-round cross-module finetuning follows Layer Input shape Kernel shape Output shape to further improve the performance by reducing the total error: Change channel (512, 1) (9, 1, 100) (512, 100) N ! (9, 100, 20) ˆ X (i) (i) 1st bottleneck (512, 100) (9, 20, 20) ×2 (512, 100) arg min E x F x ; W . (4) (9, 20, 100) # (1) (N) W ···W i=1 Downsampling (512, 100) (9, 100, 100) (256, 100) (9, 100, 20) During the finetuing step, we first (a) initialize the parameters of 2nd bottleneck (256, 100) (9, 20, 20) ×2 (256, 100) each module with those estimated from the greedy training step (9, 20, 100) # (b) perform cascaded feedforward on all the modules sequen- Change channel (256, 100) (9, 100, 1) (256, 1) tially to calculate the total estimation error in (3) (c) backprop- Change channel (256, 1) (9, 1, 100) (256, 100) (9, 100, 20) agate the error to update parameters in all modules altogether 1st bottleneck (256, 100) (9, 20, 20) ×2 (256, 100) (thin black arrows in Figure 3). Aside from the total recon- (9, 20, 100) # struction error (3), we inherit Kankanahalli-Net’s other regu- Upsampling (256, 100) (9, 100, 100) (512, 50) (9, 50, 20) larization terms, i.e., perceptual loss, quantization penalty, and 2nd bottleneck (512, 50) (9, 20, 20) ×2 (512, 50) entropy regularizer. (9, 20, 50) # Change channel (512, 50) (9, 50, 1) (512, 1) 2.3. Bitrate and entropy coding The bitrate is calculated from the concatenated bitstrings from residual input works better than AMR-WB and OPUS at high all modules in CMRL. Each encoder module produces S/d bitrates. quantized symbols from the softmax quantization process (Fig- ure 1 (e)), where the stride size d divides the input dimensional- 3.1. Experimental setup ity. Let c(i) be the average bit length per symbol after in the i-th module. Then, c(i)S/d stands for the bits per 300 and 50 speakers are randomly selected from TIMIT [32] frame. By dividing the , (S − o)/f, where o and f training and test datasets, respectively. We consider two types denote the overlap size in samples and the sampling rate, re- of inputs in time-domain: raw PCM and LPC residuals. For spectively, the bitrates per module add up to the total bitrate: the raw PCM input, the data is normalized to have a unit vari- PN fcS ance, and then directly fed to the model. For the LPC residual ξLPC + i=1 (S−o)d , where the overhead to transmit LPC co- input, we conduct a spectral envelope estimation on the raw sig- efficients is ξlpc=2.4kbps, which is 0 for the case with raw PCM signals as the input. nals to get LPC residuals and corresponding coefficients. The By having the entropy control scheme proposed in LPC residuals are modeled by the proposed end-to-end CMRL Kankanahalli-Net as the baseline to keep a specific bitrate, we pipeline, while the LPC coefficients are quantized and sent di- further enhance the coding efficiency by employing the Huff- rectly to the receiver side at 2.4 kbps. The decoding process re- man coding scheme on the vectors. Aside from encoding each covers the speech signal based on the LPC synthesis procedure symbol (i.e., the softmax result) separately, encoding short se- using the LPC coefficients and the decoded residual signals. quences can further leverage the temporal correlation in the se- We consider four bitrate cases: 8.85 kbps, 15.85 kbps, ries of quantized symbols, especially when the entropy is al- 19.85 kbps and 23.85 kbps. All convolutional layers in CMRL ready low [29] [30]. We found that encoding a short symbol se- use 1-D kernel with the size of 9 and the Leaky Relu activation. quence of adjacent symbols, i.e., two symbols, can lower down CMRL hosts two modules: each module is with the topology the average bit length further in the low bitrates. as in Table 1. Each residual learning block contains two bottle- neck structures with the dilation rate of 1 and 2. Note that for 3. Experiments the lowest bitrate case, the second encoder downsamples each window to 128 symbols. The learning rate is 0.0001 to train the We first show that for the raw PCM input CMRL outperforms first module, and 0.00002 for the second module. Finetuning AMR-WB and Kankanahalli-Net in terms of objective metrics uses 0.00002 as the learning rate, too. Each window contains in the experimental setup proposed in [14], where the use of 512 samples with the overlap size of 32. We use Adam opti- LPC was not tested. Therefore, for the subjective quality, we mizer [33] with the batch size of 128 frames. Each module is perform MUSHRA tests [31] to show that CMRL with an LPC trained for 30 epochs followed by finetuning until the entropy is (a) 8.85kbps (b) 15.85kbps (c) 19.85kbps (d) 23.85kbps (e) 23.85kbps

Figure 4: MUSHRA test results. From (a) to (d): the performance of CMRL on raw and LPC residual input signals compared against AMR-WB at different bitrates. (e) An additional test shows that the performance of CMRL with the LPC input competes with OPUS, which is known to outperform AMR-WB in 23.85kbps.

1st module training 2nd module training fine-tuning 25 3.3. Subject test

20 Figure 4 shows MUSHRA test results done by six audio experts 20 on 10 decoded test samples randomly selected with gender eq- uity. At 19.85 kbps and 23.85 kbps, CMRL with LPC residual 15 inputs outperforms AMR-WB. At lower bitrates though, AMR- WB starts to work better. CMRL on raw PCM is found less 10 favored by listeners. We also compare CMRL with OPUS in the high bitrate where OPUS is known to perform well, and 5 find that CMRL slightly outperforms OPUS1. Model parameters (million) 1.5 0.9 0 3.4. Model complexity K-Net CMRL CMRL

WaveNet WaveNet The cross-module residual learning simplifies the topology of (a) (b) each component module. Hence, CMRL has less than 5% of the model parameters compared to the WaveNet based codec [16], and outperforms Kankanahalli-Net with 40% less model Figure 5: (a) SNR and PESQ per epoch (b) model complexity parameters. Figure 5 (b) summarizes the comparison. Table 2: SNR and PESQ scores on raw PCM test signals. 4. Conclusion Metrics SNR (dB) PESQ In this work, we demonstrated that CMRL as a lightweight Bitrate (kbps) 8.85 15.85 19.85 23.85 8.85 15.85 19.85 23.85 model carrier for DNN based speech codecs can compete with AMR-WB 9.82 11.93 12.46 12.73 3.41 3.99 4.09 4.13 the industrial standards. By cascading two end-to-end mod- K-Net - - - - 3.63 4.13 4.22 4.30 ules, CMRL achieved a higher PESQ score at 15.85 kbps than CMRL 13.45 16.35 17.18 17.33 3.69 4.21 4.34 4.42 AMR-WB at 23.85 kbps. We also showed that CMRL can con- sistently outperform a state-of-the-art DNN codec in terms of PESQ. CMRL is compatible with LPC, by having it as the first within the target range. pre-processing module and by using its residual signals as the input. CMRL, coupled with LPC, outperformed AMR-WB in 3.2. Objective test 19.85 kbps and 23.85 kbps, and worked better than OPUS at 23.85 kbps in the MUSHRA test. More work is required to We evaluate 500 decoded utterances in terms of SNR and PESQ examine other module structures to further improve the perfor- with wide band extension (P862.2) [34]. Figure 5 (a) shows the mance at low bitrates. effectiveness of CMRL against a system with a single module in terms of SNR and PESQ values per epoch. The single mod- ule is with three more bottleneck blocks and twice more codes 5. Acknowledgements for a fair comparison. It is trained for 90 epochs with other hy- This work was supported by Institute for Information & com- perparameters are unaltered. For both SNR and PESQ, the plot munications Technology Promotion (IITP) grant funded by the shows a noticeable performance jump as the second module is Korea government (MSIT) (2017-0-00072, Development of included, followed by another jump by finetuning. Audio/ Coding and Light Field Media Fundamental Tech- Table 2 compares CMRL with AMR-WB and nologies for Ultra Realistic Tera-media). The authors also ap- Kankanahalli-Net at four bitrates for the raw PCM input preciate Srihari Kankanahalli for valuable discussion and shar- case. CMRL achieves both higher SNR and PESQ at all four ing audio samples. bitrate cases. Note that the SNR for CMRL at 8.85 kbps is greater than AMR-WB at 23.85 kbps. CMRL also gives a 1Samples are available at http://saige.sice.indiana. better PESQ score at 15.85 kbps than AMR-WB at 23.85 kbps. edu/research-projects/neural-audio-coding 6. References [21] Y. L. Cristina Garbacea, Aaron van den Oord, “High-quality speech coding with samplernn,” in Proc. ICASSP, 2019. [1] P. Noll, “MPEG coding,” IEEE signal processing magazine, vol. 14, no. 5, pp. 59–81, 1997. [22] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chi- nen, S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy im- [2] K. Brandenburg and G. Stoll, “ISO/MPEG-1 audio: A generic age compression with priming and spatially adaptive bit rates for standard for coding of high-quality digital audio,” Journal of the recurrent networks,” in Proceedings of the IEEE Conference on Audio Engineering Society, vol. 42, no. 10, pp. 780–792, 1994. Computer Vision and Pattern Recognition, 2018, pp. 4385–4393. Techniques and standards for image, [3] K. R. Rao and J. J. Hwang, [23] G. Schuller, B. Yu, and D. Huang, “Lossless coding of audio sig- video, and audio coding . Prentice Hall New Jersey, 1996, vol. 70. nals using cascaded prediction,” in 2001 IEEE International Con- [4] D. O’Shaughnessy, “Linear predictive coding,” IEEE potentials, ference on Acoustics, Speech, and Signal Processing. Proceedings vol. 7, no. 1, pp. 29–32, 1988. (Cat. No. 01CH37221), vol. 5. IEEE, 2001, pp. 3273–3276. [5] B. S. Atal and M. R. Schroeder, “Adaptive predictive coding of [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning speech signals,” Bell System Technical Journal, vol. 49, no. 8, pp. for image recognition,” in Proceedings of the IEEE conference on 1973–1986, 1970. computer vision and pattern recognition, 2016, pp. 770–778. [6] M. Schroeder and B. Atal, “Code-excited linear prediction [25] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, (CELP): High-quality speech at very low bit rates,” in Acoustics, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for Speech, and Signal Processing, IEEE International Conference end-to-end learning compressible representations,” in Advances on ICASSP’85., vol. 10. IEEE, 1985, pp. 937–940. in Neural Information Processing Systems, 2017, pp. 1141–1151. [7] B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila, [26] W. Shi, J. Caballero, F. Huszar,´ J. Totz, A. P. Aitken, R. Bishop, J. Vainio, H. Mikkola, and K. Jarvinen, “The adaptive multirate D. Rueckert, and Z. Wang, “Real-time single image and video wideband speech codec (amr-wb),” IEEE transactions on speech super-resolution using an efficient sub-pixel convolutional neural and audio processing, vol. 10, no. 8, pp. 620–636, 2002. network,” in Proceedings of the IEEE conference on computer vi- sion and pattern recognition, 2016, pp. 1874–1883. [8] J.-M. Valin, G. Maxwell, T. B. Terriberry, and K. Vos, “High- quality, low-delay music coding in the opus codec,” arXiv preprint [27] K. Tan, J. Chen, and D. Wang, “Gated residual networks with arXiv:1602.04845, 2016. dilated for supervised speech separation,” in Proc. ICASSP, 2018. [9] J. Makhoul, S. Roucos, and H. Gish, “Vector quantization in speech coding,” Proceedings of the IEEE, vol. 73, no. 11, pp. [28] D. A. Huffman, “A method for the construction of minimum- 1551–1588, 1985. redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952. [10] M. Kim and P. Smaragdis, “Bitwise neural networks,” in Inter- national Conference on Machine Learning (ICML) Workshop on [29] I. H. Witten, R. M. Neal, and J. G. Cleary, “ for Resource-Efficient Machine Learning, Jul 2015. ,” Communications of the ACM, vol. 30, no. 6, pp. 520–541, 1987. [11] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hin- ton, “Binary coding of speech spectrograms using a deep auto- [30] L. R. Welch and E. R. Berlekamp, “Error correction for algebraic encoder,” in Eleventh Annual Conference of the International block codes,” Dec. 30 1986, uS Patent 4,633,470. Speech Communication Association , 2010. [31] R. B. ITU-R, “1534-1,method for the subjective assessment of in- [12] M. Cernak, A. Lazaridis, A. Asaei, and P. Garner, “Composition termediate quality levels of coding systems (mushra),” Interna- of deep and spiking neural networks for very low speech tional Union, 2003. arXiv preprint arXiv:1604.04383 coding,” , 2016. [32] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pal- [13] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, lett, N. L. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic con- Q. Wang, and T. C. Walters, “Wavenet based low rate speech cod- tinuous speech corpus,” Linguistic Data Consortium, Philadel- ing,” arXiv preprint arXiv:1712.01120, 2017. phia, 1993. [14] S. Kankanahalli, “End-to-end optimized speech coding with deep [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- neural networks,” arXiv preprint arXiv:1710.09064, 2017. mization,” arXiv preprint arXiv:1412.6980, 2014. [15] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “Wavenet [34] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual eval- with limited training data for voice conversion,” in Proc. uation of speech quality (PESQ)-a new method for speech qual- Interspeech, 2018, pp. 1983–1987. ity assessment of telephone networks and codecs,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). [16] Y. L. Cristina Garbacea, Aaron van den Oord, “Low bit-rate 2001 IEEE International Conference on, vol. 2. IEEE, 2001, pp. speech coding with vq-vae and a wavenet decoder,” in Proc. 749–752. ICASSP, 2019. [17] R. Salami, C. Laflamme, B. Bessette, and J.-P. Adoul, “ITU-TG. 729 annex a: reduced complexity 8 kb/s CS-ACELP codecfor dig- ital simultaneous voice and data,” IEEE Communications Maga- zine, vol. 35, no. 9, pp. 5663, 1997. [18] G. Recommendation, “722.2:wideband coding of speech at around 16 kbit/s using adaptive multi-rate wideband (amr-wb),” 2003. [19] M. Neuendorf, M. Multrus, N. Rettelbach, G. Fuchs, J. Robil- liard, J. Lecomte, S. Wilde, S. Bayer, S. Disch, C. Helmrich et al., “MPEG unified speech and audio coding-the iso/mpeg standard for high-efficiency audio coding of all content types,” in Audio Engineering Society Convention 132. Audio Engineering Soci- ety, 2012. [20] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016.