A Packetization and Variable Bitrate Interframe Compression Scheme for Vector Quantizer-Based Distributed Speech Recognition

A Packetization and Variable Bitrate Interframe Compression Scheme For Vector Quantizer-Based Distributed Speech Recognition Bengt J. Borgstrom¨ and Abeer Alwan Department of Electrical Engineering, University of California, Los Angeles [email protected], [email protected] Abstract For example, the ETSI standard [2] packetization scheme sim- ply concatenates 2 adjacent 44-bit source coded frames, and We propose a novel packetization and variable bitrate com- transmits the resulting signal along with header information. pression scheme for DSR source coding, based on the Group The proposed algorithm performs further compression using of Pictures concept from video coding. The proposed algorithm interframe correlation. The high time correlation present in simultaneously packetizes and further compresses source coded speech has previously been studied for speech coding [3] and features using the high interframe correlation of speech, and is DSR coding [4]. compatible with a variety of VQ-based DSR source coders. The The proposed algorithm allows for lossless compression of algorithm approximates vector quantizers as Markov Chains, the quantized speech features. However, the algorithm is also and empirically trains the corresponding probability parame- robust to various degrees of lossy compression, either through ters. Feature frames are then compressed as I-frames, P-frames, VQ pruning or frame puncturing. VQ pruning refers to ex- or B-frames, using Huffman tables. The proposed scheme can clusion of low probability VQ codebook labels prior to Huff- perform lossless compression, but is also robust to lossy com- man coding, and thus excludes longer Huffman codewords, pression through VQ pruning or frame puncturing. To illustrate and frame puncturing refers to the non-transmission of certain its effectiveness, we applied the proposed algorithm to the ETSI frames, which drastically reduces the final bitrate. DSR source coder. The algorithm provided compression rates of up to 31.60% with negligible recognition accuracy degradation, and rates of up to 71.15% with performance degradation 2. Vector Quantizer-Based Source Coding under 1.0%. for Distributed Speech Recognition Index Terms: distributed speech recognition, speech coding, VQ There exists a variety of source coding techniques that have been presented and analyzed for DSR applications, many of which involve traditional speech processing features. For exam- 1. Introduction ple, the ETSI standard [2] uses a subset of the Mel-Frequency Speech recognition systems involving separated clients and Cepstral Coefficients (MFCCs), as does [5]. In [6], the authors servers have become popular due to the reduced computational provide performance analysis for various subsets of the Linear load at the client and the ease of model updating at the server. Prediction Cepstral Coefficients (LPCCs) and the Line Spectral However, such systems, referred to as distributed speech recog- Pairs (LSPs). nition (DSR) systems, introduce the need for additional com- Due to the inherent bandwidth-restrictive nature of DSR munication between the clients and server, either over wireless systems, quantization of speech features is a major issue. It channels or over IP networks. has been shown in [4] and [7] that speech features such as those Thus, bandwidth restrictions immediately become an issue previously discussed display high correlation both across time for DSR systems, since multiple clients must simultaneously and across coefficients. Thus VQ techniques offer the ability to communicate with a server over a single channel or over a sin- efficiently reduce the bitrate of such speech features. gle network. Therefore, efficient compression of transmitted Though optimal quantization of vectors is generally ob- speech recognition features is an important topic of research. tained through quantization of the entire vector, as the dimen- Many DSR compression algorithms involve vector quanti- sion increases, the computational load and required memory of zation techniques to compress speech features prior to trans- such an algorithm becomes costly [8], which is especially prob- mission. This paper introduces a compression and packeti- lematic on distributed devices. Thus, suboptimal VQ techniques zation scheme, based on the GOP concept from video cod- such as Split VQ (SVQ) or Multi-Stage VQ (MSVQ) algorithms ing [1], that further compresses the source coded features. have been proposed for various speech coding applications [8]. The algorithm organizes speech feature frames into Groups of The output of a general VQ-based DSR source coding al- Frames (GOF) structures, comprised of intra-coded I-frames, gorithm can thus be given as: predictively-coded P-frames, or bidirectionally predictively- coded B-frames. The algorithm then applies frame dependent f = SC (s) , (1) Huffman coding to compress the individual feature frames. Fur- thermore, the proposed scheme is compatible with a variety of where SC (·) represents the source coding function. Also T VQ-based DSR source coding algorithms. s = [s1,..., sNsc ] is the original unquantized speech feature Existing packetization schemes do not aim to compress sig- vector, where the si are the subvectors of s chosen for subopti- nals, but rather to organize information prior to transmission. mal vector quantization and Nsc is the number of subvectors. Let a given VQ scheme be referred to as VQk, and let the the error signal is compressed. Finally, B-frames are predicted corresponding vector of codebook labels, or states, be repre- based on the I-frame, and both prior and future P-frames, and sented by: the error signal is compressed. Various GOP structures allow for various compression rates, but also induce certain decoding T k h k k k i delays and computational loads. c = c1 , c2 , . , cNk , (2) where Nk is the number of codebook labels. Also, let the 3.1. GOF Intra-Coded Frames vector quantization of a given vector si by the given scheme k Similar to video coding utilizing GOP structure, GOF packe- VQk to the new vector corresponding to codebook label cj be represented by: tized I-frames are coded as sole frames. Using the notation de- rived in Section 2, Equation 1 can be expressed as: k bsj = VQk (si) . (3) Furthermore, define the codebook label function CB (·), T f = [CB (VQ1 (s1)) ,..., CB (VQN (sN ))] (10) which returns the codebook label of a given quantized vector, sc sc T h 1 Nsc i as: = cL , . , c . 1 LNsc ck = CB sk . (4) j bj The variable bitrate I-frame packetization of speech frame Since there is a one-to-one relationship between quantized f can then be performed by Huffman encoding [11] of each el- m vectors and codebook labels within a given VQ scheme, there ement c separately, for 1 ≤ m ≤ Nsc. Distinct codeword m exists an inverse codebook label function such that: tables are created for each c using the corresponding probability vectors: k −1 k sj = CB cj . (5) b m m m m hI = [π1 , π2 , . , πN ] . (11) Since a VQ scheme represents a discrete group of codebook m labels between which a quantized signal transitions in time, The Nsc chosen variable rate codewords created by the the VQ scheme can be interpreted as a discrete hidden markov Huffman algorithm are then concatenated to form the bitstream model (HMM). In this case, the given VQ scheme, say VQk, representing the current intra-coded speech frame. Once the can be completely characterized by its corresponding state prob- transmitted bitstream is received at the server, Huffman decod- k abilities, πi , and its corresponding transitional probabilities, ing is performed to extract the VQ codebook labels. The recon- k aij , for 1 ≤ i, j ≤ Nk, where Nk represents the number of structed feature vector, es, can then be determined as: → codebook labels in VQk [9]. The state probability vector, π k, and state transitional matrix, Pk, can be formed as: T h −1 1 −1 2 −1 Nsc i s = CB cL , CB cL ,..., CB cL . → h i e 1 2 Nsc π k k k k= π1 , π2 , . , πNk , (6) (12) and 3.2. GOF Predictively-Coded Frames ak . ak 11 1Nk GOF packetized P-frames are coded as predictions of the most . Pk = . . (7) recent I-frame or P-frame. Therefore, the packetization of . P k k speech frame f is dependent on the transitional probabilities aN 1 . aN N k k k defined in the matrices Pm, for 1 ≤ m ≤ Nsc. The packetiza- → tion is also dependent on the degree of separation between the In order to parameterize VQk, the probability vector, π k, current P-frame and the frame on which it is being predicted. and the transitional probability matrix, Pk, must be estimated. th th Using training data, this can be done empirically [10]: The i , j element of Pm, given by [Pm]i,j , represents the probability of transitioning from state i to state j within the k k no. of samples quantized to ci VQ scheme VQm in 1 iteration. This relationship can be ex- πi ≈ , (8) n no. of total samples tended to the n iteration case: [Pm]i,j represents the probability of transitioning from state i to state j in n transitions, where and n th [Pm] represents the n power of the matrix Pm. Just as in the I-frame case, P-frames are compressed using k k separate Huffman codes for each vector quantized speech fea- k no. of samples transitioning from ci to cj aij ≈ k . (9) ture in the current frame. However, the Huffman probability no. of total samples quantized to ci vector for VQm in the P-frame case is given by: 3. Group of Frames (GOF) Packetization m h n n n i hP = [Pm]i,1 , [Pm]i,2 ,..., [Pm]i,N , (13) The Group of Frames (GOF) packetization system introduced m in this paper is based on the Group of Pictures (GOP) con- given that the previous I-frame or P-frame VQm state was cept used for video coding [1].

A Packetization and Variable Bitrate Interframe Compression Scheme for Vector Quantizer-Based Distributed Speech Recognition

Challenges in Relaying Video Back to Mission Control

Hikvision H.264+ Encoding Technology

AVC to the Max: How to Configure Encoder

Adaptive Bitrate Streaming Over Cellular Networks: Rate Adaptation and Data Savings Strategies

Comments on `Area and Power Efficient DCT Architecture for Image

Neural Rate Control for Video Encoding Using Imitation Learning

Forensic Reconstruction of Fragmented Variable Bitrate MP3 Files

A/330, "Link Layer Protocol"

A Psychoacoustic-Based Multiple Audio Object Coding Approach Via Intra- Object Sparsity

NETWORK VIDEO ENCODER User Manual

Compressing Variational Posteriors

Constant Bit-Rate Control Efficiency with Fast Motion Estimation in H.264/Avc Video Coding Standard