Experiment on Multi Channel Speech Decoding: G711, G722

Category: Signal & Audio Processing - SP02

Poster contact Name P4163 Sifa Serdar Ozen: [email protected]

EXPERIMENT ON MULTI CHANNEL SPEECH DECODING: G711, G722 Şifa Serdar ÖZEN1 Alptekin TEMİZEL2 [email protected], [email protected] 1 Inforcept Networks, Gebze Organize Sanayi Bölgesi, High-Tech Building B1, Gebze, Kocaeli, Turkey 2 Graduate School of Informatics, Middle East Technical University, Ankara, Turkey

Abstract Implementation Details Experiment

g711a decoding g722 decoding Experiments are performed in a Debian 7.2 Wheezy 64-bit system In Voice over IP (voip) communication, with 4GB RAM, using AMD Athlon core for CPU calculations and GTX650Ti Boost based graphics card for GPU calculations. speech is transmitted as coded in order to G711a coding is usually used in Europe at traditional telephone switch G722 is known as 7kHz wideband audio codec operating at 64kbps and preserve network bandwidth. Upon arriving Execution times are calculated to reflect all necessary operations, systems. Its decoding requires 8bit to 13bit pcm 1 to 1 exponential usually used at LAN where there is availability for higher bandwidth to taken from the host side as a function call execution time (in case to the receiving side, these packets should mapping implementing following formula with usually A set to 87.7 ([1][2]) yield enhanced signal quality. Codec uses sub-band adaptive differential be properly decoded in order to feed of CUDA, a wrapper function is called that performs device pulse code modulation to operate and may also be used in three operations which includes all memory transfers and kernel application layer. Multi channel speech different modes to carry data with audio ([4]). In this work, mode 1 decoding tries to handle many of these executions). For g722 case execution time also includes transfer of operation is assumed (there is no data channel and all 8bits correspond state vectors to and from GPU. All (but last) experiments are simultaneous voip packets that should be to audio), and lower sub-band is used in decoding to form 8kHz sampled performed with randomly initialized input data and repeated 1000 decoded on the fly. Examples of such a multi 14bit pcm decoded output. G722 uses internal states in its operation for times (experiment for data size 1024000 is iterated 100 times) . channel decoder includes lawful requirement quantization steps and prediction filters, so state vector should also be Displayed values are the mean values of those simulations. of contact center voice recording systems Decoding graph corresponding to upper formula will be as in Figure 2. supplied to decoder in addition to 160 byte coded data. Iterations that can run at real time (execution time less than 20 which provides future reference. General ms) are displayed in bold fonts. Reference CPU code of g711 and flowchart of such a multi channel system can g722 decoders can be found at [5]. In this work, modified and be seen at Figure 1. Popular license free more or less optimized versions are used, and GPU versions are codec choices include g711a, g711u and prepared which can be seen at github [6]. Experiment flow may be g722 codecs. By default, these codecs seen at Figure 4. produce coded speech packets of 160 byte for each 20 ms frame interval. In this poster, experiment results of using GPU in multi channel speech decoding will be presented .

g711u decoding

G711u coding is used in North America and Japan. Its decoding requires 8bit to 14bit pcm 1 to 1 exponential mapping using following decoding Figure 3 displays the block diagram of the lower sub-band decoder used formula, where u is set to be 255 ([1][3]) in g722 decoding experiment.

Decoding graph of g711u will be similar to decoding of g711a with slight modification to represent mapping to 14 bits (segments) and yielding slightly greater dynamic range at cost of increased distortion on smaller signal levels.

Result of Experiment Conclusions • Decoding multi channel voip calls is inherently parallel task and usage of GPU decrease g711a decoding results g711u decoding results g722 decoding results decode time significantly even in low compute complexity decoders such as g711a and CPU GPU CPU GPU CPU GPU g711u. Data size execution execution speedup execution execution speedup execution execution speedup • Speedup increases when compute complexity increases. G722 is a much more complex time (s) time (s) time (s) time (s) time (s) time (s) decoder than g711. Typically decode code for g722 is around hundred of lines compared to 250 0.00079 0.00058 1.36 0.00105 0.00070 1.50 0.01750 0.00094 18.62 g711 case which is performed with about 10 lines of source code. 500 0.00157 0.00068 2.31 0.00188 0.00078 2.41 0.03791 0.00105 36.11 • Decoding of G722 is 20 times faster even in low end case of 250 channel decoding (which 1000 0.00290 0.00085 3.41 0.00342 0.00098 3.49 0.07180 0.00142 50.56 corresponds to 125 calls, somewhat workload of a small call center). 2000 0.00624 0.00121 5.16 0.00738 0.00136 5.43 0.13688 0.00203 67.43 • Gains are expected to be greater if more complex decoders are implemented: g729, g723, 4000 0.01226 0.00193 6.35 0.01495 0.00219 6.83 0.29596 0.00358 82.67 speex and opus, which can be done as a future work. 8000 0.02466 0.00311 7.93 0.03035 0.00308 9.85 0.56429 0.00583 96.79 16000 0.04880 0.00490 9.96 0.06126 0.00500 12.25 1.13155 0.00983 115.11 32000 0.09717 0.00860 11.30 0.12285 0.00835 14.71 2.26779 0.01786 126.98 References 64000 0.19421 0.01582 12.28 0.24592 0.01580 15.56 4.54728 0.03294 138.05

128000 0.38900 0.03029 12.84 0.49174 0.03017 16.30 9.10895 0.06339 143.70 [1] ITU-T G.711: Pulse code modulation (PCM) of voice frequencies, www.itu.int/rec/T-REC-G.711 256000 0.84400 0.06500 12.98 0.97420 0.05900 16.51 18.78200 0.12800 146.73 [2] http://en.wikipedia.org/wiki/A-law_algorithm [3] http://en.wikipedia.org/wiki/M-law_algorithm 512000 1.61400 0.12400 13.02 1.99880 0.11900 16.80 37.15400 0.25000 148.62 [4] ITU-T G.722: 7kHz audio-coding within 64kbit/s, www.itu.int/rec/T-REC-G.722 [5] ITU-T G.191: Software tools for speech and audio standardization, www.itu.int/rec/T-REC-G.191 1024000 3.20880 0.24200 13.26 3.99774 0.23500 17.01 73.10099 0.48700 150.10 [6] https://github.com/sifaserdarozen/ConcurrentDecoder