Speex: a Free Codec for Free Speech

Speex: A Free Codec For Free Speech Jean-Marc Valin CSIRO ICT Centre, Cnr Vimiera & Pembroke Roads, Marsfield NSW 2122, Australia Xiph.Org Foundation [email protected] Abstract—The Speex project has been started in 2002 without a floating-point unit, allowing Speex to be used to address the need for a free, open-source speech codec. in embedded devices equipped with a fixed-point CPU Speex is based on the Code Excited Linear Prediction or DSP. (CELP) algorithm and, unlike the previously existing The paper is organised as follows. Section II presents Vorbis codec, is optimised for transmitting speech for low overview of Speex. Then, an explanation of the underly- latency communication over an unreliable packet network. This paper presents an overview of Speex, the technology ing coding algorithm (CELP) is provided in Section III. involved in it and how it can be used in applications. The Section IV provides information and tips about using most recent developments in Speex, such as the fixed-point the Speex library in applications. Finally, Section V port, acoustic echo cancellation and noise suppression are discusses recent developments in Speex and Section VI also addressed. concludes this paper. Index Terms—speech coding, voice over IP II. OVERVIEW I. INTRODUCTION The Speex project was stated in February 2002 be- When I started implementing Speex in 2002, there cause there was a need for a speech codec that was was no good free codec that could be used for speech open-source and free from software patents. These are communication. The situation was greatly harming the essential conditions for being used by any open-source development of voice over IP (VoIP) on Linux and software. At that time, the only options for compressing other free operating systems. Unlike the Vorbis1 codec speech were the G.711 codec (also known as µ-law that already existed at that time, the Speex codec is and A-law) and the open-source implementation of the optimised for speech and is designed for low latency aging GSM-FR codec2, although the patent status of the communication over an unreliable packet network. latter is unclear. The Vorbis codec was also available for Speex is based on the popular Code Excited Lin- general audio compression, but it was not really suitable ear Prediction (CELP) algorithm but because of patent for speech. restrictions, some techniques like the use of algebraic Unlike many other speech codecs, Speex is not tar- codes (ACELP) could not be used. Nevertheless, Speex geted at cell phones but rather at voice over IP (VoIP) is able to achieve a quality comparable to proprietary and file-based compression. Because any codec suitable arXiv:1602.08668v1 [cs.SD] 28 Feb 2016 codecs at the same bitrate, while supporting features not for VoIP is also suitable for file-based compression, only found in other speech codecs. These features include the VoIP requirements are taken into account. Version 1.0 variable bitrate, which makes better use of the bits when was released in March 2003 and the bitstream has been encoding to a file, and embedded coding, which provides frozen since that time, which means that all versions an easy way to interface wideband (16 kHz) channels since then are compatible with version 1.0. with legacy narrowband (8 kHz) telephony equipment. Speex is now evolving into a complete toolkit for voice A. Design Decisions over IP (VoIP) development, including algorithms for The use of Speex for VoIP imposes the following noise cancellation, acoustic echo cancellation and adapt- requirements: ive jitter buffering. This allows a developer without any • The frame size and algorithmic delay must be small signal processing knowledge to implement a VoIP client. • Both the encoder and decoder must run in real-time In the mean time, Speex is being ported to architectures with limited resources 1http://www.vorbis.com/ 2http://kbs.cs.tu-berlin.de/~jutta/toast.html • The effect of lost packets during transmission must longer answer is that first, we (Xiph.Org, me) obviously be minimised did not patent Speex. During development, we were • The codec must support both narrowband and wide- careful not to use techniques known to be patented band and we searched through the USPTO database to see • Multiple bit-rates and quality settings must be sup- if anything we were doing was covered. ported to take into account different connection That being said, we cannot offer any absolute warranty speeds regarding patents. No company does that, simply because • Good compression must be achieved while avoiding of the mess the patent system is. Even if you pay for known speech coding patents a codec, you cannot have the guaranty that nobody else The CELP algorithm [1] was chosen because of its owns a (previously unknown) patent on that codec. Even proven track record from low bitrates (e.g. DoD CELP if someone comes forward and claims to have a patent at 4.8 kbps) to high bitrates (e.g. G.728 LD-CELP at on a certain piece of software, it takes years of legal 16 kbps) and because the patents on the base algorithm procedures to sort out. This has been happening recently have now expired. Unfortunately, newer variants such as with the JPEG standard, which has always been thought the use of arithmetic codebooks in ACELP could not be to be free of patents. It is not yet clear what will happen used in Speex because of patent issues. It was decided in this particular case. Sadly, this is how things are when to encode in chucks of 20 ms (frame size) and only use it comes to patents – at least in the United States because 10 ms of additional buffering (look-ahead). every country has different patent laws, which makes the Because Speex is designed for VoIP instead of cell matter even more complicated. phones (like many other speech codecs) it must be robust In any program written that is longer than "Hello to lost packets, but not to corrupted ones – UDP/RTP World" (and even that!), there might be a patent issue, packets either arrive unaltered or they don’t arrive at all. generally a trivial patent (e.g. 1-click shopping) nobody For this reason, we have chosen to restrict the amount of thought about. There are probably patents out there inter-frame dependency to pitch prediction only, without claiming rights on "using a computer for doing useful incurring the overhead of coding frames independently things" or something like that. What this means for as done in the iLBC [2] codec. Speex is that we did our best to avoid known patents As much as we would like it to be true, Speex is and minimise the risk. This is about the best that can be not the most technologically advanced speech codec done when writing software at the moment, regardless available and it was never meant to be. The original goal of whether the software is Free or proprietary. was to achieve the same quality as state-of-the-art codecs III. SPEEX AND CELP with comparable or slightly higher bit-rate and this goal has been achieved. Speex is based on CELP, which stands for Code Speex also has the following features, some of which Excited Linear Prediction. The CELP technique is based are not found in other speech codecs: on three ideas: • Integration of narrowband and wideband using an 1) Using a linear prediction (LP) model to model the embedded bit-stream vocal tract; • Wide range of bit-rates available (from 2 kbps to 2) Using of an adaptive and a fixed codebook as the 44 kbps) input (excitation) of the LP model; • Dynamic bit-rate switching and Variable Bit-Rate 3) Performing a search in closed-loop in a “percep- (VBR) tually weighted domain”. • Voice Activity Detection (VAD, integrated with The description of the Speex encoder and decoder in this VBR) section has been greatly simplified for clarity purposes. • Variable complexity Nonetheless, speech processing knowledge is required to • Ultra-wideband mode at 32 kHz understand some parts. • Intensity stereo encoding option A. Source-Filter Speech Model B. Legal Status Before going into Speex and CELP in details, let us One of the most frequently asked question about introduce the source-filter model of speech. This model Speex is: “But how can you be sure it is not patented?”. assumes that the vocal cords are the source of spectrally The short answer is that we don’t know for sure. The flat sound (the “excitation signal”), and that the vocal tract acts as a filter to spectrally shape the various sounds of speech. While still an approximation, the model is widely used in speech coding because of its simplicity. Its use is also the reason why most speech codecs (Speex included) perform badly on music signals. The different phonemes can be distinguished by their excitation (source) and spectral shape (filter). Voiced sounds (e.g. vowels) have an excitation signal that is periodic and that can be approximated by an impulse train in the time domain or by regularly-spaced harmon- ics in the frequency domain. On the other hand, fricatives (such as the “s”, “sh” and “f” sounds) have an excitation signal that is similar to white Gaussian noise. So called “voice fricatives” (such as “z” and “v”) have excitation signal composed of an harmonic part and a noisy part. Figure 1. A generic CELP decoder. B. Speex Decoder Before exploring the complex encoding process of where g0, g1 and g2 are the pitch gains and are jointly Speex we introduce the Speex decoder here.

Load more