<<

arXiv:1602.08668v1 [cs.SD] 28 Feb 2016 ntema ie pe sbigpre oarchitectures to ported being is time, mean client. VoIP the a any In implement without to developer knowledge a processing allows signal This buffering. adapt- for jitter and ive cancellation algorithms echo including acoustic cancellation, development, noise (VoIP) IP over channels equipment. kHz) telephony (16 kHz) (8 wideband narrowband interface legacy to with way provides which easy coding, when embedded an include bits and the file, features a of to use These encoding better makes . which speech bitrate, not variable proprietary other features to supporting in while comparable bitrate, found quality same the a at achieve codecs Speex to Nevertheless, algebraic used. able of be is use not could the patent (ACELP) like of codes techniques because some but algorithm restrictions, (CELP) Prediction latency ear is low network. for packet Speex designed unreliable is the an over and time, communication speech that for at optimised existed already that eeomn fvieoe P(oP nLnxand on the Unlike (VoIP) systems. IP operating over speech free the for other voice harming used of greatly be was could development situation that The codec free communication. good no was are suppression noise addressed. and The also fixed-point cancellation applications. the echo in as acoustic used such port, be Speex, can in developments it technology recent how the most and Speex, it of in overview involved an presents existing network. paper low packet This for unreliable previously Prediction an speech over transmitting the Linear communication for latency optimised Excited unlike is Code codec, and, Vorbis the codec. algorithm speech on open-source (CELP) based free, a is for Speex need the address to 1 pe snweovn noacmlt oli o voice for toolkit complete a into evolving now is Speex Lin- Excited Code popular the on based is Speex hnIsatdipeetn pe n20,there 2002, in Speex implementing started I When Terms Index Abstract http://www.vorbis.com/ SR ete n iir eboeRas asedNSW Marsfield Roads, Pembroke & Vimiera Cnr Centre, ICT CSIRO TeSexpoethsbe tre n2002 in started been has project Speex —The sec oig oc vrIP over voice coding, —speech pe:AFe oe o reSpeech Free For Codec Free A Speex: .I I. NTRODUCTION [email protected] ihOgFoundation Xiph.Org 1 enMr Valin Jean-Marc codec n -a)adteoe-oreipeetto fthe as of known implementation codec open-source GSM-FR (also aging the codec and G.711 A-law) compressing and the for options were only open-source was the speech any time, that by that are codec used At These software. being speech patents. for a software conditions for from essential free need and a open-source was there cause V VI Section Section paper. and Finally, this Speex concludes applications. in using developments about in recent tips discusses library and Speex information III. the Section provides in IV provided is Section underly- (CELP) the algorithm of coding explanation ing an Then, Speex. of overview CPU fixed-point a with DSP. used or equipped be devices to Speex embedded allowing in unit, floating-point a without requirements: Decisions versions Design all A. 1.0. that version with means compatible which are been then has time, since bitstream that the since and 2003 frozen March in 1.0 released Version account. was into only taken compression, are requirements file-based (VoIP) VoIP for the IP suitable suitable over also codec is voice VoIP any for at Because rather compression. but file-based and phones cell at geted suitable really not speech. was forfor it available but also compression, was general Vorbis The unclear. is latter 2 h pe rjc a ttdi eray20 be- 2002 February in stated was project Speex The presents II Section follows. as organised is paper The h s fSexfrVI moe h following • the imposes • VoIP for Speex of use The - not is Speex codecs, speech other many Unlike http://kbs.cs.tu-berlin.de/~jutta/toast.html ohteecdraddcdrms u nreal-time in resources run limited must with decoder small and be encoder must the delay Both algorithmic and size frame The I O II. 2 lhuhteptn ttso the of status patent the although , VERVIEW 12 Australia 2122, µ -law • The effect of lost packets during transmission must longer answer is that first, we (Xiph.Org, me) obviously be minimised did not patent Speex. During development, we were • The codec must support both narrowband and wide- careful not to use techniques known to be patented band and we searched through the USPTO database to see • Multiple bit-rates and quality settings must be sup- if anything we were doing was covered. ported to take into account different connection That being said, we cannot offer any absolute warranty speeds regarding patents. No company does that, simply because • Good compression must be achieved while avoiding of the mess the patent system is. Even if you pay for known patents a codec, you cannot have the guaranty that nobody else The CELP algorithm [1] was chosen because of its owns a (previously unknown) patent on that codec. Even proven track record from low bitrates (e.g. DoD CELP if someone comes forward and claims to have a patent at 4.8 kbps) to high bitrates (e.g. G.728 LD-CELP at on a certain piece of software, it takes years of legal 16 kbps) and because the patents on the base algorithm procedures to sort out. This has been happening recently have now expired. Unfortunately, newer variants such as with the JPEG , which has always been thought the use of arithmetic codebooks in ACELP could not be to be free of patents. It is not yet clear what will happen used in Speex because of patent issues. It was decided in this particular case. Sadly, this is how things are when to encode in chucks of 20 ms (frame size) and only use it comes to patents – at least in the United States because 10 ms of additional buffering (look-ahead). every country has different patent laws, which makes the Because Speex is designed for VoIP instead of cell matter even more complicated. phones (like many other speech codecs) it must be robust In any program written that is longer than "Hello to lost packets, but not to corrupted ones – UDP/RTP World" (and even that!), there might be a patent issue, packets either arrive unaltered or they don’t arrive at all. generally a trivial patent (e.g. 1-click shopping) nobody For this reason, we have chosen to restrict the amount of thought about. There are probably patents out there inter-frame dependency to pitch prediction only, without claiming rights on "using a computer for doing useful incurring the overhead of coding frames independently things" or something like that. What this means for as done in the iLBC [2] codec. Speex is that we did our best to avoid known patents As much as we would like it to be true, Speex is and minimise the risk. This is about the best that can be not the most technologically advanced speech codec done when writing software at the moment, regardless available and it was never meant to be. The original goal of whether the software is Free or proprietary. was to achieve the same quality as state-of-the-art codecs III. SPEEX AND CELP with comparable or slightly higher bit-rate and this goal has been achieved. Speex is based on CELP, which stands for Code Speex also has the following features, some of which Excited Linear Prediction. The CELP technique is based are not found in other speech codecs: on three ideas: • Integration of narrowband and wideband using an 1) Using a linear prediction (LP) model to model the embedded bit-stream vocal tract; • Wide range of bit-rates available (from 2 kbps to 2) Using of an adaptive and a fixed codebook as the 44 kbps) input (excitation) of the LP model; • Dynamic bit-rate switching and Variable Bit-Rate 3) Performing a search in closed-loop in a “percep- (VBR) tually weighted domain”. • (VAD, integrated with The description of the Speex encoder and decoder in this VBR) section has been greatly simplified for clarity purposes. • Variable complexity Nonetheless, speech processing knowledge is required to • Ultra-wideband mode at 32 kHz understand some parts. • Intensity stereo encoding option A. Source-Filter Speech Model B. Legal Status Before going into Speex and CELP in details, let us One of the most frequently asked question about introduce the source-filter model of speech. This model Speex is: “But how can you be sure it is not patented?”. assumes that the vocal cords are the source of spectrally The short answer is that we don’t know for sure. The flat sound (the “excitation signal”), and that the vocal tract acts as a filter to spectrally shape the various sounds of speech. While still an approximation, the model is widely used in speech coding because of its simplicity. Its use is also the reason why most speech codecs (Speex included) perform badly on signals. The different phonemes can be distinguished by their excitation (source) and spectral shape (filter). Voiced sounds (e.g. ) have an excitation signal that is periodic and that can be approximated by an impulse train in the time domain or by regularly-spaced harmon- ics in the frequency domain. On the other hand, fricatives (such as the “s”, “sh” and “f” sounds) have an excitation signal that is similar to white Gaussian noise. So called “voice fricatives” (such as “z” and “v”) have excitation signal composed of an harmonic part and a noisy part. Figure 1. A generic CELP decoder. B. Speex Decoder

Before exploring the complex encoding process of where g0, g1 and g2 are the pitch gains and are jointly Speex we introduce the Speex decoder here. For the sake quantised (VQ). of simplicity only the narrowband decoder is presen- Many current CELP codecs use moving average (MA) ted. Figure 1 describes a generic CELP decoder. The prediction to encode the fixed codebook gain. This excitation is produced by summing the contributions provides slightly better coding at the expense of intro- from an adaptive (aka pitch) codebook and a fixed (aka ducing some extra dependency with past frames. Speex innovation) codebook: encodes the fixed codebook gain as a global excitation gain for the frame as well as a set of sub-frame gain e[n]= e [n]+ e [n] (1) a f corrections. where ea[n] is the adaptive codebook contribution and Speex uses sub-vector quantisation of the innovation ef [n] is the fixed codebook contribution. (fixed codebook) signal. Unfortunately, this is not a very The filter that shapes the excitation has an all-pole efficient quantisation method, but it was the best that (infinite impulse response) model of the form 1/A(z), could be used given patent restrictions. Each sub-frame where A(z) is called the prediction filter and is obtained is divided into sub-vectors of length ranging between using linear prediction (Levinson-Durbin algorithm). An 5 and 20 samples. Each sub-vector is chosen from a all-pole filter is used because it is a good representation bitrate-dependent codebook and all sub-vectors are then of the human vocal tract and because it is easy to concatenated. As an example, the 3.95 kbps mode uses compute. a sub-vector size of 20 samples with 32 entries in the In Speex, frames are 20 ms long, which is 160 samples codebook (5 bits). This means that the innovation is for narrowband. Furthermore, each frame is divided into encoded with 10 bits per sub-frame, or 2000 bps. On 4 sub-frames of 40 samples that are encoded sequentially. the other hand, the 18.2 kbps mode uses a sub-vector In most modes, only the synthesis filter and the global size of 5 samples with 256 entries in the codebook (8 excitation gain (see below) is encoded on a frame basis, bits), so the innovation uses 64 bits per sub-frame, or the other parameters are encoded on a sub-frame basis. 12800 bps. Speex includes three main differences compared to C. Speex Encoder most recent CELP codecs. First, while most recent CELP codecs make use of fractional pitch estimation [3] with The main principle behind CELP is called Analysis- a single gain, Speex uses an integer to encode the pitch by-Synthesis (AbS) and means that the encoding (ana- period, but uses a 3-tap predictor [4] (3 gains). The lysis) is performed by perceptually optimising the de- adaptive codebook contribution can thus be expressed coded (synthesis) signal in a closed loop. In theory, as: the best CELP stream would be produced by trying all possible bit combinations and selecting the one ea[n]= g0e[n−T −1]+g1e[n−T ]+g2e[n−T +1] (2) that produces the best-sounding decoded signal. This is obviously not possible in practice for two reasons: the high-band information is encoded after then narrowband required complexity is beyond any currently available (low-band) data, so simply discarding the last part of hardware and the “best sounding” selection criterion a wideband frame produces a valid narrowband frame. implies a human listener. Alternatively, a wideband decoder is able to identify that In order to achieve real-time encoding using limited the high-band information is missing and thus is able to computing resources, the CELP optimisation is broken decode a narrowband frame. down into smaller, more manageable, sequential searches One particularity of the wideband mode made possible using a simple perceptual weighting function. In the case by the SB-CELP scheme is that for low-bitrate, only the of Speex, the optimisation is performed in four steps spectral shape (envelope) of the high-band is encoded. 1) Linear prediction analysis is used to determine the The high-band excitation is obtained simply by “folding” synthesis filter, which is converted to Line Spectral the narrowband excitation. The technique is similar to the Pair (LSP) coefficients and vector-quantised. low bit-rate encoding of the high frequencies described 2) The adaptive codebook entry and gain are jointly in [5]. searched for the best pitch-gain combination using IV. IMPLEMENTING SPEEX SUPPORT AbS. 3) The fixed codebook gain is determined in an Unlike most speech codecs, Speex supports several “open-loop” manner only the energy of the excit- sampling rates and several bit-rates for each sampling ation signal. rate. The thing one must determine before choosing a 4) The fixed codebook is searched for the best entry codec (or in this case, which mode of the codec) is: using AbS. • What is the bandwidth available? Optimisation for steps 2) and 4) are performed in the • What is the quality requirement? so called “perceptually weighted domain”, meaning that • What is the latency requirement? we try to minimise the perceptual difference with the When designing a Voice over IP application, the first original (as opposed to simply maximising the signal-to- aspect one must consider is the fact that the combination noise ratio). In order to do that, the following weighting of IP, UDP and RTP incurs an overhead of 40 bytes per filter is applied on the input signal: packet transmitted. When transmitting one packet every 20 ms as is commonly done, the overhead is equal to A(z/γ1) W (z)= (3) 16 kbps! Considering that, it is obvious that in most A(z/γ2) situations, there is little incentive to use very low bit- where A(z) is the linear prediction filter and γ1 and γ2 rate. control the shape of the filter by moving the poles and A. Choosing the Right Mode zeros toward the centre of the z-transform unit circle (Speex uses γ1 = 0.9 and γ2 = 0.6). The filter defined The first choice one must make when using Speex in 3 is in fact a very, very rough approximation for what is the sampling rate used. For new applications, it is is known as the “masking curve” in audio codecs such recommended to use wideband (16 kHz sampling rate) as Vorbis. The overall effect of the weighting filter W (z) audio whenever possible. Wideband speech sounds much is that the encoder is allowed to introduce more noise at clearer than narrowband speech and is also easier to un- frequencies where the power level is high and less noise derstand. In many case, for example when implementing at frequencies where the power level is low. a VoIP application, it is a good idea to also support narrowband – either directly or through the wideband D. Wideband CELP (aka SB-CELP) encoder/decoder. In this context, narrowband is useful The wideband mode in Speex differs from most for both compatibility reasons (with clients that may wideband speech codecs as it separates the signal into not support wideband) and because at very low bit-rates two sub-bands (SB-CELP). The lower band is encoded (<10 kbps), it sometimes provides better quality than using the narrowband encode. This has the advantage wideband. of making it easy to interoperate with “legacy” sys- Once the sampling rate is chosen, one must decide on tems (e.g. PSTN) operating in narrowband. The high- the bitrate. It is generally recommended to use constant band is encoded using a CELP encoder similar to the bitrate (CBR) in voice over IP applications because of narrowband encoder with the difference that only a the fixed channel (especially if using a modem) capacity. fixed codebook is used, with no adaptive codebook. The On the other hand, VBR is usually more efficient for file compression, because bits can be allocated where written by calling speex_bits_nbytes(&bits), which needed. When calculating for a target bit-rate, it is im- returns a number of bytes. portant to also consider the overhead. The VoIP overhead After you’re done with the encoding, all resources can when using RTP and one frame per packet is 16 kbps, be freed using: so when using a 56 kbps modem, this leaves very little speex_bits_destroy(&bits); bandwidth for the codec itself. For file encoding in , speex_encoder_destroy(enc_state); the overhead is approximately 1 byte per frame, so it The decoder API can be used in a similar manner. For 3 is usually not problematic. For very low bit-rates, the more details, refer the Speex manual . –nframes option to speexenc can be useful as it packs C. Common Layer 8 Problems multiple frames in an Ogg packet. There are problems that frequently arise when imple- B. Libspeex API Overview menting support for Speex. While some of them are due The reference implementation of Speex uses a C API. to programming errors, most are by sending or receiving Only an overview of the codec API is provided here. the wrong signals. The most important aspect for the When using Speex, one must first create an instance of input signal is the amplitude. Speex can encode any − an encoder and/or a decoder. If more than one channel signal in the [ 32768, 32767] range. However, it is best is used, then an instance is required for each channel to make use of the full dynamic range without being too (Speex is not stateless). close to saturation because that could cause the decoded In order to encode speech using Speex, the following speech to saturate even if the encoded speech does not. header must be included: In practice, it is best to scale the input signal so the #include maximum amplitude is between 5000 and 20000. Another important characteristics of the signals is the Then, a Speex bit-packing struct and a encoder state are frequency content. DC offsets should be avoided at all necessary: cost as it causes the Speex encoder to produce bad SpeexBits bits; void *enc; sounding files (DC is never needed for audio). A DC These are initialized by: offsets is sometimes caused by a cheap soundcard and speex_bits_init(&bits); means that the signal is not centred around the zero enc = speex_encoder_init(&speex_nb_mode); (zero-mean). In the most simple case, a DC offset can For wideband coding, speex_nb_mode will be replaced easily be removed using a notch filter of the form: by speex_wb_mode. In most cases, you will need to know 1 − z−1 N(z)= (4) the frame size used by the mode you are using. The value 1 − αz−1 can be obtained in the frame_size variable using: where a value of α = 0.98 is usually appropriate for speex_encoder_ctl(enc, both narrowband and wideband. The filter described in SPEEX_GET_FRAME_SIZE,&frame_size); (4) can be implemented as: In practice, frame_size will correspond to 20 ms when #define ALPHA .98f using 8, 16, or 32 kHz sampling rate. static float mem=0; Once the initialization is complete, frames can be for (i=0;i