Mobile Peer-To-Peer Audio Streaming

Mobile Peer-to-Peer Audio Streaming Andreas Lüthi Bachelor Thesis Computer Science Department ETH Zürich 8092 Zürich,Switzerland Email: [email protected] Abstract A peer-to-peer network has several features over a standard network connection that need to be accounted for while developing a scalable audio codec. There is a multitude of possible ways to implement such an audio codec with tradeoffs ranging from partitioning the network to having computationally complex ways of scaling the compressed signal (which is not useful in a mobile setting where energy efficiency is key). The developed codec scales along the available channels, with a base layer providing a mono signal to more than one layer which enable for example stereo signals. This scaling technique allows for an elegant implementation in the existing Vorbis I codec and provides the necessary properties that are important in a peer-to-peer environment. The codec aims to provide multimedia content (like movie audio or pieces of music) in a way that allows the efficient use of the available bandwidth on a mobile peer-to-peer client. 1 Introduction While the mobile revolution has already begun with the dawn of devices like the iPhone or Google Android devices, the evolution of audio codecs was more or less stagnant in this area of use. Currently available scalable audio codec technologies implement usually one of two ways how an audio signal can be encoded. One such way would be to encode an audio signal into different independent channels and each channel with different qualities (e.g. 32kbit/s, 64kbit/s and 128kbit/s) and switching between the channels if the available bandwidth changes (stream switching). The other way would be a technique called "Bitrate Peeling" where the sending side for example reduces the resolution in which a signal is sampled. This reduction of resolution can for example be achieved by lowering the number of bits which are needed to represent the quantification values of a signal. 1 2 Theoretical Background 2 Stream switching usually works well in server based environments where there are central entities in a network which distribute the content { with regard to a peer-to-peer network there are some differences which need to be accounted for. On one hand there would be a need to partition the peer- network into peers of different bitrate capabilities and on the other hand the switching between streams would be extremely costly because of the new peers that need to be found for the new bitrate. [1] This would imply that there could be troubles with seamless stream switching. This facts led to the idea of developing a scalable audio codec which uses a hierarchical layering technique much like the one available in todays video codecs. For lower bandwidth connections a base layer would suffice to have a reasonably good audio quality and while bandwidth would increase with better connections like third generation mobile networks (LTE, HSDPA, UMTS, etc.) and/or Wireless LAN (WiFi) more layers could be added to improve the overall quality of the audio signal. 1.1 Organization of This Report In Section 2 we will first discuss what we are going to encode and how the current Vorbis reference implementation works. Section 3 will embrace possible ways on how the Vorbis codec could be modified to implement a scaling. It will also include some reasoning about what facts led to our decisions. In Section 4 we give a short overview on how the prototype implementation has been developed. Section 5 discusses and analyzes our findings while developing a scalable Vorbis codec. The report concludes with Section 6 and a short conclusion which subsumes the report and gives pointers in future directions. 2 Theoretical Background 2.1 What will be encoded In order to encode an audio signal it has to go through the process of digi- talization. This process consists of a sampling and a quantization phase. To represent a continuous analogue signal in a finite system we have to select and record the ordinate values of the continuous signal in a time dependent manner. Usually we take so called samples of the signal within equally spaced distances with respect to the time axis; i.e. with a signal that is to be sampled with 44:1kHz we take the ordinate value of the signal every 1 44100 th fraction of a second. The resulting discrete signal is often referenced to as a time series of a signal. [2] The signal, now discrete in time rather than continuous, is nevertheless still an analogue signal because the pulse amplitudes (ordinate values) can still take on every possible value. Therefore we quantize the signal. If such a 2 Theoretical Background 3 signal is quantized each pulse is adjusted in amplitude to coincide with the nearest value of a predefined finite set of possible values. This signal is no longer analog but digital in nature and as a consequence can be represented by a finite set of symbols. [2] The resulting digital format is denoted as a PCM (Pulse Code Modula- tion) representation of the audio signal. This PCM representation is often regarded as 'raw' or the original reference in digital multimedia signal processing applications. [3] (see figure 1 for an illustration of the process). time time 111 110 101 100 011 010 001 time 000 Fig. 1: An illustration of the sampling and quantization process. From top to bottom we see first the analog input signal then the sampled time series (arrows). The sampled time series is the input signal measured at a regular interval. The last part shows the quantization. The quantization represents an approximation of the analog input signal into a finite discrete representation. 2 Theoretical Background 4 2.2 How the existing encoder works To explain how the Ogg Vorbis codec internally works, we go step by step through the encoding pipeline as it is depicted in figure 2. PCM audio A E F G D MDCT spectrum residue Channel Vector Huffman / coupling quantization coding DFT floor B C H Psycho- Vorbis packets FLOOR Bitrate acoustic function selection model Fig. 2: Encoder pipeline (Source: [4]). The enumeration of the stages is only for referencing purposes and does not reflect the flow of the signal through the pipeline. As denoted in step A in figure 2 the first task is to take the signal and transform the representation into a frequency domain representation. The reference implementation of the Vorbis I encoder does this in two ways. On one side the input signal gets windowed and transformed by an MDCT (Modified Discrete Cosine Transform) and on the other side the input signal gets windowed and transformed by a DFT (Discrete Fourier Transform). Because it is impossible to encode arbitrarily large signals (for example a radio stream which runs non-stop over several years cannot be transformed at once because of size and time restrictions on a computer) an input signal gets windowed. Windowing is a process which takes a part of a signal, e.g. a 2048 or 512 samples chunk (in Vorbis I only those two window lengths are used { the shorter one even only for critical music signals with sudden changes), applies the requested transformation(s) and returns the encoded signal. The MDCT transformation works, like the name suggests, in principle like the widely used DCT (Discrete Cosine Transform) with some minor changes. To easily understand the differences between the MDCT and the DCT we can look at some of the widely used picture compression algorithms [2]. With those algorithms we take usually a block (window) of a picture and encode it separately. This leads to the often seen blocking artifacts, where we see the edges of a block due to different local attributes within a block. This essentially happens also with audio signals. To reduce this effect we often use the MDCT which provides an overlapping of the windows 2 Theoretical Background 5 as it can be seen in figure 3. This overlapping leads to smoother edges and therefore a signal with lesser fast changes (like crackling noises). Fig. 3: The visual representation of the overlapping of two MDCT frames. The red and green parts are signal chunks (windows) which overlap to ensure a smooth reconstruction. (Source: [5]) The use of the DFT in the Vorbis I codec is primarily as an instrument to analyze the incoming signal (besides the MDCT). The DFT has the ad- vantages of a better tonal estimation and easier handling and is therefore used as an additional input for the psycho acoustic model besides the MDCT output. After having the signal transformed from the time domain into the frequency domain the first step of the lossy processing of the signal happens (step B in figure 2). The psycho acoustic model helps to exploit important effects regarding the human hearing system with the use of the masking property of the human hearing [3]. As it can be seen in figure 4 not every frequency can be heard equally well therefore we are able to remove unheard parts from the spectrum to compress the signal in a first step. The output of the psycho acoustic model is afterwards used to create a low level approximation of the frequency representation of the input signal, the floor (depicted as C in figure 2) shown in figure 5. This floor is afterwards used to remove a large part of the spectrum with the calculated approximation (stage D in the encoding pipeline depicted in figure 2). The remaining part, called residuals (the fine structure of the audio spectrum), are afterwards encoded separately.

Load more