An Analytical Comparison of Digital Audio Encoding Technologies
Total Page:16
File Type:pdf, Size:1020Kb
An Analytical Comparison of Digital Audio Encoding Technologies Sean McGrath October 18, 2005 Executive Summary With the recent popularity of the Internet and the use of personal computers as media devices, digital audio, especially digital music, has become a common component of most peoples lives. As the number of uses for digital audio has grown, so has the number of different way to store and encode audio in a digital manner. This report covers a critical analysis and comparison of three of the main methods used for encoding digital audio; MPEG-1 Layer-3, MPEG-4 Advanced Audio Coding, and Vor- bis I. The majority of this analysis will focus on the technical features of these methods and the approaches they use in their encoding algorithms. Each of these encoding technolo- gies has their own unique features, benefits and drawbacks and this report will outline them in detail. Contents 1 Introduction . 2 2 Digital Audio . 3 2.1 Sampling . 3 2.2 Bit Rate . 4 2.3 Audio Bandwidth . 4 2.4 Encoding and Decoding . 5 3 Determining Audio Quality . 7 4 Audio Codecs . 8 4.1 MPEG-1 Layer-3 . 8 4.2 MPEG-4 Advanced Audio Coding . 11 4.3 Vorbis I . 13 5 Performance Comparison . 14 6 Recomendations . 16 7 Conclusion . 17 1 1 Introduction In an age where the distribution of digital content is beginning to surpass distribution of physical media, the manner in which this content is represented digitally can play a huge role in the acceptance of the content by end users. This is especially true for digital forms of media (i.e. images, audio and video) since they often require large amounts of data to represent the data accurately. This report focuses on the digital storage of audio and several of the more popular meth- ods for encoding it. This is a topic that has become a heated discussion in the last few years as the worldwide music industry has started distributing its content over the inter- net. When choosing how you encode digital music, you are in fact choosing the quality of the audio (how close it resembles the original source), the amount of disk space that will be required to store the audio at that level of quality, how compatible this audio will be with players and portable devices, and how limited you will be in the use of that audio. Limiting the use of audio to the end user is accomplished through digital rights management (DRM), something that varies from encoding to encoding. DRM is often the deciding factor for the music industry when choosing how to distribute their music digitally. We will see that by doing this they are in fact drastically limiting the quality and portability of the music. This report covers the features and specifications of the MPEG-1 Layer-3 (MP3), MPEG- 4 Advanced Audio Coding (AAC) and the Vorbis I (Ogg Vorbis) encoding/decoding schemes (or codecs) and the benefits that each has over the other. These are three of the most popular codecs in use today and were chosen for discussion in this report for that reason along with the fact that they are supported on multiple platforms. Another large player in the codec world, especially with the music industry, is the Windows Media 9 2 audio codec, which was left out of this discussion because it is only supported by the Microsoft Windows platform. 2 Digital Audio The primary goal of a digital audio codec is to take an existing digital audio stream, com- press it (also called encoding) and store it in a new format. In order to play this encoded audio stream the codec must decode it in order to play it. Before we can make sense of the encoding/decoding process it is necessary to explain how audio is represented digitally. 2.1 Sampling Sampling of digital audio refers to the process of digitally storing the amplitude of the sound wave at any given time. Each time a sample is taken, the amplitude is stored as typically a 2 byte (16-bit) or 3 byte (24-bit) value that is capable of measuring even subtle differences in volume. When encoding digital audio one of the key choices to be made is how often to sample the original sound source. This is known as the sampling frequency, with 44.1 kHz (44,100 samples taken per second) and higher being desired for high quality audio. Compact discs (CDs) use a sampling rate of 44.1 kHz and store each sample using 16 bits. The Nyquist Theorem states that in order to prevent abnormal audio signals in the representation, a sampling frequency of at least twice the highest recorded frequency is needed [8]. The highest audio frequency that the human ear can hear is 20 kHz so a sampling frequency of 44.1 kHz is over the minimum frequency required to avoid these abnormalities. Uncompressed digital audio, such as that found on CDs or in WAV files is stored us- 3 ing what is called the Pulse Code Modulation (PCM) format which uses this method of sampling and provides a very accurate reflection of the original sound. 2.2 Bit Rate Another measurement that plays an important role in audio encoding is the bit rate, or the number of bits used to store a segment of audio. Bit rates are typically measured in kilobits per second (kbps), and range from 8 kbps to 1411 kbps (the bit rate used on compact discs). Lower bit rates are often associated with lower quality, and the aim of some encoders is to overcome this and maximize audio quality at lower bit rates. There are three different methods used to capture bit rates while encoding an audio stream. The first is constant bit rate (CBR) which uses the same number of bits to store each sample as opposed to an average bit rate (ABR) which will store each second of the audio stream with the same number of bits, but the number of bits for each sample may vary. The final type of bit rate that is commonly used is a variable bit rate (VBR). With a variable bit rate, the encoder chooses the best bit rate for a segment of audio depending on its characteristics in order to keep quality high, but save on disk space. 2.3 Audio Bandwidth The audio bandwidth of an audio source refers to the frequency range of that source. The higher the audio bandwidth of a signal the more accurate it is. The highest bandwidth required when producing signals used by the human ear ranges from 20 Hz to 20 kHz (the audible frequencies to the human ear) [17]. The importance of audio bandwidth will become apparent in the later sections of this report when we look at how encoders 4 attempt to minimize disk storage. 2.4 Encoding and Decoding By doing a few simple calculations on the uncompressed digital audio that is stored on a CD we see that the disk space required to store audio in this form is an issue. An audio CD uses a bit rate of approximately 1411 kbps, or 1411000 bits per second. This works out to be roughly 172 KB per second of audio, or 10 MB per minute. Now image if someone wanted to store their entire CD collection consisting of 200 discs at 40 minutes a piece, This would require roughly 80 GB of storage, which even with today’s large inexpensive hard drives is a bit unpractical. By compressing this audio into lower bit rates, we can effectively reduce the file size of an audio file with very little loss in quality. Studies have shown that under optimal listening conditions, even expert listeners are unable to determine uncompressed from compressed audio (stereo, 16 bit samples, 256 kbps, 48kHz sampling frequency) a sixth of the original size [10]. Using these compression setting we would be able to shrink the CD library mentioned above to 14 GB where it could them be stored on a portable device. So if compressing audio to a bit rate of 256 kbps is enough to reduce the size by a sixth, why do encoders bother encoding at levels such as 128 kbps and 64 kbps? The answer is simply that we can decrease storage even more to sizes that are more attractive for use on the internet and portable devices by slightly decreasing the quality of the audio. We will see in the later sections that the three codecs discussed focus heavily on providing high quality audio at these lower bit rates. The compression of audio signals differs greatly from the compression of regular data files such as text files and executables. With these basic file types, compression must be non- 5 destructive in the sense that once they are uncompressed you have the exact same file, bit for bit, as the original. Audio compression, or encoding, is based on a psychoacoustic model that eliminates sounds in the input signal that are not perceived by the human ear. This results in the encoded signal sounding the same to humans, but being represented much differently, on a bit for bit basis, once decoded (uncompressed). Audio can be lightly compressed without destroying information when lossless compression is used, otherwise if the decoding process doesn’t produce a bit for bit replica of the original, lossy encoding is being used. The psychoacoustic models that are used by the codecs covered in this report succeed by using the limitations of the human ear to remove unnecessary noises in the audio signal, a technique called perceptual coding.