Evolution of Sound Reproduction – from Mechanical Solutions to Digital Techniques Optimized for Human Hearing

Communication acoustics: Paper ICA2016-109

Evolution of sound reproduction – from mechanical solutions to digital techniques optimized for human hearing

Ville Pulkki

1Aalto University, Finland, Ville.Pulkki@aalto.ﬁ

Abstract Sound reproduction consists of the processes of recording, processing, storing and recreating sound, typically speech, music, environmental or other sounds. The applications include such fields as public address, telecommunication, music, cinema, virtual reality, and aided hearing. The unifying factor is the common endpoint of the chain, the human listener. Historically, reproduction of sound has come a long way from the first monophonic phonographs. Nowadays audio is available efficiently in various digital formats and immersive 3D spatial sound can be reproduced over different multichannel set-ups or headphones. In general view, the trend in the development of sound reproduction during last decades has been the dedicated design of reproduction systems to match better the resolution of human hearing system. The reproduction methods are designed to reproduce acoustic parameters in time, frequency, and space with only a bit better resolution than the hearing system has. This paper discusses the needs and challenges faced in sound reproduction and it will also present various solutions and technologies. Keywords: audio, sound reproduction, spatial sound Evolution of sound reproduction – from mechanical solutions to digital techniques optimized for human hearing

1 Introduction The methods to store audio signals over time have been developed for more than hundred years, starting from mechanical solutions, and ending to masking-based digital audio. The ﬁrst reproduction methods were monophonic, and the signal was made audible using a single acoustic radiator. The evolution of systems to reproduce also spatial properties of sound has produced a wide variety of different listening set-ups, with different number of loudspeakers in 2D or 3D positioning, or headphone reproduction with or without head tracking. The audio formats for spatial set-ups have been based on delivering single audio signal for each loudspeaker. Only recently techniques to represent audio in generic format that can be listened to with any spatial reproduction technique have emerged.

2 Storing audio signals 2.1 Analog methods The first method to transfer sound over time was the phonograph, which is a device invented in 1877 for the mechanical recording and reproduction of sound. In its later forms it is also called a gramophone. The sound is captured with a horn to a tube, and the waveforms propagating in the tube are recorded as corresponding physical deviations of a spiral groove into the surface of a rotating cylinder or disc. When reproducing the sound, the surface is similarly rotated while a playback stylus traces the groove. The vibrations of the stylus are then in transferred to vibrations of a membrane. The membrane then radiates to a tube having a horn in the open end of it, and the reproduced sound is made audible. In later electric record players or turntables, the motions of the stylus are converted into an analog electrical signal by a transducer called a pickup or cartridge. The signal is further electronically amplified with a power amplifier, then made audible using a loudspeaker. [29] Alternatively to mechanical storing, the magnetic solutions were developed in 1930’s. The most used ones were based on tape recorders. The audio signal is represented in analog form in the strength of magnetization in the magnetic coating on the tape. The playback head of the recorder then reads the signal, and converts it into electric signal. [28]

2.2 Digital methods Analog methods suffered from multiple problems, such as limited signal-to-noise ratio, and various different distortions in signal. The digital methods solve most of the problems in storing audio signals, where the electric audio signal from a microphone is converted to digital (nu- meric) form using analog-to-digital (A/D) conversion. The most straightforward representation

2 Analysis Quantization Encoding of audio lterbank & Coding bitstream in

Perceptual model

Decoding of Inverse Synthesis bitstream Quantization lterbank bitstream audio in out

Figure 1: The block diagram of an audio encoder and decoder based on perceptual masking. Adapted from [3]. is to use PCM coding (Pulse-Code Modulation) in A/D conversion. Each sample is quantized into a binary number where the number of bits implies the precision of the result. Sample val- ues of an analog signal are mapped onto binary numbers so that there are 2n discrete levels when the number of bits is n. [30] Quantization with ﬁnite precision generates an error called quantization noise. Each added bit improves SNR between maximal signal and quantization noise by 6 dB for each added bit, so that 16 bits, often used in audio, yields a maximum SNR of about 96 dB. Because the dynamic range of the auditory system is about 130 dB, even more than 22 bits may be needed. Digital signal processing has many advantages compared to analog techniques. It is pre- dictable, and a single DSP processor may be programmed to compute any DSP programs that does not exceed its processing capacity. However, the amount of data needed to represent audio signals is too high for some applications, and the delay due to coding process may be harmful in communication.

2.3 Masking-based audio coding Considerable development has been conducted in compression of audio signals into lower amounts of data, where the perceptual effects caused by compression are minimized. With audio coding methods such as MPEG-1 Layer-3 (MP3) [10] and MPEG-2 Advanced Audio Coding (AAC) [11, 2], the main means of reducing the bit-rate is to transform the signals into the time-frequency domain and to optimize the quantization of the time-frequency samples using a perceptual masking model, as shown in Fig 1. If quantization noise was equally spread over the entire frequency region, it would be easily audible in those frequency regions where the signal level is low. The spectrum of quantization noise can be shaped so that it follows the masking curve created by the signal, but shifted slightly lower in level. In general, if the level is set to about 13 dB lower than that of the signal, quantization noise is no longer audible, although corresponding quantization noise with a ﬂat spectrum is clearly annoying. This effect is known as the "13-dB miracle" from an audio

3 demonstration given by J. D. Johnston and K. Brandenburg at AT&T Bell Labs in 1990. The audio signals used in the demonstration are described in [4].

3 Channel-based sound systems 3.1 Monophony Monophonic reproduction was state-of-the-art method for sound reproduction for decades, and for many applications it is still an adequate method. For example, in telephones most com- monly the sound captured near the mouth of the talker is monophonically reproduced with a loudspeaker next to the ear. This gives maximal quality in speech intelligibility, which is the key requirement in mobile communication. However, due to coloration effects monophonic reproduction is considered suboptimal in many other applications, such as in music and in cinema. A disadvantage of monophonic reproduction is that the coloration caused by the recording room is exaggerated in the listening when compared to binaural listening of the recording venue [19]. The reason for this is that the reproduced single sound signal is filtered by the room, manifesting as a complex structure in the magnitude frequency response. In recording room sound conditions the sound reaching at the two ears of a listener through the recording room has different magnitude and phase spectra, which results in binaural decoloration. However, in monophonic reproduction the spectrum emanated by the loudspeaker is already filtered by the recording room response, and the binaural decoloration mechanisms can only try to com- pensate for the listening room acoustics, not the recording room acoustics. This makes the acoustical effect of the recording room overemphasized in listening. When the microphone is at a distance of a few centimeters from the source, the acoustics of the recording room does not have much of an effect on the result, and the timbral quality of the reproduction is good. The disadvantage discussed in the previous paragraph is thus valid only when the captured effect of the room is significant, or in technical terms, the level of the direct sound is comparable to or lower than that of the reverberant field.

3.2 Stereophony The motivation to use more than two loudspeakers in the reproduction is the potentially better spatial quality in a larger listening area. The most common loudspeaker set-up is the two- channel stereophonic set-up. Its use became widespread after the development of the single- groove 45◦/45◦ two-channel record in the late 1950s. Two loudspeakers are positioned in front of the listener 60◦ apart. The set-up enables the positioning of virtual sources between the loudspeakers, and it also makes the timbral quality of reproduction better when compared to monophonic reproduction. The reduced coloration issues in stereophonic reproduction can be understood with simple discussion. When recording for stereophonic setup, at least two microphones are used. Since each microphone is located in different position in the recording room, or alternatively is in the same position but has different directivity, the room effect is different in corresponding

4 signals. This results in loudspeaker signals, that have different manifestations of the room effect with different phase and magnitude spectra, enabling binaural decoloration effects in hearing. The reproduced sound does thus then have such emphasized room effect as monophonic reproduction has, and the perceived quality is improved a lot.

3.3 Multi-channel setups Different multi-channel loudspeaker set-ups have been specified in the history of multichannel audio [26, 6, 27]. In the 1970s, the quadraphonic set-up was proposed where four loudspeakers are positioned evenly around the listener at azimuth angles ±45◦ and ±135◦. This layout was never successful because of problems related to content delivery techniques of that time, and because the layout itself has too few loudspeakers to provide good spatial quality in all directions around the listener [23]. A sound reproduction system was developed for cinema, where the front image stability of the standard stereophonic set-up was enhanced by an extra center channel and two surround channels were added to create atmospheric effects and room perception. This surround sound system for cinemas was first used in 1976 [6], and the ITU made a recommendation for the layout in 1992 [5]. The late 1990s saw some households acquiring also this 5.1 surround system, where the figure before the dot stands for the number of loudspeakers and the figure after the dot is the number of low-frequency channels. In the ITU recommendation, the three frontal loudspeakers are in the directions 0◦ and ±30◦, and the two surround channels in directions ±110±10◦. The system has been criticized for not being able to deliver good directional quality anywhere else other than in the front [23]. So other layouts with 6–12 loudspeakers have been proposed to enhance the directional quality in other directions as well. All the loudspeaker layouts described above have loudspeakers only in the horizontal plane. There are systems in which loudspeakers are placed above and/or below the listener too for use in theaters and virtual environment systems, which enables positioning sound sources above and below the listener as well, thus enhancing the perceived realism especially in situations where the 3-D position of a virtual source is important [24]. Typical examples of such situations are virtual sources for flying vehicles or the sound of raindrops on a roof. Such 3-D set-ups have been proposed for use in domestic listening too and are currently being stan- dardized [12]. For example, Japan Broadcasting Corporation has proposed a 22.2 loudspeaker set-up [8], which has 22 loudspeakers in planes at three heights and two subwoofers, or a common setup has four elevated loudspeakers added to the 7.1 or 5.1 set-up, denoted as 7.1.4 or 5.1.4, where the last digit denotes the number of elevated speakers.

3.4 Binaural reproduction The basic binaural recording technique is to reproduce a recorded binaural sound track through headphones. The recording is made by inserting miniature microphones in the ear canals of a real human listener, or by using a manikin with microphones in the ears [31, 1]. Such a recording is reproduced by playing the recorded signals to the ears of the listener. In principle, this is a very simple technique and can provide effective results. A simple implementation

5 is to replace the transducers of in-ear headphones with insert miniature microphones, use a portable audio recorder to record the sounds of the surroundings, and play back the sound with headphones. Already without any further processing, a convincing spatial effect is achieved, as the left-right directions of the sound sources and the reverberant sound ﬁeld are reproduced naturally. Especially, if the person who did the recording is the listener, the effect can be striking. Unfortunately, there are also technical challenges with the technique. The sound may appear colored, the perceived directions move from front to back, and everything may be localized inside head. To partially avoid these problems, the recording and the reproduction should be carefully equalized, because headphone listening typically produces a different magnitude spectrum to the ear drum than natural listening. Careful equalization of headphone listening is, unfortunately, a complicated business, and it requires very careful measurements of the acoustical transmission of sound from the headphone to the ear drum [32]. A further challenge in binaural reproduction is that the auditory system also utilizes dynamic cues to localize sound sources. When listening to a binaural recording with headphones, the movements of the listener do not change the binaural reproduction at all, and the best explana- tion is that the sources must be inside the listener’s head. This is one reason why headphone reproduction easily tends to be localized inside head of the listener [1].

4 Loudspeaker-setup agnostic formats The drawback of the channel-based spatial audio systems is the requirement of to listen each format with a ﬁxed loudspeaker setup. The consumer has to purchase the audio content specif- ically for his listening system, and the interchange between different systems is awkward. A generic audio format suitable for whatever standard or arbitrary loudspeaker layout or headphones would be an answer to these needs, and considerable amount of research has been conducted to that direction.

4.1 Generic linear audio The Ambisonics reproduction technique [7] provides a theoretical framework for coincident recording techniques for 2-D and 3-D multichannel loudspeaker set-ups. In theory, Ambison- ics is a compact-format, loudspeaker-set-up-agnostic, efficient and comprehensive method for capture, storage and reproduction of spatial sound. All coincident multichannel microphone techniques produce signals that can be transformed to B-format signals. B-format signals have directional patterns that correspond to spherical harmonics. The spherical harmonics can thus be seen as basis functions for the design of arbitrary patterns. The most common microphone device for Ambisonics is the first-order four-capsule B-format microphone, producing signals with directional patterns of spherical harmonics upto the first order. Higher-order microphones with more capsules have also been developed and are commercially available. The higher-order components can then be derived in a specific frequency window. There are some higher-order microphones available, which can extract har-

6 monics upto about the 4-6th order in a certain frequency window. The number of good-quality microphone capsules in such devices is relatively high, something like 32–64. Outside the frequency window the microphones suffer from low-frequency noise and deformation of directional patterns at high frequencies [21, 15]. In principle, first-order Ambisonics could be used for any loudspeaker set-up, but unfortunately it has very limited range of use. The broad first-order directional patterns make the listening area where the desired effect is audible very small, extending to the size of the head of the listener only at frequencies below about 700 Hz [25]. At higher frequencies, the high coherence between the loudspeaker signals leads to undesired effects, such as coloration and loss of spa- ciousness. The number of loudspeakers in a horizontal set-up should not exceed 2N +1, where N is the order of the B-format microphone, to avoid high coherence between the loudspeaker signals. Thus, first-order microphones can be used only with three-loudspeaker set-ups, which are far too few to produce the perception of virtual sources between the loudspeakers. This calls for the use of microphone set-ups able to capture signals with higher-order spherical harmonic directional patterns. With higher-order directional signal components, the directional patterns of the loudspeaker signals can be made narrower, which solves these issues, though with known problems with low-frequency noise and high-frequency aliasing.

4.2 Generic non-linear audio formats The problems with the utilization of spherical harmonic signals in reproduction of spatial sound lead to consideration if the human hearing resolution could be used to enhance spatial audio reproduction. The spatial resolution of hearing is limited within the auditory frequency bands [1]. In principle, all sound within one critical band can be only perceived as a single source with broader or narrower extent. The limitations of spatial auditory perception impose the question of whether the spatial accuracy in reproduction of an acoustical wave field can be compromised without a decrease in perceptual quality. When some assumptions on the resolution of human spatial hearing are used to derive reproduction techniques, potentially an enhanced quality of reproduction is obtained [16]. The audio recording and reproduction technology called Directional audio coding (DirAC) [18, 14] was the first non-linear signal-dependent method to record and reproduce spatial audio exploiting the resolution of human hearing. It assumes that the spatial resolution of the auditory system at any one time instant and in one critical band is limited to extracting one cue for direction and another for inter-aural coherence. It further assumes that if the direction and diffuseness of the sound field is measured and reproduced correctly with a suitable time resolution, a human listener will perceive the directional and coherence cues correctly. An example implementation of DirAC is shown in Figure 2. In the analysis, the direction and diffuseness parameters of the sound field is estimated using temporal energy analysis in the auditory frequency bands. The parameters are then used in reproduction. The direction is expressed in azimuth and elevation angles, indicating the most important direction of arrival of sound energy. Diffuseness is a real number between zero and one, that indicates whether a sound field resembles mostly a plane wave or a diffuse field. Virtual microphones are then formed from B-format signals, which are divided into a diffuse stream and a non-diffuse stream

7 N frequency channels Directional Direction (azi, ele) Gain factor Loud. and computation speaker diffuseness setup analysis Diffuseness ( Ψ)

gain Sum over 1 - Ψ Ψ factors frequency STFT channels

B-format Virtual microphone cardioid NON-DIFFUSE STREAM channels in microphones

Loud- Loudspeaker Decorrelation speaker setup information signals DIFFUSE STREAM

DecorrelationDiffusion

B-format audio single-channel audio parameter

Figure 2: Reproduction of spatial audio with Directional audio coding (DirAC). The system is non-linear and signal-dependent, and it is based on assumptions of human hearing.

using the diffuseness parameter. The non-diffuse stream is assumed to contain sound that originates primarily from one source, and in DirAC the corresponding time-frequency part of sound is applied in the direction analyzed for it. The diffuse stream in turn is assumed to contain sound originating from reverberation or from multiple concurrent sources from different directions, which should produce low interaural coherence. The diffuse is thus applied to all directions after some decorrelation process. Both streams thus reduce the coherence between loudspeaker channels, which mitigates the artifacts in corresponding linear systems. According to listening tests, DirAC indeed enhances the reproduction quality when compared to first-order linear decoding with both loudspeakers and headphones [17]. The system performs well if the recorded spatial sound is not in strong violation with the implicit assumption of sound field in DirAC, i.e., at each frequency band only single source is dominant at one time with moderate level of reverberation. In other cases audible distortions, i.e., artifacts, may occur. Typical cases are, e.g., surrounding applause, speech in presence of broadband noise from opposing directions, or strong early reflections within single temporal analysis window. A number of methods to overcome the artifacts have been developed [17], and the utilization of higher-order input in parametric analysis makes the artifacts to vanish in any realistic acoustical condition [20].

4.3 Object-based audio While the previous two sections discussed the recording of existing spatial sound ﬁelds for listening with arbitrary set-ups, there has been also advances in delivery of audio content to consumers in listening set-up agnostic format. In traditional mixing, the outcome of the mixing desk is a set of audio signals, each meant to be played back over a loudspeaker. Recently, several audio formats meant primarily for cinema have been proposed which allow different reproduction set-ups. The audio signals of discrete sources and multi-channel mixed material are sent accompanied by spatial metadata. The metadata then deﬁnes how the tracks are

8 Figure 3: The delivery of spatial audio using MPEG-H standard [9]. rendered to the loudspeakers. The corresponding methods are called object-based audio techniques [22, 13]. The ﬁrst movies with sound tracks in such formats were released in 2012. For example, the MPEG-H format [9] can be used to deliver a large number of monophonic sound signals (objects) to be positioned to the reproduction setup, which can be either any loudspeaker setup or headphone playback, see Fig. 3. This can be accompanied by transmission of audio in channel-based format, which is then reproduced directly with the loudspeakers, or converted to match the layout in reproduction. Furthermore, higher-order Ambisonics (HOA) signals can also be transmitted with MPEG-H.

References [1] J. Blauert. Spatial Hearing — Psychophysics of Human Sound Localization. MIT Press, 1996. [2] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, and M. Dietz. ISO/IEC MPEG-2 advanced audio coding. J. Audio Eng. Soc., 45(10):789–814, 1997. [3] K. Brandenburg. MP3 and AAC explained. In 17th Int. Conf. Audio Eng. Soc., September 1999. [4] K. Brandenburg and T. Sporer. -nmr-and-masking ﬂag-: Evaluation of quality using perceptual criteria. In 11th Int. Audio Eng. Soc. Conf.: test & measurement. AES, 1992. [5] I. BS.775-2. Multichannel stereophonic sound system with and without accompanying picture. Recommendation, International Telecommunication Union, Geneva, Switzerland, 2006. [6] M. F. Davis. History of spatial coding. J. Audio Eng. Soc., 51(6):554–69, June 2003. [7] M. J. Gerzon. Periphony: With height sound reproduction. J. Audio Eng. Soc., 21(1):2–10, January/February 1973.

9 [8] K. Hamasaki, K. Hiyama, and R. Okumura. The 22.2 multichannel sound system and its application. In Audio Eng. Soc. Convention 118. AES, 2005. [9] J. Herre, J. Hilpert, A. Kuntz, and J. Plogsties. Mpeg-h audio —the new standard for universal spatial/3d audio coding. Journal of the Audio Engineering Society, 62(12):821–830, 2015. [10] ISO/IEC. Coding of moving pictures and associated audio for digital storage media at up to about 1,5 mbit/s – Part 3: Audio. Standard 11172-3, 1993. [11] ISO/IEC. MPEG-2 advanced audio coding, AAC. Standard JTC1/SC29/WG11 (MPEG), 1997. [12] ISO/IEC 23008-1. High efﬁciency coding and media delivery in heterogeneous environments. Standard, 2014. [13] P.-A. S. Lemieux, W. Dressler, and J.-M. Jot. Object-based audio system using vector base amplitude panning, May 2013. US Patent App. 13/906,214. [14] J. Merimaa and V. Pulkki. Spatial impulse response rendering. In 7th Intl. Conf. on Digital Audio Effects (DAFX04), 2004. [15] S. Moreau, J. Daniel, and S. Bertet. 3d sound ﬁeld recording with higher order Ambisonics - objective measurements and validation of spherical microphone. In Audio Eng. Soc. Convention 120, May 2006. [16] A. Politis and V. Pulkki. Overview to time-frequency-domain parametric spatial audio techniques. In V. Pulkki, S. Delikaris-Manias, and A. Politis, editors, Parametric Time-Frequency-Domain Spatial Audio, chapter 4. Wiley, 2016. In press. [17] V. Pulkki, M.-V. L. , J. Vilkamo, J. Ahonen, and A. Politis. First-order directional audio coding. In V. Pulkki, S. Delikaris-Manias, and A. Politis, editors, Parametric Time-Frequency-Domain Spatial Audio, chapter 5. Wiley, 2016. In press. [18] V. Pulkki. Spatial sound reproduction with directional audio coding. J. Audio Eng. Soc., 55(6):503–16, 2007. [19] V. Pulkki and M. Karjalainen. Communication Acoustics: An Introduction to Speech, Audio and Psychoacoustics. John Wiley & Sons, 2015. [20] V. Pulkki, A. Politis, G. Del Galdo, and A. Kuntz. Parametric spatial audio reproduction with higher-order B-Format microphone input. In Audio Eng. Soc. Convention 134, May 2013. [21] B. Rafaely, B. Weiss, and E. Bachmat. Spatial aliasing in spherical microphone arrays. IEEE Trans. Signal Proc., 55(3):1003–10, 2007. [22] C. Q. Robinson, S. Mehta, and N. Tsingos. Scalable format and tools to extend the possibilities of cinema audio. SMPTE Motion Imaging Journal, 121(8):63–9, 2012. [23] F. Rumsey. Spatial Audio. Taylor & Francis, 2001. [24] A. Silzle, S. George, E. Habets, and T. Bachmann. Investigation on the quality of 3D sound reproduction. Pro- ceedings of ICSA, page 334, 2011. [25] A. Solvang. Spectral impairment of two-dimensional higher order Ambisonics. J. Audio Eng. Soc., 56(4):267–79, April 2008. [26] G. Steinke. Surround sound — the new phase. an overview. In Audio Eng. Soc. 100th Convention, Copenhagen, Denmark, 1996. [27] E. Torick. Highlights in the history of multichannel sound. J. Audio Eng. Soc., 46(1/2):27–31, 1998. [28] Wikipedia. Magnetic tape. https://en.wikipedia.org/wiki/Magnetic_tape, 2016. [29] Wikipedia. Phonograph. https://en.wikipedia.org/wiki/Phonograph, 2016. [30] Wikipedia. Pulse-code modulation. https://en.wikipedia.org/wiki/Pulse-code_modulation, 2016. [31] A. Wilska. Untersuchungen ueber das richtungshoeren (Studies on Directional Hearing). PhD thesis, Helsinki University, 1938. English translation available: http://www.acoustics.hut.fi/ publications/Wilskathesis/. [32] B. Xie. Head-Related Transfer Function and Virtual Auditory Display, volume 2. J. Ross publishing, 2013.