SOUNDFIELD ANALYSIS and SYNTHESIS: Recording, Reproduction and Compression

by

SHUAI WANG

Athesis presented to the University of New South Wales in fulfilment of the thesis requirement for the degree of Master of Engineering (Research) in Electrical Engineering

Kensington, Sydney, Australia c Shuai Wang, 2007 Originality Statement

I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no material previously published or written by another person, or substantial proportions of material which have been accepted for the award of any otherdegreeordiplomaatUNSWoranyother educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

I further authorize the University of NSW to reproduce this thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

ii Acknowledgments

As the author of this dissertation, I would like to express my deep and sincere gratitude to my supervisor, Dr. D. Sen, for his inspiration and guidance throughout all the work of this research. It has been a great pleasure to conduct my research under his supervision, and his extraordinary patience and encouragement have helped me conquer various kinds of problems along the way.

Many thanks are also granted to other members from both our research group and school of Electrical Engineering and Telecommunications at the University of New South Wales for their kind help in various forms.

Last but not least, I wish to ascribe this thesis to my family for their consistent support and love.

iii Publications

• S. Wang, D. Sen, W. Lu, “Subband Analysis of Time Delay Estimation in STFT Domain,” Proc. Eleventh Australian International Conference on Speech Science and Technology, pp.211-215, 2006.

iv Abstract

Globally, the ever increasing consumer interest in multichannel audio is a major factor driving the research intent in soundfield reconstruction and compression. The popularity of the well commercialized 5.1 system and its 6-Channel audio has been strongly supported by the advent of powerful storage medium, DVD, as well as the use of efficient telecommunication techniques. However, this popularity has also revealed potential problems in the development of soundfield systems.

Firstly, currently available soundfield systems have rather poor compatibility with irregular speaker arrangements. Secondly, bandwidth requirement is dramatically in- creased for multichannel audio representation with good temporal and spatial fidelity.

This master’s thesis addresses these two major issues in soundfield systems. It introduces a new approach to analyze and sysnthesize soundfield, and compares this approach with currently popular systems. To facilitate this comparison, the behavior of soundfield has been reviewed from both physical and psychoacoustic perspectives, along with an extensive study of past and present soundfield systems and multichannel audio compression algorithms. The 1th order High Spatial Resolution (HSR) soundfield recording and reproduction has been implemented in this project, and subjectively evaluated using a series of MUSHRA tests to finalize the comparison.

v Contents

1 Introduction 1

1.1BackgroundOverview...... 1

1.2ReconstructionSystems...... 4

1.3MotivationforCurrentProject...... 6

1.4Contribution...... 8

1.5DissertationOverview...... 10

2 Soundfield Physics and Psychoacoustics 12

2.1PhysicalRepresentation...... 13

2.2Psychoacoustics...... 17

2.2.1 Spatialhearingwithonesoundsource...... 17

2.2.2 SpatialHearingwithTwoSoundSources...... 21

2.3Summary...... 24

3 Soundfield Systems and Multichannel Audio Compression Techniques 25

3.1 Historic Review of Soundfield Reconstruction Systems ...... 25

3.1.1 3-ChannelSystem...... 26

3.1.2 Stereophony...... 29

vi 3.1.3 Quadraphony...... 33

3.1.4 Ambisonics...... 34

3.1.5 WaveFieldSynthesis(WFS)...... 37

3.1.6 Ambiophonics and Vector Based Amplitude Panning (VBAP) . 39

3.1.7 Summary...... 40

3.2SummaryofMultichannelAudioCompressionTechniques...... 41

3.2.1 LosslessAudioCoding...... 41

3.2.2 LossyAudioCoding...... 42

3.3Summary...... 48

4 Multichannel Audio Compression Technique: Binaural Cue Coding (BCC) 49

4.1ProblemReview...... 49

4.2BCCDesignScheme...... 51

4.2.1 MacroscopicDesign...... 52

4.2.2 FrequencyProcessing...... 53

4.2.3 Encoding...... 56

4.2.4 Decoding...... 65

4.3Summary...... 66

5 High Spatial Resolution(HSR) Soundfield System 69

5.1ReviewofMicrophones...... 70

5.2HighSpatialResolution(HSR)Recording...... 77

5.3Reproduction...... 79

5.4Post-recordingProcessing...... 80

vii 5.4.1 Analysis...... 81 5.4.2 Synthesis...... 84 5.5Summary...... 88

6 Experiments and Results 90 6.1ExperimentsonBCCImplementation...... 90

6.1.1 EstimationofICTD...... 91 6.1.2 Pre-echoingEffect...... 96 6.2AcousticExperimentsPreparation...... 98

6.2.1 Anechoicchamber...... 99 6.2.2 SpeakerArrayConfiguration...... 100 6.2.3 MicrophoneArrayConfiguration...... 103 6.2.4 SubjectiveTests...... 105 6.3MUSHRATestsandResults...... 107

6.3.1 MUSHRATestsonBCC...... 108 6.3.2 MUSHRA Tests on HSR Soundfield Systems ...... 110

7 Conclusion 118 7.1SummaryofThisProject...... 118

7.2FutureWorks...... 119

Appendices

A Spherical Solutions to the Wave Equation 121 A.1 Solving the Wave Equation in Spherical Coordinates ...... 121

A.2 Spherical Bessel Functions of the 1st Kind ...... 122

A.2.1Plot...... 122 A.2.2Properties...... 123

viii B Information about Speaker and Wedge 124

B.1 GENELEC Loudspeakers (8130A)...... 124

B.2Wedge‘A’...... 125

C Stimuli of MUSHRA Tests on Soundfield Systems 127

C.1 Matrix F and Parameter μ ...... 127

C.2VariationsinStimuli...... 128

ix List of Figures

1.1GeneralStructureofSoundReconstructionSystems...... 4

1.2DifferentSoundReconstructionSystems...... 4

2.1AChirpSignalRecordedbyanOmnidirectionalMicrophone...... 13

2.2 Spherical Coordinates ...... 14

2.3SpatialHearingwithOneSoundSource...... 18

2.4BinauralSignals...... 19

2.5TheHeadShadowingEffect...... 20

2.6SummingLocalization...... 22

2.7SuperpositionofMultipleAuditoryEvents...... 23

3.1TheHuygens’Principle...... 27

3.2SteinbergandSnow’s3-ChannelSystem...... 28

3.3StereophonicRecordingandReproduction...... 30

3.4QuadraphonicSetup...... 34

3.5 Soundfield Microphone...... 35

3.6B-formatDirectionality...... 36

3.7 ITU-R BS.775, 5.1 Surround Sound Speakers Placement ...... 37

3.8ThePrincipleofWaveFieldSynthesis...... 38

3.9AmbiophonicsReproduction...... 39

3.10GenericStructureofLossyAudioCoding...... 43

4.1GenericStructureofBinauralCueCoding(BCC)...... 52

x 4.2MacroscopicStructuresofBCCAnalysisandSynthesis...... 52

4.3OverlappedHannWindows...... 55

4.4FineStructureofBCCEncoder...... 57

4.5UniformQuantizationScheme...... 64

4.6FineStructureofBCCDecoder...... 65

5.1 Generic Structure of Soundfield Systems ...... 70

5.2StandardMicrophones’2DPolarPattern...... 71

5.3TheDecompositionofB-formatDirectionality...... 73

5.4DoubleM/SMicrophoneDesignScheme...... 74

5.5ORTFStereoRecordingMicrophone...... 75

5.6 The Configuration of Decca Tree Microphone ...... 76

5.7Post-RecordingAnalysisSystem...... 82

5.8Post-recordingAnalysis...... 83

5.9Post-recordingSynthesis...... 84

5.10 High Spatial Resolution Soundfield System ...... 88

6.1TDEforMulti-Sinusoids:IntegerSamples...... 93

6.2TDEforMulti-Sinusoids:Non-integerSamples...... 94

6.3TDEforMulti-Sinusoids:Complex...... 95

6.4Pre-echoingEffects...... 97

6.5AdaptiveWindowingScheme...... 98

6.63DGeometricModeloftheAnechoicChamber...... 100

6.7GeometricSimulationofSpeakerMounting...... 101

6.8 Vogels Loudspeaker Support ...... 101

6.9HSRRecordingMicrophoneArray...... 103

6.10RecordingPosition...... 104

6.11GraphicTerminal(Qterm-Z60)andGUIinMUSHRATests...... 105

6.12TheContinuousQualityScaleinMUSHRA...... 106

xi 6.13ResultsofMUSHRATestsonBCCPerformance...... 109

6.14 Scores of MUSHRA Tests on HSR Soundfield Systems ...... 113

6.15NumberofSubjectsGrading‘Good’part1...... 115

6.16NumberofSubjectsGrading‘Good’Part2...... 116

A.1VariousBesselFunctionsoftheFirstKind...... 122

B.1 GENELEC Loudspeaker...... 124

B.2 Side {A1,A2,A3} andMountingPoint...... 125

B.3 Side {A0,A1,A3} ...... 126

B.4 Side {A2,A0,A3} ...... 126

xii List of Tables

4.1ComparisonofMultichannelAudioCompressionAlgorithms...... 50

4.2 Critical Band Boundaries(Hz) ...... 59

4.3 Comparison Between BCC and Other Popular Multichannel Audio Com- pressionAlgorithms...... 67

6.1CorrespondenceBetweenDSTAudioChannelsandSpeakers...... 111

6.2 Stimuli for MUSHRA Tests on HSR Soundfield Systems ...... 112

xiii Chapter 1

Introduction

1.1 Background Overview

Human beings live in a space congested by a variety of sound sources originating from different directions. Since we can not close our ears just like we can close our eyes, people are constantly exposed to a world of sound. This sonic environment serves as an essential element in our lives, and contributes in terms of both orientating ourselves and entertaining. Beyond the limited range of vision, the information delivered from sound sources helps people to discover, identify and localize their surroundings. The perception of sound also enriches a human’s capability of learning and exploring the physically unreachable world through communication. Moreover, sound can bring considerable pleasure to listeners. By stimulating the imagination of recipients, sound may present extraordinary auditory scenarios. The tremendous enthusiasm of the audience may also be strongly boosted by the melody and harmony of an inspirational piece of music.

However, the fidelity of a particular sound event was rather time and space lim- ited until 1877[1], when human speech, the famous “Mary Had a Little Lam”, was

first recorded by Thomas Edison. In the same year, a patent on transporting sound was filed by Alexander Graham Bell[2]. These inventions revealed the possibility of

1 transmitting and replicating a sound event at another time and location. As a re- sult, the reproduction of sound events is now extensively deployed in the film and broadcasting industries to enhance the perceptual quality and pleasure[3]. Provided with well reconstructed and synchronized audio tracks, viewers are able to tolerate the occasionally missing frames in motion pictures[4]. However, an inappropriate delivery and reproduction of sound sources over a long period of time or distance can cause serious confusion or misunderstanding. Hence, improving the quality of replicated sound events has been one of the major issues and under ongoing investigation ever since[5][6][7].

In the first half of previous century, most of the research work on sound reproduc- tion focused on retaining the accurate temporal characteristics of a sound event during reconstruction, despite the fact that hearing, the perception of sound, does not only find itself working in the time domain but also possesses some spatiality[8]. In another word, the directional perspective of a sound source, which features its localization and interaction with listeners, has long been neglected during the process of reconstructing the sound event at another time or location. The lack of spatial analysis and synthe- sis may lead to a mis-localized sound event, and cause confusion to recipients during their listening. Even high fidelity sound reproduction can be deceptive when being implemented via different arrangements of loudspeakers.

This situation has gradually changed in recent years. There are an increasing number of publications discussing the importance of spatial analysis and synthesis, studying different scenarios of spatial hearing and proposing various kinds of sound reproduction systems[9][10][11][12]. As a result, sound reconstruction has become one of the fastest growing segments in the consumer audio market[13]. With the assistance of Digital Signal Processing(DSP) technology, the applications of sonic replication have been expanded to the fields of virtual reality[14], education[15] and air traffic control[16]. In addition, the reproduced sound that maintains a good spatial quality has found its way into an extensive range of commercial products, like home theater

2 entertainment systems[17][18], teleconferencing equipment[19] and gaming surround sound systems[20].

Fundamental to these applications, sound reconstruction is often incorporated into multimedia systems. Involving a mixture of senses, such systems are able to provide a fairly realistic experience for recipients and allow their active participation. Some researchers, for example from Dolby Laboratories, are trying to merge auditory local- ization with visual perception to immerse recipients into a recreated virtual space, a developing trend in the modern gaming and film industries[21]. It is believed that, compared to the visual perceptibility, hearing is usually peripheral and considered as the secondary sense for acquiring information within these multimedia systems[22]. As the perception of sound is an integrated process of physics, physiology and psychol- ogy[23], it can be considerably affected by the introduction of visual information from the psychological perspective. It seems to be the case that our brains tend to believe what we can see more than what we can hear. Nevertheless, this belief has underes- timated the importance of precisely replicating sound events in multimedia systems. Besides, the consumer market in the audio/music industry possesses a tremendous in- terest in creating an incredibly realistic and immersive sound environment. Therefore, the pursuit of a faithful reconstruction of live sound events, without the presence of any other perceptual information, never ends in the Audio Engineering Society(AES)[24]. For instance, the ultimate goal of using home entertaining theater systems is defined by Dermot Furlong[25] as being able,“to optimally reconstruct the concert hall experi- ence for the domestic living room listener.” Achieving a precise reproduction of sound events, in terms of both spatiality and temporal fidelity, is the primary goal of this master project as well.

3 1.2 Reconstruction Systems

In general, current available sound reconstruction systems in the consumer market consist of three sections(cf. Figure 1.1): recording, encoding and decoding, and play- back. Those microphone capsule signals are passed through a certain codec to generate

Figure 1.1: General Structure of Sound Reconstruc- tion Systems the inputs for playback whose configurations have normally been decided in advance.

Compression techniques may be applied to reduce the bandwidth required for repre- senting these signals.

According to different sound reproduction mechanisms and apparatus in the play- back section, these sound reconstruction systems can be roughly classified into the following three families(cf. Figure 1.2).

Figure 1.2: Different Sound Reconstruction Systems

4 • Binaural Reconstruction System[26]

As implied in its name, the binaural reconstruction focuses on correctly repro- ducing sound signals at the ears. Typically, such a system uses headphones to play sound. In this way, signals can be directly delivered to the ear entrances.

By using this kind of system, recipients are always at the sweet spot which is the best location for a good auditory perception.

One of the demerits of the binaural system is that the awareness of wearing headphones may be undesirable for listeners to feel completely immersed in an auditory scenario. In addition, as the recipient’s head may be moving during a

replay, the reproduced sound can be misleading and cause confusion.

• Transaural Reconstruction System[27][28]

Similar to binaural reconstruction, the transaural approach is also dedicated

to retaining the sound signals at the listener’s ear entrance, whereas speak- ers, instead of headphones, are deployed in such a system. Using speakers can successfully avoid the side effects of headphones, as well as guarantee a good precision.

However, listeners are extremely space-limited in such a system, since the sweet spot for a transaural reconstruction system is rather small. As soon as the subject moves away from the ideal position, the quality of perception would be dramatically reduced.

• Soundfield Reconstruction System[29]

Different from the previous two families, a soundfield reconstruction system (cf. Figure 1.2c) utilizes a number of speakers to faithfully reproduce the sound pressure level(SPL) across a large area. In such a system, listeners are free to move around, and still able to perceive sounds with relatively good qualities. In

addition, it is also possible for a group of audience to enjoy the well replicated sound environment together within the same listening space.

5 However, such systems suffer from a severe problem that an enormous number of speakers and channels are required to handle a good reconstruction of soundfields over a large area[25][30]. This leads to the problems of storing and transmitting data[31]. The huge amount of data resulting from a multichannel recording of soundfields by using such a system is extremely difficult to store, despite the

advent of a powerful storage medium, the Digital Versatile Disc(DVD)[32], in the 1990s. Transmitting data is also problematic, even with the “huge” bandwidth networks currently available.

A second problem of the soundfield reconstruction systems is that currently well commercialized systems have to be coincident with certain arrangements of

speakers. In another word, adapting these systems to an irregular or any layout of loudspeakers can be extremely complicated.

Hence, this master thesis focuses on one of the most challenging families, the soundfield reconstruction system, with the primary goal of faithfully replicating a soundfield from both spatial and temporal perspectives. To clarify the expression, all the reconstruction systems mentioned in the following chapters of this thesis refer to the soundfield reconstruction systems using multiple speakers.

1.3 Motivation for Current Project

In modern societies, the need for a sensational reproduction of soundfields is booming in the consumer market, due to the influence of currently rapidly growing entertain- ment industry. Take computer games as an example: there is a high demand for an immersive gaming environment. Such a realistic sonic environment can be produced with the incorporation of visual perception. Meanwhile the accurate reconstruction of soundfields, in terms of both spatiality and fidelity, without the assistance of any other senses, is also a vital part of the market. Being able to enjoy the same incredible

6 sensation at home as in a real concert hall is the primary goal of most research on soundfield reconstruction[25][33][34].

For a long time, the analysis and synthesis of soundfields has been focused on im- proving the temporal fidelity, while leaving the spatial information unattended. The lack of spatial analysis could eventually affect the perceptual quality of the reproduced soundfields, possibly causing severe disorientation to the audience. Hence, it is nec- essary to take directional characteristics into account during soundfield recording and reproduction.

Meanwhile, currently available soundfield reconstruction systems have several ma- jor issues.

Firstly, as mentioned previously in section 1.2, an extraordinarily large num- ber1 of speakers are required to reproduce a soundfield of good perceptual quality over a large area, and this potentially causes problems in storage and transmission. It has been suggested in literature that a large number of chan- nels are essential for an immersive soundfield reconstruction. When a better spatial and temporal performance over a large area is targeted, this number will dramatically increase. As a result, an enormous increment will be introduced to the bitrates re- quired to represent the speaker input signals, if currently popular compression schemes are used. Arguably the development in storage media capacity and the bandwidth of networking is currently approaching its limit, and it is still inadequate to handle such a dramatic increase in the bandwidth of multichannel audio representations.

Secondly, currently well commercialized sound reconstruction systems are limited by the pre-determined arrangement of speakers in the play- back section. Currently in most of the cases, a certain layout of speakers is chosen, and then systems, like Ambisonics2, are designed accordingly. The process of convert- ing microphone signals to speaker feeds within these systems is relatively easy to be implemented. However, it is not always the same case on site which can be either a 1Details about this number can be found in Section 3.1.4. 2Details about this system will be reviewed in Section 3.1.4.

7 consumer’s home or a stadium. The placement of speakers is often different from the ideal layout, due to the variations of space, room structure and furniture arrangement. In this case, the quality of reproduction can not be the same as the proposed. Such a low compatibility of soundfield reconstruction systems with the irregular speaker layouts can lead to a large spatial distortion and an unsatisfactory listening experi- ence[35]. Moreover, as the number of speakers increases, it is more complex to design a suitable process to calculate the speaker feeds from microphone signals.

These two issues in current soundfield reconstruction systems are the premises of this thesis. Some research on the topics of multichannel audio and soundfield recon- struction[36][35] is found to address these issues separately in recent years. However, few of them are able to synthesize these two problems, and implement a soundfield reconstruction system which suceeds in both quality and efficiency. Meanwhile, a comparison between such a soundfield reconstruction system and currently popular systems in the market can be incredibly valuable for consumers who are aspiringly after a better system.

1.4 Contribution

This master project addresses both of the above two issues during the analysis and synthesis of soundfields, and offers a comparison between a state-of-art soundfield reconstruction system and currently well commercialized systems. This comparison can potentially be used to direct the consumer market towards the development of better soundfield reconstruction systems.

An extensive range of audio recording, reproduction and compression techniques are reviewed during this research project to provide the basis of the comparison. Par- ticularly, the principles of the High Spatial Resolution (HSR)3 recording technique

3High Spatial Resolution (HSR) technique was proposed by Arnaud Laborie et al in 2003[110].

8 and the Binaural Cue Coding (BCC)4 algorithm are studied in detail. In order to realize an effective and efficient soundfield system, the entire process of multichannel recording, compression and reproduction using HSR technique and BCC compression algorithm is implemented in an anechoic chamber that is also used as the listening environment for the later subjective tests.

A framework which is controlled by a number of factors, including the config- urations of both the microphone array and the speaker set, is built up to convert multichannel recordings into loudspeaker input signals for reproduction. In addition to its function of encoding and decoding soundfields, this framework is also a flexible process that can be adapted according to various arrangements of the speakers in dif- ferent home theater systems. Therefore, such a framework can not only be used in future research works in soundfield reconstruction, but also potentially be integrated into the home theater systems or set top boxes to improve the quality of on-site sound- field reconstructions. Furthermore, the internal product of this framework, vectorp ˆ

(cf. Equation 5.8), can be regarded as another format of audio recordings and utilized in storage and transmission in the future, because this vector, similar to the B-format in Ambisonics(cf. Section 3.1.4), can be used to describe the recorded soundfields.

As an important section of this research project, Binaural Cue Coding, the key algorithm in recently ratified MPEG5 multichannel audio codec, has been reviewed, implemented and applied to compress the feeds of multiple speakers. Compared with currently popular audio compression algorithms, this multichannel coding technique is purported to dramatically reduce the bandwidth required for multichannel audio representation, while maintaining spatiality and temporal fidelity. It is the intention of this project to evaluate just how well these cues are maintained in comparison to the uncompressed soundfield reproduction. During the implementation of BCC, several algorithms are also proposed and compared, in terms of accuracy and computational complexity, to determine the most efficient technique for spatial cue estimation.

4Binaural Cue Coding (BCC) algorithm was invented by Christof Faller et. al. in 2002[109]. 5MPEG is the abbreviation of Moving Pictures Expert Group.

9 Eventually, subjective experiments were conducted on various replicated sound- fields, to verify the performance of the reconstruction systems. The results are an- alyzed, and used to conclude the comparison of different systems. To facilitate the conclusion, a novel approach was undertaken during these experiments, which was to ensure the whole process of soundfield analysis and synthesis is staged. In another word, the subjective evaluation tests were carried out in exactly the same acoustic environment as the recordings, so that those reproduced soundfields are comparable with the original.

1.5 Dissertation Overview

Motivated by the pursuit of faithfully reproduced soundfields, this master thesis studies two dominant issues in currently popular soundfield reconstruction systems, realizes a good soundfield reconstruction using state-of-art techniques, and offers a comparison between such a system and currently well-commercialized multichannel systems. The content of this thesis is organized into five chapters.

Chapter 2 briefly reviews the representation of sound from both physical and psy- choacoustic perspectives. The wave equations of sound in both cartesian coordinates and spherical coordinates are first explained. A frequency domain equivalent solution is then described. This chapter also illustrates several psychoacoustic phenomena, which are fundamental in soundfield reconstruction systems.

In order to clearly understand the major issues in reconstructing soundfields, a variety of currently available systems are reviewed in Chapter 3. Everything from the earliest stereo systems to the up-to-date and most popular are reviewed, as well as some soundfield reconstruction techniques yet to be commercialized. In addition, a brief introduction of modern audio coding schemes is also included in this chapter. It helps with the understanding of problems in audio storage and transmission.

10 A multichannel audio compression algorithm, Binaural Cue Coding(BCC), is dis- cussed in Chapter 4. It first introduces the fundamental information of BCC and its generic coding scheme. The details of the encoding and decoding process, including the estimation of the essential information, are then explained. A comparison with other audio coding techniques is included in this chapter to show the advantage of

BCCinregardtocodingefficiency.

Chapter 5 looks into the new approach of faithfully reproducing a soundfield, the High Spatial Resolution (HSR) technique. The whole chapter is divided into several sections which explore the different stages of this soundfield reconstruction system.

It starts with a general review of recording techniques, followed by the recording principle of this new approach. The stage of post-recording analysis where the acoustic characteristics of a recording are analyzed and refined is described next. Chapter 5 also discusses the decoding process prior to the playback stage. In general, this chapter proposes a simple framework to enable a accurate reconstruction of soundfields on any loudspeaker arrangement.

All the experiments conducted during this project are described in Chapter 6. Firstly, several methods which are proposed to estimate essential cues in BCC are compared in terms of accuracy and computational complexity. The best option is then applied during the implementation. The following sections of this chapter ex- plain the setup and conduct of those subjective experiments. The temporal fidelity of several reproduced multichannel audio clips which have only been through BCC codec is examined during the first part of the tests, while other experiments are also conducted to evaluate both the spatiality and the temporal quality of various repro- duced soundfields. Experimental results are then analyzed to justify the performance of different soundfield reconstruction systems.

Finally, Chapter 7 briefly summarizes the research work involved in this thesis. Some suggestions for future research are then proposed.

11 Chapter 2

Soundfield Physics and Psychoacoustics

The perception of sound is recognized as an integrated process involving physics, psy- chology, and physiology[8]. Therefore, the physical representation of soundfield is first explained in this chapter to reveal the possibility of accurately recording/analyzing and reproducing/synthesizing soundfields. The second part of this chapter summarizes various psychoacoustic phenomena of sound spatial perception which are fundamental in soundfield reproduction and compression. Such a theoretical review of the char- acteristics of soundfield from these perspectives provides a better understanding of the mechanisms of existing soundfield systems and compression techniques which will be reviewed in Chapter 3. It also implies the essentials of soundfield systems which should be preserved throughout the entire process of recording, reproduction and com- pression. As this research project does not look into the physiological perspective of hearing, details about the physiological studies on sound perception will not be further described in the following sections.

12 2.1 Physical Representation

From a physical perspective, sound is considered as the product of vibrations which occur among the molecules of a medium. These vibrations can cause condensation and rarefaction in local regions of the medium where the vibrations occur and create differences in pressure, which is fundamental to the propagation of sound. As a result, the dynamic sound pressure, p(x, y, z, t) is often measured by microphones (cf. Figure

2.1) to describe a sound event.

Figure 2.1: A Chirp Signal Recorded by an Omnidi- rectional Microphone

The propagation of sound in a homogeneous medium can be described by the wave equation (cf. Equation 2.1) in three dimensions[37]. Solutions to this equation can be used to describe a certain soundfield, or explain sonic phenomena within this soundfield, since any valid soundfield should comply with the wave equation:

1 ∂2p Δp = . (2.1) c2 ∂t2 where Δ = ∂2/∂x2 + ∂2/∂y2 + ∂2/∂z2 is the Laplace operator in three dimensional

13 (3D) cartesian coordinates.

Figure 2.2: Spherical Coordinates: r is the distance, θ is the polar angle and φ is the azimuth angle[37].

Due to the convenience of the alternative spherical coordinate system (cf. Figure 2.2), the wave equation 2.1 is often expressed in spherical coordinates as[37],

∂2p 2 ∂p 1 ∂ ∂p 1 ∂2p 1 ∂2p + + sin θ + = (2.2) ∂r2 r ∂r r2 sin θ ∂θ ∂θ r2 sin2 θ ∂φ2 c2 ∂t2 where p(r, θ, φ, t), is the alternative expression of sound pressure in spherical coordi- nates.

A general solution to the partial differential equation 2.2 can be written as equation

2.3[38] using Fourier Bessel Expansion.

∞ l ∞ l m m P (r, θ, φ, k)= αl,m(k)jl(kr)yl (θ, φ)+ βl,m(k)hl(kr)yl (θ, φ) (2.3) l=0 m=−l l=0 m=−l where k is the wave number inversely proportional to wavelength: k =2πf/c.

In this equation, jl(kr) is the spherical Bessel function of the first kind which is associated with the interior case[38] that no sources are within the sphere of consider- ation. When all sources are inside the considered region, the outgoing sound wave[38]

14 which is the solution to wave equation 2.2 for the exterior region can be described by

1 the second part of equation 2.3 utilizing the spherical Hankel function, hl(kr) .

Because the interior region is the focus of soundfield systems which deploy multiple speakers to recreate soundfields within the space of interest, the exterior case can be excluded during soundfield analysis and synthesis by neglecting the second part of the equation. Therefore, equation 2.3 can be refined as,

∞ l l m P (r, θ, φ, k)=4π Pl,m(k)i jl(kr)yl (θ, φ) (2.4) l=0 m=−l

where the complex series, Pl,m(k), are often referred to as Fourier Bessel Coefficients which can be used to describe a soundfield. Given k is proportional to frequency,

P (r, θ, φ, k)andPl,m(k) are often written as P (r, θ, φ, f )andPl,m(f)whichcanbe interpreted as the Fourier Transforms of p(r, θ, φ, t)andpl,m(t) respectively.

The radial part of the equation 2.4, jl(kr), is given by, π jl(kr)= Jl+ 1 (kr) (2.5) 2kr 2

where Jv(kr) denotes the cylindrical Bessel function of the first kind at the order v (cf. Appendix A.2). .

The spherical harmonics2[39] are described by the angular part of the solution

m (Equation 2.4), yl (θ, φ),

m 1 |m| yl (θ, φ)=√ Pl (cos θ)trgmφ (2.6) 2π

|m| where trgmφ and Pl (x) are defined respectively by the following equation 2.7 and

1 The spherical Hankel function, hl(kr), can be expressed as the linear combination of the spherical Bessel function of the first kind, jl(kr), and the spherical Bessel function of the second kind which is often referred to as the spherical Neumann function, yl(kr): hl(kr)=jl(kr)+iyl(kr). 2The orthonormal basis functions on the unit sphere.

15 equation 2.8: ⎧ ⎪√ ⎪ ⎪ 2cosmφ for m>0 ⎨⎪ trgmφ = 1form =0 (2.7) ⎪ ⎪ ⎪√ ⎩ 2sinmφ for m<0

m m 2l +1 (l − m)! 2 m/2 d P (x)= (1 − x ) Pl(x) (2.8) l 2 (l + m)! dxm

|m| where Pl (x) are the fully normalized associated Legendre functions, and Pl(x)are the Legendre polynomials:

l 1 d 2 l Pl(x)= (x − 1) (2.9) 2ll! dxl

The currently popular soundfield system, Ambisonics, is based on the concept of spherical harmonics, while it is also limited due to the lack of consideration on soundfields’ radial behavior.

The subscripts, l and m, in the above equations 2.3 - 2.9 are integers introduced to help derive the simplest form of solution to the wave equation 2.2 (cf. Appendix

A.1). Their relation can be described by l ≥ 0and−l ≤ m ≤ l, because the associated |m| Legendre functions- Pl (x)=0when|m| >l(cf. Equation 2.8).

As inferred from equation 2.4, the Fourier Bessel Coefficients, the Spherical Bessel functions and the Spherical Harmonics, altogether define the soundfield completely at each point within the sphere of interest - independent of the transducer location. Given these functions, the soundfield can be perpectly reproduced at another time or location. This is fundamental to an accurate soundfield reconstruction, as well as the basic approach to resolve the issue in sounfield systems using irregular speaker configuration (cf. Chapter 5).

16 2.2 Psychoacoustics

In addition to the physical study, the perception of sound/soundfield has been exten- sively investigated from the psychoacoustic perspective which looks into the psycholog- ical correlates of the physical parameters of acoustics.[8][40][41] Ideally, those crucial parameters identified by psychoacoustics should be well preserved during reproduction to ensure listeners with the accurate localization, when audio compression techniques are introduced into the soundfield systems to remove the redundant information. This section thereby summarizes the psychoacoustic principles in various spatial hearing scenarios based on the research work of Blauert[8], and focuses on identifying the important parameters which are utilized by listeners to determine the direction and distance of sound sources during the reconstructions. Such a review helps with a better understanding of current soundfield systems, and lends itself to resolving the issue3 on the increasing bandwidth of multichannel audio representation in soundfield systems (cf. Chapter 4). The following part starts with the simplest case: one sound source in afreefield4.

2.2.1 Spatial hearing with one sound source

As we all may have experienced, the location at which a listener thinks the sound source may be sometimes differs dramatically from its true location, due to reflec- tions, diffractions or other factors in the acoustic field. Hence, the sound event and the auditory event are used by researchers during their studies[8][42] of spatial hearing to respectively represent the real sound source and the virtual auditory image per- ceived by the recipients. Such a terminology is inherited in this thesis. In addition, ‘Localization’ is defined as the process of identifying the direction and distance of an

3The first issue mentioned in Section 1.3. 4A free field is an ideally noiseless environment which is frequently adopted as a premise of psy- choacoustic experiments.

17 auditory event according to the attributes of a sound event. ‘Localization Blur’ de- notes the smallest changes in some of the sound event’s attributes which can introduce a perceptible change to the location of its auditory event. It is widely accepted that the localization blur is smaller for a sound source in the front than to the rear, while a sound event on the side introduces the largest localization blur[8]. In order to provide a perfect localization perception to recipients, a precise soundfield system should be able to deliver the attributes faithfully.

Figure 2.3: Spatial Hearing with One Sound Source in Horizontal Plane

Figure 2.3 illustrates the case of spatial hearing with only one sound source present- ing in the horizontal plane. Given an anechoic listening condition, the recipient can only perceive direct sound. Therefore, the straight path from the single sound source to the left or right ear can be considered as a filter whose transfer function is often referred to as Head Related Transfer Function (HRTF)[43]. During the perception of a sound source away from the median plane, the source signal, S, is converted by the two HRTFs, H1 and H2, into the signals, s1 and s2, presented at the entrance to the left and right ear canal respectively. Due to the difference between those two paths, especially the length difference- Δd, their HRTFs differ from each other. Eventually, these different HRTFs introduce the time difference and the intensity difference which

18 are often referred to as the Interaural Time Difference (ITD) and the Interaural Level

Difference (ILD) to the binaural signals s1 and s2. Figure 2.4 shows the ITD and ILD between the recordings of s1 and s2.

Figure 2.4: Binaural Signals

Around 1900, Lord Rayleigh[44] identified ITD and ILD as two potential cues for the localization of sound sorces in the horizontal plane. It was also noticed that human listeners are more sensitive to ITD for low frequencies (less than 1.5kHz), while the difference in level is prominent in localizing high frequency components (above 1.5kHz). These two renowned observations were later named as the Duplex Theory. The reason behind this theory can be found in a physical perspective. When a sound source is in the horizontal plane but far away from the median plane (cf. Figure

2.5), the sound perceived by the contralateral ear may be shadowed by the head introducing ILD. However, when the frequency is below approximately 1.5kHz,the wavelength of sound is comparable with, or larger than, the distance between two ears- Δd =22cm, causing ILD less salient or negligible. In contrast, the head shadowing is more effective for a high frequency (above 1.5kHz) sound whose wavelength is smaller, resulting perceptually much more noticeable ILD.

The research work by Gaik in 1993[45] confirms that these two cues, ITD and

19 Figure 2.5: The Head Shadowing Effect

ILD, are approximately in causal relation to the direction of a sound source in the frontal section of the horizontal plane. In other words, in this single sound source scenario, the localization of auditory events strongly depends on the parameters, ITD and ILD, carried by the binaural signals. It reveals the significant importance of these two parameters as spatial cues in the process of localization/spatial hearing in the horizontal plane.

On the other hand, the pair of ITD and ILD has much less indication in regard to localizing an auditory event in the median plane. This is because if a sound source is in the median plane of the listener no significant time or level difference can be observed between the corresponding binaural signals given two similar HRTFs. In this case there are usually some other cues offering assistance in dealing with the front-back and top-bottom ambiguity. For example, as mentioned in [43], visual cues, spectral cues and the head movement can help to determine the location of the sound event in the median plane.

Another important attribute is Interaural Coherence (IC), which is defined as the maximum value of the normalized cross-correlation between the binaural signals. The value of IC implies the similarity of these two signals. Chernyak and Dubrovsky, in

20 their study[46], have shown that IC can affect the width of the perceived auditory event, in the case of headphone listening. As IC increases and two ear entrance signals share more similarity, recipients are likely to perceive a more focused auditory event during a headphone listening. On the other hand, the width of the virtual auditory image can be increased by reducing the coherence between these two signals. Even- tually, two different auditory events may appear at the same time when IC is below a certain value. This phenomenon is within the scope of lateralization, which deals with the causal relation between relevant attributes of binaural signals and the lat- eral displacement of the resulting auditory events during a headphone playback. It is believed that the laterlization of an auditory event can be manipulated by ITD and ILD, while IC controls the width.

2.2.2 Spatial Hearing with Two Sound Sources

It is more practical to study the case of spatial hearing with two sound sources. This is because, in reality, the soundfield and transaural reproduction systems normally use at least two speakers (sound sources) for playback and thus can be generalized as a superposition of several two-source systems for simplification and a better under- standing. As soundfield systems is the focus of this project, a review of psychoacoustic phenomena related to two sound sources perception is essential and helps identifying the important parameters in soundfield reconstructions.

As studied previously in the case of one single sound source, there are two di- rectional cues, ITD and ILD, which denotes the time difference and level difference between binaural signals respectively dominating the localization of auditory events. Similarly, two other parameters, Inter-Channel Time Difference (ICTD) and Inter- Channel Level Difference (ICLD), are found very important during the process of localizing auditory events when two sound sources are involved. However, different from ITD and ILD between the perceived binaural signals, ICTD and ICLD represent

21 the time difference and level difference respectively between the two source signals. On the other hand, ITD and ILD can be approximated by ICTD and ICLD respectively when the binaural reproduction (cf. Chapter 1.2) is deployed, since the two sound sources are extremely close to the listener’s ears in this case.

It was reviewed in [8] that depending on the value of ICTD, the phenomena related to the perception of two coherent sound sources are mainly governed by the following three laws:

• Summing Localization

Given the ICTD is under 1ms, only one single auditory event can be perceived between the two sound sources and biased towards the earlier source (as shown in Figure 2.6). The extreme case is that the auditory event coincide with one of the true sources if only one source is sounding. Stereo reproduction5 is based on

this effect.

Figure 2.6: Summing Localization, the localization of an auditory event is determined by ICTD and ICLD between the coherent source signals.

5Details about the stereo reproduction will be reviewed in Chapter 3.

22 • Law of the First Wavefront[47][48]

When the ICTD is over 1ms, the auditory event’s location will be solely deter- mined by the primary source whose signal arrives first at the listener. This is

also known as the precedence effect.

• Inhibition of the Primary Sound[49]

If the time difference increases beyond the upper limit, each sound source may produce an auditory event separately with a delay in presence. It is even pos- sible that under some circumstances the second sound source may be dominant

and masking the primary sound. This effect has been observed by Georg von B´ek´esy[49].

In addition, the attributes, ICTD and ICLD, play an important role in localizing multiple incoherent auditory events during the transaural or soundfield playback. This is because a system generating incoherent auditory events simultaneously can be inter- preted as the superposition of subsystems which produce an individual auditory event independently (cf. Figure 2.7). Due to the linearity of HRTFs, the overall ICTD and

ICLD depends on subsystems’ ICTDs and ICLDs respectively which are fundamental to the localization of those multiple auditory events.

Figure 2.7: Superposition of Multiple Auditory Events

A third important parameter in spatial hearing involving two sound sources is

Inter-Channel Coherence (ICC) which is defined as the maximum absolute value of the normalized cross-correlation between two source signals. Similar to IC, ICC is used

23 to describe the coherence of two sound sources and can be manipulated to control the width of auditory event during a transaural or soundfield reproduction.

However, the manipulation of the characteristics of auditory events in soundfield systems is not the focus of this thesis. In contrast, this thesis partially focuses on sus- taining the values of the attributes, ICTD and ICLD, during the compression process in soundfield systems and resolving the issue in relation to the bandwidth of multi- channel audio representation involved in these systems. Potentially, no further audible noise can be introduced into soundfield systems during audio compression as long as the applied compression technique can faithfully preserve these vital spatial cues.

2.3 Summary

In this chapter, the perception of sound is reviewed from both a physical perspec- tive and a psychoacoustic perspective to identify the essentials in various stages of the soundfield systems. Firstly, the Fourier Bessel Decomposition (cf. Equation 2.4) can be utilized to describe a soundfield. This is the basis of precisely analyzing and synthesizing a soundfield at another time or location. Secondly ICTD, ICLD and ICC are identified as important cues for localizing auditory events in soundfield systems. Therefore, it is extremely important to preserve these parameters during compression to secure faithful reconstructions of soundfields. Overall, the two sections in this chap- ter indicates the approaches to respectively resolve the two issues prosed previously in Section 1.3. The following chapter will review the reason why existing popular soundfield systems and compression techniques can not overcome those two problems respectively.

24 Chapter 3

Soundfield Systems and Multichannel Audio Compression Techniques

In this chapter, a variety of soundfield systems and multichannel audio compression algorithms are reviewed to explain their existing problems and facilitate their com- parison. The implementation details of past and present soundfield systems are first explained along with the discussion of their features. The second section reviews the design principle and performance of several audio coding techniques, and unravels the issue in regard to compressing the multichannel audio product of soundfield systems using these coding techniques.

3.1 Historic Review of Soundfield Reconstruction

Systems

Ever since Thomas Edison recorded a human voice for the first time in 1877, the recording and reproduction of sound/soundfields has developed and grown in popu-

25 larity in various applications. Therefore, the section will review the past and current soundfield systems following the chronological order in which they were introduced. The focus of this section is the reproduction stage of soundfield systems, since cur- rent attempts of reconstructing soundfields mostly involve a certain pre-determined arrangement of speakers which limits the flexibility and performance of the available systems. Recording techniques will also be mentioned during the description in this section, but more details will be given in Chapter 5.

Nowadays, most people are more familiar with 2-Channel stereophonic1 reproduc- tion and may have thought it is the first soundfield system in history. However, a number of systems had been implemented prior to the commercialization of stereo- phonic systems which dates back to the 1950s[50], including a 3-Channel system im- plemented by scientists from Bell Labs[51]-[56] and Alan Blumlein’s research work[57] on 2-Channel reproduction at Electric and Musical Industries Ltd. (E.M.I.) in the 1930s. These systems are the basis of the development and commercialization of stereophonic systems. Therefore, this section starts with a review of the 3-Channel system at Bell Labs.

3.1.1 3-Channel System

It is widely accepted that every point in a propagating wavefront can be considered as a elementary sound source emitting spherical waves (cf. Figure 3.1). In other words, a wavefront can thus be considered as the superposition of such elementary waves. This is the renowned Huygens’ Principle[58] from which Wave Field Synthesis2 (WFS), an- other soundfield reconstruction technique, has originated. Following this principle, the original sonic wavefront can be faithfully recreated by connecting an infinite number of closely placed microphones in the pickup room with an infinite number of loud-

1The word “stereo” or “stereophony” is frequently used with diverse meanings covering any sound- field systems that can provide realistic acoustic impression. However, in order to simplify the de- scription, it is only used to represent two-channel systems in this thesis. 2More details will be given in Section 3.1.4.

26 Figure 3.1: The Huygens’Principle speakers placed at the corresponding positions in the playback room3[51]. However, such a faithful reproduction of wavefronts requires an infinite number of transducers which is impractical for either public use or domestic use.

In the early 1930s, scientists from Bell Labs attempted to reduce the number of transducers required for the recreation of original wavefronts in a reproduction envi- ronment. Amongst these scientists, Steinberg and Snow implemented a 3-Channel soundfield system[52] (cf. Figure 3.2) and carried out several subjective tests over this system in 1934.

As illustrated in Figure 3.2, this system deploys three omnidirectional micro- phones4, LM, CM and RM, equally spaced in line in the recording room, as well as three loudspeakers, LS, CS and RS, which are also separated from one another in the listening room. Each of the microphones is connected to the corresponding loud- speaker via one amplifier and a high fidelity transmission channel. This 3-Channel system can thus be considered as an approximation to the ideal infinite transducers system.

Series of experiments were carried out to subjectively test the performance of the 3- Channel system at the time. In those experiments, sound was uttered by a ‘caller’[52]

3Assume the size, shape and acoustic property of these two rooms are identical in these two rooms. 4Please refer to Section 5.1 for more details on microphone directionality.

27 Figure 3.2: Steinberg and Snow’s 3-Channel System from nine pre-determined locations in the pickup room. Then, a listener was asked to estimate the locations of the resulted phantom ‘caller’ during playback in the listening room. By comparing the corresponding locations upon reproduction with the original positions in the recording room, Steinberg and Snow concluded the localization of auditory events in such a 3-Channel system was governed mainly by the precedence effect (cf. Section 2.2.2) as well as the level difference between channels. It was also noticed that the performance of the 3-Channel soundfield system was reasonably good and better than any other systems up to that date, even though the quality of the reproduced soundfield had been compromised as a result of the decrement in the number of transducers used in this system.

In the meanwhile, Steinberg and Snow also realized several 2-Channel systems by bridging the center microphone or the center loudspeaker or both in the above 3-Channel system. Similar subjective experiments were carried out to justify the per- formance of these 2-Channel reconstructions. After comparing the results with those

28 of the 3-Channel system, it was found that the 3-Channel system generally provided better performance than 2-Channel systems in terms of eliminating the recession of the center stage. This is because the center loudspeaker is effective in filling the gap between two far separated speakers in those 2-Channel systems and consequently stabilizing the reproduced center stage.

Due to its superior performance over 2-Channel reproduction, the 3-Channel sys- tem was adopted for the first time as the essential unit of the audio system for the film, Fantasia, in 1939[59], and this three front channel format has been commonly used in cinema ever since. However, this 3-Channel format was thought to be less economical for domestic use, because such a system required three independent amplifiers and loudspeakers which were only suitable for public use at the time[60].

Around the same time, a British engineer, Alan Blumlein, from EMI was also in- vestigating soundfield systems independently in a different approach which eventually lead to the invention of the well-known stereo system. The following paragraphs will review Blumlein’s stereo system.

3.1.2 Stereophony

Different from the scientists at Bell Labs, Blumlein focused on delivering a realistic impression of the original soundfield, rather than reproducing the exact wavefronts, using only two channels. As reviewed in previous chapter, the perception of sound, in the case of two sound sources or more, depends mostly on the time and level difference between channels which are resultantly considered as essential cues for localization. The importance of these directional cues were also recognized by Blumlein and utilized in his Stereo system to imitate the actual listening situation.

According to Blumlein’s patent[57], a pair of directional microphones (eg. figure- of-eight microphones5), also known as a “Blumlein Pair”, are positioned coincidently

5Please refer to Section 5.1 for more details on microphone directionality.

29 with their axes at 90o to each other to record the original sound/sound field (cf. Figure 3.3). Due to their coincident configuration, these two microphones can be considered

Figure 3.3: Stereophonic Recording and Reproduc- tion to share approximately the same recording position which is the cross point of their center axes. As a result, the output signals do not have phase difference but differ in level. Such level difference is closely correlated to the offset angle of sound source with reference to the center axis of the coincident pair, which was analyzed by Clark et al from a mathematical perspective in 1957. As explained in [60], such correlation can be described by, L − R = tan(θs). (3.1) L + R where L and R are the outputs of the left and right microphones respectively, and θs denotes the offset angle of sound source (cf. Figure 3.3). The pair of mono signals, L+R and L−R,inEquation3.1areoftenreferredtoasthesum(M)anddifference(S)

30 signals, so this technique is named as the M/S Coincident Technique6 or the Blumlein Difference Technique in various literature[61]-[64]. Because the output signals, L and R, are in phase with each other but possessing different amplitude, the Blumlein Difference Technique also reveals the possibility of down-mixing stereo signals into a single channel without any cancelations.

Upon reproduction, two loudspeakers are positioned equally on each side of the listener’s median plane with an offset angle, φ. Given the outputs of the Blumlein Pair, L and R, are fed directly into two loudspeakers, LS and RS, respectively and the impulse response of these two speakers are identical, the direction of the resulted virtual image can be estimated by Equation 3.2[60].

L − R sin(θa)= sin(φ). (3.2) L + R

This equation suggests the interaural time difference at the listener, especially for low frequencies7, resulting from the phantom image can be simulated by the level difference between two speaker feeds. Comparing Equations 3.1 and 3.2, we can simply get,

sin(θa) tan(θs)=λ . (3.3) sin(φ) where λ is constant. It is obvious that given the value of φ, the direction of the replicated sound source, θa, in the playback room may agree with the direction of the real source, θs, in the pickup room by carefully modifying the amplitude of the recording outputs (or loudspeaker inputs). As a result, listening in such a system is close to the true listening situation in pickup room. This is the basis of stereophonic reproduction.

Currently in a typical stereo configuration, the speaker offset angle, φ,isoften

6Some literature may particularly refer the M/S Coincident Technique to the coincident micro- phone composed of one cardioid capsule and one figure-of-eight capsule positioned ninety degrees to each other (cf. Section 5.1). 7According to the Duplex Theory (cf. Chapter 2), for high frequency components, ITD is less important than ILD in sound/soundfield perception due to the head shadowing effect.

31 chosen to be +/ − 300[65]. This is because a wider angle may increase the width of the reproduced phantom image and potentially tear it apart leaving a hole in between which is undesired during soundfield reproduction. Such an unstable listening situation was observed by John Eargle during his experiments using two loudspeakers 80o[66] apart from each other. Therefore, without introducing an additional center speaker,

φ has to be reduced to overcome this issue. A variety of literature[67] recommends positioning loudspeakers 60o apart during a stereo reproduction.

A small value of φ causes stereo systems another problem- the position of the phantom image is extremely limited during such 2-Channel reproduction. According to Equation 3.2, the direction of the virtual source, θa, is proportional to the offset angle of loudspeakers, φ,whenφ is less than 45o. L and R are in phase which means

L − R

The stereo reproduction is also prone to have a small sweet spot which, in a con- figuration as illustrated in Figure 3.3, ideally is the cross point of the two speakers’ acoustic axes. If a listener moves aside from that particular spot, the quality of the perception is dramatically degraded. This is one of the reasons why the stereo repro- duction is ideal to be used for headphone listening situation.

Another thing worthwhile mentioning about Blumlein’s patent is the “shuffler” cir- cuit designed to suit spaced omnidirectional microphone recording. During Blumlein’s early experiments between 1929 and 1931, the directional microphones were not avail- able. Therefore, he placed two pressure (omnidirectional) microphones approximately 21 cm (the typical distance between ears) apart to pickup binaural signals during recording causing time (phase) difference between the outputs of those two spaced mi- crophones. Feeding these binaural signals directly into a stereo pair of speakers would result in inaccurate ITD at listener’s ears due to crosstalk. A shuffler circuit[57] was

32 thereby designed to convert the time difference, especially at low frequencies, between the outputs of the spaced microphones into the required level difference at the speakers while preserving the level difference for high frequencies. In this way, the appropriate interaural directional information picked up during recording was able to be conveyed to listeners upon reproduction.

Despite the problems stereo systems entail, 2-Channel recording and reproduction has grown in popularity over years, particularly in the radio broadcasting industry. The biggest reason is only two channels of signals are involved, which requires fewer speakers during playback cutting down the cost of apparatus and the required band- width for signal transmission. Meanwhile, the stereo system has been developing with variations emerging in the market. For example, Enhanced Stereo[68] and Dolby Stereo[69] generate the sensation of multichannel audio signals using the traditional stereo recordings and provide a better spatial perception using multiple speakers.

3.1.3 Quadraphony

In the early 1970s, the Quadraphonic (Quad) sound[70], was introduced into the home entertainment industry. It contained two additional channels over traditional stereo signals. Therefore panning techniques were utilized in this system to create and control virtual sound sources among four speakers which are usually arranged in a quadrangular shape (cf. Figure 3.4)[71] and labeled as Left Front (LF), Right Front (RF), Left Back (LB) and Right Back (RB).

In order to be compatible with the stereophonic format, matrixing techniques were developed to encode the 4-Channel audio into two channels. The available systems were UD-4/UMX, developed by Duane Cooper et al[70], and Matrix-H, invented by BBC engineers[72] for delivering quadraphonic sound via FM radio.

However, various quadraphonic systems led to a confusion for consumers over dif- ferent encoding formats in the market. Besides, the sweet spot of the Quad system

33 Figure 3.4: Quadraphonic Setup was rather small. These two factors could possibly have contributed to its commercial failure[73].

3.1.4 Ambisonics

Despite the commercial failure of Quadraphony, the matrixing techniques developed at the time significantly contributed to the advent of Ambisonics[74][75]. In the 1970s, Michael Gerzon et al designed an Ambisonics system to record and reproduce soundfields spatially. Based on the spherical harmonic decomposition of soundfields (cf. Section 2.1), this system was believed to be a truly 3D system which is able to capture and replicate the acoustic characteristics of the entire soundfield. The appearance of Ambisonics in the 70s changed the world’s perspective of soundfield reconstruction from 2D stereophony to 3D recording and reproduction.

A series of multichannel audio recording and reproduction techniques are required and incorporated in Ambisonics systems along with a codec in between to convert microphone signals to speaker feeds. Inspired by the Blumlein Difference Technique (cf. Section 3.1.2), Peter Gaven and Michael Gerzon invented another coincident microphone, the Soundfield microphone, using a different combination of standard

34 directional microphones in 1975. According to their patent [76], four cardioid or hypercardioid microphone capsules8 are deployed and positioned on each side of a regular tetrahedron in a Soundfield microphone (cf. Figure 3.5). The outputs of

Figure 3.5: Soundfield Microphone these microphone capsules, referred to as A, B, C and D in [76], are collectively known as A-format. W =0.5(A + B + C + D), (3.4)

X =0.5(A + B − C − D), (3.5)

Y =0.5(B + C − A − D), (3.6)

Z =0.5(B + D − A − C). (3.7)

Applying Equations 3.4-3.7[76] and further equalization, these A-format signals can be refined to generate the B-format signals[77], W , X, Y and Z,amongwhichW possesses omnidirectional directionality and the others, X, Y and Z, are bidirectional signals providing front-back, left-right and top-bottom information (cf. Figure 3.6). Therefore, a Soundfield microphone is equivalent to a combination of one omni- directional microphone and three figure-of-eight capsules picking up the directional information of soundfields three-dimensionally.

8Chapter 5 will explain cardioid and hyper cardioid microphones in detail.

35 During playback, B-format is converted into the loudspeaker feeds according to the configuration of the speaker array. In order to simplify the transformation between B-format signals and the speaker feeds, the choice of speaker layouts is often limited. Currently, one of the most popular arrangements in the market is the 5.1 surround sound speaker placement (cf. Figure 3.7). This arrangement has been standardized by the International Telecommunications Union (ITU) in its recommendation 775[78], and extensively adopted by home theater systems and Dolby surround[50]. As illustrated in Figure 3.7, three speakers, often referred to as ‘Left’(L), ‘Center’(C) and ‘Right’(R), are placed in the frontal plane. The left and right speakers are often positioned 30o off the median plane, which is also compatible with the stereo reproduction. These three front speakers are powerful in enhancing and stabilizing the auditory images from the front. Two other speakers, ‘Left Surround’(LS) and ‘Right Surround’(RS), are located to the rear to provide spatial information and an extra sensation of ambience. In addition, a sixth speaker which is the ‘.1’ speaker can be added to emphasize the Low Frequency Effects (LFE).

Overall, the Ambisonics technique is able to three-dimensionally record the acoustic characteristics of a soundfield and excels in defining simple format, B-format, which

Figure 3.6: B-format Directionality[77]

36 Figure 3.7: ITU-R BS.775, 5.1 Surround Sound Speakers Placement can be further programmed to fit the pre-determined speaker arrangements. This is one of the main reasons why the Ambisonics system has been a commercial success since it entered the consumer market. However, the reproduction is not in 3D. The speakers in an Ambisonics reproduction are all placed ideally in the horizontal plane at the ear level. The elevational information about the localization of the auditory events is inadequate using this speaker arrangement. Furthermore, options of the speaker layouts are limited. This is because, just as suggested in [79], the synthesis from B-format to speaker feeds may be much more complicated if an irregular speaker layout is adopted or a large number of speakers are used.

3.1.5 Wave Field Synthesis(WFS)

Wave Field Synthesis(WFS)[80] is another technique that provides a good analysis and synthesis of a soundfield. Originated from Steinberg and Snow’s experiments at Bell Labs, the WFS technique attempts to reconstruct the original wavefront during reproduction based on the Huygens’ Principle (cf. Figure 3.1). But different from the early 3-Channel or 2-Channel systems, WFS deploys a larger number of speakers.

37 As illustrated in Figure 3.8, WFS is capable of manipulating a number of speakers

Figure 3.8: The Principle of Wave Field Synthesis to replicate the primary wavefront and substantially reproduce the original recorded soundfield.

Ideally, if a soundfield is reconstructed using WFS, the listeners are not be able to notice any difference from the original, even though they may move around during the playback. In other words, WFS possesses a large sweet spot. The quality of the reproduced sound is high in fidelity as well.

However, WFS also bears a couple of disadvantages. Firstly, as with playback in Ambisonics, WFS also reproduces soundfields two-dimensionally. It means the soundfield is only precisely reproduced in the horizontal plane at ear level, while the elevation information in the median plane can be missing since no vertical speaker array is arranged in its primary design. Furthermore, an enormous number of speakers are required to faithfully reproduce the soundfield using WFS. For example, over 100 speakers are used at Delft University[81] and 24 speakers are used during one of the experiments at the University of Erlangen-Nuremberg[82]. Apparently, such a huge number of channels leads to an overwhelming demand for storage and transmission bandwidth. It is certainly one of the reasons why WFS has not been commercialized.

38 3.1.6 Ambiophonics and Vector Based Amplitude Panning

(VBAP)

Several novel techniques have emerged trying to improve the quality of 3D soundfield reconstruction within the past decade. One of them is Ambiophonics[83] which is dedicated to provide the listeners at home with a concert hall acoustic environment. In an Ambiophonic system (cf. Figure 3.9) proposed by Angelo Farina et al[84], a listener is surrounded by a speaker array including two closely-placed frontal speakers (stereo dipole) playing the crosstalk canceled stereo recordings and eight surround speakers driven by the convolved B-format signals. Therefore, this system can create an ambience that is perceptually similar to a well-designed performance space.

Figure 3.9: Ambiophonics Reproduction[34]

However, it is a rather complicated process. As proposed in [84], a stereo recording has to be carefully edited to prevent the crosstalk effect. The reflected signals have to be removed as well, since the frontal stereo dipole only plays the direct sound. On the other hand, delicate processing is also required to eliminate the direct signals from the B-format surround recordings before playback using eight surround speakers. In addition, Ralph Glasgal put a barrier in front of the listener in his earlier implementa-

39 tion of Ambiophonics[83] to cancel the possible crosstalk between the frontal speakers. This is obviously not a desired solution in terms of convenience and comfort.

A second approach is Vector Based Amplitude Panning (VBAP) which was developed by Pulkki[85] in 1997. Based on the radial behavior of a soundfield (cf. Section 2.1), this technique is able to spatially create and manipulate the auditory events in a soundfield by directly panning the playback speakers. However, such an external control over the presence and motion of the auditory events is not within the scope of this project. Therefore, VBAP will not be further discussed in this thesis.

3.1.7 Summary

As reviewed in this section, the ever available soundfield systems can be arguably sep- arated into two categories according to their approaches to reconstruction. The first category includes the early 3-Channel system implemented by Steinberg and Snow and the WFS technique which attempt to reconstruct the original wavefronts based on the Huygens’ Principle. Obviously, a more accurate reconstruction requires a large number of speakers which is not economical for domestic use. Differently, Stereo, Quad, Ambisonics and Ambiphonics take a second approach which adopts the Blum- lein difference technique. The focus of this category is to provide a realistic impression for listeners. However, it is prone to a small sweet spot and most importantly, the systems in this category are compatibly limited with their pre-determined speaker arrangements. Any slight offset of speakers may degrade the quality of reproduction. Another soundfield system that adapts to any irregular speaker arrangements will be explained in Chapter 5.

In addition, it is clear from this section that the soundfield systems, from WFS to Ambisonics and even to Stereo, are always accompanied with multiple channels. Without compression, the multichannel audio achieved after recording requires un- doubtedly a large bandwidth for representation, which may cause potential problems

40 during storage and transmission. The following section will thereby summarize the available audio compression techniques and explain the potential problem in sound- field systems due to the use of these coding algorithms.

3.2 Summary of Multichannel Audio Compression

Techniques

Audio compression, or audio coding, is a process of achieving a compact representation of audio signals, especially multichannel signals, for the sake of convenience in storage and transmission. It has become a vital part of soundfield systems, since the number of channels used in these systems gradually increased over the past century, from mono to stereo, to six (in typical Ambisonics), and to an even larger number (in WFS). Therefore, a variety of multichannel audio coding techniques will be summarized in this section to complete the review of soundfield systems and explore the issue which these compression techniques may possess in representing a large number of audio channels.

In general, audio compression algorithms can be classified into two families: lossless audio coding and lossy audio coding. The following paragraphs will thus describe these two families separately.

3.2.1 Lossless Audio Coding

Lossless audio coding[86], as implied by its name, is a reversible process of changing the representation of audio signals without compromising their quality. Ideally, the exact quality of the original signals can be preserved after decoding if a lossless audio compression algorithm is applied. This is because the algorithms in this family focus on identifying and removing the redundancy within original audio signals to reach a

41 more concise representation with fewer errors. Since lossless coding has no quality issues, it is often applied after lossy audio coding to remove the residual redundancy.

The available lossless audio coding techniques in the market are Direct Stream Transfer(DST)[87] in Super Audio CD(SACD), Meridian Lossless Packing(MLP) for DVD Audio, and developed by Apple Inc. There are also several free codec, such as Monkey’s Audio, Free Lossless Audio Codec (FLAC) and etc. Due to their reversible performance, these lossless audio coding techniques attract attention mostly from audiophiles who are enthusiastic about high fidelity sound reproduction. However, the intention of achieving a perfect reconstruction after decoding substan- tially limits the compression ratio of lossless audio coding algorithms. In other words, compressed multichannel audio signals using lossless coding techniques will require a ‘’huge” bandwidth for either transmission or storage. For example, a SACD disk with the physical storage capacity of 4.7 GB can only accommodate 74 minutes of both stereo and multichannel audio9 (altogether eight channels) compressed using the DST algorithm10[88].

3.2.2 Lossy Audio Coding

In contrast to lossless compression, lossy audio coding is capable of using a much lower bitrate to represent the same signals at the expense of quality. During the process of lossy audio coding, both the perceptually irrelevant information and the statistical redundancy are abstracted by coders. As a result, the original audio signals are refined with a high compression ratio, while the fidelity may be degraded due to the irreversible changes introduced by lossy coders. A psychoacoustic model (cf. Figure 3.10) which calculates the threshold of human auditory perception as a function of time and frequency may be incorporated in lossy audio coders allowing the possibility of “hiding” quantization noise below this threshold of perception. This threshold of

9The sampling frequency is approximately 2.8MHz (64 times 44.1kHz). 10The compression ratio varies between 2.6 and 2.9 for different types of music[88].

42 perception is also known as the masking threshold[40].

Figure 3.10: Generic Structure of Lossy Audio Cod- ing

Since the lossy compression schemes possess the advantage of high coding effi- ciency and the quantization noise can be controlled by coders, lossy audio coding has been popular in the market and extensively used in soundfield systems and various other applications, such as digital , internet streaming media, satellite and cable media, and etc. Examples of this family include MPEG1-Layer III (MP3)[89],

Advanced Audio Coding (AAC)[90] and (AC-3)[91] which are among the most popular audio compression schemes in the consumer market. The following paragraphs will review these three techniques in detail.

• MPEG1-Layer III (MP3)

The renowned MPEG-1 audio coder[89] consists of three layers. Each of these three layer deploys a masking model to control the quantization noise which is inaudible if the level of noise is under the masking threshold. However, com-

pared with Layer I and Layer II, MP3 is more complex but excels in quality, since several mechanisms are introduced to improve its performance. The first additional part is a hybrid filterbank which applies an adaptive Modified Dis- crete Cosine Transform (MDCT)11[92] after a 32-subband filterbank resulting a

11The MDCT is a Fourier-related transform with a feature of perfect reconstruction in time domain due to its usage on overlapped windows.

43 considerable increase in frequency resolution. The Adaptive segmentation used during MDCT helps to restrain the pre-echoing effect (cf. Section 6.1.2) and substantially improve the quality of outputs. Other schemes, such as bit allo- cation loop, nonuniform quantization and entropy coding (Huffman coding[93]), are also added to effectively reduce bitrates and improve quality.

Overall, MP3 supports up to 2-Channel[89]12 audio compression with sampling rate varying from 16kHz to 48kHz. A wide range of bitrates, from 32kbps to

320kbps per channel, is available depending on the desired perceptual quality of compressed audio. In particular, MP3 possesses a transparent rate of 64kbps per channel[94] above which the encoded audio is perceptually the same as its original. In 1993, MP3 coder was standardized by ISO/IEC as a part of MPEG-I,

and it has grown in popularity since then. The MP3 scheme has been chosen to be the standard compression technique for audio over the internet[95]. Because of its high coding efficiency, it is also deployed by some satellite Digital Audio Broadcasting (DAB) systems[96] in Britain and many portable digital audio

players.

(AAC)

Initially, MP3 was designed to suit the need of coding monophonic or stereo- phonic signals. As the consumer interest for high quality compression of multi- channel audio was growing in the 1990s, MP3 was also modified to be able to compress up to 5.1 Channels and included in MPEG-II BC/LSF[97] which is a backward compatible algorithm. However, it was found unlikely to reach a bi-

trate lower than 640kbps[98] using the MPEG-II BC/LSF algorithm during the compression of 5-Channel audio material because of the imposed backward com- patibility. Therefore, a different coding algorithm, the non-backward compatible Advanced Audio Coding (AAC)[99] algorithm, was developed to compress mul-

12MP3 only supports monophonic and stereophonic audio compression in MPEG-I, but up to 5.1 channels in MPEG-II mode[97].

44 tichannel audio signals more efficiently and eventually became a part of MPEG standard as MPEG-II NBC/AAC[90].

Being a part of MPEG-II standard, AAC does inherit some coding schemes from MP3, such as the adaptive MDCT filter bank, the masking model and the nonuniform quantization strategies. However, it also introduces several changes. Firstly, the 32-subband filter bank is abandoned and a high resolution adaptive MDCT filter bank is used solely in AAC to produce subband spectrum. Accord- ing to the presence of transients in input signals, the length of windows used for MDCT is chosen between 256 samples and 2048 samples. The window shape, on the other hand, is determined with the purpose of achieving the largest coding gain. If strong frequency components are close to one another, a sin window will be chosen to produce narrow passbands which are more capable of separating these dense spectrum. But a Kaiser-Bessel Derived (KBD) window function will be used when the signal segment under consideration does not have adjacent strong frequency components. Details of this KBD window can be found in

[100]. Secondly, a noiseless coding scheme is added working with bit allocation to improve the reduction of redundancy. As a result, even lower bitrates can be achieved. Thirdly, AAC also adopts a Temporal Noise Shaping (TNS) mod- ule[101] and a spectral predictor to further control pre-echoes and substantially improve its coding efficiency.

Compared with MP3, AAC’s coding efficiency is much higher due to the intro- duction of the above modules. According to [102], AAC possesses a transparent rate of 320kbps for 5-Channel audio compression which is only half of the bitrates required by MP3 coders. Besides, AAC supports the audio compression of up to 48 channels even more than MP3 in MPEG-II mode. Because it outperforms MP3 in dealing with multichannel signals, AAC has been used in an extensive range of multichannel audio applications. For example, it is the default coder for Apple iTunes and used in iPod. Sony’s Play Station Portable (PSP) and even

45 the latest Play Station 3 (PS3) both support AAC encoded audio. A satellite Digital Audio Radio Service(DARS) in the U.S. adopts the AAC technique to compress its audio signals as well.

• Dolby Digital (AC-3)

A third multichannel audio coding algorithm is Dolby Digital13,alsoknownas AC-3[91][103][104], which was independently invented and commercialized by Dolby Laboratories in the 1990s. In fact, AC-3 appeared chronologically earlier than AAC, and some of its coding mechanisms were later adopted by AAC dur-

ing the collaboration between MPEG and Dolby towards the MPEG-II standard. Originating from several prior monophonic audio compression algorithms, collec- tively referred to as AC-2[100], AC-3 was designed specially for the compression of standard 5.1 channel surround sound in home theater systems. At the same

time, both monophonic (single channel) and stereophonic (two channels) systems were also supported by AC-3.

Similar to AAC, the AC-3 algorithm deploys an MDCT filter bank in which adaptive KBD14 windows[100] are used to separate the input signals into frames.

Short windows of 256 samples are selected when transient appears in the frame, while 512-sample windows, different from AAC’s long 2048-sample windows, are used for stationary signals. As a result, the frequency resolution of AC-3 is approximately four times lower than that of AAC, given the same sampling

rate. A more significant difference between AC-3 and AAC is the usage of a spectral envelope encoder for bit allocation in AC-3. In such an encoder, the MDCT coefficients resulting from the filter bank can be represented in a binary exponent/mantissa format. These exponents provide an estimation of the input signal’s spectral envelope when several consecutive frames are considered

13Dolby Digital is a promotional name of an early lossy audio compression technique invented at Dolby Labs. The Dolby Digital logo appears in various audio products using this compression technique. 14In fact, the KBD window was first used in AC-3, and then introduced to AAC during the collaboration between MPEG and Dolby towards a new ISO/IEC standard.

46 together. Therefore, a spectral envelope encoder is used in AC-3 to exploit the existing time-frequency redundancy, in conjunction with a psychoacoustic model and a bit allocation module. Eventually, a high compression ratio can be achieved using AC-3.

In general, the AC-3 audio compression algorithm supports several configurations of reproduction, from monophonic to 5.1 surround, allowing bitrates adjustable between 32kbps and 640kbps. High-quality results are guaranteed at the bi-

trate of 64kbps per channel[105] when AC-3 is applied. Due to such high coding efficiency for multichannel audio compression, AC-3 was first used in the film, Star Trek VI, in 1991, and its first official public introduction was soon launched on the premiere of another film, Batman Returns, in the following year. In 1993, the U.S. Advanced Television Systems Committee (ATSC) selected AC-3

to be the audio coding standard for North American High Definition Televi- sion (HDTV) service[106]. Although the AAC algorithm which was reportedly outperforming AC-3 in terms of improving perceptual quality due to the addi- tional modules such as TNS and nonuniform quantization[98][107][108] emerged

in the late 1990s, AC-3 still managed to gain more acceptance in the market, because it is simpler to implement while possessing good performance. Over the past decade, AC-3 has extended its applications to DVD, cable television, Direct Broadcast Satellite (DBS) service and etc.

Generally speaking, good quality outputs can be guaranteed at an average bitrate of 64kbps per channel for the above three lossy multichannel audio coding algorithms. However, such coding efficiency may potentially result in a tremendous total of bitrates since the number of bitrates required for representing the entire multichannel audio signals scales proportionally with the increasing number of channels. Take AAC which supports up to 48 channels for an example, in order to maintain the good quality, a total number of 64 ∗ 48 = 3072kbps is needed to compress a 48-Channel audio file with the AAC algorithm, let alone the 100 channels used in a WFS system at Delft

47 University[81]. Therefore, a multichannel audio compression technique with higher coding efficiency is essential for good-quality soundfield systems.

3.3 Summary

This chapter has reviewed the techniques involved in reproduction and compression consecutively showing the existing issues in these two stages of a soundfield system.

Firstly, different reproduction strategies of the past and present systems have been reviewed in a chronological order. This historic review reveals that the performance of these soundfield systems suffer from poor compatibility enforced by the limited pre- determined speaker configurations. As a result, the use of irregular speaker arrange- ments often caused by imperfect mounting can substantially degrade the reproduced soundfields. Secondly, the required Bandwidth for multichannel audio representations in soundfield systems appears to be another issue. As explained in the review of vari- ous multichannel audio coding techniques, the number of bitrates scales proportionally with the increasing number of channels. So a huge Bandwidth will be required when the popular audio coding algorithms, MP3, AAC and AC-3, are used to compress a large number of channels.

In order to improve the performance of soundfield systems and the quality of the reconstructed soundfields, the following two chapters will introduce the techniques resolving the issues of soundfield systems in regard to bandwidth and compatibility respectively. Chapter 4 presents another multichannel audio compression algorithm, Binaural Cue Coding (BCC)[109], followed by an explanation of the High Spatial Resolution (HSR)[110] system in Chapter 5.

48 Chapter 4

Multichannel Audio Compression Technique: Binaural Cue Coding (BCC)

4.1 Problem Review

Globally, as an associated product of soundfield systems, multichannel audio material is getting tremendous interest from consumers in the market. This situation drives the research intent in soundfield reconstruction partly towards enhancing the capability of accommodating all the data required for representing the multiple channels of sig- nals. For example, the renowned 5.1 surround sound configuration and its associated 6-Channel audio have been extensively used in entertainment systems at home for the past decade. The increasing popularity of these home theater systems has been strongly supported by the advent of the “powerful” storage medium, DVD, in the 1990s as well as the constant improvement in telecommunication techniques. Sufficient stor- age and transmission bandwidth are provided by each respectively. However, it also indicates the potential problem and limitation that may occur to the development of

49 Table 4.1: Comparison of Multichannel Audio Com- pression Algorithms in Terms of the Re- sulted Bitrates (kbps)

Compression Techniques Mono Stereo 5.1 Surround (1 chan) (2 chan) (5.1 chan) MP3 64 128 6401 AAC 64 128 3202 AC-3 64 128 384 soundfield systems and their multichannel audio. As reviewed in the previous chapter, when the number of channels increases, the amount of data required for high quality representation of such multichannel audio signals scales accordingly. Consequently, the required bandwidth can be so overwhelming that it may exceed the limit of DVD’s storage capacity or the extreme speed of current networks.

Table 4.1 contains the resulted bitrates of compressing various number of audio channels by MP3, AAC and AC-3 which are commonly deployed for high-quality multichannel coding in current market. As implied by this table, the bitrate achieved via various audio compression is approximately in proportion to the number of channels involved. As a result, the bandwidth required for accommodating or transmitting compressed audio can eventually “burst out of control” if an increasing number of channels are used in a system to improve the quality of reproduced soundfields.

Such an issue can be handled by research towards a more powerful storage media possessing larger capacity and faster networks. Alternatively, a fundamental improve- ment in enhancing the efficiency of multichannel audio compression techniques can help to reduce the resulted bitrate and substantially resolve this issue while preserving good quality. This chapter will introduce another multichannel audio coding tech- nique, Binaural Cue Coding (BCC)[109], which is the key component of the audio compression algorithm in recently ratified MPEG-4 standard[111]. The BCC algo- rithm was invented by Christofer Faller during his pursuit for a doctoral degree a few

1This bitrate can be achieved by MP3 in MPEG-2 BC/LSF mode. 2The bitrate is specified for a 5-Channel audio compression using AAC in MPEG-2 mode.

50 years ago[109][112]-[114]. It follows the latter approach encoding multichannel audio with an emphasis in preserving spatial information. Details about the coding scheme and my implementation of BCC will be explained in the following sections.

4.2 BCC Design Scheme

As suggested in Section 2.2, in order to improve the efficiency of representing mul- tichannel audio signals while preserving the spatial clarity and temporal quality, the coding algorithms should be capable of identifying and sustaining the essential infor- mation in original multichannel signals. In 2002, Christof Faller proposed an audio compression technique, named as Binaural Cue Coding (BCC), utilizing ICTD and ICLD which are the vital clues to the perception of soundfield in a soundfield system

(cf. Section 2.2). This technique suits the compression of multichannel signals in par- ticular[109][112] as it will be shown later to possess a much higher compression ratio than the three commonly used algorithms.

Figure 4.1 presents the generic structure of BCC. As illustrated in the diagram, this algorithm attempts to compress one single audio channel using one of the standardized coding techniques (as listed in Table 4.1) and encode the side information which has been extracted from the multiple channels of signals at a lower bitrate. A perceptually faithful reconstruction is expected after BCC synthesis, since the side information including ICTD and ICLD is well preserved to help listeners localize auditory events accurately.

The following subsections further explains the details of BCC coding during my implementation in the project starting with its macro structure.

51 Figure 4.1: Generic Structure of Binaural Cue Cod- ing (BCC)

4.2.1 Macroscopic Design

A better understanding of multichannel audio coding and decoding using the BCC algorithm can be achieved by looking at the macroscopic structures of BCC Analysis and Synthesis blocks in Figure 4.2.

Figure 4.2: Macroscopic Structures of BCC Analysis and Synthesis

As shown in the above figure, each channel is compressed by one BCC encoder during the process of BCC Analysis, except for the reference channel which is encoded as a mono channel signal. At the receiver end, the reference channel is decoded using

52 the relevant mono decoder. This decoded reference signal is then passed into multiple BCC decoders simultaneously with decompressed side information. Eventually, each one of the other audio channels can be reconstructed by restoring the relevant spatial information into the decoded reference channel.

As a result, by coding side information at a much lower bitrate, the number of bitrates required for representing the entire multichannel audio can be dramatically reduced. It may be comparable with a standardized compression of stereo signals, since the majority of compression data are used to represent the single reference channel. Ideally, this technique will not introduce much perceptual distortion as long as the spatial cues, ICTD and ICLD, can be precisely extracted, sustained and restored as discussed in the following paragraphs.

4.2.2 Frequency Processing

Ideally, audio compression can be implemented either in time domain or in frequency domain depending on individual coding scheme. During my implementation of BCC coding in this project, frequency domain processing techniques were deployed. This is based on the consideration that the perception of sound/soundfield is strongly fre- quency dependent for human beings which has been widely accepted as the result of previous psychoacoustic research[8][40][115]. People can normally perceive sound with a frequency varying between 20Hz and 20kHz. Any frequency below 20Hz or beyond 20kHz usually becomes undetectable for human ears3. Furthermore, the audible fre- quency range is usually partitioned into several segments according to the sensitivity of human’s auditory system to individual frequency. Adjacent frequencies incurring similar response are then grouped together forming a segment. These segments are often referred to as Critical Bands (CBs)[115] and normally used to define the band- widths of the filter bank that models the function of human’s auditory system. It is

3The audible frequency range may vary for individuals. For example, young people may sometimes be able to hear a sound with a frequency higher than 20kHz.

53 also worthwhile mentioning that these critical bands are of nonuniform bandwidths. As the center frequency of a CB increases, the bandwidth enlarges accordingly. This is because to human’s auditory system low frequency components are more distin- guishable from one another than high frequencies. So it is better to estimate the side information in frequency domain on a CB basis to ensure the perceptual precision across the audible frequency range while avoiding either redundant estimations for every individual frequency component or insufficient frequency resolution averaging the difference.

Due to such significant frequency dependance of spatial hearing, frequency domain processing approach is chosen for the multichannel audio analysis in BCC. Conse- quently, Discrete Fourier Transform(DFT) is required to convert the original time signal (discrete time sequence) with a finite duration into its spectral components in frequency domain. Generally speaking, the DFT is applicable to signals composed of sinusoids whose frequency, magnitude and phase are time-invariant. Decomposing such a signal into its critical bands using this frequency domain approach is computationally cheaper with the same performance, compared with temporally filtering/convoluting the time signal with a complex auditory filter bank[113].

However, natural audio signals and the characteristics of their frequency compo- nents often change over time. Therefore, in digital signal processing, DFT is usually replaced by Short Time Fourier Transform(STFT) which calculates the DFT of a win- dowed segment of a signal with an infinite duration. Assuming frequency spectrum remains invariant within a short window, the spectral components for the entire du- ration can be estimated applying STFT to each of the consecutive segments as the window-w[n] sweeps across the signal in time domain.

An extensively deployed window is the zero padded Hann window (cf. Figure 4.3)

54 which is defined by the following equation: ⎧ ⎪ ⎪ ⎪0for0≤ n

w[n] Z W/2 W/2 Z 1

0 N n

Figure 4.3: Overlapped Hann Windows, with Zero Padding at both ends

Commonly in digital signal processing, Hann windows with 50% overlapping are used to separate the original signal into frames. Such overlapped windowing possesses two obvious advantages. Firstly, the overlapping assures a ‘perfect’ reconstruction. This is because the gain of these overlapped consecutive windows remains constant at any instant. As a result, no aliasing which could be caused by the possible discontinuity at the junction between two adjacent frames will appear in its reconstructed signal.

55 In addition, the padded zeros, to certain extent, could avoid the temporal circular shifts caused by the potential compensation of time delay(time difference) in frequency during the process.

In order to simplify the expression, only one frame of the multichannel audio,

{si[n]}, is used in this thesis as an example to explain the process of encoding and decoding in the following subsections. This frame of signal in time domain is given by,

W si[n]=w[n − Ki ]chani[n] (4.2) 2

where, i specifies the considered channel, while Ki is the index of current frame in channel i. Applying STFT to {si[n]}, we can obtain the frequency spectrum for this frame. The frequency spectrum can be then passed into the BCC encoder for further analysis and compression. The following subsection will explain details about the encoding procedure.

4.2.3 Encoding

The fine structure of a BCC encoder is shown in Figure 4.4. Given multichannel audio signals {si[n]} and one of the channels as the reference, sref [n], this figure sketches the key encoding process of BCC including a separate MP3 encoder for the reference channel, a block of cue analysis to extract spatial cues, ICTD and ICLD, from two input channels as well as a process of quantizing these spatial cues. Details about these three parts of a BCC encoder will be discussed separately in the following paragraphs.

Reference Channel Compression

In this project, the reference channel is arbitrarily chosen from the original multi- channel audio material, since no significant change in reconstruction quality has been perceived depending on the decision of reference channel. However, it is obviously

56 Figure 4.4: Fine Structure of BCC Encoder preferable to use the channel with relatively more information across the entire time domain, so that the reconstruction quality of the other channels will not suffer from the occasional silent gaps in the reference. For example, for a piece of 5.1-Channel audio material, it is desired to choose the center channel as reference. This is because this channel is often used to enhance and stabilize the frontal auditory stage substan- tially requiring more temporal information/coverage than the other channels. Based on this consideration, the center channel is chosen to be the reference as preferred dur- ing the compression of 5.1-Channel audio materials using BCC within this thesis. As

8-Channel recording and playback (cf. Chapter 5) is also involved in this project, one of the front channels is hereby selected as reference when BCC is applied to code the 8-Channel audio materials. More details about this occasion will be given in Chapter 6.

Coding the reference channel can be realized by applying any of the reviewed audio compression techniques in previous chapter since it is only a monophonic channel com- pression process. However, LAME[116], a patent free MP3 audio compression codec, is adopted in this project to carry out the compression of the reference channel, sref ,

and its decoding back to sref . It has been observed that the length of sref is usually different from that of the original, sref , and such difference varies for different versions

57 of LAME codec. This is mainly due to the variations in window size or windowing scheme involved during implementation. Leaving such difference unattended will cer- tainly introduce undesired error to the reconstruction of the other channels. So a pre-analysis process is necessary to detect and compensate for the delay synchronizing the decoded reference channel with the other channels.

Spatial Cue Analysis

The second function block in BCC encoder features spatial cue analysis in which each of the other channels is compared with reference to generate essential inter- channel cues including ICTD and ICLD. Obviously, it is better to compare channel i

signal, si, with the decoded reference channel, sref , rather than its original, sref,when estimating those spatial cues. This is because certain amount of information may be lost after applying the lossy MP3 compression to the reference channel. Such loss in reference will be accumulated in the other channels during reconstruction, if different

signals, sref and sref , are used for analysis and synthesis respectively. Using sref which is assured of being present at decoder for cue analysis, on the other hand, can eliminate such loss and avoid the resulted quantization noise in reference from being passed onto the other channels. So during my implementation of BCC, the reference channel, sref , will be MP3 encoded and decoded before passing into cue analysis to estimate TCTD and ICLD for the compression of each one of the other channels.

As reviewed previously, the perception of sound is strongly frequency dependant. Therefore, the directional cues are estimated on a CB basis when comparing the syn-

chronized inputs, si and sref , for cue analysis. In another words, a pair of ICTD and ICLD will be calculated for each critical band at any instant by averaging the time and level difference of all the frequency components within that band between two channels respectively. As the critical band boundaries deployed in this thesis are the same as those adopted in MPEG-1 Standard[89] (cf. Table 4.2), totally 24 pairs of directional cues per frame are resulted from cue analysis for the compression of each channel, other than the reference, within the multichannel audio recording.

58 Table 4.2: Critical Band Boundaries(Hz)

Band Lower Boundary Higher Boundary Width 0 0 100 100 1 100 200 100 2 200 300 100 3 300 400 100 4 400 510 110 5 510 630 120 6 630 770 140 7 770 920 150 8 920 1080 160 9 1080 1270 190 10 1270 1480 210 11 1480 1720 240 12 1720 2000 280 13 2000 2320 320 14 2320 2700 380 15 2700 3150 450 16 3150 3700 550 17 3700 4400 700 18 4400 5300 900 19 5300 6400 1100 20 6400 7700 1300 21 7700 9500 1800 22 9500 12000 2500 23 12000 15500 3500 24 15500 24000 8500

59 In order to obtain the signals for each critical band, frequency spectral decompo- sition is applied in the STFT domain. During this process, the above critical band boundaries (Hz) are converted and rounded to the nearest spectral indices (sample) according to the sampling frequency, Fs, of the piece of recording. Assuming Ab and

th Bb are the lower and higher boundaries (sample) of the b critical band respectively, the segment, [Ab,Bb), of the signal’s frequency spectrum is thereby equivalent to the spectrum of signals obtained by filtering the original signal with a relevant bandpass filter in time domain. However, much less computations are required in the former STFT approach, which can be considered as another advantage of implementing BCC coding in the STFT domain. In this way, the frequency spectrum of the original channel can be separated into critical bands which are ready for cue analysis. The following paragraphs will take the bth critical band as an example to explain the algo- rithms for estimating those directional cues, ICTD and ICLD, separately. From now

on, the notations, Si[θk]andSref [θk], will be used to represent the STFT of si and

sref respectively.

1. ICTD

th As proposed in [117], the value of ICTD, ΔTi,b,fortheb critical band of channel i can be determined in various ways based either on the concept of cross-correlation or on the modeling of group delay.

From the cross-correlation perspective, the time difference, ΔTi,b, between si and

sref , can be estimated from the STFT domain implementation of their cross- correlation following Equations 4.3 and 4.4.

ΔTi,b =argmax{c0[dtest]} (4.3)

where c0[dtest]isgivenby,

Bb−1 jdtestθk c0[dtest]={ Si[θk]Sref [θk]e } (4.4)

k=Ab

60 In the above equations, {dtest} denotes a set of testing ICTD values, while c0 can be interpreted as the implementation of cross-correlation in STFT approach. In addition, Si[θk] stands for the complex conjugate of Si[θk].

Instead of computing the cross-correlation in the time domain, the frequency domain approach is taken, where one signal can be arbitrarily shifted by intro- ducing a phase delay. This method still possesses the high precision as a simple cross-correlation method in time domain while requiring much less computations.

This method is referred to as the circular cross-correlation(CXC) algorithm in [117].

Another way of evaluating ICTD is based on the use of multiple linear regression in modeling the group delay between two signals within the specified critical

band. As si and sref are two channels of signals from the same piece of recording, they are most likely to be coherent. As shown in Equations 4.5 - 4.8, the group delay between these two coherent signals reflects their time difference. Therefore, the multiple linear regression method can be applied to determine ΔTi,b.This algorithm is named as Linear Regression Modeling (LRM).

sref [n] ⇒ Sref [θ]; (4.5)

−jdθ si[n] ≈ sref [n − d] ⇒ Si[θ] ≈ Sref [θ]e ; (4.6)

Ψ[θ]=Sref [θ] − Si[θ]; (4.7)

∂Ψ[θ] ΔTi,b = . (4.8) ∂θ

CXC, LRM and their variations have been tested and compared in terms of computational complexity and accuracy during my implementation of BCC. The experimental results of this comparison have been presented in [117] and details will also be given in Chapter 6. Overall, it has been observed that CXC generally possesses the best performance among the compared algorithms. It is more

61 accurate with much lower expense of computations. As a result, CXC has been adopted for the estimation of ICTD in this project.

2. ICLD

The value of ICLD, ΔIi,b, simply depends on the power of each signal within the

th b critical band, Vi,b and Vref,b. It is often defined as a ratio between these two powers (cf. Equation 4.9).

Vi,b ΔIi,b =10log10 (4.9) Vref,b

In this equation, the power of si and sref within the corresponding band are given by, Bb−1 2 Vi,b = |Si[θk]| (4.10)

k=Ab

Bb−1 2 Vref,b = |Sref [θk]| (4.11)

k=Ab

Therefore, Equations 4.9 - 4.11 have been utilized during my implementation to evaluate ICLD.

Quantization

A third function block within BCC encoder is quantization. During this process, the output of cue analysis, 24 pairs of ICTD and ICLD for each frame, can be encoded via various quantization algorithms including the simplest scalar quantization schemes (cf. Figure 4.5) and more sophisticated entropy coding techniques such as Huffman coding[93].

Generally speaking, assume Bitt bits on average are used for the quantization of each ICTD, the overall bitrate (kbps) required to represent ICTD for each channel

62 during BCC coding is given by,

BitrateICTD = NCBs ∗ Bitt ∗ NFrames/1024; (4.12) where NCBs and NFrames, denote the number of critical bands adopted during cue analysis and the number of frames obtained within one second’s duration after windowing respectively. Similar equation (cf. Equation 4.13) can be used to calculate the overall bitrate (kbps) of ICLD.

BitrateICLD = NCBs ∗ Bitl ∗ NFrames/1024. (4.13)

Depending on the sampling frequency, Fs, and the length of windowing applied, W , NFrames in the above equations can be approximated as,

NFrames ≈ Fs/(W/2). (4.14) if 50% overlapped windows, as suggested in previous sections, are used during imple- mentation. Apparently, the overall bitrate required to code side information for each channel is proportional to the average number of bits used during the quantization, the number of critical bands as well as the sampling frequency. Besides, An increase in window size can certainly reduce the resulted bitrate, but it also substantially reduces the time resolution of the entire system. As mentioned previously, W has been chosen to be 896, while 24 CBs have been used in this project.

As far as the quantization technique is concerned, the most common uniform quan- tization and an entropy coding algorithm have both been applied to encode ICTD and ICLD. As ICTD and ICLD may contain both positive and negative values, the peak- to-peak range can be equally divided into an odd number of quantization levels each of which is specified by one quantizer during the uniform quantization. In this project, 7 quantizers have been used to code both ICTD and ICLD, while their code books

63 Figure 4.5: Uniform Quantization Scheme are different for these directional cues. As a result, an approximate bitrate of 7.5kbps per channel can be achieved for quantizing each cue with uniform quantizers, if the original multichannel audio has a sampling rate of 48000Hz. Such a bitrate can be further reduced if an entropy coding algorithm-Huffman coding is applied. Compared with the simple uniform quantization, the Huffman coding scheme has the advantage of producing optimal prefix codes in which the representation of certain value is never the prefix of the others’, according to the input probability of distribution. By assign- ing less bits to represent the value of ICTD or ICLD with a higher probability, high coding efficiency is assured. During my implementation, an approximate bitrate of 6kbps per channel has been achieved during the quantization of each directional cue using Huffman coding.

Overall, the coding efficiency of the reference channel is comparable with that of a monophonic channel compression using those popular audio compression techniques, while a much lower bitrate is required in BCC to present the vital information of the other channels. Therefore, a much higher compression ratio can be reached if multichannel audio materials are coded with BCC following the above steps and this ratio enlarges as the number of channel increases. The resulted BCC compression

64 data can then be either stored somewhere or transmitted via high-quality transmission channels. At the other end, these multichannel signals can be reconstructed by BCC decoders which will be explained in the following section.

4.2.4 Decoding

Figure 4.6: Fine Structure of BCC Decoder

Generally speaking, decoding is an inverse process of encoding. As illustrated in Figure 4.6, the BCC decoder also includes three parts- an MP3 decoder, a de- quantization process and a block of cue synthesis. Upon arriving at the receiver, the BCC encoded data corresponding to the reference channel and the side information are be separated from each other. The MP3 encoded mono signal will first be decompressed by the relevant MP3 decoder- LAME decoder in my implementation. The directional cues, on the other hand, will be retrieved through a de-quantization process and then restored into the decoded reference on a CB basis to reconstruct the other channels via cue synthesis.

Given the decoded reference channel, Sref [θk], in STFT domain and the de-quantized

th directional cues, ΔTi,b and ΔIi,b, the restoration of ICTD and ICLD, for the b critical

65 band of channel i, can be implemented following Equations 4.15 and 4.16.

ICLD

 ΔIi,b Sˆ i[θk]=Sref[θk]10 20 (4.15)

ICTD

 −jΔTi,bθk Si[θk]=Sˆ i[θk]e (4.16)

In this way, the first half of the frequency spectrum of channel i signal can be achieved after the synthesis of ICTD and ICLD in each critical band. The other half spectrum can be recovered accordingly due to the symmetric property of STFT. Therefore, within each frame, the time sequence can then be retrieved by computing the inverse STFT of the reconstructed frequency spectrum. Bearing the consideration of overlapped frames, the entire digital time signal of channel i can eventually be reconstructed by simply assembling all the consecutive frames. At the same time, a similar procedure is followed to rebuild the reference channel after decoding.

In the end, multiple channels of signals can be recovered from the entire BCC coding process resulting a decoded multichannel audio material with the preserved essential directional information.

4.3 Summary

This chapter has introduced the design principle of Binaural Cue Coding which can be used to resolve the bandwidth related issue and described my implementation of it. As explained previously, the compression ratio of BCC is strongly affected by the coding scheme of the reference channel while the other channels have much less impact since fewer bits on average are required to compress them. In general, the overall bitrate,

66 Bitrateallkbps, of an N-Channel audio compression using BCC is given by,

Bitrateall = Bitrateref +(BitrateICTD + BitrateICLD) ∗ (N − 1). (4.17)

where Bitrateref denotes the bitrate used to compress the reference channel. The parameters, BitrateICTD and BitrateICLD, are given by Equations 4.12 and 4.13 re- spectively. Therefore, a bitrate of 64 + (6 + 6) ∗ (8 − 1) = 148kbps has been reached during my implementation of compressing an 8-Channel audio material with the BCC coding algorithm4.

A comparison between the BCC coding and other currently popular multichannel audio compression algorithms, in terms of bitrate, is listed in Table 4.3. It is apparent in this table that BCC possesses the highest compression ratio among the four com- pared and its superiority in coding gain to the others is strengthened as the number of channels grows. Due to such preponderant coding efficiency, the BCC coding scheme has been included as a key component in recently ratified MPEG-4 standard.

Table 4.3: Comparison Between BCC and Other Popular Multichannel Audio Compression Algorithms, in terms of bitrate (kbps)

Compression Mono Stereo 5.1 Surround Multichannel Audio Techniques (1 channel) (2 channels) (6 channels) (N channels) MP3 64 128 6405 64*N AAC 64 128 3206 64*N AC-3 64 128 384 64*N BCC 64 64+127 64+12*2 64+12*(N-1)

As far as the reconstruction quality is concerned, the BCC coding scheme ideally ensures a good preservation of directional information. This is because the essential cues, ICTD and ICLD, are refined from the original and quantized effectively. How- ever, there are also some issues potentially threatening the quality of BCC. Compared

4MP3 is used to encode reference and Huffman coding is applied for quantization in this case. 5This bitrate can be achieved by MP3 in MPEG-2 BC/LSF mode. 6The bitrate is specified for a 5-Channel audio compression using AAC in MPEG-2 mode. 7Assume a bitrate of 6kbps has been reached to quantize either ICTD and ICLD for each channel.

67 with the other three coding algorithms, BCC utilizes the psychoacoustic model to con- trol the quantization noise during the compression of the reference only leaving the other channels unattended. In addition, uniform window size adopted during encoding makes the reconstructed signal prone to the pre-echoing effect. Therefore, a set of sub- jective tests have been carried out to evaluate the clarity of BCC coded multichannel audio signals. The conduction and results of these test will be presented in Chapter 6.

Generally speaking, BCC is an efficient coding scheme and thus have been deployed to resolve the issue of soundfield systems on bandwidth. The next chapter will focus on the second issue and review the principle of the HSR approach to accurately record and reproduce soundfields which has been used to enhance the compatibility of the soundfield system in my project.

68 Chapter 5

High Spatial Resolution(HSR) Soundfield System

In previous chapters, a variety of soundfield systems have been reviewed, along with a comparison among several audio coding techniques used to compress multichannel audio materials involved in soundfield systems. The Binaural Cue Coding algorithm explained in Chapter 4, in particular, has been implemented in this project to resolve the issue of soundfield system on required bandwidth. This chapter will move on to a second issue- the compatibility with irregular speaker configurations of soundfield systems. A different soundfield recording and reconstruction technique which has been deployed in this project to precisely record and reproduce soundfields with a high spatial resolution will be introduced in this chapter. The approach that has been taken to convert certain recording to speaker inputs of any speaker configuration will be explained.

Generally speaking, a soundfield system can be separated into three stages: record- ing, playback as well as a codec in between that converts microphone signals into speaker input. As illustrated in Figure 5.1, recording is always isolated from playback within such a soundfield system. This has been the case since the early stereophonic system till Ambisonics, and even to more recent WFS system studies. The codec in-

69 cluding a series of analysis and synthesis is thus required to convert microphone pickup signals to speaker feeds without losing the essential temporal and spatial information of soundfield. In this way, microphone recording can be made independently from the configuration of speakers, but still used to possibly replicate the original soundfield at another time or place. It shows the potential of more flexibility in soundfield recon- struction which is supported by the separation between recording and reproduction. Therefore, the High Spatial Resolution (HSR) soundfield recording and reproduction techniques will be described separately in this chapter, as well as the post-recording analysis and synthesis systems between them. But this chapter starts with a general review of various microphones.

Figure 5.1: Generic Structure of Soundfield Systems

5.1 Review of Microphones

A microphone is simply a transducer that translates acoustic energy (acoustic infor- mation) into electric energy. Ever since its first appearance as a ‘sound transmitter’ in a telephone system in 1876[118], microphone has experienced extraordinary develop- ments in terms of its element: from the earliest carbon microphone to crystal micro- phones which are best used in telephones. Nowadays, the most common microphones for audio recording are dynamic microphones, which are based on electromagnetic properties, as well as condenser microphones which possess better sonic quality but

70 require external power supply. Another option is electret microphone which is a type of condenser microphone but does not require an external power supply due to the use of a permanently-charged electret material. There are also some other microphones available such as piezo, laser, liquid microphones and etc. In order to ensure the quality of recording, condenser microphones are deployed in this research work.

As far as the directionality is concerned, microphones can also be classified as om- nidirectional, bi-directional, cardioid, hypercardioid and supercardioid microphones. The 2D polar pickup patterns of these microphones are shown in Figure 5.2.

90 1 90 1 120 60 120 60

150 0.5 30 150 0.5 30

180 0 180 0

210 330 210 330

240 300 240 300 270 270 a. Omnidirectional b. Bi−directional

90 1 90 1 120 60 120 60

150 0.5 30 150 0.5 30

180 0 180 0

210 330 210 330

240 300 240 300 270 270 c. Cardioid d. Hypercardioid

Figure 5.2: Standard Microphones’ 2D Polar Pattern

• Omnidirectional Microphone (cf. Figure 5.2a), has the simplest polar pat- tern and records the sound pressure at its location in soundfield;

71 • Bi-directional Microphone (cf. Figure 5.2b), or a figure-of-eight microphone, converts the pressure gradient into electrical signals. This is evident from its polar pattern. Its directional pickup pattern introduces the proximity effect[119] which describes the phenomenon of low frequency distortion in its frequency response as the sound source is close to the microphone;

• Cardioid Microphone (cf. Figure 5.2c), with a heart-shaped polar pick (Figure 5.2c), only picks up sound from the front. Similar to figure-of-eight microphones, the cardioid microphone is also prone to the proximity effect as the result of its unidirectional pattern;

• Hypercardioid Microphone (cf. Figure 5.2d), is similar to the cardioid mi- crophone while having a negative lobe at the rear. This type of microphone and the cardioid both excel in picking up vocal or speech signals. Therefore, they have been extensively deployed in teleconferencing;

• Supercardioid Microphone, has the same polar pickup pattern as the hy-

percardioid, while its negative lobe is smaller than that of the hypercardioid microphone.

These standard microphone capsules are often used separately in recording, con- ference and varieties of other cases. On the other hand, it is also possible to combine some of them together to form a new microphone that possesses a rather complex directional pattern. This is the premise of 3D Soundfield recording. In general, the currently available microphone assemblies can be categorized into two classes:

Coincident microphones, are designed on the basis of the M/S Coincident tech- nique or the Blumlein Difference Technique (cf. Section 3.1.2) which was original developed by Alan Blumlein to carry out stereo recording using a coincident pair of figure-of-eight microphones in the 1930s. Within a coincident microphone, the stan- dard microphone capsules share exactly the same recording location. As a result, the multichannel microphone output does not possess any inter-channel time difference,

72 but produces salient inter-channel level difference. According to the M/S Coincident technique, the sum and difference of these multiple in-phase channels of a coincident microphone output, bearing such level difference information, can substantially be used to localize the sound source. So the coincident configuration of microphone assembly has been extensively adopted in recording.

Figure 5.3: The 3D Decomposition of B-format Di- rectionality: X, Y, Z are three figure-of- eight microphones; W is an omnidirec- tional microphone

The most popular surround sound mike, the Soundfield microphone (cf. Figure

3.5), belongs to this class. As reviewed previously in Section 3.1.4, such a Soundfield microphone is composed of four cardioid microphone capsules producing A-format signals. These internal microphone capsule pickup signals can be refined using ma- trixing techniques (cf. Equation 3.4 - 3.7). The resulted signals, collectively known as B-format, are able to characterize the recorded soundfield, since they possess the equivalent directionality as a combination of three bi-directional microphones placed along the three orthogonal axes and one omnidirectional capsule. The decomposition

73 of such directionality is shown in Figure 5.3.

Figure 5.4: Double M/S Microphone Design Scheme

Another example of the coincident microphones is Double M/S[120] which is an improved version of M/S microphone. As illustrated in Figure 5.4a, a single M/S microphone is composed of one cardioid capsule (Mid microphone) and one figure- of-eight capsule (Side microphone) placed at a right angle against each other. By mixing the output signals of these two microphone capsules, two unidirectional pickup patterns (Left and Right) can be achieved. In addition to such a M/S pair, another cardioid microphone can be placed on the other side of the figure-of-eight microphone to create a Double M/S microphone array. As shown in Figure 5.4b, the two cardioid microphone capsules facing theoppositedirectionsshare the same bi-directional mi- crophones. In this way, a Double M/S microphone is able to produce 4-Channel audio recordings using only three microphone capsules to pickup both the frontal and the rear soundfield.

Generally speaking, the coincident microphones excel in stabilizing the localization of auditory images. However, the lack of ICTD, the vital directional information, substantially leads to a lower capability of lateralisation. In other words, the width of the perceived virtual image will be reduced, which is undesired in soundfield systems.

74 In addition, the proximity effect (cf. the description about bi-directional microphones) of those internal standard directional microphone capsules may introduce distortion to the low frequency response of recording. This is a salient issue for those using figure-of-eight, cardioid or other directional microphones internally.

Non-coincident microphones, different from the coincident microphone design strategy, introduce ICTD into recording by spacing their internal microphone capsules. As a result, the output signals of non-coincident microphones may possess both time and level difference between channels, which consequently improves the accuracy of lateralisation during soundfield reconstruction.

One example of this class is the assembly invented and adopted by the French national radio broadcaster, the Office de Radiodiffusion T´el´evision Fran¸caise (ORTF). This microphone assembly is now referred to as the ORTF configuration[121]. Fig- ure 5.5 is a picture of the ORTF microphone. As shown in this picture, an ORTF microphone is composed of two cardioid microphones which are spaced 17cm apart with an angle of 110o. In this way, both intensity difference and time difference will be captured by the ORTF microphone, as long as this microphone is placed far from the sound source to avoid the proximity effect.

Figure 5.5: ORTF Stereo Recording Microphone

75 The Decca Tree microphone[122] which is commonly deployed for better orchestral recording is another typical example of the non-coincident microphones. Invented by Roy Wallace, Arthur Haddy and their team at Decca Studio in the 1950s, this type of microphone (cf. Figure 5.6) adopts three omnidirectional microphone capsules1.The use of omnidirectional microphones successfully avoids the proximity effect caused by directional microphones. At the same time, the center microphone which is placed in front of the other two helps to stabilize the center auditory image.

Figure 5.6: The Configuration of Decca Tree Micro- phone

There are also some other non-coincident microphones available in the market in- cluding the Optimized Cardioid Triangle (OCT) microphone[123], the INA-5 spider surround microphone[124], and the microphones using the Multichannel Microphone Array Design (MMAD) technique[125]. Besides, the microphone array used in Stein- berg and Snow’s early 3-Channel soundield system in the 1930s belongs to this class as well. One non-incident microphone array composed of eight omnidirectional micro- phones has been deployed during my implementation of a soundfield system in this project to pickup the original soundfield. The following sections will describe different parts of this soundfield system separately starting with the recording stage.

1Traditionally, the Neumann M50 microphones were used.

76 5.2 High Spatial Resolution(HSR) Recording

Fundamentally, the recording process of soundfield systems can be considered as a 3D spatial sampling of the original soundfield using discretely placed microphone cap- sules, which is similar to the temporal sampling of continuous signals in digital signal processing. Therefore, from a mathematical perspective, soundfield recording can be simulated by the following equation2:

c = Bp (5.1) where c is a vector comprising the output signals of those discrete microphones, and p denotes a vector of coefficients that represents the acoustic information of the recorded/input soundfield independently of the microphone array configuration. This matrix, B, referred to as sampling matrix in [110], describes such a spatial sampling process during recording. Equation 5.1 is thus known as the sampling relation in soundfield systems.

On the other hand, as reviewed previously in Section 2.1, sound pressure, P (r, θ, φ, f ), which is the solution to wave equation 2.2 at P (r, θ, φ) within the sphere of interest can be decomposed into terms of Fourier Bessel Coefficients, Spherical Bessel func- tions and Spherical Harmonics using Fourier Bessel Expansion (cf. Equation 2.4). Among these three terms, only the Bessel Coefficients are absolutely independent of the orientation and position of the microphone capsule within the recorded soundfield, which makes them the best option to represent the uniqueness/acoustic information of soundfield. In other words, vector p in Equation 5.1 can be assigned with these Fourier Bessel Coefficients in practice. This is the basis of the High Spatial Resolution (HSR) soundfield recording technique which was developed by Laborie in 2003[110].

2In order to simplify the expression, equations in this chapter are all written in frequency domain with the variables omitted, unless otherwise stated.

77 Given omnidirectional microphones are used during recording, the multiple chan- nels of signals contained in vector c thereby correspond to the sound pressure at recording spots of the soundfield. Comparing the sampling relation, Equation 5.1, with the Fourier Bessel Expansion of sound pressure, Equation 2.4, we can have the sampling matrix, B,givenby,

l m Bn,l,m(f)=4πi jl(krn)yl (θn,φn) (5.2) where n denotes the index of each microphone capsule. This equation illustrates the dependence of matrix B on the configuration, (rn,θn,φn), of the assembly of omnidi- rectional microphones. Therefore, the sampling matrix which is the transfer function of the recording system substantially implies the sampling ability of the deployed mi- crophone array. At the same time, the parameter, l, is defined as the order of this microphone array.

An extensive range of microphone assemblies, including linear, circular, spherical and 3D microphone arrays, have been studied by Laborie et al[110]. Comparing the performance of their sampling matrices, they have discovered that among those tested, the irregular placement of microphone capsules generally outperforms the regular mi- crophone array from the spatial perspective. Following this discovery, an irregular array composed of omnidirectional microphones has been designed to carry out high spatial resolution soundfield recording. The configuration of such a microphone array is presented in [126]. In order to increase the accuracy of soundfield system, especially the spatial performance, a microphone array3 (cf. Figure 6.9) has been made according to this configuration and deployed for recording in this project.

3More details about this microphone array and the recording process in this project will be given in Section 6.2.3.

78 5.3 Reproduction

Commonly, the last stage of soundfield systems is to playback multichannel audio through a speaker array trying to recreate the original soundfield. Similar to soundfield recording, such a reproduction process can be modeled by a radiation system which is described by,

pˆ = Mg (5.3) where pˆ contains the acoustic information of the reproduced soundfield, and vector g denotes the multichannel input signals of loudspeakers. The matrix, M,inEquation

5.3 is often referred to as the radiation matrix which can be understood as the transfer function of this system describing the radial behavior of speaker array.

In order to determine the matrix M of a speaker array with random/irregular arrangement, we can investigate the radial behavior of individual loudspeakers due to the linearity of the reproduction process. Assume the vth speaker is placed at

(rv,θv,φv), the soundfield, pˆv(r, θ, φ, t), produced by this speaker with the input signal, gv(t), is given by,

+∞ M pˆv(r, θ, φ, t)= gv(τ)fv (r, θ, φ, t − τ)dτ (5.4) −∞

M th where fv (r, θ, φ, t) is the transfer function corresponding to the v speaker in spatio- temporal domain.

Comparing Equation 5.4 and Equation 5.3, we can easily tell that all the en- tries of the radiation matrix, M(l2+l+m+1),v(f), are the Fourier Bessel Coefficients of

M fv (r, θ, φ, t). As given in [127], these coefficients can be calculated using the frequency

th response, Hv(f), of the v speaker, (rv,θv,φv), if the radiation pattern of this speaker is omnidirectional:

−jkrv e m ∗ M(l2+l+m+1),v(f)=Hv(f) yl (θv,φv) ξl(krv) (5.5) rv

79 where

l −q ξl(krv)= βl,q(jkrv) (5.6) q=0 (l + q)! β = (5.7) l,q 2qq!(l − q)!

Another way of determining the matrix, M, is through direct measurement. In other words, the spatio-temporal response of the speaker array can be estimated through the recording of given multichannel speaker feeds. Some more information on this topic can be found in [128]. This approach, however, was not adopted in this project.

Given Equation 5.1) and 5.3, the ultimate goal of this project, faithful soundfield reconstruction using a given irregular speaker arrangement, can be interpreted as modifying speaker feeds, g, to ensure the characteristics of the replicated soundfield, pˆ, agree with the original, p. In other words, post-recording processing including soundfield analysis and synthesis is essential to covert microphone outputs, c,into the proper speaker feeds, g, according to the configuration of speaker array. This post-recording processing stage in soundfield systems will be discussed in the next section.

5.4 Post-recording Processing

The post-recording processing of soundfield systems, as illustrated in Figure 5.1, con- verts multichannel recording signals into multiple speaker inputs according to the placement of speakers. Generally speaking, currently available soundfield systems de- fine their recording output formats based on their specified speaker arrangements (cf. Section 3.1), which simplifies the post-recording processing stage in these systems.

80 However, it is rather complicated for them to be altered according to another con- figuration of speaker array, especially an irregular arrangement. This is one of the issues raised in Chapter 1 which shows the limitation in the compatibility of those well commercialized soundfield systems.

This section will address this issue discussing the post-recording analysis and syn- thesis process successively. A framework that suits any configuration of the speaker array can be structured by combining analysis and synthesis. Eventually the proper multichannel speaker feeds can be derived from recording using this framework.

5.4.1 Analysis

As mentioned in the previous section, faithful soundfield reconstruction can be achieved if pˆ is equal to p. However, the original soundfield, p, is normally unknown in prac- tice, which means the target soundfield of the reconstruction process is mysterious.

To resolve this issue, pˆ needs to be compared with another soundfield representation, pˆ, that is derivable from given conditions such as recording signals while equivalent to the original, p. In this way, the accuracy of soundfield reconstruction can be sub- stantially assured. The derivation of this vector is referred to as the post-recording analysis process in this project.

Given multichannel recording signals and microphone array configuration, pˆ can be derived using Equation 5.8: pˆ = Ec (5.8) where E denotes the transfer function of the analysis system that extracts the Fourier Bessel Coefficients representation of the recorded soundfield from multiple microphone output signals. Since the output of this analysis system, pˆ, is independent of both microphone and speaker arrays and characterizes the recorded soundfield, it can be defined as the ultimate recording output format that is handy for storage or transmis- sion. Therefore, Equation 5.8 can also be interpreted as an encoding process, and E is

81 the corresponding encoding matrix. Figure 5.7 shows this encoding relation in post- recording analysis. The parameter, L =(l +1)2, in this figure denotes the number of

Figure 5.7: Post-Recording Analysis System

Fourier Bessel Coefficients that can be derived from the output signals of an lth order microphone array, and N is the number of microphone capsules used to construct this array. It is suggested in [127] that L ≤ N.

To derive the expression of matrix E, we need to compare the recording and the post-recording analysis processes that are reverse to each other. Apparently from Equation 5.8 and 5.1, E is an inverse matrix of B, which means the encoding matrix, E, may be expressed as B−1. However, B is usually not invertible, or having a unique inversion due to its singularity. In this case, the inverse matrix of B can be determined using the Generalized Inverse:

E = BT (BBT )−1 (5.9) where BT is the transpose conjugate of matrix B.

In order to control the noise introduced by this matrix inversion, a parameter, μ,

82 has been imposed by Laborie et al in [110]. Equation 5.9 is thereby changed to,

E = μBT (μBBT +(1− μ)I)−1 (5.10) where I denotes an identity matrix. The value of μ varies between 0 and 1, specifying the amplification of noise. As μ decreases, the amplification of the intrusive noise will be restrained, which consequently increases the temporal fidelity. A smaller μ, on the other hand, reduces the spatial resolution of the system. In order to activate such control over the intrusive noise, Equation 5.10 is adopted for the post-recording analysis in this project.

Figure 5.8: Post-recording Analysis

Figure 5.8 illustrates the entire process of post-recording analysis including the introduction of noise control, μ. Following this diagram, information of the target soundfield which is equivalent to the original can be extracted/encoded from multi- channel recording signals. The output of this post-recording analysis process is inde- pendent of microphone array configuration and ready to be adapted to any speaker arrangement. The decoding process that converts the output of analysis to speaker feeds will be discusses next in the synthesis section.

83 5.4.2 Synthesis

Figure 5.9: Post-recording Synthesis

As illustrated in Figure 5.9, during post-recording synthesis, the output of prior analysis needs to be converted into speaker feeds according to the configuration of the chosen speaker array. V in this figure denotes the number of speakers used for reproduction. This synthesis system can be simulated by,

g = Dpˆ (5.11) where D, referred to as the decoding matrix, is the transfer function of this synthe- sis/decoding system.

Since pˆ = pˆ, Equation 5.11 and 5.3 can be compared to generate matrix D. Obviously, the decoding matrix in synthesis can be simply obtained by calculating the inverse of the radiation matrix, M. However, it incurs the same problem as in analysis, because the radiation matrix usually is not square either. Similar approach can be taken to resolve this problem. Minimizing the squared error between pˆ and pˆ,

84 we can eventually have D as the generalized inverse of M:

D =(M T M)−1M T (5.12)

According to [127], several additional parameters have been introduced by Laborie et al to the above expression of D for optimization. Those parameters are explained as following:

• Matrix W- a spatial weighting window

This matrix is used to specify certain space in which the soundfield is optimized. The reproduction of the exterior soundfield, on the other hand, will not be considered. This is realized by filling the diagonal entries of W with elements

Wl(f), while the rest entries of this matrix are assigned with 0. The elements,

Wl(f), can be determined in two ways.

The first option is to directly assign Wl(f) with a specific number, varying be- tween 0 and 1, which implies the level of concentration on optimizing the lth order soundfield reproduction ranges from the least to the most.

A second approach is to rely on a given window model, W (r, f ) in direct space (r, θ, φ). One example of such a window model is a weighting ball defined as,

 1ifr ≤ R(f) W (r, f )= (5.13) 0ifr>R(f)

where R(f) denotes the radius of this weighting ball. It was suggested in [127]

that applying Fourier Bessel Expansion, Wl(f) can be computed, in general, from W (r, f )by, ∞ 2 2 2 Wl(f)=16π W (r, f )jl (kr)r dr (5.14) 0

85 Substituting Equation 5.13 into Equation 5.14, we have Wl(f)as,

R(f) 2 2 2 Wl(f)=16π jl (kr)r dr (5.15) 0

According to the properties of spherical bessel functions of the 1st kind[37] (cf. Appendix A.2.2), Equation 5.15 can be computed as,

2 3 2 2 2l +1 Wl(f)=8π R j (kR)+j (kR) − jl(kR)jl+1(kR) (5.16) l l+1 kR

given R(f) is constant, R, across the entire frequency domain. This equation has been adopted in this project.

• Matrix F, imposed Fourier Bessel Coefficients

Although it is impossible to replicate a soundfield absolutely the same as it was using a limited number of speakers, we can still enforce some of the Fourier Bessel

Coefficients to be reproduced properly by adding matrix F. It is a matrix of size K x L, suggesting the relevant Fourier Bessel Coefficients that are required to be reproduced faithfully. L is the same as the parameter in Matrix E specifying the number of coefficients considered in this entire system, while K denotes the

number of coefficients imposed for faithful reproduction. Because it is unlikely to achieve the perfect reproduction, K is always smaller than the number of speakers, V [127]. In this project, K is chosen to be 4, since four out of eight speakers are mounted in the front to stabilize the frontal auditory images (cf. Chapter 6).

This matrix activates a control over the variation of spatial resolution for different speakers and increases the flexibility on selecting speaker setups. Even irregular configuration can be used to faithfully reproduce soundfields.

• μ, a control parameter

Similar to post-recording analysis, an additional parameter, μ, can be introduced

86 to provide a control over the accuracy of various Fourier Coefficients. If μ =0, only the imposed coefficients specified by F will be faithfully reproduced. As μ approaches 1, the possibility of reproducing the other coefficients more precisely improves accordingly.

Introducing the above addition parameters, we can have the optimized D given by[127],

−1 T T T T T T D = μAM W + AM F FMAM F F IL − μMAM W (5.17) where −1 T A = (1 − μ)IN + μM WM (5.18)

According to this expression, the expression of D can be simplified as,

D =(M T WM)−1M T W (5.19) when μ = 1 and matrix F is null. On the other hand, if μ = 0 and matrix F is not null, Equation 5.17 will be rewritten as,

D = M T F T (FMMT F T )−1F (5.20)

The control imposed by the additional matrix F and parameter μ is thereby demon- strated in Equations 5.19 and 5.20.

Eventually, the decoding matrix, D, can be obtained using Equation 5.17 and applied to convert the Fourier Bessel Coefficients of the target soundfield into the input signals of the selected speaker array. As a result, a faithful reproduction of the original soundfield is expected during playback.

87 5.5 Summary

Figure 5.10: High Spatial Resolution Soundfield Sys- tem

Previous Sections have discussed various stages of High Spatial Resolution (HSR) soundfield system (cf. Figure 5.10). As illustrated in this figure4, the separated record- ing and reproduction stages are successfully connected by an additional processing block in this soundfield system. The post-recording processing block including analy- sis and synthesis can convert multichannel recording signals into speaker feeds resulting an internal product, the Fourier Bessel Coefficients of the original soundfield, which is independent from both the microphone array configuration and the speaker set arrangement. Therefore, a faithful reconstruction of the original soundfield is guar- anteed using this post-recording processing framework even if an irregular speaker array is deployed for playback. In other words, the compatibility is enhanced in HSR soundfield systems.

At the same time, the HSR soundfield system bears the potential for external control over soundfield reconstruction. By carefully selecting values for the additional parameters, those inverse matrices used in post-recording processing can be optimized and substantially used to improve reconstruction quality. In addition, according to the HSR recording technique, the simplest omnidirectional microphones can be used for high spatial quality soundfield recording. 4The controlled noise, Q(f) , is a generalized concept including all the artificial noise that may appear during post-recording analysis and synthesis.

88 Another thing worthwhile mentioning is that the Fourier Bessel Coefficients re- sulted from post-recording analysis can be defined as the output format of recording. This is because those coefficients are independent from both the microphone array and the speaker set while characterizing the features of the recorded soundfield.

In this project, several soundfields have been recorded and reconstructed in this

HSR approach, followed by subjective experiments to evaluate the quality of those reproduced soundfields. The conduction and results of those tests will be discussed in the next chapter.

89 Chapter 6

Experiments and Results

Previous chapters have reviewed the design of various soundfield systems as well as the multichannel audio compression techniques that can be applied in these systems revealing the existing issues of those popular soundfield systems in terms of compat- ibility and bandwidth. In order to address these two issues, a soundfield system that deploys BCC coding (cf. Chapter 4) for its multichannel audio compression has been implemented in the HSR approach (cf. Chapter 5). Experiments including subjective tests have been conducted in this project to evaluate the performance of such a sys- tem compared with several other systems. Details of these experiments including the implementation of BCC and the setup of those acoustic experiments will be explained in this chapter. Test results will also be presented and discussed at the end of this chapter as a part of the comparison.

As this project started with the implementation of BCC coding algorithm, exper- iments on BCC implementation will be presented first in this chapter.

6.1 Experiments on BCC Implementation

In the early stage of this project, the BCC coding algorithm was first investigated and implemented over 6-Channel audio materials to resolve the bandwidth related problem

90 in soundfield systems. Following the steps of BCC coding as explained in Chapter 4, the essential spatial cues, ICTD and ICLD, which affects the localization of auditory images during listening was extracted from each channel and then carefully quantized. In this way, the spatial information that was carried by ICTD and ICLD could be retained after reconstruction. Therefore, the major issue during BCC implementation was to obtain an accurate estimation of ICTD and ICLD for each critical band of the multichannel signals. Several experiments have been carried out to determine the best method of time difference estimation (TDE) in frequency domain. Results of these experiments will be given and discussed in the following subsection.

6.1.1 Estimation of ICTD

The estimation of time difference has been extensively involved in radar, sonar, au- ditory localization and varieties of other signal processing and telecommunications applications. Current TDE algorithms, in general, are variations of the concept of cross-correlation. It can also be determined from group delay in a frequency approach. Four algorithms[117] have been proposed in this project to estimate time difference in the STFT domain, especially for subband analysis. These algorithms are:

• Circular Cross-Correlation (CXC)

Circular cross-correlation (CXC) is developed in order to better the resolution

of the delay estimation while keeping the computational complexity penalties as low as possible. As explained in Chapter 4, instead of computing the cross- correlation in time domain, a frequency domain approach is taken (cf. Equation 4.3 and 4.4), where one signal can be arbitrarily shifted by introducing a phase

delay. This method would result in a much lower complexity compared to an interpolated time-domain approach, while still possessing a higher precision than a simple cross-correlation method. The CXC technique may however be prone to circular shifts.

91 • Non-Circular Cross-Correlation (NCXC)

This method is designed to avoid the effects of circular shifts in the previous

method by padding zeros at the end of signals, while maintaining the same level of accuracy as the CXC approach. However, a dramatic increment in com- putational complexity would be resulted using this approach, especially when transforming signals between time domain and frequency domain. This is due to the consequent increase in signal length after zero padding.

• Linear Regression Modeling (LRM)

As shown in Equations 4.5 - 4.8, the Linear Regression Modeling (LRM) method

of TDE is based on the concept of group delay between signals in frequency domain. Multiple Linear Regression modeling can thus be applied to estimate such group delay and substantially obtain the time difference. However, because of the limited1 number of samples available in subbands, errors can be significant

when using this technique.

• Zero-padded linear regression modeling (ZPLRM)

To alleviate the error that is caused by the limited number of frequency sam- ples while not imposing a coding-delay penalty, we consider zero padding the original sequences in time domain. Padding the time signals with zeros has the effect of interpolating samples in frequency domain. Therefore, the precision is

expected to improved, especially at the low frequency region, at the expense of computational complexity.

In order to determine the optimal method to estimate ICTD in BCC implemen- tation, the performance of above four algorithms have been compared in terms of accuracy and computational complexity[117]. Since sinusoidal signals are the funda- mental elements of both natural and artificial sound that can be processed by human

1When the subbands represent critical bands, and coding-delay considerations preclude the usage of long time frames, only a couple of frequency samples are available at the lower end of the spectrum.

92 TDE for Multi, Integer 32/32 by CXC TDE for Multi, Integer 32/32 by NCXC 1 0.6

0.4 0 0.2 −1 0 mean error in sample mean error in sample −2 −0.2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 4 4 subband fc in Hz x 10 subband fc in Hz x 10 TDE for Multi, Integer 32/32 by LRM TDE for Multi, Integer 32/32 by ZPLRM 200 100

100 0 0 −100 −100 mean error in sample mean error in sample −200 −200 0 0.5 1 1.5 2 0 0.5 1 1.5 2 4 4 subband fc in Hz x 10 subband fc in Hz x 10

Figure 6.1: TDE for Multi-Sinusoids: Integer Sam- ples, Mean Error vs Subband Center Fre- quency fc ears[129], signals that are synthetically generated with multiple sinusoids are deployed as stimuli in experiments. Original time delay can be artificially introduced into the stimuli and eventually compared with the estimated time difference to evaluate the precision of each algorithm. In addition, the original sampling frequency Fs is set as 32kHz, the signal length or frame size is chosen as 1024 samples, and the testing delay set, {dtest} (cf. Equation 4.4), has a resolution of 1/8sample.

Figure 6.1 shows results for the case where 32 sinusoidal frequency contents of the stimuli correspond exactly to the STFT domain indices. 32 subbands with uniform bandwidth are used to divide the frequency range from 0 to Fs/2 Hz in this case. This is symbolized as 32/32 in the figure meaning each subband covers at least one frequency component of the signal while this content may switch among frequency samples within the band. For each algorithm, the errors that are resulted from such variation in each band are averaged and plotted against the corresponding center frequency. We can see from this figure that the best overall performance is obtained when CXC is applied. Besides, both CXC and NCXC are more accurate than the other two methods. Comparing LRM with ZPLRM, the mean error achieved by the

93 TDE for Multi, Non−integer 32/32 by CXCTDE for Multi, Non−integer 32/32 by NCXC 0.6 0.6

0.4 0.4

0.2 0.2

0 0 mean error in sample mean error in sample −0.2 −0.2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 4 4 subband fc in Hz x 10 subband fc in Hz x 10 TDE for Multi, Non−integer 32/32 by LRMTDE for Multi, Non−integer 32/32 by ZPLRM 100 100

50 50 0 0 −50 mean error in sample mean error in sample −50 −100 0 0.5 1 1.5 2 0 0.5 1 1.5 2 4 4 subband fc in Hz x 10 subband fc in Hz x 10

Figure 6.2: TDE for Multi-Sinusoids: Non-integer Samples, Mean Error vs Subband Center Frequency fc latter is generally lower than that of the former. Thus, as expected, zero padding does produce better accuracy. However, even with zero padding, the linear regression methods do not compare with cross-correlation based methods.

The second case tested is similar to the above, considering 32 subbands and 32 sinusoidal components (32/32). But differently, these frequency contents lie between integer samples in STFT domain. Results for this case are shown in Figure 6.2. Obvi- ously, the CXC and NCXC algorithms again outperform LRM and ZPLRM. Another thing worthwhile noticing is the mean error is larger than that of the previous case, especially at low frequencies, when using CXC. This is because the energy of each fre- quency component is smeared to adjacent frequency samples. This perhaps is a more realistic representation of what happens with natural audio signals whose frequency components will invariably lie on the frequency continuum rather than at distinct frequency samples.

A third case where the number of frequency contents per subband are gradually increased is also examined to investigate the performance of CXC and NCXC in more complex stimuli. In this part of the experiments, 32 subbands with uniform bandwidth

94 TDE for Multi Sin Waves by CXC 6

4

2

0 mean error in sample −2 0 2000 4000 6000 8000 10000 12000 14000 16000 subband fc in Hz 32/32 TDE for Multi Sin Waves by NCXC 64/32 6 128/32 4 256/32 512/32 2 1024/32 2048/32 0 mean error in sample −2 0 2000 4000 6000 8000 10000 12000 14000 16000 subband fc in Hz

Figure 6.3: TDE for Multi-Sinusoids: Com- plex,Mean Error vs Subband Center Frequency fc continue to be selected, while the number of sinusoidal components in stimuli doubles every time, labeled as 32/32, 64/32, 128/32 and etc in Figure 6.3. As shown in this

figure, both CXC and NCXC maintain good accuracy for TDE at relatively high frequencies (above 4 kHz), while in low frequency region the mean error tends to increase with more sinusoidal components.

Overall, it is concluded in [117] that the CXC and NCXC algorithms perform bet- ter than LRM and ZPLRM, which means Multiple Linear Regression may not be a suitable model for the application of ICTD estimation in multichannel audio com- pression. Besides, zero padding in time domain, applied in both NCXC and ZPLRM, is shown to reduce the mean error especially at high frequencies. However, a com- plexity penalty is imposed by zero padding, which substantially causes the NCXC and ZPLRM algorithms are computationally more expensive than CXC and LRM re- spectively. Therefore, the CXC algorithm is selected to evaluate ICTD during the implementation of BCC coding in this project.

95 6.1.2 Pre-echoing Effect

Following the procedure described in Chapter 4, the BCC coding algorithm has been implemented using CXC for the estimation of ICTD. It is then applied to compress several 6-Channel audio clips including ‘circus2’, ‘glock3’, ‘dixieland’ and ‘violin’. Af- ter BCC decoding, the reconstructed clips are up for subjective experiments to eval- uate the performance of this multichannel audio compression technique. However, it is observed the quality of reconstruction varies significantly over different types of audio content during informal listening tests. Among the above four audio clips,

‘glock’ has the worse perceptual quality after decoding. This can be explained by the pre-echoes[98] that are artificially introduced into time sequences after the frequency domain processing in BCC.

As illustrated in Figure 6.4, the pre-echo distortion is substantially caused by sharp attacks (energy transients) in recording. Due to the Gibbs phenomenon4 in Fourier

Analysis[130], the energy of sharp attacks will be spread backwards and forwards across the window introducing temporal artifacts after processing in STFT domain. Since forward masking is much stronger than backward masking, the forward artifacts are less audible while the backward distortion- pre-echo more severely damages the perceptual quality.

Nowadays, pre-echo control has become an essential unit in lossy audio compression techniques involving frequency domain analysis. A common way to restrain the pre- echoing effect is adaptive windowing[132] in which window length is switched between long and short according to the appearance of energy transients in time signal. As shown in Figure 6.4c, a shorter window is used when a transient appears. In this case, less pre-echo distortion can be introduced during reconstruction, and the perceptual

2The ‘circus’ clip is a recording of the noisy sound environment in a circus arena. 3The ‘glock’ clip is a recording of glockenspiel. 4The Gibbs phenomenon[131], also known as ringing artifacts, refers to the overshoot behavior that happens at each jump discontinuity point when using Fourier series to approximate a piecewise smooth periodic function. It was first analyzed in detail by Josiah Willard Gibbs between 1898 and 1899.

96 Figure 6.4: Pre-echoing Effects quality can be substantially improved. On the other hand, long windows are applied to increase the coding efficiency whenever the audio signal is stationary. A simple adaptive windowing process used in this project to improve the performance of BCC coding will be explained in the following paragraph.

Prior to adaptive windowing, the audio signal is first separated into frames using short windows (256 points). Energy within each frame is calculated and then com- pared to an energy threshold which is set to distinguish high energy frames from the others. If energy changes from low to high within two adjacent frames, the latter

97 one will be identified as the frame that contains a significant transient. During adap- tive windowing, the frames containing energy transients and their adjacent overlapped frames remain short while the rest need changed to long frames, unless there is a tran- sition (cf. Figure 6.5 between a long and a short frame. In this way, coding gain can be improved with long windows applied to stationary signals. At the same time, the pre-echo distortion can be effectively controlled by short windows.

Figure 6.5: Adaptive Windowing Scheme

Such an adaptive windowing process has been embedded in my implementation of BCC coding during this project. The performance of of this implementation, along with the HSR soundfield system, is then subjectively evaluated via several acoustic listening experiments. The preparation of all the acoustic experiments carried out in this project will be described in the next section, followed by the discussion of results collected in subjective tests.

6.2 Acoustic Experiments Preparation

In addition to the experiments described in the previous section, this project involves a series of acoustic experiments, including HSR soundfield recording, reproduction, and

98 several listening tests for subjective evaluation. Details about the apparatus deployed in these acoustic experiments will be given in this section, starting with the acoustic environment of recording and reproduction.

6.2.1 Anechoic chamber

In order to evaluate the performance of HSR soundfield system, free-filed, a noiseless acoustic environment, is essential for recording and reproduction, because additional acoustic noise introduced during recording or reproduction may consequently affect the results of subjective listening tests. Therefore, the entire process of HSR soundfield recording, reproduction and the following subjective tests in this project are carried out in an anechoic chamber which has been extensively deployed in acoustic experiments as a free-field. In such a chamber, walls, as well as ceiling and floor, are completely covered with acoustic absorbent material that prevents internal sound from being reflected. At the same time, the inside space of an anechoic chamber is acoustically isolated from any external noise, avoiding the internal soundfield from being interfered by external soundfields.

One anechoic chamber has been installed within the acoustic laboratory in the school of Electrical Engineering and Telecommunications (EE&T) at University of New South Wales (UNSW). Figure 6.6 presents a 3D simulation of this anechoic chamber explaining the geometric design of its effective space. This particular anechoic chamber has been qualified according to the ISO 3745 standard[133] with a cut-off frequency at

250Hz[134]. It means this anechoic chamber is reflection free for any frequency above 250Hz.

In the project, this anechoic chamber has been equipped with eight digital loud- speakers to create soundfields for recording and reproduction. The configuration of these speakers will be described in the following subsection.

99 Figure 6.6: 3D Geometric Model of the Anechoic Chamber

6.2.2 Speaker Array Configuration

The speaker array deployed in this project consists of eight GENELEC digital loud- speakers (cf. Appendix B.1), but it does not have a regular configuration like 5.1 surround sound (cf. Section 3.1.4) or stereo dipole (cf. Section 3.1.2 & 3.1.6). In contrast, these digital speakers, regarded as point sources, are planned to be allocated three dimensionally on a sphere. In addition, the radius of this sphere is decided so that the speaker array can make the best use of the effective space to generate sound- fields in the chamber. Therefore, the speakers, arbitrarily labeled with numbers (cf. Figure 6.6), are mounted in the corners of this room with their acoustic axes pointing towards the center. In this case, the radius of this sphere is approximately 3.03m.

Figure 6.7 shows a geometric simulation of how speaker no.3 is mounted to the corresponding corner. In this figure, ‘A’ is an irregular tetrahedral wedge that has been firmly fastened onto the wall. Details about the size of this wedge can be found in Appendix B.2. B is a Vogels loudspeaker wall support whose dimensions are shown in Figure 6.8. The tilt and turn function of this speaker support can guarantee to face the acoustic axis, labeled as ‘D’, of each speaker towards the center point of

100 Figure 6.7: Geometric Simulation of Speaker Mount- ing: No. 3 the chamber. Letter ‘C’ symbolizes the GENELEC digital loudspeaker used in this project.

Figure 6.8: Vogels Loudspeaker Support

101 One thing worthwhile clarifying about speaker mounting is that the bottom four speakers shown in Figure 6.6 are inversely mounted- in an upside down position. This is because the acoustic axis, ‘D’, of GENELEC digital loudspeaker is closer to the top (cf. Appendix B.1). The inverse mounting can make sure speakers do not touch thefloororwallsoneitherside.

Prior to recording where these digital speakers are used to create the original sound- field and other acoustic experiments, the loudness of these speakers has been adjusted to the same volume. In addition, their positions have been calibrated about the center of the anechoic chamber using 8-Channel chirp signal. Within this calibrating signal, each channel contains a short period of chirp signal which is a linear swept-frequency cosine signal with the rest of this channel filled with silence. The appearance of this chirp signal differs in each channel without any overlapping. This means eight speak- ers sound consecutively during calibration. An omnidirectional microphone is placed in the center of the chamber to record the sound. From this calibration recording, sound from each speaker can be easily discriminated and compared with the original speaker feeds for delay estimation. According to the difference in such delay, the di- rections of these speakers can be adjusted to make sure their delays ultimately agree with one another. In this way, no ICTD and ICLD due to speaker configuration can be introduced into soundfield and mislead the subjective test results.

Overall, the same speaker array described above is used for both recording and reproduction in this project, since this is the premise of the subjective evaluation. The original soundfield that has been created during recording can be regenerated in listening tests using the same set of speakers to provide subjects with the reference soundfield. Therefore, the performance of soundfield systems can be rated according to the comparison between the reconstructed soundfields and this reference/original soundfield. The next section will describe the microphone array that has been used to record the original soundfield in this project.

102 6.2.3 Microphone Array Configuration

Figure 6.9: HSR Recording Microphone Array: con- sists of 8 omnidirectional microphones

As mentioned in Chapter 5, a prototype HSR recording microphone array[126] has been produced and utilized in this project for soundfield recording. Figure 6.9 shows the top view and side view of this microphone array. As shown in this figure, eight

103 omnidirectional microphones are embedded in a butterfly-like frame. This frame is made of wood and covered with acoustic absorbent material, so that the potential acoustic reflection caused by this stiff frame can be eliminated.

When recording of the original soundfield, this HSR microphone array is placed horizontally in the center5 of the anechoic chamber. A microphone stand (cf. Figure 6.10) is used to hold this array at the desired height and angle. In this way, the sound

Figure 6.10: Recording Position pressure around the center point of the original soundfield can be successfully picked up by these eight omnidirectional microphones and converted into 8-Channel digital audio signals6 using a RME Multiface audio interface7[135] and a Hammerfall HDSP sound card[136]. A powerful workstation is used to control the entire recording process and store the 8-Channel output signals of recording.

Given these recording outputs, post-recording analysis and synthesis (cf. Chapter 5) can be applied to generate the speaker feeds according to the specific configuration of speaker array. The speaker input signals can also be compressed using the BCC coding algorithm to reduce bandwidth that is required for the representation of these signals. 5HSR microphone array’s position refers to the position of microphone No.3 in Figure 6.9. In other words, microphone No.3 is placed at the center point of the chamber. 6The original recording output signals are sampled at a rate of 48kHz. 7Totally 16-Channel recording is supported by this RME Multiface audio interface.

104 In the end, playing these speaker feeds back through the speaker array completes the HSR soundfield system. Subjective tests have been conducted to evaluate the performance of this system. The next subsection will briefly describe these subjective experiments.

6.2.4 Subjective Tests

The subjective evaluation in this project has been conducted according to an ITU standardized method of subjective assessment of audio materials, Multi Stimulus Test with Hidden Reference and Anchor (MUSHRA). This method - mostly used for mono and stereo sound - has been extended to assess the quality of multichannel audio signals. Details about the standard procedure of MUSHRA tests can be found in

[137].

Figure 6.11: Graphic Terminal (Qterm-Z60) and GUI in MUSHRA Tests

During the MUSHRA tests in this project, a QlarityTM graphic terminal (Qterm- Z60) is placed inside the anechoic chamber for scoring, while a workstation that sits outside the room is connected with this terminal and controls the progress of tests.

105 Figure 6.11 shows the Graphical User Interface (GUI) which is displayed on the ter- minal during MUSHRA tests. As illustrated in this figure, up to 7 stimuli in addition to the reference are supported by this GUI in one session of MUSHRA test. In order to avoid subjective bias from both listeners and the test conductors, the playlist is automatically randomized and recorded by the workstation at the beginning of each session. Therefore, the order of presentation of stimuli is secrete to both parties until the end of MUSHRA test when all the data are collected and stored in the workstation. This has made the subjective tests of this project double-blinded.

Once a MUSHRA session starts, subject is instructed to first listen to all eight mul- tichannel audio clips including the reference and 7 other stimuli before start scoring. Then, he or she can rate the quality of each stimuli individually against the reference (the original). In addition, stimuli can also be rated against one another. A Con-

Figure 6.12: The Continuous Quality Scale in MUSHRA tinuous Quality Scale (CQS) is adopted during evaluation. As shown in Figure 6.12,

106 this scale can be devided in five equal bins which have the corresponding descriptions written alongside. The upper and lower boundaries of each bin are also presented in this figure. Subject can always replay a certain stimulus by selecting numbers through the touch screen of graphic terminal at anytime of the test. After all 7 stimuli are rated, subject can press ‘Done’ to end this session of MUSHRA test. The scores are then transferred from terminal to the workstation and saved into a result file along with the corresponding playlist, the name of the subject and the time of this test. Experimenter can access the result files to retrieve and analyze these scores in the end.

Subjective tests, as explained in the previous paragraphs, have been conducted in this project to evaluate the performance of BCC coding and HSR soundfield systems. Totally eight subjects have participated in these tests. The original multichannel audio clips have served as both the reference and the hidden reference during these tests, while the hidden anchor is set to be the uncompressed signal with its amplitude shaped in time domain. The next section will explained the subjective tests on BCC coding and HSR soundfield systems separately and discuss their results in detail.

6.3 MUSHRA Tests and Results

The performance of BCC coding algorithm and HSR soundfield systems that were im- plemented in this project has been subjectively tested using MUSHRA in an anechoic chamber. During the part of MUSHRA tests on the BCC multichannel audio compres- sion technique, subjects are required to score those decoded 6-Channel signals based on their temporal quality only. On the other hand, the overall performance, including both spatial and temporal quality, is the focus of the MUSHRA tests conducted for HSR soundfield systems. This part of MUSHRA tests also involves the evaluation of a BCC embedded HSR soundfield reproduction. Details and results of these MUSHRA tests will be given in the following two subsections.

107 6.3.1 MUSHRA Tests on BCC

After being implemented, BCC coding, the key component in recently ratified MPEG-4 standard, is applied to compress several 6-Channel audio clips (cf. Section 6.1.2). The clarity of these multichannel audio clips after being decoded is subjectively evaluated in the first part of MUSHRA tests. Only one clip and its various BCC coding outputs are tested in each session. These outputs are symbolized by ‘1024’, ‘256’, ‘256.1’, ‘ad1’ and ‘ad2’ according to the window length that has been adopted in each BCC coding. ‘1024’ suggests 1024-sample long windows are used in compression, while ‘256’ deploys uniform windows with a length of 256 samples. The ‘.1’ in ‘256.1’ indicates the ICTD estimation has a 0.1 sample resolution rather than the 1 sample resolution in previous cases. Adaptive windowing has also been applied. ‘ad1’ and ‘ad2’ denote the corresponding outputs involving adaptive windowing with a higher and lower energy threshold respectively. In addition, the anchor, labeled as ‘noisy’, of each session is an amplitude shaped clip. Eight subjects are thus required to score each stimuli with a graphic terminal according to its temporal clarity. It is important to ensure that these BCC coded audio stimuli are played in the same manner as the original (reference) clip during subjective tests. This is because mismatch between audio channels and speakers can cause salient additional spatial difference that distracts the focus of subjects from the temporal fidelity of stimuli.

Figure 6.13 shows the results of MUSHRA tests on ‘dixieland’ and ‘circus’. As illustrated in this figure, BCC, in general, performs relatively well on ‘dixieland’ which does not contain many energy transients in its time sequence. However, the use of large windows (1024 samples) in compression dramatically degrades the temporal clarity of ‘circus’ due to its audio content. A lot of claps and drumbeats where energy transient appears are included in the original 6-Channel clip. Therefore, pre-echo distortion, the consequence of frequency domain analysis in BCC, becomes more salient when signal is windowed into large frames. As the length of windows decreases, the quality of BCC coded ‘circus’ improves significantly. As shown in Figure 6.13 ‘circus’ has the

108 Figure 6.13: Results of MUSHRA Tests on BCC Per- formance: Dixieland and Circus best perceptual quality when 256-sample windows are used uniformly in BCC. But no significant improvement is observed for ‘dixieland’ when window length is reduced.

Furthermore, none of the subjects have described the quality of BCC coded ‘circus’ or ‘dixieland’ with ‘excellent’. This may be due to the following two issues. Firstly, BCC coding, as explained in Chapter 4, does not deploy any psychoacoustic model to control quantization noise. Despite the coding error of reference channel is under control in MP3 codec, it can still be amplified when the reference channel is used to reconstruct the others. Besides, noise may also occur in the quantization of directional cues. The quality of BCC coded multichannel channel audio has possibly been dam- aged by these noises, since they are not restrained by any psychoacoustic model in BCC coding scheme. Secondly, only one single channel from the original clip, rather than the output of multichannel downmixing, is considered as the reference channel of BCC coding in this project. This can cause insufficient temporal information delivered and recovered for the reconstruction of other channels, especially when the original chan-

109 nels are not significantly cross-correlated. An introduction of some other parameters, such as Inter-Channel Cross-correlation (ICC), may help to ease this problem. The above two issues can be further investigated in future research.

The next subsection will continue to explain the MUSHRA tests on HSR soundfield systems, followed by the analysis of collected experimental results.

6.3.2 MUSHRA Tests on HSR Soundfield Systems

In this second part of MUSHRA tests, the performance of various 1th order8 HSR soundfield systems will be subjectively evaluated by eight listeners. These subjects are seated in the center of anechoic chamber facing the right wall (cf. Figure 6.6). During experiments, they are instructed to score the overall quality, including both spatial and temporal clarity, of reconstructed soundfields (stimuli) against that of the original (reference). However, head movement is prohibited while listening9.The following paragraphs will explain original soundfield (reference) and stimuli in detail.

Original Soundfield (Reference)

The original soundfield is generated by playing a 5.1-Channel DST quality audio through the speaker array mounted in anechoic chamber. The correspondence between those six audio channels and the speakers is listed in Table 6.1. According to this table, speakers No.3 and No.4 are not assigned with any channel. In this case, they do not sound when the original soundfield is created. To provide a reference for this part of

MUSHRA tests, the original soundfield is regenerated using the same speaker array and its correspondence with multichannel audio materials during evaluation. As listed in Table 6.1, the center (C) channel of DST audio is connected to the rear speaker No.7, which is not a typical allocation as in 5.1 surround sound. Such an irregular arrangement can avoid results from being biased by familiarity of subjects towards 5.1

8The order of a soundfield system is specified by the parameter l of Fourier Bessel Coefficients of the target soundfield (cf. Chapter 5). l = 1 in this case. 9Head movement can significantly affect the perception of soundfield, especially the localization of auditory images[43]. Therefore, subjects are asked not to move their heads while listening.

110 Table 6.1: Correspondence Between DST Audio Channels and Speakers

DST Quality Audio Speakers (Channel) (No.) C 7 LF 2 RF 6 LS 1 RS 8 LFE 5 surround sound listening environment, because subjects can thereby purely focus on comparing the quality of reproduced soundfields with that of the original.

Three DST quality audio clips have been used to create the original soundfield independently. They are labeled as ‘Vocal 1’, ‘Vocal 2’ and ‘Instruments’. The first two are two pieces of vocal music with instrumental accompaniment, while ‘Instruments’ contains instrumental sound only. Therefore, it is an intension of this part of MUSHRA tests to evaluate the performance of various HSR soundfield systems over different types of music.

Reconstructed Soundfield (Stimuli)

The above three original soundfields are recorded with a HSR microphone array (cf.

Figure 6.9), and then post-recording processed to generate speakers feeds according to the configuration of the microphone array and the speaker array. Additional BCC coding may be applied to compress these multiple speaker feeds. Due to the variation of parameters F and μ in post-recording analysis and synthesis, several multichannel speaker input clips (cf. Table 6.2) can be produced for each of the original soundfields (created by DST quality audio clips). The stimuli, the reconstructed soundfields, for this part of MUSHRA tests are thus provided by playing those different input clips from the same speaker array that has been used in recording. In order to simplify the expression of later results analysis, these reconstructed soundfields will be referred to using the labels of their corresponding speaker input clips in Table 6.2. 6.2.

111 Table 6.2: Stimuli for MUSHRA Tests on HSR Soundfield Systems: Various Multichannel Speaker Input Clips

Speaker Input Clips Variations in Speaker Feeds Generation No. Compression, F, μ

1 Uncompressed, F = I10, μ =0.4

2 Uncompressed, F = InI11, μ =0.4

3 Compressed by BCC, F = InI, μ =0.4

4 Uncompressed, F = InI, μ =0.9

5 Uncompressed, F = InI, μ =0.1

6 Uncompressed, Amplitude Shaped by 0.02

7 Uncompressed, Imperfect Decoding Matrix

More detailed explanation of the above speaker input clips can be found in Appen- dix C.2.

During this part of MUSHRA tests, each DST audio clip and various related speaker input clips are tested in the same session. Soundfield No.6 plays the an- chor in each session. Similar to the previous part, eight subject are required to listen to all of the stimuli and reference and become familiar with them before start scoring.

According to the overall quality, including both spatial and temporal fidelity, of each stimulus, subjects can assign a score to the corresponding soundfield by adjusting the slider position on CQS. It is important for individual subject to grade a reconstructed soundfield consistently against the reference and the other stimuli. At the end of each session, scores are collected by terminal and transferred to the workstation for storage or further analysis.

The following paragraphs will present and discuss the experimental results collected

10I denotes an identity matrix. 11InI denotes an anti-diagonal matrix.

112 for HSR soundfield systems and BCC embedded HSR soundfield systems successively.

1. Results on HSR Soundfield Systems

Figure 6.14: Scores of MUSHRA Tests on HSR Soundfield Systems

Figure 6.14 shows the subjective evaluation results of reconstructed soundfields, excluding soundfield No.3 whose generation involves the BCC multichannel com-

pression algorithm. The mean value and standard deviation of MUSHRA test scores are plotted in this figure against different soundfields, showing the perfor- mance of various HSR systems.

It is obvious from this figure that, for those three pieces of music - ‘Vocal 1’, ‘Vocal 2’ and ‘Instruments’, HSR soundfield systems No.1 and No.2 possess the best performance over the rest systems. An average subjective grade between 60

and 80, suggesting ‘Good’ performance, has been achieved by these two systems for all three original clips. This shows the HSR soundfield system, implemented in this project, is capable of reproducing good soundfields using an irregular

113 speaker array. However, this figure does not imply whichever of these two systems that apply different imposition onto Fourier Bessel Coefficients (cf. Table 6.2 and Appendix C) has a better performance than the other, because the relation between their mean scores does not agree for different DST clips. This can be further investigated in the future.

Secondly, as shown in Figure 6.14, soundfield systems No.4 and No.5 score less than 40 on average, suggesting they do not perform well in preserving overall quality. Besides, several subjects have advised test conductor after the MUSHRA tests that some clips that possess the best spatial fidelity have the most tem- poral distortion and vice versa. These have confirmed the role of parameter μ as a tradeoff factor in post-recording processing. The temporal fidelity of recon- structed soundfield generally improves at the expense of its spatial quality, as the value of μ decreases towards 0. The reconstructed soundfield, on the other hand, will have much better spatial quality, when μ increases towards 1 compromising its temporal clarity. Therefore, the overall quality, as the combination of spatial and temporal quality, drops as μ is biased towards either 0 or 1. This is testified by those results.

As to soundfield system No.7 which deploys an imperfect decoding matrix for post-recording synthesis, the average score is rather low, slightly above that of the anchor (cf. Figure 6.14). This result agrees with the imperfectness of using that decoding matrix, and it is within expectation.

Another thing worthwhile noticing from Figure 6.14 is the large standard devia- tion resulted from these tests. This is because the MUSHRA tests conducted in this project can not separate spatial quality assessment from the evaluation of temporal clarity. No suitable training materials are currently available to help subjects develop the ability of discriminating spatial and temporal information. Without suitable training, subjects may have diverse interpretations of overall quality of soundfields, and focus on differently aspects of stimuli. Such large

114 standard deviation is the consequence of this issue.

Figure 6.15: Number of Subjects Grading ‘Good’: Clips No.1, No.2, No.4 - No.7

From the music content perspective, the performance of different HSR sound- field systems varies over those three DST audio clips. As shown in Figure 6.15, a majority of subjects, seven out of eight (87.5%), think good vocal music accom- panied by instrumental sound can be reproduced using soundfield systems No.1

and No.2, while purely instrumental clip scores less ‘good’s with these two sys- tems. Similar results can be observed in Figure 6.14 where the overall quality of ‘Vocal 1’ and ‘Vocal 2’ in either systems No.1 or No.2 is rated higher, on average, than that of ‘Instruments’ respectively. This phenomenon may be explained in

future research.

2. Results on BCC Embedded HSR Soundfield Systems

In addition to those simple HSR soundfield systems, a BCC embedded HSR

system, referred to as No.3, is also subjectively evaluated during this part of MUSHRA tests. In such a system, the BCC coding algorithms is applied to

115 compress multichannel speaker feeds before soundfield reproduction. Results of this system are shown in Figure 6.16, along with those of systems No.1 and No.2.

Figure 6.16: Number of Subjects Grading ‘Good’: Clips No.1, No.2, No.3

As illustrated in this figure, more than half of subjects consider good overall quality soundfield can be reproduced for clips ‘Vocal 1’ and ‘Instruments’ using BCC embedded HSR system. However, ‘Vocal 2’ only scores three out of eight.

Such unpopularity for this clip may be explained by pre-echo distortion. Since adaptive windowing is not adopted for BCC coding in this system, energy tran- sients in the original ‘Vocal 2’ audio clip may cause a lot of salient pre-echoes that consequently degrade the overall quality of reconstructed soundfield.

When comparing the overall quality of soundfield No.3 with that of No.1 and No.2 in Figure 6.16, we can notice that the BCC embedded soundfield system, in general, performs worse than the other two systems, especially HSR soundfield system No.2. It means the introduction of BCC coding for speaker feeds will

116 potentially damage the quality of reconstructed soundfields.

Another thing worthwhile mentioning is that the BCC multichannel compression technique is applied to speaker feeds in this project, which, to certain extent, limits the compatibility of such a BCC embedded soundfield system. This is because speaker feeds that are generated according to a particular speaker array, once compressed, can not be transmitted and then used for another speaker array. In other words, speaker input signals are not the best transmitting format of HSR soundfield systems in terms of compatibility. Neither are the microphone capsule signals, since quantization noise introduced by compression of microphone signals can be exaggerated after post-recording analysis. As mentioned in Section 5.4.1, Fourier Bessel Coefficients may be the best option for transmission and they can be compressed with some algorithm before transmission. This issue can be further investigated in the future.

117 Chapter 7

Conclusion

7.1 Summary of This Project

Due to the influence of fast growing entertainment industry, the demand for faithful reconstruction of soundfields is currently booming in the consumer market. However, past and present soundfield systems, including Steinberg and Snow’s 3-Channel sys- tem, stereophony, quadraphony, Ambisonics, WFS and etc, face two major issues.

Firstly, the compatibility of these systems is limited by their pre-determined speaker array configurations. Secondly, the bandwidth required for multichannel audio rep- resentation scales proportionally with the increasing number of channels involved in these systems.

This thesis has looked into the above issues, and reviewed the representation of soundfield from various perspectives. An extensive range of soundfield systems and multichannel audio compression techniques, including HSR and BCC, have also been investigated to facilitate comparison. Based on these reviews, 1th order HSR soundfield systems that are compatible with any speaker layouts have been successfully imple- mented in this project. A framework, including post-recording analysis and synthesis, is built up to generate speaker feeds from microphone signals according to the config- urations of microphone array and speaker array. BCC coding, the key algorithm in

118 recently ratified MPEG multichannel audio codec, has also been implemented using CXC for ICTD estimation and embedded in such HSR systems to resolve bandwidth related issue.

To evaluate the performance of these soundfield systems, subjective listening tests have been conducted in an anechoic chamber according to an ITU standard method,

MUSHRA. Results confirm the role of parameter μ as a tradeoff factor between spatial quality and temporal clarity of HSR soundfield system. It is also shown from results that it is possible to achieve ‘good’ overall quality reconstruction from an irregular speaker array using HSR soundfield systems. Besides, no significant improvement on quality has been observed for BCC embedded HSR soundfield system. In contrast, the introduction of BCC potentially damages the performance of HSR systems. In addition, the compatibility of HSR soundfield systems can be somewhat limited by applying BCC to compress their speaker feeds.

7.2 Future Works

The work in this dissertation can be further extended and strengthened by the follow- ing:

1. The MUSHRA tests conducted in this project can not separate spatial quality assessment from the evaluation of temporal clarity, because no suitable training

materials are currently available to help subjects develop the ability of discrim- inating spatial and temporal information. Therefore, it is essential to design a set of subjective tests specifically for spatial quality assessment, as well as train- ing materials that help subjects develop their ability of discriminating spatial

characteristics of soundfield.

2. This project has confirmed the role of parameter μ in HSR soundfield systems with MUSHRA tests. However, the effect of matrix F is not apparently shown

119 in subjective test results. So further subjective tests can be carried out to inves- tigate this issue.

3. Since Fourier Bessel Coefficients may be a better format for transmission and storage, we can compress these coefficients, rather than speaker feeds, to reduce bandwidth in HSR soundfield systems. In this way, we can avoid the compatibil-

ity of soundfield systems from being limited by the introduction of multichannel audio coding algorithms. This is another topic that can be investigated in future research.

120 Appendix A

Spherical Solutions to the Wave Equation

A.1 Solving the Wave Equation in Spherical Coor-

dinates

In time domain, the wave equation in spherical coordinates is given by,

∂2p 2 ∂p 1 ∂ ∂p 1 ∂2p 1 ∂2p + + sin θ + = (A.1) ∂r2 r ∂r r2 sin θ ∂θ ∂θ r2 sin2 θ ∂φ2 c2 ∂t2

Assume the solution to this equation is separable as shown in Equation A.2:

p(r, θ, φ, t)=R(r)U(θ)V (φ)T (t)(A.2)

The following four ordinary differential equations can be yielded by after substi- tuting Equation A.1 with Equation A.2 and some manipulations.

1 d2T = −k2, (A.3) T dt2

121 1 d2V = −m2, (A.4) V dφ2 1 d dU m2 (sin θ )+[l(l +1)− ]U =0, (A.5) sin θ dθ dθ sin2 θ 1 d dR l(l +1) (r2 )+[k2 − ]R =0. (A.6) r2 dr dr r2 where the factors k, l and m are introduced to help deriving the above equations.

Therefore, the time-t, azimuthal angle-φ, elevational angle-θ and radial variable-r dependence of the solution to the wave equation in spherical harmonics can be achieved from Equation A.3-A.6 respectively.1

A.2 Spherical Bessel Functions of the 1st Kind

A.2.1 Plot

Figure A.1: Various Bessel Functions of the First Kind

1More details are given in [37][38].

122 A.2.2 Properties

Given n>0, the spherical bessel functions of the 1st kind have the following properties:

2n +1 jn−1(x)=jn+1(x)= jn(x); (A.7) x

3 2 2 x 2 j (x)x dx = [j (x) − jn−1(x)jn+1(x)]. (A.8) n 2 n

123 Appendix B

Information about Speaker and Wedge

B.1 GENELEC Loudspeakers (8130A)

Figure B.1: GENELEC Loudspeaker

124 B.2 Wedge ‘A’

Figure B.2: Side {A1,A2,A3} and Mounting Point

125 Figure B.3: Side {A0,A1,A3}

Figure B.4: Side {A2,A0,A3}

126 Appendix C

Stimuli of MUSHRA Tests on Soundfield Systems

C.1 Matrix F and Parameter μ

Matrix F

In this project, 1th order HSR soundfield systems have been implemented. So l =1. Therefore, as explained in section 5.4.2, matrix F canbeexpressedinthefollowing formation:

⎛ ⎞ ⎜F1,0,0(f) F1,1,−1(f) F1,1,0(f) F1,1,1(f)⎟ ⎜ ⎟ ⎜ ⎟ ⎜F2,0,0(f) F2,1,−1(f) F2,1,0(f) F2,1,1(f)⎟ ⎜ ⎟ F = ⎜ ⎟ (C.1) ⎜F3,0,0(f) F3,1,−1(f) F3,1,0(f) F3,1,1(f)⎟ ⎝ ⎠

F4,0,0(f) F4,1,−1(f) F4,1,0(f) F4,1,1(f)

Where elements Fk,l,m(f) are either 0 or 1 specifying imposed Fourier Bessel Coeffi- cients.

th Fk,l,m(f) = 1 suggests the m lobe/direction of the l order soundfield is faithfully reproduced for the kth speaker.

127 Parameter μ

Parameter μ specifies the accuracy of the rest Fourier Bessel Coefficients (Sound- field) which are not imposed by matrix F for faithful reproduction. μ = 1 suggests these coefficients are also required to be faithfully reconstructed in addition to those imposed by F. On the other hand, only those coefficients specified by F will be ac- curately reproduced if μ = 0. Therefore, parameter μ is apparently a tradeoff factor between spatial precision and the temporal fidelity.

C.2 Variations in Stimuli

Clip No.1 is generated using an identity F matrix. Such F suggests Fourier Bessel Coefficients (0, 0) of the 1th speaker, (1, −1) of the 2th speaker, (1, 0) of the 3th speaker, and (1, 1) of the 4th speaker are faithfully reproduced. In addition, μ is assigned with 0.4 which implies the spatial quality and temporal clarity are almost equally compromises.

Different from No.1, clip No.2 uses an anti-diagonal matrix to impose faithful reproduction of Fourier Bessel Coefficients. As a result, coefficients (1, 1) of the 1th speaker, (1, 0) of the 2th speaker, (1, −1) of the 3th speaker, and (0, 0) of the 4th speaker are expected to be faithfully reproduced. The value of μ remains the same for this clip.

Clip No.3 is generated by applying additional BCC codec to No.2. Therefore, its corresponding HRS soundfield system is a BCC embedded system which involves multichannel audio compression.

For both clips No.4 and No.5, matrix F remains the same as that of clip No.2 using an anti-diagonal matrix. But the value of μ is 0.9and0.1 for clips No.4 and No.5 respectively. Since these values are close to the upper or lower limit of μ,No.4 and No.5 are expected to have either good spatial quality or good temporal fidelity.

128 Clip No.6 is generated by shaping the amplitude of original signals by 0.02 in time domain. The purpose of introducing No.6 is to set an anchor for this part of MUSHRA tests.

Clip No.7 is simply achieved via an imperfect soundfield reconstruction approach. So no further explanation will be given here.

129 Bibliography

[1] T. A. Edison, “Phonograph or Speaking Machine,” US Patent No. 200521, Patented on Feb 19th,1878.

[2] A. G. Bell, “Improvements on Electric Telephony,” Patent No. 7789, Filed in

1877.

[3] J. Aldred, “Fifty years of sound,” American Cinematographer, pp.888-889 & 892- 897, September 1981.

[4] T. Holman, Sound for Film and Television, Focal Press, Boston, 1997.

[5] J. Moir, High Quality Sound Reproduction, 2nd edition, Chapman & Hall, London, 1961.

[6] G. H. R. Taylor and D. L. Watson, “An Improvement in the Sound Quality of High Speed Duplicated Musicassettes,” the Synopsis of a lecture given at the

50th Annual Convention of the European Audio Engineering Society, Preprint No. L-22, London, March 1975.

[7] J. Moir, “Phase and Sound Quality,” presented at the 50th Annual Convention of the European Audio Engineering Society, Preprint No. L-9 , London, March

1975.

[8] J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization, MIT Press, Cambridge, MA, 1983.

130 [9] R. M. Warren, Auditory Perception: A New Analysis and Synthesis, Cambridge University Press, Cambridge, UK, 1999.

[10] W. M. Hartmann, “Listening in a Room and the Precedence Effect,” in Binaural and Spatial Hearing in Real and Virtual Auditory Environments, R.H. Gilkey and T.R. Anderson, Eds., pp.191-210, Lawrence Erlbaum Associates, Mahwah, New

Jersey, 1997.

[11] L. R. Bernstein, “Detection and Discrimination of Interaural Disparities: Mod- ern Earphone-Based Studies,” in Binaural and Spatial Hearing in Real and Vir- tual Auditory Environments, R.H. Gilkey and T.R. Anderson, Eds., pp.117-138,

Lawrence Erlbaum Associates, Mahwah, New Jersey, 1997.

[12] H. F. Olson and H. Belar, “Acoustics of Sound Reproduction in the Home,” Journal of Audio Engineering Society, Vol. 8, No.1, pp.7-11, January 1960.

[13] C. Kyriakakis, “Virtual Loudspeakers and Virtual Microphones for Multichannel

Audio,” in IEEE International Conference on Consumer Electronics, Los Angeles, June 2000.

[14] K. M. Stanney, ed., Handbook of Virtual Environments: Design, Implementation, and Applications, Lawrence Erlbaum Associates Inc., Mahwah, New Jersey, 2002.

[15] R. J. Ellis-Geiger, “Designing Surround Sound Facilities for Higher Education,” in the 19th AES International Conference on Surround Sound- Techniques, Tech- nology, and Perception, Preprint No. 1887, Schloss Elmau, Germany, June 2001.

[16] M. O’Dwyer, G. Potard and I. Burnett, “A 16-Speaker 3D Audio-Visual Display

Interface and Control System,” in Proceedings of the 2004 International Confer- ence on Auditory Displays, Sydney, Australia, July 2004.

[17] W. Woszczyk, S. Bech and V. Hansen, “Interactions Between Audio-Visual Factors in a Home Theater System: Definition of Subjective Attributes,”

131 inProceedings of the Audio Engineering Society 99th International Convention, Preprint No. 4133, New York, USA, September 1995.

[18] S. Bech, V. Hansen and W. Woszczyk, “Interactions Between Audio-Visual Fac- tors in a Home Theater System: Experimental Results,” inProceedings of the Au- dio Engineering Society 99th International Convention, Preprint No. 4096, New

York, USA, September 1995.

[19] J. H. Snyder, “The Quorum TM Teleconferencing Microphone,” in Proceedings of Acoustical Society of America Meeting, vol.74, November 1983.

[20] L. Haddon, “Interactive Games,” in Future Visions: New Technologies on the

Screen, P. Hayward and T. Wollon, Eds., British Film Institute Publishing, Lon- don, 1993.

[21] M. Kleiner et al., “Emerging Technology Trends in the Areas of the Technical Committees of the Audio Engineering Society,” Journal of Audio Engineering

Society, vol.51, No.5, pp.442-452, May 2003.

[22] D. R. Begault, 3-D Sound for Virtual Reality and Mutlimedia, Academic Press Professional Inc., Cambridge, MA, 1994.

[23] L. A. Werner and G. C. Marean, Human Auditory Development, Westview Press,

Boulder, CO, 1996

[24] The next Audio Engineering Society (AES) Convention will be held in Austria on May 5-8, 2007.

[25] D. J. Furlong, “Comparative Study of Effective Soundfield Reconstruction,” in

the 87th Convention of the Audio Engineering Society, Preprint No. 2842, New York, October 1989.

132 [26] A. Persterer, “Binaural Simulation of an “Ideal Control Room” for Headphone Reproduction,” at the 90th Convention of the Audio Engineering Society, Preprint No. 3062, January 1991.

[27] D. H. Cooper, “Comments on the distinction between stereophonic and binaural sound,” Journal of the Audio Engineering Society, vol.39, pp.261-266, 1991.

[28] D. H. Cooper and J. L. Bauck, “Prospects for Transaural Recording,” Journal of the Audio Engineering Society, vol.37, pp.3-19, 1989.

[29] J. D. Johnston and Y. H. Lam, “Perceptual Soundfield Reconstruction,” in the 109th Convention of Audio Engineering Society, Los Angeles, CA, September

2000.

[30] R. Nicol and M. Emerit, “3D-Sound Reprodution Over an Extensive Listening Area: A Hybrid Method Derived from Holophony and Ambisonics,” in AES 16th International Conference on Spatial Sound Reprodution, Helsinki, March 1999.

[31] M. Poletti, “The Design of Encoding Functions for Stereophonic and Polyphonic Sound Systems,” Journal of the Audio Engineering Society, vol.44, no.11, pp.948- 963, November 1996.

[32] J. Markoff, “Business Technology: A Battle for Influence Over Insatable Disks,”

New York Times, January 11, 1995

[33] M. A. Gerzon, “Recording Concert Hall Acoustics for Posterity,” Journal of Audio Engineering Society, vol.23, no.7 , pp.569&571, 1975.

[34] A. Farina and R. Ayalon, “Recording Concert Hall Acoustics for Posterity,” in

AES 24th International Conference on Surround Sound: Techniques, Technology and Perception, Banff, Canada, June 2003.

133 [35] R. Bruno, A. Laborie and S. Montoya, “Reproducing Multichannel Sound on any Speaker Layout,” in the 118th Audio Engineering Society Convention, Preprint No. 6375, Barcelona, Spain, May 2005.

[36] Y. Wang, M. Vilermo and L. Yaroslavsky, “A Multichannel Audio Coding Algo- rithm for Inter-Channel Redundancy Removal,” in the 110th Audio Engineering

Society Convention, Preprint No. 5295, Amsterdam, Netherland, May 2001.

[37] E. Skudrzyk, The Foundation of Acoustics: Basic Mathematics and Basic Acoustics, Springer-Verlag, Wien, New York, 1971.

[38] E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustical

Holography, Academic Press, London, UK, 1999.

[39] N. N. Lebedev, Special Functions and Their Applications, translated from the Russian by R. A. Silverman, Prentice-Hall Inc., Englewood Cliffs, New Jersey, 1965.

[40] S. Gelfand, Hearing : An Introduction to Psychological and Physiological Acoustics, 3rd edition, Dekker, New York, USA, 1998.

[41] E. Zwicker and H. Fasti, Psychoacoustics: Facts and Models, Springer-Verlag, Berlin, 1990.

[42] H. Lungwitz, “The localization of acoustical objects,”1 in Textbook of Psychobi- ology, Walter de Gruyter, Berlin.

[43] J. Blauert, “Investigations of Directional Hearing in the Median Plane with the Head Immoblized”2 Dissertation, Technische Hochschule, Aachen, 1969.

[44] L. Rayleigh, “On Our Perception of Sound Direction,” Philosophical Magazine, vol. 13, pp. 214-232, 1907.

1Original document is in German. 2Original document is in German.

134 [45] W. Gaik, “Combined Evaluation of Interaural Time and Intensity Differences: Psychoacoustic Results and Computer Modeling,” Journal of Acoustical Society of America, vol.94, no.1, pp.98-110, July 1993.

[46] R.I. Chernyak and N. A. Dubrovsky, “Pattern of the noise images and the bin- aural summation of loudness for the different interaural correlation of noise,” in

Proceedings of the 6th International Congrress on Acoustics, vol.1, pp.A53-A563, Tokeyo, 1968.

[47] M. B. Gardner, “Historical Background of the Haas and/or Precedence Effect,” Journal of Acoustical Society of America, vol. 43, pp. 1243-1248, 1968.

[48] J. Henry, “Presentation before the American Association of Advanced Sciences on the 21th of August,” in Scientific Writings of Joseph Henry, part II, pp.295-296, Smithsonian Institution, Washington DC, 1849.

[49] G. von B´ek´esy, “Auditory Backward Inhibition in Concert Halls,” Science,

vol.171, pp.529-536, 1971.

[50] J. Hull, “Surround Sound Past, Present and Future,” Dolby Laboratories Licens- ing Corporation, 1994.

[51] H. Fletcher, “Auditory Perspective - Basic Requirements,” Electrical Engineering,

vol.53, pp.9-11, January 1934.

[52] J. C. Steinberg and W. B. Snow, “Auditory Perspective - Physical Factors,” Electrical Engineering, vol.53, pp.12-17, January 1934.

[53] E. C. Wente and A. L. Thuras, “Auditory Perspective - Loud Speakers and Mi-

crophones,” Electrical Engineering, vol.53, pp.17-24, January 1934.

[54] E. O. Scriven, “Auditory Perspective - Amplifiers,” Electrical Engineering, vol.53, pp.25-28, January 1934.

3cf. [8] Fig. 3.24

135 [55] H. A. Affel, R. W. Chesnut and R. H. Mills, “Auditory Perspective - Transmission Lines,” Electrical Engineering, vol.53, pp.28-32&214-216, January 1934.

[56] E. H. Bedell and I. Kerney, “Auditory Perspective - System Adaptation,” Elec- trical Engineering, vol.53, pp.216-219, January 1934.

[57] A. D. Blumlein, “Improvements in and relating to Sound-transmission, sound- recording and sound-reproducing systems,” British Patent No. 394,325, 1931.

[58] Y. H. Pao and V. Varatharajulu, “Huygens’ Principle, Radiation Conditions and Integral Formulas for the Scattering of Elastic Waves,” Journal of the Acoustical Society of America, vol.59, no.6, pp.1361-1371, June 1976.

[59] D. G. Malham and A. Myatt, “3-D Sound Spatialization Using Ambisonic Tech- niques,” Compter Music Journal, vol.19, no., pp.58-70, 1950.

[60] H. A. M. Clark, G. F. Dutton and P. B. Vanderlyn, “The ‘Stereosoic’ Recording and Reproducing System,” Institute of Radio Engineers Transactions on Audio,

vol.AU-5, pp.96-111, 1957.

[61] H. Robjohns, “Stereo Microphone Technique Explained: Part 1,” Sound on Sound, February 1997.

[62] M. Gerzon, “Ultra-Directional Microphones: Applications of Blumlein Difference

Technique: Part 1,” Studio Sound, vol.12, pp.434-437, October 1970.

[63] M. Gerzon, “Ultra-Directional Microphones: Applications of Blumlein Difference Technique: Part 2,” Studio Sound, vol.12, pp.501-504, November 1970.

[64] M. Gerzon, “Ultra-Directional Microphones: Applications of Blumlein Difference

Technique: Part 3,” Studio Sound, vol.12, pp.539-543, December 1970.

[65] C. Hugonnet and P. Walder, Stereophonic Sound Recording: Theory and Practice, translated by Patrick R. W. Roe, Wiley, Chichester, W. Sussex, England, 1998.

136 [66] J. M. Eargle, “Stereophonic Localization: An Analysis of Listener Reactions to Current Techniques,” Institute of Radio Engineers Transactions on Audio, vol.AU-8, pp.174-178, 1960.

[67] F. Rumsey and T. McCormick, Sound & Recording - an Introduction, Focal Press, Oxford, 1994.

[68] R. W. Benson, “Enhanced Stereo,” Institute of Radio Engineers Transactions on Audio, vol.MJ-5, pp.63-65, 1961.

[69] L. Blake, “Mixing Dolby Stereo Film Sound,” Recording Engineer/Producer,

vol.12, no.1, February 1981.

[70] D. H. Cooper and T. Shiga, “Discrete-Matrix Multichannel Stereo,” Journal of the Audio Engineering Society, vol. 20, no. 5, pp.346-360, June 1972.

[71] J. G. Woodward, “Quadraphony - A Review,” Journal of the Audio Engineering Society, vol.25, No.10/11, pp.843-854, Oct/Nov 1977.

[72] P. A. Ratcliff and D. J. Meares, “BBC Matrix H: Compatible System for Broad- casting,” Wireless World, March 1974.4

[73] M. A. Gerzon, “What’s wrong with Quadraphonics,” Studio Sound, vol.16, no. 5,

pp50-51&56, May 1974.

[74] P. Fellgett, “Ambisonics. Part One: General System Description,” Studio Sound, vol.17, no.8, pp.20-22&40, August 1975.

[75] M. A. Gerzon, “Ambisonics. Part Two: Studio Techniques,” Studio Sound, vol.17, no.8, pp.24-26,28&30, August 1975.

[76] P. G. Craven and M. A. Gerzon, “Coincident microphone simulation covering three dimensional space and yielding various directional outputs,” US Patent No. 4042779, July 1975.

4It also can be found in British Patent No.1514162.

137 [77] SoundField Ltd., The Surround Zone, Surround Sound and Stereo Production Solutions (User Manual Version 1.0), Retrieved from http://www.soundfield.com/downloads/surroundzone userguide.pdf, in March, 2007.

[78] International Telecommunications Union, Multichannel stereophonic sound sys- tem with and without accompanying picture, ITU Recommendation BS.775-1, ap-

proved in July 1994.

[79] M. A. Gerzon and G. J. Barton, “Ambisonic Decoders for HDTV,” at the 92nd Audio Engineering Society Convention, Preprint No. 3345, Vienna, March 1992.

[80] A. J. Berkhout, “A Holographic Approach to Acoustic Control,” Journal of the Audio Engineering Society, vol.36, no.12, pp.977-995, December 1988.

[81] M. M. Boone, “Acoustic Rendering with Wave Field Synthesis,” ACM SIGGraph and EUROGraphics Campfire: Acoustic Rendering for Virtual Environments, Snowbird, Utah, May 2001.

[82] S. Spors, H. Teusch and R. Rabenstein, “High-Quality Acoustic Rendering with

Wave Field Synthesis,” Vision, Modeling and Visualization, pp.101-108, Novem- ber 2002.

[83] R. Glasgal, “Ambiophonics: The Synthesis of Concert-Hall Sound Fields in the Home,” in the 99th Audio Engineering Society Convention, Preprint No. 4113, New York, October 1995.

[84] A. Farina, R. Glasgal, E. Armelloni and A. Torger, “Ambiophonic Principles for the Recording and Reproduction of Surround Sound for Music,” Proceedings of

the 19th Audio Engineering Society Conference on Surround Sound, Techniques, Technology and Perception, Schloss Elmau, Germany, June 2001.

[85] V. Pulkki, “Virtual Sound Source Positioning Using Vector Base Amplitude Pan- ning,” Journal of the Audio Engineering Society, vol.45, no.6, pp.456-466, 1997.

138 [86] P. G. Craven and M. A. Gerzon, “Lossless Coding for Audio Discs,” Journal of the Audio Engineering Society, vol.44, no.9, pp.706-720, September 1996.

[87] D. Reefman and E. Janssen, “One-bit audio: an overview,” Journal of the Audio Engineering Society, vol.52, no.2, February 2004.

[88] E. Janssen, E. Knapen, D. Reefman and R. Bruekers, “Lossless compression of one-bit audio,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Montreal, May 2004.

[89] ISO/IEC, Coding of moving pictures and associated audio for digital storage me- dia at up to about 1.5 Mbits/s- Part3: Audio, ISO/IEC 11172-3 International

Standard, ISO/IEC JTC1/SC29/WG11, 1993.

[90] ISO/IEC, Generic coding of moving pictures and associated audio information- Part 7: Advanced Audio Coding, ISO/IEC 13818-7 International Standard, ISO/IEC JTC1/SC29/WG11, 1997.

[91] M. F. Davis, “The AC-3 Multichannel Coder,” at the 95th Audio Engineering Society Convention, Preprint No. 3774, New York, October 1993.

[92] H. Malvar, Signal Processing with Lapped Transforms, Artech House, Boston, MA, 1992.

[93] K. Sayood, Introduction to , Morgan Kaufmann, San Francisco, California, 1996.

[94] K. Brandenburg and G. Stoll, “The ISO/MPEG-Audio Codec: A Generic Stan- dard for Coding of High Quality Digital Audio,” in the 92nd Audio Engineering

Society Convention, Preprint No. 3336, Vienna, 1992.

[95] International Telecommunications Union, Low Bitrate Audio Coding, ITU Rec- ommendation BS.1115, Geneva, Switzerland, 1994.

139 [96] R. Buchta, S. Meltzer and O. Kunz, “The WorldStarTM Sound Format,” at the 101st Audio Engineering Society Convention, Preprint No. 4385, Los Angeles, California, November 1996.

[97] ISO/IEC, Generic coding of moving pictures and associated audio information- Part 3: Audio, ISO/IEC 13818-3 International Standard, ISO/IEC JTC1/SC29/WG11, 1994.

[98] T. Painter and A. Spanias, “Perceptual coding of digital audio,” in the Proceedings

of the IEEE, vol.88, no.4, pp.451-513, 2000.

[99] M. Bosi et al., “ISO/IEC MPEG-2 Advanced Audio Coding,” Journal of the

Audio Engineering Society, vol.45, no.10, pp.789-814, October 1997.

[100] L. D. Fielder et al., “AC-2 and AC-3: Low Complexity Transform-Based Au- dio Coding,” in Collected Papers on Digital Audio Bit-Rate Reduction for Audio Engineering Society Conference, N. Geilchrist and C. Grewin (Ed.), pp.54-72, 1996.

[101] J. Herre and J. Johnston, “Enhancing the Performance of Perceptual Audio

Coders by Using Temporal Noise Shaping (TNS),” in the Proceedings of 101st Audio Engineering Society Convention, Preprint No. 4384, 1996.

[102] ISO/IEC, Overview of the Report on the Formal Subjective Listening Tests of MPEG-2 AAC Multichannel Audio Coding, ISO/IEC, JTC1/SC29/WG11 N1420, November 1996.

[103] C. Todd et al., “AC-3: Flexible Perceptual Coding for Audio Transmission and Storage,” in the Proceedings of 96th Audio Engineering Society Convention,

Preprint No. 3796, February 1994.

[104] G. Davidson, “Digital Audio Coding: Dolby AC-3,” in The Digital Signal

Processing Handbook, V. Madisetti and D. Williams (Ed.), FL:CRC Press., Boca Raton, pp.41.1-41.21, 1998.

140 [105] International Telecommunications Union, Low Bitrate Multichannel Audio Coder Test Results, ITU Recommendation Document 10/51-E, Geneva, Switzerland, May 1995.

[106] U.S. Advanced Television Systems Committee (ATSC), “Digital Audio Com- pression (AC-3) Standard,” Document A/52/10, December 1995.

[107] K. Brandenburg and M. Bosi, “Overview of MPEG Audio: Current and Future Standards for Low-Bit-Rate Audio Coding,” in Journal of Audio Engineering

Society, vol.45, no.1/2, pp4-21, 1997.

[108] D. Thom, H. Purnhagen and the MPEG Audio Subgroup, MPEG Audio FAQ

Version 9: MPEG-2 AAC, ISO/IEC JTC1/SC29/WG11 N2431, Atlantic City, October 1998.

[109] C. Faller and F. Baumgarte, “Binaural Cue Coding: A novel and efficient rep- resentation of spatial audio,” in the Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol.2, pp.1841-1844, Orlando, FL, USA,

May 2002.

[110] A. Laborie, R. Bruno and S. Montoya, “A New Comprehensive Approach of Surround Sound Recording,” in the Proceedings of the 114th Audio Engineering Society Convention, Preprint No. 5717, Amsterdam, Netherland, March 2003.

[111] ISO/IEC, Parametric Coding for High Quality Audio, ISO/IEC 14496-3 Inter- national Standard, ISO/IEC JTC1/SC29/WG11, 2001/AMD2:2004.

[112] C. Faller and F. Baumgarte, “Binaural Cue Coding Applied to Stereo and Multi- channel Audio Compression,” in the 112th Audio Engineering Society Convention,

Preprint No. 5574, Munich, Germany, May 2002.

[113] F. Baumgarte and C. Faller, “Binaural Cue Coding- Part I: Psychoacoustic

Fundamentals and Design Principles,” inIEEE Transactions on Speech and Audio Processing, vol.11, no.6, pp.509-519, November 2003.

141 [114] C. Faller and F. Baumgarte, “Binaural Cue Coding- Part II: Schemes and Ap- plications,” inIEEE Transactions on Speech and Audio Processing, vol.11, no.6, pp.520-531, November 2003.

[115] J. V. Tobias Ed., Foundations of Modern Auditory Theory, vol.1, Academic Press, New York, 1970.

[116] LAME codec, Retrieved from http://lame.sourceforge.net/index.php,Lastvis- ited in May 2006.

[117] S. Wang, D. Sen and W. Lu, “Subband Analysis of Time Delay Estimation in STFT Domain,” in the Proceedings of the Eleventh Australian International Con-

ference on Speech Science and Technology, pp.211-215, Aukland, New Zealand, 2006.

[118] H. Robjohns, “A Brief History of Microphones,” Microphone Data Book, 2001.

[119] J. M. Woram, Sound Recording Handbook, 1st edition, Howard W. Sams & Com-

pany, Indianapolis, 1989.

[120] J. Wuttke, “General Considerations on Audio Multichannel Recording,” in Proceedings of the 19th International Conference of Audio Engineering Society, Schloss Elmau, Germany, June 2001.

[121] W. Woszczyk, “A Review of Microphone Techniques Optimized for Spatial Con- trol of Sound in Television,” inProceedings of the 9th International Conference of Audio Engineering Society: Television Sound Today and Tomorrow, Detroit, Michigan, February 1991.

[122] W. Woszczyk, “Microphone Arrays Optimized for Music Recording,” Journal of Audio Engineering Society, vol.40, no.11, pp.926-933, November 1992.

142 [123] G. Thiele, “Multichannel Natural Music Recording Based on Psychoacoustic Principles,” in Proceedings of the 19th International Conference of Audio Engi- neering Society on Surround Sound, Schloss Elmau, Germany, June 2001.

[124] U. Herrmann, V. Henkels and D. Braun, “Comparison of 5 Surround Microphone Methods,”5 in Proceedings of the 20th Tonmeistertagung, pp.508-517, 1998.

[125] M. Williams and G. Le Du, “Multichannel Microphone Array Design,” the 108th Audio Engineering Society Convention, Preprint No. 5157, Paris, February 2000.

[126] A. Laborie, R. Bruno and S. Montoya, “High Spatial Resolution Multichannel Recording,” in the Proceedings of the 116th Audio Engineering Society Conven-

tion, Preprint No. 6116, Berlin, Germany, May 2004.

[127] A. Laborie, R. Bruno and S. Montoya, “Reproducing Multichannel Sound on Any Speaker Layout,” in the Proceedings of the 118th Audio Engineering Society Convention, Preprint No. 6375, Barcelona, Spain, May 2005.

[128] A. Farina, “Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique,” in the 108th Audio Engineering Society Convention, Preprint No. 5093, Paris, France, February 2000.

[129] J. O. SmithIII, Mathematics of the Discrete Fourier Transform (DFT), with

Music and Audio Applications, W3K Publishing, Menlo Park, California, 2003.

[130] J. S. Walker, Fourier Analysis, Oxford University Press., Oxford, 1988.

[131] J. W. Gibbs, “Fourier Series,” Nature, vol.59, pp.200 & 606, 1898-1899.

[132] B. Edler, “Codierung von audiosignalen mituberlappender ¨ transformation und

adaptiven fensterfunktionen,” Frequenz, pp.252-256, 1989. Cited by [98].

5Original document is in German.

143 [133] ISO, Acoustics - Determination of Sound Power Levels of Noise Sources Using Sound Pressure - Precision Methods for Anechoic and Hemi-anechoic Rooms, ISO 3745, second edition, Geneva, Switzerland, 2003.

[134] Q. Meng, Qualification of an Anechoic Chamber, Project Report for Master of Engineering Science Degree, University of New South Wales, 2007.

[135] RME INTELLIGENT AUDIO SOLUTIONS, User’s Guide, retrieved from http://rme-audio.co/english/download/mface e.pdf, in September 2006.

[136] RME INTELLIGENT AUDIO SOLUTIONS, http://www.rme-audio.com/english, last visited in March 2007.

[137] International Telecommunications Union, Method for the Subjective Assessment of Intermediate Sound Quality (MUSHRA), ITU-R Recommendation BS.1534-1, Geneva, Switzerland, 2001.

144