Audiovisual Spatial Congruence, and Applications to 3D Sound and Stereoscopic Video
Total Page:16
File Type:pdf, Size:1020Kb
Audiovisual spatial congruence, and applications to 3D sound and stereoscopic video. C´edricR. Andr´e Department of Electrical Engineering and Computer Science Faculty of Applied Sciences University of Li`ege,Belgium Thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy (PhD) in Engineering Sciences December 2013 This page intentionally left blank. ©University of Li`ege,Belgium This page intentionally left blank. Abstract While 3D cinema is becoming increasingly established, little effort has fo- cused on the general problem of producing a 3D sound scene spatially coher- ent with the visual content of a stereoscopic-3D (s-3D) movie. The percep- tual relevance of such spatial audiovisual coherence is of significant interest. In this thesis, we investigate the possibility of adding spatially accurate sound rendering to regular s-3D cinema. Our goal is to provide a perceptually matched sound source at the position of every object producing sound in the visual scene. We examine and contribute to the understanding of the usefulness and the feasibility of this combination. By usefulness, we mean that the technology should positively contribute to the experience, and in particular to the storytelling. In order to carry out experiments proving the usefulness, it is necessary to have an appropriate s-3D movie and its corresponding 3D audio soundtrack. We first present the procedure followed to obtain this joint 3D video and audio content from an existing animated s-3D movie, problems encountered, and some of the solutions employed. Second, as s-3D cinema aims at providing the spectator with a strong impression of being part of the movie (sense of presence), we investigate the impact of the spatial rendering quality of the soundtrack on the reported sense of presence. The short 3D audiovisual content is pre- sented with three different soundtracks. These soundtracks differ by their spatial rendering quality, from stereo (low spatial coherence) to Wave Field Synthesis (WFS, high spatial coherence). The original stereo version serves as a reference. Results show that the sound condition does not impact on the sense of presence of all participants. However, participants can be classi- fied according to three different levels of presence sensitivity with the sound condition impacting only on the highest level (12 out of 33 participants). Within this group, the spatially coherent soundtrack provides a lower re- ported sense of presence than the other custom soundtrack. The analysis of the participants’ heart rate variability (HRV) shows that the frequency- domain parameters correlate to the reported presence scores. ABSTRACT By feasibility, we mean that a large portion of the spectators in the audience should benefit from this new technology. In this thesis, we explain why the combination of accurate sound positioning and stereoscopic-3D images can lead to an incongruence between the sound and the image for multiple spectators. Then, we adapt to s-3D viewing a method originally proposed for 2D images in the literature to reduce this error. Finally, a subjective experiment is carried out to prove the efficiency of the method. In this experiment, an angular error between an s-3D video and a spatially accurate sound reproduced through WFS is simulated. The psychometric curve is measured with the method of constant stimuli, and the threshold for bimodal integration is estimated. The impact of the presence of background noise is also investigated. A comparison is made between the case without any background noise and the case with an SNR of 4 dBA. Estimates of the thresholds and the slopes, as well as their confidence intervals, are obtained for each level of background noise. When background noise is present, the point of subjective equality (PSE) is higher (19.4◦ instead of 18.3◦) and the slope is steeper ( 0.077 instead of 0.062 per degree). Because of the overlap − − between the confidence intervals, however, it is not possible to statistically differentiate between the two levels of noise. The implications for the sound reproduction in a cinema theater are discussed. vi Acknowledgements I am truly thankful to my two supervisors Jacques G. Verly and Jean-Jacques Embrechts at the University of Li`egefor letting me pursue this research on such a timely topic. I would like to show my deepest gratitude to Brian F.G. Katz, at LIMSI- CNRS, for welcoming me in his lab and provide me with his constant support. Credit for many aspects of the results in this work goes to him as well as Marc R´ebillat,formerly at LIMSI-CNRS, now at Universit´eParis Descartes, and Etienne´ Corteel, at sonic emotion labs, for their fruitful collaboration on the experiments in this thesis and their work on the SMART-I2 platform. I would also like to thank Matthieu Courgeon, at LIMSI-CNRS, for his work on the software tool MARC and its integration in the SMART-I2. The open source movie “Elephants Dream” was a very important part of this doctoral work. I would like to acknowledge Wolfgang Draxinger who provided me with the sources of the stereoscopic version of the movie and Marc Evrard´ for his work on the creation of a 3D soundtrack for the movie. This dissertation would not have been possible without the financial support of the University of Li`ege,under the form of a salary for a teaching assistant position and two financial aids for my research stays at the LIMSI-CNRS, in France. Finally, I also wish to thank all the colleagues, the friends and my fam- ily who supported me along the way. In particular, I thank Pierre T. and G´eraldinewith whom I shared an office, the whole Audio & Acoustics team at LIMSI-CNRS for the very good time I spent there, Gr´egoire,Pierre P., Fran¸cois,Th´er`ese,and all the others with whom I enjoyed spending numer- ous lunchtimes, and all the PhD students I met during the various events of the PhD Student Network I had the chance to organize and manage. A special thanks goes to my mother whose countless sacrifices led me to where I am today. This page intentionally left blank. Contents Abstractv Acknowledgements vii Contents ix Nomenclature xiii 1 Introduction1 1.1 Motivation for the thesis . .2 1.2 Objectives of the thesis . .3 1.3 Hardware and software considerations . .4 1.4 Review of the key concepts . .5 1.5 Contributions of the thesis . .5 1.6 Organization of the manuscript . .6 I State of the art9 2 Overview of auditory, visual, and auditory-visual perception 11 2.1 Overview of auditory spatial perception . 12 2.2 Overview of visual spatial perception . 20 2.3 Overview of auditory-visual perception . 28 3 Spatial sound reproduction 49 3.1 Two-channel sound reproduction . 50 3.2 A brief history of sound reproduction in movie theaters . 55 3.3 3D audio technologies . 62 4 Principles and limitations of stereoscopic-3D visualization 89 4.1 Principles of stereoscopic-3D imaging . 90 4.2 Limitations of stereoscopic-3D imaging . 94 CONTENTS 5 Combining 3D sound with stereoscopic-3D images on a single platform 111 5.1 New challenges of stereoscopic-3D cinema . 112 5.2 The SMART-I2 ............................. 115 II Contributions of the thesis 123 6 The making of a 3D audiovisual content 125 6.1 Introduction . 126 6.2 Selecting a movie . 127 6.3 Precise sound positioning in the SMART-I2 .............. 128 6.4 Experimental setup . 141 6.5 Creation of the audio track . 142 6.6 Discussion . 146 6.7 Conclusion . 149 7 Impact of 3D sound on the reported sense of presence 151 7.1 Introduction . 152 7.2 Soundtracks . 153 7.3 Method . 156 7.4 Results from post-session questionnaires . 163 7.5 Results from Heart Rate Variability . 169 7.6 Results from the participants’ feedback . 173 7.7 Conclusions . 174 8 Subjective Evaluation of the Audiovisual Spatial Congruence in the Case of Stereoscopic-3D Video and Wave Field Synthesis 179 8.1 Introduction . 180 8.2 Method . 185 8.3 Results . 191 8.4 Discussion . 196 8.5 Conclusion . 201 9 Conclusions and perspectives 207 9.1 Conclusions . 207 9.2 Perspectives . 209 x CONTENTS III Appendices 217 A List of utilized software tools 219 A.1 Open source software tools . 219 A.2 Proprietary software tools . 220 B Other limitations of s-3D imaging 223 B.1 Crosstalk . 223 B.2 Flicker . 225 C An overview of the OSC protocol 227 C.1 Basics . 227 C.2 Implementations . 228 D Affine geometry, homogeneous coordinates, and rigid transforma- tions 229 D.1 A brief introduction to affine geometry . 230 D.2 Representation and manipulation of 3D coordinates . 233 E Post-session presence questionnaire 237 E.1 Presentation of the questions . 238 E.2 Temple Presence Inventory . 239 E.3 Swedish viewer-user presence questionnaire . 242 E.4 Bouvier’s PhD thesis . 243 E.5 Additional questions . 244 F Analysis of variance and multiple imputation 247 F.1 Point and variance estimates . 248 F.2 Analysis of variance . 248 G The chi–squared test for differences between proportions 253 List of Figures 257 List of Tables 261 xi This page intentionally left blank. Nomenclature Roman Symbols c Speed of sound (340 m/s) A, B, X,... A number in R (possibly a coordinate) f Frequency (temporal) (Hz) 1 k Wavenumber (k = ω/c) (m− ) M Order of a spherical (or cylindrical) wave expansion a, b, x,... A point in R3 Ax The point x expressed in the coordinate system (A) B AR The rotation matrix describing the rotation from the coordinate system (A) to the coordinate system (B) u, v,... A vector in R3, or a point or a vector in homogeneous coordinates −→ab, ab, or b a The vector linking a and b in R3 − Greek Symbols ω Angular frequency