3-D Audio Using Loudspeakers
Total Page:16
File Type:pdf, Size:1020Kb
3-D Audio Using Loudspeakers William G. Gardner B. S., Computer Science and Engineering, Massachusetts Institute of Technology, 1982 M. S., Media Arts and Sciences, Massachusetts Institute of Technology, 1992 Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy at the Massachusetts Institute of Technology September, 1997 © Massachusetts Institute of Technology, 1997. All Rights Reserved. Author Program in Media Arts and Sciences August 8, 1997 Certified by Barry L. Vercoe Professor of Media Arts and Sciences Massachusetts Institute of Technology Accepted by Stephen A. Benton Chair, Departmental Committee on Graduate Students Program in Media Arts and Sciences Massachusetts Institute of Technology 3-D Audio Using Loudspeakers William G. Gardner Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning on August 8, 1997, in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy. Abstract 3-D audio systems, which can surround a listener with sounds at arbitrary locations, are an important part of immersive interfaces. A new approach is presented for implementing 3-D audio using a pair of conventional loudspeakers. The new idea is to use the tracked position of the listener’s head to optimize the acoustical presentation, and thus produce a much more realistic illusion over a larger listening area than existing loudspeaker 3-D audio systems. By using a remote head tracker, for instance based on computer vision, an immersive audio environment can be created without donning headphones or other equipment. The general approach to a 3-D audio system is to reconstruct the acoustic pressures at the listener’s ears that would result from the natural listening situation to be simulated. To accomplish this using loudspeakers requires that first, the ear signals corresponding to the target scene are synthesized by appropriately encoding directional cues, a process known as “binaural synthesis,” and second, these signals are delivered to the listener by inverting the transmission paths that exist from the speakers to the listener, a process known as “crosstalk cancellation.” Existing crosstalk cancellation systems only function at a fixed listening location; when the listener moves away from the equalization zone, the 3-D illusion is lost. Steering the equalization zone to the tracked listener preserves the 3-D illu- sion over a large listening volume, thus simulating a reconstructed soundfield, and also provides dynamic localization cues by maintaining stationary external sound sources during head motion. This dissertation will discuss the theory, implementation, and testing of a head-tracked loudspeaker 3-D audio system. Crosstalk cancellers that can be steered to the location of a tracked listener will be described. The objective performance of these systems has been evaluated using simulations and acoustical measurements made at the ears of human subjects. Many sound localization experiments were also conducted; the results show that head-tracking both significantly improves localization when the listener is displaced from the ideal listening location, and also enables dynamic localiza- tion cues. Thesis Supervisor: Barry L. Vercoe Professor of Media Arts and Sciences This work was performed at the MIT Media Laboratory. Support for this work was provided in part by Motorola. The views expressed within do not necessarily reflect the views of the supporting sponsors. 3 4 Doctoral Dissertation Committee Thesis Advisor Barry L. Vercoe Professor of Media Arts and Sciences Massachusetts Institute of Technology Thesis Reader William M. Rabinowitz Principal Research Scientist MIT Research Laboratory for Electronics Thesis Reader David Griesinger Lexicon, Inc. Waltham, MA Thesis Reader Jean-Marc Jot Chargé de Recherche IRCAM Paris, France 5 6 Acknowledgments First and foremost, I would like to thank my advisor, Barry Vercoe, for providing constant encouragement and support through this long process. Barry’s vision has resulted in the creation of the Machine Listening Group (formerly the Music and Cognition Group) at the MIT Media Lab, where researchers such as myself can freely pursue topics in the under- standing and synthesis of music and audio. Barry has always encouraged me to “look at the big picture,” a task which I have not mastered. I can’t imagine having a better doctoral committee than Bill Rabinowitz, David Griesinger, and Jean-Marc Jot. Each has a particular perspective that complements the others. Bill has been cheerful and supportive throughout; he has been particularly helpful with the psycho- acoustic validation portion of this work. Bill also arranged for the use of the KEMAR and MIT’s anechoic chamber, and he assisted me in building miniature microphones for mak- ing ear recordings. David has worked closely with me on previous projects, most notably a study of reverberation perception. We’ve met numerous times to discuss room reverbera- tion, loudspeaker audio systems, and spatial hearing. He strives to find simple solutions to difficult problems; I hope I have emulated that ideal here. Jean-Marc’s work in spatial audio synthesis, strongly grounded in theory, has been inspirational. Some of the key ideas in this work are the result of many illuminating email discussions with him. My parents are no doubt responsible for my interest in science and for my choice of schooling at MIT. At a young age, they took me to MIT to see an exhibit of moon dust recently brought back from the lunar surface. From that time on I was certain that I would attend MIT; after a total of twelve years here (both undergraduate and graduate) it seems hard to believe that I’m going to get out. Mr. Aloian at Belmont Hill School deserves par- ticular credit for getting me into MIT as an undergrad. I would like to thank my colleagues in the Machine Listening Group for providing a stim- ulating and fun place to work. Thanks are extended to current group members Keith Mar- tin, Michael Casey, Eric Scheirer, Paris Smaragdis, and Jonathan Feldman. Keith Martin, my current officemate, was closely involved with the early stages of this work, assisting me with the measurement of the KEMAR HRTF data. Keith is also to be thanked for pro- viding a large music library for my late night listening pleasure. Eric Scheirer provided valuable proofreading assistance. Mike Casey, by using my systems in a number of audio productions, has both advertised my work and reminded me of the need for production- friendly designs. Thanks are also extending to former members of our group, including Dan Ellis, Nicolas Saint-Arnaud, Jeff Bilmes, Tom Maglione, and Mary Ann Norris. Dan Ellis, my former 7 officemate, deserves particular accolades. In addition to being a dear friend and providing emotional support during the initial stages of this work, he helped me solve countless prob- lems ranging from computer glitches to theoretical issues. Dan is also credited with sug- gesting that I use the Tcl/Tk Toolkit to build a graphical interface for the sound localization experiment software; this was a huge time saver. Connie Van Rheenen, Betty Lou McClanahan, Greg Tucker, and Molly Bancroft provided essential support at the Media Lab. Thanks are also extended to Bob Chidlaw at Kurzweil Music Systems, who first introduced me to the wonderful world of digital signal process- ing. Finally, none of this would be possible without Felice, who has been an unfaltering source of confidence, support, and love in this long and difficult process. This thesis is dedicated to her. 8 Contents 1 Introduction..............................................................................................11 1.1 Motivation: spatial auditory displays ................................................................................ 11 1.2 Ideas to be investigated .....................................................................................................13 1.3 Applications ...................................................................................................................... 14 1.4 Organization of this document .......................................................................................... 15 2 Background ..............................................................................................17 2.1 Sound localization in spatial auditory displays................................................................. 17 2.1.1 Interaural cues..................................................................................................... 17 2.1.2 Spectral cues ....................................................................................................... 18 2.1.3 Individualized HRTFs......................................................................................... 19 2.1.4 Externalization.................................................................................................... 20 2.1.5 Dynamic localization.......................................................................................... 20 2.2 Loudspeaker displays........................................................................................................ 21 2.2.1 Stereo .................................................................................................................. 21 2.2.2 Crosstalk cancellation......................................................................................... 22 2.2.3 Inverse filtering