<<

Filmuniversit¨atBabelsberg KONRAD WOLF

Interactive design for perceptually motivated HRTF selection

Interaktives Design f¨ur wahrnehmungsmotivierte HRTF Auswahl

Master thesis by

Vensan Mazmanyan

November, 2017

Vorgelegt von: Vensan Mazmanyan

Studiengang: Master for picture

Matrikelnummer: 5353

Betreuer: Dipl.-Ing. Felix Fleischmann

Gutachter: Prof. Dr.-Ing. Klaus Hobohm

Anhang: USB Stick - Videobeispiele Acknowledgments

The following work was accomplished at the Fraunhofer Institute for Integrated Cir- cuits in Erlangen, .

I am very grateful to Dipl.-Ing. Felix Fleischmann for involving with the idea of my thesis and for his invaluable support. I would also like to specially thank to Dipl.-Ing. Jan Plogsties for his help during the preliminary research prior to for- mulating the thesis and for his essential advices afterwards. Special thanks also to Dipl.-Tonmeister Ulli Scuda for all the opportunities and resources provided for the implementation of the test-design.

My warmest thanks to all who have supported me with advices and feedback during my research and to all who have participated in the listening tests.

Vensan Mazmanyan List of abbreviations

3D - Three-dimensional 6DoF - Six degrees of freedom API - Application programming interface AR - Augmented reality. ARI - Research Institute BEM - Boundary element method BRIR - Binaural room impulse response CIPIC - Center for Image Processing and Integrated Computing DIRAC - short name of the Dirac-impulse or Dirac-function, named after the English theoretical physicist Paul Dirac DFEQ - Diffuse-field equalization DSP - Digital signal processing EQ - Equalizer FIR - Finite impulse response GPU - Graphics processing unit GUI - Graphical user interface HMD - Head-mounted display (e.g. Oculus Rift—, HTC Vive— etc.) HPIR - Headphones impulse response HRIR - Head-related impulse response HRTF - Head-related transfer function IIS - Institut f¨urIntegrierte Schaltungen ILD - Interaural level difference IR - Impulse response IRCAM - Institut de Recherche et Coordination Acoustique/Musique ISD - Interaural spectral differences ITD - Interaural time difference ITU-R - International Telecommunication Union, Radiocommunication Sector LS - Loudspeaker LTI - Linear time-invariant (system) MLS - Maximum length sequence MUSHRA - Multi-Stimulus Test with Hidden Reference and Anchor PRTF - Pinna related transfer function QE - Quality element (in the physical domain) [Jekosch 2004] QF - Quality feature (in the perceptual domain) [Jekosch 2004] QoE - Quality of Experience [Jekosch 2004] QoS - Quality of system or service [Jekosch 2004] SOFA - Spatially Oriented Format for Acoustics TF - Transfer function VR - Virtual reality. Zusammenfassung

Die Entwicklung von immersiven Medien f¨uhrtzu einer erh¨ohten Aufmerksamkeit auf die Wiedergabem¨oglichkeiten von r¨aumlicher Tongestaltung. Im Bereich der VR-Technologie, stellt die binaurale Tonwiedergabe ¨uber Kopfh¨orereinen wichtigen Aspekt der Personalisierung dar. Die Erstellung von individuellen HRTFs bleibt auch heute immer noch schwierig und ist nur unter speziellen Bedingungen m¨oglich. Weil passende HRTFs f¨urdie Gesamtqualit¨atder Wahrnehmung eines binauralen Con- tents entscheidend sind, werden unterschiedliche L¨osungenuntersucht die individu- elle HRTF anzun¨ahernoder eine andere HRTF anzupassen. Mit der Entwicklung neuer Austauschformate f¨urHRTFs und deren Unterst¨utzungdurch diverse neu- entwickelte binaurale Systeme, wird es m¨oglich, existierende HRTF-Datenbanken zu untersuchen und perzeptiv zu evaluieren.

Dies hat den Autor motiviert, die Effekte und Zusammenwirkung von konkreten tech- nischen HRTF Parametern und Wahrnehmungskriterien zu untersuchen. Im Rahmen der Recherche wurde eine Datenbank aus individuell gemessenen HRTFs erstellt. Diese wurde im Zusammenhang mit anderen existierenden HRTF-Datenbanken analysiert. Anhand der Ergebnisse wurden H¨ortestsdurchgef¨uhrt, um Wahrnehmungskriterien und verantwortliche technische Parameter zu testen. Dabei wurden auch multi- variate Zusammenh¨angezwischen unterschiedlichen Kriterien festgestellt. Ein Ver- fahren f¨urwahrnehmungsmotivierte HRTF-Auswahl wurde vorgeschlagen, das in einer VR-Implementierung in Unity getestet wurde. Die Ergebnisse haben gezeigt, dass die H¨orerin der Lage waren, innerhalb einer VR-Umgebung kritisch zuzuh¨oren und ihre Pr¨aferenzanhand isolierter Kriterien und Parameter auszudr¨ucken. Die statistische Auswertung hat gezeigt, dass die ausgew¨ahltenHRTFs keine signifikante Verbesserung gegen¨uber individuellen HRTFs, der HRTF von einem Kunstkopf und einer nummerisch gemittelten HRTF gebracht hat. Allerdings war eine Validierung der Gesamtqualit¨atim Kontext einer komplexen Wiedergabe nicht m¨oglich aufgrund der Einschr¨ankungdes Renderings auf ein dynamisches Objekt. Contents

1 Introduction 1

2 Fundamentals 3 2.1 Binaural hearing ...... 3 2.2 VR...... 6 2.3 HRTF acquisition methods ...... 11 2.4 HRTF selection methods ...... 13 2.4.1 Seeber-Fastl method ...... 13 2.4.2 DOMISO method ...... 13

3 Proposed HRTF selection method 15 3.1 General considerations ...... 15 3.2 Proposed method overview ...... 15 3.3 BRIR measurements ...... 17 3.3.1 Measurement setup ...... 17 3.3.2 Postprocessing ...... 19 3.3.3 HRTF database preparation in SOFA format ...... 19 3.4 Selecting the most relevant perceptual criteria and isolation of the responsible technical parameters ...... 20 3.4.1 Perceptual criteria ...... 20 3.4.2 Possible relevant technical parameters ...... 20 3.4.3 Analisys of HRTF-databases ...... 24 3.4.4 ITD analysis ...... 24 3.4.5 Main pinna notch analysis ...... 26

4 Implementation of the VR-environment 32 4.1 System overview ...... 32 4.2 VR-rendering ...... 33 4.3 Controller ...... 33 4.4 GUI ...... 34

5 Preliminary listening tests 35 5.1 Sound coloration and spectral dynamics ...... 35 5.2 Externalization, generic HRTF, DFEQ and main notch effect . . . . . 38 5.3 Horizontal localization ...... 43 5.4 Vertical localization ...... 45 5.5 Data analysis and interpretation ...... 48

6 Summary of the selected QF-QE pairs. Selection design. 50

7 Selection and validation tests 52 7.1 Database preparation for the selection procedure ...... 52 7.2 Selection test ...... 52 7.3 Validation test ...... 54

8 Discussion and outlook 56 9 References 59

10 List of figures 63

11 Declaration of authorship / Eidesstattliche Erkl¨arung 66 1 Introduction

The recent uplift of the consumer aimed VR-technologies and applications have brought up attention to the personalized spatial audio rendering solutions. In order to enable a complete immersive experience for the end-user, not only an immersive 360° visual environment, but also a plausible three-dimensional (3D) au- dio listening experience through headphones is needed. This represents a challenge in many aspects of the content creation and rendering that need to be addressed [Rumsey 2016].

If we compare the conventional multichannel audio reproduction over loudspeakers and the spatial audio reproduction over headphones, there is one obvious difference that has to be pointed. In contrast to the loudspeaker reproduction, the headphones reproduction utilizes a binaural perceptual model that exists as an additional layer in the rendering engine. This model, often referred to as HRTF as a generalized term, contains the so called binaural cues, which provide the information about the acoustic effects caused by the human head, torso and outer ear (as well as the room acoustics if present) when interfering with an acoustic wave front emitted by a sound source in the three-dimensional space. This information is stored as a set of transfer functions describing those effects for a given source position and when applied to an incoming signal, the human cognitive system reacts to the sound similarly as if the signal originates from a real acoustical source positioned in the natural surrounding space.

‘Authenticity, in this context, means that the subjects at the receiving end do not sense a difference between their actual auditory events and those which they would have had at the recording position when the recording was made.’ [Jens Blauert 1997]

Jens Blauert, 1997

Many researchers throughout the years have tested and analyzed the nature of the binaural cues of the human spatial hearing, studying the principles that enables us to localize the direction of the sound, to evaluate its distance, to analyze the interac- tion between the and the reflective environment that they are placed in etc. There are still many unanswered questions concerning the psychophysics of human hearing that need to be further investigated, but one thing is being made clear so far. The perceived binaural effect could be significantly different for different indi- viduals, which is due to a complex mix of many different factors involving not only the purely physical qualities of the human body structure, but among others also the personal auditive experience and education, health condition, the type of sound source, headphone qualities, rendering artifacts, signal parameters etc.

A highly regarded example as part of those studies is the binaural rendering for a particular person by applying a perceptual model based on the actual anthropometric features of this same person, or a perceptual model also known as individual HRTF. Theoretically this is the only way to the most precise approximation of the individual perception under real natural (without headphones) listening conditions.

1 To date, the process of acquiring the personal HRTF remains quite expensive, time- consuming and really couldn’t be categorized as a user-friendly experience. This is why such procedures are being normally conducted by rather large research institu- tions or universities. However, with the creation of publicly available databases of measured HRTFs from real persons, and in conjunction with the recently developed HRTF exchange file format [Sridhar, Tylka, and Choueiri 2017], and its support by a growing number of consumer binaural engines, a new aspect of the binaural auditive experience has been brought up. Now it is now possible to load different HRTFs into a binaural renderer as a part of a VR application and it is possible to experience the same given content with dif- ferent perceptual models. As a result, the quality assessment of the binauralization is no longer limited only by criteria concerning the match or the difference between the spatial perception of a binaural virtual sound source and the natural perception of a real source. The perceptual model could now be also discussed as a matter of personal preference with respect to the particular VR environment and application [B. Boren and Roginska 2011].

All the questions concerning this new ‘aesthetic’ or ‘hedonistic’ aspect of the per- ceptual model have motivated the current work in which an effort has been made to provide some insight about the perceptual evaluation of the spatial audio reproduc- tion in the environment of consumer based VR rendering. Questions like:

How would a listener use an opportunity to exchange the provided HRTF against another one, based on perceptual criteria ? Would they use it at all, or they won’t notice any difference because of the over- whelming VR experience ? If they would use it, what criteria would they use to decide ? Would they try to approximate the perception they know from a real environment ? Would they instead rely on something else that seems more appealing in the context of a virtual environment ? How would visual component affect their perception and/or preference ?

Further considerations lead to questions of higher order, which ultimately marked the main hypothesis of this study.

Would it be possible for an individual listener to use some kind of an interactive interface in a VR environment that would allow them to select particular qualities of the binaural perceptual model in real-time after their subjective liking ?

If they do, how would they then rate their choice against established alternatives like their personal HRTF or an HRTF from a dummy head with respect to those particular criteria ?

‘One important component is that even though technology rapidly evolves, humans who use it do not. It is therefore crucial to understand how our sensors systems function, especially with matched with artificial stimula- tion.’ [LaValle 2017]

Steven M. LaValle, 2017

2 2 Fundamentals

2.1 Binaural hearing Spatial hearing is of great importance for the human cognitive system. It provides in- formation about the surrounding environment and the particular occurrences within, whose relevance could cover a wide pallet of categories like: informative, emotion- ally appealing, aesthetically pleasing, energizing, relaxing, distracting, dangerous, life threatening and many others. Because the human anthropomorphic structure doesn’t provide primary control over the acoustical flow entering the auditory system, a complex auditory analysis is being constantly maintained by the central nervous system. In other words, by having our ears open all the time we are always aware of everything happening in the space around us [Moore 2012]. This points the aspect of the ability to quickly and pre- cisely identify all surrounding events. Not only their character and environmental significance, but also their relation to the listener’s own spatial position.

Similar to the way we can see with two eyes, binaural hearing is based on combining the information in the brain from the two ears, creating a robust impression that confers the stimulus a special character of perspective known as three-dimensional (3D) depth and localisation [Avan, Giraudet, and B¨uki2015].

‘Both in the visual and auditory modalities, this character contributes to creating ‘objects’, which are easier to segregate and identify than what would have happened if a single receiver had been available.’ [Avan, Gi- raudet, and B¨uki2015]

Avan/Giraudet/B¨uki,2015

Researchers have contributed in the investigation and understanding of the human hearing apparatus in order to enable the implementation of the binaural technology in many different applications.

‘The advent of microcomputers and, consequently, the availability of the necessary computational power for real-time processing of audio signals have initiated and fostered the development of a new technology, ”bin- aural technology,” which has established itself as an enabling technology in many fields of application, such as information and communication systems, measurement technology, hearing aids, speech technology, mul- timedia systems, and virtual reality.’ [Jens Blauert 1997]

Jens Blauert, 1997

3 Figure 1: Ear anatomy. Source: Blausen.com staff (2014). ”Medical gallery of Blausen Medical 2014”. WikiJournal of Medicine 1 (2). DOI:10.15347/wjm/2014.010. ISSN 2002-4436

The physical layout of the human auditory system may be divided into outer, middle, and inner ears [Gelfand 2016]. The outer ear consists of the pinna and ear canal prop- agating the sound waves to the eardrum (middle ear). Their exact shape varies with different individuals [Monge 2011], thus the acoustical impact they produce could significantly differentiate between listeners. However, the aspect of morphological variance is still being investigated [Purkait 2016].

‘The knowledge of the role of the external ears in spatial hearing and the availability of quantitative data for modeling external ears paves the way for various applications, for example, creating auditory events at prescribed directions and distances in a subject’s perceptual space.’ [Jens Blauert 1997]

Jens Blauert, 1997

The exact mechanisms of the human binaural hearing are still not completely clear, but a number of related phenomena are being already discussed and extensively in- vestigated. The distinction between the different attributes of auditory events is important for better isolation and understanding of the causes for those effects. Lo- calization, localization blur, summing localization, cones of confusion, precedence effect, cocktail-party effect, binaural masking etc. are all effects discovered and sum- marized thanks to a systematical approach and diligent tests.

4 It is common to divide the attributes of a incoming signal into the auditive system in two main categories: monaural - consisting mainly of time and level differences between the individual spectral components of each individual ear input signal; and inter-aural - consisting mainly of time and level differences between corresponding spectral components of the two ear input signals. Additional parameters like head movements during listening or involving other senses simultaneously with the occurrence of the auditive event also contribute to the fi- nal perceptual effect and must be considered. The improvement of the localization confidence and reducing the front-back confusion thanks to head-movements is being discussed in several studies. The actual parameters of the stimulus also play an im- portant role by affecting the proportions between different time, level and dependent cues [Jens Blauert 1997].

To be able to formally discuss the auditive localization effects, it is necessary to be able to relate all occurrences to a particular spatial coordinate system. Unless stated differently, the following convention for spatial coordinates will be used throughout this work.

Figure 2: Left - polar coordinates top view (horizontal plane). Right - polar coordinates side view (median plane).

5 2.2 VR When we say virtual reality a common first associations is the consumer based head mounted displays (HMD) like Oculus Rift—. However, the exact historical origin of the concept is subject of discussion. First elements of the idea for virtual reality could be traced all the way back to the cave paintings from thousands of years ago [LaValle 2017]. Actual hardware designs relating to modern devices could be found in the nineteen thirties1.

A common point of discussion is what exactly VR is ? It would probably be incorrect to try to describe it only in one sentence. Still, there are some suggestions.

‘Defining Virtual Reality can prove to be a difficult task because there is no standard definition for it. It is said to be an oxymoron, as it is referred to by some school of thought as “reality that does not exist”.’ [Bamodu and Ye 2013]

Bamodu, Oluleke and Ye, Xu Ming, 2013

‘Virtual reality - a medium composed of interactive computer simulations that sense the participant’s position and actions and replace or augment the feedback to one or more senses, giving the feeling of being mentally immersed or present in the simulation (a virtual world).’ [Sherman and Craig 2003]

Sherman, William R and Craig, Alan B, 2003

‘Inducing targeted behavior in an organism by using artificial sensory stimulation, while the organism has little or no awareness of the interfer- ence.’ [LaValle 2017]

Steven M. LaValle, 2017

Another term related to virtual reality is augmented reality, which indicates enhance- ment of the real environment with the means of virtual reality.

‘Augmented Reality (AR) superposes synthetic elements like 3D objects, multimedia contents or text information onto real-world images (Hsieh and Lin, 2011), increasing its possibilities of interaction with the user.’ [Jorge Mart´ın-Guti´errezand Beatriz A˜norbe-D´ıaz2017]

J. M. Guti´errezet al., 2016

The two way influence between VR and AR and their derivations motivates the need for more precise definition and also provokes concepts concerning the categorization of mixed realities [Milgram and Kishino 1994].

1https://www.vrs.org.uk/virtual-reality/history.html

6 ‘The conventionally held view of a Virtual Reality (VR) environment is one in which the participantobserver is totally immersed in, and able to interact with, a completely synthetic world. Such a world may mimic the properties of some real-world environments, either existing or fictional; however, it can also exceed the bounds of physical reality by creating a world in which the physical laws ordinarily governing space, time, me- chanics, material properties, etc. no longer hold. What may be over- looked in this view, however, is that the VR label is also frequently used in association with a variety of other environments, to which total im- mersion and complete synthesis do not necessarily pertain, but which fall somewhere along a virtuality continuum.’ [Milgram and Kishino 1994]

Milgram, Paul and Kishino, Fumio, 1994

Figure 3: Simplified representation of a ‘virtuality continuum’. Source: Giovanni Vincenti

‘VR systems can be classified into 3 major categories. These are, non- immersive, immersive and semi-immersive, based on one of the important features of VR, which is immersion and the type of interfaces or com- ponents utilized in the system. Non-Immersive VR system, also called Desktop VR system, Fish tank or Window on World system is the least immersive and least expensive of the VR systems, as it requires the least sophisticated components. It allows users to interact with a 3D envi- ronment through a stereo display monitor and glasses, other common components include space ball, keyboard and data gloves. Its application areas include modeling and CAD systems. Immersive VR system on the other hand is the most expensive and gives the highest level of immersion; its components include HMD, tracking devices, data gloves and others, which encompass the user with computer generated 3D animation that give the user the feeling of being part of the virtual environment. One of its applications is in virtual walk-through of buildings. Semi-Immersive VR system, also called hybrid systems ... or augmented reality system, provides high level of immersion, while keeping the simplicity of the desk- top VR or utilizing some physical model. Example of such system includes the CAVE (Cave Automatic Virtual Environment) and an application is the driving simulator. Distributed-VR also called Networked-VR is a new category of VR system, which exists as a result of rapid development of internet. Its goal is to remove the problem of distance, allowing people from many different locations to participate and interact in the same vir- tual world through the help of the internet and other networks.’ [Bamodu and Ye 2013]

7 Bamodu, Oluleke and Ye, Xu Ming, 2013

Despite the attractive new experience that the VR technology promises, there are also ethical and philosophical considerations. Some authors are concerned about the impact of VR over our social and economics life.

‘With the development of new VR technologies (e.g. goggles, gloves, in- terfaces, etc.), virtual reality will be able to approximate the real world with greater and greater accuracy. High-end flight simulators are already nearly indistinguishable from the real thing. In time, virtual reality en- vironments, which are able to simulate ”real world experience,” will be available to the general public. In other words, virtual reality will become indistinguishable from the real, at least in terms of perceptual and cog- nitive processing. With the development of life-like virtual worlds (often replicating the real world in perfect virtual detail), virtual reality (i.e. ”The Virtual World”) will come into its own.’ [Cline 2005]

Mychilo Stephenson Cline, 2005

A modern consumer VR system consists of three main components. The first one comprise all input utilities and controllers. They enable the subject to actively en- gage with the the VR environment. The second component is the VR engine, which maintains the rendering and interaction process. And the third main component is represented by all output devices. These are all units providing feedback from the VR engine back to the subject [Bamodu and Ye 2013]. This rather complex constellation presents some challenges to the current consumer rendering solutions. Concerning the video rendering, because the the HMD is placed right in front of the eyes, the inter-pixel space appear as visible artifacts over the whole visual area, which spoils the immersion to some extent. Currently, newer higher resolution systems are in development, but this relates to other difficulties concerning the GPU performance requirements. It is widely sug- gested that for seamless visual performance the video frame needs to be refreshed at least 90 times per second. This high frame rate value combined with a higher pixel resolution is not achievable by the available consumer hardware at the time of writing (Nov. 2017). Another problematic aspect of the VR system still remains the head-tracking latency and the overall system reaction times. A common term to describe the amount of time it takes to update the display in response to a change in head orientation and position is the so called motion-to-photons latency [LaValle 2017].

‘At Oculus we believe the threshold for compelling VR to be at or be- low 20ms of latency. Above this range, users tend to feel less immersed and comfortable in the environment. When latency exceeds 60ms, the disjunction between one’s head motions and the motions of the virtual world start to feel out of sync, causing discomfort and disorientation.’ [Oculus 2017]

Oculus Best Practices, 2017

8 Figure 4: Example of visual artifacts in a typical HMD. Top - around 70 megapixels per eye. Bottom - around 1.2 megapixels per eye. Source: http://www.varjo.com/

Diverse practical aspects limit the user experience when wearing an HMD.

‘A head-mounted display also weighs considerably more than a pair of glasses. The longer a heavy device is worn, the more tired the users will become. Further, HMDs usually require cables to carry the video and audio signals to and from the display. Tracking systems often have cables as well, so HMDs have a large cable burden. Cables limit the movement of the user, allowing them to walk only a short distance or rotate no more than one full turn.’ [Sherman and Craig 2003]

Sherman, William R and Craig, Alan B, 2003

9 10 Figure 5: The input/output loop in a typical VR system. Source: Shmuel Csaba Otto Traian Despite all current challenges and uncertainties the VR market shows increasing growth in the last couple of years. Some analytics foresee exponential growth of the total industry revenue reaching 80 billion US dollars by 2025 [Sachs 2016] with video games and video entertainment holding the biggest share.

The importance of the binaural technology for the immersive experience in this new medium is unquestionable. This motivates the improvement of the existing state of the art by addressing issues in many aspects, one of the most important of which is the personalization of the rendering system. The VR environment relies on the effective control over our senses. In order to achieve plausible results and fulfill wider range of expectations among the largest crowd possible, the pure technical approach alone is probably not going to ensure that success. Addressing the subjective preference concerning the rendering system and offering individual customization options could effectively contribute for the broader acceptance of the this new reality.

2.3 HRTF acquisition methods ‘The process of scattering/diffraction/reflection by human anatomical structures converts the information in sound fields into binaural pressure signals. In the case of a free field and a fixed listener, the transmission from a point sound source to each of the two ears can be regarded as a linear time-invariant (LTI) process, and characterized by head-related transfer function (HRTF). HRTFs contain the main spatial information of sound sources, and are thus vital for research on binaural hearing.’ [Xie 2013]

Bo-sun Xie, 2012

The examination and acquisition of HRTFs has a fairly long tradition. The classical approach relies on the actual acoustic measurement and storing of the acquired TF as DIRAC impulse responses (IRs). A typical case requires an acoustically controlled environment and treated room acoustics. Professional loudspeaker (LS) or multiple loudspeakers are being used as sources for the excitation signal. The subject/dummy head is positioned accordingly to ensure precise orientation and often is restricted in movements via means of special construction, which holds the subject perfectly steady. Pressure field microphones are being placed typically into the blocked ear canal of the subject to record the excitation signal emitted from the LS and scattered by the human head and ears on its way to the ear entrance.

Figure 6: Typical block diagram of acoustic HRTF measurements.

11 Understandably, this process is rather expensive and could get quite tedious for the subjects in the case of individual human measurements, especially if many source positions need to be measured.

Because of the impractical aspect of the acoustic measurements, other ways are being investigated. A possible alternative to the measurements is the numerical calculation, the most common iteration of which is the boundary element method (BEM) [Huttunen and Vanne 2017].

‘BEM is an extensively used method for acoustic radiation and scattering problems ... , in which the boundary problem of the wave or Helmholtz equation is first converted into a boundary surface integral. The boundary surface is then made discrete, into a mesh of elements, resulting in a set of linear algebra equations.’ [Xie 2013]

Bo-sun Xie, 2012

While this alternative method effectively avoids the common difficulties of the acous- tic measurements, other strains apply. To be able to accurately reconstruct the anatomic features of a person into a 3D model, required by the method, a highly accurate scanning device is required, which is capable to probe the skin surface of the head and pinna with high enough resolution [Reichinger et al. 2013]. Moreover, the acquired 3D model needs to be examined for possible artifacts, which need to be fixed in a post-processing environment [Ziegelwanger, Kreuzer, and Majdak 2016]. Another aspect of the BEM procedure is the significant computational requirement in order to calculate the individual HRTF based on the 3D mesh [Young, Tew, and Kearney 2016].

Another alternative, which deserves attention, is the application of artificial neural networks, a method also know as deep learning [Kaneko, Suenaga, and Sekine 2016]. Deep learning is part of the bigger family of machine learning. The application of the method in this particular case consists of the development of a special learning algorithm, which is capable (after a training period) to recognize relations between morphological patterns and reconstruct the individual HRTF corresponding to those patterns. The big advantage with this approach is that eventually it could achieve consistent results based on two dimensional anthropometric input like a common pic- ture of the head and pinna. However, in order to reach that level a large quality input base for the learning process is required as well as diligent tests and optimizations during the training stage.

12 2.4 HRTF selection methods As opposed to the analytical and numerical approaches described above, the pure subjective selection of an HRTF out of an existing pool based on listening tests represents a different way of acquiring of HRTF that could be perceptually plausible and more appealing [Grasser, Rothbucher, and Diepold 2014]. In the following lines two known selection methods are briefly presented.

2.4.1 Seeber-Fastl method The objective of the Seeber-Fastl mthod is to inestigate on a fast selection procedure for subjects without particular training. The procedure is divided in two stages [Seeber and Fastl 2003]. In the first selection stage the subjects listen to 12 HRTFs and select 5 out of them based on score 0-9 given for greatest spatial perception in the frontal area. In the second stage the 5 preselcted HRTFs are being rate after new criteria set:

ˆ The direction of the sound is perceived from-40° left to 40° right, but not further outside ˆ The sound moves horizontally in equally-spaced steps ˆ The sound has a constant elevation at all times ˆ The sound is perceived in the frontal plane,at a constant distance,. and prefer- ably far away

White noise bursts are being used as a test signal structured like: 5 sequential 30ms pulse + 70ms pause, with applied 5ms gaussian slopes at 60 dB SPL. The database used is the AUDIS-catalogue [Blauert et al. 1998]. The 12 HRTFs are being used to generate virtual sound sources on the horizontal plane so that the subject can eval- uate the localization effect with different HRTFs. The subjects uses a laser pointer to indicate the perceived direction. The test-design is implemented in Matlab GUI, as input a PC keyboard is being used.

The findings of this selection method confirmed the feasibility of the approach and ‘directional anchors’ were suggested for feature implementations.

2.4.2 DOMISO method The DOMISO method builds upon the findings of the Seeber-Fastl method, but a different selection design was proposed [Yairi, Iwaya, and Suzuki 2008]. It utilizes a swiss-style-tournament approach to account for larger HRTF databases. All the HRTFs in the database are being first clustered in groups consisting of 32 HRTFs after cepstral analysis. In the original tests the main criteria given is the localization on the horizontal plane. The 32 HRTFs in every group compete in a swiss-style-tournament and all the winners are being validated in a round-robin com- parisons.

While both Seeber-Fastl and DOMISO methods show partial improvements for the particular examined criteria, their effectiveness could be questioned [Grasser, Roth- bucher, and Diepold 2014].

13 Alternative perceptually based approaches try to take use of cross-modal learning effects in order to ‘recalibrate’ the perceptual degradation [Zhong, Zhang, and Yu 2015], [Mayenfels 2015].

14 3 Proposed HRTF selection method

3.1 General considerations As already discussed in the previous chapter, the limitations concerning the indi- vidual HRTF acquisition motivate the need for a more detailed observation and understanding of the dependencies between particular HRTF parameters and their influence over particular perceptual effects. In case that such dependencies could be clearly identified and isolated, this would allow for a partial parametrization of those elements which could then be adjusted on demand by the end-user accordingly to their preference and the actual application. If a selection procedure proves to be ef- fective and easy to manage, this could provide a reasonable compromise compared to a individually acquired HRTF and possibly a better solution compared to a dummy head HRTF without customization options. The availability of HRTF databases in the recently developed universal HRTF file container format [AES 2015] provide convenient ways to analyze and perceptually compare different HRTFs in an application oriented environment with the opportu- nity to observe the listener’s response to particular qualities of the binaural rendering under complex conditions. A traditional approach for the assessment of the HRTFs is the conducting of listening tests that normally are being carried out in a controlled environment and under very strict conditions without any visual distractions, so that the listener is not biased in their judgment on the audio rendering quality. These kind of tests provide reliable and reproducible evidence for the listener’s reactions under those specific conditions. However, in a real-world scenario, many other aspects besides the pure auditive experience could have a significant impact on the listener’s judgment. In the case of a contemporary consumer VR-rendering, which involves an HMD and ultimately isolates the consumer into an environment very different than the natural one, the complexity of the multi-modal stimulus could have a significant perceptual impact that needs to be examined in its particular context. This way a broader range of relevant dependencies that need to be analyzed could be revealed and a more profound understanding for the significance of particular QE and/or QF under typical application conditions could be provided. On the other hand this would be a good opportunity to evaluate the margins in the overall QoE between HRTFs acquired by using significantly different approaches. An individually measured HRTF is being generally regarded as a ground truth in many aspects, but still there aren’t many clear evidence about the performance of the individual HRTF in the context of a commercial VR-application. A similar interest would be valid also for an HRTF acquired from a typical dummy head.

3.2 Proposed method overview The main objective of the proposed selection procedure in this study is to investigate a specific approach with regard to the HRTF manipulation and selection in terms of much greater freedom for the listener to dynamically interact with the stimulus and basically with all perceptual criteria of interest. A type of controlled interaction in which only the criterion of interest is being affected by the conditions change, but at the same time, an effect presented in the context of many other perceptual criteria. This can be implemented into a full featured VR-environment in order to isolate the subject from the surrounding distractions, and also to enable a complex, high-level

15 experience, which closely approximates a typical application scenario. A game-like interaction that will allow very efficient interaction with many focused QFs, which could potentially enable better overall perceptual conditions [Pike and Stenzel 2017]. In order to provide effective support and guidance to the listener so that they don’t get lost in the process of interaction, a strategically designed set of isolated criteria is being used as a sequence of perceptual tasks, in which only one QE will be offered to the listener for adjustment by selection. By dynamically adjusting this isolated tech- nical parameter, the listener will have to decide which setting corresponds to their personal liking with respect to the examined criterion. Their decision in a particular trial will be acknowledged and retained in the trials that follow. By the completion of the whole sequence, the listener ideally would have designed a more appealing HRTF than the HRTF that they have started with, just by expressing their personal preference to the given conditions and criteria in every trial.

One challenge to address is the selection of the isolated criteria that have the most significant impact over the QoE during binaural listening and the isolation of the cor- responding technical parameter that influences most significantly a given criterion. Another important aspect is the implementation of the interactive environment that will present the selection sequence to the listener based on audio-visual stimuli that need to be rendered in real-time. An environment including the important feature of being able to dynamically switch between different HRTFs and also containing a basic interactive VR-GUI to acquire the input from the listener concerning their preference in a efficient but also a user-friendly enough fashion.

To be able to fulfill the aforementioned requirements and deal with the existing chal- lenges, the following steps were taken.

Throughout the current study a measurement session was conducted, in which BRIRs from 67 listeners were measured, in which also their anthropomorphic features were protocoled by means of hand measurements and 3D scanning. The gathered BRIRs were processed further to eliminate the room acoustics influences and to acquire the dry individual HRTFs. The newly created HRTF database was then analyzed and compared with others existing databases. These databases were analyzed to look for common patterns, to evaluate the data statistically and to localize potential pa- rameters that could relate to certain perceptual criteria. Informal and preliminary tests were conducted in order to localize the most significant relations between the examined QFs and QEs. After the perceptual criteria and technical parameters were selected, an HRTF database containing 243 HRTFs in SOFA format was created for the purpose of the proposed selection procedure. The test-design was implemented in a VR renderer that provided dynamic real-time HRTF convolving of a single dy- namic object, as well as HRTF switching and basic GUI. A total of 16 selection procedures and validations were conducted. At 12 of the vali- dations, experienced listeners compared their perceptually selected HRTF with their individual HRTFs, with an average HRTF, and with a dummy head HRTF. With the remaining 4 validations, the listeners compared their perceptually selected HRTF only with an average HRTF, and with a dummy head HRTF.

In the following chapters the steps mentioned in the paragraph above will be described in detail.

16 3.3 BRIR measurements To be able to evaluate the QoE of a selected HRTF after the selection procedure, a robust validation method is required. This could be a direct comparison against the individual HRTF, considered as a reference for a particular listener. At the same time a comparison against a dummy head HRTF, usually considered as a compromise, would be also interesting, because this kind of HRTF originates from a statistical ap- proximation and is not directly related to a particular individual.

In order to acquire a sufficient amount of individual HRTFs from potential par- ticipants into consequent listening tests, 67 individual HRTFs from listeners with experience in the field of spatial audio research were measured.

3.3.1 Measurement setup The measurements were made in a reference listening room [Silzle et al 2009] at Fraunhofer IIS in Erlangen (Germany) within 5 days.

Figure 7: Reference listening room (ITU-R BS.1116) ”Mozart” - Fraunhofer IIS, Erlangen (Germany)

The room has a room-in-room construction, with rectangular geometry, length - 9.3m, width - 7.5m, height - 4.2m. The net floor area is 69,75 m2 and the volume is 300 m3. The playback system consists of 60 loudspeakers which are mainly attached to free hanging circular trusses and also to the walls. The room meets NR 10 noise specification and is adapted to the wide range of requirements of demanding audio scientists and engineers.

17 Figure 8: Subject 1 - measurement set up Figure 9: Subject 1 with in-ear microphones

The subjects were seated in the middle of the room on a special chair with adjustable height, without wheels and with a custom head support as shown on Figure 2. In or- der to maintain correct lateral orientation, every subject had to wear a head-strapped laser pointer showing the viewing direction at any time. HTM-1 microphones nor- mally bundled with Smyth Realizer A8 were used together with custom designed ex- ternal preamplifier providing phantom power. The microphone capsules were placed at the blocked ear canal of the subjects as demonstrated on Figure 3. The outputs of the preamplifier were fed into the MADI based audio router which then send them to a PC running Matlab, which was used to manage the whole playback and recording routine. Twenty-six LS (Dynaudio BM6A MKII) were selected as source positions in a circu- lar layout with following coordinates with respect to the subject:

1. Azimuth 30°, Elevation 0° 15. Azimuth 30°, Elevation 39° 2. Azimuth -30°, Elevation 0° 16. Azimuth -30°, Elevation 39° 3. Azimuth 0°, Elevation 0° 17. Azimuth 0°, Elevation 39° 4. Azimuth 180°, Elevation 0° 18 Azimuth 180°, Elevation 39° 5. Azimuth 110°, Elevation 0° 19. Azimuth 110°, Elevation 39° 6. Azimuth -110°, Elevation 0° 20. Azimuth -110°, Elevation 39° 7. Azimuth 90°, Elevation 0° 21. Azimuth 90°, Elevation 39° 8. Azimuth -90°, Elevation 0° 22. Azimuth -90°, Elevation 39° 9. Azimuth 135°, Elevation 0° 23. Azimuth 135°, Elevation 39° 10. Azimuth -135°, Elevation 0° 24. Azimuth -135°, Elevation 39° 11. Azimuth 45°, Elevation 0° 25. Azimuth 45°, Elevation 39° 12. Azimuth -45°, Elevation 0° 26. Azimuth -45°, Elevation 39° 13. Azimuth 60°, Elevation 0° 14. Azimuth -60°, Elevation 0°

18 As shown in the second chapter, the coordinates convention uses positive values on the left side of the coordinate system. The excitation signal used was 4 seconds long logarithmic sine sweep with 48000 Hz sample rate, with start frequency at 80 Hz and with stop frequency at 24000 Hz. By applying 10° lateral rotation of the subject (rotation with the sitting chair) once to the left and and once to the right of the 0° axis azimuth, it was possible to record additional source directions in order to achieve greater spatial resolution of the per- sonal HRIR dataset with a total of 74 directions. After the BRIR measurements, HPIR measurements of Bayerdynamic DT 770 PRO were made per subject. Addi- tionally the position of the subject’s head without the subject present was measured with a measurement microphone to be able to linearize the transfer function of the electro-acoustic system. Consequently, the anthropometric data from all the subjects was measured and pro- tocoled by hand and categorized accordingly to the CIPIC convention.

3.3.2 Postprocessing After all measurements were completed, the gathered BRIRs were further processed to make them suitable for general purpose use.

For every subject three HRTF sets were prepared. The first one contains 4096 sam- ples long BRIRs of the 26 main loudspeaker directions as listed above, trimmed at the beginning by the amount of overall system delay samples and also trimmed at the end with 2048 samples long, asymmetrical, right sided only, tukey window function, applied to samples 2049-4096. The second set consist of the pure HRTFs windowed to the first room reflection (floor reflection, see fig. 10), which occurs at roughly 140 samples from the beginning of the IR, calculated as the time difference between the direct sound path and the reflected sound path between the loudspeaker membrane and the ear entrance.

The third set takes the pure HRTFs and applies a smooth low frequency compensation via 6 dB/oct IIR filter to account for the low frequency en- ergy drop caused by the shorter HRIR length. After the compensation the resulting IRs were stored with a length of 256 samples. For the 140 and 256 samples long sets die DFEQs were calculated and stored Figure 10: First room reflection separately as IRs to be applied when needed. The HPIR was also stored separately.

After the HRTF sets for all subjects were prepared, they were all stored in SOFA format for more convenience during further analysis or perceptual evaluation.

3.3.3 HRTF database preparation in SOFA format

19 3.4 Selecting the most relevant perceptual criteria and iso- lation of the responsible technical parameters 3.4.1 Perceptual criteria In order to test the feasibility of a perceptually motivated HRTF selection procedure that could help the listener to find a solution to their personal preference, a specific approach is needed in which the qualities of the desired HRTF are being categorized into a number of perceptual criteria [Reardon et al. 2017]. These criteria ideally represent all main auditive aspects that affect the QoE and could potentially be of interest to the listener [Olko et al. 2017]. Selecting criteria which are as isolated as possible to each other could have a significant advantage in this sense, because this would potentially allow to more precisely isolate the different parameters of the HRTF which are responsible for the different perceptual effects and also potentially reduce the time needed by the listener to make a selection. Recent research offers a set of perceptual attributes resulting from a consensus vo- cabulary protocol produced by a panel of sound engineers, experts in spatial audio and binaural mixing [Simon, Zacharov, and Katz 2016]. Because similarly to the current work these criteria are placed in the context of improving the selection of non-individual HRTFs when individual HRTF measurements are not available for a given listener, the proposed perceptual attributes represent a suitable starting point for the design of the investigated interactive HRTF selection procedure. In order to further optimize the design structure and minimize the required rendering complexity, the perceptual attributes are being reduced to a total of five attributes: sound coloration, front-back differentiation, externalization, horizontal localization, vertical localization. Because the pure HRTF does not contain room information, all other aspects and criteria concerning the room acoustic environment or the qualities of the rendering engine were neglected.

3.4.2 Possible relevant technical parameters After the perceptual criteria was selected, the next step was to find the most efficient ways to affect those criteria, in order to be able to offer distinctive enough options to the listeners, so that they can more easily express their preference.

As already pointed at the beginning, the exact function of the human hearing ap- paratus is still not completely clear, which in all cases forces the assumption that there could be multivariate relations between the physical parameters and perceptual criteria [Silzle 2007], so that one single parameter could affect more than one crite- rion, or that a single criterion is being affected by a mix of parameters in particular proportion. This is why a special attention was needed to identify possible influences between parameters.

Another noteworthy point is the significance of the rendering system and its influ- ence on the perception. Currently, most of the efforts to establish some standards concerning the VR technology aim at the application development, production and distribution formats2, but the issues concerning the personal consumer binaural ren- dering system are still not being addressed in a direct manner.

2https://www.khronos.org/news/press/khronos-announces-vr-standards-initiative http://www.bbc.co.uk/rd/projects/binaural-broadcasting http://www.vr-if.org/guidelines/

20 Multiple studies have shown the significance of the headphone transfer function for the quality of the binauralization [Sch¨arerand Lindau 2009] and despite some efforts to address this issue [B. B. Boren et al. 2014], the common assumption remains that the end-user would just use the headphones available at their disposal no matter the difference in the QoE. Moreover, there is still no clear consensus concerning the target transfer function [Fleischmann, Silzle, and Plogsties 2012] that a reference headphone should have, which could be partially explained with the high variances between the transfer functions of the individual ear canals [A. T. Christensen et al. 2013] and the difficulties to acquire them [Hoffmann, F. Christensen, and Hammershøi 2013].

This brings up once more the aspect of personal preference concerning the magnitude response provided from the particular headphone with the respect of the particular ear canal properties and the given headphone placement on the head. Because the consumer’s headphone choice in most cases is dependent not on par- ticular standards or guidelines ensuring optimal QoE of VR content, but only on personal preference and/or economics, the impact of that choice over the QoE of the binauralized content has to be considered on the rendering side of the reproduction chain. Such considerations easily focus the attention over the common technical aspects of the consumer headphones as a valid parameter significantly affecting the overall sound coloration perception [Olive, Welti, and Khonsaripour 2017a], [Olive, Welti, and Khonsaripour 2017b]. Another important aspect is the diffuse field equalization of the headphone magni- tude response and its relevance to the perceptual preference. Despite the common assumption that the DFEQ is a necessity when it comes to binaural audio reproduc- tion, the perceptual evaluation of this equalization not always shows a clear consensus with respect to diverse content [Lorho 2009].

Based on those common technical aspects of the headphones TFs and the diffuse field equalization, it was decided to use the individual DFEQ from the HRTF mea- surements as a technical parameter for broadband adjustment of the spectrum that will significantly affect the subjective evaluation of the overall sound coloration. After some informal listening tests implemented in Max 7 (MSP)3 using the IRCAM Spat4 package, conducted in a smaller circle it was decided to present three distinc- tive enough options to the listener. The DFEQ and two derivatives with differently scaled magnitudes, factor 0.8 and 1.2 respectively, were chosen as a good compromise between noticeability and effectiveness.

The perceptual effect of externalization is also commonly described as ”out of the head hearing”, which naturally involves the subjective perception concerning the distance between one or more sound sources in the three dimensional space and the listener’s own position. It is already clear that the presence of room acoustics could significantly affect the perception in this context, still the perceptual effects of the externalization in the context of a VR room simulation require further investigation [Klein et al. 2017]. This becomes even more important in the case of 6DoF based VR environment. As already pointed out, the current work focuses on the pure HRTF model. The aspect of room acoustics is not a part of the investigated HRTF selec- http://www2.iis.fraunhofer.de/mpeghaa/papers/AES137_MPEG-H_v14_final.pdf 4https://cycling74.com/ 4http://forumnet.ircam.fr/product/spat-en/

21 Figure 11: IIS database, DFEQ of subject 67 with two scaled versions. Blue - original DFEQ, yellow - scaled DFEQ with factor 0.8, red - scaled DFEQ with factor 1.2 tion procedure. Accordingly, the HRTFs used in this experiment did not contain any room reflections or reverb. An interesting insight into the importance of the dynamics of the magnitude response as a part of the pure HRTF is given in the investigation of Juha Merimaa [Merimaa 2010], which shows an inverse relation between the criteria for timbral coloration and externalization with respect to the spectral dynamics. His investigation showed that by manipulating the spectral dynamics a trade-off between the coloration and exter- nalization accordingly to the particular application is possible. This have motivated the selection of the spectral dynamics of an HRTF as the technical parameter to influence the perceptual criteria externalization, but also to test the effect over the sound coloration that should be expected given that the two criteria could influence each other. A newer study also confirms such relation [Salmon et al. 2017]. Initial informal listening tests showed that spectral dynamics scaling with factors 0.5 and 2 represent a good compromise between noticeability and excessive sound coloration. Often the effect of externalization is being also considered as the ability to differenti- ate between auditive events occurring in front or behind the listener’s head. Because the impact of the ITDs on the lateral localization is being intensively investigated already [Jens Blauert 1997], [Estrella 2011], the general assumption is that the front- back differentiation and externalization are mostly due to monaural cues contained in the spectrum and less due to ITDs, because of their absence in the case of a source without low frequency components placed on the median plane [Langendijk

22 and Bronkhorst 2002]. Other considerations question the importance of the individ- ual ear symmetry, which has been partly confirmed in a recent study [Bomhardt and Fels 2017], or the importance of the acoustic resonances of the outer ear, known also as pinna notches [Clarke and Lee 2017]. All these factors need to be considered when searching for the most relevant pa- rameter responsible and because the exact relations still remain to be discovered, a decision was made to statistically analyze the individual HRTFs from the IIS, ARI and IRCAM, databases and look for further clues.

The human perception for the direction of auditive events depends on many factors simultaneously, whos relations change dependent on the source movement and posi- tion [Jens Blauert 1997]. Those exact relations are still being investigated, but there are already indications pointing the important role of the ITD for the localization of sources on the horizontal plane [Wightman and Kistler 1992]. This motivated a number of researchers to look for different possibilities to acquire or approximate the individual ITD as alternatives to an acoustic measurement [Estrella 2011], [Juras, Miller, and Roginska 2015]. Given the existing information and scientific work to this date, it is clear that the ITD has a high significance for the localization on the horizontal plane, which makes it the perfect technical parameter responsible for this perceptual criterion. Still a preliminary test was conducted to verify the degree of parameter change in the context of the VR environment. The manipulation of the ITD is being already investigated and perceptually validated [Busson, Nicol, and Katz 2005], so the tools required for controlling the parameter are already known.

The elevation cues responsible for the height perception, or also known as localization on the median plane, are generally considered as monaural and heavily dependent on the PRTF [Spagnol, Geronazzo, and Avanzini 2010], [Fan, Wan, and McMullen 2016]. While several evidence for the relation between the pnna resonances and the height perception are being discovered already [Spagnol and Avanzini 2015], the exact contribution of the PRTF compared to other parameters like shoulder reflections or head scattering effects is still to be cleared. This motivated a lot of researchers to examine the structure of the pinna and the elements responsible for the resonances, known also as pinna notches in attempt to be able to extract and manipulate them [Raykar, Duraiswami, and Yegnanarayana 2005]. While the relation between the frequency positions of the notches and the elevation angle is obvious, there is still not clear evidence if on the perceptual side this relation is valid only for a particular spectrum range or up to a particular frequency border [Sch¨onsteinand Katz 2010], or may be it could be valid for the whole spectrum [Ghorbal et al. 2017]. In 1998 Middlebrooks introduced the idea of reducing the inter-subjects differences in the HRTFs spectra by means of frequency scaling [Middlebrooks 1999]. Because of its efficiency in the particular case and its potential to affect the examined criteria on a larger scale, it was decided to test the frequency scaling as a possible way for effectively influencing the perception of elevation.

23 3.4.3 Analisys of HRTF-databases A logical step in the search of the QEs responsible for particular perceptual effects was the comparative analysis of existing databases with individually measured HRTFs. Thanks to the SOFA standard, several quality HRTF databases are freely available for download and analysis5, which enables statistical comparisons in terms of dif- ferent HRTF parameters like ITD margins, direction-dependent spectral patterns, anthropomorphic features etc. For the purpose of the current research, a deeper look was taken into the databases from IIS, ARI, IRCAM and CIPIC. The relatively similar technical specifications as well as the well documented additional information were an important factor that needed to be considered when suitable sets for comparisons and statistical analy- ses were required. Except those three HRTF datasets, there are also other freely available sets, but dissimilarities in the specifications like incompatible spatial lay- out or resolution, missing meta-data etc. narrowed the choice to the aforementioned databases.

3.4.4 ITD analysis For the purpose of the ITD analysis, the SOFA HRTFs were loaded in Matlab and cross-correlation (xcorr) was used to define the time lag between left and right ears given in samples.

Figure 12: Raw ITDs from all 67 subjects of the IIS database - all LS from elevation 0°

5https://www.sofaconventions.org/mediawiki/index.php/Main_Page

24 Figure 13: Raw ITDs from all 67 subjects of the IIS database - all LS from elevation 39°

Figure 14: ITDs sorted from min to max Figure 15: ITDs sorted from min to max from all 67 subjects of the IIS database - all from all 67 subjects of the IIS database - LS from elevation 0° all LS from elevation 39°

The plots show the characteristic time delay curve describing the delay change based on the source position with respect to the subject. It is to be noted that the artifacts of the ITD curve seen when the source is firing in the area around 90° and -90° are most probably caused by irregularities in the sitting position of the subject. The difference between the smallest and biggest ITD on the middle loudspeaker ring

25 (ear height elevation) is 14 samples, and the difference with the upper loudspeaker ring (elevation = 39°) is 7 samples. The mean values are roughly 39 and 23 samples for the middle and upper ring respectively and the standard deviations are roughly 2.9 and 1.5. An analytical comparison between the signal measurements and the anthropometric measurements showed consensus concerning the ITDs.

Similar analysis was made with the ARI and IRCAM database.

Figure 16: ITDs all subjects - ARI, Figure 17: ITDs all subjects - IRCAM, all LS from elevation 0° all LS from elevation 0°

3.4.5 Main pinna notch analysis Because of the significance of the PRTF for the binaural cues and spatial hearing, a detailed look into the pinna resonance with the highest energy contribution was taken. This resonance, called main pinna notch throughout the rest of this work, in the most cases causes an obvious narrow-band energy drop in the magnitude spec- trum of the HRTF in the frequency range above 5 kHz [Iida et al. 2007]. In order to gain some overview of the main notch behavior with respect to source position on the horizontal plane, the magnitude spectra from the HRTF databases IIS, ARI, Listen and CIPIC were analyzed. The HRTFs from all subjects of the IIS database were plotted to provide some ini- tial impressions about the position and size of the main pinna notch with different subjects.

As seen on the figures that follow, despite the obvious differences concerning the position and shape of the main pinna notch a subtle trend could be recognized that shows some difference in the distribution when the sound source is situated in front and behind the listener. This becomes even more clear when visualizing the magnitude spectra of all source directions per subject at elevation 0°. With many of the subjects, the frequency de- viation of the main pinna notch could be clearly followed even if its shape varies with the source direction. It seems that when the sound source is behind the listener the main notch6 position almost always lies higher on the frequency spectrum compared

26 Figure 18: IIS - 67 subjects, left ears, Figure 19: IIS - 67 subjects, right ears, azimuth 0°, elevation 0° azimuth 0°, elevation 0°

Figure 20: IIS - 67 subjects, left ears, Figure 21: IIS - 67 subjects, right ears, azimuth 180°, elevation 0° azimuth 180°, elevation 0° to when the sound source is in front. Moreover, the frequency area of the main notch at azimuth 0° has less energy compared to azimuth 180°. The spatial color plot from subject 19 represents a good example of this relation.

After some informal listening tests in Max 7, the author recognized the effect of the main notch position and the difference in the position between azimuth 0°and az- imuth 180° as a significant technical parameter affecting the perception of the sound source in front or behind the head.

The next step was deciding which exact notch frequency positions should be offered to the listeners for selection. In order to find those a purely statistical approach was used to define the subsets covering the largest amount of the poten- tial matches. Having as much data as possible motivated the analysis of several

6The term notch is being consistently applied in the particular context only to systematically illustrate a particular trend. With many subjects/directions it would be arguable if the dips in the spectrum could be described as notches.

27 Figure 22: IIS database - Subject 19, spectral plot of left and right ears over all LS with elevation 0° The horizontal black line crossing shows the main notch frequency for all directions. Note the differences between directions 0°and 180° 28 HRTF databases. To be able to accomplish this in an efficient manner, a script was developed in Matlab to automatically detect and report the notch frequencies in a predefined range.

Figure 23: Histogram of the main notches from all left ears from the IIS, ARI, IRCAM, CIPIC databases at azimuth 0°, elevation 0° (304 samples, 76 bins)

Figure 24: Histogram of the main notches from all right ears from the IIS, ARI, IRCAM, CIPIC databases at azimuth 0°, elevation 0° (304 samples, 76 bins)

29 30 Figure 25: Histogram of the main notches from all left and right ears combined from the IIS, ARI, IRCAM, CIPIC databases at azimuth 0°, elevation 0° (608 samples, 76 bins) The three frequencies selected as most typical and enough distant from each other were 7000 Hz, 8400 Hz and 9800 Hz. It is to be noted that for the purpose of the proposed HRTF selection procedure the ear asymmetry, normally present with individual measurements, has not being taken into account concerning the HRTF manipulations applied.

31 4 Implementation of the VR-environment

4.1 System overview

Figure 26: IIS - mobile VR workstation consisting of Windows PC, MacPro and PC Laptop. Unity VR rendering to Oculus Rift via NVIDIA GTX1080. Immersive tripple display configuration with 4K KVM switch. RME Madiface XT, RME ADI 2-Pro, Avid Artist Mix. Dual headphones output with independent volume control.

The actual design of the VR-Workstation used for the purpose of the test-design implementation was subject of many consideration questioning different application scenarios and improvements throughout several months. The mobility of the system allows for easy relocation and deployment in different rooms at the Fraunhofer IIS. The dual - Mac and PC - design in conjunction with the dynamic KVM assignment in the flexible audio routing over MADI enables wide pallet of combinations including dynamic switching between headphones and loudspeakers.

For the purpose of the HRTF selection procedure, the system was located in a small edit suite with moderate sound proofing. The PC configuration utilizes extreme silent construction.

32 4.2 VR-rendering The VR environment was implemented in Unity 3D as separated scenes containing basic visual elements. The newly developed external Unity plug-in SOFAlizer-for- Unity7 enabled dynamic binaural synthesis as C-sharp asset attached to one VR object. It enabled flexible scene design required in order to prepare stimuli suitable for the different trials, as well as dynamic switching between different HRTFs in SOFA format. Being able to use Oculus Rift as a head-tracker for both audio and video has a clear advantage. The binaural renderer was specifically compiled with disabled automatic normal- ization of the loaded HRTFs. Because it doesn’t apply any interpolation during rendering, but only convolution in frequency domain, all the HRTFs prepared for the selection procedure were interpolated beforehand with 3° azimuth resolution and 5° elevation resolution by using the ‘pchip’ algorithm in Matlab.

4.3 Controller

Figure 27: Xbox one controller used by the subjects as dynamic tactile input in the VR environment. D-pad horizontal axis controls the sound level, D-pad vertical axis controls the ratings between the conditions (HRTFs). The ‘Task’ button shows/hides the trial description. X, A, B, Y on the right were used to switch between the conditions. The bumper and trigger on the front side were dynamically programmed for each trial.

7https://github.com/sofacoustics/SOFAlizer-for-Unity

33 4.4 GUI

Figure 28: Screen-shot showing the start screen from the trial - horizontal localization. On pressing ‘Task’ on the controller the description disappears and the trial begins. Second press activates pause and brings back the description.

Figure 29: Screen-shot from the actual selection. The red cross indicates deviation of the viewing direction from the green bar. On the top left some ratings are visible on the sliders. The current selected condition is B with 57 points.

34 5 Preliminary listening tests

Because the effect of the DFEQ was considered already widely discussed in variety of studies, and obvious enough throughout the informal listening tests, it was decided to proceed with testing other parameters and their perceptual impact in a VR envi- ronment. Furthermore, it was important to investigate the relation between the two criteria sound coloration and externalization.

5.1 Sound coloration and spectral dynamics This test aimed at judging the ability of the listeners to acknowledge the difference in the spectral TF, particularly the change in the spectral dynamics and express their preference accordingly to the perceptual criterion natural sounding. A total of 17 listeners have participated. For 10 of the listeners the manipulation of the spectral dynamics was applied on their individually measured HRTFs, for the remaining 7 listeners the same manipulation was applied on a dummy head HRTF - the spherical far-field compilation of KU 100 from the publicly available database from TH K¨oln8. In order to gain a better understanding about the impact of the spectral dynamics scaling on the perception for natural sound coloration three candidates were offered for selection. One was the unaltered original HRTF. The second one was a manipu- lated version of the HRTF which had reduction of the spectral dynamics by a factor of 0.3. The third one was a manipulated version of the HRTF which had increased spectral dynamics by a factor of 3. The manipulation was applied element-wise in the range 1.5 - 12 kHz via custom designed window function gradually increasing from factor 1 at the range borders to reach the defined scaling factor in the center area of the window. The range was selected so that it covers all three potentially selected notch frequencies.

Figure 30: IIS - Subject 1, spectral Figure 31: IIS - Subject 1, spectral dynamics dynamics scaling - left ear, azimuth 30°, scaling - right ear, azimuth 330°, elevation 0°. elevation 0°. Blue - original magnitude, red Blue - original magnitude, red - scaling factor - scaling factor 0.3, yellow - scaling factor 3 0.3, yellow - scaling factor 3

8http://audiogroup.web.th-koeln.de/index.html

35 The stimuli used were four music tracks of different genres, which the listener can switch at will via the Xbox controller’s bumpers and triggers.

ˆ Classic: Beethoven Symphony N.5, fourth movement ”Allegro” ˆ Pop: Sia ”Unstoppable” ˆ Heavy metal: Audioslave ”Cochise” ˆ Jazz: Gershwin ”Summertime” From all 4 pieces the loudest chorus parts were selected and looped, around 30 sec- onds long each, so that a more efficient comparison is possible in the absence of transport controls. The loudness differences were reduced but not completely elimi- nated in order to retain the natural genre based loudness differences. For this test all stimuli were pre-binauralized in Matlab (conv) so that the left stereo channel was convolved with the HRTF IR at azimuth 30°, elevation 0°, and the right stereo channel was convolved with the HRTF IR at azimuth 330°, elevation 0°. This arrangement mimics a standard stereo loudspeaker layout that is percep- tually known to the listener already. The rendering in the VR environment was implemented without head-tracking and visual cues related to the stimuli. This was important to eliminate possible dynamic sound coloration caused by the renderer during IR switching and also to focus the listener’s attention only on the timbral qualities of the sound rendering without being distracted by other related criteria or issues. The main task was to rate the three conditions with different spectral dynamics. The listeners were instructed to focus solely on the sound coloration and try to ignore any other effects possibly caused by the manipulation. At first, they had to confirm if they hear differences between the offered conditions. In case they didn’t, they had to rate all conditions equally at 50 points. In case there were some differences, the listeners had to decide which condition provides the most natural sound coloration and give it 100 points. The condition providing the least natural sound coloration had to receive 0 points. The third remaining condition could have been rated freely and could have been rated equally with some of the other two in case there was no perceivable difference.

Figure 32 shows the results of the test from all subjects.

One of the interesting facts seen is that out of 10 subjects with individual HRTFs only 2 rated their unprocessed HRTFs as most natural sounding. The majority of the listeners found the reduced spectral dynamics to be sounding more natural as the increased spectral dynamics, still three listeners had the exact opposite perception. It must be pointed out that reducing the spectral dynamics naturally reduces also the ISDs as described by Merimaa [Merimaa 2010], even if applied as peak based independently to each ear. This could explain the reports from some of the subjects about noticeable change in the sound picture symmetry between the candidates. The HRTF from KU 100, because of its nature, contains comparably less ISDs than the individual HRTFs. This could partially explain the differences between the judg- ments based on individual HRTFs and the judgments based on the KU 100 HRTF. Still, a clear consensus concerning the perceptual effect of the spectral dynamics ma- nipulation is difficult to be seen despite the obvious significance of the parameter.

36 Figure 32: Preliminary listening test sound coloration - Table with the results from 17 listeners. 100 = most natural, 0 = least natural. The green highlighting indicates that the reduced spectral dynamics was rated more natural sounding as the increased spectral dynamics. The yellow highlighting indicates the opposite. The listeners having their own HRTFs as input for the manipulation are marked in red, the blue highlighting indicates the dummy head HRTF. 37 5.2 Externalization, generic HRTF, DFEQ and main notch effect In order to be able to test the influence of the notch presence or absence for the externalization perception, it was important to have an HRTF base that is as neutral as possible. The KU 100 was considered as a valid option, but because of its synthetic nature, based on the spectrum plots it was difficult to recognize a direct relation to the individual HRTFs available. This motivated a rather experimental approach to the problem.

A generic HRTF was synthesized based on the IIS database by means of averaging of all 67 HRTFs. An averaging over all subjects was applied from all absolute spectra for all directions separately. The same was done with the ITDs, separately for the two available loudspeaker rings. The result was an average HRTF from all subjects, with much smoother magnitude spectrum compared to the individual HRTFs, still showing closer relation compared to the KU 100. After some initial listening tests it was soon recognized that perceptually and systematically, the newly synthesized HRTF, called shortly generic throughout the following lines, is the better input for further manipulations.

To test the effect of the main pinna notch over the perception of externalization in the context of different DFEQs, a separate scene in Unity was created. It contained a moving audio-visual object representing a talking male person reciting a short text passage, looped over time. A slow shuttling movement was intentionally restricted around azimuth 0° (±15°), elevation 0° to focus on the binaural cues in the frontal area. Head-tracking was enabled and the Xbox controller was programmed so that the listener could ‘teleport’ the object behind them (azimuth + 180°) on a button press, retaining the shuttling pattern. Also a separate option was provided to pause the movement completely if desired. Another object was placed in the scene to in- dicate the required viewing direction. In the task description it was also stated that the listeners are allowed to close their eyes if desired and that they could experiment with different object teleporting speeds.

The first offered condition was the pure generic HRTF, the second was the generic HRTF with added notch at 8400 Hz, the third condition had broadband reduction of the DFEQ’s spectral dynamics by factor of 0.8 and the same notch was applied scaled by factor 2.

38 Figure 33: IIS - generic HRTF, right ear - Figure 34: IIS - generic HRTF, right ear - azimuth 0°, elevation 0°. azimuth 180°, elevation 0°. Red - pure generic, yellow - generic with Red - pure generic, yellow - generic with notch, blue - generic with scaled DFEQ and notch, blue - generic with scaled DFEQ and notch. notch.

To provide a better overview of the manipulation in the context of all azimuth direc- tions, the following surface plots show the differences of the TFs between the three conditions offered for rating accordingly to the criterion externalization.

Figure 35: IIS - generic HRTF, left ear at Figure 36: IIS - generic HRTF, right ear at elevation 0°. elevation 0°.

The notch application was implemented as automated script in Matlab with the options to change different parameters like notch frequency at azimuth 0°and 180°, as well as the notch shape.

39 Figure 37: IIS - generic HRTF with notch, Figure 38: IIS - generic HRTF with notch, left ear at elevation 0°. right ear at elevation 0°.

Figure 39: IIS - generic HRTF with scaled Figure 40: IIS - generic HRTF with scaled DFEQ and notch, left ear at elevation 0°. DFEQ and notch, right ear at elevation 0°.

40 As with the sound coloration test, the listeners were forced to give 100 and 0 points to two of the conditions if they hear differences concerning the perceptual criterion externalization. Figure 41 shows the results of the test from all subjects.

Similarly to the sound coloration test, the variation of the selected technical param- eters shows significant impact on the externalization perception. Still, it could be seen that the manipulation could lead into different perceptual extremes with differ- ent subjects. Despite the efforts put into explaining the significance of focusing on the single ques- tioned criterion, some of the subjects reported after the test that their judgment was influenced by the difference in the sound coloration. Something else to mention is that despite there was no normalization applied to the HRTFs after the manipula- tions, none of the listeners reported loudness differences between the conditions. A couple of the subjects reported that the perceivable differences were bigger when the object was placed behind them. It was also reported that closing the eyes helped some of the subjects to better percept the effect change. Other listeners reported that quickly alternating the objects position from front to back was helpful as well. Something else to mention is that some of the subjects used more than 10 minutes for their final decision, so it is possible that learning effects took place.

41 Figure 41: Preliminary listening test externalization - Table with the results from 18 listeners. 100 = most externalized, 0 = least externalized. The green highlighting indicates that the generic with DFEQ and notch scaling was rated more externalized sounding compared to the pure generic. The yellow highlighting indicates the opposite. 42 5.3 Horizontal localization In order to test the effect of ITD scaling to the horizontal localization in a VR en- vironment, a new Unity scene was created. It contained a static object that could be moved by the listener to two predefined directions on the horizontal plane. A helicopter object was used as stimulus that represents an optimal broadband signal, most probably already known by the subject. The listener was instructed how to change the objects direction with the Xbox con- troller first to -90°(right hand side), then to -60°(right hand side) and the task was to judge for those two directions which condition caused an auditive change of the di- rection of the sound event on the horizontal plane. They were forced to give 0 points to the condition causing a change in the direction tending towards azimuth 0° and to give 100 points to the condition causing a change in the direction tending towards azimuth 180°. The were forced to begin with -90°in order to first quickly notice the effect change and then switch to -60°to make final decision with less impact from the ‘cone of confusion’.

Figure 42: IIS generic HRTF - ITD scaling Figure 43: IIS generic HRTF - ITD scaling middle ring. upper ring. Blue - original ITD, Yellow - ITD scaled by Blue - original ITD, Yellow - ITD scaled by factor 1.2, red - ITD scaled by factor 0.8 factor 1.2, red - ITD scaled by factor 0.8

It must be noted that in the particular VR setup, the viewing angle of Oculus Rift— covered not more than ±40°, this is why the subjects had to moved their head to be able to see the visual position of the object. Again, there was a visual marker in the VR environment indicating the required viewing direction that enables optimal judgment conditions. Figure 44 shows the results of the test from all subjects. Despite the surprising results from listeners 14 and 16, a clear consensus among the remaining 16 subjects shows that increasing the ITDs causes a perceptual change in the localization towards azimuth 180°and that decreasing the ITDs causes a percep- tual change in the localization towards azimuth 0°. Many of the listeners expressed uncertainties when the object was placed at -90°(right hand side). This is most probably due to the effect of ‘cones of confusion’, causing degradation of the localization precision when the sound sources are placed directly on the corresponding plane axis. Still, by consequent positioning the object at -60°(right hand side) possible confusions were avoided and the final decision was easily met.

43 Figure 44: Preliminary listening test horizontal localization - table with the results from 18 listeners. 100 = the direction changes towards azimuth 180°, 0 = the direction changes towards azimuth 0°. The green highlighting indicates that the increasing ITD causes a change in the localization towards azimuth 180°. The yellow highlighting indicates the opposite. 44 5.4 Vertical localization In order to test the effect of frequency scaling on the elevation perception, new Unity scene was created and three different conditions were offered. One was the unaltered generic HRTF, the second one was the generic HRTF with applied frequency scaling by factor 1.15 on the upper loudspeaker ring, the third was the generic HRTF with applied frequency scaling by factor 0.85 on the upper loudspeaker ring. The energy change caused by the IR resampling was intentionally not compensated to account for the natural level drop with increasing sound source height.

Figure 45: IIS generic HRTF - left ear Figure 46: IIS generic HRTF - right ear Blue - azimuth 90°elevation 0°, red - Blue - azimuth 90°elevation 0°, red - azimuth azimuth 90°elevation 39° 90°elevation 39°

Figure 47: IIS generic HRTF - frequency Figure 48: IIS generic HRTF - frequency scaling left ear, azimuth 90°elevation 30°. scaling right ear, azimuth 90°elevation 30°. Blue - original spectrum, yellow - Blue - original spectrum, yellow - frequency frequency scaling by factor 1.15, red - scaling by factor 1.15, red - frequency scaling frequency scaling by factor 0.85 by factor 0.85

The VR implementation of the test in Unity was based on a static elevated object placed at azimuth 90°(left hand side) elevation 30°. This direction was chosen to

45 avoid possible ‘cones of confusion’ effects and to enable easier judgment of the per- ceptual elevation effect. To account for the inevitable timbral colorations with the different conditions, four different sound sources/objects were offered so that the perceptual evaluation with different stimuli types is being made possible. The listeners could switch freely be- tween those different sources via the Xbox controller. At the end, they had to assign 100 points to the condition sounding most elevated and give 0 points to the condition sounding least elevated. Figure 49 shows the results of the test from all subjects.

It was rather surprising to discover a very clear consensus among all subjects, with one exception, showing that the frequency scaling by factor 1.15 caused higher elevation perception and the frequency scaling by factor 0.85 caused lower elevation perception than the unaltered generic HRTF. Moreover, many of the subjects reported that the task was rather easy to accomplish. It is to be noted that, while all listeners heard the timbral changes, none of them reported unusual artifacts or pitch changes with one particular condition.

46 Figure 49: Preliminary listening test vertical localization - table with the results from 18 listeners. 100 = most elevated sounding, 0 = least elevated sounding. The green highlighting indicates that frequency scaling with factor 1.15 caused higher elevation perception than the original and scaling with factor 0.85 caused lower elevation perception than the original. The yellow highlighting indicates the opposite. 47 5.5 Data analysis and interpretation To summarize the results of the preliminary tests, it was confirmed that the tested technical parameters significantly influenced the questioned perceptual criteria in the context of VR environment and audio-visual stimuli.

The most confident results were acquired concerning the criteria horizontal localiza- tion and vertical localization, where the technical borders of manipulation influenced the extremes on the perceptual side with small variance between the subjects. Because these two trials were described by many subjects as easy, it was decided to slightly reduce the parameter change values for the ITD scaling and the spectral frequency scaling in order to maintain better statistical coverage and provide po- tentially more precise options in the following selection test. The ITD scaling with factors 1.15 and 0.85 respectively and spectral frequency scaling with factors 1.12 and 0.88 respectively were chosen.

With the criterion externalization, the HRTF manipulation also proved to be quite significant to the perceptual effect, but there were less obvious consensus among the subjects about the trend of the effect change. Nevertheless, the presence of a notch at 8400 Hz proved to influence the externalization effect even with different DFEQs.

The exact notch frequency position varies with different people and because it was already decided to split the front-back differentiation and externalization into two different selection trials, notch frequency selection will be applied in the front-back differentiation trial. Spectral dynamics scaling will be applied in the externalization trial covering all three possibly selected notch frequencies - 7000 Hz, 8400 Hz and 9800 Hz.

With the criterion sound coloration, the spectral dynamics scaling proved to have an impact on the perceptual effect, but the there were many different interpretations among the subjects about the trend of the effect change. This means that an influ- ence from a spectral dynamics change over the sound coloration is to be expected, but with effect, which is difficult to be predicted.

In the following charts (see fig. 50) the averaged results are being presented as an overview of the preliminary tests. Because the tests were strictly aimed at identifying of perceptual effects and not of listening experience, the points bars shown are not to be confused with preference ranking. They only represent an average trend of what perceptual effect could be expected from a particular HRTF manipulation concerning an isolated criterion.

48 Figure 50: Average results from the preliminary listening tests (100 points = corresponds fully to the criterion). From left to right: 1. Sound coloration (17 listeners) → Which one sounds most natural ? (100 = most natural, 0 = least natural) 2. Externalization (18 listeners) → Which one of the three conditions sounds most externalized ? (100 = most externalized, 0 = least externalized) 3. Horizontal localization (18 listeners) → Which one tends to sound towards azimuth 180° and which one tends to sound towards azimuth 0°? (100 = tends towards azimuth 180°, 0 = tends towards azimuth 0°) 4. Vertical localization (18 listeners) → Which one sounds most elevated ? (100 = most elevated, 0 = least elevated) 49 6 Summary of the selected QF-QE pairs. Selection design.

A total of five perceptual quality features and five physical quality elements were selected for further implementation.

ˆ DFEQ → Sound coloration

ˆ Main notch frequency position → Front-back differentiation

ˆ Spectral dynamics scaling → Externalzation

ˆ ITDs scaling → Horizontal localization

ˆ Spectral frequency scaling → Vertical localization

Throughout several preliminary tests inside the implemented VR test-design the ma- nipulated technical parameters proved to have impact over particular isolated criteria. This was perceivable by the subjects in the VR environment. All tests went without accidents or other issues concerning the health and the well being of the subjects in any way.

It was confirmed that there are multivariate relations between the criteria sound col- oration and externalization that need to be considered when designing the HRTF selection procedure.

The numerically synthesized average HRTF from the IIS database was selected as base input for the selection procedure, because of the more adequate approximation of an individual HRTF (from the IIS database) and because it is more convenient for parametric manipulation.

Based on the observations so far the following selection design was implemented.

Trial N.1 Sound coloration: The subject is presented with 4 pre-binauralized music tracks of different genres and must select the most pleasant sound coloration by switching between three different DFEQs with scaled magnitude response (as shown in section 3.4.2). The three scaling factors are - 1, 1.2 and 0.8. The listener is forced to try out all 4 music tracks. Head-tracking is not active.

Trial N.2 Front-back differentiation: The subject is presented with dynamic jumping auditive object (male speech) with active head-tracking. The auditive object jumps rather quickly between two identical static visual objects. At the position of each visual object a short sentence is being cited while the auditive object is not in motion. The two visual objects are being represented by loudspeak- ers. One is place at azimuth 0° elevation 0°, the other one is placed at azimuth 180° elevation 0°. The Subject must switch three conditions to select between three different main notch positions - 7000Hz, 8400Hz, 9800Hz - and decide with which

50 condition the perceivable direction of the audio object is easier to identify as coming from the front or back loudspeaker.

Trial N.3 Externalization: The subject is presented with dynamic moving audio-visual object (talking male head) with active head-tracking. The auditive object moves rather slow and oscillates between azimuth 30° elevation 0°, and azimuth -30° elevation 0°. The Subject must switch the conditions to select one of three different spectral scaling factors - 0.5, 1, 2. They must decide with which condition the perceivable externalization or distance to the audio-visual object feels most realistic. The listener is forced to look forward at a visual anchor element.

Trial N.4 Horizontal localization: The subject is presented with an audio-visual object (helicopter) with active head- tracking. They have to move the object at azimuth -90° elevation 0° with the tactile controller, then switch between three conditions with differently scaled ITDs with factors - 1, 1.15, 0.85. The listener must decide with which condition the perceivable horizontal localization of the object feels most realistic. They are instructed to look straight forward at a visual anchor element during se- lection. After experiencing the effect when the object is placed at azimuth -90° they are force to move the object to azimuth -60° and azimuth -30° and repeat the process to make final decision.

Trial N.5 Vertical localization: The subject is presented with a static audio-visual object that can be dynamically exchanged (helicopter, male speech, music playing over loudspeaker, violin) with ac- tive head-tracking. The object is placed statically at azimuth 90° elevation 30°The˙ subject must switch between three conditions with different frequency scalings with factors - 1, 1.12, 0.88 and decide with which condition the perceivable vertical local- ization of the object feels most realistic. They are instructed to look forward at a visual anchor element during selection. They are forced to try out all 4 objects and repeat the process to make final decision.

A decision made in a particular trial about a certain parameter is being retained and equally applied to all conditions from the next trial. After completing the selection procedure, the final selected condition (also called ‘winner’ HRTF for convenience) is being validated in a second listening test after a short break.

To validate the selected winner HRTF the same trials will be offered to the listener to compare the winner HRTF against the unaltered generic HRTF, the KU 100 and the individual HRTF. All conditions in the validation stage will be rated with a score between 0 and 100.

Finally the scores from all subjects will be averaged over the isolated trials.

51 7 Selection and validation tests

7.1 Database preparation for the selection procedure Because of the implementation effort with test-design of this kind, some aspects of the rendering system needed to be taken into account.

All of the parameter manipulation were to be made off-line, so that the computational requirements are kept low enough to enable the test-design to run on a consumer plat- form. This means that a new HRTF database had to be created for the purpose of the selection process, that contains all possible outcomes. These pre-made HRTFs incorporated all chosen parameter changes, whose values were motivated from actual statistical examples, covering all possible (243) selection outcomes.

Because the current implementation of the audio renderer required strict file names for the HRTFs in order to load them, separated Unity projects in a nested structure were created in order to be able to quickly navigate from trial to trial.

To come over this hierarchic challenge and being able to keep track on the selection progress, a proprietary file naming was necessary that contained coded description of the particular HRTF manipulation as well as the individual selection path.

Figure 51: Selection scheme - 243 possible outcomes. All branches only from A starting point are shown. The same structure were applied for B and C.

7.2 Selection test Some short video examples from the section procedure are to be found on the asso- ciated USB stick.

52 Figure 52: Example protocol from one selection procedure. Some short video examples from the section procedure are to be found on the associated USB stick. 53 7.3 Validation test The validation of the selected ‘winner’ HRTF from the selection stage was carried out as direct comparison against the HRTF from KU 100, the generic HRTF and the individual HRTF, if available. The comparisons were made based on the same VR scenes and criteria used during the selection stage. The only difference was that this time the conditions had to be rated between 0 and 100 accordingly to the subjective preference corresponding to the current criteria.

Figure 53: Results from all 12 listeners with individual HRTFs.

As shown from the results of 12 listeners comparing including their own HRTFs, the differences between the ratings are generally too small to be able to recognize a clear trend. Only in trial 1 the generic and ‘winner’ HRTFs have significantly higher score com- pared to the individual HRTF and the KU 100 with non-overlapping confidence intervals in different ranges. In trial 2 all conditions have close ratings. It must be pointed that the individual HRTF and the KU 100 sore slightly better than the generic and ‘winner’ HRTFs, but still in the same confidence range. The scores in trial 3 are similar, only this time KU 100 shows lower performance compared to the rest with confidence range roughly 20 points lower. In trial 4 we see again see close results with slight supremacy of the generic and ‘winner’ HRTFs. In trial 5 despite the close results again the KU 100 performs worse than the generic and ‘winner’ HRTFs, with generic HRTF having slight lead.

54 Figure 54: Results from all 4 listeners without individual HRTFs.

The validation test with the 4 listeners without individual HRTFs shows similar pic- ture. All conditions ratings are fairly close, only in trials 1 and 5 the KU 100 shows sig- nificantly lower performance than the generic and ‘winner’ HRTFs.

Based on the statistical evaluation of the results form the listening tests a conclu- sion can be made that the ‘winner’ HRTF haven’t achieved significant perceptual improvement over the generic HRTF in the isolated trials. The proposed selection method, the way it was designed, obviously failed to deliver the expected results.

One thing to note is the performance of the generic HRTF, which significantly out- performed the KU 100 in trials 1 and 5.

55 8 Discussion and outlook

In the current study several relations between HRTF parameters and isolated per- ceptual criteria were confirmed, which allowed for targeted influence of those criteria.

Throughout a series of listening tests multivariate interrelations between QEs and QFs were investigated and evaluated in a VR environment. With very few exceptions, the majority of the listeners were being able to familiarize themselves with the proposed interface relatively fast. No complaints concerning the experience or potentially related health issues were reported during or after the tests.

The VR implementation of the suggested test-design proved its feasibility, because the listeners were able to distinguish between the conditions in combination with visual stimuli in an interactive environment. Many reports from listeners were made that the visual element helped in making a decision.

The validation of the selection procedure have not shown statistically significant im- provement of the perceptual quality with respect to five isolated perceptual criteria. It seems that the multivariate relations between the QFs and QEs were not acknowl- edged effectively enough by the selection design so that particular decision to a given criterion caused side effects to other criteria evaluated in an earlier or later stage. Nevertheless, the majority of listeners were able to identify and express their personal preference given a particular selection stage and for the most part the judgments were confidently made.

Because of the renderer limitations it wasn’t possible to validate the performance of the conditions in the context of a complex audio-visual scene (with many spatial objects simultaneously), so no clear statement about the overall listening experience between the conditions was possible based on the evaluation of the isolated criteria. It must be also pointed that the validation stage brought a new dimension by offering conditions differing in many parameters simultaneously. That could have affected the listeners judgment when transitioning from selection to validation stage, which were separated by a short pause.

Some insights during the investigation brought up further interesting aspects that need to be discussed. The individual HRTF, generally regarded as a reference in many studies, was rated surprisingly low in many of the tests compared to the alternatives. The reasons for this outcome need to be further investigated. It could be assumed that the ear asymmetry of the individual HRTFs have probably contributed for the low score in the sound coloration trial.

One other possible factor could be, that the perceptual impact of the particular VR context could have its own environmental dimensions and cause different spatial awareness than a real natural environment. This would mean that occurrences that are being perceived as realistic in a real environment could be perceived as less real- istic in a VR environment given the same perceptual model.

56 Another noteworthy point is that the numerically synthesized generic HRTF, along with its derivatives, performed consistently well in many perceptual aspects. One explanation would be that, because of its comparably smoother frequency response it sounded timbrally more natural to the majority of listeners regardless of the par- ticular questioned criterion. Another possible reason could be the dynamic spatial properties of this HRTF. Un- derstandably it posses higher uniformity compared to other alternatives. Because of the averaging, the transitions from the TF of one direction to another are generally monotonous and unwrinkled. This could have played a role for the quality of the dynamic sound coloration effects during head or object movements.

Considering the gathered observations about the perceptual expectations in a VR environment, the subjective preference concerning the auditive experience and the complexity involved, it could be discussed on the possibility for a single HRTF to meet all perceptual criteria equally perfect.

Figure 55 illustrates the complexity of the multivariate relations between the percep- tual criteria (blue area) and physical parameters (orange area). Throughout this study some of those relations were obvious (bold arrow connection), others are assumable (normal arrow connection) based on diverse reports from the listeners, and others are still to be tested (dotted arrow connections).

57 Figure 55: Multivariate relations between QFs and QEs in the context of HRTF evaluation as examined in the current study. Figure design inspired by Dr.-Ing. Andreas Silzle [Silzle 2007]

58 9 References

AES (2015). AES standard for file exchange - Spatial acoustic data file format. Audio Engineering Society, Standard AES69-2015. Avan, Paul, Fabrice Giraudet, and B´elaB¨uki(2015). “Importance of binaural hear- ing”. In: Audiology and Neurotology 20.Suppl. 1, pp. 3–6. Bamodu, Oluleke and Xu Ming Ye (2013). “Virtual reality and virtual reality sys- tem components”. In: Advanced Materials Research. Vol. 765. Trans Tech Publ, pp. 1169–1172. Blauert, Jens (1997). Spatial hearing: the psychophysics of human . MIT press. Blauert, J et al. (1998). “Der AUDIS-Katalog menschlicher Auenohr¨ubertragungsfunktionen (The AUDIS-catalogue of human head-related transfer functions)”. In: Fortschritte der Akustik–DAGA’98, pp. 174–175. Bomhardt, Ramona and Janina Fels (2017). “The influence of symmetrical human ears on the front-back confusion”. In: Audio Engineering Society Convention 142. Audio Engineering Society. Boren, Braxton B et al. (2014). “PHOnA: a public dataset of measured headphone transfer functions”. In: Audio Engineering Society Convention 137. Audio Engi- neering Society. Boren, Braxton and Agnieszka Roginska (2011). “The Effects of Headphones on Lis- tener HRTF Preference”. In: Audio Engineering Society Convention 131. Audio Engineering Society. Busson, Sylvain, Rozenn Nicol, and Brian Katz (2005). “Subjective investigations of the interaural time difference in the horizontal plane”. In: 118th Audio Engineer- ing Society Convention. Christensen, Anders T et al. (2013). “Magnitude and phase response measurement of headphones at the eardrum”. In: Audio Engineering Society Conference: 51st International Conference: Loudspeakers and Headphones. Audio Engineering So- ciety. Clarke, Jade Raine and Hyunkook Lee (2017). “The Effects of Decreasing the Magni- tude of Elevation-Dependent Notches in HRTFs on Median Plane Localization”. In: Audio Engineering Society Convention 142. Audio Engineering Society. Cline, Mychilo Stephenson (2005). Power, madness, and immortality: The future of virtual reality. Mychilo Cline. Estrella, Jorgos (2011). “Real time individualization of interaural time differences for dynamic binaural synthesis”. In: Fan, Ziqi, Yunhao Wan, and Kyla McMullen (2016). “Quantitatively Validating Sub- jectively Selected HRTFs for Elevation and Front-Back Distinction”. In: Interna- tional Community on Auditory Display. Fleischmann, Felix, Andreas Silzle, and Jan Plogsties (2012). “Identification and evaluation of target curves for headphones”. In: Audio Engineering Society Con- vention 133. Audio Engineering Society. Gelfand, Stanley A (2016). Hearing: An introduction to psychological and physiolog- ical acoustics. CRC Press. Ghorbal, Slim et al. (2017). “Pinna morphological parameters influencing hrtf sets”. In:

59 Grasser, Thomas, Martin Rothbucher, and Klaus Diepold (2014). Auswahlverfahren f¨urHRTFs zur 3D Sound Synthese. Tech. rep. Lehrstuhl f¨urDatenverarbeitung. Grijalva, Felipe (2016). A Manifold Learning Approach for Personalizing HRTFs from Anthropometric Features. IEEE/ACM. Hoffmann, Pablo F, Flemming Christensen, and Dorte Hammershøi (2013). “Insert earphone calibration for hear-through options”. In: Audio Engineering Society Conference: 51st International Conference: Loudspeakers and Headphones. Audio Engineering Society. Hsieh, Min-Chai and Hao-Chiang Koong Lin (2011). “A conceptual study for aug- mented reality e-learning system based on usability evaluation”. In: J. Commu- nications in Information Science and Management Engineering 1.8, pp. 5–8. Huttunen, Tomi and Antti Vanne (2017). “End-To-End Process for HRTF Personal- ization”. In: Audio Engineering Society Convention 142. Audio Engineering So- ciety. Iida, Kazuhiro et al. (2007). “Median plane localization using a parametric model of the head-related transfer function based on spectral cues”. In: Applied Acoustics 68.8, pp. 835–850. Jekosch, Ute (2004). “Basic Concepts and Terms of”. In: acta acustica united with Acustica 90.6, pp. 999–1006. Jorge Mart´ın-Guti´errez,Carlos Efr´enMora and Antonio Gonz´alez-MarreroBeatriz A˜norbe-D´ıaz(2017). “Virtual technologies trends in education”. In: EURASIA Journal of Mathematics Science and Technology Education 13.2, pp. 469–486. Juras, Jordan, Chris Miller, and Agnieszka Roginska (2015). “Modeling itds based on photographic head information”. In: Audio Engineering Society Convention 139. Audio Engineering Society. Kaneko, Shoken, Tsukasa Suenaga, and Satoshi Sekine (2016). “DeepEarNet: Indi- vidualizing Spatial Audio with Photography, Ear Shape Modeling, and Neural Networks”. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality. Audio Engineering So- ciety. Klein, Florian et al. (2017). “Training on the Acoustical Identification of the Listening Position in a Virtual Environment”. In: Audio Engineering Society Convention 143. Audio Engineering Society. Langendijk, Erno HA and Adelbert W Bronkhorst (2002). “Contribution of spectral cues to human sound localization”. In: The Journal of the Acoustical Society of America 112.4, pp. 1583–1596. LaValle, Steven M. (2017). Virtual Reality. Copyright Steven M. LaValle. Lorho, Ga¨etan(2009). “Subjective evaluation of headphone target frequency re- sponses”. In: Audio Engineering Society Convention 126. Audio Engineering So- ciety. Mayenfels, Thomas (2015). “Equity research virtual and augmented reality”. In: Merimaa, Juha (2010). “Modification of HRTF Filters to Reduce Timbral Effects in Binaural Synthesis, Part 2: Individual HRTFs”. In: Audio Engineering Society Convention 129. Audio Engineering Society. Middlebrooks, John C (1999). “Virtual localization improved by scaling nonindi- vidualized external-ear transfer functions in frequency”. In: The Journal of the Acoustical Society of America 106.3, pp. 1493–1510.

60 Milgram, Paul and Fumio Kishino (1994). “A taxonomy of mixed reality visual dis- plays”. In: IEICE TRANSACTIONS on Information and Systems 77.12, pp. 1321– 1329. Monge, Janet M (2011). “Ear Photographs: Examination and Forensic Potential”. In: Wiley Encyclopedia of Forensic Science. Moore, Brian CJ (2012). An introduction to the psychology of hearing. Brill. Oculus, Team (2017). Oculus Best Practices, Version 310-30000-02. 2017 Oculus VR. Olive, Sean, Todd Welti, and Omid Khonsaripour (2017a). “A Statistical Model that Predicts Listeners’ Preference Ratings of In-Ear Headphones: Part 1—Listening Test Results and Acoustic Measurements”. In: Audio Engineering Society Con- vention 143. Audio Engineering Society. — (2017b). “A Statistical Model that Predicts Listeners’ Preference Ratings of In- Ear Headphones: Part 2—Development and Validation of the Model”. In: Audio Engineering Society Convention 143. Audio Engineering Society. Olko, Marta et al. (2017). “Identification of Perceived Sound Quality Attributes of 360º Audiovisual Recordings in VR Using a Free Verbalization Method”. In: Audio Engineering Society Convention 143. Audio Engineering Society. Pike, Cleopatra and Hanne Stenzel (2017). “Direct and Indirect Listening Test Meth- ods—A Discussion Based on Audio-Visual Spatial Coherence Experiments”. In: Audio Engineering Society Convention 143. Audio Engineering Society. Purkait, Ruma (2016). “External ear: An analysis of its uniqueness”. In: Egyptian Journal of Forensic Sciences 6.2, pp. 99–107. Raykar, Vikas C, Ramani Duraiswami, and B Yegnanarayana (2005). “Extracting the frequencies of the pinna spectral notches in measured head related impulse responses”. In: The Journal of the Acoustical Society of America 118.1, pp. 364– 374. Reardon, Gregory et al. (2017). “Evaluation of Binaural Renderers: A Methodology”. In: Audio Engineering Society Convention 143. Audio Engineering Society. Reichinger, Andreas et al. (2013). “Evaluation of methods for optical 3-D scanning of human pinnas”. In: 3DTV-Conference, 2013 International Conference on. IEEE, pp. 390–397. Rumsey, Francis (2016). “Virtual Reality: Mixing Rendering, Believabiity”. In: Jour- nal of the Audio Engineering Society 64.12, pp. 1073–1077. Sachs, Goldman (2016). “Equity research virtual and augmented reality”. In: Salmon, Fran¸coiset al. (2017). “Optimization of Interactive Binaural Processing”. In: Audio Engineering Society Convention 143. Audio Engineering Society. Sch¨arer,Zora and Alexander Lindau (2009). “Evaluation of equalization methods for binaural signals”. In: Audio Engineering Society Convention 126. Audio En- gineering Society. Sch¨onstein,David and Brian Katz (2010). HRTF selection for binaural synthesis from a database using morphological parameters. Proceedings of 20th International Congress on Acoustics, ICA 2010. Seeber, Bernhard U and Hugo Fastl (2003). “Subjective selection of non-individual head-related transfer functions”. In: Georgia Institute of Technology.

61 Sexton, Connor (2017). “Immersive Audio: Optimizing Creative Impact without In- creasing Production Costs”. In: Audio Engineering Society Convention 143. Audio Engineering Society. Sherman, William R and Alan B Craig (2003). “Understanding Virtual Reality—Interface, Application, and Design”. In: Presence 12.4, pp. 441–442. Silzle et al, Andreas (2009). Vision and Technique behind the New Studios and Lis- tening Rooms of the Fraunhofer IIS Audio Laboratory. Audio Engineering Society, Convention paper 7672. Silzle, Andreas (2007). Generation of Quality Taxonomies for Auditory Virtual En- vironments by Means of Systematic Expert Survey. Shaker, 2008. Simon, Laurent, Nick Zacharov, and Brian Katz (2016). Perceptual attributes for the comparison of head-related transfer functions. J. Acoust. Soc. Am. 140 (5), November 2016. Spagnol, Simone and Federico Avanzini (2015). “Frequency estimation of the first pinna notch in head-related transfer functions with a linear anthropometric model”. In: Proceeding 18th International Conference Digital Audio Effects (DAFx-2015), pp. 231–236. Spagnol, Simone, Michele Geronazzo, and Federico Avanzini (2010). “Structural mod- eling of pinna-related transfer functions”. In: In Proc. Int. Conf. on Sound and Music Computing (SMC 2010). Vol. 34. Sridhar, Rahulram, Joseph G Tylka, and Edgar Choueiri (2017). “A Database of Head-Related Transfer Functions and Morphological Measurements”. In: Audio Engineering Society Convention 143. Audio Engineering Society. Struck, Christopher and Steve Temme (2015). “Headphone Response: Target Equal- ization Trade-offs and Limitations”. In: Audio Engineering Society Convention 139. Audio Engineering Society. Wightman, Frederic L and Doris J Kistler (1992). “The dominant role of low-frequency interaural time differences in sound localization”. In: The Journal of the Acous- tical Society of America 91.3, pp. 1648–1661. Xie, Bosun (2013). Head-related transfer function and virtual auditory display. J. Ross Publishing. Yairi, Satoshi, Yukio Iwaya, and YˆoitiSuzuki (2008). “Individualization feature of head-related transfer functions based on subjective evaluation”. In: Proc. of In- ternational Conference on Auditory Display (ICAD2008), Paris. Young, Kat, Tony Tew, and Gavin Kearney (2016). “Boundary element method mod- elling of KEMAR for binaural rendering: Mesh production and validation”. In: Zhong, Xiaoli, Jie Zhang, and Guangzheng Yu (2015). “Recalibration of Virtual Sound Localization Using Audiovisual Interactive Training”. In: Audio Engineer- ing Society Convention 139. Audio Engineering Society. Ziegelwanger, Harald, Wolfgang Kreuzer, and Piotr Majdak (2016). “A priori mesh grading for the numerical calculation of the head-related transfer functions”. In: Applied Acoustics 114, pp. 99–110.

62 10 List of figures List of Figures

1 Ear anatomy. Source: Blausen.com staff (2014). ”Medical gallery of Blausen Medical 2014”. WikiJournal of Medicine 1 (2). DOI:10.15347/wjm/2014.010. ISSN 2002-4436 ...... 4 2 Left - polar coordinates top view (horizontal plane). Right - polar coordinates side view (median plane)...... 5 3 Simplified representation of a ‘virtuality continuum’. Source: Giovanni Vincenti 7 4 Example of visual artifacts in a typical HMD. Top - around 70 megapix- els per eye. Bottom - around 1.2 megapixels per eye. Source: http://www.varjo.com/ 9 5 The input/output loop in a typical VR system. Source: Shmuel Csaba Otto Traian ...... 10 6 Typical block diagram of acoustic HRTF measurements...... 11 7 Reference listening room (ITU-R BS.1116) ”Mozart” - Fraunhofer IIS, Erlangen (Germany) ...... 17 8 Subject 1 - measurement set up ...... 18 9 Subject 1 with in-ear microphones ...... 18 10 First room reflection ...... 19 11 IIS database, DFEQ of subject 67 with two scaled versions. Blue - original DFEQ, yellow - scaled DFEQ with factor 0.8, red - scaled DFEQ with factor 1.2 ...... 22 12 Raw ITDs from all 67 subjects of the IIS database - all LS from ele- vation 0° ...... 24 13 Raw ITDs from all 67 subjects of the IIS database - all LS from ele- vation 39° ...... 25 14 ITDs sorted from min to max from all 67 subjects of the IIS database - all LS from elevation 0° ...... 25 15 ITDs sorted from min to max from all 67 subjects of the IIS database - all LS from elevation 39° ...... 25 16 ITDs all subjects - ARI, all LS from elevation 0° ...... 26 17 ITDs all subjects - IRCAM, all LS from elevation 0° ...... 26 18 IIS - 67 subjects, left ears, azimuth 0°, elevation 0° ...... 27 19 IIS - 67 subjects, right ears, azimuth 0°, elevation 0° ...... 27 20 IIS - 67 subjects, left ears, azimuth 180°, elevation 0° ...... 27 21 IIS - 67 subjects, right ears, azimuth 180°, elevation 0° ...... 27 22 IIS database - Subject 19, spectral plot of left and right ears over all LS with elevation 0°The horizontal black line crossing shows the main notch frequency for all directions. Note the differences between directions 0°and 180° ...... 28 23 Histogram of the main notches from all left ears from the IIS, ARI, IRCAM, CIPIC databases at azimuth 0°, elevation 0° (304 samples, 76 bins) ...... 29 24 Histogram of the main notches from all right ears from the IIS, ARI, IRCAM, CIPIC databases at azimuth 0°, elevation 0° (304 samples, 76 bins) ...... 29 25 Histogram of the main notches from all left and right ears combined from the IIS, ARI, IRCAM, CIPIC databases at azimuth 0°, elevation 0° (608 samples, 76 bins) ...... 30

63 26 IIS - mobile VR workstation consisting of Windows PC, MacPro and PC Laptop. Unity VR rendering to Oculus Rift via NVIDIA GTX1080. Immersive tripple display configuration with 4K KVM switch. RME Madiface XT, RME ADI 2-Pro, Avid Artist Mix. Dual headphones output with independent volume control...... 32 27 Xbox one controller used by the subjects as dynamic tactile input in the VR environment. D-pad horizontal axis controls the sound level, D-pad vertical axis controls the ratings between the conditions (HRTFs). The ‘Task’ button shows/hides the trial description. X, A, B, Y on the right were used to switch between the conditions. The bumper and trigger on the front side were dynamically programmed for each trial...... 33 28 Screen-shot showing the start screen from the trial - horizontal local- ization. On pressing ‘Task’ on the controller the description disappears and the trial begins. Second press activates pause and brings back the description...... 34 29 Screen-shot from the actual selection. The red cross indicates devia- tion of the viewing direction from the green bar. On the top left some ratings are visible on the sliders. The current selected condition is B with 57 points...... 34 30 IIS - Subject 1, spectral dynamics scaling - left ear, azimuth 30°, ele- vation 0°. Blue - original magnitude, red - scaling factor 0.3, yellow - scaling factor 3 ...... 35 31 IIS - Subject 1, spectral dynamics scaling - right ear, azimuth 330°, elevation 0°. Blue - original magnitude, red - scaling factor 0.3, yellow - scaling factor 3 ...... 35 32 Preliminary listening test sound coloration - Table with the results from 17 listeners. 100 = most natural, 0 = least natural. The green highlighting indicates that the reduced spectral dynamics was rated more natural sounding as the increased spectral dynamics. The yellow highlighting indicates the opposite. The listeners having their own HRTFs as input for the manipulation are marked in red, the blue highlighting indicates the dummy head HRTF...... 37 33 IIS - generic HRTF, right ear - azimuth 0°, elevation 0°. Red - pure generic, yellow - generic with notch, blue - generic with scaled DFEQ and notch...... 39 34 IIS - generic HRTF, right ear - azimuth 180°, elevation 0°. Red - pure generic, yellow - generic with notch, blue - generic with scaled DFEQ and notch...... 39 35 IIS - generic HRTF, left ear at elevation 0°...... 39 36 IIS - generic HRTF, right ear at elevation 0°...... 39 37 IIS - generic HRTF with notch, left ear at elevation 0°...... 40 38 IIS - generic HRTF with notch, right ear at elevation 0°...... 40 39 IIS - generic HRTF with scaled DFEQ and notch, left ear at elevation 0°...... 40 40 IIS - generic HRTF with scaled DFEQ and notch, right ear at elevation 0°...... 40

64 41 Preliminary listening test externalization - Table with the results from 18 listeners. 100 = most externalized, 0 = least externalized. The green highlighting indicates that the generic with DFEQ and notch scaling was rated more externalized sounding compared to the pure generic. The yellow highlighting indicates the opposite...... 42 42 IIS generic HRTF - ITD scaling middle ring. Blue - original ITD, Yellow - ITD scaled by factor 1.2, red - ITD scaled by factor 0.8 . . . 43 43 IIS generic HRTF - ITD scaling upper ring. Blue - original ITD, Yellow - ITD scaled by factor 1.2, red - ITD scaled by factor 0.8 ...... 43 44 Preliminary listening test horizontal localization - table with the re- sults from 18 listeners. 100 = the direction changes towards azimuth 180°, 0 = the direction changes towards azimuth 0°. The green high- lighting indicates that the increasing ITD causes a change in the lo- calization towards azimuth 180°. The yellow highlighting indicates the opposite...... 44 45 IIS generic HRTF - left ear Blue - azimuth 90°elevation 0°, red - az- imuth 90°elevation 39° ...... 45 46 IIS generic HRTF - right ear Blue - azimuth 90°elevation 0°, red - azimuth 90°elevation 39° ...... 45 47 IIS generic HRTF - frequency scaling left ear, azimuth 90°elevation 30°. Blue - original spectrum, yellow - frequency scaling by factor 1.15, red - frequency scaling by factor 0.85 ...... 45 48 IIS generic HRTF - frequency scaling right ear, azimuth 90°elevation 30°. Blue - original spectrum, yellow - frequency scaling by factor 1.15, red - frequency scaling by factor 0.85 ...... 45 49 Preliminary listening test vertical localization - table with the results from 18 listeners. 100 = most elevated sounding, 0 = least elevated sounding. The green highlighting indicates that frequency scaling with factor 1.15 caused higher elevation perception than the original and scaling with factor 0.85 caused lower elevation perception than the original. The yellow highlighting indicates the opposite...... 47 52 Example protocol from one selection procedure...... 53 55 Multivariate relations between QFs and QEs in the context of HRTF evaluation as examined in the current study. Figure design inspired by Dr.-Ing. Andreas Silzle [Silzle 2007] ...... 58

65 11 Declaration of authorship / Eidesstattliche Erkl¨arung

Declaration of authorship

I hereby certify that this thesis has been composed by me and is based on my own work under the supervision of Dipl.-Ing. Felix Fleischmann. No other person’s work has been used without due acknowledgment in this thesis. All references and verbatim extracts have been quoted, and all sources of information, including graphs and data sets, have been specifically acknowledged.

Eidesstattliche Erklarung¨

Hiermit erkl¨areich, dass ich die vorliegende Arbeit selbstst¨andigunter der Betreuung durch Dipl.-Ing. Felix Fleischmann angefertigt habe. S¨amtliche Stellen der Arbeit, die im Wortlaut oder dem Sinn nach anderen gedruckten oder im Internet verf¨ugbarenWerken entnommen sind, habe ich durch genaue Quellenangaben kenntlich gemacht.

Vensan Mazmanyan Erlangen, 13. November, 2017

66