Pdf Preprint
Total Page:16
File Type:pdf, Size:1020Kb
On the human ability to discriminate audio ambiances from similar locations of an urban environment Dani Korpi1, Toni Heittola1, Timo Partala2, Antti Eronen3, Annamaria Mesaros1, and Tuomas Virtanen1 1Department of Signal Processing, Tampere University of Technology, dani.korpi@tut.fi 2Human-Centered Technology, Tampere University of Technology 3Nokia Research Center Abstract increase the perceived credibility of all the audio samples in representing a certain location. The re- When developing advanced location-based systems sults suggest that developers of audio-augmented augmented with audio ambiances, it would be cost- location-based systems should aim at using audio effective to use a few representative samples from samples recorded on-site for each location in order typical environments for describing a larger num- to achieve a credible impression. ber of similar locations. The aim of this experi- ment was to study the human ability to discrimi- nate audio ambiances recorded in similar locations 1 Introduction of the same urban environment. A listening experi- ment consisting of material from three different en- In online location-based services, locations are usu- vironments and nine different locations was carried ally depicted based on visual aspects only. Ex- out with nineteen subjects to study the credibility amples include the online services Nokia Maps 3D of audio representations for certain environments [14] and Google Street View [6] which provide 360◦ which would diminish the need for collecting huge street-level views to numerous places on the globe. audio databases. The first goal was to study to As it is now, few services exist, which utilize au- what degree humans are able to recognize whether dio ambiances for depicting real locations. How- the recording has been made in an indicated loca- ever, for example, in computer games and movies tion or in another similar location, when presented the audio is effectively used for conveying more in- with the name of the place, location on a map, and formation about the environment to the user. It the associated audio ambiance. The second goal is intuitively clear that in order to achieve an au- was to study whether the ability to discriminate thentic virtual reality, also audio ambiances must audio ambiances from different locations is affected be utilized. by a visual cue, by presenting additional informa- An example of a service which builds heavily tion in form of a photograph of the suggested loca- on everyday audio ambiances is the Urban Remix tion. The results indicate that audio ambiances project initiated by Georgia Institute of Technol- from similar urban areas of the same city differ ogy [19]. It is mostly based on artistic aspirations, enough so that it is not acceptable to use a sin- and all the audio is crowdsourced from users car- gle recording as ambience to represent different yet rying mobile phones. Different recordings are then similar locations. Including an image was found to automatically annotated by the location they have been recorded in, and they represent in a way the The final publication is available at soundscape of the location. The recordings can be www.springerlink.com (http://link.springer.com/article/10.1007/s00779- used, for example, in creating a remix consisting of 012-0625-z) several audio samples. 1 An alternative for using recordings from the prove automatic recognition of speech in noisy en- actual environments is to use artificially created vironments. These examples indicate that using a soundscapes. Early methods to combine visual and listening experiment to extract information regard- audio descriptions for location suggest using ab- ing the limits of human auditory perception and stract or natural sounds for representing additional cognition is well justified in different contexts. Fur- information about the location [11, 12], while later thermore, this information can be used to develop methods suggest characteristic audio sequences for automated audio signal processing algorithms. mapping environmental noise into city models [17]. Studies related to cross-modal perception are From application development perspective and abundant in various disciplines, such as psychoa- considering the cost of collecting audio ambiance coustics, psychology, and computer{human inter- data, it would be attractive to use as few audio action. For the technology field, the most impor- recordings as possible for representing everyday au- tant cross-modal combination is the audio-visual dio environments. In [15], the authors analyze the one [3, 18], and the psychological studies reveal human accuracy in classifying the everyday envi- that perception of visual quality is increased when ronment, such as street, car, or restaurant, based presenting visual information together with high- on audio ambiance. The authors report that, for 19 quality auditory information. The influence of the subjects, the average correct recognition rate was sound is positive when the two perception modes 70 % for 25 different environments. On the average, are in synchrony regarding the information that is the recognition took 20 s for the subjects. Further- presented, or when the visual information leaves an more, the most commonly reported cue based on ambiguity that can be solved by the sound [5, 21]. which the subjects performed the recognition were Humans are also able to create a visual image based prominent identifiable sound events. on audio information they receive. The study pre- The results in this earlier work suggest that hu- sented in [10] asked subjects to select a depiction mans are good at recognizing environments based of bars that corresponds in size to the characteris- on audio only, and therefore, it is unlikely that one tics of different sounds of two wooden or metal bars could successfully use audio samples from a differ- struck together. The subjects were able to select a ent environment (e.g., a street) for depicting a lo- depiction of the bars that correctly corresponded cation on a map (corresponding to a restaurant, for to the approximate physical properties of the two example). However, the question arises how accu- bars. rately humans are able to perceive differences in The experiments in [8, 20] are related to the per- the audio samples captured in different locations, ception of urban environments. In [20], the test such as two streets of a city, but which nevertheless subjects were asked to rate the pleasantness of dif- belong to the same environment, the street. In our ferent urban sound environments while presented current paper, we address this research question. with pictures of different levels of urbanization. It The perception of audio has been investigated in was observed that the more urban environment the numerous publications. In [9], a listening experi- picture depicted, the less pleasant the sound envi- ment was carried out in order to determine how cer- ronment was rated, concluding that the visual cues tain discontinuities in synthesized speech are per- indeed affected the auditory perception of the test ceived, with the scope of improving existing speech subjects. In [8], test subjects were asked to walk a synthesis algorithms. In [16], test subjects had to certain route in a city. After that, they were asked rate the similarity of several audio samples con- to describe prominent sounds and rate the acous- sisting of different types of sonar echoes to study tical comfort of the locations along the path. The perceptually significant signal features that should results of the experiment indicated that the most be useful for automatic classification. An investiga- significant factors affecting the overall impression of tion of the perceptual structure of everyday sounds the soundscape were odors and acoustic comfort. was performed in [1] by asking the test subjects to The objective of the current study was to investi- group everyday sounds based on similar auditory gate whether several locations within one environ- characteristics. In [7], it was determined how well ment could be credibly represented by audio sam- humans are able to recognize spoken digits under ples from a single location and whether visual cues noisy conditions. This was done in order to im- affect the perceived credibility of the audio sam- 2 ples. By an environment, we mean a set of similar the different environments were done under as simi- locations, such as the parks or streets in a city. A lar conditions as possible. It was made sure that the location is a particular street or a park in the cor- relative traffic levels on the streets were as similar responding environment. as possible by doing the recordings outside the busy We conducted an experiment studying two cen- traffic hours. In parks, the recordings were done tral research questions. First, we studied how eas- in the afternoon after people had got home from ily the subjects, who are familiar with the loca- work and possibly gone out to relax. The record- tions, notice whether an audio sample is recorded ings from the marketplaces were done under typi- in the indicated location or another similar loca- cal conditions. In some cases, it meant doing the tion. Second, we studied how an additional visual recordings in the morning when there were market cue, more specifically a photograph of the location, stalls present (the marketplace Tammelantori) and affects the recognition of the location based on au- in some other cases, in the afternoon when there dio ambiances. were people walking by (the marketplaces Laukon- tori and Keskustori). One aspect to note was that on the day of the 2 Methods recording the weather was sunny, but the temper- ature was quite low. Thus, there were not as many Nineteen na¨ıve test subjects participated voluntar- people in the parks or in certain marketplaces as ily in the test. The age span of the subjects was on the warmer days. Because the temperatures between 20 and 40, and they included 16 males and are usually relatively low in Finland, these weather three females.