<<

Audio Engineering Society Convention Paper Presented at the 135th Convention 2013 October 17–20 New York, USA

This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

Audio Effect Classification Based on Auditory Perceptual Attributes

Thomas Wilmering, Gy¨orgyFazekas, and Mark B. Sandler Centre for Digital Music (C4DM), Queen Mary University of London, London, E1 4NS, UK Correspondence should be addressed to Thomas Wilmering ([email protected])

ABSTRACT While the classification of audio effects has several applications in music production, the heterogeneity of possible taxonomies, as well as the many viable points of view for organising effects, present research problems that are not easily solved. Creating extensible Semantic Web ontologies provide a possible solution to this problem. This paper presents the results of a listening test that facilitates the creation of a classification system based on auditory perceptual attributes that are affected by the application of audio effects. The obtained results act as a basis for a classification system to be integrated in a Semantic Web Ontology covering the domain of audio effects in the context of music production.

1. INTRODUCTION ceptual qualities. An example of inter-disciplinary Musicians and music producers have a large num- effect classification has been proposed in [19], as part ber of effects at their disposal. For of an effort to facilitate communication and collabo- instance, over 70 types of effects have been identi- rations between DSP programmers, sound engineers, fied in [23]. For this reason, the task of audio effect composers, performers and musicologists. In this pa- classification is a challenging problem, to which, de- per, we present the results of a listening test, which pending on a variety of factors, many different ap- aims to validate a previously proposed classification proaches may be taken. We can group audio ef- system. This system is based on auditory percep- fects by their perceptual attributes for instance, or tual attributes that are affected by the application classify them by their underlying signal processing of audio effects. The applied perceptual axes are implementations. A developer for example is likely loudness, duration/rhythm, pitch/harmony, space, to emphasise signal processing techniques, whereas a and timbre/quality. The listening test has been de- musician would prefer to classify effects by their per- signed as a web-based application. We collected re- Wilmering et al. Perceptual Audio Effects Classification

sults from more than 100 participants with back- of human auditory perception with regards to per- grounds in , audio engineering and ceptual attributes. Since a comprehensive discus- musicology. sion exceeds the scope of this paper, we will only introduce the concepts and define the perceptual at- First, participants were asked to rate their exper- tributes used in our work. The topic of auditory tise in different music and audio related disciplines, and musical perception is discussed in detail in the such as music production, instrument playing skills, literature (see for instance [7, 14]). composition, and audio digital sig- nal processing. During the test, participants were The classification in our listening test is based on the presented with audio material before and after the main perceptual axes of sound as identified in [1] in application of a given audio effect. The participants the context of adaptive digital audio effects. These were then asked to choose which of the perceptual include duration/rhythm, loudness, pitch/harmony, attributes were affected most by the transformation. space and timbre/quality. To facilitate the test we applied 33 different digital 2.1. Loudness audio effects to 40 audio files that consist of short (3 Loudness is defined as the magnitude of the sensa- to 10s) mono-timbral sound samples, ranging from tion resulting from sound or noise perceived by the vocal recordings and percussive instruments to har- human auditory system [13]. Although mainly de- monic instruments. Each test participant was pre- pendent on the sound pressure level, this does not di- sented with a total of 66 audio pairs (2 per effect rectly translate to loudness, since perceived loudness type) in random order, without disclosing the effect varies depending on the frequency of the changes in type. air pressure picked up by the human ear. Sound The results are used to establish an audio effect clas- level is measured in decibel (dB). The logarithmic sification system that can be integrated into the Au- scale is chosen because it approximates the human dio Effects Ontology [22] for the description of au- sense of hearing. Since dB describes a ratio between dio effects. This Semantic Web ontology can act two values, for the calculation of the absolute sound as a framework for the description and sharing of level a reference level is defined, usually at 20µPa knowledge in the domain of audio effects, and can corresponding to the threshold of sensitivity of the facilitate novel ways of retrieving information about human ear. It takes the form: audio effects and their implementations. p Although the listening test results largely confirm P = 20log (1) a 10 p the classification proposed in [19], they also reveal 0 differences in the association of effects and percep- where Pa is the absolute loudness level in dB, p is tual attributes, depending on the level of expertise the measured sound pressure in µPa and p0 is the in audio and music related fields of the test subjects. reference sound pressure of 20µPa. The broader outcome of this research is a new clas- The human auditory system does not respond to the sification system embedded in a Semantic Web on- same degree over the whole audible frequency range tology, which includes an updated and perceptually of roughly 20Hz to 20kHz, i.e., the perception of validated version of the originally proposed taxon- loudness differs for tones having different frequen- omy [19]. The resulting ontology reflects the needs of cies at equal sound pressure levels. Thus, loudness its intended users, as well as audio applications that may be measured in phon. The phon scale is derived are based on the Studio and Audio Effects ontologies from psychophysical measurement of the ear. Par- described in [8] and [21]. These applications range ticipants of listening tests were asked to adjust the from adaptive audio effects using high-level seman- level of signals of different frequencies until they per- tic metadata, content-aware music production tools, ceive the frequencies as loud as a given 1kHz signal. and searchable audio effect databases. At 1kHz phon is equal to dB, while at other frequen- 2. AUDITORY PERCEPTUAL ATTRIBUTES cies phon is determined by the loudness curves. In the fields of psychology and neurology there is a The unit sone is a loudness unit which is also the large body of literature discussing the mechanisms result of psychophysical measurements. Here, test

AES 135th Convention, New York, USA, 2013 October 17–20 Page 2 of 10 Wilmering et al. Perceptual Audio Effects Classification

subjects were asked to adjust sounds until they per- detection of pitch of complex tones is not as straight- ceive them as twice as loud. One sone is defined as forward and has to take into account psychoacous- a 1000Hz sinusoidal tone with a sound pressure level tic phenomena. The spectrum of complex periodic of 40dB. Hence the tone with twice the loudness has sounds that have a discernible pitch includes several a loudness of 2 sones, and with half the loudness predominant sinusoidal components that are usually 0.5 sones. While the sone scale is a purely psy- close to integer multiples (harmonics) of the funda- chophysical loudness scale, the phon scale is consid- mental frequency. These components are referred to ered a physical-psychophysical loudness scale, since as partials and are perceived as one perceptual unit. at 1000Hz the scale values correspond to dB [15]. Although the perceived periodicity pitch is widely considered to be directly derivable from the funda- 2.2. Duration and Rhythm mental frequency, experimental studies revealed that A change in duration of a sound means that the complex tones are perceived as having a lower pitch length is changed independently from pitch and tim- than the spectral pitch of a single tone of the funda- bre. This is commonly referred to as time scaling. mental frequency [17]. A residue of the fundamen- Rhythm is defined by the time periods between suc- tal frequency is perceived when it is missing from a cessive onsets of events [11]. In most Western music, complex tone. For instance, a sound consisting of the inter-onset times are usually integer multiples of harmonics at 200Hz, 300Hz and 400Hz without the each other. Changing the duration of rhythmic au- fundamental at 100Hz, is still perceived as having dio material affects the tempo, i.e. the rate of beats a pitch related to 100Hz, a phenomenon attributed that occur in the music. Although the times be- to auditory Gestalt perception. However, pitch per- tween the audio events can be measured objectively ception is often ambiguous among listening test sub- by analysing the waveforms, human perception of jects, in that certain complex tones are related to a temporal information of acoustical events is not as pitch that differs from that around the fundamental accurate and has been the subject of numerous pub- frequency by an octave or a fifth-interval [18]. lications, most notably by Paul Fraisse. A summary of his findings is given in [4, 9], which emphasises The human auditory system is also capable of dis- the theory of distinguishing perception of time and estimation of time. Within a time frame of up to ap- criminating multiple tones simultaneously. Apart proximately 5s, referred to as the perceptual present, from the octave equivalence, combinations of differ- perceived events are linked, i.e. perceived more or ent tones played simultaneously are perceived conso- less simultaneously. In longer time frames the es- nant or dissonant depending on the ratios of the fre- timation of duration involves memory. Moreover, quencies determining their pitch. The ratio of uni- durations less than 100ms are perceived as instanta- son (1:1) and octave (2:1) are the most consonant. neous events. Rhythm perception mainly takes place These are followed by perfect intervals, the perfect within sequences of shorter time frames. Changes in fifth (3:2) and the perfect fourth (4:3). The ma- the rhythmic structure however do not only depend jor third (5:4), minor third (6:5), major sixth (5:3), on the timing of events. Rhythmic patterns may also and minor third sixth (8:5) are regarded less con- be altered by changing attributes of individual sonic sonant, with the major seventh (15:8), and the tri- events, such as their loudness or timbre. tone (45:32) exhibiting the least consonance. The dissonance of tones is attributed to interference or 2.3. Pitch and Harmony beating in harmonic spectra of tones at a ratio be- Pitch is a perceptual attribute related to the funda- low 1.2 [12]. A musical interval is also perceived mental frequency of a complex tone or, in the case when tones are played in succession. Three or more of a simple tone, to its frequency [15]. The human tones form chords which are classified by the ratios auditory system is capable of discerning pitch in a and the order by which the ratios occur. The ma- range of approximately 20-5000Hz after a few peri- jor third and minor third are perceived differently, ods of a tone. Above this range audible frequencies although they have the same ratios. However, in are perceived without having a clear pitch. While the major triad the minor third lies above the major tones with a spectrum of a single frequency have a third an vice versa [6]. An important factor in mu- definite pitch directly related to the frequency, the sic perception is the mood associated with tonality.

AES 135th Convention, New York, USA, 2013 October 17–20 Page 3 of 10 Wilmering et al. Perceptual Audio Effects Classification

It has been shown that children as young as three timbre, however, musical instruments can be recog- years old associate music in minor mode to nega- nised even if a recording is of poor quality or re- tive emotions (sadness), and music in major mode verberant, both factors that significantly alter the with positive emotions (happiness), supporting the spectral profile. This is due to the fact that timbre is theory that mood perception in music is indeed not also influenced by the temporal envelope of a given solely based on cultural psychological conditioning sound. The attack, decay and modulation of the [5]. steady state region of a sound play a significant role Harmony as a perceptual attribute affected by audio in distinguishing instruments [16]. Houtsma [10] ar- transformations may alter the level of consonance or gues that due to their subjective nature and partial the mode of musical audio material. Those audio dependency on the same physical attributes, tim- effects are therefore capable of having a large impact bre and pitch should not be presented as separate on the mood associated with the transformed audio. variables. However, in our work we treat timbre ac- cording to the ANSI definition, which separates the 2.4. Space two perceptual attributes. Space as a perceptual attribute in our context refers to room characteristics audible in recorded sound in The attribute quality is highly subjective and may the form of reverberation, and the spatial position entail several of the dimensions we described in of sound sources. Reverberation results in the per- this section. For instance, the quality of sound- ception of a room effect giving information about reproducing equipment is generally judged by how the characteristics of a room (or space) a sound is accurately it recreates original recordings. This is propagating in, and the position of a sound source tied to the frequency response of the system and how in the room. Delay and echo effects are produced by it deviates from a flat response. In music production, reflections with longer delay times and can also be which also includes working with artificially gener- considered spatial effects. The spatial position of a ated sound, quality may be judged by the brightness sound source is perceived by the listener through the (linked to the spectral centroid) of a sound. Changes analysis of differences in intensity, timing and spec- in quality in our context may be seen as more sub- tral cues. The human auditory system determines tle differences in timbre that do not interfere for in- the position on the lateral plane by evaluating the stance with the identification of a particular musical inter-aural time difference (ITD) and inter-aural in- instrument, but may involve or lack of tensity difference (IID) of a sound arriving at the brightness. ear. Spatial information in the median plane is de- 3. DATASET rived from spectral differences caused by filtering of To carry out the test we applied 33 different digital the sound that occurs due to the shape of the outer audio effects to audio files that consist of short (3 to ear (the head, shoulders and torso also contribute 10s) mono-timbral musical recordings, ranging from to the outer ear transfer function). Since higher fre- vocal sounds, percussive instruments and harmonic quencies are dampened more quickly than lower fre- instruments. The effects applied for the experiment quencies when travelling through air, the resulting and their descriptions are given in Table 2. Details filtering effect gives cues about the distance of an about the composition of the source audio material object. In enclosed spaces distance is also evaluated in the data set are shown in Table 1. All audio ef- by comparing direct sound and reflections in rever- fects have been applied to each audio file, with the berant spaces [3]. exception of harmonisation and pitch discretisation. 2.5. Timbre and Quality These effects have only been applied to the vocal Timbre is a multidimensional attribute, which the recordings, which is arguably their primary usage in American National Standards Institute (ANSI) de- music production. In total, we generated 1245 audio fines as “that attribute of auditory sensation in stimuli in addition to the 40 dry audio files. The pa- terms of which a listener can judge that two sounds rameter settings for the audio effects were aimed at similarly presented and having the same loudness representing standard settings, however, at intensi- and pitch are dissimilar” [2]. The frequency spec- ties that create an obvious change in the audio ma- trum of a sound is the main factor in determining terial. The audio files have been encoded to MP3 at

AES 135th Convention, New York, USA, 2013 October 17–20 Page 4 of 10 Wilmering et al. Perceptual Audio Effects Classification

a constant bit rate of 320kb. Since this listening test formation about the type of a given transformation has been conducted online, we chose a suitable trade- that has been applied is hidden. Each test partici- off between sound quality and bit rate to ensure that pant is presented with a total of 66 audio pairs (2 it is possible to stream the audio comfortably even per effect type). For each effect type the recordings on lower-bandwidth internet connections. are selected randomly from the dataset. A partial screenshot of the user interface is given in Figure 1, Type Sound source Quantity showing the audio players and pull down menus for Harmonic Acoustic guitar (pluck) 4 one audio pair. Acoustic guitar (strum) 3 Bass guitar 1 Brass section 2 Classic choir 1 Electric Piano 1 Flute solo 2 Harpsichord 1 Organ 2 Piano 2 Solo voice (female, R&B) 2 Solo voice (male, tenor) 1 Solo violin 1 Synth bass 2 String ensemble 4 Fig. 1: The user interface of the online listening Trumpet solo 1 test. Percussive Drum kit 4 5 Percussion combo 1 5. PARTICIPANTS STATISTICS Total 40 The call for participants for the experiments has been sent to various mailing lists related to research Table 1: Sound sources for the audio material in in the fields of audio technology and musicology1. the data set. Prior to the test, the participants were asked to pro- vide some details about themselves. Apart from age (Figure 2) and gender (Figure 3), we gathered in- 4. METHODOLOGY formation about how the test participants rate their We designed our listening test as an online applica- expertise in areas related to music and music pro- tion that can be run in a standard Web browser. duction. The areas in question are audio production, On the welcome page of the online listening test electronic music composition, playing a musical in- the participants are asked to provide some infor- strument and audio DSP. The level of expertise for mation about themselves (see Section 5). During each of these areas is chosen from a list that includes the test the users are presented with pairs of audio not familiar, somewhat familiar, very familiar and recordings, prior and after the application of audio expert. The expertise distribution is shown in the effects. The users are asked to compare the audio bar charts of Figure 4. Although we recommend files of each of the pairs and choose the perceptual high quality for the listening test, we do attributes that have been affected by the given effect not impose this as a requirement for participating via two rows of pull-down menus below the audio in the experiment. We argue that equal listening players. At least one main attribute has to be se- conditions are not a requirement to conduct this ex- lected. Optionally a second main attribute, and up periment, since we evaluate rather large changes in to two other (or secondary) attributes may be spec- 1 ified. The secondary attributes are those that are The call has been sent to: [email protected], dmrn- [email protected], [email protected], music- affected noticeably but to a lesser degree than the ir@listes..fr, [email protected] and music- main attributes. Designed as a blind experiment, in- [email protected] and the internal mailing lists of C4DM.

AES 135th Convention, New York, USA, 2013 October 17–20 Page 5 of 10 Wilmering et al. Perceptual Audio Effects Classification

audio. However, test participants are asked to spec- Effect Description Bandpass filter 24dB bandpass filter at 5kHz. ify their listening conditions from a list of available Bitcrusher Bit depth reduction introducing quan- choices that we deem sufficient for the test. The tisation noise. available choices and the distribution of listening Chorus Added modulated delay lines with de- conditions is given in Figure 5. The majority of lay times around 10ms. Comb filter Single added feedback delay line. participants (63.8%) used high quality headphones Comb filter tuned to 420Hz (2.38ms de- while running the test. The participants’ ages are lay). given in specific age groups. Assuming the ages in Compressor Dynamic gain compression each group are evenly spread, the mean age is 35.7. Distortion Tube amplifier emulation Doppler Emulation of the Doppler effect created The majority of the participants were male (78.1%). by a moving sound source passing. The mean expertise on a scale from 1 (not familiar) Echo Feedback delay effect at around 200ms. to 4 (expert) in the music and music technology re- Enhancer Combination of equalising and non- lated disciplines is as follows: audio production: 2.6, linear processing to subjectively en- hance the sound. electronic music composition: 2.3, playing a musical Flanger Added modulated delay lines with de- instrument: 2.9, audio DSP: 2.6. lay times around 2ms. Fuzz Completely non-linear distortion. Gain Change in SPL. Grain delay Grain tap delay Harmoniser Addition of two harmonically pitch- shifted copies of the original. Highpass filter 24dB Highpass filter at 5kHz. Lowpass filter 24dB Lowpass filter at 5kHz. Phaser (mono) 0.85Hz phaser sweep from 1500Hz to Age distribution 50% 7000Hz. Phaser (stereo) 0.85Hz phaser sweep from 1500Hz to 40% 7000Hz and 30% phase difference be- tween the stereo channels. 30% Pan Change of the positioning of the sound source in the stereo field. 18-2 25-3 35-4 45-5 55-6 65-7 Pitch discret. Auto-tune. Time-varying pitch shift 20% Unti 13.34 42.94 420 420 42.9 4 1 tled controlled by the input signal Pitch (up) pitch shift by 3 semitones 10% Pitch (down) pitch shift by 3 semitones Reverb Schroeder/Moorer based digital rever- 0% 18-24 25-34 35-44 45-54 55-64 65-74 beration with RT60 of 750ms. Ring modulation Sinusoidal ring modulation at 1kHz. Robotisation Phase-vocoder based ”robot” effect Fig. 2: Age distribution of the test participants. produced by zeroing the phase in the time-frequency representation. Rotary speaker Leslie/rotary speaker emulation Speed down Playback slow down by 20% (resam- pling). Speed up Playback speed up by 20% (resam- Gender distribution pling). 80% Tempo down Tempo decrease by 30% Tempo up Tempo increase by 30% 60% Tremolo Sinusoidal amplitude modulation at 6Hz with 40% depth Vibrato Sinusoidal frequency modulation (15% 40% Male Female depth) and amplitude modulation Untitled 1 78.1 21.9 (10% depth) at 4Hz Whisperisation Phase-vocoder based ”whisper” effect 20% produced by setting a random phase in the time-frequency representation. 0% Male Female Table 2: Overview of the digital audio effects for Fig. 3: Gender distribution of the test participants. the perceptual attributes listening test.

AES 135th Convention, New York, USA, 2013 October 17–20 Page 6 of 10 Wilmering et al. Perceptual Audio Effects Classification

50%

38%

25%

13%

0% Audio Production Electronic Music Composition Playing Musical Instrument Audio DSP not familiar somewhat familiar very familiar expert Fig. 4: Distributions of the levels of expertise of the test participants in music-related areas (self-assessed).

Listeningconditions distribution the most. Notable exceptions are the low pass fil- 80% ter, where timbre/quality (75.2%) is closely followed by loudness (71.4%), and the stereo phaser where 60% high quality 63.8 timbre/quality scores 64.3% and space scores 59.0%. headphones lowquality 20.0 There are several effects for which many test par- 40% headphones speakers/ 3.8 ticipants selected no attribute. The highest number professional speakers/ 5.7 of cases of this selection can be found for the vi-

participants studio 20% semi- hi-fi 6.7 brato effect. Indeed, after pitch/harmony (42.4%) professional speakers/ studio this choice has here the highest percentage (32.4%). 0% Many participants also selected none/unknown for high quality headphones low quality headphones the compressor, which may be due to the fact that speakers/professional studio we did not use an extreme threshold and release set- speakers/semi-professional studio ting making the effect rather subtle. Furthermore, hi-fi speakers/home the effect is not equally noticeable in the different au- Fig. 5: Listening conditions distribution of the test dio examples that are selected randomly in the test. participants. Interestingly, the participants choosing loudness as one main attribute have a higher average expertise in the areas concerned specifically with electronic mu- 6. RESULTS sic and audio DSP than those not identifying any The number of participants in the test was 105. For other main attribute for the compressor (see Table each audio effect we counted the choices made for 3). The expertise in playing a musical instrument the main and secondary perceptual attributes re- did not affect the selections. This trend can also be spectively. The percentage of participants that chose observed with the vibrato effect if we compare the no secondary attribute is relatively high. For this number of selections of either pitch/harmony or tim- reason, we focus on the combined results and anal- bre with the number of selections of none/unknown yse the total of attributes assigned to a given effect. (see Table 4). The listening conditions of these par- The values given in Table 5(a) are the percentages ticipant groups did not differ significantly from the of audio files transformed by a given effect for which overall statistics. a given attribute has been selected. For instance, 95.7% of audio files to which a bandpass filter has We compared the results from the listening test to been applied exhibit a change in loudness as a per- the classification system proposed by [19, 20]. In ceptual attribute.The choice none/unknown is only Table 5(b) the effects we included in our experi- counted if all choices of the main and secondary at- ment that also appear in the classification system tributes have been selected as such. are shown. The percentage in brackets following The results show that for the majority of effects the abbreviations for the perceptual attributes are there is one particular attribute that has been chosen taken from our test results. In some cases our test

AES 135th Convention, New York, USA, 2013 October 17–20 Page 7 of 10 Wilmering et al. Perceptual Audio Effects Classification

results do not directly reflect the compared classifi- Average expertise Discipline pitch/harmony none/unknown cation system. For instance, only a relatively small audio production 2.73 2.46 percentage of test participants chose pitch/harmony electr. music compos. 2.39 2.10 for the distortion effect. For the doppler effect the playing an instrument 2.87 2.90 test result identifies pitch/harmony as the most of- audio dsp 2.72 2.46 ten associated perceptual attribute. In the filter cat- Table 4: Average expertise scores for test par- egory (bandpass, lowpass and highpass) the results ticipants choosing loudness or pitch/harmony, and for the bandpass filter differ in that loudness was none/unknown as perceptual attributes for the vi- chosen most with timbre/quality following second. brato effect. The classification system states pitch/harmony as a secondary perceptual attribute for fuzz distortion, whereas in our test this attribute has been selected showed the preferred choice being timbre/quality. for this effect in only 10% of the audio examples. Although in this case we chose an effect setting of Furthermore, the weighting of the main and sec- relatively high intensity, this outcome seems to con- ondary attribute for pitch discretisation differs be- firm [10] stressing the interdependency of pitch and tween the classification system and the evaluation. timbre perception. Loudness was not perceived as an important at- tribute for variable speed replay. The low percentage 7. CONCLUSIONS for loudness for the vibrato effect may be the result We conducted a listening test in order to create and of our choice to only perform mild amplitude modu- evaluate an audio effect classification system based lation as compared to the frequency modulation for on auditory perceptual attributes. Such a taxonomy our audio examples. Our results also show a rela- facilitates the identification and selection of audio tively high percentage of selections of pitch/harmony effects by non-technical factors. The listening test for the comb filter. The listening test included a participants consisted of members of the music tech- resonant comb filter, which is not specified in the nology and academic music communities approached compared classification system. This may explain through calls on various mailing lists. The results the high number of selections for this attribute due show that for the majority of effects there is agree- to the resonant harmonics caused by the feedback ment with the attributes assigned to the effects in filter. the taxonomy. More precise results may be achieved with a larger Average expertise group of test participants, and a larger database that Discipline loudness none/unknown consists of audio examples that have been trans- audio production 2.70 2.24 formed by audio effects with varying settings. This elec. music composition 2.38 1.88 may reveal differences in the weighting of percep- playing an instrument 2.85 2.88 tual attributes depending on these settings. Fur- audio dsp 2.71 2.28 thermore, with a larger test group the relationship between the audio material and the effects may be Table 3: Average expertise scores for test partici- established. For instance by comparing the results pants choosing loudness, and none/unknown as per- for one effect applied to harmonic and percussive ceptual attributes for the dynamic range compres- content. For an audio effects ontology we use a clas- sion effect. sification system that is expressed using the Ontol- ogy Web Language, based on a combination of the Although overall the test results confirm the com- compared classification system and the results of the pared classification system to a large degree, in some listening test. cases the perceptual attributes that have been objec- tively determined do not reflect the choices test par- ticipants would make when given audio examples. For instance, although pitch discretisation mostly af- fects the pitch of the audio material, our examples

AES 135th Convention, New York, USA, 2013 October 17–20 Page 8 of 10 Wilmering et al. Perceptual Audio Effects Classification T (62.4%) secondary L (20.5%), D (19.5%), T (21.4%) P (37.6%) P (9.5%), T (78.1%) T (39.0%) T (30.0%), D (9.5%) L (12.4%), P (45.2%) L (49%) P (31.0%) L (21.9%) L (95.2%), P (10.0%) L (69.5%), P (16.2%) L (39.0%), D (54.3%), L (54.3%), P (74.8%) L (4.8%) (b) Perceptual attribute T (69.7) L (77.6%) P (63.6%)D (95.0%) T (67.3%) D (81.3%), T (20.7%) L (2.6%), P (83.8%) 3 2 3 1 Audio e ff ect main combined results of lowpass,combined highpass results and of bandpass fi lters. upwardscombined and results downwards of pitch speed shifting. increase and speed decrease. 1 2 3 PanningPitch discretisation Pitch shifting P (56.7%) S (76.2%) T (78.6%) GainGranular delay S (49.0%) L (87.1%) HarmoniserLeslie/rotary speakerMono phaser S (85.2%) P (97.1%) T (72.4%) P (17.6%), Whisperisation T (91.4%) Flanger T (81.0%) Reverberation S (83.3%) Tremolo Variable speed replay Vibrato L (33.3%) L (8.6%) , P (42.4%) Fuzz T (79.0%) Ring modulationRobotisationTime scaling P (78.6%), T (86.2%) P (67.6%), T (90.0%) L (15.7%) Chorusfi lterComb CompressorDistortion Doppler Echo Enhancer T (67.6%) T LFilter (79.0%) (57.6%) T (92.4%) S (52.9%) T (59.5%) S (69.5%) 0.0% 0.0% 6.2% 27.6% 46.7%15.2% 0.0% 2.9% 23.3% 0.5% Classification of audio effects by perceptual attributes affected by their application b) 10.0% 83.8% 39.0% 1.9% 56.7% 9.5% 78.6% 3.8% 12.9% 25.7% 75.2% 0.5% 78.6% 15.7% 86.2% 0.0% 39.0% 35.2% 91.4% 0.5% 14.8% 24.8% 80.0% 0.0% harmony quality unknown (a) 4.8% 42.4% 8.1% 30.0% 32.4% 4.8% 7.1% 69.5% 26.2% 5.7% 9.0% 31.0% 46.7% 79.0% 1.4% 5.7% 37.6% 35.7% 81.0% 1.4% 3.8% 83.8% 2.4% 18.1% 1.4% 17.1% 66.2% 50.5% 65.7% 0.5% 39.0% 9.5% 49.0% 78.1% 1.0% 12.4% 45.2% 20.5% 67.6% 7.1% 49.0% 9.0% 19.5% 59.5% 11.0% 15.7% 67.6% 11.0% 90.0% 0.0% 41.0% 10.0% 23.3% 56.2% 3.3% 10.0% 31.0% 35.2% 72.4% 4.8% 18.6% 61.0% 46.7% 69.0% 1.0% 69.5% 16.2% 18.6% 92.4% 0.5% 54.3% 74.8% 52.9% 26.7%11.0% 0.5% 97.1% 15.7% 39.0% 0.5% 11.4% 8.6% 13.8% 80.0% 10.0% 12.9% 23.3% 59.0% 64.3% 4.8% 1.9% 27.6% 4.3% 76.2% 14.8% 10.0% 4.8% 87.1% 3.8% 12.4% 19.5% 2.4% 95.2% 10.0% 9.5% 79.0% Duration/ Loudness Pitch/ Space Timbre/ None/ Results of the listening test. The percentages represent the proportion of audio files transformed by a given effect that E ff ect rhythm Bit crusher 9.5% Bandpass fi lter 3.3% 95.7% 3.3% 15.2% 53.8% ReverberationRing modulation 19.5% 6.2% 20.5% 8.6% Speed upStereo phaserTempo up TremoloWhisperisation 5.7% 78.1% 94.8% 7.1% 31.9% 1.4% 21.9% 6.2% 3.3% Pitch discretisation 14.3% 3.8% EnhancerGrain delay 5.7% 54.3% EchoHighpass fi lter 56.2% 6.2% 65.7% Chorus 7.6% Distortion 5.7% Robotisation 9.0% Speed down 84.8% 1.4% 83.8% 2.9% Pitch downPitch up 13.8% 13.3% Vibrato 9.5% Tempo down 95.2% 3.3% 6.7% 6.7% fi lterLowpass 3.3% 71.4% DopplerHarmoniser 44.3% 4.3% fi lterComb Compressor 4.8% 4.8% Gain 57.6% 4.3% 13.3% 21.4% Pan Mono phaser 7.6% FlangerFuzz Leslie/rotary speaker 7.1% 11.0% 11.9% 17.6% 85.2% 62.4% 0.5% Table 5: a) have been assigned a givenbased perceptual on the attribute. system proposedIn in brackets [19, are 20]. thetest. (L: percentages loudness, of Italics D: audio denote duration files effects and transformed rhythm, for P: with which pitch the there and effect are harmony, T: that discrepancies timbre have between and been the quality, S: labeled test space). with results a and given the attribute compared in classification our system. listening

AES 135th Convention, New York, USA, 2013 October 17–20 Page 9 of 10 Wilmering et al. Perceptual Audio Effects Classification

8. REFERENCES [13] H. F. Olson. The measurement of loudness. Au- dio, Our 25th Year, February 1972, 1972. [1] X. Amatriain, J. Bonada, A.` Loscos, J. L. Ar- cos, and V. Verfaille. Content-based transfor- [14] H. Pashler and S. Yantis, editors. Steven’s mations. Journal of New Music Research, 32(1), Handbook of Experimental Psychology, 3rd Edi- 2003. tion. John Wiley & Sons, 2002.

[2] ANSI. S1.1-1994 (R2004): American National [15] R. Rasch and R. Plomp. The perception if musi- Standard Acoustical Terminology. American cal tones. In D. Deutsch, editor, The Psychology National Standards Institute, 2004. of Music, 2nd Edition, pages 89–112, 1999.

[3] J. Blauert. Spatial Hearing: The Psychophysics [16] J.-C. Risset and D. L. Wessel. Exploration of of Human Sound Localization. The MIT Press, timbre by analysis and synthesis. In D. Deutsch, revised edition, 1983. editor, The Psychology of Music, 2nd Edition. Academic Press, 1999. [4] E. F. Clarke. Rhythm and timing in music. In D. Deutsch, editor, The Psychology of Music, [17] E. Terhardt. Die tonh¨oheharmonischer kl¨ange 2nd Edition, pages 473–500. Academic Press, und das oktavinterval. Acustica, 24, 1971. 1999. [18] E. Terhardt. Pitch, consoncance and harmony. [5] N. D. Cook and T. Hayashi. The psychoaoustics Journal of the Acoustical Society of America, of harmony perception. American Scientist, 96, 55(5), 1974. 2008. [19] V. Verfaille, C. Guastavino, and C. Traube. An interdisciplinary approach to audio effect clas- [6] D. Deutsch. The processing of pitch combina- sification. Proceedings of the 9th International tions. In D. Deutsch, editor, The Psychology of Conference on Digital Audio Effects (DAFx- Music, 2nd Edition. Academic Press, 1999. 06), Montreal, Canada, September 2006. [7] D. Deutsch, editor. The Psychology of Music, [20] V. Verfaille, M. Holters, and U. Z¨olzer. Intro- 2nd Edition. Academic Press, 1999. duction. In U. Z¨olzer,editor, DAFX - Digital [8] G. Fazekas and M. B. Sandler. The studio ontol- Audio Effects, 2nd Edition. J. Wiley & Sons, ogy framework. Proceedings of the 12th Interna- 2011. tional Society for Music Information Retrieval [21] T. Wilmering, G. Fazekas, and M. B. Sandler. Conference (ISMIR 2011), 2011. High level semantic metadata for the control of multitrack adaptive audio effects. presented at [9] P. Fraisse. Perception and estimation of time. the 133rd Convention of the AES, San Fran- Annual Review of Psychology, 35, 1984. cisco, USA, 2012. [10] A. J. M. Houtsma. Pitch and timbre: Defini- [22] T. Wilmering, G. Fazekas, and M. B. San- tion, meaning and use. Journal of New Music dler. The audio effects ontology. Proceedings of Research, 26(2), 1997. 14th International Society for Music Informa- [11] T. C. Justus and J. J. Bharucha. Music percep- tion Retrieval Conference (ISMIR 2013), 2013. tion and cognition. In H. Pashler and S. Yan- [23] U. Z¨olzer. DAFX - Digital Audio Effects. J. tis, editors, Steven’s Handbook of Experimental Wiley & Sons, 2nd edition, 2011. Psychology, 3rd Edition, volume 1: Sensation and Perception. John Wiley & Sons, 2002.

[12] C. L. Krumhansl. The cognition of tonality - as we know it today. Journal of New Music Research, 33(3), 2004.

AES 135th Convention, New York, USA, 2013 October 17–20 Page 10 of 10