<<

QUANTIFYING THE CONSONANCE OF COMPLEX TONES WITH MISSING FUNDAMENTALS

A THESIS SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF ENGINEER

Song Hui Chon June 2008 c Copyright by Song Hui Chon 2008 All Rights Reserved

ii Approved for the department.

(Julius O. Smith, III) Advisor

Approved for the Stanford University Committee on Graduate Studies.

iii Abstract

Consonance is one of the more fundamental perceptual criteria in music and on which Western music theory is built. It is closely related to the musical intervals, explaining why certain intervals sound better than others. Recently, the concept of tonal consonance has been distinguished from that of mu- sical consonance. Tonal consonance is about consonance or “pleasantness” of tones when sounded together, while musical consonance is about intervals. Tonal conso- nance appears to be a superset of musical consonance, in that musically consonant intervals are tonally consonant. The “missing fundamental” (or “”) is an interesting psychoacoustical phenomenon. When we hear a set of tones whose are integer multiples of a fundamental that is missing in the sound, we identify it as a complex tone whose pitch is the missing fundamental . This phenomenon enables producing decent bass sound out of mini-sized speakers. Then the question is, why do we hear something that is not there? This thesis deals with the intersection of tonal consonance and missing funda- mental. While trying to explain some data from a psychoacoustical experiment, I had stumbled onto the concept tonal consonance. This work is built upon the earlier work, with the addition of missing fundamental analysis. The work covered in this thesis finds that the consonance of most sound stimuli stayed pretty constant regardless of the loudness level at which they were presented.

iv It also supports a previous conclusion that each type of stimulus has its own intrinsic consonance value. A new discovery here is that the consonance values for the stimuli considered seem to be grouped by the size of bandwidth.

v Acknowledgments

I would first like to thank my advisor, Julius O. Smith, for his constant support and guidance. It was his advice that first prompted me to embark on this exciting journey of applying engineering techniques to the applications of music. I would also like to thank the CCRMA community including the faculty there – Jonathan Berger, Chris Chafe, Jonathan Abel, Marina Bosi, Malcolm Slaney and David Berners. I had the honor of being the teaching assistant for Drs. Abel, Berners and Bosi, to whom I owe much of my audio compression and effect knowledge. They were teachers, mentors and friends for the past four years of my life. There are my friends and colleagues at CCRMA and Electrical Engineering who have been there for me when I needed a guidance – Woon Seung Yeo, Kyogu Lee, Sook Young Won, Juhan Nam, Minjong Kim, Greg Sell, Gautham Mysore, Ed Berdahl and David Yeh. They would teach, help and discuss with me and that eventually led me here. I would especially like to thank Daniel Steele, with whom I worked on a research project and published a paper together. In doing the research we collaborated, I found the topic of consonance quantification and its underlying problems, which eventually became this thesis. I owe a big thank to my church community at Calvary Chapel Mountain View, including Inma and David Robinson, Lisa Erickson, Regina and Kirill Buryak, Brid- get Ingham and others. They have been unvaryingly understanding, supportive and

vi loving during my many ups and downs in the last few years, with prayer and encour- agement. My life at Stanford would have been a lot more challenging without their support. My special thanks go to Doctor Ik Keun Hwang of Chonbuk National University in Korea. Without his medical cares, I would not be here now studying what I am curious about. I am indebted to him eternally for his thorough care and the miracle he produced. The biggest thanks is for my family for their love and support. They have showed a continuous and unconditional love throughout my entire life. I feel very blessed to belong to my family. It was my parents’ love and encouragements that enabled me to pursue a degree at Stanford. My caring sister was always there for me with arms stretched. My brother, who is also studying at Seoul National University in Korea, was a friend and colleague with whom I had numerous research conversations. Last, but certainly not the least, I thank God for my past four years at Stanford. He brought me here and opened the doors for me to study music and engineering together. My life often took an unexpected turn, but no matter what I was going through, God always was faithful to His Words. I hope that this thesis is a testimony to His glory.

vii Contents

Abstract iv

Acknowledgments vi

1 Introduction 1 1.1 IntroductionandMotivation ...... 1 1.2 Consonance ...... 2 1.2.1 TonalConsonance...... 3 1.3 MissingFundamental...... 4

2 Quantifying Consonance 6 2.1 Introduction...... 6 2.2 TheProposedAlgorithm ...... 10 2.2.1 HowYINWorks ...... 14 2.3 Experiment ...... 16 2.3.1 A Simple Counterexample for Plomp and Levelt ...... 16

3 Annoyance Perception Experiment 21 3.1 Stimuli...... 21 3.2 Procedure ...... 23 3.3 TheMatterofWeighting...... 24

viii 3.4 Results...... 28

4 Results and Discussion 30 4.1 QTC Values of Annoyance Stimuli ...... 30 4.2 CorrelationCoefficients...... 32 4.3 Fundamental Estimation using YIN ...... 34 4.3.1 Comparison of Estimated Fundamentals with Human Perception 38

5 Conclusions and Future Works 40

A A Perceptual Study of Sound Annoyance 42

B Sound Annoyance as a Function of Consonance 49

Bibliography 58

ix List of Tables

3.1 Theoretically calculated perceived loudness values using dB(A), dB(B), dB(C)anddB(ITU-R468) ...... 27

4.1 The QTC values calculated with the proposed algorithm ...... 30 4.2 The correlation coefficients of QTC values and loudness levels per stim- ulustype...... 32 4.3 The correlation coefficients of QTC values and annoyance order, per loudnesslevel ...... 32

x List of Figures

2.1 The original QTC algorithm (from Chon et al. [5])...... 7 2.2 Illustration of consonant and dissonant parts in a pair of pure tones (fromChonetal.[5])...... 8 2.3 ThemodifiedQTCalgorithm(from[6])...... 9 2.4 Perceived consonance and dissonance of a pair of pure tones as defined by Plomp and Levelt in [17] (reproduced from [6])...... 10 2.5 TheproposedQTCalgorithm...... 12 2.6 Magnitude spectrum of a) the complex tone, b) the and c) themixture...... 17 2.7 of a) the complex tone, b) the pure tone and c) the mixture. 18 2.8 Time-domain waveforms of a) the complex tone, b) the pure tone and c)themixture...... 19

3.1 The magnitude spectrum of the six stimuli ...... 22 3.2 Weighting standards: A-, B-, C-weighting and ITU-R 468 ...... 24 3.3 Weighted decibel levels of six stimuli using A- (solid red line), B- (blue dash dotted line), C- (black dotted line) and ITU-R 468 (green dashed line)weighting...... 25 3.4 The result from annoyance perception experiment [20] ...... 28

4.1 QTC values of twenty-four stimuli ...... 31

xi 4.2 Magnitude spectrum of the mixture from the Simple Test in section 2.3.1(fromfigure2.6)...... 34 4.3 Magnitude spectrum of the mixture from the Simple Test in section 2.3.1 and the estimated fundamentals using YIN ...... 35 4.4 Fundamental estimation procedure in the Peaks and Fundamentals blockinfigure2.5...... 36

xii Chapter 1

Introduction

1.1 Introduction and Motivation

Consonance is one of the oldest concepts in music and sound. It is an intuitive concept that produces ambiguities in attempting a formal definition. During my annoyance perception studies [6] [20] I became curious about this concept of consonance and started to wonder if I could somehow quantify the consonance of sound stimuli and whether it could in turn explain the annoyance experiment data. It was relatively successful, because it could explain most cases. While studying papers on consonance, I realized that the literature was sparse. And I began to wonder if there should have been more questions in this field, now that we have achieved a significant development in psychoacoustics and also in com- putational processing capabilities. Thus it is time to make advances. I have decided to pursue the issue of consonance quantification in the case of missing fundamentals, described in section 1.3. My last paper [6] used a classic algorithm by Plomp and Levelt from [17] to quantify disso- nance between two tones, which regarded two pure tones (i.e., sinusoids) completely consonant when they are over 1.2 times a critical bandwidth apart. This will be the

1 CHAPTER 1. INTRODUCTION 2

right judgment for cases with only two sinusoids considered. But in real life, most audio stimuli have complex structures in them; therefore Plomp and Lev- elt’s theory may not hold up well. In fact, a simple listening test reveals that there is a perceived beating created between a pure tone and a set of complex tones with a missing fundamental, where the missing fundamental and the pure tone are within the same critical bandwidth, while the lowest existing harmonic and the pure tone are more than a critical bandwidth apart. We can see using MATLAB that the time- domain waveforms exhibit a physical modulation whose rate corresponds to the rate of beats in this case. This suggests not only that there is a perceptual phenomenon that can be explained by nonlinear processing in ear, but also a physical phenomenon of beating even though the fundamental is not there, and that beating is translated as dissonance in psychoacoustics. This thesis is an attempt to answer my own question of ‘what happens with consonance when the fundamental is missing.’ It is by no means a thorough case study, although I have tried my best to study most of the published relevant works. Still, I would like to think this is a small contribution in the field of consonance theory.

1.2 Consonance

In traditional Western music theory, consonance means the harmonic ‘togetherness’ between components, usually used in a description of intervals such as a perfect 5th or an octave. There have been different approaches to define consonance in the history of music. Pythagoras used a mathematical approach, saying two tones with a ratio of small whole numbers are more consonant than those with a higher ratio [24]. Galileo observed “agreeable consonances are pairs of tones which strike the ear with a certain regularity; this regularity consists in the fact that the pulses delivered by the two tones, in the same time, shall be commensurable in number, so as not to keep the CHAPTER 1. INTRODUCTION 3

eardrum in perpetual torment, bending in two different directions in order to yield to the ever-discordant impulses [18].” Helmholtz argued that the ear is a frequency analyzer, and therefore the more (or partials) are in coincidence, the more consonant those tones are [24]. Terhardt [21] as well as Gerson and Goldstein [9] used the concept of tonal fusion and pattern matching. From the audiological perspective, Patterson [16] tried to explain it in terms of neural-firing coincidence. Terhardt defined consonance as “a link between music and psychoacoustics” in [22]. Along the same line, there were attempts to employ consonance as a perceptual parameter recently [5] [19]. This thesis proposes the Modified Quantified Total Con- sonance (QTC) algorithm based on the QTC algorithm proposed by Chon, et al. [5] with a special consideration of missing fundamentals. The concept of tonal consonance emerged separately from the musical consonance with the developments in psychoacoustics. One central concept in describing tonal consonance is the critical bandwidth, which is defined as a range of frequency where two pure tones interfere with each other and thus create dissonance in the form of beating or roughness of sound. This intersection of musical and tonal consonance is sometimes called sensory consonance [19].

1.2.1 Tonal Consonance

Tonal consonance refers to pleasantness between two simultaneous tones, created from a lack of roughness or beating in two pure tones. Carol Krumhansl [14] talks about tonal consonance as a concept different from musical consonance. A paragraph from her book [14] says:

...Tonal consonance refers to the attribute of particular pairs of tones that, when sounded simultaneously in isolation, produce a harmonious or pleasing effect. Musical consonance, on the other CHAPTER 1. INTRODUCTION 4

hand, refers to intervals that are considered stable or free from ten- sion, and constitute good resolutions. This kind of consonance de- pends strongly on the musical style and also the particular context in which the interval is sounded. Thus, musical consonance may bear only a rough correspondence to tonal consonance. ...[14].

Tonal consonance can be thought of a superset of musical consonance, in that all musically consonant intervals are tonally consonant. This thesis deals with audio stimuli, not necessarily musical ones, therefore the word “consonance” used in this thesis will refer to tonal consonance.

1.3 Missing Fundamental

Although pure tones are very useful in experiments and research, most sound in the real world is composed of complex tones, where a complex tone here is a harmonic set of pure tones. In a complex tone, all the frequency components are at integer multiples of the . When we listen to a complex tone, we hear the pitch corresponding to the fundamental frequency with a certain that comes from the specific frequency distribution of harmonic components. Even though in most cases the fundamental is included in the complex tone, sometimes it may not be there, and those cases are called “missing fundamentals”. Curiously enough, in this case, the human brain figures out the missing fundamental and still hears the pitch of it, although the details of “how” this process works is still in debate. Examples of missing fundamentals can be found more easily than one may expect. Some stops in pipe organs use this phenomenon to produce very low pitched . There is also the example of a tiny speaker producing bass sounds, which would not be possible without the missing fundamental perception. The telephone lines take advantage of this phenomenon to make low-pitched speech heard, whose fundamental CHAPTER 1. INTRODUCTION 5

is below 200 Hz. Plomp and Levelt broke new ground in the study of consonance perception, but failed to consider this case in their classic paper [17] by assuming a pair of pure tones is totally consonant when they are over 1.2 times a critical band apart. Kemeoka and Kuriyagawa and Geary took the harmonic structure of complex tones into con- sideration in [12] and [8] respectively, but they did not include the case of missing fundamentals in their experiments. This thesis is an attempt to add a consideration of the missing fundamental case in consonance theory. Chapter 2

Quantifying Consonance

This chapter introduces the original algorithm from a reference paper and the pro- posed algorithm for consonance quantification.

2.1 Introduction

The Quantified Total Consonance (QTC) Algorithm was first proposed in [5] as a quantified perceptual measure of tonal consonance in audio stimuli. As indicated in figure 2.1, the algorithm performs harmonic analysis, loudness analysis, critical band analysis and dissonance analysis in the frequency spectrum of the given audio input. Here is a summary of each block’s functionality [5]:

1. Pitch Perception: Assuming that each sound source has two complex tones, it tries to find harmonic structures, thus estimating fundamentals. They used the algorithm by Tolonen and Karjalainen [23] for verification.

2. Harmonic Analysis: Using the fundamental pitches estimated from the previous block, this step produces harmonic index and amplitude of each peak in the harmonic structure.

6 CHAPTER 2. QUANTIFYING CONSONANCE 7

Figure 2.1: The original QTC algorithm (from Chon et al. [5]).

3. Loudness Contour: This module converts physical sound pressure level of each peak’s magnitude into phons, which is a perceptual measurement unit using the Equal Loudness Contour from ISO226 [7].

4. Critical Band Analysis: Based on the consonance model by Plomp and Levelt [17], this block examines if a pair of peaks are within the interfering range, which is defined as 1.2 times the critical bandwidth.

5. Dissonance Analysis: For a pair of peaks that are within the interfering range, CHAPTER 2. QUANTIFYING CONSONANCE 8

Figure 2.2: Illustration of consonant and dissonant parts in a pair of pure tones (from Chon et al. [5]).

their dissonance is calculated using the curve defined by Plomp and Levelt [17] in figure 2.4.

6. Weighting and Summation: After calculating dissonance of every pair of peaks, weighting and summation are performed. The dissonant part is weighed fol- lowing the idea presented in figure 2.2, which distinguishes the consonant part from the dissonant part in a pair of pure tones. Figure 2.2 is the illustration

reproduced from [5] – assuming f1 and f2 are within a critical bandwidth, the loudness of the dissonant part will be determined by the smaller magnitude of the two pure tones. And the loudness difference will determine the consonance in the sound. After separating dissonant parts from consonant parts, all dis- sonance values get added together before being normalized by the total energy in the input, which yields the ‘total dissonance’ value for the particular input audio file. The ‘total consonance’ is defined by the the difference between ‘total dissonance’ and 1.

I came across the original algorithm [5] while trying to explain data from a per- ceptual experiment [20]. I subsequently developed a modified version of the algorithm in an attempt to find a relationship between annoyance perception and consonance CHAPTER 2. QUANTIFYING CONSONANCE 9

Figure 2.3: The modified QTC algorithm (from [6]).

[6], which is indicated schematically in figure 2.3. The major difference between the original algorithm and the modified one in figure 2.3 is the Masking Analysis module. It was added to take more psychoacoustic phenomena into account. The masking phenomenon is one of the main bases for the success of perceptual audio compression [3]. CHAPTER 2. QUANTIFYING CONSONANCE 10

Figure 2.4: Perceived consonance and dissonance of a pair of pure tones as defined by Plomp and Levelt in [17] (reproduced from [6]).

2.2 The Proposed Algorithm

One question that the model in [6] could not answer was regarding missing funda- mental cases. The algorithm followed the model of Plomp and Levelt [17], therefore assuming that two tones are completely consonant when they are over 1.2 times a critical bandwidth apart. That will be completely appropriate to assume if the audio stimuli are composed of two pure tones only, while in real life, most audio stimuli consist of complex tones. Figure 2.4 shows the consonance-dissonance curve that Plomp and Levelt defined in [17]. As is already known and universally accepted, the human auditory pathway per- forms an ingenious process when a complex tone is heard without its fundamental. CHAPTER 2. QUANTIFYING CONSONANCE 11

Although the details of how it is done are still under debate, the the- ory seems to be accepted as the explanation of this phenomenon. This autocorrelation theory uses time-domain autocorrelation of the auditory input to detect the period- icity of the missing fundamental. One of these “time-domain analysis” models is the one by Karjalainen and Tolonen [13] – a 2-band model using half-wave rectification that happens in the hair cells. Kameoka and Kuriyagawa in [12] pointed out that Plomp and Levelt considered the dissonance of only adjacent pairs of pure tones [17] and proposed to calculate the dissonance of all unique combinations of two tones. In other words, if there are m tones in the audio stimulus, Plomp and Levelt considered m − 1 adjacent pairs for the dissonance calculation, while Kameoka and Kuriyagawa considered m = 2  m(m−1) 2 pairs. In contrast, what I propose here is to consider the harmonic structure before calculating dissonance. If the given m tones had a harmonic structure with n fundamentals (where some may not be physically existent), this new algorithm n(n−1) calculates dissonance of at most 2 pairs. The reason that the number of pairs is bounded is because this new model still follows one proposal from [17] that two tones are regarded completely consonant when they are not in the same critical bandwidth. Figure 2.5 shows the proposed QTC algorithm. It looks very much like the modi- fied QTC algorithm represented in figure 2.3 from [6]; the only change is made in the “Peaks and Fundamentals” block which is surrounded by a dashed rectangle. Below is the description of each block.

1. Sound Source: The source is assumed to be clean or already de-noised, and stored in a wav file.

2. T/F Conversion: A 44100-point FFT (Fast Fourier Transform) is used for time- to-frequency conversion. The magnitude spectrum is normalized by its maxi- mum absolute magnitude. CHAPTER 2. QUANTIFYING CONSONANCE 12

Figure 2.5: The proposed QTC algorithm.

3. Peaks and Fundamentals: A peak is defined to be at any frequency which has a local maximum whose magnitude is over -10 dB. After finding all peaks in the signal, (missing) fundamentals are calculated using an excellent fundamental estimation algorithm called YIN by Alain De Cheviegne and Hideki Kawahara [4]. In case a missing fundamental is detected by YIN, its magnitude is assumed to be the biggest magnitude value among the partials. For a given fundamental

f0, a partial is defined to be the local peak in the window of Nf0 ± ∆ where N is any integer. ∆ is empirically set to 6 Hz currently. This assumption CHAPTER 2. QUANTIFYING CONSONANCE 13

of the missing fundamental having maximum magnitude among its harmonics is a simplification that may not match the loudness perception of the missing fundamentals in reality. My literature review revealed no substantial work on this topic and I think it demands further consideration. This block is discussed in more detail in section 4.3.

4. Masking Analysis: This block takes the set of peaks and estimated fundamentals from the previous block and eliminates local peaks (and estimated fundamen- tals) that are masked by nearby loud components. The MPEG Psychoacoustic Model 2 [3] was used for masking effect calculation.

5. Loudness Contour: This block produces Equal Loudness Contours by interpo- lating the ISO226 specification [7]. Then the amplitudes (in dB SPL) of the peaks are converted into phons using these contours. One of the stimuli, pink noise, for the experiment had frequency range that extended well beyond 12500 Hz, but only the range of 20 to 12500 Hz the stimulus was considered for this step, due to the fact that the ISO226 is only defined for the frequency range specified above.

6. Critical Band Analysis: The critical bandwidth formula by Moore and Glas- berg [15] was used to determine whether two neighboring peaks are within the same critical bandwidth, hence creating interference in the perception of those tones. A critical bandwidth (which is defined as ERB (or Equivalent Rectan- gular Bandwidth) in the original paper [15]) is defined as:

ERB = 6.23f 2 + 93.39f + 28.52Hz (2.1)

where f is the center of the critical band and is in kHz.

I followed the concept of consonance defined by Plomp and Levelt in [17] and CHAPTER 2. QUANTIFYING CONSONANCE 14

assumed that two peaks do not create any dissonance when they are apart by more than 1.2 times the critical bandwidth of the mid-frequency (which is the arithmetic mean of the two peak frequencies).

7. Dissonance Calculation: Consider a pair of pure tones with unequal magnitudes, as shown in figure 2.2. The dissonance perceived is determined by the tone with smaller magnitude. Using all the analyses above and the formula below (from [5]), the total dissonance D of the input signal is calculated:

1 D = min(l , l )d (2.2) L X i j ij ∀i

th where li is the loudness (in phons) of the i peak or estimated fundamental, th th dij the dissonance between the i and j peaks and L the total loudness of all peaks.

8. QTC: The quantified total consonance C of the input signal is calculated by 1 − D.

2.2.1 How YIN Works

This section summarizes how YIN by De Cheveigne and Kawahara [4] estimates the fundamental frequency. YIN is described in six steps.

Step 1.Autocorrelation Method With the given signal x, it calculates the auto- correlation using t+W r (τ)= x x (2.3) t X j j+τ j=i+1 where t is the time index, τ is the time lag and W is the window size. The peak of the autocorrelation (excluding the one at τ = 0) indicates the period of the fundamental. CHAPTER 2. QUANTIFYING CONSONANCE 15

Step 2. Difference Function The authors used the difference function as another way to estimate the fundamental period of the signal. The difference function is given by

W d (τ) = (x − x )2 (2.4) t X j j+τ j=1

dt(τ) = rt(0) + rt+τ (0) − 2rt(τ) (2.5)

and the difference function will yield the minimum at the fundamental period in principle.

Step 3. Cumulative Mean Normalized Difference Function The difference func-

tion dt(τ) is normalized by a short running average of old values:

′ 1, if τ = 0, d (τ)=  (2.6) t  τ d (τ)/ 1 d (j) , otherwise t h τ j=1 t i  P It differs from d(τ) in that it starts at 1 rather 0, tends to remain large at low lags and drops below 1 only where d(τ) falls below average.

Step 4. Absolute Threshold The difference function sometimes produces a sub- harmonic error, which means the model estimates a wrong period due to the higher-order dips that happen to be deeper than the period dip. The authors set an absolute threshold and choose the smallest value of τ that gives a minimum ′ of d deeper than the threshold.

Step 5. Parabolic Interpolation This step uses parabolic interpolation to calcu- late the true frequency from the integer period estimated in the steps above.

Step 6. Best Local Estimate For each time index t, it searches for a minimum ′ of dθ(Tθ) for θ within a small interval [t − Tmax/2, t + Tmax/2], where Tθ is the CHAPTER 2. QUANTIFYING CONSONANCE 16

estimate at time θ and Tmax is the largest expected period.

Notice how the steps are connected to one another. Step 1 (autocorrelation function) is replaced by Step 2 (difference function), which in tern is replaced by Step 3 (cumulative mean normalization operation), upon which Step 4 (absolute threshold) and Step 6 (Best Local Estimate) are based. Step 5 is independent of other steps, but it still relies on the spectral properties of the autocorrelation function (Step 1).

2.3 Experiment

2.3.1 A Simple Counterexample for Plomp and Levelt

A preliminary test was carried out to verify the dissonance (as exemplified by the existence of beating in a physical and/or perceptual way). A pure tone was created at 123 Hz for 500 millisecond. A complex tone consisting of 800, 900 and 1000 Hz tones was also generated for the same time duration, which correspond to the 8th, 9th and 10th harmonic of a 100 Hz complex tone. The pure tones all have the same amplitude. The complex tone is heard in the pitch of 100 Hz with very bright and sharp timbre, caused by the removal of lower harmonics. Incidently, the resulting sound has a speech quality in the sound, which resembles the sound of the /a/. According to the critical bandwidth calculation formula in [15], the critical band- width of a 123 Hz tone is 100 Hz, meaning any tone from 73 Hz to 173 Hz will interfere with this 123 Hz tone. The critical bandwidths of 800, 900, 1000 Hz tones are 150, 150 and 160 respectively. So, if Plomp and Levelt were right in that there is no dissonance between two tones outside of a critical band [17], we should not hear any dissonance in the combination of the pure tone and the complex tone, since 800, CHAPTER 2. QUANTIFYING CONSONANCE 17

100

50 Complex Tone

0 0 500 1000 1500

100

50 Pure Tone

0 0 500 1000 1500

100

50 Mixture

0 0 500 1000 1500 Frequency (Hz)

Figure 2.6: Magnitude spectrum of a) the complex tone, b) the pure tone and c) the mixture.

900, 1000 Hz are harmonious with one another and 123 Hz is certainly too far from those complex tone components to create dissonance. The magnitude spectra of the three signals are presented in figure 2.6. Clearly there is no frequency component corresponding to the missing fundamental at 100 Hz. The spectrogram of figure 2.7 also confirms that there are no other frequency components besides the tones at 123, 800, 900 and 1000 Hz. Figure 2.8 shows time-domain waveforms of the three signals. The x-axis is time in second and the y-axis in amplitude. The first subfigure shows a period of 0.01 second, CHAPTER 2. QUANTIFYING CONSONANCE 18

1500

1000

500 Complex Tone 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

1500

1000

500 Pure Tone

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

1500

1000

Mixture 500

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Time (seconds)

Figure 2.7: Spectrogram of a) the complex tone, b) the pure tone and c) the mixture. which corresponds to the 100 Hz (missing) fundamental. The second subfigure looks as expected; it shows a periodicity of about 0.008 seconds, close to the 0.0081 second period for a 123 Hz pure tone. What is interesting is the third figure; it shows a modulation in amplitude. The modulation causes an audible beating, and the beating frequency can be roughly estimated in the figure to be about 25 Hz (the visible period is slightly larger than 0.04 second), which is quite close to the real value of the beating frequency of 23 Hz. This comes from the frequency difference of the missing fundamental at 100 Hz and the pure tone at 123 Hz. CHAPTER 2. QUANTIFYING CONSONANCE 19

5

0 Complex Tone

−5 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

2

1

0

Pure Tone −1

−2 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

5

0 Mixture

−5 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Time (seconds)

Figure 2.8: Time-domain waveforms of a) the complex tone, b) the pure tone and c) the mixture.

This figure clearly illustrates that the beating caused by the missing fundamental and its neighboring pure tone is a physical phenomenon, not just a psychoacoustical one. This contradicts the theory by Plomp and Levelt that only adjacent pairs of tones within the same critical bandwidth contribute to the dissonance. One will hear that the mixture has distinct dissonance coming from the beating. The same beating is heard when trying to add two missing-fundamental complex tones together, whose fundamentals are within the same critical bandwidth. The only notable difference is in the sound quality, that the dissonance in this case seems to CHAPTER 2. QUANTIFYING CONSONANCE 20

be greater than the simple example above. Chapter 3

Annoyance Perception Experiment

The main goal behind the proposed algorithm in section 2.2 was to explain the an- noyance perception data from an earlier work included in Appendix A. The data were analyzed in terms of consonance, where there was no consideration of the harmonic structure (e.g., missing fundamentals), which is included in Appendix B. This chapter presents the data from the perceptual experiment carried out in Appendix A and the analysis result using the proposed algorithm.

3.1 Stimuli

Six types of MATLAB generated audio stimuli were used for this experiment at four different loudness intensity levels – 50, 60, 70 and 80 dB(A). dB(A) means A-weighted loudness level in decibels (see figure 3.2). There are other popular weightings such as B- and C-weighting [2] and ITU-R 468 [11] (also shown in figure 3.2), and the next section 3.3 will show different weighting methods do not affect this experiment. The six stimuli are:

1. Pink noise (PN),

21 CHAPTER 3. ANNOYANCE PERCEPTION EXPERIMENT 22

(a) PN (b) TS 50 50

0 0

−50 −50 0 3000 6000 1000 1100

(c) NBN (d) BBN 50 50

0 0

−50 −50

−100 −100 0 3000 6000 0 3000 6000

(e) NTC (f) BTC 50 50

0 0

−50 −50

−100 −100 0 3000 6000 0 3000 6000

Figure 3.1: The magnitude spectrum of the six stimuli

2. Two sinusoids (TS), which is consisted of two pure tones at 1000 and 1023 Hz with equal amplitudes.

3. Narrowband noise (NBN), which is generated by filtering Gaussian-white noise with a bandpass filter whose passband is from 1000 Hz to 1500 Hz.

4. Broadband noise (BBN), which is generated in the same fashion as NBN. The only difference is the passband of the filter – from 500 to 5000 Hz for BBN. CHAPTER 3. ANNOYANCE PERCEPTION EXPERIMENT 23

5. Narrowband tone complex (NTC), which has ten sinusoids of random ampli- tudes and frequencies, uniformly distributed between 1000 and 1500 Hz.

6. Broadband tone complex (BTC), which consists of 100 sinusoids of equal am- plitudes between 500 and 5000 Hz whose frequencies are logarithmically dis- tributed so that there are an equal number of tones per octave.

Figure 3.1 shows the spectra of the six stimuli.

3.2 Procedure

Each subject was asked to fill out a questionnaire concerning potentially relevant distinctions before the experiment. The subjects were asked for their gender, age, ethnic and language background, musical training, and whether they identified as an introvert or extrovert. All 276 unique pairs of stimuli were presented (corresponding to 24 choose 2), each stimulus lasting 1000 millisecond plus 50 millisecond of ramp-up and ramp-down time to avoid clicks, with a 500 millisecond pause between subsequent stimuli. Each subject was asked to select which of the two stimuli was more annoying according to their judgment (the authors did not define the term ‘annoyance’ to the subjects). The experiment did not allow the subjects to respond “both stimuli are equally annoying”, even if it may have applied in some cases. The experiment was performed monaurally (left ear) on Sony MDR-7506 head- phones, coupled with an M-Audio Omni I/O mixer connected to a Linux computer running MATLAB, Version 7.3. Before each session, the authors used a sound meter set to measure decibel levels with the A-weighting scheme and calibrate the head- phone volume. CHAPTER 3. ANNOYANCE PERCEPTION EXPERIMENT 24

20

10

0

−10 Gain dB −20

−30

−40 ITU−R 468 A−weighting B−weighting C−weighting −50 1 2 3 4 10 10 10 10 Frequency (Hz)

Figure 3.2: Weighting standards: A-, B-, C-weighting and ITU-R 468

3.3 The Matter of Weighting

There are a few standard weighting methods in sound research, such as A-, B-, C- weighting [2] and ITU-R 468 [11]. As shown in figure 3.2, each weighting scheme looks quite different and therefore has a different use. The A-weighting is often used for environmental studies, where B- and C-weightings are useful for louder noises such as aircraft turbulence [1]. ITU-R 468 was later proposed to be an appropriate perceptually based weighting scheme for all types of sound, though it is not much used in U.S. The bump in ITU-R 468 curve in figure 3.2 is to reflect the psychoacoustical CHAPTER 3. ANNOYANCE PERCEPTION EXPERIMENT 25

(a) PN (b) TS 80 80

70 70

60 60

50 50

40 40 50 55 60 65 70 75 80 50 55 60 65 70 75 80

(c) NBN (d) BBN 80 80

70 70

60 60

50 50

40 40 50 55 60 65 70 75 80 50 55 60 65 70 75 80

(e) NTC (f) BTC 80 80

70 70

60 60

50 50

40 40 50 55 60 65 70 75 80 50 55 60 65 70 75 80

Figure 3.3: Weighted decibel levels of six stimuli using A- (solid red line), B- (blue dash dotted line), C- (black dotted line) and ITU-R 468 (green dashed line) weighting

fact that humans are considerably more sensitive to noise in the region of 6 kHz. There was a concern about using A-weighting in the experiment since A-weighting has been criticized for being inappropriate for human hearing measurement [10] due to its validity only for relatively quite sounds around 40 phons (or 40 dB at 1kHz). But the following analysis demonstrates that, at least for this particular experiment, this difference is irrelevant. Figure 3.3 shows the perceptually weighted dB values of the twenty four stimuli (six types in four loudness levels) using A-, B-, C- and ITU-R 468-weightings. What CHAPTER 3. ANNOYANCE PERCEPTION EXPERIMENT 26

one sees at first is that all the lines are positioned very closely, almost on top of one another. The exception is the case of ITU-R weighted values (in green dashed line) that sometimes have visible offset from other lines, as in PN, BBN and BTC. This makes sense since A-, B- and C-weightings are quite similar in figure 3.2 while the ITU-R 468 shows a much higher gain in frequency between 1000 and 10000 Hz. While TS, NBN and NTC have small bandwidths around 1000 Hz, where the gain is around 0 dB in all four weighting schemes, PN, BBN and BTC have higher bandwidths, therefore ITU-R 468’s higher gain generates consistently higher weighted dB values. Table 3.1 lists the perceptually weighted values shown in figure 3.3. As shown in figure 3.3 already, the numbers can be quite different depending on the weighting schemes used. They also show that those differences may come from the spectral content of the stimuli, in other words they are quite small when the bandwidth of the stimuli are narrow (i.e., the cases of TS, NBN and NTC) and the frequencies lie in the range where all four weighting methods are close in gain, while the differences can be larger with stimuli with wider bandwidth (i.e., the cases of PN, BBN and BTC). What should be noticed in figure 3.3 is the equal slope in all cases. The equal slope in each graph (i.e., the relationship between the x-axis and the y-axis) between successive points is important because this means a 10 dB increase in loudness corre- sponds to a consistent level of increase in perceived loudness, whether the increase was from 50 to 60 dB or from 70 to 80 dB. Also the consistent offset between each pair of lines illustrates that even though the theoretically calculated perceived loudness may be different, the difference is consistent and therefore we can ignore this difference (and furthermore, different weighting methods) for the purpose of this study. CHAPTER 3. ANNOYANCE PERCEPTION EXPERIMENT 27

Table 3.1: Theoretically calculated perceived loudness values using dB(A), dB(B), dB(C) and dB(ITU-R468) Loudness ITU-R468 A- B- C- Stimulus Level weighted weighted weighted weighted (dB) values values values values PN 50 49.5574 42.1305 42.5399 44.0690 60 59.5573 52.1306 52.5399 54.0690 70 69.5573 62.1305 62.5399 64.0690 80 79.5573 72.1306 72.5399 74.0690 TS 50 45.8231 45.7640 45.7025 45.6982 60 55.8229 55.7639 55.7025 55.6982 70 65.8228 65.7639 65.7024 65.6981 80 75.8228 75.7639 75.7024 75.6981 NBN 50 47.3884 46.4201 45.8575 45.8204 60 57.3883 56.4202 55.8576 55.8205 70 67.3883 66.4201 65.8575 65.8204 80 77.3883 76.4202 75.8575 75.8204 BBN 50 53.9672 46.5973 45.3785 45.3116 60 63.9673 56.5974 55.3785 55.3116 70 73.9674 66.5975 65.3786 65.3117 80 83.9674 76.5975 75.3786 75.3117 NTC 50 46.8876 46.2835 45.8772 45.8500 60 56.8873 56.2832 55.8770 55.8498 70 66.8872 66.2832 65.8769 65.8497 80 76.8872 76.2832 75.8769 75.8497 BTC 60 52.1434 46.2109 45.5288 45.5122 60 62.1435 56.2109 55.5289 55.5122 70 72.1435 66.2109 65.5289 65.5122 80 82.1435 76.2109 75.5289 75.5122 CHAPTER 3. ANNOYANCE PERCEPTION EXPERIMENT 28

400

300

200

100

0 Sum of rankings

−100 PN TS NBN −200 BBN NTC BTC −300 50 60 70 80 Loudness (dB)

Figure 3.4: The result from annoyance perception experiment [20]

3.4 Results

Steele and Chon [20] recruited sixteen subjects of age between 16 and 41 with various degrees of musical training. They suspected that factors like age, gender, cultural background (first language) and musical training may play a role in annoyance per- ception of audio stimuli, but did not find any conclusively supporting data. Figure 3.4 represents the degree of annoyance as a function of loudness per stim- ulus type. This was generated by summing up the annoyance ranking data from all subjects. This graph shows that it is possible for rankings to switch places depending on the level especially, as it appears, at low stimulus levels. It also shows that loudness can have an unexpected nonlinear effect on stimulus annoyance, most prominently on the Narrowband Tone Complex. CHAPTER 3. ANNOYANCE PERCEPTION EXPERIMENT 29

The numbers on the y-axis of figure 3.4 do not carry any importance. What is important here is the relative height of a point to others and the slope between points. Even at a relatively quite level of 50 dB, Broadband Tone Cluster was highly annoying – in fact, it was more annoying than the other five types of stimuli at a 60 dB level. The opposite case is Two Sinusoids. Even though it was slightly more annoying than Narrowband Noise at 50 dB, the difference here is negligible, and it continues to be the least annoying stimulus type at all loudness levels. As one can easily expect, the annoyance perception is a function of loudness levels. Figure 3.4 shows this in consistent increase of the position of data points in all four stimuli types. The figure also suggests that there is absolute annoyance per stimulus type that is independent of loudness. Among the stimuli types, the graph maintains the order of [ TS, NBN, NTC, PN, BBN, BTC ] at loudness levels of 60, 70 and 80 dB (from bottom to top). It is noticeable that this annoyance order is an increasing function of bandwidth. And for two stimuli of the same bandwidth (such as Narrowband Noise and Narrowband Tone Complex, or Broadband Noise and Broadband Tone Complex), the noise-based stimuli are found to be less annoying than the tone-based ones. This is probably based on the distribution of the noise-derived stimuli, since sounds from uniform maximum-entropy time-frequency distributions are known to be easier to tune out. This may be related to the fact the perception of tones is different from that of noises, as frequently mentioned in masking effects [3]. Chapter 4

Results and Discussion

4.1 QTC Values of Annoyance Stimuli

This section presents the Quantified Total Consonance (QTC) values calculated with the proposed algorithm, and compares them with the QTC values from the earlier study in Appendix B. Table 4.1 shows the QTC values of the twenty-four stimuli, which is illustrated in figure 4.1. Two Sinusoids (TS) exhibited the least consonance, and Narrowband Noise (NBN) turns out to be slightly less consonant than Broadband Noise (BBN). And a similar fashion is noticed between Narrowband Tone Complex (NTC) and Broadband Tone Complex (BTC). The x-axis of each subfigure in figure 4.1 is in an

Table 4.1: The QTC values calculated with the proposed algorithm Stimulus 50 dB 60 dB 70 dB 80 dB PN 0.5308 0.5308 0.5294 0.5109 TS 0.0833 0.0915 0.1635 0.2658 NBN 0.5272 0.5297 0.5299 0.5314 NTC 0.6110 0.6094 0.6040 0.6040 BBN 0.5661 0.5651 0.5644 0.5639 BTC 0.5860 0.5829 0.5817 0.5814

30 CHAPTER 4. RESULTS AND DISCUSSION 31

(a) 50dB (b) 60dB 0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 TS NBN NTC PN BBN BTC TS NBN NTC PN BBN BTC

(c) 70dB (d) 80dB 0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 TS NBN NTC PN BBN BTC TS NBN NTC PN BBN BTC

Figure 4.1: QTC values of twenty-four stimuli increasing order of annoyance perception, as shown in chapter 3. Unlike the old data in Appendix B, the annoyance order is no longer a linear function of consonance (or QTC values) and stimulus bandwidth. Rather, what can be observed from this figure is that the consonance seems to have different patterns for narrowband signals (i.e., TS, NBN and NTC) and broadband signals (i.e., PN, BBN, BTC). For the stimuli with the same bandwidth (such as NBN and NTC vs. BBN and BTC), the tone- based signals seem to be more consonant than the noise-based ones. This appears to be related to the psychoacoustical fact that the brain reacts differently to tones from CHAPTER 4. RESULTS AND DISCUSSION 32

Table 4.2: The correlation coefficients of QTC values and loudness levels per stimulus type PN TS NBN BBN NTC BTC 0.7093 0.7129 -0.7023 -0.6086 -0.7423 -0.6921

Table 4.3: The correlation coefficients of QTC values and annoyance order, per loud- ness level 50 dB 60 dB 70 dB 80 dB 0.5709 0.5672 0.5715 0.5682 noises [3]. Perhaps it is more sensitive to a more consonant stimulus. We can see from figure 4.1 that, with the exception of TS, the QTC values for each stimulus type stays about the same regardless of the loudness levels. The fact that we can see similar looking graphs in all four loudness cases in figure 4.1 suggests that the consonance is not a function of loudness. These similar QTC values are as expected, since the proposed model calculates the dissonance of each unique pair of peaks with their magnitudes where the ratios of the magnitudes in a certain stimulus remain the same across different loudness levels.

4.2 Correlation Coefficients

As in Appendix B, correlation coefficients were calculated to verify the relationship between QTC and annoyance perception. The correlation coefficient ρSQi is calculated using the formula below:

cov(S, Qi) ρSQi = (4.1) δSδQi E [(S − µ )(Q − µ )] = S i Qi (4.2) δSδQi CHAPTER 4. RESULTS AND DISCUSSION 33

for i = 1, 2, 3, 4, 5, 6 and where S is the ordering of input signal loudness levels of

th [50 60 70 80] as in dB SPL, Qi the QTC values of the i stimulus type (from the annoyance perception order of [TS, NBN, NTC, PS, BBN, BTC]), µ and δ the mean and the standard variation of the subscripted variable respectively.

The correlation coefficient ρAQj for QTC values and the stimuli types was calcu- lated in a similar manner:

cov(A, Qj) ρAQj = (4.3) δAδQj

E (A − µA)(Qj − µQ ) =  j  (4.4) δAδQj for j = 1, 2, 3, 4 and where A is the ordering of [TS, NBN, NTC, PS, BBN, BTC] th from annoyance perception, Qj the QTC values of the j loudness level, µ and δ the mean and the standard variation of the subscripted variable, respectively. Table 4.2 shows the correlation coefficients of QTC values and loudness levels and table 4.3 of QTC values and annoyance order (from the least annoying to the most). As can be seen in table 4.1, the stimuli PN and TS were positively correlated with QTC as the loudness level increases. The other stimuli showed slight decrease in their respective QTC values as they got louder, which corresponds to the negative correlation coefficients in table 4.2. This means that the QTC value is an increasing function of loudness levels for PN and TS, and a decreasing function for other stimuli. As discussed in chapter 3, the annoyance order [TS, NBN, NTC, PN, BBN, BTC] is in the order of increasing bandwidth and noise-based stimulus type before tone-base type within similar bandwidths. In table 4.3, we can see that the annoyance order and the QTC values are moderately correlated with same magnitude regardless of the loudness level. This indicates that at a fixed loudness level, the annoyance order is a function of the QTC values, which in turn is a function of loudness level and stimulus type as shown in table 4.2. This also hints that the QTC values probably CHAPTER 4. RESULTS AND DISCUSSION 34

100

90

80

70

60

50 Mixture

40

30

20

10

0 0 500 1000 1500 Frequency (Hz)

Figure 4.2: Magnitude spectrum of the mixture from the Simple Test in section 2.3.1 (from figure 2.6) do not change much with a change in loudness, which is illustrated in table 4.1 with the exception of TS. The slight fluctuation in QTC values per stimulus type seems to come from the fact that the phons do not correlate linearly to SPL.

4.3 Fundamental Estimation using YIN

Let’s consider the simple test from section 2.3.1 again. With the given magnitude spectrum of figure 4.2 (reproduced from figure 2.6), human ears perceive the pure CHAPTER 4. RESULTS AND DISCUSSION 35

100

90

80

70

60

50 Mixture

40

30

20

10

0 0 500 1000 1500 Frequency (Hz)

Figure 4.3: Magnitude spectrum of the mixture from the Simple Test in section 2.3.1 and the estimated fundamentals using YIN

tone at 123 Hz and the missing fundamental at 100 Hz but fail to resolve them due to their close proximity in the critical bandwidth. Also, we can guess from a visual examination that the peak at 123 Hz probably does not belong to the same harmonic group with the other peaks, which exhibit a certain internal structure among them. Figure 4.3 shows the magnitude spectrum of the estimated fundamentals of the mixture (shown in two peaks in solid line) and that of the mixture (shown in dotted line). The fundamentals are estimated to be at 112.4481 Hz and 40.0082 Hz. Here is CHAPTER 4. RESULTS AND DISCUSSION 36

Figure 4.4: Fundamental estimation procedure in the Peaks and Fundamentals block in figure 2.5 CHAPTER 4. RESULTS AND DISCUSSION 37

how the estimation is found:

1. With the entire ‘mixture’ signal given, the ‘Peaks and Fundamentals’ module in the proposed algorithm (shown in 2.5) first analyzes the given signal to find all peaks. Here, a peak is defined to be a frequency whose normalized magnitude (by the maximum magnitude value) is greater than -10 dB.

2. YIN gets called to find that the best estimate is 112.4481 Hz with the given signal. Note that this value is close to the mean between the pure tone frequency of 123 Hz and the missing fundamental at 100 Hz.

3. With the estimated f0, remove all frequency components that are related to f0 harmonically. The peak at 900 Hz gets removed in this step. Calculate the ratio of the total energy in the signal before the removal of harmonic components and the new total energy after the removal. If the ratio is bigger than 0.2, continue to a further fundamental estimation.

4. Run YIN again with the new spectrum to find a new fundamental estimate at 40.0082 Hz. In fact, if we were given the spectral content of [123, 800, 1000] Hz tones, our brain will think the 123 Hz tone was slightly out of place from 120 Hz and we will hear the missing fundamental at around 40 Hz, which is the greatest common divisor of 120, 800 and 1000.

5. As in step 3, remove all the harmonics related to the new estimated fundamental. In this particular example, it will remove all three peaks, therefore the ratio of new-to-old energy becomes almost zero. Since the ratio is less than 0.2, stop the fundamental estimation process.

In more general cases, the fundamental estimation gets stopped when the newly estimated f0 is either greater than the peaks found in step 1 or the new f0 is within 1 Hz of the previous estimates. Figure 4.4 shows the flow of this procedure. CHAPTER 4. RESULTS AND DISCUSSION 38

4.3.1 Comparison of Estimated Fundamentals with Human Perception

It is apparent that the fundamental estimation process returns different values from what would be perceived by humans. As one can easily conclude from the magnitude spectrum in figure 4.2, there are two groups of sounds – a pure tone at 123 Hz and a 100 Hz missing-fundamental complex tone, consisted of 8th, 9th and 10th harmonics. Humans perceive a rough tone around 100 to 123 Hz with beating. This is because we do hear two tones whose frequencies are 100 Hz (whose fundamental is missing) and 123 Hz (a pure tone), but fail to resolve their interference in our perception of the sound. In contrast, when we listen to a mixture of pure tones at 40 Hz and 112 Hz, we are able to distinguish two frequencies even though the mixture does have some roughness to the sound. This discrepancy seems to be coming from the original YIN algorithm, which works great for an input with only one harmonic structure in it. It was originally designed for cases with one fundamental and is quite robust in those cases. The ‘Simple Test’ example, however simple it appeared to be in the frequency content, had two harmonic structures, which made it much harder for YIN to figure out the fundamentals. This is not desirable, since we want to have fundamental estimations that are meaningful. I think, however, that this ‘bad’ estimation may not matter too much when used on the ‘real sound’ examples. The stimuli considered for analysis in this thesis were all artificially generated in MATLAB. No stimulus had a regular harmonic structure that is usually to be expected in natural sounds. Half of the stimuli were a set of pure tones with little or no harmonic structure. On the contrary, I would expect the (natural) sound input to have a harmonic structure in it when someone is interested in finding a (missing) fundamental in CHAPTER 4. RESULTS AND DISCUSSION 39

that sound. That sound will be probably monophonic, meaning that there will be probably very little chance (or purpose) in trying to figure out the fundamentals in a polyphonic sound, though humans are innately very good at figuring out several missing fundamentals in a chord. Perhaps someday it may become possible for a computer to find multiple fundamentals sounding at the same time for applications such as automatic transcription. Chapter 5

Conclusions and Future Works

Consonance is a fundamental component of sound perception. In this thesis, I revis- ited the classic consonance theory by Plomp and Levelt [17] and proposed a model that extends the original theory. The algorithm presented in this thesis quantifies consonance with a special con- sideration for missing fundamental cases. A fundamental estimation algorithm ‘YIN’ [4] was used for this. The proposed algorithm was used in analysis of annoyance perception data (see Appendix A, which include audio stimuli artificially generated in MATLAB. These new results did not differ much from previous ones in Appendix B, and it confirmed that consonance in one sound input does not change with the loudness level at which it gets presented. While most stimuli showed roughly the same Quantified Total Consonance (QTC) values regardless of the loudness level, the QTC values of Two Sinusoids did change quite a bit with the loudness level change. It is not clear what causes this, or how important this phenomenon is at this time. Also the new algorithm does not shed any light on why Broadband Tone Complex was a lot more annoying than any other stimuli, even at a low loudness level.

40 CHAPTER 5. CONCLUSIONS AND FUTURE WORKS 41

What was different in the analysis result this time was that there seems to be a grouping of the QTC values among narrowband stimuli (i.e., Two Sinusoids, Narrow- band Noise and Narrowband Tone Complex) vs. those among broadband stimuli (i.e., Pink Noise, Broadband Noise and Broadband Tone Complex). Those groupings seem to be related to the psychoacoustical fact that the brain reacts to tones in a different way than to the noises. A further investigation is required to verify the conjecture and judge the significance of this. An audio engineer once said that people tend to be sensitive to a sudden change in the frequency spectrum. Since most of the stimuli (with the exception of Pink Noise) have abrupt changes of frequency content around passband limits, this may affect the annoyance perception of those stimuli. Also, this may explain why Pink Noise is less annoying than other broadband stimuli (Broadband Noise and Broadband Tone Complex) even though its bandwidth is larger than those of the other two. The fact 1 that Pink Noise exhibits a natural frequency decay of f may be one of the reasons for its low annoyance ranking – it sounds natural and people are used to it. When tested with a simple synthesized example, the estimated fundamentals re- turned by the proposed algorithm did not match those in human perception. This is reasoned to be due to the fact the sound stimulus was artificially synthesized, hence lacking an harmonic structure. I would expect the new algorithm to perform as ex- pected with monophonic ‘real’ input sounds, such as from instruments. This is left for future work. Studying consonance has been an interesting journey. For such a perceptually fundamental concept, consonance is found to have been little researched. I hope this thesis brought us closer to understanding human perception of sound and that this will be a building block for other perception related research in sound. Appendix A

A Perceptual Study of Sound Annoyance 1

1This appendix is a reproduction of an earlier study by Steele and Chon [20].

42 APPENDIX A. A PERCEPTUAL STUDY OF SOUND ANNOYANCE 43 APPENDIX A. A PERCEPTUAL STUDY OF SOUND ANNOYANCE 44 APPENDIX A. A PERCEPTUAL STUDY OF SOUND ANNOYANCE 45 APPENDIX A. A PERCEPTUAL STUDY OF SOUND ANNOYANCE 46 APPENDIX A. A PERCEPTUAL STUDY OF SOUND ANNOYANCE 47 APPENDIX A. A PERCEPTUAL STUDY OF SOUND ANNOYANCE 48 Appendix B

An Assessment of Sound Annoyance as a Function of Consonance 1

1This appendix is a reproduction of an earlier study by myself [6].

49 APPENDIX B. SOUND ANNOYANCE AS A FUNCTION OF CONSONANCE 50 APPENDIX B. SOUND ANNOYANCE AS A FUNCTION OF CONSONANCE 51 APPENDIX B. SOUND ANNOYANCE AS A FUNCTION OF CONSONANCE 52 APPENDIX B. SOUND ANNOYANCE AS A FUNCTION OF CONSONANCE 53 APPENDIX B. SOUND ANNOYANCE AS A FUNCTION OF CONSONANCE 54 APPENDIX B. SOUND ANNOYANCE AS A FUNCTION OF CONSONANCE 55 APPENDIX B. SOUND ANNOYANCE AS A FUNCTION OF CONSONANCE 56 APPENDIX B. SOUND ANNOYANCE AS A FUNCTION OF CONSONANCE 57 Bibliography

[1] A-weighting page on wikipedia, as of 3 june 2008. http://en.wikipedia.org/ wiki/A-weighting.

[2] American National Standards Institute (ANSI). American National Standard Specification for Sound Level Meters. 2003.

[3] Marina Bosi and Richard E. Goldberg. Introduction to Digital Audio Coding and Standards. Kluwer Academic Publishers, 2003.

[4] Alain De Cheviegne and Hideki Kawahara. Yin, a fundamental frequency est- imator for speech and music. Journal of the Acoustical Society of America, 111(4):1917–1930, 2002.

[5] Sang Bae Chon, In Yong Choi, Mingu Lee, and Koeng-Mo Sung. Quantified total consonance as an assessment parameter for the sound quality. In Proceedings of the Audio Engineering Society, 2006.

[6] Song Hui Chon. An assessment of sound annoyance as a function of consonance. In Proceedings of the Neural Information Processing Systems, Music, Brain and Cognition Workshop, 2007.

[7] International Organization for Standardization (ISO). Standard 226:2003. - Normal equal-loudness-level contours. 2003.

58 BIBLIOGRAPHY 59

[8] J. M. Geary. Consonance and dissonance of pairs of inharmonic sounds. Journal of the Acoustical Society of America, 67(5):1785–1789, 1980.

[9] A. Gerson and J. L. Goldstein. Evidence for a general template in central opti- mal processing for pitch of complex tones. Journal of the Acoustical Society of America, 63:498–510, 1978.

[10] Rhona P. Hellman and Eberhard Zwicker. Why can a decrease in db(a) pro- duce an increase in loudness? Journal of the Acoustical Society of America, 82(5):1700–1705, 1987.

[11] International Telecommunications Union (ITU). ITU-R Recommendation 468- 4, Measurement of Audio Frequency Noise in Broadcasting, Sound Recording Systems and on Sound Programme Circuits. 1986.

[12] A. Kameoka and M. Kuriyagawa. Consonance theory part ii: Consonance of complex tones and its calculation method. Journal of the Acoustical Society of America, 45:1460–1469, 1969.

[13] Matti Karjalainen and Tero Tolonen. Multi-pitch periodicity analysis model for sound separation and auditory scene analysis. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1999.

[14] Carol Krumhansl. Cognitive Foundations of Musical Pitch. Oxford University, 2001.

[15] B. C. J. Moore and B. P. Glasberg. Suggested formulae for calculating auditory- filter bandwidths and excitation patterns. Journal of the Acoustical Society of America, 74(3):750–753, 1983.

[16] Roy D. Patterson. Spiral detection of periodicity and the spiral form of musical scales. 14:44–61, 1986. BIBLIOGRAPHY 60

[17] R. Plomp and W.J.M. Levelt. Tonal consonance and critical bandwidth. Journal of the Acoustical Society of America, 38(4):548–560, 1965.

[18] Thomas D. Rossing, F. Richard Moore, and Paul A. Wheeler. The Science of Sound. Addison Wesley, 2002.

[19] Esben Skovenborg and Soren H. Nielsen. Measuring sensory consonance by au- ditory modelling. In Proceedings of the 5th International Conference on Digital Audio Effects (DAFx), 2002.

[20] Daniel L. Steele and Song Hui Chon. A perceptual study of sound annoyance. In Audio Mostly 2007, 2007.

[21] Ernst Terhardt. On the perception of periodic sound fluctuations (roughness). Acustica, 30(4):201–213, 1974.

[22] Ernst Terhardt. The concept of musical consonance: A link between music and psychoacoustics. Music Perception, 1(3):276–295, 1984.

[23] Tero Tolonen and Matti Karjalainen. A computationally efficient multipitch analysis model. IEEE Transactions on Speech and Audio Processing, 8(6):708– 716, 2000.

[24] Herman von Helmholtz. On the as a Physiological Basis for the Theory of Music, 4th edition, translated by A. J. Ellis (1954). Dover Publications, 1877.