Hidden Markov Models for Music Classification

BY

Xinru Liu

A Study

Presented to the Faculty

of

Wheaton College

in Partial Fulfillment of the Requirements

for

Graduation with Departmental Honors

in Mathematics

Norton, Massachusetts

May, 2019 Acknowledgement

I would like to first thank my thesis advisor, Professor Michael Kahn, for helping me accomplish this year-long honors thesis. I have been taking classes with him since I was a freshman and I was lucky for being able to do this honors thesis with him in my senior year. I appreciate all the instructions, care, courage and love he gave me in the past four years.

Secondly, I would like to thank my parents and my sister, Xinyi Liu, for giving me the mental support when I was struggling and frustrated during my senior year.

Without their support, I would not be able to go through this hardship.

Then I would like to thank my piano teacher, professor Lisa Romanul, for teaching me for four years. This thesis is motivated by my curiosity of the similarity between western piano music and it was taking piano lessons with her that made me be fascinated with classical music. I want to thank all the care she gave me.

I would also like to thank all the professors at Wheaton College that I took the classes with, especially my committee members, professor Mike Gousie and professor

Tommy Ratliff, who gave many good suggestions on the thesis I wrote.

Finally, I would like to thank all my friends who shared the joy and sorrow with me during the past four years, including Xinyi Liu, Zhuo Chen, Cheng Zhang, Shi

Shen, Martha Bodell, Jenny Migotsky, Keran Yang, Weiqi Feng and many others. I had my happiest four years at Wheaton and they made my life full of fun and energy.

i Contents

Acknowledgementi

1 Introduction1

2 Music Information Retrieval5

2.1 Music Information Retrieval...... 5

2.2 Mel-Frequency Cepstral Coefficients...... 7

2.3 Implementation of MFCCs...... 11

3 Hidden Markov Model 13

3.1 Hidden Markov Model...... 13

3.1.1 Markov Chain...... 14

3.1.2 Hidden Markov Models...... 15

3.2 Forward algorithm...... 17

3.3 Backward Algorithm...... 20

3.4 Parameter Estimation...... 21

3.4.1 Expectation-Maximization Algorithm...... 21

3.4.2 Baum Welch algorithm...... 24

3.5 Similarity Metric...... 26

4 Initialization 28

4.1 Initial parameter estimation...... 28

ii CONTENTS iii

4.1.1 Model-based agglomerative Hierarchical Clustering...... 29

5 Experiments and results 34

5.1 Composers...... 35

5.1.1 Bach...... 35

5.1.2 Beethoven...... 37

5.1.3 Schubert...... 39

5.1.4 Chopin...... 41

5.1.5 Debussy...... 43

5.1.6 Schumann...... 45

5.1.7 Schoenberg...... 47

5.2 Experiments...... 49

5.3 Discussion...... 50

5.3.1 Validation...... 50

5.3.2 Analysis of the result...... 51

5.3.3 Accuracy...... 55

5.3.4 Problems and Concerns...... 56

6 Conclusion and Future work 58

A Appendix 60

List of Figures 66 Chapter 1

Introduction

Music Information Retrieval (MIR) is concerned with extracting features from music

(audio signal or notated music), as well as developing different search and retrieval schemes[1]. With the explosion in the availability of music in the past two decades

(both digital audio and musical scores), more individuals have access to large music collections. People use online or streaming music repositories, such as Spotify[2] and

Pandora[3] to access music. Others obtain music scores from either music stores or online repositories such as IMSLP[4]. MIR developed in response to music retrieval applications focusing on matching personal tastes to corresponding music. For the digital audio music representation, the main goal of MIR is to characterize different musical features from the audio and overcome the gap between extractable features and human music perception[1]. One of the factors that determines the perspective of music is music “similarity”, which is difficult to define and is particularly complex because of its numerous parameters (timbre, melody, rhythm, harmony). Similarity metrics measure some inherent structure of a music collection, and the acceptance of a music retrieval system crucially depends on whether the user can recognize some similarities between the given piece and the retrieved music. A way of comparing audio recordings is to extract features from the audio signal which reflects the relevant

1 CHAPTER 1. INTRODUCTION 2 aspects of the recordings, followed by defining and computing a measure of similarity of the extracted information. Timbre is a feature that is used to distinguish the same tone performed by different instruments. It is one of the most important dimensions in a piece of music. Mel-frequency cepstral coefficients (MFCCs) are a good measure for the perceptual timbre space[5]. MFCCs also capture the melodic structure in the music, but the pitch-related features, like the chroma-based features (meaningfully categorize pitches), are the most powerful representation for describing harmonic information[5]. The rhythmic features also provide information about the music’s structure. Timbre, melody and rhythm are three of the most important features that represent perceptual cues for each music piece and researchers mainly focus on extracting these features from the music.

Motivated by the goal of recognizing the similarities and relationships between different music pieces, several statistical models have been developed for analysis.

Both supervised and unsupervised classification methods have been applied in pre- vious research. In Xu et al.’s paper, the authors proposed a music genre classifi- cation method using support vector machine, a supervised machine learning algo- rithm[6]. The paper concludes that multi-layer support vector machines have better performance compared to traditional Euclidean distance based methods and statisti- cal learning methods. In Tzanetakis and Cook’s paper[7], Guassian Mixture Model

(GMM) and K-Nearest Neighbor are employed to classify music pieces. However, those methods treated data independently and identically distributed samples and failed to take the dependent, dynamic features of the music into consideration[8]. In

Qi et al.’s paper[8], the Hidden Markov Model (HMM) was proposed to accurately represent the characteristics of sequential data. A HMM mixture model in a Bayesian setting using a non-parametric Dirichlet process (DP) as a prior distribution is applied by the authors. A similarity matrix between the respective HMM mixture models trained from each piece is computed. The paper compares the results from DP HMM CHAPTER 1. INTRODUCTION 3

mixture models and DP Gaussian Mixture models and concludes that HMM mix-

ture models better distinguish between the content of the given music by taking the

temporal character of the given music into account, providing sharper contrasts in

similarities than the GMMs[8].

The motivation of this thesis is from my own curiosity of the similarity between

western classical piano pieces by different composers. Classical music is used to refer

to the period from 1750 to 1820. The major time divisions of classical music include

the Baroque, Classical and Romantic periods. Prominent composers during the entire

classical era include , Wolfgang Amadeus Mozart, Ludwig van

Beethoven, , Frederic Chopin, , Claude Debussy and

more. Although the composing styles differ between composers across the different

time periods, there exist interesting connections between those composers and their

music. For example, Franz Schubert and Ludwig van Beethoven are composers that

are often compared and contrasted in a number of ways due to their temporal and

spatial proximity to each other in the early nineteenth century. Indeed, there are

many similarities in the two composers’ piano work. However, their music can be

distinguished not only because of the different compositional process but also their

distinctive personality. Although these characteristics cannot be measured directly,

they are reflected in the music which can be quantified. Similar connections oc-

cur when comparing Schumann and Chopin. One of the piano pieces in Schumann

Carnaval Op.9 is called Chopin, which Schumann wrote in homage to his colleague,

Frederic Chopin. There are many “Chopin elements” in the piece so that many people misclassfiy it as a work by Chopin. So, the questions is: if the human ear can detect similarities and differences between piano pieces, can a model be built that is able to tell which pieces are more similar to each other? Is there a metric that can be built to measure the similarity?

The goal of this thesis is to build Hidden Markov Models on piano pieces from CHAPTER 1. INTRODUCTION 4 different composers and develop a similarity metric to measure similarity between piano works. In order to examine the effectiveness of the model and the similarity metric, a database of pre-characterized piano pieces trained by Hidden Markov Model for each piece will be built. A new piece can then be put into those trained models and the “similarity” between the new piece and the pieces in the database will be computed. This way, a music retrieval system can be built for people to search and retrieve pieces in the database based on a given piece. One difference between this research and other music genre classification research is that here we only consider the piano pieces.

Chapter 1 will introduce the history of music information retrieval and feature extraction process, including the definition of Mel-Frequency Cepstral Coefficient

(MFCCs), which is the main feature measurement applied in this research, how

MFCCs is computed, and the way to extract it from the digital music. Chapter 2 in- troduces the Markov Chain and Hidden Markov Model, mainly focusing on parameter estimation using an Expectation-Maximization (E-M) algorithm called Baum-Welch algorithm. Similarity metric is defined using a likelihood value from the Baum-Welch algorithm. Chapter 3 will introduce the initialization of model parameters using model-based agglomerative hierarchical clustering. Chapter 4 includes the experiment conducted on the collected piano pieces. The first part introduces seven composers’ composing styles, accompanied with the MFCC plots from their piano pieces. Then the results from computing similarities between new testing pieces and the pieces in the trained database are included. The three pieces in the database most similar to the test piece are returned. The measure of accuracy is defined based on the pieces returned. The results will be analyzed and the problems will be discussed. Chap- ter 5 gives the conclusion of the strength and weakness of this retrieval system we developed, and the future work to improve the retrieval process. Chapter 2

Music Information Retrieval

2.1 Music Information Retrieval

Music information Retrieval (MIR) is a relatively young field but it has grown vastly

in the last two decades for several reasons: 1) the development of audio compression

techniques starting in the late 1990s, 2) increased computing power enabling users to

extract musical features in reasonable time, 3) the widespread availability of mobile

music players and 4) the emergence of music streaming services such as Spotify[2], itunes[9], Pandora[3], etc[1]. MIR is concerned with the extraction of meaningful features from music (audio signal or symbolic representation), indexing of music us- ing these features, and the development of different search and retrieval schemes[1].

Music retrieval applications are intended to help users find music in large collections using various similarity criteria[1]. One of the most common methods of accessing music is through textual metadata. An example of metadata-driven music systems is Pandora. The user types in the name of their favorite artist, or song, and Pan- dora will use textual metadata to suggest “similar” artists and tracks[10]. Textual metadata provides a reliable method to efficiently retrieve music from millions of entries. However, one shortcoming is that it becomes less promising when users do

5 CHAPTER 2. MUSIC INFORMATION RETRIEVAL 6

not know the name, or other information, of the music or artist they want exactly.

This gap gives an opportunity to other retrieval methods. One method, query by

example, also called content-based method, aims at retrieving information from a

given melody described in terms of features and is then compared to documents in

a music collection[10]. Figure 2.1 shows a flow chart describing how content-based

retrieval is achieved. Music is first digitized into quantitative data and features are

extracted from the data. The system then compares those extracted features with

the music in the database and matches similar ones to the input features. Finally the

most similar music is retrieved for the users. Shazam[11] is a system that identifies a particular recording from a sample of music and returns the artist, album and track title to the users. Other online music services such as Najio[12] and Soundhound[13] allow users to sing a query and identify the work. Recently, music streaming services have shifted to user-centric strategies, aiming to take into account different factors in the perception of musical qualities, in particular “music similarity”[1]. Music sim- ilarity is complicated to define, because of music’s numerous features. In order to improve the accuracy of recommendations, much research has been conducted on ex- tracting meaningful features from music content and computing similarities between two pieces, or classifying pieces according to certain criteria[1]. Most of the research focuses on music-genre classification, which classifies music into different genres in- cluding classical, jazz, blues, hip-hop, etc. In this research, we study classical piano music. Motivated by the essence of query-by-example, we want to build a database which includes a number pre-characterized piano works. By comparing the “similar- ity” between piece of interested music with the music in the database, the users can retrieve the pieces in the database similar to the piece that they are interested in. CHAPTER 2. MUSIC INFORMATION RETRIEVAL 7

Query

Digitization

Quantitative data

Feature extraction

Features Match

Retrieval

Retrieved music Database

Result

Figure 2.1: Flow chart of content-based MIR query system

2.2 Mel-Frequency Cepstral Coefficients

This research only focuses on the audio signal representation of a piece of music. Au- dio signals have frequencies ranging roughly from 20 to 20,000 Hz, which correspond to the lower and upper limits of human hearing[10]. The frequency of a sound is defined as the number of times that a cycle is repeated per second, or Hertz(Hz)[10].

Figure 2.2 is a waveform representing an audio signal from Bach’s Prelude in B Flat minor. The amplitude measures the loudness of the audio and is a measure of inten- sity of the sound. Figure 2.3 is a spectrogram of the same audio signal which is a visualization of the spectrum of frequencies of the sound as it varies with the time. CHAPTER 2. MUSIC INFORMATION RETRIEVAL 8

Figure 2.2: Audio signal from Bach Fugue in B Flat Minor

Figure 2.3: Spectrogram from Bach Fugue in B Flat Minor

Important information such as pitch, timbre and rhythm can be used to charac- terize music. Pitch is the frequency of the respective tone and a linear series of pitch

forms a melody. If the melody is seen as a ‘horizontal” representation of pitch, the

harmony can be described as “vertical” combinations of pitch into chords. Timbre is

the characteristic of a musical sound and is usually used to differentiate the instru-

ments. A common way of extracting both melody and timbre information from an

audio signal is the Mel-Frequency Cepstral Coefficients (MFCCs). The word “cep-

strum” is a play on the word “spectrum” and is meant to transform the spectrum into

something that better describes the sound characteristics as perceived by a human

listener[14]. A mel is a unit of measure for the perceived pitch of a tone. The human

ear is sensitive to linear changes in frequency below 1000 Hz and logarithmic changes CHAPTER 2. MUSIC INFORMATION RETRIEVAL 9 above[15]. Mel-frequency is a scaling of frequency that takes this fact into account.

Therefore, a logarithmic function is implemented on the frequencies, which enables a new scale, which is commonly known as the Mel-scale. The Mel-scale enables the fea- ture extracted from the signal to more closely what match humans hear [15]. Below are the five steps obtaining the coefficients:

1. Divide the signal into short frames. The audio signal is usually divided into

20-40s frames. If the frame is too short, we do not get a reliable spectral estimate; if the frame is too long, the signal changes too much so that after the scaling, the scaled signal is no longer statistically stationary[15].

2. For each frame, calculate the periodogram estimate of the power spectrum.

Peridogram is an estimate of the spectral density of a signal. The human cochlea (an organ in the ear) vibrates at different spots depending on the frequency of the incom- ing sounds[15]. Different nerves fire and inform the brain that certain frequencies are present as different locations in the cochlea vibrate. The periodogram estimate works similar to the cochlea, identifying which frequencies are present in the frame[15].

3. Apply the mel-filter-bank to the power spectra and then sum the energy in each

filter. In signal processing, a filter bank is an array of band-pass filters that separate the input signal into multiple components, each one carrying a single frequency sub- band of the original signal [15]. Since the cochlea can not discern the difference between two closely spaced frequencies, and this effect becomes more pronounced as the frequencies increase, periodogram bins are aggregated and summed to obtain certain amounts of energy in various frequency regions [15]. This is performed by the

Mel filter bank: the first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher the filters get wider as we become less concerned about variations. The Mel scale tells us exactly how to space our filter banks and how wide each bank is. Figure 2.4 shows a Mel-filter bank of 12

filters. CHAPTER 2. MUSIC INFORMATION RETRIEVAL 10

Figure 2.4: A Mel-filter bank containing 12 filters

4. Take the log of all the filter bank energies. This is motivated by the fact that humans perceive the frequencies on a non-linear scale.

5. Apply the Discrete Cosine Transform (DCT):

r M 2 X πi ci = xjcos( (j − 0.5)), i = 0, 1, 2, ..., M (2.1) M M j=1

where π is a constant, ci is the cepstral coefficient, M is the number of filterbank

channels, xj is the calculated filter bank energies. If M is 12, then the transformation

will return 12 cepstral coefficients. Usually 2-13 DCT coefficients are computed.

DCT is applied because the filter banks are overlapped and there are correlations

between filter energies. The DCT decorrelates the energies so that diagonal covariance

matrices can be used to model the features. Only 12 coefficients are kept because the

higher DCT coefficients represent fast changes in the filter bank energies and it turns

out that these fast changes actually degrade the recognition performance. CHAPTER 2. MUSIC INFORMATION RETRIEVAL 11 2.3 Implementation of MFCCs

This section introduces the procedure of implementing MFCC on the audio signal.

In this research, each audio signal is divided into short frames of 25ms long. For a

90-second music, there will be around 3600 frames. A 12-dimensional MFCC will be computed for each frame so that the final feature vector for each frame will be of the form:

  ct,1      ct,2    ot =    .   .    ct,12

For a data sequence of length 3600, the form of the whole data sequence is:

  c1,1 . . . c3600,1    . . .  O =  . .. .   . .    c1,12 . . . c3600,12

Figure 2.5 shows 12 MFCCs from Bach’s B flat minor BWV867 Fugue. The gaps between the top , bottom and middle coefficients indicates that different coefficients span different frequency ranges. Figure 2.6 shows the histrogram of first 3600 MFCCs c1,1, ..., c3600,1 spanning through the whole piece. The other eleven MFCC histograms are included in the AppendixA. The histograms indicate the coefficients from every dimension are reasonably symmetric and exhibit characteristics of a sample from a normal population. CHAPTER 2. MUSIC INFORMATION RETRIEVAL 12

Figure 2.5: 12-dimensional MFCC data from Bach B flat minor BWV867 fugue

Figure 2.6: Histogram of first MFCC from Bach B flat minor BWV867 fugue Chapter 3

Hidden Markov Model

3.1 Hidden Markov Model

Many statistical models can be applied to the given data. However, the assumption of an appropriate model is that the data can be best characterized by a parametric process where the parameters are precise and well-defined[16]. The MFCC data se- quence extracted from the audio signal is a time series, a series of equi-spaced data indexed in time order. The dynamic structural changes in musical content, including melody, harmony, rhythm get encoded into musical time series data. A model that can not only capture the dynamic structure of the music but also uncover the hidden information behind the data is desired. A Hidden Markov Model (HMM) is a statis- tical model with two stochastic process, one is observable and one is hidden. In the simple Markov model, the state is visible and all the information is observable. In the Hidden Markov Model, there is a state sequence that is not directly visible, but can be “observed” through another observation sequence. Through the observation generated by the hidden states, we can have a better understanding of the hidden state sequence. HMM has been applied to fields such as speech recognition and ge- netic prediction[8]. But how HMM works in our music case? Following sections will

13 CHAPTER 3. HIDDEN MARKOV MODEL 14 give detailed explanation of Markov Chain, Hidden Markov Model, and how it can be applied to the MFCCs.

3.1.1 Markov Chain

Definition 3.1. A discrete-time Markov Chain with finite state space S = {1, ..., N} is a sequence of random variables Q1, ...., QT so that

P P (Qt+1 = it+1|Q1 = i1,...,Qt = it) = (Qt+1 = it+1|Qt = it) = aitit+1 (3.1)

where ik ∈ S and N is the number of states. A Markov Chain follows the Markov property, saying the probability of moving to a certain state ik ∈ 1, ..., N depends on the whole past history only through the most recent state. A transition matrix is often used to describe a Markov chain. It indicates the probability of transferring from one state to another:

  a11 a12 a13 . . . a1n     a21 a22 a23 . . . a2n   P A =   , aij = (Qt+1 = i|Qt = j)  . . . . .   . . . . .    an1 an2 an3 . . . ann

Pn where j=1 aij = 1, i, j ∈ S. Many phenomena are explained well by Markov chains. For example, finance and economics use Markov chains to model different phenomena, including market crashes and asset prices. In the weather forecast, Markov chains help predicting the long-run weather. It is also commonly used in page rank algorithm by Google search. In this research, the Markov property is involved because of the sequential characteristic of the data. From the perspective of chord progression, when the music is played, the probability of a certain note or chord occurring at time t + 1 only depends on the note or chord a very short time period before time t + 1 instead CHAPTER 3. HIDDEN MARKOV MODEL 15

of few minutes ago.

3.1.2 Hidden Markov Models

A Markov chain is used when we want to compute the probability for a sequence of

data that we can observe. However, in many cases, the data we are interested in

may not be directly observable. For example, in speech recognition, different people

have different accents of pronouncing words. The actual words are hidden by the

pronunciation that we can hear. Hidden Markov Models have been applied in speech

recognition to identify those spoken words. In this work, the MFCC data sequence is

sequential and extractable. However, dynamic feature range in the music representing

the composition style of different composers cannot be directly “observed” from our

data sequence. A Hidden Markov Model provides a way to obtain information about

hidden dynamic structure through the observable data (MFCC vectors).

Definition 3.2. A Hidden Markov Model is a doubly embedded stochastic process with an underlying unobservable state sequence associated with transition probabilities, but generates a concurrently running observable chain through an emission probability distribution .

A continuous HMM is specified by the following components:

T: number of observations

O = X1,X2,...,XT : Sequence of observations

N: Number of states

Q1,...,QT : Sequence of hidden states with state space {1, ..., N}

π = {π1, π2, . . . , πN } = {P(Q1 = i), i ∈ {1, ..., N}}: Initial probability vector for

hidden states

A = {aij} = {P(Qt+1 = j|Qt = i), i, j ∈ {1, ..., N}} : Transition probability matrix CHAPTER 3. HIDDEN MARKOV MODEL 16

for hidden process P N bi(Xt) = (Ot = Xt|Qt = i) ∼ (µi, Σi) : Emission probability that “connects” the hidden states and the observation

where A is a transition probability matrix, each aij representing the probability of moving from state i at time t to state j at time t + 1. O is sequence of MFCC vectors in this case. bi(Xt) is the emission probabilities, which represent the probability of an observation MFCC vector being generated at time t from hidden state i through

a probability distribution. Therefore,

0 1 (Xt−µi)(Xt−µi) − 2Σ bi(Xt) = e i p k (2π) |Σi|

0 where (Xt − µi) is the transpose of (Xt − µi), µi and Σi are the mean vector and

covariance matrix of the multivariate normal distribution, with k = 12. Note that

Xk and Xm are conditionally independent given Qk and Qm. Below is an illustration

of Hidden Markov chains:

a12 a23 Hidden (Music Feature) Q1 Q2 Q3 ... QT

bi(X1) bi(X2) bi(X3) bi(XT )

Observed (MFCC vectors) X1 X2 X3 ... XT

The dynamic feature range in music can be split into a certain number of discrete

states, shown in the grey circles in the figure, where each of them characterize a certain

range of features. Those discrete feature ranges are connected by the Markov Chain

so that the stochastic process will take the dynamic part of the whole process into

consideration. Those features are hidden behind the MFCC vectors extracted from CHAPTER 3. HIDDEN MARKOV MODEL 17 the audio signal. Through the observation of the MFCCs (white circles), we can obtain information about the dynamic feature (grey circles) behind them. The parameters that define the model are the transition matrix A for the hidden process, the starting probability π from the hidden states, and the emission probability distribution bi(Xt) for each hidden state i. A complete parameter set of the HMM model is specified as:

12×13 λ = (π, A, µ, Σ), where the parameters have dimensions space N, N ×N, 12×1, 2 respectively. Notice that the number of the parameters depends on the number of the hidden states. The method used to choose N, the number of states, will be explained in detail in the later chapter. Suppose N is 3, the total number of parameters will be N 2 + (12 + 78) × N + N = 32 + (12 + 78) × 3 + 3 = 282. This indicates that the estimation of these parameters is complicated.

In order to estimate parameters, we need to answer the following two questions:

1. Training: Given a set of observations, how do we estimate λ by maximizing

L(λ) = P(O|λ)?(L(λ) is called the likelihood function).

2. Evaluation: Given parameter estimates λˆ, and a set of observations sequence O, how do we compute the likelihood P(O|λˆ)?

The so-called forward algorithm is used to answer the second question and the

Baum-Welch algorithm specifically is used to answer the first question.

3.2 Forward algorithm

To start, we first discuss the second question. We wish to calculate the probability of the observation sequence O = X1,X2,...,XT given parameters λ, i.e., P(O|λ).

The most straightforward way of thinking about this is through enumerating every possible state sequence of length T[16]. The probability of the observation sequence CHAPTER 3. HIDDEN MARKOV MODEL 18

O for the state state sequence Q1,Q2,...,QT is

T Y P(O|Q, λ) = P(Ot|Qt, λ) (3.2) t=1 where we use the conditional independence of observations given hidden states. From the definition of emission probability, the observations are conditionally independent of one another given the hidden states,

P (O|Q, λ) = bQ1 (X1) · bQ2 (X2) . . . bQT (XT ) (3.3)

We know from the chain rule for probability, the state sequence Q1,Q2,...,QT can be written as the product of sequential, conditional probabilities

P (Q|λ) = πQ1 · aQ1Q2 · aQ2Q3 . . . aQT −1QT (3.4)

The definition of conditional probability then yields

P(O,Q|λ) = P(O|Q, λ)P(Q|λ) (3.5)

The probability of O given λ is obtained by summing this joint probability over all possible state sequences

X P(O|λ) = P(O|Q, λ)P(Q|λ) (3.6)

Q1,Q2,...,QT

However, since there are N possible states for each observation, this calculation is computationally unfeasible even for small values of N and T [16]. In Rabiner’s pa- per, he introduces a more efficient procedure to solve this problem, called forward algorithm[16]. CHAPTER 3. HIDDEN MARKOV MODEL 19

Definition 3.3. Forward probability for state j, αt(j), is the probability of observ- ing sequence O = X1,X2,...,Xt and being in hidden state j at time t (Qt = j).

So

αt(j) = P(Qt = j, O = X1,...,Xt|λ) (3.7)

where αt(j) can be defined recursively:

N N X X αt(j) = αt−1(i)P(Qt = j|Qt−1 = i)P(Xt|Qt = j) = αt−1(i)aijbj(Ot) (3.8) i=1 i=1 where the product αt−1(i)aij is the probability of the joint event that X1,X2,...,Xt−1 are observed and state j is reached at time t from state i at time t−1. The summation of this product over all the N possible states at time t results in the probability of going to state j at time t with all the accompanying previous partial observations[16].

Finally, P(O|λ) is derived from the summation of the terminal forward variables αT (i)

N X P(O|λ) = αT (i) (3.9) i=1 by the definition

αT (i) = P(O = X1,X2,...,XT ,QT = i|λ) (3.10)

The brute force procedure takes approximately 2TN T calculations and the forward algorithm takes about TN 2 calculations, a considerable reduction[16]. Below is the summary of steps to find P(O|λ)

• Initialization: α1(i) = π1bi(X1), 1 ≤ i ≤ N PN • Recursion: αt(j) = i=1 αt−1(i)aijbj(Xt), 1 ≤ j ≤ N and 2 ≤ t ≤ T P PN • Evaluation: (O|λ) = i=1 αT (i), 1 ≤ i ≤ N CHAPTER 3. HIDDEN MARKOV MODEL 20 3.3 Backward Algorithm

After knowing how to find P(O|λ), we may want to make further inference about the sequence. For example, we may want to know the probability of an observation Xt coming from hidden state i given the whole observation. This can be obtained by

P(Ot = Xt,Qt = i|O) = P(X1,...,Xt,Qt = i)P(Xt+1,...,XT |X1,...,Xt,Qt = i)

= P(X1,...,Xt,Qt = i)P(Xt+1,...,XT |Qt = i)

The first term of the product is just the value computed by the forward algorithm.

The second term is part of the backward algorithm. Define

βt(i) = P(Xt+1,...,XT |Qt = i) (3.11)

Definition 3.4. Backward probability for state i, βt(i), is the probability of the rest of partial sequence Xt+1,...,XT given hidden state in i at time t.

βt(i) can be obtained recursively, similar to how αt(i) is obtained:

N X βt(i) = aijbj(ot+1)βt+1(j) (3.12) i=1

P(O|λ) can be derived by

N X P(O|λ) = πibi(O1)β1(i) (3.13) i=1

The backward algorithm is useful to answer the training question for finding maximum likelihood estimate of λ. CHAPTER 3. HIDDEN MARKOV MODEL 21 3.4 Parameter Estimation

In the application of HMMs, the most important problem is to estimate the param-

eters based on the observations, the first question (training) stated above. However,

computing the global maximum directly for HMMs is computationally intractable[17].

Therefore, the Expectation-Maximization (E-M) algorithm becomes an ideal method

to estimate the value of the parameter that maximizes the likelihood of the observa-

tion. The Baum Welch algorithm, or forward-backward algorithm, is a special case

of the E-M algorithm, which works iteratively starting with an initial guess. Then

based on expected values, the algorithm iteratively computes a better estimate using

the previous one and so on, to maximize the probability of observing the sequence.

3.4.1 Expectation-Maximization Algorithm

The Expectation-Maximization algorithm enables parameter estimation in probabilis-

tic models with incomplete data[18]. Parameter estimation in HMM where hidden

states are unknown is known as the incomplete data case[18]. In fact, there is no

optimal way of estimating the HMM parameters, but we can find λˆ such that P(O|λˆ) is locally maximized. The E-M algorithm consists of three parts: initialization, iter- ation, and evaluation. It works by starting with some initial guess of the parameters

(initialization), re-estimating the parameters iteratively using the E-step and M-step

(iteration) and calculating the likelihood of observing the data for the new model of parameter λˆ (evaluation). The procedure repeats until the difference between two consecutive likelihood is small enough (convergence).

In the HMM parameter estimation, the EM algorithm alternates between the steps of computing the expected state occupancy count γ and the expected state transition count δ based on the earlier parameters (known as the E-step) and then re-estimating the model parameters using γ and δ(known as the M-step)[18]. The hidden data is CHAPTER 3. HIDDEN MARKOV MODEL 22 estimated based on the observed data and current estimate of parameters in the E- step using the conditional expectation. In the M-step, new parameters are estimated by maximizing the likelihood function, assuming the hidden data is known. Below is a more detailed explanation of the E-M algorithm.

Derivation of the EM-algorithm

Suppose we have the observation variables O and hidden variables Q. Then the likelihood function P(O|λ) can be written in terms of the hidden variables Q as

X P(O|λ) = P(O|Q, λ)P(Q|λ) (3.14) Q

P where Q denotes summing up the states over t, i1, ..., iT , where it ∈ {1, ..., N}. The goal is to find λ that maximizes l(λ) = ln P(O|λ). The way to do this maximization is to maximize the difference between l(λi) and l(λi+1), where λi is the current pa- rameter value, and λi+1 is the new parameter value. The difference of ln P(O|λi+1) and ln P(O|λi) is:

X l(λi+1) − l(λi) = ln P(O|Q, λi+1)P(Q|λi+1) − ln P(O|λi) Q

X P(Q|O, λi) = ln P(O|Q, λi+1)P(Q|λi+1) · − ln P(O|λi) P(Q|O, λi) Q   X P(O|Q, λi+1)P(Q|λi+1) = ln P(Q|O, λi) − ln P(O|λi) P(Q|O, λi) Q P P  X P (O|Q, λi+1) (Q|λi+1) P > (Q|O, λi) ln − ln (O|λi) P(Q|O, λi) Q

(From Jensen’s inequality, if f is a concave function on domain D, X  X then f aixi > aif(xi) , where ai are the positive weights

and xi ∈ D) CHAPTER 3. HIDDEN MARKOV MODEL 23   X P(O|Q, λi+1)P(Q|λi+1) = P(Q|O, λi) ln P(Q|O, λi)P(O|λi) Q

= ∆(λi+1|λi)

Then we have l(λi+1) > ∆(λi+1|λi) so that ∆(λi+1|λi) + l(λi) is bounded above by l(λi+1). Therefore, any λ increases ∆(λi+1|λi) cannot decrease l(λi+1). So we want to ˆ choose a λ to maximize ∆(λ|λi). We denote an updated parameter value as λ, such that

ˆ λ = argmax {l(λi) + ∆(λ|λi)} λ ( ) X  P(O|Q, λ)P(Q|λ)  = argmax l(λi) + P(Q|O, λi) ln λ P(Q|O, λi)P(O|λi) Q ( ) X = argmax P(Q|O, λi) ln(P(O|Q, λ)P(Q|λ)) λ Q ( ) X = argmax P(Q|O, λi) ln(P(O,Q|λ)) λ Q  P = argmax EQ|O,λi {ln( (O,Q|λ)} λ

P P P P where Q (Q|O, λi) ln( (O,Q|λ)) is the expectation of ln( (O,Q|λ)) given condi- P P tional probability distribution (Q|O, λ). EQ|O,λi {ln( (O,Q|λ))} denotes this condi- tional expectation.

The E-M algorithm stops when the difference between l(λi+1) and l(λi) is smaller than a threshold. That is,

l(λi+1) − l(λi) < 

Baum-Welch is a specific case of the E-M algorithm and below is a detailed description of the Baum-Welch algorithm and how HMM parameters are derived using the Baum-

Welch algorithm. CHAPTER 3. HIDDEN MARKOV MODEL 24

3.4.2 Baum Welch algorithm

Let γt(j) be the probability of being in hidden state j at time t (Qt = j) given the observation sequence and the parameter:

γt(j) = P(Qt = j|O, λ) P(Q = j, O|λ) = t P(O|λ) P(Q = j, O = X , ..., X |λ)P(O = X , ..., X |Q = j, λ) = t 1 t t+1 T t P(O|λ) α (j)β (j) = t t P(O|λ)

δt(i, j) is the probability of being in hidden state i at time t − 1 (Qt−1 = i), hidden

state j at time t (Qt = j) given the observed sequence:

δt(i, j) = P(Qt−1 = i, Qt = j|O, λ) P(Q = i, Q = j, O|λ) = t−1 t P(O|λ) P(O|Q = i, Q = j, λ)P(Q = i, Q = j|λ) = t−1 t t−1 t P(O|λ) P(O , ...O |Q = i, λ)P(O , ..., O |Q = j, λ)P(Q = j|Q = i, λ)P(Q = i|λ) = 1 t−1 t−1 t T t t t−1 t−1 P(O|λ)

P(O1,...,Ot−1|Qt−1=i,λ)P(Qt−1=i|λ)P(Qt=j|Qt−1=i,λ)P(Ot=Xt|Qt=j,λ)P(Ot+1,...OT |Qt=j,λ) = P(O|λ) α (i)a b (j)β (j) = t−1 ij t t P(O|λ)

The relationship between γt(j) and δt(i, j) is defined as

N X γt(j) = δt(i, j) (3.15) i=1

The summation of γt(i) over time t gives a value that can be interpreted as the CHAPTER 3. HIDDEN MARKOV MODEL 25

expected number of transitions made from state i

T −1 X γt(i) = expected number of transitions from state i to any other states (3.16) t−1

The summation of δt(i, j) over t gives the expected number of transitions from state

i to state j

T −1 X δt(i, j) = expected number of transitions from state i to state j (3.17) t=1

Here is how the E-M algorithm works:

Start with an initial value for λi and begin the iteration:

• E-step: calculate the joint probabilities δt(i, j) and γt(i) according to the for-

mula above given λi.

• M-step: Estimate the parameter λi+1 based on δt(i, j) and γt(i) determined at

E-step.

Derivation of parameters

But how are the parameters are derived? In HMM, the expectation function is

T ! X Y EQ|O,λ{ln(P(O,Q|λ))} = P(Q|O, λ) ln P(Qt = j|Qt−1 = i, λ)P(Ot|Qt = j, λ) Q t=1 T X X = P(Q|O, λ) (ln aij + ln bj(Ot)) Q t=1

We denote two indicator functions

  1 , if Qt−1 = i and Qt = j I(Qt−1 = i, Qt = j) =  0 , otherwise CHAPTER 3. HIDDEN MARKOV MODEL 26 and   1 , if Qt = j I(Qt = j) =  0 , otherwise so that the expectation is specified as

ˆ EQ|O,λ{ln(P(O,Q|λ))} N N T N T ! X X X X X X = P(Q|O, λ) I(Qt−1 = i, Qt = j) ln(aij) + I(Qt = j) ln bj(Ot) Q i=1 j=1 t=1 j=1 t=2

Then the maximization of the expectation can be solved by Lagrangian duality theorem[19] . The final estimate of each parameter is specified below:

• πˆi = γ1(i) = expected number of times in state i at time t = 1, i ∈ 1, ...N

PT −1 δt(i, j) expected number of transitions from states i to state j aˆ = t=1 = • ij PT −1 , t=1 γt(i) expected number of transitions from state i i, j ∈ 1, ..., N

PT γ (i)X µˆ = t=1 t t i ∈ 1, ..., N • PT , t=1 γt(i)

PT γ (i)(X − µˆ )(X − µˆ )0 Σˆ = t=1 t t i t i i ∈ 1, ..., N • PT , t=1 γt(i)

0 where (Xt − µˆi) is the transpose of (Xt − µˆi). The E-M algorithm stops when it meets the termination condition. That is, the difference of the likelihood P(O|λˆ) and

P(O|λ) is less than the given tolerance.

3.5 Similarity Metric

Since our final goal is to select music from the retrieval database that is most similar to a given new piece, we first define a similarity metric. The way that we define this metric is to use a probability value, P(O|λ). Remember that P(O|λ) is the probability CHAPTER 3. HIDDEN MARKOV MODEL 27 of observing a data sequence given the parameters of a certain model. It is computed using the forward algorithm explained in section 3.2. Given a data sequence and a trained model (i.e., λ), we let the data sequence go through the model once and compute the likelihood value P(O|λ). This value will tell us how likely the model will generate this observation sequence. We then seek the piece from the database that yields the highest P(O|λ). Chapter 4

Initialization

As introduced in Section 2.4, the Baum-Welch algorithm requires an initial guess on the parameters to start the parameter estimation. This chapter introduces a model- based clustering algorithm that helps us determine a reasonable parameter set before the parameter estimation.

4.1 Initial parameter estimation

There are some remaining questions need to be answered before the parameter esti- mation. First, we must determine the number of hidden states. Second, the results from some simple cases show that the Baum-Welch algorithm is sensitive to initial parameters. That is, the accuracy of the estimation will depend on the chosen ini- tial parameter values. Randomly setting the parameters may lead to an unfavorable result. Therefore, to determine the parameters of a HMM, it is necessary to make a reasonable initial guess at what they might be. Once this is done, more accurate (in the maximum likelihood sense) parameters can be found by applying the Baum-Welch algorithm.

But how should we decide how many hidden states are needed? How do we

28 CHAPTER 4. INITIALIZATION 29

obtain a reasonable guess of the parameters? From the multiple trials, we found that

transition probability aij and starting probability πi can be uniformly assigned. The

initialization only focus on the emission probability, which aims at finding N sets of

observation probabilities parameters {µi, Σi}i∈1,...,N , where N is the number of states.

Before we explain the initialization steps for the application, we first introduce the clustering algorithm that we apply in our experiment: model-based agglomerative hierachical clustering.

4.1.1 Model-based agglomerative Hierarchical Clustering

Cluster analysis

Cluster analysis is the identification of observations with unknown structure into groups where observations in each group are cohesive and separated from other groups[20]. Clustering algorithms vary based on the cluster model, but the main goal is to minimize the “distance” between the observations within a group, that is, to classify the objects into different groups such that similar subjects are placed to- gether. There are many clustering algorithms, including centroid-based clustering

(e.g. k-means), hierarchical clustering, and model-based clustering. In this paper we will implement one of the model-based clustering algorithms, Gaussian Mixture

Model, to initialize our HMM parameters. Model-based agglomerative Hierarchi- cal Clustering was first introduced by Fraley and Raftery(2002)[20]. The strategy of implementing this clustering method comprises three elements: initialization via model-based hierarchical agglomerative clustering, maximum likelihood estimation via the EM algorithm, and selection of the number of clusters with the Bayesian

Information Criterion (BIC)[20]. CHAPTER 4. INITIALIZATION 30

Gaussian finite mixture modeling

For a mixture model, let X = X1,X2, ..., Xn be a sample of n independent identi-

cally distributed observations. The distribution of every observation is specified by a

probability density function through a finite mixture model of N components which

is described as the form below:

N X f(xi|ψ) = wkfk(xi|θk), (4.1) k=1 where ψ = w1, ..., wN , θ1, ...θN are the parameters of the mixture model, fk(xi|θk) is the kth component density for observation xi with parameter vector θk, w1, ..., wN−1 PN are the mixing weights or probabilities such that wk > 0, k=1 wk = 1, and N is the number of mixture components[21]. Notice that in this situation, we treat the de-

pendent chain of observations from MFCCs as independently identically distributed

variables. We want to group “similar” MFCC vectors together to characterize cer-

tain feature ranges. The essence of the Markov chain will give consideration to the

dependence of the variables.

During the clustering, N is fixed and the mixture model parameters ψ are what

we want to estimate. The likelihood for a mixture model with N components is

n N Y X L(ψ|xi) = wkfk(xi|θk) (4.2) i=1 k=1

Since the direct maximization of the log-likelihood function is computationally com- plicated, the maximum likelihood estimator(MLE) of a finite mixture model is usually obtained via the EM algorithm[21].

In the model-based clustering, each component of a finite mixture density is usu- ally associated with a cluster. In our case, and each component has a multivariate

Gaussian distribution fk(xi|θk) ∼ N(µk, Σk). CHAPTER 4. INITIALIZATION 31

Bayesian Information Criterion

Before the clustering process, the number of components should be first selected.

Here we use the Bayesian Information Criterion (BIC) to determine he number of

clusters. BIC is a criterion for model selection among a finite set of models. The

lower the BIC, the better the model. When fitting the clustering model, over-fitting

may occur when unnecessary clusters are added even if it will increase the likelihood.

BIC attempts to solve this problem by introducing a penalty term for the number

of parameters in the model. To implement BIC, the E-M algorithm is also applied.

Here are the steps of how BIC works on the clustering[20]:

1. Determine a maximum number of clusters, M.

2. Perform the hierarchical agglomeration to initialize the EM algorithm. If there

are L observations, then each observation starts in its own cluster so that it

starts with L clusters in total. And then pairs of clusters are merged when

moving up the hierarchy to that provide the smallest decrease in the classi-

fication likelihood for Gaussian mixture model. The classification likelihood

calculates the probability that a given observation belongs to a specific cluster.

3. Then the parameter of the model of a certain number of clusters is estimated

using the EM algorithm initialized by step 2.

4. Compute BIC for the GMM with the optimal parameters from EM for a cluster

number.

5. Repeat step 2 to 4 for each cluster number 1, ..., M. Select the cluster number

with lowest BIC.

Below are the detailed steps for determining state space and initializing parame- ters:

1. Use Bayesian Information Criterion (BIC) for parameterized Gaussian mixture models fitted by EM algorithm initialized by model-based hierarchical clustering to CHAPTER 4. INITIALIZATION 32

decide the number of clusters, which is used to be the number of states N.

1 1 2. Uniformly distribute the starting probability of each state to be N , πi = N , i ∈ {1, ..., N}

1 3. Uniformly distribute the transition probability to be aij = N , j ∈ {1, ..., N} 4. The probability distribution associated with each state is a normal distribution

specified by two parameters: mean µi,i∈1,...,N and covariance matrix Σi,i∈1,...,N . The

mean and covariance are computed using the model-based agglomerative hierachical

clustering described above.

Figure 4.1 shows the BIC plot of the MFCC data from one of the Bach fugue pieces

using mclust[21]. The clusters in Gaussian Mixture Model are ellipsoidal, centered

at mean vector µk and with other geometric features, such as volume, shape and

orientation determined by the covariance matrix Σk[21]. In multivariate setting, the

volume, shape, and orientation of the covariances can be constrained to be equal (E)

or varying (V) across the groups[21]. Table shows 14 combinations of the volume,

shape and orientation with the corresponding distribution structure[21]. The result

shows that 7 clusters with VVV is the best GMM fitted by EM algorithm.

Figure 4.1: BIC plot of model selection for Bach A flat major BWV862 fugue CHAPTER 4. INITIALIZATION 33

Model Distribution Volume Shape Orientation

EII Spherical Equal Equal -

VII Sperical Variable Equal -

EEI Diagonal Equal Equal Coordinate axes

VEI Diagonal Variable Equal Coordinate axes

EVI Diagonal Equal Variable Coordinate axes

VVI Diagonal Variable Variable Coordinate axes

EEE Ellipsoidal Equal Equal Equal

EVE Ellipsoidal Equal Variable Equal

VEE Ellipsoidal Variable Equal Equal

VVE Ellipsoidal Variable Variable Equal

EEV Ellipsoidal Equal Equal Variable

VEV Ellipsoidal Variable Equal Variable

EVV Ellipsoidal Equal Variable Variable

VVV Ellipsoidal Variable Variable Variable

Table 4.1: Clusters geometric characteristics for multidimensional data available in the mclust package

The final step is to cluster the observations with the cluster number provided by BIC. The cluster number will be the number of states. Then mean vector µ and covariance matrix Σk are computed for each state, becoming the parameters for the emission distribution. The following chapter will explain the experiments we conduct applying the model-based agglomertive hierachical clustering for determining state space and parameter initialization, and Baum-Welch algorithm for parameter estimation. Chapter 5

Experiments and results

After knowing the procedure for building the HMM model, we want to examine the

effectiveness of the models. Can the model capture the characteristics of different

piano pieces? How can we test the effectiveness of the trained models? In this

section, experiments are conducted on the data extracted from the audio signal.

We build a database of 70 models trained by 70 piano pieces from 7 composers:

Bach, Beethoven, Schubert, Chopin, Debussy, Schubert, Schoenberg, Schumann from

different time period of classical music.

Baroque Classical Romantic Contemporary

(1600-1750) (1750-1820) (1820-1910) (1910-present)

Chopin, Beethoven, Bach Schumann, Schoenberg Schubert Debussy

Figure 5.1: Flow chart of western music time period

The audio files are collected from International Music Score Library Project (IM-

34 CHAPTER 5. EXPERIMENTS AND RESULTS 35

SLP)[4] and ClassicalArchive[22]. Section 5.1 gives a brief background introduction of each composer and analysis of their composing style. The MFCC plots will be accompanied to provide some insights of how Mel-frequency changes in each piece.

Section 5.2 introduces the procedure of training the model. Section 5.3 discusses the result from the experiment and potential improvements can be done in the future.

5.1 Composers

5.1.1 Bach

Johann Sebastian Bach is the representative composer from the Baroque period and his composing style is characterized by simple rhythms and steady shifts of underlying harmony[23]. Baroque music is known for tonality, that is to arrange the pitch and chord with the largest stability. Bach is known for his fugues, a contrapuntal technique built on a musical theme that is introduced at the beginning of a piece and recurs frequently in the course of the composition [23]. Prelude is a music form that is featured with a short melodic motif and is usually written as a preface for fugue.

Bach’s preludes and fugues for keyboard are one of the landmarks of western classical music[23]. Figure 5.2 are the MFCCs from Bach fugue in A flat major and Bach prelude in B flat minor. The extent that coefficients change over the time is generally uniform. That validates the flat dynamic change of Bach’s fugue and prelude. CHAPTER 5. EXPERIMENTS AND RESULTS 36

Figure 5.2: MFCC of Bach Fugue and Prelude CHAPTER 5. EXPERIMENTS AND RESULTS 37

5.1.2 Beethoven

Ludwig Van Beethoven is one of the most significant and influential composers of western classical music. His music features intense emotion and passion, which fore- shadows transiting from classical style, “full of poise and balance”, to Romantic style, which is more expressive and impact[24]. Although Beethoven admired and was mo- tivated by Bach, his music is more personally expressive and intense due to the fact that he began to lose his hearing when he was 26 and became completely deaf dur- ing his later years. Beethoven is a prolific composer who wrote 9 symphonies, 32 piano sonatas, 5 piano concertos and many chamber music[24]. His piano Sonatas are representative of his powerful and variate expression drawn from a single instru- ment. Similar to Bach’s composing principal, Beethoven perceived tonality as the most important aspect in a sonata. Beethoven sonatas always start with a lively and brisk first movement, followed by a slow and gentle second movement and ends with a fast-tempo and fevered third movement. Sections from first and third movements are chosen in our experiment. Figure 5.3 shows the MFCCs from Beethoven piano

Sonata Op.31 No.2 1st movement and Sonata Op.14 No.1 1st movement. Compared to Bach, Beethoven’s MFCCs change dramatically through the whole piece, the top one most remarkably. This provides evidence that MFCCs are able to capture and distinguish the dynamic features in different music. CHAPTER 5. EXPERIMENTS AND RESULTS 38

Figure 5.3: MFCC from Beethoven Sonata CHAPTER 5. EXPERIMENTS AND RESULTS 39

5.1.3 Schubert

Franz Schubert is regarded as the last of the classical composers and one of the first romantic composers. Schubert was awed by Beethoven and he was even too timid to introduce himself to Beethoven, even though they were at same occasions several times. However, Schubert wrote masterpieces with rich harmonies and legendary melodies for a variety of genres. He wrote approximately 20 piano sonatas when the genre was in decline. The styles of the sonatas vary from each other. Some of them are dramatic and intense, like Sonata D.845 in A minor, while some are tranquil and terse, like Sonata D.664 in A major [25]. There are differences between Beethoven’s

Sonatas and Schubert’s Sonatas. In my perspective, the biggest difference is that the modulation (change from one key to the other) in Beethoven sonata is more sudden and unexpected, while Schubert’s sonatas modulate more seamlessly and sometimes listeners will not realize the transition until it has happened. Figure 5.4 shows the

MFCCs from Schubert Sonata D845 in A minor 1st movement and Sonata D664 in A major 1st movement. There is a clear distinction between those two plots. However, it is hard to tell the plot apart from Beethoven’s Sonata. CHAPTER 5. EXPERIMENTS AND RESULTS 40

Figure 5.4: MFCCs from Schubert Sonatas CHAPTER 5. EXPERIMENTS AND RESULTS 41

5.1.4 Chopin

Frederic Chopin is a Polish composer and pianist during the Romantic period (1820-

1910). His works for solo piano include Mazurka, Polonaise, Prelude, Etude, Waltz,

Sonata, Nocturne, Ballade, and Scherzo. Chopin is well-known as a composer that owned a very personal melody. Expressive of heartfelt emotion, his music is pene- trated by a poetic feeling that has almost universal appeal[26]. Although many of

Chopin’s piano works are romantic in their essence, they actually reflect the tragic story of Polish history[26]. Chopin’s Nocturnes are known for their poetic harmony and lyrical melody. The ornamentation in the nocturne becomes an integral element in the melody that represents Chopin’s unique composing style. Chopin’s Waltz is a

3/4-time music that is for dancing and social function. He spent his whole life time writing Waltzes. In a short length, Chopin Waltz not only preserves the tradition hopping and bouncing rhythm in dancing music, but also contains a refinement and nuanced details, making it more elegant and deliberate[27]. Figure 5.5 shows MFCCs from Chopin Nocturne Op.32 No.1 in B major and Chopin Waltz Op.69 No.1 in A

flat major. An interesting finding from those MFCC is the repetitive patterns in the melody (marked with black box on the plot). This reflects the characteristic of

Chopin miniature work, repeating a certain melody in a short length of time. CHAPTER 5. EXPERIMENTS AND RESULTS 42

Figure 5.5: MFCCs from Chopin Nocturne and Waltz CHAPTER 5. EXPERIMENTS AND RESULTS 43

5.1.5 Debussy

Claude Debussy is a French composer who was regarded as the first impressionist com- poser around late 19th to early 20th century (although he strongly objected to the word “impressionism” describing his work). The word “impressionism” was originally used to describe a style of late 19th century French paining. As Richard Langham

Smith wrote in his article impressionism, “impressionism” was later transferred to describe those composers “using landscape or natural phenomenon, particularly wa- ter and light imagery, through subtle textures suffused with instrumental color”[28].

Debussy’s music “visualizes” the color strokes on the painting and people can always

“see” those colors when listening to his music. Images are the 6 compositions for solo piano piece by Debussy and they are the representatives of Debussy’s impressionism.

Childen’s corner suite is a six-movement suite for solo piano. Debussy wrote the suite for his three-year-old daughter Claude-Emma. Figure 5.6 shows the MFCCs from two of the Debussy Image. The coefficients rise and fall through the whole piece. In ad- dition, compared to MFCCs from other composers, one of the coefficients in green is more separated from the others clustered together, which provides an evidence of the uniqueness of Debussy’s pitch range. CHAPTER 5. EXPERIMENTS AND RESULTS 44

Figure 5.6: MFCC from Debussy Image i and ii CHAPTER 5. EXPERIMENTS AND RESULTS 45

5.1.6 Schumann

Robert Schumann is known as a miniaturist, a type of composer who wrote short pieces of small form. His piano pieces were often inspired by lyric poems. Much of his most characteristic work is introverted and tends to record precise moments and their moods [29]. His music, subtle and veiled, accurately reflects his uncertain and sensitive personality. Schumann was highly inspired by Chopin and he showed his homage and respect in his work op.9, where he wrote a short piece named Chopin. Schumann Carnaval Op.9 contains 20 short pieces depicting different characters including his friends, his colleagues, and himself. This work is reflects an inner personality of Schumann himself. Schumann put a musical cryptogram (a cryp- togrammatic sequence of musical notes which are rearranged to refer to some letters or words) in his Carnaval: the 20 pieces are connected by a recurring motif consisting of 4 notes, C, Eb, A, B. The four notes are repeated in different musical sequences to represent three German names: Asch, the town that his fiancee was born; Fasching,

Carnival in English; and his name Robert Alexander Schumann. Carnaval Op.9 is known for its “resplendent chord passage and rhythmic displacement”[30]. Schumann

Fantasiestuke Op.12 is a suite of 8 short pieces which Schumann wrote inspired by novels, letters, poems. The piece’s names reflect the nature of the piece. For example,

“Des Abends” means “In the evening”. Schubert wrote this piece to depict a dreamy picture of an introverted part of himself. Since Schumann’s piano works are written for different characters, the form varies from piece to piece. Therefore, it is not easy to detect common patterns by eye. CHAPTER 5. EXPERIMENTS AND RESULTS 46

Figure 5.7: MFCCs from Schumann Carnaval Op.9 Chopin and Fantasiestuke Op.12 Des Abends CHAPTER 5. EXPERIMENTS AND RESULTS 47

5.1.7 Schoenberg

Arnold Schoenberg is a German contemporary composer who is the most influential composer to the . His later piano works (after the rising of Nazi

Party) were no longer tonal, which means the harmony and melody play less impor- tant roles in the music. Atonality is the most characteristic feature of 20th century music. Schoenberg’s five pieces for piano Op.23 “evokes an astonishing and delightful impression of freedom”[31]. It is a transitional work from Schoenberg’s atonal piano work to his twelve-tone music (a music that consists of series that contains all 12 pitch classes in a particular order)[31]. Piano suite Op.25 is the earliest piano work in which Schoenberg began to use twelve-tone, a way to achieve his goal: unity and regularity[31]. The first of the Five Pieces Op. 23, is called Sehr langsam (Very slowly in English), demonstrates Schoenberg’s approach to the principle of develop- ing variation[31]. The variation is reflected when the opening melody reappears. The same pitches are moved to different octaves, the shape of the phrase is changed and are presented in a different rhythmic configuration. Schoenberg intended to return to the theme by keeping the same pitches, but not rhythms or melody, to maintain the core of a composition. This concept led directly to his formation of the 12-tone method[31]. Very different from previous composers, Schoenberg’s piano work is no longer harmonic and tonal. It sounds dissonant and even inartistic. But he was one of the most important composers that transited western music from Romantic era to

Modernism. CHAPTER 5. EXPERIMENTS AND RESULTS 48

Figure 5.8: MFCCs of two pieces from Schoenberg Piano suite, Massige Achtel and Sehr Rasch CHAPTER 5. EXPERIMENTS AND RESULTS 49 5.2 Experiments

Different composing styles of seven composers are introduced and compared in section

5.1. As we can see, though each composer’s composing style distinguishes from the

other, sometimes it is still hard to detect the difference between two pieces only from

MFCCs (e.g. Beethoven Sonata and Schubert Sonata). Therefore, Hidden Markov

Models are applied to help us to figure out which pieces are more similar to, or

different from, a given piece that a user is interested. In our experiment, 103 piano

pieces from seven composers are collected from IMSLP and ClassicalArchive. The

103 pieces include 14 Bach Fugue, 14 Bach Prelude, 14 Beethoven Sonata (1st and 3rd

movements), 7 Chopin Nocturne, 7 Chopin Waltz, 6 pieces from Debussy’s Children

Corner Suite, 6 Debussy Images, 15 Schubert Sonatas, 2 pieces from Schoenberg

Piano suite Op.25, 3 pieces from Schoenberg Op.23, 2 pieces from Schoenberg Op.11,

7 pieces from Schumann Carnaval Op.9, 1 Schumann variation, 1 Schumann Sonata,

3 pieces from Schumann Fantasiestuke. We split the samples into a training set and a testing set. Below are the detailed steps of how we build the database by training the HMMs from training set and test the effectiveness of the model using the testing set.

1. Extract 90-second audio signal from each piece.

2. Divide the 90-second audio signal into short frame of length 25ms. Compute

12 Mel-frequency Cepstral Coefficients for each frame. The final data extracted

from each audio signal is a 12-dimensional data sequence of length 3600.

3. Split the samples into a training set of 70 pieces and a testing set of 33 pieces.

4. Build the Hidden Markov Model for each of training pieces.

• Determine the number of hidden states using Bayesian Information Crite-

rion (BIC). CHAPTER 5. EXPERIMENTS AND RESULTS 50

• Initialize the emission probability parameters (µi, Σi) using model-based

agglomerative hierachical clustering.

• Estimate HMM parameters using Baum-Welch algorithm.

After training the HMM model for each piece, we put the trained models ˆ ˆ λBach1 , ..., λSchumann9 into our database. In order to test how similar a new given piece to the pieces in the database, we put the new piece into each trained model,

iterate the Baum-Welch algorithm only one time and compute the likelihood ˆ ˆ th value P(·|λi), where λi is the parameter estimate for i piece, i ∈ {1, ..., 70}. ˆ The similarity metric P(·|λi) tells us how likely a testing piece is generated from

a trained model.

5. Compare the similarity between the new given piece and every model in the

database. Select three pieces that generate three highest likelihood value as the

final result.

We use R library mhsmm[32] to estimate the parameters and generate likelihood value

P(O|λ). A.1, A.2 and A.3 in AppendixA are the result tables that show the full test set (33 testing pieces in total) with the corresponding results. Following section will give a detailed discussion of the result.

5.3 Discussion

In this section, the results from the experiment, their implication, problems and concerns will be discussed and analyzed.

5.3.1 Validation

Before testing a new piece, we want to first validate our model to see how it works.

We chose three training pieces from retrieved database as testing pieces and put them CHAPTER 5. EXPERIMENTS AND RESULTS 51 in all trained models, including the ones trained by themselves. The result is shown in table 5.1. The leftmost column is the testing piece name; the second column is composer of the testing piece; the rightmost columns are the pieces in the database that generate 3 largest likelihood value (in decreasing order) after putting the testing piece into the 70 models. Table 5.1 shows that the highest likelihood values are generated by the models trained by themselves. That means our trained model can

“pick” the data sequence that is the most “similiar” to itself.

Testing Top 3 pieces Composer piece 1st piece 2nd piece 3rd piece

B minor Bach B minor Bach G major Bach G major Bach BWV869 fugue BWV869 fugue BWV884 fugue BWV884 fugue

Beethoven Sonata Beethoven Sonata Beethoven Sonata Sonata Op2 No.2 Beethoven Op.2 No.2 Op31 No.2 1 Op.2 No.1

Chopin Waltz Chopin Waltz Chopin Waltz Schubert Eb Major Op.34 No.2 Chopin Op.34 No.2 Op.64 No.2 D568 Allegro Moderato in A minor in A minor in C# minor

Table 5.1: Validation table

5.3.2 Analysis of the result

After validating our model, we want to see how the model works on the 33 new testing pieces. Table 5.3 shows some examples from the Bach result. For each testing piece, it chooses Bach models for all the time. However, the result does not show the model can distinguish the tonality (major or minor) of the piece. CHAPTER 5. EXPERIMENTS AND RESULTS 52

Testing Top 3 pieces Composer piece 1st piece name 2nd piece name 3rd piece name

A minor Bach G major Bach A major Bach E minor Bach BWV865 Fugue BWV884 Fugue BWV864 Fugue BWV879 Fugue

A minor Bach B minor Bach A major Bach F# major Bach BWV865 Prelude BWV869 Prelude BWV864 Prelude BWV882 Prelude

Bb minor Bach F# major Bach F# major Bach Bb minor Bach BWV867 Fugue BWV882 Prelude BWV882 Fugue BWV867 Prelude

B major Bach B major Bach C# major Bach G minor Bach BWV868 Prelude BWV868 Fugue BWV872 Fugue BWV861 Fugue

Table 5.2: Part of Bach Classification Result

The classifications for Beethoven Sonatas are also quite promising, despite of a few misclassifications.

Testing Top 3 pieces Composer piece 1st piece name 2nd piece name 3rd piece name

Beethoven Beethoven Sonata Op.10 No.3 Beethoven Schoenberg Gigue Op.31 No.2 Op.2 No.2

Beethoven Beethoven Beethoven Sonata Op.2 No.3 Beethoven Op.2 No.2 Op.2 No.1 Op.31 No.2 1mvt

Beethoven Sonata Beethoven Sonata Beethoven Sonata Sonata Op.49 No.1 Beethoven Op.78 No.24 Op.49 No.2 Op.2 No.1

Table 5.3: Part of Beethoven Classification Result

An interesting observation in classifications of Schubert Sonatas is that Schubert testing pieces vacillate between choosing Schubert and Beethoven. From section 4.1, we describe how Beethoven and Schubert are two composers from similar eras and their composing styles are similar. That indicates that this modeling process finds the similarities as well as distinguish pieces of different style. Yet this also indicates that there are improvements that can be done. CHAPTER 5. EXPERIMENTS AND RESULTS 53

Testing Top 3 pieces Composer piece 1st piece name 2nd piece name 3rd piece name

Sonata Op.78 Schubert Sonata Beethoven Sonata Schubert Schoenberg Gigue D894 3mvt D959 1mvt Op.54 No.1

Sonata D664 Schubert Sonata Beethoven Sonata Schubert Schoenberg Gigue A major 1mvt D664 A major 3mvt Op.14 No.1

Schubert Sonata Schubert Sonata Schubert Sonata Sonata D960 3rd mvt Schubert D845 3mvt D959 1mvt D958 1mvt

Table 5.4: Part of Schubert Classification Result

The accuracy of classifications of Debussy’s pieces is quite good. This is not surprising because Debussy’s composing style is very distinct from other composers, due to the fact that he was early influenced by Russian and eastern music causing him to develop his unique style of harmony and musical color. The NA in the fourth testing pieces indicates that the other pieces in the retrieval data are so unlike to the

Debussy Image II3 that the likelihood is lower than the machine precision.

Testing Top 3 pieces Composer piece 1st piece name 2nd piece name 3rd piece name

Schoenberg Schoenberg Jimbos Lullaby Debussy Schubert D960 1mvt Schwungvoll Sehr rasch

Debussy Doctor Images I2 Debussy Debussy images I1 Debussy images II1 Gradus Ad Parnassum

Debussy Doctor Gradus Images I3 Debussy Debussy images I1 Debussy images II1 Ad Parnassum

Images II3 Debussy Debussy images II 1 NA NA

Table 5.5: Part of Debussy Classification Result

An unexpected result happens in the classifications of Chopin’s pieces, which doesn’t show much pattern indicating which models are mostly likely to generate the piece and many are classified to contemporary composer Schoenberg. This is CHAPTER 5. EXPERIMENTS AND RESULTS 54

surprising for Chopin Waltz because of its unique 3/4 rhythm and bouncing melodic

feature. This misclassification may be caused by the fact that MFCC is not good at

capturing the rhythmic feature. Further investigation is needed to conducted to find

the cause of this discrepancy.

Testing Top 3 pieces Composer piece 1st piece name 2nd piece name 3rd piece name

Nocturne No.9 Schoenberg Beethoven Sonata Chopin Schoenberg Gigue in b minor Schwungvoll Op.78 No.24

Waltz B.150 Schoenberg Waltz Op. 64 No. 2 Chopin Schoenberg Gigue in A minor Schwungvoll in C sharp minor

Waltz Chopin Waltz Schubert E Flat

G flat Major Chopin Op. 64 no. 2 Major D 568 Schubert D959 1mvt

Op.70 No.1 in C sharp minor Allegro moderato

Table 5.6: Part of Chopin Classification Result

We also test the Schumann Carnaval Op.9 No.12 Chopin and the result is quiet in-

teresting. The piece easily excludes composers Bach, Beethoven, Debussy and Schoen-

berg. From the table 5.7, we can see that it selects Schubert pieces for the top three

choices. Then the choices oscillate between Schumann himself and Chopin and the

differences between the likelihood from those two composers is not large. However,

this result is not that unexpected since the two chosen Schubert pieces have many

similar dynamic patterns as Chopin, such as the Allegretto tempo (A moderately fast speed often played with a light character) and lyric melody. CHAPTER 5. EXPERIMENTS AND RESULTS 55

Schumann Carnaval Op.9 No.12 Likelihood

1st Schubert D959 1st mvt 75595.83

2nd Schubert D958 1st mvt 75113.22

3rd Schubert D845 3rd mvt 72189.3

4th Schumann Variations abegg 72188.83

5th Schumann Sonata f sharp minor 70449.60

6th Schumann Carnaval Op. 9 No.2 69554.24

7th Chopin Waltz Op.64 No. 2 in C sharp minor 65375.72

8th Schumann Fantasiestücke, Op.12 Warum 64832.86

Table 5.7: Result table for Schumann Carnaval Op.9 No.12

5.3.3 Accuracy

To examine accuracy, we define the accuracy of the classification as follows: for each testing piece, as long as the true composer appears in the top three pieces, we count it a “success”. We sum up all the successes and divide the number of testing pieces n.

The accuracy of each composer can also be computed as this way and n in this case is the number of testing piece of a specific composer. Table 5.8 is the summary of the accuracy of the classification. Among 33 testing pieces, 24 is classified correctly.

However, the accuracy varies from one composer to the other. Bach has the highest accuracy of 100 %. This is not surprising because he is the only composer among those seven that is from Baroque time period and his composing style is relatively consistent. From the table, it seems to be hard to tell Chopin’s work apart from the other composers’ work due to his lowest accuracy. However, this accuracy table itself may not be the best way to classify a composer’s work since a certain composer may have diverse composing style in different pieces (especially the composers from

Romantic period, who have more freedom in form and more dramatic dynamics). CHAPTER 5. EXPERIMENTS AND RESULTS 56

But the table can somehow reflect the consistency and distinction of a composer’s composing style.

Total accuracy 72.7% n=33

Bach 100% n=7

Beethoven 100% n=4

Schubert 80% n=5

Debussy 75% n=4

Schoenberg 50% n=4

Schumann 50% n=4

Chopin 40% n=5

Table 5.8: Accuracy table

5.3.4 Problems and Concerns

We now wish to describe some of the aspects of the results that suggest further work is necessary. Firstly, only the likelihood value P(O|λ) is used to suggest the similarity.

There are some other potential ways to measure the similarity. For example, in Qi et al’s paper, music similarity is computed based on the distance between the respec- tive HMM mixture models[8]. For two HMM mixture models, they generate a data sequence from each of them and compute a distance between them. This may give us an alternative way to measure the similarity between different pieces. Secondly,

We need to develop a more systemic and reliable way to measure the accuracy of the classification. So far the classification accuracy is defined as the percentage of the “correct” classification over the total number of classification. We give an error- tolerant rate so that the “correct classification” happens as long as the true composer appear once in the chosen models. However, there are many other more sophisticated ways to define the accuracy of the classification in a more rigorous way. For exam- CHAPTER 5. EXPERIMENTS AND RESULTS 57 ple, we can take the occurrences of the true composer in three chosen models into consideration. We can also give different weights to the three rankings, with the first to be the highest. Thirdly, when building the model, error happens due to the fact that starting values for an emission distribution being very unlikely to generate any of the emitted observations. That means the initialization of the parameter may be unstable sometimes. When testing the new pieces, the same error happens when the given piece is very distinct from a certain model, for example, putting the Beethoven

Sonatas into Debussy’s model. Finally, we only have 70 training pieces and 33 testing pieces so far. The sample size needs to be expanded to make the conclusion more generalized. Chapter 6

Conclusion and Future work

This research introduces a statistical model, Hidden Markov Model, to classify west- ern classical piano works. Harmonic and melodic features are extracted from the audio file using the Mel-frequency Cepstral Coefficients(MFCC). Model parameters are estimated using the Expectation-Maximization algorithm, Baum-Welch algorithm and the similarity metric, P(O|λ) is computed using forward algorithm. The emission parameters are initialized using the model-based agglomerative hierarchical cluster- ing and the number of hidden states are determined using the Bayesian Information

Criterion. The accuracy of the classification shows the advantage and effectiveness of applying the Hidden Markov Models to the music selection. Still, a lot more can be done based on the current work. First of all, MFCC is one of the promising tech- niques to capture the feature of the music. The chroma feature, a feature vector to implement chord or harmonic recognition, may be another effective perspective to capture more features of the music. Rhythm features can be used to capture different

Rhythmic pattern (this may be useful to distinguish some special rhythm, e.g. Chopin

Waltz). Secondly, the estimation of the parameters may be improved by adding a prior distribution on the parameters. In Qi et al’s paper, the work develops an HMM mixture model in a Bayesian setting using a non-parametric Dirichlet process as the

58 CHAPTER 6. CONCLUSION AND FUTURE WORK 59 prior distribution on the parameters of each individual HMM[8]. This way, the pos- terior of the model parameters could be learned so that the process generates an ensemble of HMM rather than a point estimation of the model parameters. Adding the prior distribution on the parameters may improve the initialization of the param- eters. Thirdly, as stated in Chapter 4, a more systemic way to measure the accuracy of the model needs to be developed. There are a lot of variability in classifying piano music by composers since one composer may have diverse composing style. Last but not least, the database needs to be expanded to generalize the result and a physi- cal user-interface can be built to make this classification method a potential music information retrieval system that is applicable to users. Appendix A

Appendix

60 APPENDIX A. APPENDIX 61 APPENDIX A. APPENDIX 62

Figure A.1: 12 MFCC from Bach A flat major BWV862 fugue APPENDIX A. APPENDIX 63 54237.5 70614.9 50050.5 41621.75 42010.36 44112.12 30214.28 35905.31 46212.29 48243.81 Bach F sharp Bach E minor Bach B major Bach A major Bach G minor 3rd piece name BWV882 fugue BWV864 fugue BWV879 fugue BWV861 fugue BWV868 fugue BWV867 prelude Schoenberg Gigue Bach B flat minor Bach F sharp major Schubert D850 1mvt major BWV882 prelude Beethoven Op.31 No.2 1mvt Top 3 pieces prelude 44858.43 31824.64 55364.49 48838.78 44762.31 48689.88 50948.56 37043.94 71181.03 46755.37 Bach F sharp Bach C sharp Bach C sharp Bach A major Bach A major BWV864 fugue BWV882 fugue 2nd piece name BWV864 prelude Bach F sharp major Beethoven Op.2 No.2 Beethoven Op.2 No.1 major BWV872 fugue major BWV872 fugue Bach B minor BWV869 major BWV882 prelude Beethoven Op.31 No.2 1mvt 57958.4 48982.51 52730.61 34396.68 72757.76 51078.37 47820.55 38135.88 45050.83 46152.81 Table A.1: Result table Bach B minor Bach B minor Bach B major Bach G minor Bach G major 1st piece name BWV884 fugue BWV861 fugue BWV868 fugue BWV869 prelude BWV882 prelude BWV882 prelude BWV869 prelude Bach F sharp major Bach F sharp major Presto agitato Gigue Beethoven Op.2 No.2 Beethoven Op.31 No.2 Bach Bach Bach Bach Bach Bach Bach Ground truth Beethoven Beethoven Beethoven Testing piece Sonata Op.2 No.3 Sonata Op.10 No.3 Sonata Op.27 No.2 C minor BWV871 fugue A minor BWV865 fugue C minor BWV871 prelude B major BWV868 prelude A minor BWV865 prelude B flat minor BWV867 fugue C sharp major BWV872 prelude APPENDIX A. APPENDIX 64 NA 37854.39 26429.72 59834.51 58486.36 53663.51 55593.41 51163.65 -11433.97 Doctor Gradus Doctor Gradus 3rd piece name Ad Parnassum-g Ad Parnassum-g Schubert 959 1st in C sharp minor Schubert D850 1st Schubert D959 1st Chopin Waltz Op.64 No.2 schubert D960 1st 74848.93 Beethoven Sonata Op.2 No.1 Beethoven Sonata Op.78 No.24 NA 109741.6 54188.98 57127.79 60734.95 53793.37 38147.13 59302.23 117975.63 Top 3 pieces 2nd piece name Schoenberg Gigue Schoenberg Gigue Debussy images II1 Debussy Images II1 Schubert E Flat Major D568 Allegro moderato Schoenberg Schwungvoll Schoenberg Schwungvoll Beethoven Sonata Op.49 No.2 Schoenberg Sehr rasch 79092.64 38225.4 60408.78 57959.26 62438.44 57007.76 84702.21 70322.54 55061.71 113702.92 129398.58 Table A.2: Result table (cont.) 1st piece name in C sharp minor Schoenberg Gigue Schoenberg Gigue Debussy Images I1 Debussy Images I1 Debussy Images II1 Schoenberg Schwungvoll Schoenberg Schwungvoll Beethoven Sonata Op.78 Chopin Waltz Op.64 no. 2 Beethoven Sonata Op.2 No.2 Chopin Chopin Chopin Chopin Chopin Debussy Debussy Debussy Debussy Ground truth Beethoven Testing piece Images I2 Images I3 Images II3 Op. 70 No.1 Jimbos Lullaby Sonata Op.49 No.1 Waltz B.56 in E minor Waltz B.150 in A minor Nocturne 09.1 in b minor Nocturne B 108 in C minor Chopin Waltz in G.flat Major APPENDIX A. APPENDIX 65 NA 76602.7 72189.3 50955.61 60460.03 87552.88 44151.03 20838.84 62140.42 41910.91 45854.71 64773.11 68152.21 Carnaval No.4 3rd piece name Schoenberg Gigue Schoenberg Gigue Schubert D959 1st Schubert D959 1st Debussy Images I3 Schubert D845 3rd Schubert E Flat Major D568 Allegro moderato schubert Sonata D958 1mvt Schubert Sonata D845 3mvt Beethoven sonata Op.2 No.2 Beethoven Sonata Op.54 No.22 87560 42450.2 48233.2 62639.06 44618.58 62173.92 77228.44 21420.86 51868.85 42854.71 65134.05 69588.66 75113.22 2nd piece name Debussy Image I2 Debussy Image I2 Schubert D959 1st Schubert D958 1st Top 3 pieces Schubert D845 3rd Schubert D845 3rd D568 Allegro moderato Schoenberg Schwungvoll Beethoven Sonata Op.54 Schubert Sonata D959 1st schubert Sonata D959 1mvt Beethoven Sonata Op.14 No.1 Bach G major BWV884 fugue Schubert Sonata E Flat Major 54031.8 65885.9 50234.1 66545.42 63719.52 77369.91 45334.56 27718.97 90994.75 71137.97 75595.83 43610.56 46553.89 Carnaval No.4 Massige achtel 1st piece name Table A.3: Result table (cont.) Schoenberg Gigue Schoenberg Gigue Schubert D959 1st Schubert D959 1st Debussy Image II1 Schubert Sonata D845 3mvt Schubert Sonata D959 1mvt Schubert Sonata D959 1mvt Beethoven Op.31 No.2 1mvt Beethoven Sonata Op.10 No.1 Schubert Sonata D664 A major 3mvt Schubert Schubert Schubert Schubert Schubert Ground truth Schumann Schumann Schumann Schumann Schoenberg Schoenberg Schoenberg Schoenberg No.9 Testing piece intermezzo Des Abends Op.11 Mässig Gavotte Etwas Bewegte achtel Carnaval No.18 Sonata A Minor variations abegg Sonata D960 3mvt Schumann Carnaval D784 Allegro vivace langsam nicht hastig Sonata E Flat Major Sonata Op.78 D894 3mvt D568 Menuetto Allegretto Sonata D664 A major 1mvt List of Figures

2.1 Flow chart of content-based MIR query system...... 7

2.2 Audio signal from Bach Fugue in B Flat Minor...... 8

2.3 Spectrogram from Bach Fugue in B Flat Minor...... 8

2.4 A Mel-filter bank containing 12 filters...... 10

2.5 12-dimensional MFCC data from Bach B flat minor BWV867 fugue. 12

2.6 Histogram of first MFCC from Bach B flat minor BWV867 fugue.. 12

4.1 BIC plot of model selection for Bach A flat major BWV862 fugue.. 32

5.1 Flow chart of western music time period...... 34

5.2 MFCC of Bach Fugue and Prelude...... 36

5.3 MFCC from Beethoven Sonata...... 38

5.4 MFCCs from Schubert Sonatas...... 40

5.5 MFCCs from Chopin Nocturne and Waltz...... 42

5.6 MFCC from Debussy Image i and ii...... 44

5.7 MFCCs from Schumann Carnaval Op.9 Chopin and Fantasiestuke Op.12

Des Abends ...... 46

5.8 MFCCs of two pieces from Schoenberg Piano suite, Massige Achtel and

Sehr Rasch...... 48

A.1 12 MFCC from Bach A flat major BWV862 fugue...... 62

66 Bibliography

[1] Markus Schedl and Emilia Gomez. “Music Information Retrieval: Recent De-

velopments and Applications”. In: Foundations and Trends in Information Re-

trieval 8 2-3 (2014), p. 55. doi: DOI:10.1561/1500000042.

[2] Music for everyone. url: http://www.spotify.com/.

[3] Music and Podcasts, Free and On-Demand. url: http://www.pandora.com/.

[4] imslp.com. url: http://www.imslp.com/.

[5] H. Terasawa, M. Slaney, and J. Berger. “The thirteen colors of timbre”. In: IEEE

Workshop on Applications of Signal Processing to Audio and Acoustics. 2005,

pp. 323–326. isbn: 978-0-7803-9154-3. doi: 10.1109/ASPAA.2005.1540234.

[6] Changsheng Xu et al. “Musical genre classification using support vector ma-

chines”. In: 2003 IEEE International Conference on Acoustics, Speech, and Sig-

nal Processing, Proceedings(ICASSP 03). (2003). doi: 10.1109/icassp.2003.

1199998.

[7] G. Tzanetakis and P. Cook. “Musical genre classification of audio signals”. In:

IEEE Transactions on Speech and Audio Processing 10.5 (2002), pp. 293–302.

doi: 10.1109/tsa.2002.800560.

[8] Yuting Qi, John William Paisley, and Lawrence Carin. “Music Analysis Using

Hidden Markov Mixture Models”. In: IEEE Transactions on Signal Processing

67 BIBLIOGRAPHY 68

55.11 (Nov. 2007), pp. 5209–5224. issn: 1053-587X. doi: 10.1109/TSP.2007.

898782.

[9] iTunes. url: https://www.apple.com/itunes/.

[10] M.a. Casey et al. “Content-Based Music Information Retrieval: Current Direc-

tions and Future Challenges”. In: Proceedings of the IEEE 96.4 (2008), pp. 668–

696. doi: 10.1109/jproc.2008.916370.

[11] shazam. url: https://www.shazam.com/.

[12] nayio. url: http://www.nayio.com/.

[13] soundhound. url: https://www.soundhound.com/.

[14] Peter Knees and Markus Schedl. “Introduction to Music Similarity and Re-

trieval”. In: Music Similarity and Retrieval The Information Retrieval Series

(2016), pp. 1–30. doi: 10.1007/978-3-662-49722-7_1.

[15] Crypto. url: http://practicalcryptography.com/miscellaneous/machine-

learning/guide-mel-frequency-cepstral-coefficients-mfccs/.

[16] Lawrence R. Rabiner. “A Tutorial on Hidden Markov Models and Selected Ap-

plications in Speech Recognition”. In: Readings in Speech Recognition (1990),

pp. 267–296. doi: 10.1016/b978-0-08-051584-7.50027-9.

[17] Fanny Yang, Sivaraman Balakrishnan, and Martin J. Wainwright. “Statistical

and Computational Guarantees for the Baum-Welch Algorithm”. In: arXiv:1512.08269

[cs, math, stat] (Dec. 27, 2015). arXiv: 1512.08269. url: http://arxiv.org/

abs/1512.08269 (visited on 10/18/2018).

[18] Chuong B Do and Serafim Batzoglou. “What is the expectation maximization

algorithm?” In: Nature Biotechnology 26.8 (Aug. 2008), pp. 897–899. issn: 1087-

0156, 1546-1696. doi: 10.1038/nbt1406. BIBLIOGRAPHY 69

[19] Frederick S. Hillier and Gerald J. Lieberman. Introduction to operation research.

1995.

[20] Chris Fraley and Adrian E Raftery. “Model-Based Clustering, Discriminant

Analysis, and Density Estimation”. In: Journal of the American Statistical As-

sociation 97.458 (June 2002), pp. 611–631. issn: 0162-1459, 1537-274X. doi:

10.1198/016214502760047131.

[21] Luca Scrucca et al. “mclust 5: Clustering, Classification and Density Estimation

Using Gaussian Finite Mixture Models”. In: 8 (2016), p. 29.

[22] Classical Archives LLC. Collect the "Must-Know/Must-Have" Classical Hits.

url: http://www.classicalarchives.com/.

[23] Colin Wight. Johann Sebastian Bach (1685-1750). Mar. 2014. url: https :

//www.bl.uk/onlinegallery/onlineex/musicmanu/bach/.

[24] Colin Wight. Ludwig van Beethoven (1770-1827). Mar. 2014. url: http://

www.bl.uk/onlinegallery/onlineex/musicmanu/beethoven/index.html.

[25] Franz Peter Schubert - Life and Music. url: http://www.franzpeterschubert.

com/schuberts_sonatas.html.

[26] Arthur Hedley and Leon Plantinga. Frédéric Chopin. Feb. 2019. url: https:

//www.britannica.com/biography/Frederic-Chopin.

[27] url: http://www.classicalnotes.net/classics3/chopinwaltzes.html.

[28] Richard Langham Smith. “Impressionism”. In: The Oxford Companion to Music

(2011).

[29] Gerald E.H. Abraham. Robert Schumann. Jan. 2019. url: https : / / www .

britannica.com/biography/Robert-Schumann.

[30] Eric Frederick Jensen. “Endenich”. In: Schumann (2012), pp. 297–317. doi:

10.1093/acprof:osobl/9780199737352.003.0015. BIBLIOGRAPHY 70

[31] Ine Heneghan. “Composing with Tones: A Musical Analysis of Schoenbergs

Op. 23 Pieces for Piano- by Kathryn Bailey”. In: Music Analysis 26.3 (2007),

pp. 373–380. doi: 10.1111/j.1468-2249.2008.00264.x.

[32] Package mhsmm. url: https://cran.r-project.org/web/packages/mhsmm/

index.html.