<<

The Single Hidden Layer Neural Network Based Classifiers for Han Chinese Folk Songs

Sui Sin Khoo

A thesis submitted in fulfilment of the requirements for the Doctor of Philosophy at Faculty of Engineering and Industrial Sciences Swinburne University of Technology Australia

2013

This page is intentionally left blank.

Abstract

This thesis investigates the application of a few powerful machine learning techniques in music classification using a symbolic database of folk songs: The Essen Folksong Collection.

Firstly, a meaningful and representative set of theory-based method of encoding Chinese folk songs, called the musical feature density map (MFDMap) is developed to enable efficient classification by machines. This encoding method effectively encapsulates useful musical information that is readable by the machines and at the same time can be easily interpreted by humans. This encoding will aid ethnomusicologists in future folk song research.

The extreme learning machine (ELM), an extremely fast machine learning algorithm that utilizes the structure of the single-hidden layer feedforward neural networks (SLFNs) is employed as the machine classifier. This algorithm is capable of performing at a very fast speed and has good generalization performance. The application of the ELM classifier and its enhanced variant called the regularized extreme learning machine (R-ELM), in real-world multi-class folk song classification is examined in this thesis. The effectiveness of the MFDMap encoding technique combining with the ELM classifiers for multi-class folk song classification is verified.

The finite impulse response extreme learning machine (FIR-ELM) is a relatively new learning algorithm. It is a powerful algorithm in the sense that its robustness is reflected in the design of the input weights and the output weights. This algorithm can effectively remove input disturbances and undesired frequency components in the input data. The capability of the FIR-ELM in solving complex real-world multi-class classification is examined in this thesis. The MFDMap performed more effectively with

i the FIR-ELM. The classification accuracy using the FIR-ELM is significantly better than both the ELM and the R-ELM.

The techniques of folk song classification proposed in this thesis are further investigated on a different data samples. These techniques are also applied to the European folk songs, a culture that is very different from the Chinese culture, to investigate the flexibility of the learning machines. In addition, the roles and relationships of four music elements: solfege, interval, duration and duration ratio are investigated.

ii Acknowledgement

I would like to express my gratitude to my supervisor, Professor Zhihong Man, who has given me both guidance and courage to pursue the work in this thesis. A special thank for his patience to my slow responses and his advices that lead me along the path.

I would also like to express my utmost gratefulness to my parents for their loving and constant support, interest and encouragement that lead me up to this point in my life. I would love to express my deepest appreciation to my dearest brother who leads me and inspired me along my way.

A sweet thank to Aiji, Kevin, Fei Siang, Hai, and Tuan Do for all the laughter and companies I received during my years of research in Swinburne.

iii

This page is intentionally left blank.

Declaration

This is to certify that:

1. This thesis contains no material which has been accepted for the award to the candidate of any other degree or diploma, except where due reference is made in the text of the examinable outcome.

2. To the best of the candidate’s knowledge, this thesis contains no material previously published or written by another person except where due reference is made in the text of the examinable outcome.

3. The work is based on the joint research and publications; the relative contributions of the respective authors are disclosed.

______

Sui Sin Khoo, 2013

v

This page is intentionally left blank.

Table of Contents

Table of Contents

ABSTRACT...... i ACKNOWLEDGEMENT...... iii LIST OF FIGURES ...... xi LIST OF TABLES ...... xiii LIST OF ACRONYMS ...... xix

1. INTRODUCTION...... 1 1.1 Motivation...... 1 1.2 Contribution ...... 3 1.3 Organization of the Thesis ...... 4

2. LITERATURE REVIEW...... 7 2.1 Artificial Neural Network ...... 7 2.1.1 McCulloch-Pitts Threshold Processing Unit...... 8 2.1.2 Rosenblatt’s Perceptron ...... 9 2.1.3 Multi-Layer Perceptron...... 11 2.1.4 Learning Algorithms ...... 13 2.1.5 Extreme Learning Machine...... 16 2.2 Music Representations ...... 20 2.2.1 Audio Format ...... 21 2.2.2 Symbolic Format...... 32 2.3 Discussion ...... 41

3. MUSIC REPRESENTATION AND THE MUSICAL FEATURE DENSITY MAP ...... 43 3.1 Ethnomusicology Background on Geographical Based Han Chinese Folk Song Classification...... 44

vii Table of Contents

3.1.1 Rationale for the Choice of the Five Classes ...... 49 3.2 Music Data Set – The Essen Folksong Collection...... 51 3.2.1 The **Kern Representation ...... 52 3.2.2 An Example of Han Chinese Folk Song in **Kern Format ...... 53 3.2.3 Assumptions in **Kern Version of the Essen Folksong Collection...... 59 3.3 Music Elements and Encoding...... 60 3.3.1 Pitch Elements...... 61 3.3.2 Duration Elements...... 66 3.4 The Musical Feature Density Map...... 72 3.4.1 Advantage of the Musical Feature Density Map ...... 79 3.4.2 Future Enhancement to the Musical Feature Density Map...... 92

4. THE EXTREME LEARNING MACHINE FOLK SONG CLASSIFIER...... 93 4.1 Introduction...... 94 4.2 Extreme Learning Machine...... 95 4.3 Regularized Extreme Learning Machine ...... 98 4.4 Experiment Design and Setting ...... 100 4.4.1 Data Pre-Processing and Post-Processing...... 100 4.4.2 Parameter Setting ...... 109 4.5 Experiment Results ...... 110 4.6 Discussion ...... 122 4.7 Conclusion ...... 125

5. THE FINITE IMPULSE RESPONSE EXTREME LEARNING MACHINE FOLK SONG CLASSIFIER...... 127 5.1 Introduction...... 127 5.2 Finite Impulse Response Extreme Learning Machine ...... 129 5.3 Experiment Design and Setting ...... 135 5.3.1 Data Pre-Processing and Post-Processing...... 136 5.3.2 Parameter Setting ...... 137 5.4 Experiment Results ...... 138 5.5 Discussion ...... 149 5.6 Conclusion ...... 152

viii Table of Contents

6. A TWO-CASE EUROPEAN FOLK SONG CLASSIFICATION ...... 155 6.1 Introduction...... 155 6.2 Experiment Design and Setting...... 156 6.2.1 The Musical Feature Density Map...... 156 6.2.2 Data Set ...... 160 6.2.3 Parameter Setting ...... 160 6.3 Experiment Results ...... 161 6.4 Discussion ...... 164 6.5 Conclusion ...... 167

7. CONCLUSION...... 169 7.1 Summary ...... 169 7.2 Future Works...... 171

REFERENCES...... 173 APPENDIX A. FOLK SONG CLASSIFICATION USING AUDIO REPRESENTATION...... 189 LIST OF PUBLICATIONS...... 213

ix

This page is intentionally left blank.

List of Figures

List of Figures

2.1 An example of a Threshold Processing Unit ...... 9 2.2 A single-hidden layer feedforward neural network ...... 12 2.3 The flow diagram of the construction of a beat histogram ...... 30 3.1 Map of the three main rivers: the Yellow River, the Yangtze River and the Pearl River...... 46 3.2 Map of the regions in China with the five classes studied in this thesis highlighted ...... 50 3.3 The musical of a Jiangsu folk song – Si Ji Ge ...... 54 3.4 A **kern representation of the Jiangsu folk song – Si Ji Ge...... 55 3.5 An example of a Jiangsu folk song encoded using solfege representation ...... 64 3.6 An example of a Jiangsu folk song encoded using interval representation...... 65 3.7 The seven most commonly used durations ...... 67 3.8 Examples of tie notes and their equivalence in duration...... 67 3.9 Examples of dotted notes and their equivalence in duration...... 67 3.10 Examples of triplets and their equivalence in duration...... 68 3.11 An example of a Jiangsu folk song encoded using duration representation...... 69 3.12 An example of a Jiangsu folk song encoded using duration ratio representation ...... 71 3.13 The flow chart for constructing a MFDMap...... 74 3.14 The music score and the encode solfege, interval, duration and duration ratio representations (Step 1 to 4 in constructing Case 1 MFDMap)...... 75 3.15 The Case 1 MFDMap for Shanxi folk song – Zou Xi Kou...... 77 3.16 The music score and the encode solfege, interval, duration and duration ratio representations (Step 1 to 4 in constructing Case 2 MFDMap – rests omitted) .80 3.17 The Case 2 MFDMap for Shanxi folk song – Zou Xi Kou...... 82 3.18 Example of Class 1 folk song using windowing method...... 84 3.19 Example of Class 2 folk song using windowing method...... 85

xi List of Figures

3.20 Example of Class 3 folk song using windowing method...... 85 3.21 Example of Class 4 folk song using windowing method...... 86 3.22 Example of Class 5 folk song using windowing method...... 86 3.23 Example of Class 1 folk song using Case 1 MFDMap...... 87 3.24 Example of Class 2 folk song using Case 1 MFDMap...... 87 3.25 Example of Class 3 folk song using Case 1 MFDMap...... 88 3.26 Example of Class 4 folk song using Case 1 MFDMap...... 88 3.27 Example of Class 5 folk song using Case 1 MFDMap...... 89 3.28 Example of Class 1 folk song using Case 2 MFDMap...... 89 3.29 Example of Class 2 folk song using Case 2 MFDMap...... 90 3.30 Example of Class 3 folk song using Case 2 MFDMap...... 90 3.31 Example of Class 4 folk song using Case 2 MFDMap...... 91 3.32 Example of Class 5 folk song using Case 2 MFDMap...... 91 4.1 A single-hidden layer feedforward neural network ...... 96 5.1 A single hidden layer neural network with linear neurons and time-delay elements ...... 130 6.1 An example of raw musical data of Austrian folk song...... 157 6.2 An example of raw musical data of German folk song...... 157 6.3 An example of a MFDMap of Austrian folk song...... 158 6.4 An example of a MFDMap of German folk song...... 158 6.5 The FIR-ELM network structure with linear neurons and time-delay elements ...... 161 6.6 Classification accuracy of the low-pass FIR-ELM with 100 hidden neurons (MFDMap: interval, duration and duration ratio)...... 166 6.7 Classification accuracy of four filters FIR-ELM with cutoff frequency 0.1 (MFDMap: interval, duration and duration ratio)...... 167

xii List of Tables

List of Tables

3.1 The solfege encoding reference table (all tonics start within principal octave)..63 3.2 List of durations and the encoded representation...... 70 3.3 List of encoded music representations and their respective occurrence percentage (Step 5 to 7 in constructing Case 1 MFDMap)...... 76 3.4 List of encoded music representations and their respective occurrence percentage (Step 5 to 7 in constructing Case 2 MFDMap — rests omitted) ...... 81 4.1 Selected list of reduced MFDMaps and their respective list of features (Case 1, notes and rests)...... 103 4.2 Selected list of reduced MFDMaps and their respective list of features (Case 2, only notes)...... 106 4.3 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with the original map size ...... 111 4.4 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 3...... 112 4.5 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 5...... 112 4.6 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 10...... 113 4.7 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 15...... 113 4.8 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 20...... 114 4.9 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 30...... 114 4.10 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 40...... 115

xiii List of Tables

4.11 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 50 ...... 115 4.12 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with the original map size ...... 116 4.13 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 3 ...... 116 4.14 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 5 ...... 117 4.15 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 10 ...... 117 4.16 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 15 ...... 118 4.17 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 20 ...... 118 4.18 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 30 ...... 119 4.19 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 40 ...... 119 4.20 Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 50 ...... 120 4.21 Confusion matrix for Case 1 MFDMap with map size 71 (x = 15) at 8000 hidden neurons, using the ELM classifier...... 120 4.22 Confusion matrix for Case 2 MFDMap with map size 63 (x = 15) at 8000 hidden neurons, using the ELM classifier...... 121 4.23 Confusion matrix for Case 1 MFDMap with map size 121 (x = 3) at 3000 hidden neurons, using the R-ELM classifier ...... 121 4.24 Confusion matrix for Case 2 MFDMap with map size 63 (x = 15) at 5000 hidden neurons, using the R-ELM classifier ...... 122 5.1 Classification accuracy using Case 1 MFDMap with the original map size (map

size = 172, ωc = 0.6, d/γ = 0.001)...... 139

5.2 Classification accuracy using Case 1 MFDMap with x = 3 (map size = 121, ωc = 0.6, d/γ = 0.001) ...... 140

5.3 Classification accuracy using Case 1 MFDMap with x = 5 (map size = 101, ωc = 0.6, d/γ = 0.001) ...... 140

xiv List of Tables

5.4 Classification accuracy using Case 1 MFDMap with x = 10 (map size = 81, ωc = 0.6, d/γ = 0.001) ...... 141

5.5 Classification accuracy using Case 1 MFDMap with x = 15 (map size = 71, ωc = 0.6, d/γ = 0.001) ...... 141

5.6 Classification accuracy using Case 1 MFDMap with x = 20 (map size = 63, ωc = 0.6, d/γ = 0.001) ...... 142

5.7 Classification accuracy using Case 1 MFDMap with x = 30 (map size = 55, ωc = 0.6, d/γ = 0.001) ...... 142

5.8 Classification accuracy using Case 1 MFDMap with x = 40 (map size = 47, ωc = 0.6, d/γ = 0.001) ...... 143

5.9 Classification accuracy using Case 1 MFDMap with x = 50 (map size = 40, ωc = 0.6, d/γ = 0.001) ...... 143 5.10 Classification accuracy using Case 2 MFDMap with the original map size (map

size = 145, ωc = 0.6, d/γ = 0.001)...... 144

5.11 Classification accuracy using Case 2 MFDMap with x = 3 (map size = 102, ωc = 0.6, d/γ = 0.001) ...... 144

5.12 Classification accuracy using Case 2 MFDMap with x = 5 (map size = 88, ωc = 0.6, d/γ = 0.001) ...... 145

5.13 Classification accuracy using Case 2 MFDMap with x = 10 (map size = 73, ωc = 0.6, d/γ = 0.001) ...... 145

5.14 Classification accuracy using Case 2 MFDMap with x = 15 (map size = 63, ωc = 0.6, d/γ = 0.001) ...... 146

5.15 Classification accuracy using Case 2 MFDMap with x = 20 (map size = 58, ωc = 0.6, d/γ = 0.001) ...... 146

5.16 Classification accuracy using Case 2 MFDMap with x = 30 (map size = 49, ωc = 0.6, d/γ = 0.001) ...... 147

5.17 Classification accuracy using Case 2 MFDMap with x = 40 (map size = 44, ωc = 0.6, d/γ = 0.001) ...... 147

5.18 Classification accuracy using Case 2 MFDMap with x = 50 (map size = 37, ωc = 0.6, d/γ = 0.001) ...... 148 5.19 Confusion matrix for Case 1 MFDMap with x = 15 at 500 hidden neurons (map

size = 71, ωc = 0.6, d/γ = 0.001)...... 148 5.20 Confusion matrix for Case 2 MFDMap with x = 15 at 500 hidden neurons (map

size = 63, ωc = 0.6, d/γ = 0.001)...... 149

xv List of Tables

5.21 Classification accuracy of the RPROP, ELM, R-ELM, FIR-ELM and SVM classifier ...... 149 6.1 The fifteen MFDMaps ...... 159 6.2 Classification accuracy (%) using one music element in the MFDMap...... 162 6.3 Classification accuracy (%) using two music elements in the MFDMap...... 162 6.4 Classification accuracy (%) using three music elements in the MFDMap...... 163 6.5 Confusion matrix for MFDMap using interval, duration and duration ratio elements ...... 163 6.6 Classification accuracy of the RPROP, ELM, R-ELM, FIR-ELM and SVM classifier ...... 163 A.1 Classification accuracy (%) of the RPROP classifier using median...... 191 A.2 Classification accuracy (%) of the RPROP classifier using mean...... 191 A.3 Classification accuracy (%) of the RPROP classifier using variance...... 192 A.4 Classification accuracy (%) of the RPROP classifier using median and mean ...... 192 A.5 Classification accuracy (%) of the RPROP classifier using median and variance ...... 193 A.6 Classification accuracy (%) of the RPROP classifier using mean and variance ...... 193 A.7 Classification accuracy (%) of the RPROP classifier using median, mean and variance ...... 194 A.8 Classification accuracy (%) of the ELM classifier using median...... 194 A.9 Classification accuracy (%) of the ELM classifier using mean...... 195 A.10 Classification accuracy (%) of the ELM classifier using variance ...... 195 A.11 Classification accuracy (%) of the ELM classifier using median and mean ....196 A.12 Classification accuracy (%) of the ELM classifier using median and variance ...... 196 A.13 Classification accuracy (%) of the ELM classifier using mean and variance...197 A.14 Classification accuracy (%) of the ELM classifier using median, mean and variance ...... 197 A.15 Classification accuracy (%) of the low-pass FIR-ELM classifier using median ...... 198 A.16 Classification accuracy (%) of the low-pass FIR-ELM classifier using mean .198

xvi List of Tables

A.17 Classification accuracy (%) of the low-pass FIR-ELM classifier using variance ...... 199 A.18 Classification accuracy (%) of the low-pass FIR-ELM classifier using median and mean ...... 199 A.19 Classification accuracy (%) of the low-pass FIR-ELM classifier using median and variance ...... 200 A.20 Classification accuracy (%) of the low-pass FIR-ELM classifier using mean and variance ...... 200 A.21 Classification accuracy (%) of the low-pass FIR-ELM classifier using median, mean and variance...... 201 A.22 Classification accuracy (%) of the high-pass FIR-ELM classifier using median ...... 201 A.23 Classification accuracy (%) of the high-pass FIR-ELM classifier using mean ...... 202 A.24 Classification accuracy (%) of the high-pass FIR-ELM classifier using variance ...... 202 A.25 Classification accuracy (%) of the high-pass FIR-ELM classifier using median and mean ...... 203 A.26 Classification accuracy (%) of the high-pass FIR-ELM classifier using median and variance ...... 203 A.27 Classification accuracy (%) of the high-pass FIR-ELM classifier using mean and variance ...... 204 A.28 Classification accuracy (%) of the high-pass FIR-ELM classifier using median, mean and variance...... 204 A.29 Classification accuracy (%) of the band-pass FIR-ELM classifier using median ...... 205 A.30 Classification accuracy (%) of the band-pass FIR-ELM classifier using mean ...... 205 A.31 Classification accuracy (%) of the band-pass FIR-ELM classifier using variance ...... 206 A.32 Classification accuracy (%) of the band-pass FIR-ELM classifier using median and mean ...... 206 A.33 Classification accuracy (%) of the band-pass FIR-ELM classifier using median and variance ...... 207

xvii List of Tables

A.34 Classification accuracy (%) of the band-pass FIR-ELM classifier using mean and variance ...... 207 A.35 Classification accuracy (%) of the band-pass FIR-ELM classifier using median, mean and variance...... 208 A.36 Classification accuracy (%) of the band-stop FIR-ELM classifier using median ...... 208 A.37 Classification accuracy (%) of the band- stop FIR-ELM classifier using mean ...... 209 A.38 Classification accuracy (%) of the band- stop FIR-ELM classifier using variance ...... 209 A.39 Classification accuracy (%) of the band- stop FIR-ELM classifier using median and mean ...... 210 A.40 Classification accuracy (%) of the band- stop FIR-ELM classifier using median and variance ...... 210 A.41 Classification accuracy (%) of the band- stop FIR-ELM classifier using mean and variance ...... 211 A.42 Classification accuracy (%) of the band- stop FIR-ELM classifier using median, mean and variance...... 211

xviii List of Acronyms

List of Acronyms

ANN artificial neural network BH beat histogram BP backpropagation bpm beats-per-minute DFT discrete Fourier transform DWT discrete wavelet transform ELM extreme learning machine ERM empirical risk minimization EsAC Essen Associative Code FFT fast Fourier transform FIR finite impulse response FIR-ELM finite impulse response extreme learning machine FNN feedforward neural network FPH folded pitch histogram LPC linear predictive coding MFCC Mel-frequency cepstral coefficients MIDI Musical Instrument Digital Interface MLP multi-layer perceptron MFDMap musical feature density map OSC Open Sound Control PH pitch histogram R-ELM regularized extreme learning machine RMS root mean square RPROP resilient propagation SACF summary enhanced autocorrelation function SC spectral centroid SF spectral flux

xix List of Acronyms

SR spectral roll-off SLFN single-hidden layer feedforward neural network SRM structural risk minimization SVM support vector machine TPU threshold processing unit UPH unfolded pitch histogram ZC zero-crossing

xx Chapter 1 Introduction

Chapter 1

Introduction

This thesis will investigate the application of a few powerful learning machines in music classification using a symbolic database of folk songs: the Essen Folksong Collection [1]. A meaningful and representative set of theory-based method of encoding Chinese folk songs is first developed to enable classification by learning machines. This encoding will aid ethnomusicologists in future folk song research and the superiority of the finite impulse response extreme learning machine (FIR-ELM) for music classification is confirmed.

1.1 Motivation

The single-hidden layer feedforward neural network (SLFN) is the simplest and most popular structure of multi-layer perceptrons. It has been seen in [2-3] that the SLFNs with any continuous bounded nonlinear activation functions or any arbitrary (continuous or non-continuous) bounded activation function, which has unequal limits at infinities, can approximate any continuous function and implement any classification application with a sufficiently large number of hidden neurons. Such architecture has vast applications particularly in pattern recognition.

In recent years, an emerging technology called the extreme learning machine (ELM) [4] has been attracting attention within the machine learning domain. The ELM

1 Chapter 1 Introduction

is a learning algorithm, designed specifically for single-hidden layer feedforward neural networks. Unlike conventional gradient descent-based algorithm, this algorithm has very fast speed and good generalization performance as its main attractions. There are wide applications of the ELM in pattern classification domain. Some examples are handwritten character recognition [5-6], classification of bioinformatics datasets [7-11], financial credit scoring [12-13], internet-based information processing [14-17] and music genre classification [18].

The finite impulse response extreme learning machine (FIR-ELM), an enhanced variation of ELM, which has theoretically proven to greatly improve the robustness of the ELM, was proposed in [19]. The FIR-ELM is also designed for single-hidden layer feedforward neural networks. This algorithm adopts the concept of the FIR filter in the design of the hidden layer of the neural network to effectively remove input disturbances and undesired frequency components. This modification has greatly improved the robustness of the original ELM algorithm, especially in handling noisy data. In addition, an objective function that includes both the weighted sum of the output error squares and the weighted sum of the output weights squares is minimized in the output weight space of the neural network to compute a set of optimal output weights to further improve the robustness of the neural network. This new algorithm was employed for a real-world binary classification task in [20] with bioinformatics dataset. However, up until now, there is no application of such an algorithm in any real- world multi-class classification.

Chinese folk songs are an important part of Chinese culture. They are a valuable source for humanities research. They reflect the history, society, customs, tradition and everyday life of the nation. They are the faithful companion of the people in their daily life, as a form of entertainment and as assistance in laboring works. They serve as a medium to transfer and exchange knowledge and information, to express feelings, thoughts and emotions, to communicate and to entertain.

Chinese folk songs have significant influence on the development of other forms of traditional music, including the traditional dance music, opera, instrumental music and quyi (曲艺) [21]. Many instrumental music and dance music are adapted or rearranged from folk songs. Chinese folk songs also have an active influence on court 2 Chapter 1 Introduction

music, religious music and cultivated music. In addition, many contemporary composers produce works that use folk songs or components of folk songs as the theme and works that reflect great influence from folk songs.

Chinese folk songs are unquestionably a very important asset of humanity. This thesis intends to contribute a part in preserving and sustaining this important art.

1.2 Contribution

The main contributions of this thesis are summarized as follows:

• A novel music encoding method is developed for encoding Chinese folk songs. This encoding method utilizes the symbolic representations of the musical elements and enables music to be represented in a manner that is as close to human perception as possible.

• The ELM technique is successfully implemented for folk song classification using real-world Han Chinese folk song data set.

• The FIR-ELM, an improved version of the ELM, gives a better outcome in solving folk song classification. The capability of such an algorithm in multi- class classification is verified. In addition, a potentially useful method of encoding songs is demonstrated which may be helpful in future ethnomusicology research of Chinese folk songs.

• The developed song encoding technique and the machine learning based classification algorithms are then applied to European folk songs and the performance and usefulness are successfully verified.

3 Chapter 1 Introduction

1.3 Organization of the Thesis

The main contents of this thesis are organized as follows:

Chapter 2 presents a brief overview of the artificial neural networks (ANNs), focusing on the SLFNs and the conventional learning algorithms developed for the network structure. A brief review on the techniques used for the representations of music in machine classification is included.

Chapter 3 discusses issues of the ethnomusicology background for geographical based Han Chinese folk song classification and the format and musical contents of the real- world data set employed for the research in this thesis. The music elements employed to characterize each class of folk songs and their respective method of representation are presented. Finally, the novel technique of developing a feature map to meaningfully represent folk songs for machine classification without loss of musical meaning is proposed.

Chapter 4 presents an outline of the extreme learning machine (ELM) and the regularized extreme learning machine (R-ELM) algorithms, followed by a detailed description on the experiments’ design and settings for the implementation of machine classification. This chapter also includes a careful discussion on the technique of automatic classification for Chinese folk songs.

Chapter 5 investigates the capability of a new robust algorithm called the finite impulse response extreme learning machine (FIR-ELM) on multi-class classification problems. At the same time, the enhancement to the performance of automatic classification for Han Chinese folk songs is tested using such algorithms on a series of different experiments.

Chapter 6 presents a two-case European folk song classification task using the conclusion derived in Chapter 5 to further investigate the success rate of such technique in folk songs of other cultures.

4 Chapter 1 Introduction

Chapter 7 concludes the research activities in this thesis and presents a summary of the findings. Some suggestions for future work are included in this chapter.

5

This page is intentionally left blank.

Chapter 2 Literature Review

Chapter 2

Literature Review

This chapter presents a brief overview of the artificial neural networks, focusing particularly on the structure of the single-hidden layer feedforward neural networks and the conventional learning algorithms applicable to this network structure. A brief review on the techniques for the representations of music in machine classification is also included.

2.1 Artificial Neural Network

The human brain is a highly complex system. It is capable of performing parallel computation in a non-linear manner. Neurons in the human brain can be organized to perform multiple tasks such as pattern recognition, motor control, perception and etc. Artificial neural networks (ANNs) mimic the organization and functionality of human brain. The work in ANN, commonly referred to as “neural network” is vast and usually mimics the natural behaviour and phenomena of such thinking system.

In order to have the capability of performing complex task, neural networks employ a massive interconnection of simple computing cells which are always referred to as “neurons” or “processing unit”. A good definition of neural network is as follows [22]:

7 Chapter 2 Literature Review

A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: 1. Knowledge is acquired by the network through a learning process. 2. Interneuron connection strengths known as synaptic weights are used to store the knowledge.

The learning algorithm is the procedure used to perform the learning process and functioned by modifying the synaptic weights in the network in an orderly fashion so as to achieve a desired design objective.

2.1.1 McCulloch-Pitts Threshold Processing Unit

The McCulloch-Pitts Threshold Processing Unit (TPU) is a concept developed by McCulloch and Pitts [22-24] in 1943. It can be considered as the “initial” structure of ANN. This model only takes binary inputs (0 or 1), each of which are connected to a fixed weight and a bias term. The outputs from the model are the multiplication and summation of the inputs with the weights and biases that passed through a threshold activation function. The outputs are also in the binary form.

An example of the TPU model is shown in Figure 2.1. The computation of the

TPU is as follows. For a sample input data vector x = x1,x2,…,xn, the output of the model is

⎛ n ⎞ () == ⎜∑ ii + bxwgxgy ⎟ (2.1) ⎝ i=1 ⎠

where wi is the weight connecting the ith input, b is the threshold term and g(x) is the threshold activation function. In the TPU, the weights are decimal numbers which rank the relative importance of each input and the threshold is a small value that has the effect of applying an affine transformation to the output y. The threshold activation function, g(x), is defined as follow:

8 Chapter 2 Literature Review

⎧ if1 x ≥ 0 xg )( = ⎨ . (2.2) ⎩ if0 x < 0

Figure 2.1: An example of a Threshold Processing Unit.

2.1.2 Rosenblatt’s Perceptron

The first perceptron was developed by Frank Rosenblatt [25-26] in 1958. It is the first model proposed for learning with a teacher (supervised learning). It is the simplest form of a neural network used for classification of patterns that are linearly separable. The perceptron resembles the structure of a TPU, it has a single neuron, weights and bias. Unlike the TPU, the perceptron has adjustable synaptic weights. In order to tune the synaptic weights, an error-correction rule known as the perceptron convergence algorithm is developed. The synaptic weights are adjusted on an iteration-by-iteration basis.

In the perceptron model, the input vector is defined as

T x += 21 m nxnxnxn )](),...,(),(,1[)( where ‘+1’ is the synaptic weight for the bias term, b, and n denotes the time-step in applying the algorithm. Correspondingly, the weight

9 Chapter 2 Literature Review

T vector is defined as w = 21 m nwnwnwbn )](),...,(),(,[)( . Then, linear combiner output of the neuron can be written in compact form

m = ∑ ii nxnwny )()()( i=0 . (2.3) = T xw nn )()(

In order for the perceptron to function properly, the two classes C1 and C2 must be linearly separable. This means that the patterns must be sufficiently separated from each other so that the decision surface consists of a hyperplane. This means that there exists a weight vector w such that

T xw > 0 for every input vector x tobelonging class C 1 . (2.4) T xw ≤ 0 for every input vector x tobelonging class C2

The error-correction algorithm for adapting the weights in the perceptron can be summarized as follows [26]:

1. Initialization. Set w(0) = 0. Then, perform the following computations for time- step n = 1,2,…

2. Activation. At time-step n, activate the perceptron by applying continuous- valued input vector x(n) and desired response d(n).

3. Computation of actual response. Compute the actual response y(n) of the perceptron as

ny = T xw ()(sgn[)( nn )] (2.5)

where sgn(·) is the signum function.

4. Adaptation of weight vector. Update the weight vector of the perceptron to obtain

=+ ww + − x nnyndηnn )()]()([)()1( (2.6)

10 Chapter 2 Literature Review

where

⎧+1 if nx )( tobelongs class C nd )( = ⎨ 1 , (2.7) ⎩−1 if nx )( tobelongs class C2

η is the learning-rate parameter, a positive constant limited to the range <η ≤10 , and the difference d(n) – y(n) plays the role of the error signal.

5. Continuation. Increase time-step n by one and go back to step 2.

2.1.3 Multi-Layer Perceptron

The multi-layer perceptron (MLP) [22] is a concept that extent the capability of the Rosenblatt’s perceptron from classifying linearly separable patterns to solving non- linear problems. The typical structure of a MLP consists of one or more hidden layer in between the input layer and output layer forming a cascade structure of perceptrons. The input signals are propagated through the network in a forward direction, on a layer-by- layer basis. The MLP is also known as the multi-layer feedforward neural networks (FNNs) due to the direction of the propagation of signals.

The structure of a MLP can be of any size. The simplest MLP is the single- hidden layer feedforward neural networks (SLFNs) where the network structure consists of only one hidden layer, besides the output and input layers.

2.1.3.1 Single-Hidden Layer Feedforward Neural Network

A single-hidden layer feedforward neural network has three network layers: input layer, hidden layer and output layer. The input layer consists of sensory units called input neurons that receive activation signals from an external source and then supply respective elements of the activation pattern (input vector) to the neurons in the second layer, i.e. the hidden layer. The hidden layer consists of computational nodes called hidden neurons and serves to intervene between the input layer and the output layer. It acts as a pre-processor that receives input pattern from the input layer and projects it 11 Chapter 2 Literature Review

into the feature space in order for the features to be more easily separated. Finally, the output layer, which consists of computational nodes that are called output neurons, receives the pre-processed pattern from the hidden layer and performs further computation to produce a set of output signals that constitutes the overall response of the neural network to the set of activation patterns supplied by the input neurons. It is to be noted that the input neurons are non-computational nodes. They simply receive activation signals and supply them to the hidden layer for computation.

The network structure of a single-hidden layer feedforward neural network is shown in Figure 2.2. In this neural network, there are n input neurons, Ñ hidden neurons and m output neurons. The analytic function corresponding to the SLFN in Figure 2.2 can be written as follows. The output of the jth hidden neuron is obtained by first forming a weighted linear combination of all n input values and adding a bias to give

n j ∑ += bxwa jiji (2.8) i=1

for j = 1,2,3,…, Ñ with wji the weight connecting the ith input to the jth hidden neuron and bj the bias term for the jth hidden neuron.

Figure 2.2: A single-hidden layer feedforward neural network.

12 Chapter 2 Literature Review

Then, the linear sum in (2.8) is transformed using a non-linear activation function g(x) to give the activated output

j = agy j )( . (2.9)

The final outputs of the neural network are obtained by transforming the activations of the hidden neurons using a second layer of processing elements, i.e. the output neurons in the output layer. Thus, for each output neuron k, a linear combination of the outputs of the hidden neurons is formed to give

Ñ k ∑β += bya kjkj (2.10) j =1

for k = 1,2,3,…,m where βkj is the output weight connecting the jth hidden neuron to the kth output and bk the bias term for the kth output.

Similarly, an activation function is applied to the linear sum in (2.10) to give the final output

~ k = (ago k ). (2.11)

The notation ~ xg )( is used to emphasize that the activation function for the output layer need not be the same as the activation function for hidden layer. Often, the activation function for the output neurons is different from that for the hidden neurons because the output neurons perform different roles than that of the hidden neurons. In most cases, instead of a non-linear function, a linear activation function is used for the output layer.

2.1.4 Learning Algorithms

The functionality of a multi-layer perceptron is its capability in learning a suitable mapping from a given data set. The efficiency of the MLP mainly depends on the learning algorithm. The learning algorithm determines the ideal adjustments and settings to the parameters of the MLP. There are two broad categories of learning algorithm: supervised learning and unsupervised learning.

13 Chapter 2 Literature Review

Supervised learning is also known as active learning [22] where an external “teacher” is supplied. The role of the teacher is to provide the desired or targeted response of a training vector in order for the network to learn a good mapping of the input-output patterns. The desired response represents the optimum action to be performed by the neural network. The network parameters are then adjusted under the combined influence of the training vector and the error signal (error signal is defined as the difference between the actual response of the network and the desired response). The adjustment continued iteratively in a step-by-step fashion with the aim that the neural network will eventually emulate the teacher.

Unsupervised learning, also known as self-organized learning [22] is the contrary of supervised learning. There is no teacher present in the learning. Rather, the parameters of the network are optimized with respect to the task-independent measure of the input. An internal representation of the input is formed without influence from any external source.

The techniques employed in this thesis are of the type of supervised learning technique. Hence, discussions on unsupervised learning will not be included.

2.1.4.1 Gradient Descent-Based Algorithms

The most popular, also one of the simplest learning algorithms for the MLPs is the gradient descent method (also known as steepest descent). In gradient descent method, the network learning process starts with an initial random weight vector. The weight vector is then iteratively updated in steps such that, at each step, it moves a short distance in the direction of the negative gradient (i.e. the greatest rate of decrease) of the error surface. At each successive step the value of the error function, E, will decrease, eventually leading to a weight vector at which

∇E = 0 . (2.12)

The error function, typically, the mean sum of squares, is defined as

14 Chapter 2 Literature Review

m 1 2 E ∑ −= to kk )( (2.13) m k =1

where m is the number of outputs, ok is the actual neural network response of the kth output neuron in (2.10) and (2.11) and tk is the corresponding target for a particular input pattern xn.

In order to reduce the error value, E, the network weights are updated as follow:

_ newji = _ oldji + Δwww ji (2.14)

where wji is the weight connecting the ith input to the jth hidden neuron and

∂E wji −=Δ η (2.15) ∂wji

η is the learning rate parameter for the gradient descent algorithm. The output weights are updated using the analogous expressions.

The main advantage of gradient descent based methods is the relatively simple computation of the algorithm. However, although the optimization always arrives at a minimum but due to the initial starting point, the minimum might be a local minimum instead of the global minimum. Unfortunately, once the algorithm converges to a minimum, there is no further way to decrease the error value and the optimization process has to be restarted. Hence, the solution obtained will often be non-optimal.

The other problem with gradient descent-based is the long learning time. As the optimization process of the gradient descent algorithms is performed in an iteratively step-by-step manner, they required a long learning time. In addition, the number of parameters that need tuning also leads to a time-consuming process.

2.1.4.2 Discriminant-Based Algorithms

The discriminant-based algorithms are fairly different from the gradient descent algorithms. Unlike the gradient descent algorithms which approximate the parameters

15 Chapter 2 Literature Review

for the feature probability distribution, the disriminant-based algorithms focus on finding the discriminants that separate members from different classes by estimating these discriminants directly.

The support vector network, also known as the support vector machine (SVM) [27-28] is one of the most popular algorithm from this group. It is an alternative method proposed to overcome the problems in gradient descent algorithms. Unlike the gradient descent algorithms, the SVM is a non-probabilistic binary linear classifier which utilizes Lagrange multipliers in its output weights optimization operation.

The SVM was initially designed to solve binary classification problems. The SVM works by non-linearly mapping the input vectors to a very high-dimension feature space where a linear decision surface can then be constructed. Unlike the gradient descend algorithms, the non-linear mapping function of the SVM is decided based on a priori knowledge and the output layer decision surface is then computed using the optimization method.

One of the drawbacks of the SVM is the complexity of the optimization procedure and the high degrees of polynomial used for forming the decision surfaces. This leads to a considerably long learning time. The running times of the state-of-the- art SVM learning algorithms scale approximately quadratically to the number of training sample. In addition, as the SVM is designed for binary classification; in order to solve multi-class classification problem, the algorithm has to break down a single multi-class problem into multiple binary classification problems.

2.1.5 Extreme Learning Machine

The major bottlenecks of the gradient descent-based feedforward neural network, such as the one described in Section 2.1.4.1, are the very slow learning speed and the issue of converging to local minima. It has been shown in [29] and [30] that single-hidden layer feedforward neural networks with N hidden neurons and arbitrarily chosen input weights (weights connecting input layer to hidden layer) can learn N distinct observations with arbitrarily small error. This method has been proved to produce good

16 Chapter 2 Literature Review

generalization performance and extremely fast learning speed on both artificial and real applications in [31]. It has also been further proved in [32] that SLFNs with arbitrarily assigned input weights and hidden layer biases and with almost any non-zero activation function are capable of universally approximating any continuous functions on any compact input sets.

The extreme learning machine (ELM), an emerging learning algorithm that utilized the structure of a single-hidden layer feedforward neural network, has been proved to overcome the limitations of both the gradient descent-based algorithms and the support vector machine through its technique of parameters assignment and has been proved to outperform both algorithm [33].

Unlike conventional gradient descent-based algorithms, an ELM randomly assigned the input weights and hidden layer biases and deterministically computes the optimal output weights using the generalized inverse of the hidden layer outputs. Hence, the ELM’s learning speed can be many times faster than conventional gradient descent- based algorithms while obtaining better performance. In addition, the generalized inverse operation allows the ELM to reach the smallest training error and the smallest norm of weights.

The ELM uses the network structure as shown in Figure 2.2, i.e. a single-hidden layer feedforward neural network. For a dataset with N distinct samples {(X,T) | X = T n [x1,x2,…,xN], T = [t1,t2,…,tN]} where xi = [xi1,xi2,…,xin] ∈R is the input vector and ti T m = [ti1,ti2,…,tim] ∈ R is the target vector, the SLFN with Ñ hidden neurons can be written as

Ñ ∑ ( bg jijj ) =+⋅ oxwβ i (2.16) j =1

T for i = 1,2,…,N where βj = [βj1, βj2,…, βjm] is the output weights vector connecting the T jth hidden neuron and the output neurons, wj = [wj1, wj2,…, wjn] is the input weights vector connecting the input neurons and the jth hidden neuron, bj is the bias of the jth hidden neuron, wj·xi denotes the inner product of wj and xi, g(x) is the activation T m function and oi = [oi1,oi2,…,oim] ∈ R is the output vector with respect to xi =

17 Chapter 2 Literature Review

T [xi1,xi2,…,xin] the input vector. It is to be noted that the output neurons are linear, i.e. the activation function of the output neurons is a linear function.

For the SLFN with Ñ hidden neurons and activation function g(x) to approximate N data samples with zero error, there exist βj, wj and bj such that

Ñ ∑ j ( bg ) =+⋅ txwβ ijij (2.17) j =1 for i = 1,2,…,N. (2.17) can then be written compactly in matrix form Hβ = T where

21 wwwH Ñ 21 bbb Ñ 21 xxx N ),...,,,,...,,,,...,,(

⎡ g( xw +⋅ b111 g() xw +⋅ b212 L g() Ñ xw 1 +⋅ bÑ ) ⎤ ⎢ ⎥ g( xw +⋅ b121 g() xw +⋅ b222 L g() Ñ xw 2 +⋅ bÑ ) , (2.18) = ⎢ ⎥ ⎢ M OM M ⎥ ⎢ ⎥ g( xw +⋅ b g() xw +⋅ b g() xw +⋅ b ) ⎣ 1 N 1 2 N 2 L Ñ N Ñ ⎦ ×ÑN

⎡βT ⎤ ⎢ 1 ⎥ βT β = ⎢ 2 ⎥ and (2.19) ⎢ ⎥ ⎢ M ⎥ βT ⎣⎢ Ñ ⎦⎥ ×mÑ

⎡tT ⎤ ⎢ 1 ⎥ tT T = ⎢ 2 ⎥ . (2.20) ⎢ M ⎥ ⎢ T ⎥ ⎣⎢t N ⎦⎥ ×mN

The output weight matrix, β, of the SLFN is then computed as follows:

= )( − T1T THHHβ . (2.21)

2.1.5.1 Regularized Extreme Learning Machine

Although the ELM greatly improved the performance of conventional gradient descent- based algorithms, the design of the output layer weights in the ELM gives rise to an

18 Chapter 2 Literature Review

issue. As the output weights is determine through generalized inverse on the hidden layer output matrix, this minimum norm least squares solution of the hidden layer output is an empirical risk minimization (ERM) operation which tends to result in an overfitting model especially if the training set is not sufficiently large.

Deng, Zheng and Chen [34] proposed to overcome the drawback in the output weights by introducing a regularization term into the ELM algorithm. A weight factor, γ,

2 for empirical risk is inserted to regularize the proportion of the empirical risk, ε , and

2 the structural risk, β . Their improved algorithm is called the regularized extreme learning machine (R-ELM).

In the R-ELM algorithm, the output weights are calculated by minimizing both the weighted sum of the output error squares and the sum of the output weights squares of the SLFN:

⎧1 2 1 2 ⎫ Minimize ⎨ γ + βε ⎬ (2.22) ⎩2 2 ⎭

subject to = − = − THβTOε . (2.23)

The problem is solved by using the method of Lagrange multipliers:

N m m Ñ N m γ 2 1 2 T L = ∑∑ε ij + ∑∑ij − ∑∑ ( kkp Th −− εβλβ kpkpp ) (2.24) 2 i==11j 2 i == 11j k == 11p

where ɛij is the ijth element of the error matrix ɛ, βij is the ijth element of the output weight matrix β, Tij is the ijth element of the output data matrix T, hi is the ith column of the hidden layer output matrix H, β j is the jth column of the output weight matrix β,

λij is the ijth Lagrange multiplier and γ is the constant parameter used to adjust the empirical risk. Differentiating L in (2.24) with respect to (βij,ɛij) and let them equal to zero gives

∂L =→ T λHβ and (2.25) ∂βij

19 Chapter 2 Literature Review

∂L −=→ γελ . (2.26) ∂ε ij

Considering the constraint in (2.23), (2.26) can be expressed as

= −γ (H − Tβλ ). (2.27)

Using (2.27) in (2.25) leads to the computation of the output weight matrix, β, of the SLFN:

−1 ⎛ I T ⎞ T β ⎜ += ⎟ THHH . (2.28) ⎝ γ ⎠

In this thesis, the application of the single-hidden layer feedforward neural network using the extreme learning machine technique, as the folk song classifier will be investigated. The discussion on this superior technique for folk song classification will be presented in Chapter 4 and Chapter 5.

2.2 Music Representations

As computer technology became more sophisticated and accessible, human’s interest in involving machine in music classification flourishes. Automatic music classification consists of using machine to obtain useful features from music and using these features to identify which of a set of classes a new piece of music is most likely belongs to.

The two main formats of digital representation of music are: audio format and symbolic format. In audio format, music is represented in the form of raw audio signals. There is no explicit information about the music notes, voicing and phrasing nor any musical symbol and tag. WAV and MP3 files are the most commonly used audio representation [35]. Symbolic format, on the other hand, uses symbols and notations with direct musical meaning to model the visual aspects of a music score, and audio information or annotations related to the music piece. Symbolic representations contain information about what and how a music piece is to be played. Some commonly used

20 Chapter 2 Literature Review

symbolic representations are MIDI, Humdrum, abc and MusicXML [36]. In music classification, the choice of the format is usually depending on the availability of the data samples.

2.2.1 Audio Format

In music classification using audio data, features used to characterize each of the classes are constructed based on information directly derived from audio signal properties. These features are usually referred to as low-level features which do not provide direct and precise information regarding the musical context and content. This information is usually obtained by performing feature extraction on a fixed size segment of audio signal called window or frame. A window can contains audio samples ranging from a few milliseconds to seconds and sometimes even minutes. While most features extracted from audio signal are based on short windows, some longer windows can be used if information on a large scale structure is desired. A music audio is usually segmented into many overlapping windows in order to increase time localization. The distance between the starts of two overlapping windows is usually called hop size. Although there is no fixed standard for the hop size, the common hop size applied in music classification is half the size of the analysis window.

In the following, a summarized list of common audio features employed in [18,37-59] is briefly discussed. The first section of the discussion focuses on features extracted in the time domain. The second section discusses features that are extracted in the frequency domain using the discrete Fourier transform (DFT) technique. In order to allow combination of features from both time domain and frequency domain, the size of the analysis window in both cases are usually made the same. Nonetheless, it is not compulsory to do so. Finally, in the third section, two high-level features that can be extracted from the audio signal are discussed.

21 Chapter 2 Literature Review

2.2.1.1 Common Audio Features Extracted in Time Domain

Features derived from audio signal in time domain are usually calculated directly from the sequence of samples. Whilst most features described in this section are extracted from individual short windows usually with time scale ranging from 10 milliseconds to 40 milliseconds (with the purpose of being consistent with the window size apply on feature extraction in frequency domain), some features are calculated based on a collection of consecutive short windows in order to capture the pattern of the signal changes over time. To define the features mathematically, first let the music signal be denoted as x, the tth analysis window constructed using N samples at a time from the music signal x and hop size h be denoted as xt, we have

⎧ thnx )]1([ 0 Nn −≤≤−+ 1 t [nx ] = ⎨ . (2.29) ⎩0 otherwise

For non-overlapping analysis window, the hop size h is equivalent to the window size N.

Root Mean Square (RMS)

The root mean square is a measure of the power in the music signal. It is often used as a loudness feature in audio based music classification. The RMS is defined as follows:

N −1 1 2 RMSt = ∑ t nx ][ . (2.30) N n=0

Fraction of Low Energy Windows

Fraction of low energy is a measure of the fraction of analysis window within a set of consecutive windows that have root mean square value that is below some threshold value. The common calculation of low energy fraction uses the average RMS of the set of windows under consideration as the threshold value. This feature gives an indication of the fraction of silence or near silence in the segment of signal under consideration. Therefore, music with little silences, for example, music with high instrumental activity will have low fraction of low energy windows.

22 Chapter 2 Literature Review

Zero-Crossing (ZC)

The zero-crossing is the number of times the waveform changes sign within a given music frame of length N. In other words, it is the number of times a signal passes the zero midpoint of the signal range. It is used as an indication of noisiness as signals with no DC component will tend to cross the midpoint more often. The zero-crossings is highly correlated with the spectral centroid of clean (non-noisy) signals. The zero- crossing is computed as:

N −1 ZCt = ∑ t − t nxsignnxsign − ])1[(])[( (2.31) n=0 where

⎧1 t nx ≥ 0][ t nxsign ])[( = ⎨ 0, Nn −≤≤ 1. (2.32) ⎩0 t nx < 0][

Linear Predictive Coding (LPC)

The linear predictive coding is a method initially developed to analyze and encode human speech signals. The LPC works by first estimating the formants (spectral bands corresponding to the resonance frequencies in the human vocal tract), then performing inverse filtering to remove the effects of these formants from the speech signal. It then estimates the intensity and frequency of the residue (the remaining signal after the subtraction). The result from these is a vector of values that describe the intensity and frequency of the residue, the formants and the residue signal. This vector can be used to recreate speech where both the intensity and frequency of the residue and the residue signal can be used to create the source signal and the formants can be used to create a filter. Speech is produced by running the source signal through the filter. The detail explanation of the LPC can be found in [60].

The most important aspect of the LPC is that it allows a music sample to be approximated as a linear combination of previous samples. The unique set of predictor coefficients is determined by minimizing the sum of the squared differences between

23 Chapter 2 Literature Review

the actual signal and the predicted signal. Different approaches can be used for the minimization such as autocorrelation method, covariance method and lattice method. One common application of the LPC in music is for identifying instrument types.

2.2.1.2 Common Audio Features Extracted in Frequency Domain

In order to extract features from music audio signal in frequency domain, the audio signal is first segmented into overlapping, very short analysis frames on a time scale between 10 milliseconds to 40 milliseconds where the signal is considered stationary. The overlap step size is usually within the range of 5 milliseconds to 20 milliseconds. Then, each of these analysis frames is multiplied with a windowing function. The use of windowing function is to keep the continuity of the first and last sample in an analysis frame and to reduce the problem of spectral leakage, which refers to power which is assigned to frequency components that are not actually in the signal being analyzed. There are many windowing functions but the most commonly used windowing functions in music classification are the Hamming windows and the Hann windows. If the music signal in the tth frame is denoted as

⎧ thnx )]1([ 0 Nn −≤≤−+ 1 t [nx ] = ⎨ (2.33) ⎩0 otherwise where h is the hop size and N is the number of samples within a frame, then the signal after windowing function is

w t t ×= [][][ nwnxnx ] (2.34) where, for Hamming windows

⎛ n ⎞ nw −= ⎜2cos46.054.0][ π ⎟ 0, Nn −≤≤ 1 (2.35) ⎝ N −1⎠ and Hann windows

⎛ ⎛ n ⎞⎞ nw ⎜ −= ⎜2cos15.0][ π ⎟⎟ 0, Nn −≤≤ 1. (2.36) ⎝ ⎝ N −1⎠⎠

24 Chapter 2 Literature Review

Finally, after applying windowing function, the fast Fourier transform (FFT), an optimized version of the DFT, is performed on each of the analysis frame to obtain the magnitude frequency response. Detailed discussion on the Fourier transform can be found in [61]. In the followings, Mt[n] is the magnitude spectrum of the Fourier transform at frequency bin n, out of N bins, for Fourier analysis frame t.

Spectral Centroid (SC)

The spectral centroid (SC) is a measure of the spectral brightness of a music signal. Higher centroid values correspond to “brighter” textures with more high frequencies. The spectral centroid is usually used to characterize the timbre of musical instruments. It is defined as the center of gravity of the magnitude spectrum of the Fourier transform:

N −1 ∑ t ][ × nnM n=0 SCt = N −1 . (2.37) ∑ t nM ][ n=0

Spectral Roll-off (SR)

The spectral roll-off measures the spectral shape and indicates how much of the energy is concentrated in the lower frequencies. It is usually used in speech analysis to differentiate between voiced and unvoiced speech. In music analysis, it is used as a feature to characterize the timbre of musical instruments. It is defined as the frequency value below which resides the 85% (can be any number but 85% is the typical value) of the magnitude distribution:

SRt N −1 ∑ t nM ×= ∑ t nM ][85.0][ . (2.38) n=0 n=0

25 Chapter 2 Literature Review

Spectral Flux (SF)

The spectral flux measures the amount of local spectral change in the signal. It is computed by calculating the change in the normalized magnitude spectrum, Nt[n], between successive frames:

N −1 2 t = ∑ t − t −1 nNnNSF ])[][( . (2.39) n=0

The spectral flux is another feature used for characterizing timbre of musical instruments.

Mel-Frequency Cepstral Coefficients (MFCC)

The Mel-frequency cepstral coefficients is one of the most widely used features in both speech recognition and audio-based music classification. The MFCC takes into account the human perception sensitivity with respect to frequencies. It is computed as follows: (1) take the log-amplitude of the magnitude spectrum. (2) Group and smooth the frequency bins according to the perceptually motivated Mel-frequency scaling. (3) Find the discrete cosine transform to de-correlate the resulting feature vectors. Typically, 13 coefficients are used in speech analysis. Tzanetakis and Cook [38] discovered that the first five coefficients provide the best performance in music genre classification. Further details of the MFCC are presented in [62]. The Mel-frequency cepstral coefficient is also a feature for timbre-based representation.

2.2.1.3 High-Level Features Extracted from Audio

Although low-level features are useful, in most cases, they are not representative in other applications where high-level music features such as pitch and rhythmic patterns are required. Extracting high-level information from audio signals is less straightforward and less accurate than from symbolic musical data. Nonetheless, under the assumption that imperfections in the extracted information can be averaged out in broad high-level representations, it is possible to derive some useful high-level

26 Chapter 2 Literature Review

information from audio signals. The two main high-level representations that can be constructed from audio signals are the pitch histogram and the beat histogram.

Pitch Histograms

Tzanetakis and Cook [38] proposed a technique for deriving pitch information from sound signal through constructing a variety of different pitch histograms. The pitch content feature detection algorithm employed to construct the pitch histograms is based on the multi-pitch detection algorithm described by Tolonen and Karjalainen [63]. In their algorithm, the sound signal is first decomposed into two frequency bands: below 1000 Hz and above 1000Hz. Amplitude envelopes are then extracted for each frequency band. The envelope extraction is performed by applying half-wave rectification and low-pass filtering on the signals.

The extracted envelopes are summed and an enhanced autocorrelation function called the summary enhanced autocorrelation function (SACF) is then computed in order to reduce the effect of the integer multiples of the peak frequencies to the multiple pitch detection. The prominent peaks of the SACF are treated as the main pitches of a corresponding short segment of sound signals. The three dominant peaks of the SACF are then accumulated into a pitch histogram (PH) over the entire sound file.

Next, the frequencies corresponding to each histogram peak are converted to musical pitches such that each bin of the PH corresponds to a with a specific pitch, for example, the musical note name A4 is equivalent to 440Hz. The musical pitches are labeled using the MIDI note numbering scheme where the conversion from frequency to MIDI note number can be performed using the following equation:

f n = log12 2 + 69 440 (2.40) where f is the frequency in Hertz and n is the histogram bin (MIDI note number).

There are two versions of pitch histogram proposed in [38]: the unfolded pitch histogram (UPH) and the folded pitch histogram (FPH). The UPH is constructed using 27 Chapter 2 Literature Review

(2.40). The FPH method discards the octave information of a note and group notes as according to pitch classes. In the FPH, the octave information of all notes is normalized to a single octave using the mapping equation:

= nc 12mod (2.41) where c is the folded histogram bin (i.e. the pitch class or chroma value) and n is the unfolded histogram bin (MIDI note number).

The main difference between the UPH and the FPH is that the unfolded pitch histogram contains information about the pitch range of a musical piece while the folded pitch histogram contains information regarding the pitch classes or harmonic content of the music. The FPH method is similar to the chorma based representation employed in [64] for audio thumbnailing. Some detailed explanation of chroma and height dimension of musical pitch can be found in [65] and the relation of musical scales to frequency is discussed in [66].

A variant of the FPH called the circle of fifths histogram is designed where adjacent histogram bins are spaced a fifth apart rather than a semitone as in the original FPH. The authors [38] believed that the distances between the adjacent bins in the new variant are better suited for expressing tonal music relation (tonic-dominant) and the extracted features result in better classification accuracy. The mapping from the original FPH to the new circle of fifths histogram can be achieved by

′ = × cc 12mod)7( (2.42) where c’ is the new circle of fifths histogram bin after the mapping and c is the original folded histogram bin. The number ‘7’ correspond to the seven semitones or the music interval of a fifth.

There are many useful features that can be calculated from the pitch histograms. For examples, the difference between the lowest pitch and the highest pitch in a pitch histogram can indicate pitch range. The bin label of the pitch class histogram with the highest amplitude may indicate the primary key of the pitch or at least the dominant.

28 Chapter 2 Literature Review

The interval between the two strongest pitches of the folded pitch class histogram can give an indication of the centrality of tonality in a piece.

Beat Histograms

A common automatic beat detector structure consists of signal decomposition into frequency bands using a filterbank, followed by an envelope extraction step and finally a periodicity detection algorithm to detect the lags at which the signal’s envelope is most similar to itself. The process of beat detection is similar to pitch detection except that it is done in a larger time scale (approximately 0.5 seconds to 1.5 seconds for beat detection compare to 2 milliseconds to 50 milliseconds for pitch).

The concept of constructing a histogram of time intervals between note onsets that gives some overall information about the rhythmic patterns in signal as a whole is first promoted by Tzanetakis and Cook [38]. In their approach, the features used to represent the rhythmic structure of a piece of music are based on the most salient periodicities of the sound signal. Figure 2.3 shows the flow diagram of the construction of a beat histogram (BH) [38]. For constructing the beat histogram, the sound signal is first decomposed into a number of octave frequency bands using the discrete wavelet transform method. Then, the time domain amplitude envelope of each band is extracted by applying full-wave rectification, low-pass filtering, downsampling of each octave frequency band and finally a mean removal. These envelopes are then summed together and the autocorrelation is then computed on the sum. The dominant peaks of the autocorrelation function each corresponds to the various periodicities of the signal’s envelope. The peaks obtained are then accumulated over a whole sound file to build the beat histogram. The histogram bins in the BH each corresponds to a peak lag, i.e. the beat period in beats-per-minute (bpm). When compiling the beat histogram, in state of adding one to a bin, the amplitude of each peak is added to the bin. Using such method, if the signal is very similar to itself (strong beat) the histogram peaks will be higher.

The equations used in each step of the beat analysis algorithm [38] are listed in the followings. In the equations, x is the sound signal and n = 1,2,…,N where N is the total samples of signal.

29 Chapter 2 Literature Review

Figure 2.3: The flow diagram of the construction of a beat histogram.

Full wave rectification

= nxny ][][ (2.43)

Full wave rectification is applied to extract the temporal envelope of the sound signal rather than the time domain signal.

Low pass filtering

= −α + α nynxny − ]1[][)1(][ (2.44) is a one pole filter with an alpha value (α) of 0.99. It is used to smooth the envelope.

Downsampling

= knxny ][][ (2.45)

30 Chapter 2 Literature Review

where k = 16 is used in [38]. Due to the large periodicities of beat analysis, the objective of applying downsampling is to reduce computation time for the autocorrelation computation without affecting the performance of the algorithm.

Mean removal

= − [[][][ nxEnxny ]] (2.46)

Mean removal is used to center the signal at zero for the autocorrelation stage.

Autocorrelation

1 N [lagy ] = ∑ − lagnYnY ][][ (2.47) N n=1 where lag is the number of samples of delay. Autocorrelation is calculated for all integer values of lag, subject to 0 ≤ < Nlag . Y is the outcome of pre-processing on the sound signal which includes the full wave rectification, low-pass filtering, downsampling and mean removal.

Autocorrelation is a technique that involves comparing a signal with versions of itself delayed by successive intervals, which yields the relative strength of different periodicities within the signal. In music processing, autocorrelation allows one to find the relative strength of different rhythmic pulses.

The calculation of autocorrelation results in a histogram where each bin corresponds to a different lag time. Since the sampling rate of a signal is known, the histogram can provides an indication of the relative importance of the time intervals that pass between strong peaks.

High-level rhythmic information can be derived from a beat histogram. For examples, the number of strong peaks can provide some measure of rhythmic sophistication. The periods of the highest peaks can provide good information for the of an audio signal. The ratios between the highest peaks, in terms of both amplitude and period, can give metrical insights and an indication as to whether a signal

31 Chapter 2 Literature Review

is likely polyrhythmic or not. The sum of histogram as a whole can give an indication of beat strength. The proportional collective strength of low-level bins can give an indication of degree of rubato or rhythmic looseness.

2.2.2 Symbolic Format

Musical information is represented in an essentially different way in symbolic musical file formats than in audio files. Unlike audio files that store the digital approximation of the actual sound signals, symbolic files store the higher-level notions about music rather than the direct representation of the sound. For example, an audio file will store the approximation of the actual sound waves produced by a singer singing the “Jasmine Flower” (a Jiangsu folk song) but a symbolic file will store information such as the of each note sung by the singer, the instrument used to produce the sound, in this case, human voice, and the duration information of each note.

The symbolic representation of music can exist in many formats. For example, the physical form of written or printed , holes punched in player rolls and keypunched cards and of course, the digital files of the modern age such as MIDI [67- 68], Open Sound Control [69-71], GUIDO [72], Humdrum [73] and MusicXML [74-75]. A short overview of some commonly encountered symbolic music file formats is presented below. A good overview of symbolic formats (except MIDI) can be found in [36]. Dannenberg presented a useful survey on symbolic music representation in [76].

2.2.2.1 Symbolic Music File Formats

In general, the digital symbolic music file formats can be divided into three broad categories: (1) formats for communicating performance information between controllers, computers and synthesizers; (2) formats for representing musical scores and associated visual formatting information; and (3) formats for facilitating theoretical and musicological analysis [77].

32 Chapter 2 Literature Review

Formats for communicating information between controllers, computers and synthesizers

The most well-known format in this group is the MIDI – Musical Instrument Digital Interface [68] format. MIDI is a technical standard which describes a set of protocols, digital interfaces and connectors that allow a wide variety of electronic musical instruments, computers and other related devices to connect and communicate with each other. Due to its popularity, a very large amount of music of many kinds is stored using this format. Subsequently, a large portion of music classification researches that employ symbolic music representation used MIDI files as their research data set.

Open Sound Control (OSC) developed by Wright and Freed [69-71] is a successor of the MIDI format. It is a real-time performance oriented symbolic file format that is widely recognized as more technically superior than its predecessor, the MIDI format. Some advantages of OSC include improved time resolution, explicit compatibility with modern networking technology and improved general flexibility.

Formats for representing musical scores and associated visual formatting information

The most commonly used file formats within this group are the file formats of two leading score editing applications: Finale1 (.mus format) and Sibelius2 (.sib format). Both applications are commercial software and the details of the representation of the file formats are not published. One needs to purchase the software in order to read or write files in these formats. This limitation greatly reduced the research value of these file formats. Nonetheless, there are some research-oriented formats that can be used for representing musical scores. Two of the more well-known formats are the GUIDO [72] and LilyPond [78]. Both of them are text-based formats.

MusicXML [74-75] is an XML-based file format for representing Western . Although the format is proprietary, it can be freely used under a

1 www.finalemusic.com 2 www.sibelius .com

33 Chapter 2 Literature Review

Public License. MusicXML received a relatively high popularity due to its adoption by a variety of commercial or non-commercial music notation programs such as , , MuseScore 3 , SmartScore 4 , Steinberg Cubase 5 and 6 . The MusicXML can be a good tool to serve as an intermediate file format to transfer data between the .mus and .sib files.

Formats intended for facilitating theoretical and musicological analysis

The most prominent file formats in this category are the formats associated with the Humdrum Toolkit [79]. Among them, the most popular and most general file format is the **kern format [80]. Some of the many Humdrum file formats are designed to represent more specialize music types such as the **bhatk for transcribe Hindustani music, the **hildegard format for the German manuscripts and the **koto format for koto (a traditional Japanese stringed musical instrument that is similar to the Chinese zheng) . Humdrum also facilitates translation to and from MIDI data.

2.2.2.2 Benefits of Using Symbolic Music File Formats

To date, there have been more researches in music classification that are performed using audio files than symbolic music files. This is greatly due to the increase in commercial music information retrieval applications and users that are much more interested in processing audio files. In general, this group of researches viewed music as a type of sounds more than its other contexts beyond the sound perspective. Nonetheless, some examples of automatic music classification using symbolic data can be seen in [81-91].

On the other hand, musicologists and music theorists usually prefer the symbolic musical representations. The main reason is that features extracted from audio data

3 .org 4 www.musitek.com 5 www.steinberg.net 6 www.rosegardenmusic.com

34 Chapter 2 Literature Review

generally have little intuitive meaning to humans. For example, features such as the zero-crossing rate and the Mel-frequency cepstral coefficients extracted over a sequence of audio windows although useful for automatic music classification, they are unlikely to give any insight or inspiration to music theorists. Conversely, features extracted from symbolic data, such as those related to the key and of a piece of music, are much more straightforward and meaningful to humans. These features often provide useful insights on music.

The main advantage of symbolic data is that it provides much more immediate and reliable access to musically meaningful research than audio data. Since the fundamental elements in symbolic files are usually a precise representation of musical notations while the fundamental elements in audio files are typically sound samples, it is much easier to extract high-level music information from symbolic files with high accuracy than from audio files.

In addition, some symbolic file formats such as MIDI are usually more compact than audio recordings. This makes storing, processing and transmitting much faster and easier. Furthermore, it is much easier to correct and edit symbolic files than audio recordings, and the correction can be made more accurately.

The existing optical music recognition techniques such as [92-94] and softwares such as SmartScore, OpenOMR 7, SharpEye 8 and Gamera 9 allow printed or written scores to be processed into symbolic file formats from which music features can then be extracted. This is particularly useful for cases in which audio recording for music score does not exist or is hard to obtain. From musicological perspective, it is better to use features extracted from music scores than from audio recordings as it eliminates the potential performance biases and errors. This enables analysis to be based entirely on the artifact provided by the composer.

7 sourceforge.net/projects/openomr 8 www.visiv.co.uk 9 gamera.informatik.hsnr.de

35 Chapter 2 Literature Review

2.2.2.3 Symbolic Features Extracted from MIDI

Among the digital format of symbolic representations, MIDI representation is the most popular and widely used format employed in automatic music classification research, largely owing to its popularity and hence the data availability. MIDI, short for Music Instrument Digital Interface, is an encoding system used to represent, transfer and store musical information. Information is represented as sequences of instructions called MIDI messages. Each MIDI message corresponds to either an event or change in a control parameter. The details on MIDI and its specifications will not be covered in this thesis but many books on MIDI are available, for example [67] can be consulted for further details. Also, the official web site of MIDI Manufactures Association10 provides comprehensive information and documentations on MIDI. An extensive list of features extracted from MIDI can be grouped into seven groups [84]: pitch based, melody based, chord based, rhythm based, instrumentation based, musical texture based and dynamics based. The derived features can then be compiled into relevant feature vectors for music classification.

Pitch Based Features

Three pitch histograms are constructed in [84] based on the technique proposed by [37- 38]. The first histogram is the basic pitch histogram which consists of 128 bins, one for each MIDI pitch. The magnitude of each bin corresponded to the number of times the Note On events occurred at a particular pitch. This histogram gives an insight into the range and spread of notes in a music piece.

The second histogram is called the pitch class histogram which has 12 bins, one for each of the pitch in the twelve pitch classes. The magnitude of each bin is then corresponded to the number of times the Note On events occurred for a particular pitch class. This histogram gives insights into the types of scales used and the amount of transposition that was present.

10 www..org

36 Chapter 2 Literature Review

The third histogram is the fifths pitch histogram which consists of 12 bins. The bins are a reordered sequence of the bins in the pitch class histogram where in this histogram, the adjacent bins are a perfect fifth apart rather than a semitone apart.

The list of pitch based features proposed in [84] are the most common pitch prevalence, the most common pitch class prevalence, the relative strength of top pitches, the relative strength of top pitch classes, the interval between strongest pitches, the interval between strongest pitch classes, the number of common pitches, the pitch variety, the pitch class variety, the pitch range, the most common pitch, the primary register, the importance of bass register, the importance of middle register, the importance of higher register, the most common pitch class, the dominant spread, the strong tonal centres, the basic pitch histogram, the pitch class distribution, the fifths pitch histogram, the quality, the prevalence, the average range of , the vibrato prevalence and the prevalence of micro-tones.

Melody Based Features

The pitch based features discussed above do not reflect the information relating to the order in which pitches are played. The melody is a very important part of how humans reflect on music that they hear. In order to achieve this, the statistics about melodic motion and intervals are used. A melodic interval histogram is proposed in [84] where each bin of the histogram is labeled with a number indicating the number of semitones separating sequentially adjacent notes in a given channel. The magnitude of each bin indicates the fraction of all melodic intervals that correspond to the melodic interval of the given bin. Features are then derived from this histogram.

The list of melody based features includes the melodic interval histogram, the average melodic interval, the most common melodic interval, the distance between most common melodic intervals, the most common melodic interval prevalence, the relative strength of most common intervals, the number of common melodic intervals, the amount of arpeggiation, the repeated notes, the chromatic motion, the stepwise motion, the melodic thirds, the melodic fifths, the melodic tritons, the melodic octaves, the embellishment, the direction of motion, the duration of melodic arcs, the size of melodic arcs and the melodic pitch variety.

37 Chapter 2 Literature Review

Chord Based Features

Musical chords are created by having different notes played simultaneously. Some technique of chord analysis presented in Rowe [95] is adopted to design the chord based features in [84]. Two histograms are constructed for the chord based features. The first histogram is the vertical interval histogram which consists of bins labeled with different vertical intervals. The magnitude of each bin in the histogram is the sum of all vertical intervals that are sounded at each tick.

The second histogram is the chord type histogram. In this histogram, each bin is labeled with one of the following types of chords: two pitch class chord, major triad, minor triad, other triad, diminished, augmented, dominant seventh, major seventh, minor seventh, other chord with four pitch classes and chord with more than four pitch classes.

The list of chord based features proposed are the vertical intervals, the chord types, the most common vertical interval, the second most common vertical interval, the distance between two most common vertical intervals, the prevalence of most common vertical interval, the prevalence of second most common vertical interval, the ratio of prevalence of two most common vertical intervals, the average number of simultaneous pitch classes, the variability of number of simultaneous pitch classes, the minor major ratio, the perfect vertical intervals, the unisons, the vertical minor seconds, the vertical thirds, the vertical fifths, the vertical tritons, the vertical octaves, the vertical dissonance ratio, the partial chords, the minor major triad ratio, the standard triads, the diminished and augmented triads, the dominant seventh chords, the seventh chords, the complex chords, the non-standard chords and the chord duration.

Rhythm Based Features

Researches such as [95-98] emphasized that rhythm plays a very important role in many types of music. In defining the rhythm based features, beat histogram is construct using the technique proposed in [37-38]. However, instead of using not-quite-accurate beat information derived from audio signals, the precise representation of beat information in

38 Chapter 2 Literature Review

MIDI is employed to construct the beat histogram. The rhythm based features employed are then derived from the beat histogram.

The rhythm based features that are derived from the beat histogram include the strongest rhythmic pulse, the second strongest rhythmic pulse, the harmonicity of two strongest rhythmic pulses, the strength of strongest rhythmic pulse, the strength of second strongest rhythmic pulse, the strength ratio of two strongest rhythmic pulses, the combined strength of two strongest rhythmic pulses, the number of strong pulses, the number of moderate pulses, the number of relatively strong pulses, the rhythmic looseness, the , the rhythmic variability and the beat histogram itself.

There are other rhythmic features that are not from the beat histogram. These are the note density, the note density variability, the average note duration, the variability of note duration, the maximum note duration, the minimum note duration, the incidence, the average time between attacks, the variability of time between attacks, the average time between attacks for each voice, the average variability of time between attacks for each voice, the incidence of complete rests, the maximum complete rest duration, the average rest duration per each voice, the average variability of rest duration across voices, the initial tempo, the initial , the compound or simple meter, the triple meter, the quintuple meter and the change of meter.

Instrumentation Based Features

This group of features utilize the capability of General MIDI (level 1) specification which allows recordings to make use of 128 pitched-instrument patches and a further 47 percussion instruments in the Percussion Key Map. The instrumentation based features that are proposed include the present of pitched instruments, the present of unpitched instruments, the note prevalence of pitched instruments, the note prevalence of unpitched instruments, the time prevalence of pitched instruments, the variability of note prevalence of pitched instruments, the variability of note prevalence of unpitched instruments, the number of pitched instruments, the number of unpitched instruments, the percussion prevalence, the string keyboard fraction, the acoustic guitar fraction, the electric guitar fraction, the violin fraction, the saxophone fraction, the brass fraction, the

39 Chapter 2 Literature Review

woodwinds fraction, the orchestral strings fraction, the string ensemble fraction and the electric instrument fraction.

Musical Texture Based Features

The musical texture based features make use of the fact that MIDI notes can be assigned to different channels and to different tracks, thus making it possible to segregate the notes belonging to different voices. The texture related features include the maximum number of independent voices, the average number of independent voices, the variability of number of independent voices, the voice equality number of notes, the voice equality note duration, the voice equality dynamics, the voice equality melodic leaps, the voice equality range, the importance of loudest voice, the relative range of loudest voice, the relative range isolation of loudest voice, the range of highest line, the relative note density of highest line, the relative note durations of lowest line, the melodic intervals in lowest line, the simultaneity, the variability simultaneity, the voice overlap, the parallel motion and the voice separation.

Dynamic Based Features

In music, the dynamic is usually refers to the loudness of a piece. In [84], the dynamic refers to the velocity values scaled by volume channel messages:

⎛ channel volume⎞ note dynamic = note velocity×⎜ ⎟ . (2.48) ⎝ 127 ⎠

All dynamic features in [84] use relative measures rather than absolute measures as the default volume and velocity values set by different sequencer are varied. The list of dynamic based features includes the overall dynamic range, the variation of dynamics, the variation of dynamics in each voice and the average note to note dynamics change.

It can be seen in [84] that an extensive list of potential features can be extracted from MIDI data. However, not all of them are applicable to the various existing types of music. In addition, depending on the author of the MIDI files, some necessary data might not be available with the MIDI files to derive certain features. Apart from that, not all symbolic file formats are as compact as MIDI files. Hence, depending on the 40 Chapter 2 Literature Review

application of the tasks and the file formats available, varied number of features and types of features will be derived and employed in the research. In short, there is no standard list of features but the choice of features is dependent on the availability and capability of the data and the characteristics of music under investigation. However, in any cases, the use of pitch and rhythmic information are inevitable.

2.3 Discussion

The two main components of folk song classification are the machine classifier and the music encoding. In this thesis, the single-hidden layer feedforward neural network is employed as the machine classifier for the folk song classification research. However, neither the gradient descent-based learning algorithm nor the discriminant-based learning algorithm is employed for the SLFN. Instead, a superior technique which has been proven to be able of overcoming drawbacks in both types of algorithm called the extreme learning machine, is employed to examine and to verify its performance in multi-class classification task, particularly, folk song classification. Nevertheless, the classification performance of the gradient descent-based learning algorithm and the support vector machine will be included for comparison.

As mentioned in Chapter 1, this thesis intends to contribute a part in preserving and sustaining the art of Chinese folk songs. Hence, during the data collection, efforts were made to obtain data in format that facilitate ethnomusicological analysis. Since folk songs in the Essen Folksong Collection [1] are documented using kern representation, this database is employed for achieving such purpose. It is to be noted that, many symbolic music formats can be easily converted into audio format. There are many commercial converters for converting MIDI to MP3 or MIDI to WAV. In addition, symbolic formats other than MIDI usually facilitate tool for conversion into standard MIDI files. For example, the hum2mid11 is a program for converting kern files into standard MIDI files.

11 extra.humdrum.org/man/hum2mid/

41 Chapter 2 Literature Review

Preliminary investigations were performed to examine the performance of folk song classification using audio representation technique. The results are recorded in the Appendix. The overall result for the research using audio representation is below average. This preliminary research suggests that a more appropriate and efficient music representation technique is required for encoding the Han Chinese folk songs for machine classification.

42 Chapter 3 Music Representation and the MFDMap

Chapter 3

Music Representation and the Musical Feature Density Map

As presented in Chapter 2, there are two main approaches to music representation: audio and symbolic. In most cases, the choice of the music representation is greatly dependent on the format of the available data set. In this thesis, the Essen Folksong Collection [1] is employed as the data set for the research. Folk songs in this database are recorded in kern format. Hence, in this thesis, discussions are formatted around the symbolic approach of music representation, i.e. from a high-level, musicological point of view instead of the low-level audio signals view.

This chapter begins with a discussion on the ethnomusicology background of the research topic – geographical based Han Chinese folk song classification, follows by a discussion on the music database employed for the research. Then, the music elements employed to define the different classes of Han Chinese folk songs are discussed. Finally, a novel encoding method called the musical feature density map (MFDMap) is proposed for encoding useful music information. The MFDMap is designed to incorporate the ethnomusicology theory into the structure of the music feature vector.

43 Chapter 3 Music Representation and the MFDMap

3.1 Ethnomusicology Background on Geographical Based Han Chinese Folk Song Classification

The Han Chinese is an ethnic group native to East Asia and is the largest ethnic group in China. Most scholars use the general word “Chinese” to refer to the Han Chinese. However, there are considerable linguistic, custom and social diversity among the subgroups of the Han, mostly due to historical events, geographical conditions and assimilation of various regional ethnicities and tribes. As discussed in Chapter 1, folk songs are an important part of traditional Chinese music. They reflect the ideals and emotions of the common people and illustrate their custom and social life over thousands of years of Chinese history.

There are two major classification systems for the study of Han folk songs [99]: (i) according to the place of origin (geographical based) and (ii) according to the occasion when they are sung (song type based). The study of the classification of Han folk songs according to geographical factors falls into the first group. It was pioneered by two prominent ethnomusicologists: Jing Miao and Jianzhong Qiao [100] in the late 1980’s. As a whole, geographical factors such as the environment, weather and landscape structure determine the social and economic activities. Subsequently, the social and economic structures influence the development trends and characteristics of the cultures. Naturally, these cultural elements are then reflected in the folk songs. Therefore, geographical based classification of Han folk songs is not a meaningless and unproven task.

In their research [100], Miao and Qiao suggest that due to factors such as inter- marriages, social exchanges, business communications, etc. the cultural practices in neighbouring regions are usually very similar and closely related. Therefore, there are many similar music elements exhibited in the folk songs originating from these places. This suggests that these closely related regions should be grouped instead of being viewed individually. However, there is usually no single way of drawing the boundaries. Depending on the level of understanding, the point of view and the amount of available information, Han folk songs can be partitioned into any number of categories. For example, from the world viewpoint, Han folk songs are placed in the

44 Chapter 3 Music Representation and the MFDMap

eastern group (as opposed to the western). If the viewpoint is narrowed down to just the Han culture, these folk songs can be broadly placed into any other number of categories. If the viewpoint is to be further narrowed down (to the maximum extend), each folk song is unique and has unique texture and style. In other words, any fashion of division has its relativity and should be used as a reference instead of an ultimate convention.

In addition, Miao and Qiao [100] also emphasized that, the features used to identify and define a particular class label might not be universally demonstrated in all songs that are originated from that class. In many cases, as a result of population migration, social transformations, revolutions, wars, social exchanges and other historical factors, some folk songs were “mutated”, “propagated” or “migrated”. During the process, these folk songs lost some or all of their originality while adapting to other influences. Hence, in the classification of folk songs according to geographical region of origin, it is fairly common to have candidates that do not exhibit similar characteristics to others in the same class or folk songs that exhibit characteristics that belong to more than one class. Miao and Qiao highlighted that, in many cases, the research outcomes can only be applied to the typical examples. As a result, the features used to define a class in geographical-based folk song classification can only be an approximation.

In order to derive useful and meaningful attributes to describe each geographical class, it is important to understand the factors that contribute to the forming of the musical style in the folk songs. As mentioned previously, folk songs reflect the culture of the people, hence it is important to understand the culture of a population in a certain geographical region. It is commonly said that human civilization originated around river basins. In their study [100], Miao and Qiao indicate that the geographical structure of the Han folk song culture can be divided into two broad regions: the north and the south, each of which is closely associated with the two main rivers in China: the Huang He (黄河, Yellow River) of the north and the Chang Jiang (长江, Yangtze River) of the south. In the north, the Huang He basin is further divided into two: the plains in the east and the plateaus in the west. Due to a more complex geographical structure, the south region is divided into more chunks. The Chang Jiang basin is divided into three regions: east, center and west. The east is mainly plains, the center regions comprise mountainous areas, hilly areas and lake areas, and the west is mainly plateaus. There is

45 Chapter 3 Music Representation and the MFDMap

another significant river in the south called the Zhu Jiang (珠江, Pearl River) that also contributes an important role in forming the folk song culture. Besides those regions that are situated on the river basins, the regions that are situated between them are classified as the transitional areas. A map of the three main rivers is shown in Figure 3.1.

Figure 3.1: Map of the three main rivers: the Yellow River, the Yangtze River and the Pearl River.

The east of Huang He basin covers regions such as Hebei, Shandong, Liaoning, Jilin and Heilongjiang. These regions have fertile land, rich natural resources, convenient transportation and a prosperous economy. Agriculture, forestry, animal husbandry and fishery are active in these regions. The natural environments and the various economic activities create diversity in the society in these regions. In addition, active trading that happened in these regions gives rise to external influence in the folk songs in these regions. There are substantial numbers of folk songs in these regions that are dispersed from the northwest and southwest regions. The common forms of folk songs in these regions are xiaodiao (ditty) and haozi (work song). Xiaodiao is usually

46 Chapter 3 Music Representation and the MFDMap

sung as a form of entertainment or as folk art performances. The melody is usually well organized and very decorative. Haozi are sung during collective physical work. They usually have fast and powerful which synchronize with the movements of the laborers. Folk songs from these regions usually have intervallic jump of fifth, sixth or seventh.

The northwest regions (west of the Huang He basin), such as Shaanxi, Shanxi, Ningxia and Gansu, have large area covered in the Loess Plateau (also known as the Huangtu Plateau). These regions are sparsely populated, have many gullies and ravines and many areas that are difficult to access. Unlike the north east regions, the land here is not suitable for agriculture and people have to travel to other regions for jobs. Also, the land structure leads to problems in building a good transportation system. The means of transportation is the horse and donkey. This gives rise to a unique social class – porter, who is responsible for transporting the local product for trading. The jobs oblige the people to travel long distances through rugged and remote mountain roads. During traveling, they sing songs to relieve tiredness and also as a form of self entertainment. These songs usually have free rhythm and have bold, unrestrained, dark and “long- drawn-out” texture along with a hint of misery and gloom. This style of folk song is usually unique to the plateau land structure and very uncommon in plains and watery regions such as those regions in the east of the Chang Jiang basin. The common form of folk songs is the shange (mountain song). Shange are songs sung in open areas like the mountains or open fields. Some shange are sung while working but unlike haozi, the associated physical movements are usually minimal and less intense. The interval distance of fourth (especially perfect fourth) and second (especially major second) are the common representatives of the style of folk songs in these regions. They can effectively express the dreary, desolate and sorrowful mood of the plateau.

The southwest regions, include Sichuan, Guizhou, Yunnan and the northwest part of Guangxi, are where the majority of the Han people resides. These regions are located on the western part of the Chang Jiang basin and have similar land structure to the northwest regions, which are mainly plateaus. Unlike the dry and windy climate in the northwest, these southwest regions fall in the temperate and subtropical climate zone with sufficient rainfall throughout the year. Rice is one of the main plantations in these regions. The most popular form of folk songs in these regions is the shange. Most

47 Chapter 3 Music Representation and the MFDMap

shange from these southwest regions are lyrical. Some of them are love songs and many of the lyrics in these songs include words that picture beautiful scenes of the villages and landscapes. Chuanfu haozi (boatman work song) is also very common in Sichuan. Gewu xiaodiao (dancing ditty) is popular in Yunnan and Guizhou. Folk songs in these regions usually have small pitch range and small intervals. It should be noted that since ancient time, these regions on the southwest of China were populated by peoples from many different ethnic groups. Influence of non-Han materials in the Han folk songs is bound to be common. In addition, some folk songs are commonly shared among the Han and non-Han peoples.

The regions around Zhu Jiang basin include the majority of Guangdong (except those non-Han areas), the south part of Guangxi and Hainan. The climate here belongs to the subtropical zone. These regions are surrounded by islands and harbours on the southern part. Fishery is very active and many forms of folk songs are common in these regions: gewu xiaodiao, yuge (fishermen song), shange, haozi and xiaodiao. The folk songs in these regions focus a lot on the life of the fishermen and farmers (the two main occupations in these regions). The pitch range used in folk songs originating from these regions is usually slightly more than an octave. “Sol” and “re” are fairly commonly used and interval distance of fifth, sixth and seventh are common among folk songs in these regions.

The southeast regions such as Jiangsu, Zhejiang and Anhui are on the plain region of the Chang Jiang basin. These regions have mild climate, rich resources and adequate rainfall, and are a suitable area for growing rice. Many forms of folk song circulate around these regions. Among them are tiange (farm field song), xiaodiao, haozi, yuge, shange, and chage (tea song). Tiange is usually sung by farmers when working in the rice fields to create a lively atmosphere and to make the work less tiresome. Similarly, chage is sung during tea-picking. Xiaodiao is the most popular and most representative form of folk songs from these regions, and has great influence on folk songs in other regions in China. Folk songs in the southeast usually proceed in stepwise movement. It is also a common feature to insert a big interval in the stepwise progression of folk songs. The interval is usually a minor sixth (especially “mi” to “do”) or a perfect octave. Most folk songs in these regions, especially Jiangsu, follow closely to the pentatonic scale (i.e. “fa” and “ti” rarely occurred).

48 Chapter 3 Music Representation and the MFDMap

This section does not include all regions in China. The regions that are left out are mainly within the transitional zone which is not within the focus of this thesis. A thorough analysis and discussion on geographical based classification of folk songs is presented in [100].

3.1.1 Rationale for the Choice of the Five Classes

This thesis focuses on Han folk songs from five classes: Dongbei1 (东北), Shanxi (山 西), Sichuan (四川), Guangdong (广东) and Jiangsu (江苏). A few factors were taken into consideration when selecting the classes for the research.

1. The five classes selected are all within the main regions of the folk song culture highlighted in the previous section and also in [100]. Dongbei is part of the plains located east of the Huang He basin while Shanxi is in the west of the Huang He basin. Sichuan is on the plateau in the west of Chang Jiang basin and Jiangsu, on the other hand, is in the east. Finally, Guangdong is located on the Zhu Jiang basin. These classes are highlighted in Figure 3.2.

2. In [100], the authors point out that folk songs from neighbouring regions usually possess similar characteristics and texture. This is generally due to the similar customs, social structures and practices, and other cultural activities that are shared among people in those areas. These similarities usually result from communications and social exchanges among the people. However, mountains and rivers usually act as a natural barrier that break off communication and hence naturally encourages the growth of different cultures. The five classes selected are geographically reasonably far apart from each other. Hence, it is practical to categorize them as separate classes. However, as mentioned earlier, the migration of people and the propagation of popular folk tunes are still a concern causing similarity between folk songs from different regions. In other words, although each of these five classes can be regarded as geographically

1 Dongbei comprises of Liaoning, Jiling and Heilongjiang.

49 Chapter 3 Music Representation and the MFDMap

separate, it is unavoidable not to have some folk songs related to more than one class.

Figure 3.2: Map of the regions in China with the five classes studied in this thesis highlighted.

3. Another concern when selecting the five classes for the research is the size of the sample data. When there is more than one choice, the region with the largest number of samples is employed. For example, Shanxi, Gansu, Ningxia and Shaanxi are all located within the western part of Huang He basin and are all considered as having the similar “folk song colour” in [100]. Hence, when deciding the candidate region for research, the region with the largest data sample is used. It is important to note that, even though these regions fall within the same “colour area”, each of them still posses differences within themselves. In other words, they are similar on a broader perspective but dissimilar in the

50 Chapter 3 Music Representation and the MFDMap

narrower details. In many cases, especially when the area covered by a region is significantly large, folk songs within the region might shows dissimilarity in many subtle details.

4. A major concern when deciding the candidates for the study is to avoid regions that have large populations of non-Han (minority ethnic groups) people. For example, Ningxia has a large population of Hui people besides Han people and Qinghai is home to a large population of ethnic Tibetans. As pointed out in [100], cross-cultural phenomenon can be easily traced in folk songs originating from regions that are under the influence of cultures other than Han. Hence, if these regions are included in the study, the overall classification task might be “contaminated” and “complicated”.

5. Although it might be of minor importance, the popularity of the folk songs contributes to another concern when performing the selection. The five targeted classes encompass folk songs that are more familiar and well known to both professional and non-professional people within and outside China.

It is important to note that Dongbei comprises three regions: Liaoning, Jilin and Heilongjiang. The reasons for combining these regions into one are: (i) in the Essen Folksong Collection, the origin information of some folk songs do not clearly state which of the three regions the folk song originated (only stated “Dongbei”). (ii) These three areas are commonly referred to as Dongbei in many literatures. (iii) In [100], these three regions are usually viewed as one combined region. (iv) The number of folk songs from each region is limited; especially when there is no indication as to which of the three regions a particular folk song should be classified as.

3.2 Music Data Set – The Essen Folksong Collection

The Essen Folksong Collection [1] was originally created by Helmut Schaffrath and later edited by Ewa Dahlig and David Huron. It was developed in the early 1980s for

51 Chapter 3 Music Representation and the MFDMap

monophonic folk music research. The Humdrum **kern2 version of the Essen Folksong Collection was prepared and edited by David Huron. It is publicly available at http://kern.ccarh.org/browse?l=essen. The database contains folk songs from Europe, Asia and the Americas. The largest categories in the database are Germany and China.

There are a total of 1,222 songs from the Han Chinese category in the online version of the Essen Folksong Collection. This thesis focuses on Chinese folk songs from five classes: Dongbei (东北), Shanxi (山西), Sichuan (四川), Guangdong (广东) and Jiangsu (江苏). There are 333 folk songs belonging to these five classes: 70 from Dongbei, 75 from Shanxi, 43 from Sichuan, 61 from Guangdong and 84 from Jiangsu.

3.2.1 The **Kern Representation

The **kern representation [80] is the most popular representation within the many predefined Humdrum representations developed by David Huron. It is a symbolic format of music encoding that allows researchers to encode, manipulate and output musically pertinent representations. **Kern conforms to the broad Humdrum Syntax, a grammar for representing musical information. The details on Humdrum Syntax is beyond the scope of this thesis, but further information about Humdrum can be found in [73,79,101]. Kern permits the representation of core musical information which includes information on pitch, duration, accidentals, ties, slurs, phrasing, bar lines, articulation, ornamentation, stem-direction, etc. The main purpose of kern representation is to facilitate analytic applications. It represents the underlying syntactic information conveyed by a musical score. In other words, kern encodes the canonical score rather than the visual or orthography rendering. The **kern representation supports score-related signifiers listed as follows [80]:

Pitch: concert pitch, accidentals, clefs, position, key signatures, key, harmonics, glissandi, arpeggiations, unpitched events, multiple stops, etc.;

2 **kern is the formal name especially within the context of other Humdrum representations. Kern is used in a less strictly manner.

52 Chapter 3 Music Representation and the MFDMap

Duration: canonic musical durations, rests, augmentation dots, n-, ties, tempo (beats per minute), meter signatures, gruppetto designations, acciaccaturas, indefinite or durationless events; Articulation and ornamentation: staccato, spiccato, , , pizzicato, attacca, accent mark, sforzando, breath mark, generic articulation, trills (half-step), trills (whole-step), , inverted mordent, turn, inverted turn, generic ornaments; Timbre: instrument name, instrument class; Other: phrase marks, slurs, elision markers, bar lines, double bar lines, dotted bar lines, partial bar lines, invisible bar lines, measure numbers, system/staff arrangement, beams, partial beams, stem directions, up-bows, down-bows; Editorial: , sic, editorial interpretation markers, editorial intervention markers, editorial footnotes, global comments, local comments, user-defined symbols.

A musical work encoded using kern can comprise any number of the above listed signifiers but none of this information is mandatory. For example, a bona fide kern file might consist of just phrase marks and bar lines. Kern is able to encode the bare bones of traditional Western musical notation but it still lacks the ability to represent additional types of information, for example, transposed pitch, pitch frequency, scale degree, MIDI key number, cents, melodic contour and pitch intervals. One notable limit of kern is its inability to represent musical dynamics.

3.2.2 An Example of Han Chinese Folk Song in **Kern Format

All kern files are standard ASCII files. Typically, one file is used to encode a single work or movement. Figure 3.3 and Figure 3.4 are used as an example to demonstrate the **kern representation. Figure 3.3 is an example of a Han Chinese folk song from Jiangsu (江苏) titled Si Ji Ge (四季歌). Figure 3.4 is the illustration of **kern representation of the music in Figure 3.3. In **kern representation, a single column of data is used to represent a part or an instrument. For music with more than one part or

53 Chapter 3 Music Representation and the MFDMap

one instrument, there will be multiple columns within a file, each representing one of the parts or instruments. The kern representation proceeds vertically down the page.

Figure 3.3: The musical score of a Jiangsu folk song – Si Ji Ge.

In kern files, comments are lines (records) that begin with an exclamation mark: global comments “!!” pertain to the entire encoding and local comments “!” pertain to a single column of data. Reference records are a special type of comment. It is a formal way of encoding “library-type” information pertaining to a Humdrum document. They provide standardized ways of encoding bibliographic information. Reference records usually start with three exclamation marks (!!!). Following is the explanation of each of the reference records shown in Figure 3.4:

!!!OTL: Title. This item records the title of the specific work or section or segment. Titles are rendered in the original language. !!!ARE: Geographical region of origin. This reference identifies the geographical location from which the work originates. Location is usually encoded using the local language. The location begins with the continent designation and becomes more refined. Depending on the available information, the refinement can include suburban district or even street address.

54 Chapter 3 Music Representation and the MFDMap

!!!OTL: Si ji ge

!!!ARE: Asia, China, nan Jiangsu !! Ethnic Group: Han !!!SCT: C0954 !!!YEM: Copyright 1995, estate of Helmut Schaffrath. **kern *ICvox *Ivox *M2/4 *k[f#c#g#] *A: {8cc# 8b 8cc# 8ee =1 8a

8f# 8ee 8cc#

=2 4.b 16cc#

16b} =3 {8a 16a 16f# 8a 8b =4 8a 8f# 8e 8c# =5 2e} =6 {4f# 8a 8e =7 8.f# 16a 8b 8cc# =8 8b 8a 8f# 8e =9

2f#} =10 {8.f#

16a 8b 8cc# =11 8b 8a 8f# 8e =12 8.c# 16e 8a 8f# =13 2e} == !!!AGN: Xiaodiao, Shidiao, Kuqiqi, Mengjiangn?diao. Siji sixiang !!!ONB: ESAC (Essen Associative Code) Database: CHINA !!!AMT: simple duple !!!AIN: vox !!!EED: Helmut Schaffrath !!!EEV: 1.0 *- Figure 3.4: A **kern representation of the Jiangsu folk song – Si Ji Ge.

55 Chapter 3 Music Representation and the MFDMap

!!!SCT: Scholarly catalogue abbreviation and number. !!!YEM: Copyright message. This record conveys any special texts related to copyright. It might convey a simple warning, registration or licensing information, or indicate that the document is shareware. !!!AGN: Genre designation. This is a free-form text message that can be used to identify the genre of the work. !!!ONB: Free format note related to the title or identity of the encoded work. !!!AMT: Metric classification. There are eight categories of which a meter for a file may be classified as: simple duple, simple triple, simple quadruple, compound duple, compound triple, compound quadruple, irregular and various. !!!AIN: Instrumentation. This reference is used to list all instruments (and voices) used in the work. Instrumentation is encoded using the abbreviations

specified by *I tandem interpretation. !!!EED: Electronic editor. This reference identifies the name of the editor of the electronic document. !!!EEV: Electronic edition version. This reference identifies the specific editorial version of the work.

All kern representation begins with the keyword **kern to indicate that the subsequent encoded material conforms to the kern representation. The encoded passage always ends with a special terminator token (*-). Tandem interpretations are used in all Humdrum documents to encode additional or supplementary information. Tandem interpretations are identified by a single asterisk (*). The following explains the meaning of each of the tandem interpretations in Figure 3.4:

*ICvox: *IC is the tandem interpretation to identify the instrument class. *ICvox identifies the pre-defined instrument class for voice. *Ivox: *I is the tandem interpretation to identify the instrument name. *Ivox signifies generic voice. *M2/4: *M identifies the meter signatures. *M2/4 signifies simple duple, i.e. two quarter notes per bar.

56 Chapter 3 Music Representation and the MFDMap

*k[f#c#g#]: *k identifies the key signatures. *k[f#c#g#] signifies the key signatures for A major.

*A:: *A: is the tandem interpretation to identify the key (in this case, A major). Major keys are represented using upper-case letters and minor keys are represented using lower-case letters. Key specifications are always preceded by an asterisk (*) and terminated with a colon (:).

In Humdrum documents, information is encoded in data tokens. The **kern representation distinguishes three types of data tokens: (1) notes, (2) rests and (3) bar lines. In all kern files, all non-null data tokens must be one of these three types. The note tokens can encode a variety of attributes including absolute pitch, accidental, canonic musical duration, ties, articulation, ornamentation, slurs, musical phrasing, stem direction and beaming. Pitch information is encoded using the upper- and lower-case letters. Middle C (C4) is represented using the lower-case letter “c”. Successive octaves are denoted by letter representation, i.e. C5 is “cc”, C6 is “ccc” and so on. The higher the octave, the more repetition of the letter is. Pitches below Middle C are represented using upper-case letter. C3 is designated as “C”, C2 is “CC” and so on. The same scheme is applied to all other pitch letter-names. Changes of octave are deemed to occur between B and C. For example, the B below Middle C is represented as “B” while the B below “cc” is represented as “b”. All pitches are encoded as equally-tempered values. In very rare cases where tuning system other than equal temperament is used, a special tandem interpretation will be provided to indicate the new tuning system.

Accidentals are encoded immediately after the diatonic pitch information. Sharps are encoded using the hash sign (#), flats using the minus sign (-) and natural using the lower-case letter “n”. Double-flat and double-sharp are represented by repetition of their respective sign. It is to be noted that in **kern representation, all pitches are encoded as contextually independent absolute values. This means that all pitches are encoded as isolated entities, regardless of the events going on around them. For example, pitches must be encoded with the appropriate accidental even if the accidental is specified by the key signatures. In addition, for transposing instruments, the pitches are encoded regardless of the transposition, i.e. the pitches are represented at the sounding (concert) pitch. However, a special tandem interpretation will be provided to indicate the nature of the . 57 Chapter 3 Music Representation and the MFDMap

In kern, durations are encoded using reciprocal numerical values corresponding to the American duration names, i.e. “1” for whole note, “4” for quarter note, “8” for eighth note and etc. The number zero (0) is used for the breve duration (i.e. a duration of twice the length of a whole note). The augmentation dot (.) is used to indicate the dotted durations. It is added immediately following the numerical value. For example, a dotted-quarter note is represented as “4.” Any number of augmentation dots may follow the duration integer. Hence, “4..” signifies a doubly dotted-quarter note.

Triplet and other irregular durations are represented using the same logic. Take for example, the half-note triplet duration. Three half-note triplets occur in the time of two half notes, which is the duration of one whole note. If the whole note duration “1” is divided equally into three parts, each part has the duration of one-third. The corresponding reciprocal integer for 1/3 is 3. Hence, the half-note triplet is represented as “3” in kern representation.

All curved lines found in printed scores are explicitly interpreted as either ties, slurs or phrases in **kern representation. There is no generic representation for them. Ties are represented by square brackets. The open square bracket ([) denotes the first note of a tie and the closed square bracket (]) marks the last note of the tie. The underscore character (_) denotes the middle notes of a tie. Slurs are represented by parenthesis. Open parenthesis (() signifies the beginning of a slur and closed parenthesis ()) signifies the end of the slur. Phrases are marked with open brace ({) and closed brace (}) to denote the beginning and end of a phrase respectively. Slur and phrase markings can be nested and may also be elided.

The rest tokens in **kern representation are denoted by a single lower-case letter “r” along with the numerical value of the duration signifier. Bar lines are signified by the equals sign (=). Immediately following the equals sign is an optional integer value indicating the measure number. For example, “=2” indicates the end of the second measure (bar). Double bar lines are signified by a minimum of two successive equals signs (==). Several consecutive equals signs might be used to enhance readability.

As seen in Figure 3.4, not all signifiers listed in the previous section are used in the kern file. The **kern representation of Han Chinese folk songs in the Essen

58 Chapter 3 Music Representation and the MFDMap

Folksong Collection consists of the concert pitch, accidentals, key signatures, key, canonic musical durations, rests, augmentation dots, n-tuplets, ties, meter signatures, explicit phrase marks, bar lines, double bar lines, measure numbers, instrument name, instrument class, global comments and reference records. It should be noted that song lyrics are not included in the kern files. A complete list of kern signifiers, their function, application in music encoding, coding conventions and other examples of music documented using kern representation can be found in [73,36].

3.2.3 Assumptions in **Kern Version of the Essen Folksong Collection

The **kern version of the Essen Folksong Collection was translated automatically from the Essen Associative Code (EsAC) with hand editing. During the production, a number of assumptions were made and they should be carefully considered before using the **kern version of the Essen Folksong Collection. These assumptions [1] are listed below:

1. The original EsAC database was encoded from various sources. In most cases, the citations to the sources were significantly abbreviated in the original data. No attempt was made to provide a more complete reference information in the **kern version of the database. All citation information present in the source database has been retained. Nonetheless, some original sources may remain obscure.

2. The EsAC does not encode absolute pitch height. Instead, the pitch information is presented as a combination of the tonic pitch and the (diatonic) solfege-type designations. In translating to **kern representation, an estimation of the appropriate octave placement was made. In the **kern translation, the tonic pitch for the “principal octave” is assigned to the range C4 to B4. Hence, for example, a work beginning on the mediant pitch with a tonic of C would be assigned to E4.

59 Chapter 3 Music Representation and the MFDMap

3. The **kern key designators require a distinction between major and minor keys. Unfortunately, this information was not present in the original EsAC database. Hence, major/minor designations were assigned according to the following method: keys were assumed to be major unless a lowered mediant or lowered submediant tone appeared in the first phrase – in which case the key is assumed to be minor.

3.3 Music Elements and Encoding

In this section, the music information that can be derived from the folk songs in the Essen Folksong Collection, that are useful to Han Chinese folk song classification, is discussed. This section also discusses the encoding method used to convert this music information into numerical representations that can later be meaningfully and easily adopted to construct feature vectors for machine classification.

The upbringing and ideology that accompanies a person’s knowledge about music can lead to favoritism of certain characteristics of music. In this thesis, efforts were made to minimize favoritism by employing musical elements that are essential to music of most styles. Nonetheless, the Western tonal musical training that the author received and the nature of the music representation employed in the dataset might involuntarily lead to a discussion using terms that are based on Western music tradition.

Music is the art of sound. Owen [102] characterized musical sound using four main elements: pitch, duration, dynamic and timbre. Pitch is the human perception of the relative highness or lowness of a sound and may be further described as definite pitch or indefinite pitch. Duration is the relative length of a sound. Dynamic is the relative strength or loudness of a sound. Timbre is the quality or tone colour of a sound. It varies between voices and types of musical instruments. It is the result of complex interactions between various pitches (harmonics), durations and dynamics over time. These four elements have a common dimension – time. Therefore, music is characterized as a temporal art.

60 Chapter 3 Music Representation and the MFDMap

As demonstrated by the example in Figure 3.4, the **kern representation of Han folk songs encoded only pitch and duration information. No information on dynamic and timbre is presented. Hence, the discussion in the following sections will focus only on the pitch and duration information of the folk songs. It is important to note that all discussions are based on the twelve-tone equal temperament tuning system.

3.3.1 Pitch Elements

3.3.1.1 Solfege

One of the main characteristics of folk songs is their method of transmission. Unlike other styles of music, such as art songs and composed songs, folk songs are transmitted through oral tradition. When dealing with the pitch element of music, most existing research employs the absolute pitch representation of the song melody. However, in this thesis, an alternative pitch representation – solfege is proposed. Solfege is a solmization technique that is commonly used for sight singing where each note of the scale is represented by a special syllable. The seven most commonly used syllables are: do, re, mi, fa, sol, la and ti. There are two methods of applying solfege: fixed-do and movable- do. In the fixed-do solfege system, the syllables each correspond to the name of a note. Hence, do, re, mi, fa, sol, la and si (si is used instead of ti in this system) are used to name notes the same way that the letter C, D, E, F, G, A and B are used. In the movable-do system, each of the syllables corresponds to a scale degree instead of to a pitch. The first degree of a major scale is always denoted as “do”, the second as “re”, the third as “mi” and so on. For minor keys, the first degree is “la”, the second is “ti”, the third is “do” and so on. For example, if a piece is in C major, then C is denoted as “do”, D is “re” and so on. If a piece is in C minor, then C is “la”, D is “ti” and so on. In the movable-do system, a tune is always sol-faed on the same syllables no matter what key it is in. In this thesis, the movable-do solfege system is employed.

The advantage of the movable-do solfege system is its ability to assist in the theoretical understanding of music. From an established tonic, the melodic and chordal implication, through the whole music piece, can be easily inferred. The movable-do

61 Chapter 3 Music Representation and the MFDMap

solfege system thus serves to cultivate a mental apparatus for relating each tone to its neighbours hence capturing the overall progression of the notes. Through the emphasis on pitch contours, solfege can convey the similarity (in a relative sense) of works built (in an absolute sense) from different components. That is, for example, the syllables do- re-mi can represent the pitch name C-D-E and A-B-C# equally well. A similar concept is applied in the oral tradition of folk song. As a folk song is passed on (orally) to the next generation, due to varied voice ranges between the predecessor and the successor, instead of learning the song at the same absolute pitch (where some pitches might not exist within the successor’s voice range), the successor learns the pitch contour of the song and reproduces the melody using a more comfortable pitch range. Hence, if the singing of both the predecessor and the successor of the same folk song were to be documented in pitch name, there would be two different versions of the same folk song. However, if the solfege system is used instead, both outcomes will be identical.

The solfege notation can be easily computed using pitch and key information. It measures the scale distance between a note and its referencing key note in the musical scale. Table 3.1 portrays the reference table for computing the solfege representation. The entries in the table are calculated using the pitch name and the key available in the kern files. As highlighted previously, the music system used in this thesis is the twelve- tone equal temperament tuning system. Hence, there are 12 notes in the musical scale as shown in Table 3.1.

One of the assumptions made in translating the Essen Folksong Collection into kern representation was to assign the “principal octave” to the range C4 to B4. In this thesis, when calculating the solfege translation, the same assumption was applied. In addition, as vocal ranges are usually around two octaves [103], this thesis assumes that the “safe” range for the solfege representation is from two octaves below the principal octave up to two octaves above the principal octave, i.e. the solfege representation ranges from 1 to 60 for each of the 12 major keys. Solfege representation for minor keys takes the same value as their respective relative major key. Table 3.1 shows the solfege representation for all 12 major keys with the tonic starting within the range of the principal octave. For the other octaves, the solfege representation is calculated as follows:

62 Chapter 3 Music Representation and the MFDMap

solfegei = solfegeprincipal octave + (i ×12) (3.1) where i is the number of octaves above or below the principal octave. i can be either -1 or -2 for octaves below the principal octave and +1 or +2 for octaves above the principal octave. For informational purposes, the MIDI note number equivalent of the pitch name is also included in Table 3.1 to facilitate applicability in MIDI-format datasets. It is to be noted that all rests are encoded as “0” solfege. Figure 3.5 shows an example of a Jiangsu folk song – Si Ji Ge encoded with solfege representation.

Table 3.1: The solfege encoding reference table (all tonics start within principal octave).

Pitch C#4/ D#4/ F#4/ G#4/ A#4/ C4 D4 E4 F4 G4 A4 B4 Name Db4 Eb4 Gb4 Ab4 Bb4 MIDI Note 60 61 62 63 64 65 66 67 68 69 70 71 Number C major 25 26 27 28 29 30 31 32 33 34 35 36 Db3 major 24 25 26 27 28 29 30 31 32 33 34 35 D major 23 24 25 26 27 28 29 30 31 32 33 34 Eb major 22 23 24 25 26 27 28 29 30 31 32 33 E major 21 22 23 24 25 26 27 28 29 30 31 32 F major 20 21 22 23 24 25 26 27 28 29 30 31 F# major 19 20 21 22 23 24 25 26 27 28 29 30 G major 18 19 20 21 22 23 24 25 26 27 28 29 Ab major 17 18 19 20 21 22 23 24 25 26 27 28 A major 16 17 18 19 20 21 22 23 24 25 26 27 Bb major 15 16 17 18 19 20 21 22 23 24 25 26 B major 14 15 16 17 18 19 20 21 22 23 24 25

3 The lower case ‘b’ is used in this case to represent the musical ‘flat’.

63 Chapter 3 Music Representation and the MFDMap

Figure 3.5: An example of a Jiangsu folk song encoded using solfege representation.

3.3.1.2 Interval

In music theory, an interval measures the distance between two musical pitches. Music gets its richness from intervals. There are two types of intervals: harmonic interval and melodic interval. A harmonic interval is the interval between two simultaneously sounding musical notes, such as a chord. A melodic interval is the interval between two notes that are separate in time, one after the other, such as two adjacent pitches in a melody. As the Essen Folksong Collection consists of monophonic melody of folk songs, the intervals discussed in this thesis are therefore the melodic intervals.

The interval reflects the pitch ratio between two adjacent notes in a melody. The most commonly used intervals are those formed between the notes of the chromatic scale, with the smallest interval being a semitone. In this thesis, the interval is measured in terms of the number of semitones between two adjacent notes and can be computed from the solfege representation:

n = n − n − )1(solfege)(solfege)interval( (3.2)

64 Chapter 3 Music Representation and the MFDMap

where n = 2,3,4,…,N and N is the number of notes in each folk song. The value for the interval can be either a positive or a negative integer. The sign of the interval reflects the direction of the notes progression. Positive intervals denote ascending progression and negative intervals denote descending movement. It is to be noted that rests are silence and do not carry any pitch information hence are omitted when constructing the interval representation.

As discussed in Section 3.1, the geographical aspects such as the environment, climate and landscape structure play a significant role in forming the characteristics of the folk songs. These characteristics are reflected through the size of interval. For example, the cold, dry and windy climate in the northwest regions that affect the agricultural activities and lifestyle of the people are reflected by a more intense and disjunct progression of melody in folk songs from that area [99]. In addition, the mountainous land structure makes traveling difficult. This is also reflected by the large intervals in the folk songs sung by the travelers from those regions [100]. On the other hand, folk songs from the southeast regions are usually more lyrical and conjunct [99]. Hence, the melody progression is usually smoother with small intervals between notes. This important information of the musical flow and movement of folk songs can be featured through the measurement of the interval distances. Figure 3.6 shows an example of the interval representation of the Jiangsu folk song – Si Ji Ge.

Figure 3.6: An example of a Jiangsu folk song encoded using interval representation.

65 Chapter 3 Music Representation and the MFDMap

3.3.2 Duration Elements

3.3.2.1 Duration

In music, duration is the relative length in time of a musical note or rest. It measures the relative amount of time a note should sounds or silence should be perceived (for rests). Duration is the fundamental property that forms the rhythm. The longest note duration in Western music is the breve (double whole note) but it is rarely used. The longest duration that is normally encountered is the semibreve (whole note), which is half the length of the breve. The next longest is the minim (half note), then the crotchet (quarter note).

The list of durations follows a logical scheme where as it descends down the list, each successive duration is always half the length of the preceding duration. Therefore, quaver (eighth note) is half the length of the crotchet or one eight of the length of the semibreve; semiquaver (sixteenth note) is half as long as the quaver; demisemiquaver (thirty-second note) is half the length of the semiquaver and hemidemisemiquaver (sixty-fourth note) is half the duration of the demisemiquaver. Notes shorter than hemidemisemiquaver are very rarely used. Figure 3.7 depicts the relationships between the seven most commonly used durations ranging from the semibreve to the hemidemisemiquaver. It should be noted that the total duration on each line is the same except that as the length of each note gets shorter, the number of notes on each line is twice as many as the previous line.

In music, silence is often as important as sound. Silences are represented by rests. The note durations discussed previously each has its corresponding rest duration. Ties and dots are two ways of extending the durations. In order to achieve a longer duration, ties can be used to join notes of various durations. When tied notes are encountered, only the first note within the group of tied notes is to be played and the sound should be sustained for as long as the total duration of all the notes. For example, two crotchet tied notes are equivalent to the length of a minim note and a minim and two crotchet tied notes are equivalent to the length of a semibreve. Dot(s) is used to increase the duration of a note or a rest by 50 percent of its original value. There can be 66 Chapter 3 Music Representation and the MFDMap

more than one dot added to a note or a rest. The first dot represents 50 percent of the value of the initial duration, the second dot is equivalent to half the value of the first dot and so on. Figure 3.8 shows some examples of the use of ties and Figure 3.9 shows some examples of dotted notes and their equivalent duration.

Figure 3.7: The seven most commonly used durations.

Figure 3.8: Examples of tie notes and their equivalence in duration.

Figure 3.9: Examples of dotted notes and their equivalence in duration.

67 Chapter 3 Music Representation and the MFDMap

Tuplets are divisions of a beat into combinations of durations that do not posses simple duple ratios. In other words, they are a combination of notes or rests that have unusual duration. The most commonly used is the triplet, where three equal durations are to be performed within the same time span that two of the notated values would usually occupy. Figure 3.10 demonstrates some examples of triplets. The first example (leftmost) in Figure 3.10 is to be interpreted as performing the three minims slightly shorter so that the total time of three minims is the same as the time of two normal minims. Similarly, the second example in Figure 3.10 is to be interpreted as performing the three crotchets within the time of a minim (two normal crotchets). In the third example, the quavers are to be performed in the time of a crotchet (two normal quavers).

Figure 3.10: Examples of triplets and their equivalence in duration.

There are other divisions of durations in tuplets but they are very rare in Han Chinese folk songs and hence are not covered in this thesis. In addition, there are other variations to common durations such as grace notes which were also not employed in the Essen Folksong Collection, hence not discussed. In order for the duration information to be transformed into feature vectors for machine classification in the later stage, this information needs to be encoded into meaningful numerical representations. The duration information is encoded using the relative duration with respect to the designated base duration, the crotchet. This method is invariant to the change of tempo and yet meaningfully retains the information about the differences in length between these durations. Table 3.2 lists the durations that are common to the Han Chinese folk songs, particularly the durations that were employed in the Han folk songs within the Essen Folksong Collection, and their respective encoded representation. Rests are encoded using the negative equivalent of the same numerical value of each of their

68 Chapter 3 Music Representation and the MFDMap

corresponding note durations. For example, a crotchet rest is encoded as “-1” and a quaver rest is “-1/2”. Figure 3.11 demonstrates an example of duration encoding for the Jiangsu folk song – Si Ji Ge.

Figure 3.11: An example of a Jiangsu folk song encoded using duration representation.

3.3.2.2 Duration Ratio

In music, intervals serve to reveal the pattern of the pitch contour. Similarly, duration ratios can be used to express the rhythmic contour of the folk songs. Duration ratios signify the amount of change from one note (or rest) duration to the subsequent note (or rest) duration. The duration ratio is a useful measurement to express the rhythmic pattern of the folk songs especially that it is invariant to the tempo and the meter of the songs.

69 Chapter 3 Music Representation and the MFDMap

Table 3.2: List of durations and the encoded representation. Encoded Encoded Duration Dotted Duration Representation Representation semibreve 4 dotted-semibreve 6 minim 2 dotted-minim 3 minim triplet 4/3 (1.3333) dotted-minim triplet 2 crotchet 1 dotted-crotchet 3/2 (1.5) crotchet triplet 2/3 (0.6667) dotted-crotchet triplet 1 quaver 1/2 (0.5) dotted-quaver 3/4 (0.75) quaver triplet 1/3 (0.3333) dotted-quaver triplet 1/2 (0.5) semiquaver 1/4 (0.25) dotted-semiquaver 3/8 (0.375) dotted-semiquaver semiquaver triplet 1/6 (0.1667) 1/4 (0.25) triplet dotted- demisemiquaver 1/8 (0.125) 3/16 (0.1875) demisemiquaver dotted- demisemiquaver 1/12 (0.0833) demisemiquaver 1/8 (0.125) triplet triplet

The information regarding the rhythmic patterns is important as it describes the distinction between music of diverse style and form. For example, a melody with lots of short notes presents an agitated impression but a chunk of lengthy notes usually creates a more graceful feeling. This knowledge can be applied to differentiating folk songs from different geographical locations. As discussed in the beginning of this chapter, the geographical characteristics of the folk songs influence the texture of these songs. Since duration ratios measure the rhythmical changes, they can be used to describe the different textures of the folk songs. For example, a large duration ratio reflects a drastic change in rhythmic progression which portrays a rougher, more intense and disjunct texture. On the other hand, smaller ratios depict a smooth, more continuous and lyrical texture.

The duration ratio is calculated based on the duration encoding discussed in the previous section. The following equation demonstrates the method of computing the duration ratio for a note or a rest:

n)(duration duration n)(ratio = (3.3) n − )1(duration

70 Chapter 3 Music Representation and the MFDMap

for n = 2,3,4,…,N where N is the number of notes (including rests) in a folk song.

The value for the duration ratio can be either positive or negative. A negative duration ratio denotes a change from a note to a rest or vice versa. Positive duration ratios denote the ratio between two notes or two rests. The absolute values of the duration ratios determine the pattern of the change. The absolute value of ‘1’ denotes no changes between the two consecutive durations. Absolute duration ratio values that are greater than one denote a change from a shorter duration to a longer duration while absolute values of duration ratios that are smaller than one denote changes from longer durations to shorter durations. Figure 3.12 depicts an example of a Jingsu folk song encoded using the duration ratio representation.

Figure 3.12: An example of a Jiangsu folk song encoded using duration ratio representation.

71 Chapter 3 Music Representation and the MFDMap

3.4 The Musical Feature Density Map

An efficient music encoding scheme must preserve the most significant musical information. In this section, a new encoding scheme for constructing music feature vectors using the music information encoded in Section 3.3 will be introduced.

The musical feature density map (MFDMap) is an encoding method proposed for constructing music feature vectors from monophonic folk song melodies using information derived from the two fundamental music elements: the musical pitch and musical duration. The concept of using these music elements corresponds to the main ethnomusicology ideas of geographical based classification of Han folk songs discussed in [99] and [100]. In addition, as these music elements are the most essential elements that must present in all types of music, they exist in all formats of (digitized) music representations. Hence, from the practical perspective, the MFDMap can be used to construct music feature vectors from any music formats.

The musical feature density map is a size v vector where each index of the vector represents an encoded representation of one of the four defined music elements: solfege, interval, duration and duration ratio. The content of each index is the percentage of the occurrence of that particular music representation within a folk song. The vector size, v, is flexible and can be altered accordingly to match the type of music it represents. In order to construct the MFDMap, a folk song has to be encoded into its solfege representation, interval representation, duration representation and duration ratio representation. Then, a list of the encoded music representation for each of the four music elements is constructed and the percentage of occurrence frequency of these representations is calculated using the following equation:

representa music offrequency occurrence offrequency music representa tion ofcontent index = i ×100 (3.4) i N for i = 1,2,3,…,v and N is the total number of notes and rests in a folk song where regardless of the number of notes tied together, each set of tied notes is counted as one note. Note that instead of N, N – number of rests – 1 is used for interval representation and N – 1 is used for duration ratio representation.

72 Chapter 3 Music Representation and the MFDMap

The arrangement of the music representations in the vector is in the ascending order ranging from the smallest value to the largest value for each of the four elements and according to the sequence: solfege, interval, duration then duration ratio. Figure 3.13 shows the flow chart for constructing a MFDMap from a folk song.

Silences in music are created through the use of rests. These silences are often as important as the sounded parts. In this thesis, the contribution of silences in Han Chinese folk songs towards the classification performance is examined through a comparison between two cases. In the first case, all folk songs are encoded into their respective MFDMap representation by considering all sounded notes and rests. In the second case, rests in all folk songs are removed and the folk songs are encoded as if there is no rest in any of the songs.

By using a Shanxi folk song – Zou Xi Kou as an example of including rests, Figure 3.14, Table 3.3 and Figure 3.15 demonstrate the steps taken to construct the MFDMap. Figure 3.14 shows the music score of the folk song together with the encoded solfege representation, interval representation, duration representation and duration ratio representation (Step 1 to Step 4). Table 3.3 shows the combined outcome from Step 5, Step 6 and Step 7, that is, the sorted list of encoded music representations for each of the four music elements: solfege, interval, duration and duration ratio, together with their corresponding occurrence percentage. The column “Vector Index” in Table 3.3 is used to identify the position of each encoded music representation in the music feature vector. Finally, Figure 3.15 shows the final MFDMap, a music feature vector used to represent the Shanxi folk song, Zou Xi Kou. In this particular example, N is 45 and the vector size, v, is 39.

73 Chapter 3 Music Representation and the MFDMap

Folk song

Step 5: Sort all values

in each representation

Step 1: Encode folk in ascending order.

song into solfege

representation.

Step 6: Group the

sorted values in the

Step 2: Encode folk following sequence:

song into interval solfege, interval,

representation. duration and duration

ratio.

Step 3: Encode folk song into duration Step 7: Calculate the representation. occurrence percentage

of each value.

Step 4: Encode folk song into duration ratio representation. MFDMap

Figure 3.13: The flow chart for constructing a MFDMap.

74 Chapter 3 Music Representation and the MFDMap

representations (Step 1 to 4 in 4 to 1 (Step representations solfege , interval, duration and ratio constructing Case 1 MFDMap). Figure 3.14: The music score and the encode the and score music The 3.14: Figure

75 Chapter 3 Music Representation and the MFDMap

Table 3.3: List of encoded music representations and their respective occurrence percentage (Step 5 to 7 in constructing Case 1 MFDMap). Music Encoded Music Occurrence Total Occurrence Vector Element Representation Frequency Percentage Index 0 2 4.44 1 27 1 2.22 2 29 3 6.67 3 32 4 8.89 4 34 3 6.67 5 Solfege 36 2 45 4.44 6 37 12 26.67 7 39 7 15.56 8 41 7 15.56 9 42 1 2.22 10 44 3 6.67 11 -8 1 2.38 12 -4 5 11.90 13 -3 5 11.90 14 -2 9 21.43 15 -1 2 4.76 16 Interval 42 0 5 11.90 17 2 9 21.43 18 3 1 2.38 19 5 4 9.52 20 8 1 2.38 21 -1/2 2 4.44 22 1/4 10 22.22 23 1/2 24 53.33 24 Duration 3/4 4 45 8.89 25 1 1 2.22 26 2 3 6.67 27 5/2 1 2.22 28 -1 4 9.09 29 1/5 1 2.27 30 1/4 2 4.55 31 1/3 4 9.09 32 1/2 4 9.09 33 Duration 1 14 44 31.82 34 Ratio 3/2 4 9.09 35 2 7 15.91 36 4 2 4.55 37 5 1 2.27 38 8 1 2.27 39

76 Chapter 3 Music Representation and the MFDMap

Figure 3.15: The Case 1 MFDMap for Shanxi folk song – Zou Xi Kou.

As can be seen from the example, the list of encoded music representations and the size of the MFDMap are dependent on that particular folk song they represent. Hence, all 333 Han Chinese folk songs studied in this thesis will result in various lengths of MFDMap and different encoded music representation lists. In order to achieve unity among all folk songs, a standard list of encoded music representations needs to be defined. In Section 3.3, the range for the solfege representation is set as two octaves below the principal octave and up to two octaves above the principal octave, i.e. it ranges from 1 to 60. The duration list contains 22 durations. One method to define a standard list is to include all 60 solfege representations, all possible combination of intervals between these 60 solfeges, the 22 durations and all possible duration ratios that can be computed using these 22 durations. However, this will result in an enormous list of representations with many redundant representations that are inapplicable to Chinese folk songs.

A more practical method of defining a standard list of representations is to combine the representation list of all 333 folk songs to form a unified list of encoded music representations. Thus, all the possibilities that do not occur in any of the 333 folk songs are excluded. The list of encoded music representations that are to be used to construct the MFDMap for all 333 folk songs is as follows:

77 Chapter 3 Music Representation and the MFDMap

Solfege Representation: 0, 13, 15, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 46 and 49. Interval Representation: -15, -14, -12, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17 and 19. Duration Representation: -3, -2, -1.5, -1, -0.75, -0.5, -0.25, 0.0833, 0.1667, 0.25, 0.3333, 0.375, 0.5, 0.6667, 0.75, 1, 1.1667, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 4, 4.5 and 5. Duration Ratio Representation: -8, -6, -4, -3, -2, -1.5, -1.3333, -1, -0.75, -0.6667, -0.5, -0.4, -0.3333, -0.2857, -0.25, -0.2, -0.1667, -0.1429, -0.125, -0.1111, -0.0909, 0.0417, 0.05, 0.0667, 0.0714, 0.0833, 0.1, 0.1111, 0.125, 0.1429, 0.1667, 0.1875, 0.2, 0.2143, 0.2222, 0.25, 0.2857, 0.3, 0.3333, 0.375, 0.4, 0.4286, 0.4444, 0.5, 0.625, 0.6667, 0.7273, 0.75, 0.8, 0.8571, 0.875, 1, 1.2308, 1.3333, 1.5, 2, 2.25, 2.5, 2.6667, 2.75, 3, 3.5, 4, 4.5, 5, 5.3333, 6, 7, 7.5, 8, 9, 10, 11, 12, 13, 14, 16, 18, 20 and 24.

The list contains 31 solfege representations, 31 interval representations, 30 duration representations and 80 duration ratio representations, which sum up to a total of 172 representations.

The unified list of encoded music representations for constructing MFDMap for all 333 folk songs excluding rests is derived as before. The unified list of encoded representations for the second case contains 30 solfege representations, 31 interval representations, 23 duration representations and 61 duration ratio representations, which sum up to a total of 145 representations. These representations are listed as follows:

Solfege Representation: 13, 15, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 46 and 49. Interval Representation: -15, -14, -12, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17 and 19. Duration Representation: 0.0833, 0.1667, 0.25, 0.3333, 0.375, 0.5, 0.6667, 0.75, 1, 1.1667, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 4, 4.5 and 5. Duration Ratio Representation: 0.0417, 0.05, 0.0625, 0.0667, 0.0714, 0.0833, 0.0909, 0.1, 0.1111, 0.125, 0.1429, 0.1667, 0.1875, 0.2, 0.2143, 0.2222, 0.25, 0.2858, 0.3, 0.3333, 0.375, 0.4, 0.4286, 0.4444, 0.5, 0.625,

78 Chapter 3 Music Representation and the MFDMap

0.667, 0.7273, 0.75, 0.8, 0.8571, 0.875, 1, 1.2308, 1.3333, 1.5, 2, 2.25, 2.5, 2.6667, 2.75, 3, 3.5, 4, 4.5, 5, 5.3333, 6, 7, 7.5, 8, 9, 10, 11, 12, 13, 14, 16, 18, 20 and 24.

Figure 3.16, Table 3.4 and Figure 3.17 demonstrate the steps to construct a MFDMap that adheres to the second case (ignoring the rests) using the same folk song. Figure 3.16 shows the music score of the folk song together with the encoded solfege representation, interval representation, duration representation and duration ratio representation (Step 1 to Step 4). The differences in the encoded representations (between the first case and the second case) are highlighted in Figure 3.16. Table 3.4 then shows the combined outcome from Step 5, Step 6 and Step 7, that is, the sorted list of encoded music representations for each of the four music elements: solfege, interval, duration and duration ratio, together with their corresponding occurrence percentage. The column “Vector Index” in Table 3.4 is used to identify the position of each encoded music representation in the music feature vector. Finally, Figure 3.17 shows the final MFDMap, with rests excluded. For this example, N is 43 and the vector size, v, is 36. It is important to note that in the second case, all rests are omitted. Hence, the N in (3.4) is the total number of notes (only notes, rests are not included) in a folk song and instead of N, N – 1 is used for both interval and duration ratio representations.

3.4.1 Advantage of the Musical Feature Density Map

The musical feature density map is designed to incorporate the ethnomusicology theory into the structure of the music feature vector. The main characteristic of the MFDMap is that it utilizes the occurrence frequency of the music elements in folk songs to distinguish the differences between Han Chinese folk songs of various geographical origins.

The main advantage of the MFDMap is that the need to segment folk songs into fragments can be completely avoided and that each folk song is analyzed as a whole. Hence, the problems of finding a representative window size and the representative window position can be easily avoided. In addition, since no segmentation of folk songs is done, the loss of continuity and integrity of music can be prevented.

79 Chapter 3 Music Representation and the MFDMap

representations (Step 1 to 4 in 4 to 1 (Step representations solfege , interval, duration and ratio constructing Case 2 MFDMap — rests omitted). Figure 3.16: The music score and the encode the and score music The 3.16: Figure

80 Chapter 3 Music Representation and the MFDMap

Table 3.4: List of encoded music representations and their respective occurrence percentage (Step 5 to 7 in constructing Case 2 MFDMap — rests omitted). Music Encoded Music Occurrence Total Occurrence Vector Element Representation Frequency Percentage Index 27 1 2.33 1 29 3 6.98 2 32 4 9.30 3 34 3 6.98 4 36 2 4.65 5 Solfege 43 37 12 27.91 6 39 7 16.28 7 41 7 16.28 8 42 1 2.33 9 44 3 6.98 10 -8 1 2.38 11 -4 5 11.90 12 -3 5 11.90 13 -2 9 21.43 14 -1 2 4.76 15 Interval 42 0 5 11.90 16 2 9 21.43 17 3 1 2.38 18 5 4 9.52 19 81 2.38 20 1/4 10 23.26 21 1/2 24 55.81 22 3/4 4 9.30 23 Duration 43 1 1 2.33 24 2 3 6.98 25 5/2 1 2.33 26 1/5 1 2.38 27 1/4 2 4.76 28 1/3 4 9.52 29 1/2 4 9.52 30 Duration 1 16 38.10 31 42 Ratio 3/2 4 9.52 32 2 7 16.67 33 4 2 4.76 34 5 1 2.38 35 81 2.38 36

81 Chapter 3 Music Representation and the MFDMap

Figure 3.17: The Case 2 MFDMap for Shanxi folk song – Zou Xi Kou.

Musical works 4 are often of varied length. In order to perform machine classification on these musical works, feature vectors need to be constructed. Research in music classification employed the windowing method on each of the musical works in the database when constructing feature vectors. In the windowing method, instead of considering the musical work as a whole, each musical work is broken into one or more (depending on the window size) fixed size fragments. These music fragments are then encoded using the similar encoding technique discussed in Section 3.3 and feature vectors are constructed using these encoded fragments of music.

The main limitation of the windowing method is that there is no fixed standard for the size of the window. The choice of the window size determines the amount of information capture to define the characteristics of a musical work. In many cases, a small window size often poses the problem of having repeated window content for different musical works, which result in non-unique representations of completely independent musical works. On the other hand, an oversize window might result in information overload where redundant information is jumbled with useful information

4 Musical works here refer to all types of music, which includes folk music, folk songs and other composed music and songs.

82 Chapter 3 Music Representation and the MFDMap

which disguises the identity of the musical work and complicates the analysis process. Hence, varied window size often results in diverse classification performance. Therefore, exhaustive testing is usually needed to determine the most suitable window size for the classification task.

In addition to the size, the positioning of the window within a musical work is an issue that needs careful consideration. In music, each musical form has its own musical structure. Each structure usually presents a specific musical content or theme at a different musical section within the whole musical work. In other words, different musical sections contain different music information. Hence, it is important to choose the most representative position to place the window when segmenting a musical work in order to capture useful and meaningful music information.

The musical feature density map does not employ the windowing method. Instead, the MFDMap attempts to overcome the limitations of the windowing method by utilizing features that can meaningfully define a complete folk song without the need to segment it into fragments. In order to achieve this, the frequency of occurrence of the music elements is used. This method is new in machine classification of music but has ethnomusicology grounds. Béla Bartók – the great Hungarian composer, pianist, teacher and scholar – used the frequency of occurrence in his Serbo-Croatian folk song research where certain conclusions regarding the musical characteristic and texture are made based on these statistics [104]. In addition, Miao and Qiao’s research in geographical based Han Chinese folk song classification [100] also involves investigation on frequently occurred music elements and patterns.

In typical cases, folk songs from varied geographical origins can be differentiated by the high occurrence of certain music elements or the absence of them [100]. This is effectively modeled in the MFDMap. By encapsulating this useful information, the MFDMap makes the differences between the five classes of folk songs more obvious. This can be seen from Figure 3.18 to Figure 3.32. Figure 3.18 to Figure 3.22 are examples of folk song from each of the five classes using the windowing method. In these examples, the window size of 10 musical notes is employed and the folk songs are encoded using the technique discussed in Section 3.3. All four music elements: solfege, interval, duration and duration ratio are used. Figure 3.23 to Figure

83 Chapter 3 Music Representation and the MFDMap

3.27 are the same examples of folk songs using the first case MFDMap method where both notes and rests are taken into consideration. Figure 3.28 to Figure 3.32 are the examples using the second case MFDMap where rests are omitted from folk songs. Notice that the various locations of the peaks and the spread reveal patterns for differentiating between the five classes.

Figure 3.18: Example of Class 1 folk song using windowing method.

84 Chapter 3 Music Representation and the MFDMap

Figure 3.19: Example of Class 2 folk song using windowing method.

Figure 3.20: Example of Class 3 folk song using windowing method.

85 Chapter 3 Music Representation and the MFDMap

Figure 3.21: Example of Class 4 folk song using windowing method.

Figure 3.22: Example of Class 5 folk song using windowing method.

86 Chapter 3 Music Representation and the MFDMap

Figure 3.23: Example of Class 1 folk song using Case 1 MFDMap.

Figure 3.24: Example of Class 2 folk song using Case 1 MFDMap.

87 Chapter 3 Music Representation and the MFDMap

Figure 3.25: Example of Class 3 folk song using Case 1 MFDMap.

Figure 3.26: Example of Class 4 folk song using Case 1 MFDMap.

88 Chapter 3 Music Representation and the MFDMap

Figure 3.27: Example of Class 5 folk song using Case 1 MFDMap.

Figure 3.28: Example of Class 1 folk song using Case 2 MFDMap.

89 Chapter 3 Music Representation and the MFDMap

Figure 3.29: Example of Class 2 folk song using Case 2 MFDMap.

Figure 3.30: Example of Class 3 folk song using Case 2 MFDMap.

90 Chapter 3 Music Representation and the MFDMap

Figure 3.31: Example of Class 4 folk song using Case 2 MFDMap.

Figure 3.32: Example of Class 5 folk song using Case 2 MFDMap.

91 Chapter 3 Music Representation and the MFDMap

3.4.2 Future Enhancement to the Musical Feature Density Map

The musical feature density map is initially designed to address the task of Han Chinese folk song classification where folk songs consist of a single melody line. It is possible that the same concept can be extended to accommodate polyphonic music. However, the design of the structure of the MFDMap needs careful consideration to accommodate changes in the number of parts in a piece of music.

The MFDMap uses four music elements: solfege, interval, duration and duration ratio. This is sufficient for geographical based Han Chinese folk song classification but might not be adequate for other music classification tasks. In classification tasks that involve other types of music, for example, symphony or other instrumental music, there is often more than one instrument involved in the whole musical work. In this case, the current MFDMap is not capable of accommodating the music information that is needed to define the various instruments that are involved. (It is to be noted that folk songs usually involve only human voices).

92 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Chapter 4

The Extreme Learning Machine Folk Song Classifier

The multi-layer perceptron (MLP) is one of the classic neural network architectures and is very popular in the pattern recognition domain. Hornik [2] and Hornik, Stinchcombe and White [3] have proved that a multi-layer perceptron, using any arbitrarily bounded non-constant activation function, is a universal approximator. Such a network is capable of approximating any function if given sufficient number of hidden neurons and a sufficiently large training set. The extreme learning machine (ELM), an emerging technology that utilized the structure of a single-hidden layer feedforward neural network (SLFN) is receiving increased attention in the pattern recognition domain. However, there is only one research in music classification area that employed such technique. In this chapter, the ELM algorithm and its enhanced variant called the regularized extreme learning machine (R-ELM) are employed as the neural network based music classifier. The performance of these machine learning algorithms in complex real-world multi-class classification task, namely Han Chinese folk song classification, will be investigated. This chapter begins with an outline of the ELM and the R-ELM algorithms. This is followed by discussions on the design and results of the experiments for the study of machine classification of Han Chinese folk songs using both the ELM and the R-ELM.

93 Chapter 4 The Extreme Learning Machine Folk Song Classifier

4.1 Introduction

The single-hidden layer feedforward neural network is the simplest and most popular structure of the multi-layer perceptron. It has been found in [105] that the SLFNs with any continuous bounded nonlinear activation functions or any arbitrary (continuous or non-continuous) bounded activation function which has unequal limits at infinities can approximate any continuous function and implement any classification application with a sufficiently large number of hidden neurons.

The functionality of a single-hidden layer feedforward neural network is its capability in learning a suitable mapping from a given data set. The learning in neural network is based on the definition of a suitable error function, which is then minimized with respect to the weights and biases in the network. Therefore, the learning process comprises two stages. In the first stage, the derivatives of the error function with respect to the weights are evaluated. In the second stage, the derivatives are then used to compute weight values which minimize the error function by using an optimization method and the weights are adjusted accordingly.

The gradient descent-based algorithms are the most commonly used non-linear learning algorithm for the SLFNs. The process of gradient descent-based algorithms begins with a random selection of initial weights for the network, resulting in the network being placed at a random position on the error surface. Then, these weights are modified in an iteratively step-by-step process where each step taken on the error surface is in a direction that reduces the error. The direction is calculated using the gradient of the error surface at the current position.

The main advantage of the gradient descent-based methods is the relatively simple computation of the algorithm. However, these methods possess two major bottlenecks: the very slow learning speed and the issue of converging to local minima. In the conventional gradient descent-based learning algorithms, all parameters (weights and biases) of the neural network need iterative tuning to improve the learning performance. This iterative process is time consuming and resource consuming. In addition, gradient descent-based algorithms also suffer from the problem of choosing

94 Chapter 4 The Extreme Learning Machine Folk Song Classifier

the learning step that gives good convergence. In the iterative learning process, a gradient descent-based algorithm search for the solution, by gradually descending down the slope of the cost function, finally arrives at a minimum. However, due to the unknown shape of the cost function, the function curve might contain more than one minimum. In the search process, depending on the random initial starting point, the gradient descent-based algorithm will stop at the nearest minimum which might just be a local minimum and not the global minimum. Hence, the solution obtained will be non- optimal.

The extreme learning machine algorithm proposed by Huang, Zhu and Siew [33] has been proved to be able of overcoming the above mentioned limitations through its technique of parameters assignment.

4.2 Extreme Learning Machine

It has been shown in [33] that a single-hidden layer feedforward neural network with arbitrarily chosen input weights and hidden layer biases can be viewed as a linear system and the output weights (connecting the hidden layer to the output layer) of this SLFN can be analytically determined through a simple generalized inverse operation on the hidden layer output matrix. Such concepts form the foundation of the extreme learning machine.

In the ELM algorithm, the input weights and hidden layer biases of the SLFN is randomly assigned and the optimal output weights of the SLFN is deterministically computes using the Moore-Penrose generalized inverse of the hidden layer outputs. Hence, the ELM’s learning speed can be many times faster than conventional gradient descent-based algorithms while obtaining better performance. According to Bartlett’s [106] theory on the generalization performance of feedforward neural networks, for a feedforward neural network reaching smaller training error, the generalization performance of the network tends to be better if the norm of the network weights is smaller. The ELM algorithm tends to reach the smallest training error and the smallest norm of weights through the output weights computation using the generalized inverse

95 Chapter 4 The Extreme Learning Machine Folk Song Classifier

operation. Hence, the ELM algorithm is tending to have good generalization performance for feedforward neural networks.

The ELM algorithm utilizes the structure of a single-hidden layer feedforward neural network as in Figure 4.1. The overview of the ELM algorithm is described in the following.

Figure 4.1: A single-hidden layer feedforward neural network.

For a dataset with N distinct samples {(X,T) | X = [x1,x2,…,xN], T = [t1,t2,…,tN]} T n T m where xi = [xi1,xi2,…,xin] ∈R is the input vector and ti = [ti1,ti2,…,tim] ∈R is the target vector, the SLFN with Ñ hidden neurons can be written as

Ñ ∑ ( bg jijj ) =+⋅ oxwβ i (4.1) j =1

T for i = 1,2,…,N where βj = [βj1, βj2,…, βjm] is the output weights vector connecting the T jth hidden neuron and the output neurons, wj = [wj1, wj2,…, wjn] is the input weights vector connecting the input neurons and the jth hidden neuron, bj is the bias of the jth hidden neuron, wj·xi denotes the inner product of wj and xi, g(x) is the activation T m function and oi = [oi1,oi2,…,oim] ∈ R is the output vector with respect to xi =

96 Chapter 4 The Extreme Learning Machine Folk Song Classifier

T [xi1,xi2,…,xin] the input vector. It is to be noted that the output neurons are linear, i.e. the activation function of the output neurons is a linear function.

For such a SLFN with Ñ hidden neurons and activation function g(x) to approximate N data samples with zero error, i.e.

N ∑ to ii =− 0 , (4.2) i=1

there exist βj, wj and bj such that

Ñ ∑ j ( bg ) =+⋅ txwβ ijij (4.3) j =1 for i = 1,2,…,N. (4.3) can then be written compactly in matrix form Hβ = T where

21 wwwH Ñ 21 bbb Ñ 21 xxx N ),...,,,,...,,,,...,,(

⎡ g( xw +⋅ b111 g() xw +⋅ b212 L g() Ñ xw 1 +⋅ bÑ ) ⎤ ⎢ ⎥ g( xw +⋅ b g() xw +⋅ b g() xw +⋅ b ) , (4.4) = ⎢ 121 222 L Ñ 2 Ñ ⎥ ⎢ M OM M ⎥ ⎢ ⎥ g( xw +⋅ b g() xw +⋅ b g() xw +⋅ b ) ⎣ 1 N 1 2 N 2 L Ñ N Ñ ⎦ ×ÑN

⎡βT ⎤ ⎢ 1 ⎥ βT β = ⎢ 2 ⎥ and (4.5) ⎢ ⎥ ⎢ M ⎥ βT ⎣⎢ Ñ ⎦⎥ ×mÑ

⎡tT ⎤ ⎢ 1 ⎥ tT T = ⎢ 2 ⎥ . (4.6) ⎢ M ⎥ ⎢ T ⎥ ⎣⎢t N ⎦⎥ ×mN

H is known as the hidden layer output matrix of the neural network [107-108] where the jth column of H is the jth hidden neuron outputs with respect to inputs x1,x2,…,xN.

As mentioned before, the ELM algorithm employs the concept of randomly assigned input weights and hidden layer biases of the SLFN and deterministically

97 Chapter 4 The Extreme Learning Machine Folk Song Classifier

computed the output weights using the Moore-Penrose generalized inverse operation on the hidden layer output matrix, H. Hence, the output weights, β, of a SLFN using the ELM algorithm can be computed as follows:

= )( − T1T THHHβ . (4.7)

4.3 Regularized Extreme Learning Machine

The extreme learning machine attracted many attentions in pattern recognition domain owing to its extremely fast learning speed and relatively good generalization performance. However, the ELM is an empirical risk minimization (ERM) based algorithm which tends to generate overfitting model. In addition, the ELM algorithm provides weak control capacity as it directly calculates the minimum norm least squares solutions [34]. In order to overcome these drawbacks, Deng, Zheng and Chen [34] proposed an improved algorithm called the regularized extreme learning machine. This new variant of the ELM is based on the structural risk minimization (SRM) principle of statistical learning theory, hence, is able to provide better generalization capability than the ELM algorithm.

Similar to the ELM algorithm, the R-ELM utilizes the single-hidden layer feedforward neural network structure. The input weights and hidden layer biases of the SLFN is also randomly assigned. However, in the R-ELM algorithm, the Lagrange multipliers are utilized in the output weights optimization operation.

The definition for the SLFN with Ñ hidden neurons and m output neurons using the R-ELM algorithm is the same as the ELM from (4.1) to (4.6). In order to improve the generalization performance, a weight factor, γ, is introduced to the empirical risk in the R-ELM to regularize the proportion of the empirical risk which is represented by the

2 2 sum error square, ε , and the structural risk, β .

The proposed mathematical model for computing the output weights of the SLFN is as follows [34]:

98 Chapter 4 The Extreme Learning Machine Folk Song Classifier

⎧γ 2 1 2 ⎫ Minimize ⎨ + βε ⎬ (4.8) ⎩2 2 ⎭

subject to = − = − THβTOε (4.9) where γ is a constant balancing parameter for adjusting the balance of the empirical risk and the structural risk. This problem can be solved by using the method of Lagrange multipliers:

N m m Ñ N m γ 2 1 2 T L = ∑∑ε ij + ∑∑ij − ∑∑ ( kkp Th −− εβλβ kpkpp ) (4.10) 2 i==11j 2 i == 11j k == 11p

where ɛij is the ijth element of the error matrix ɛ, βij is the ijth element of the output weight matrix β, Tij is the ijth element of the output data matrix T, hi is the ith column of the hidden layer output matrix H, β j is the jth column of the output weight matrix β,

λij is the ijth Lagrange multiplier and γ is the constant parameter used to adjust the empirical risk. Differentiating L in (4.10) with respect to (βij,ɛij) and let them equal to zero gives

∂L =→ T λHβ and (4.11) ∂βij

∂L −=→ γελ . (4.12) ∂ε ij

Considering the constraint in (4.9), (4.12) can be expressed as

= −γ (H − Tβλ ). (4.13)

Using (4.13) in (4.11) leads to the computation of the output weight matrix, β, of the SLFN:

−1 ⎛ I T ⎞ T β ⎜ += ⎟ THHH . (4.14) ⎝ γ ⎠

In [34], (4.14) is known as the unweighted regularized extreme learning machine and the ELM is a special case of the unweighted R-ELM when γ ∞→ .

99 Chapter 4 The Extreme Learning Machine Folk Song Classifier

4.4 Experiment Design and Setting

This section contains the descriptions of the various designs of experiments to study the capability of machine classification of five different classes of Han Chinese folk songs based on the influence of geographical characteristics. In order for a machine to perform music classification task, it needs to first learn the necessary knowledge regarding the varied characteristics of the folk songs. Knowledge acquisition is accomplished through the neural network learning process and the effectiveness in learning is verified through the testing phase where the machine classifier is presented with a new set of folk songs (folk songs that are not used in learning) and the machine is to perform classification on these folk songs based on the learned knowledge.

The process of translating folk songs into useful musical information that can be processed by the machine classifier is explained in the following sections together with descriptions on the settings of the machine classifier that is used for the classification task.

4.4.1 Data Pre-Processing and Post-Processing

Data Pre-Processing

In order to perform machine classification of Han Chinese folk songs, each of the folk songs needs to go through a pre-processing phase where firstly, useful music information is derived from the folk songs and then converted into meaningful representations. Next, musical feature density maps (MFDMaps) are constructed from these representations. Each of these MFDMaps is a musical feature vector which will then be used as the input vectors to the neural network classifier. The techniques and procedures for each stage of the pre-processing phase are covered in details in Chapter 3 and will not be repeated here.

100 Chapter 4 The Extreme Learning Machine Folk Song Classifier

As explained in Chapter 3, there are two cases of MFDMap design: (1) including all notes and rests in the folk songs and (2) only considering sounded notes, i.e. omitting rests in the folk songs.

Feature Selection and Dimensionality Reduction

As described in Chapter 3, the initial MFDMap for each folk song is of various sizes due to the presence of the various values of the music elements in each folk song. In order to standardize the size of the final musical feature vector, the MFDMap is normalized to include all musical values that appear in any of the 333 folk songs. This results in a larger size MFDMap.

Increasing the number of features in the feature vector might lead to improvement in performance but this is not always the case. In the phenomenon of the curse-of-dimensionality [109-110], increasing the number of features results in poor performance. This is always the problem when working with a limited number of data samples. Learning a “state-of-nature” from a finite number of data samples in a high- dimensional feature space with each feature having a number of possible values requires an enormous amount of training data in order to ensure that there are several samples for each combination of the values. Thus, with limited quantity of data samples, increasing the dimensionality of the feature space rapidly leads to the point where the data is very sparse. In which case, it provides a very poor representation of the mapping.

From the discussion above, it seems that working with a large MFDMap might not achieve a good classification performance. Hence, it might be necessary to reduce the size of the MFDMap in order to achieve a better classification performance. There are a range of feature selection methods where reduced size feature vectors can be constructed using selected features that are more “significant” or “representative”. Although few of the methods can guarantee optimal feature selection, they often significantly improve the classification results compared to cases where no feature selection is done.

101 Chapter 4 The Extreme Learning Machine Folk Song Classifier

For the experiments in this chapter, the feature selection method is based on the significance of a particular musical feature1. A feature is considered significant to a particular class if a certain amount of members in that class possesses that feature. For each of the five classes, a list of features possessed by at least x% of the members in that class is constructed. Then, the reduced size MFDMap is built using the combined list of features from the five classes. The experiments start with the original MFDMap for the two cases (both notes and rests and notes only) and gradually reduce the MFDMap size by applying the value of x from 1 to 50. A selected list of reduced size MFDMaps together with the list of features each MFDMap contained is shown in Table 4.1 for Case 1 MFDMaps and Table 4.2 for Case 2 MFDMaps. It is to be noted that for x = 1, the MFDMap size for both Case 1 and Case 2 are the same as their respective original MFDMap (see Section 3.4).

Data Post-Processing

The targets for the neural classifiers in all experiments are set using the 1-of-c method by assigning each of the five classes to one target. In this method, for a set of targets, the one representing a particular class label is assigned ‘1’ and the remaining targets are assigned ‘0’. In order to classify the neural classifier’s output error on the test data set, the winner-takes-all method [22] is employed. For a set of outputs, only output that has the highest activation is taking into consideration. This output determines the class label of a particular test sample of which the neural classifier has assigned.

1 Musical feature here also means the individual encoded music representation used in constructing the MFDMap. The word “feature” is used to be consistent with the concept of the feature vector. 102 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.1: Selected list of reduced MFDMaps and their respective list of features (Case 1, notes and rests), Part 1 of 3.

MFDMap x List of Features Size Solfege Representations: 0, 13, 15, 17, 18, 19, 20, 22, 23, 24, 25, 31 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 46, 49. Interval Representations: -15, -14, -12, -10, -9, -8, -7, -6, -5, -4, - 31 3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 19. Duration Representations: -3, -2, -1.5, -1, -0.75, -0.5, -0.25, 30 0.0833, 0.1667, 0.25, 0.3333, 0.375, 0.5, 0.6667, 0.75, 1, 1.1667, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 4, 4.5, 5. 1 172 Duration Ratio Representations: -8, -6, -4, -3, -2, -1.5, -1.3333, -1, -0.75, -0.6667, -0.5, -0.4, -0.3333, -0.2857, -0.25, -0.2, - 0.1667, -0.1429, -0.125, -0.1111, -0.0909, 0.0417, 0.05, 0.0667, 0.0714, 0.0833, 0.1, 0.1111, 0.125, 0.1429, 0.1667, 0.1875, 0.2, 80 0.2143, 0.2222, 0.25, 0.2857, 0.3, 0.3333, 0.375, 0.4, 0.4286, 0.4444, 0.5, 0.625, 0.6667, 0.7273, 0.75, 0.8, 0.8571, 0.875, 1, 1.2308, 1.3333, 1.5, 2, 2.25, 2.5, 2.6667, 2.75, 3, 3.5, 4, 4.5, 5, 5.3333, 6, 7, 7.5, 8, 9, 10, 11, 12, 13, 14, 16, 18, 20, 24. Solfege Representations: 0, 15, 17, 18, 20, 22, 24, 25, 26, 27, 29, 24 30, 31, 32, 33, 34, 36, 37, 38, 39, 41, 42, 44, 46. Interval Representations: -12, -10, -9, -8, -7, -5, -4, -3, -2, -1, 0, 26 1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 14, 15, 16, 17, 19. Duration Representations: -2, -1.5, -1, -0.75, -0.5, -0.25, 0.0833, 22 0.1667, 0.25, 0.3333, 0.375, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 3 121 3.5, 4. Duration Ratio Representations: -4, -3, -2, -1.5, -1.3333, -1, - 0.75, -0.6667, -0.5, -0.4, -0.3333, -0.25, -0.1667, 0.0417, 0.0667, 49 0.0833, 0.1, 0.1111, 0.125, 0.1429, 0.1667, 0.1875, 0.2, 0.2143, 0.25, 0.3333, 0.375, 0.4, 0.5, 0.6667, 0.75, 0.875, 1, 1.3333, 1.5, 2, 2.6667, 3, 4, 5, 5.3333, 6, 7, 7.5, 8, 9, 10, 12, 14.

103 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.1: Selected list of reduced MFDMaps and their respective list of features (Case 1, notes and rests), Part 2 of 3.

MFDMap x List of Features Size Solfege Representations: 0, 15, 17, 18, 20, 22, 24, 25, 26, 27, 29, 22 30, 31, 32, 34, 36, 37, 39, 41, 42, 44, 46. Interval Representations: -12, -10, -9, -8, -7, -5, -4, -3, -2, -1, 0, 23 1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 14, 15. Duration Representations: -2, -1, -0.75, -0.5, -0.25, 0.0833, 5 101 21 0.1667, 0.25, 0.3333, 0.375, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 3.5, 4. Duration Ratio Representations: -4, -2, -1.5, -1, -0.6667, -0.5, - 0.3333, -0.25, 0.1, 0.125, 0.1429, 0.1667, 0.1875, 0.2, 0.25, 35 0.3333, 0.375, 0.4, 0.5, 0.6667, 0.75, 1, 1.3333, 1.5, 2, 3, 4, 5, 5.3333, 6, 7, 8, 9, 10, 12. Solfege Representations: 0, 15, 17, 20, 22, 24, 25, 27, 29, 30, 31, 20 32, 34, 36, 37, 39, 41, 42, 44, 46. Interval Representations: -10, -9, -8, -7, -5, -4, -3, -2, -1, 0, 1, 2, 21 3, 4, 5, 7, 8, 9, 10, 12, 15. 10 81 Duration Representations: -1, -0.5, -0.25, 0.1667, 0.25, 0.375, 15 0.5, 0.75, 1, 1.5, 1.75, 2, 2.5, 3, 4. Duration Ratio Representations: -2, -1, -0.5, -0.3333, -0.25, 25 0.125, 0.1667, 0.1875, 0.25, 0.3333, 0.375, 0.5, 0.6667, 0.75, 1, 1.3333, 1.5, 2, 3, 4, 5, 5.3333, 6, 7, 8. Solfege Representations: 0, 17, 20, 22, 24, 25, 27, 29, 30, 32, 34, 16 36, 37, 39, 41, 44. Interval Representations: -10, -9, -8, -7, -5, -4, -3, -2, -1, 0, 1, 2, 20 3, 4, 5, 7, 8, 9, 10, 12. 15 71 Duration Representations: -1, -0.5, 0.1667, 0.25, 0.375, 0.5, 12 0.75, 1, 1.5, 2, 3, 4. Duration Ratio Representations: -2, -1, -0.5, -0.3333, -0.25, 23 0.125, 0.1667, 0.1875, 0.25, 0.3333, 0.375, 0.5, 0.6667, 0.75, 1, 1.3333, 1.5, 2, 3, 4, 5.3333, 6, 8.

104 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.1: Selected list of reduced MFDMaps and their respective list of features (Case 1, notes and rests), Part 3 of 3.

MFDMap x List of Features Size Solfege Representations: 0, 20, 22, 24, 25, 27, 29, 30, 32, 34, 36, 15 37, 39, 41, 44. Interval Representations: -10, -8, -7, -5, -4, -3, -2, -1, 0, 2, 3, 4, 18 5, 7, 8, 9, 10, 12. 20 63 Duration Representations: -1, -0.5, 0.25, 0.375, 0.5, 0.75, 1, 1.5, 10 2, 3. Duration Ratio Representations: -2, -1, -0.5, -0.3333, 0.125, 20 0.1667, 0.1875, 0.25, 0.3333, 0.375, 0.5, 0.75, 1, 1.5, 2, 3, 4, 5.3333, 6, 8. Solfege Representations: 0, 20, 22, 24, 25, 27, 29, 30, 32, 34, 36, 14 37, 39, 41. Interval Representations: -8, -7, -5, -4, -3, -2, -1, 0, 2, 3, 4, 5, 7, 15 10, 12. 30 55 Duration Representations: -1, -0.5, 0.25, 0.375, 0.5, 0.75, 1, 1.5, 9 2. Duration Ratio Representations: -2, -1, -0.5, 0.125, 0.1667, 17 0.25, 0.3333, 0.5, 0.75, 1, 1.5, 2, 3, 4, 5.3333, 6, 8. Solfege Representations: 0, 20, 22, 24, 25, 27, 29, 32, 34, 36, 37, 13 39, 41. Interval Representations: -8, -7, -5, -4, -3, -2, -1, 0, 2, 3, 4, 5, 7, 14 40 47 12. 7 Duration Representations: -0.5, 0.25, 0.5, 0.75, 1, 1.5, 2. Duration Ratio Representations: -1, -0.5, 0.25, 0.3333, 0.5, 13 0.75, 1, 1.5, 2, 3, 4, 6, 8. 10 Solfege Representations: 0, 20, 22, 25, 27, 29, 32, 34, 37, 39. 12 Interval Representations: -7, -5, -4, -3, -2, -1, 0, 2, 3, 4, 5, 7. 50 40 7 Duration Representations: -0.5, 0.25, 0.5, 0.75, 1, 1.5, 2. Duration Ratio Representations: -1, 0.25, 0.3333, 0.5, 0.75, 1, 11 1.5, 2, 3, 4, 8.

105 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.2: Selected list of reduced MFDMaps and their respective list of features (Case 2, only notes), Part 1 of 3.

MFDMap x List of Features Size Solfege Representations: 13, 15, 17, 18, 19, 20, 22, 23, 24, 25, 30 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 46, 49. Interval Representations: -15, -14, -12, -10, -9, -8, -7, -6, -5, -4, - 31 3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 19. Duration Representations: 0.0833, 0.1667, 0.25, 0.3333, 0.375, 23 0.5, 0.6667, 0.75, 1, 1.1667, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 1 145 3.25, 3.5, 4, 4.5, 5. Duration Ratio Representations: 0.0417, 0.05, 0.0625, 0.0667, 0.0714, 0.0833, 0.0909, 0.1, 0.1111, 0.125, 0.1429, 0.1667, 0.1875, 0.2, 0.2143, 0.2222, 0.25, 0.2858, 0.3, 0.3333, 0.375, 0.4, 61 0.4286, 0.4444, 0.5, 0.625, 0.667, 0.7273, 0.75, 0.8, 0.8571, 0.875, 1, 1.2308, 1.3333, 1.5, 2, 2.25, 2.5, 2.6667, 2.75, 3, 3.5, 4, 4.5, 5, 5.3333, 6, 7, 7.5, 8, 9, 10, 11, 12, 13, 14, 16, 18, 20, 24. Solfege Representations: 15, 17, 18, 20, 22, 24, 25, 26, 27, 29, 23 30, 31, 32, 33, 34, 36, 37, 38, 39, 41, 42, 44, 46. Interval Representations: -12, -10, -9, -8, -7, -5, -4, -3, -2, -1, 0, 26 1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 14, 15, 16, 17, 19. Duration Representations: 0.0833, 0.1667, 0.25, 0.3333, 0.375, 3 102 16 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 3.5, 4. Duration Ratio Representations: 0.0417, 0.0667, 0.0833, 0.1, 0.1111, 0.125, 0.1429, 0.1667, 0.1875, 0.2, 0.2143, 0.2222, 0.25, 37 0.3333, 0.375, 0.4, 0.5, 0.6667, 0.75, 0.875, 1, 1.3333, 1.5, 2, 2.6667, 3, 4, 5, 5.3333, 6, 7, 7.5, 8, 9, 10, 12, 14.

106 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.2: Selected list of reduced MFDMaps and their respective list of features (Case 2, only notes), Part 2 of 3.

MFDMap x List of Features Size Solfege Representations: 15, 17, 18, 20, 22, 24, 25, 26, 27, 29, 21 30, 31, 32, 34, 36, 37, 39, 41, 42, 44, 46. Interval Representations: -12, -10, -9, -8, -7, -5, -4, -3, -2, -1, 0, 23 1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 14, 15. 5 88 Duration Representations: 0.0833, 0.1667, 0.25, 0.3333, 0.375, 16 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 3.5, 4. Duration Ratio Representations: 0.0833, 0.1, 0.125, 0.1429, 28 0.1667, 0.1875, 0.2, 0.25, 0.3333, 0.375, 0.4, 0.5, 0.6667, 0.75, 1, 1.3333, 1.5, 2, 3, 4, 5, 5.3333, 6, 7, 8, 9, 10, 12. Solfege Representations: 15, 17, 20, 22, 24, 25, 27, 29, 30, 31, 19 32, 34, 36, 37, 39, 41, 42, 44, 46. Interval Representations: -10, -9, -8, -7, -5, -4, -3, -2, -1, 0, 1, 2, 21 3, 4, 5, 7, 8, 9, 10, 12, 15. 10 73 Duration Representations: 0.1667, 0.25, 0.375, 0.5, 0.75, 1, 1.5, 12 1.75, 2, 2.5, 3, 4. Duration Ratio Representations: 0.125, 0.1429, 0.1667, 0.1875, 21 0.25, 0.3333, 0.375, 0.5, 0.6667, 0.75, 1, 1.3333, 1.5, 2, 3, 4, 5, 5.3333, 6, 7, 8. Solfege Representations: 17, 20, 22, 24, 25, 27, 29, 30, 32, 34, 15 36, 37, 39, 41, 44. Interval Representations: -10, -9, -8, -7, -5, -4, -3, -2, -1, 0, 1, 2, 20 3, 4, 5, 7, 8, 9, 10, 12. 15 63 Duration Representations: 0.1667, 0.25, 0.375, 0.5, 0.75, 1, 1.5, 10 2, 3, 4. Duration Ratio Representations: 0.125, 0.1667, 0.1875, 0.25, 18 0.3333, 0.375, 0.5, 0.6667, 0.75, 1, 1.3333, 1.5, 2, 3, 4, 5.3333, 6, 8.

107 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.2: Selected list of reduced MFDMaps and their respective list of features (Case 2, only notes), Part 3 of 3.

MFDMap x List of Features Size Solfege Representations: 20, 22, 24, 25, 27, 29, 30, 32, 34, 36, 14 37, 39, 41, 44. Interval Representations: -10, -8, -7, -5, -4, -3, -2, -1, 0, 2, 3, 4, 18 5, 7, 8, 9, 10, 12. 20 58 8 Duration Representations: 0.25, 0.375, 0.5, 0.75, 1, 1.5, 2, 3. Duration Ratio Representations: 0.125, 0.1667, 0.1875, 0.25, 18 0.3333, 0.375, 0.5, 0.6667, 0.75, 1, 1.3333, 1.5, 2, 3, 4, 5.3333, 6, 8. Solfege Representations: 20, 22, 24, 25, 27, 29, 30, 32, 34, 36, 13 37, 39, 41. Interval Representations: -8, -7, -5, -4, -3, -2, -1, 0, 2, 3, 4, 5, 7, 15 30 49 10, 12. 7 Duration Representations: 0.25, 0.375, 0.5, 0.75, 1, 1.5, 2. Duration Ratio Representations: 0.125, 0.1667, 0.25, 0.3333, 14 0.5, 0.75, 1, 1.5, 2, 3, 4, 5.3333, 6, 8. Solfege Representations: 20, 22, 24, 25, 27, 29, 32, 34, 36, 37, 12 39, 41. Interval Representations: -8, -7, -5, -4, -3, -2, -1, 0, 2, 3, 4, 5, 7, 14 40 44 12. 6 Duration Representations: 0.25, 0.5, 0.75, 1, 1.5, 2. Duration Ratio Representations: 0.125, 0.25, 0.3333, 0.5, 0.75, 12 1, 1.5, 2, 3, 4, 6, 8. 9 Solfege Representations: 20, 22, 25, 27, 29, 32, 34, 37, 39. 12 Interval Representations: -7, -5, -4, -3, -2, -1, 0, 2, 3, 4, 5, 7. 50 37 6 Duration Representations: 0.25, 0.5, 0.75, 1, 1.5, 2. Duration Ratio Representations: 0.25, 0.3333, 0.5, 0.75, 1, 1.5, 10 2, 3, 4, 8.

108 Chapter 4 The Extreme Learning Machine Folk Song Classifier

4.4.2 Parameter Setting

Two classifiers were employed in this chapter: the extreme learning machine and the regularized extreme learning machine. For all experiments, the single-hidden layer feedforward neural network is used for both learning algorithms. 333 Han Chinese folk songs from five different classes are employed to study the classification capability of such neural network classifiers in differentiating Han Chinese folk songs from different geographical regions. The data set used for training the neural classifier and for testing the classifier classification performance is designed as follows: for each class of folk songs, 10% is used for testing and the remaining 90% is used for training.

A 10-fold cross-validation is employed to thoroughly assess the technique of folk songs classification in this chapter. In this method, an experiment is repeated 10 times, each time using a different combination of samples as training data set and testing data set. Through such a method, each of the 333 folk songs has a chance to be assigned as a test sample and no duplication of data set combination is permitted. No favouritism is shown towards any of the data samples. Each of the 333 folk songs is assigned as test sample once and only once.

In all experiments, the input weights and hidden layer biases of the SLFN are randomly assigned and the output weights are computed accordingly for each of the two classifiers. Due to the presence of arbitrary characteristics in the neural classifier, each experiment is repeated 50 times. The classification accuracy is then determined using the mean accuracy of the 50 repetitions.

In the previous discussion, a non-linear activation function is used to activate the hidden neurons in the hidden layer. In general, a multi-layer perceptron performs better when the sigmoidal activation function, built into the neuron model of the network, is asymmetric than when it is non-symmetric [22]. In all experiments, the hyperbolic tangent function is employed. The hyperbolic tangent function is an asymmetric function and can be defined as

x)sinh( − ee − xx x)tanh( == . (4.15) x)cosh( + ee −xx

109 Chapter 4 The Extreme Learning Machine Folk Song Classifier

In all experiments, the simulation of the neural network begins with one hidden neuron in the hidden layer and this number is gradually increased to a maximum number of 10000 hidden neurons.

In (4.14), a balancing parameter, γ, is employed in the designed of the network output weights for the R-ELM algorithm. This parameter is used to balance the empirical risk and structural risk of the neural network. In this chapter, a set of different values are used to investigate the effect of that parameter on classifier performance. The different values assigned for γ is 0.001, 0.01, 0.1, 1, 10, 100 and 1000.

4.5 Experiment Results

The single-hidden layer feedforward neural network, trained independently using either the extreme learning machine algorithm or the regularized extreme learning machine algorithm, is employed as the machine classifier and experiments are designed according to the settings discussed in the previous section.

Table 4.3 to Table 4.11 report the mean classification accuracy (in %) of the ELM classifier on the testing data set through 50 repetitions of the experiments using different MFDMap design while Table 4.12 to Table 4.20 report the mean classification accuracy of the R-ELM classifier. In each repetition, the experiments are repeated 10 times, using 1 to 10 of the cross-validation training and testing data sets. The standard deviations for the mean values are also reported. Each table contains the results for both Case 1 and Case 2 MFDMaps.

Table 4.3 and Table 4.12 show the results using the original MFDMap with 172 or 145 elements as shown in Table 4.1 and Table 4.2. Table 4.4 and Table 4.13 contain the accuracy of classification using the list of features for MFDMap design where at least 3% of the members in a class possess such features. Table 4.5 and Table 4.14 are the case with at least 5%, Table 4.6 and Table 4.15 are the case with at least 10%, and Table 4.7 and Table 4.16 are the case with at least 15% of members in a class possessing such features. Table 4.8 and Table 4.17, Table 4.9 and Table 4.18, Table

110 Chapter 4 The Extreme Learning Machine Folk Song Classifier

4.10 and Table 4.19, and Table 4.11 and Table 4.20 are the case for 20%, 30%, 40% and 50% respectively.

Table 4.21 shows the confusion matrix for the experiment using the Case 1 MFDMap at significance value x = 15 (MFDMap size 71) at 8000 hidden neurons. Table 4.22 is the confusion matrix for the experiment using the Case 2 MFDMap which is also at significance value x = 15 (MFDMap size 63) at 8000 hidden neurons. The ELM classifier is employed in both cases. Table 4.23 and Table 4.24 are the confusion matrices using the R-ELM classifier. Table 4.23 is the confusion matrix for Case 1 MFDMap at significance value x = 3 (MFDMap size 121) at 3000 hidden neurons while Table 4.24 is the Case 2 MFDMap at significance value x = 15 (MFDMap size 63) at 5000 hidden neurons.

It is to be noted that all classification results reported in this section, for the R- ELM classifier, are the results using γ = 0.001.

Table 4.3: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with the original map size. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 172) (Map Size = 145) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 51.18 8.15 50.64 9.09 100 52.81 8.20 51.13 8.57 500 54.18 7.54 54.97 6.75 1000 56.71 7.91 56.13 8.22 1500 63.24 6.93 63.73 7.30 2000 65.28 8.57 66.69 7.87 2500 68.08 7.69 69.94 8.08 3000 68.98 7.40 70.25 8.37 4000 70.71 7.87 71.44 7.86 5000 71.37 7.29 73.43 8.79 8000 70.14 7.99 72.40 7.76 10000 69.88 7.72 71.00 8.16

111 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.4: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 3. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 121) (Map Size = 102) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 51.81 8.33 51.73 7.77 100 52.32 8.04 52.21 7.94 500 53.10 8.44 53.71 8.74 1000 56.20 8.16 55.80 7.57 1500 63.01 7.61 62.60 8.39 2000 66.67 7.10 66.23 8.15 2500 68.97 8.93 68.97 7.88 3000 69.15 7.70 69.99 7.47 4000 70.00 8.05 72.45 7.86 5000 71.79 8.28 72.65 7.96 8000 70.60 8.06 71.02 7.27 10000 69.88 7.55 70.07 8.10

Table 4.5: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 5. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 101) (Map Size = 88) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 51.50 7.59 51.24 7.48 100 52.31 7.95 52.22 8.94 500 53.41 8.17 53.40 8.47 1000 56.48 7.55 55.41 8.39 1500 63.43 7.86 62.86 7.56 2000 65.77 8.18 65.95 8.16 2500 68.19 7.08 67.84 7.72 3000 68.92 8.52 68.14 7.84 4000 69.32 9.21 69.57 7.71 5000 71.99 8.48 70.90 7.49 8000 73.50 7.84 73.42 7.55 10000 71.00 8.21 70.00 7.90

112 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.6: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 10. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 81) (Map Size = 73) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 51.78 7.15 50.96 8.63 100 52.62 8.30 51.92 8.24 500 53.68 7.94 55.08 8.59 1000 55.61 8.35 57.88 8.06 1500 63.24 8.13 62.95 7.67 2000 66.33 8.25 64.00 7.26 2500 67.30 7.26 68.22 8.08 3000 67.47 7.49 69.29 8.41 4000 70.08 7.78 69.89 7.85 5000 70.97 8.05 71.40 7.73 8000 71.68 8.56 71.67 7.85 10000 70.09 7.86 70.88 7.45

Table 4.7: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 15. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 71) (Map Size = 63) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 51.00 7.75 51.73 8.30 100 52.98 7.91 52.21 8.30 500 55.38 8.02 53.71 6.91 1000 58.24 7.97 55.80 7.34 1500 62.32 8.31 62.60 7.28 2000 65.58 8.05 66.23 8.20 2500 67.00 8.91 67.83 8.74 3000 68.52 8.16 69.35 7.84 4000 70.18 8.90 69.95 8.41 5000 71.37 7.64 71.38 8.27 8000 73.54 8.26 73.90 7.47 10000 70.94 7.87 72.00 8.20

113 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.8: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 20. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 63) (Map Size = 58) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 51.69 7.62 50.76 7.69 100 52.23 8.76 53.94 8.12 500 53.62 7.98 55.45 7.50 1000 55.48 8.82 57.47 8.49 1500 61.59 7.79 60.46 7.68 2000 64.21 8.29 66.25 8.90 2500 67.30 7.97 68.33 7.46 3000 68.63 6.99 70.38 8.10 4000 69.34 7.51 71.27 7.24 5000 69.75 8.31 71.32 7.64 8000 71.43 7.97 72.22 7.70 10000 68.99 7.44 70.55 8.20

Table 4.9: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 30. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 55) (Map Size = 49) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 50.37 8.47 50.60 8.09 100 51.70 8.15 51.84 7.85 500 53.19 7.81 55.13 7.93 1000 55.11 8.41 57.56 8.30 1500 60.44 8.40 60.47 8.52 2000 64.88 8.06 66.34 7.90 2500 66.82 8.29 68.03 8.16 3000 68.07 8.21 68.68 7.88 4000 69.63 7.51 69.52 8.11 5000 69.64 8.38 71.43 8.22 8000 71.19 7.67 72.65 7.69 10000 70.00 7.70 70.25 8.14

114 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.10: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 40. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 47) (Map Size = 44) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 50.28 8.30 52.76 8.06 100 51.95 8.05 53.33 8.22 500 55.73 8.86 54.98 8.05 1000 58.07 7.70 55.65 9.39 1500 62.79 7.63 62.44 7.42 2000 66.72 7.13 63.47 7.07 2500 67.12 8.46 67.66 7.43 3000 67.48 8.43 68.08 7.45 4000 70.64 7.96 69.12 7.78 5000 70.95 8.45 71.31 7.92 8000 69.51 8.09 69.94 7.89 10000 68.99 8.15 69.53 8.27

Table 4.11: Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = 50. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 40) (Map Size = 37) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 51.66 8.19 50.48 7.75 100 52.03 8.38 51.27 8.12 500 53.12 8.89 52.12 8.12 1000 55.36 8.61 52.90 8.04 1500 59.30 7.36 60.69 7.70 2000 63.56 6.84 64.42 7.39 2500 65.56 8.45 65.09 8.16 3000 67.28 7.08 67.90 7.33 4000 67.47 8.03 67.35 7.48 5000 70.66 8.02 70.90 8.67 8000 69.69 9.11 70.00 7.79 10000 67.10 7.97 67.43 7.93

115 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.12: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with the original map size. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 172) (Map Size = 145) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 45.69 8.05 45.16 7.90 100 52.87 8.09 52.91 7.84 500 67.92 8.13 68.43 8.08 1000 73.35 7.75 72.21 8.12 1500 74.43 7.98 74.47 8.16 2000 74.91 8.04 75.61 7.85 2500 76.94 8.07 75.24 7.87 3000 76.73 7.87 75.34 7.85 4000 76.13 7.90 76.63 8.00 5000 76.13 8.08 77.16 7.94 8000 75.83 7.99 76.84 8.13 10000 74.74 8.06 75.75 7.85

Table 4.13: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 3. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 121) (Map Size = 102) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 46.84 7.83 48.05 7.97 100 53.29 8.02 52.80 8.18 500 70.26 8.12 70.38 8.09 1000 73.32 7.92 72.72 7.99 1500 74.53 7.87 74.26 8.13 2000 75.15 7.99 75.73 7.90 2500 76.22 7.84 76.33 7.97 3000 77.04 8.00 76.25 8.14 4000 75.92 8.08 77.65 8.15 5000 75.53 8.02 76.05 8.07 8000 75.24 7.81 76.03 8.05 10000 74.74 7.95 74.85 7.94

116 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.14: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 5. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 101) (Map Size = 88) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 46.24 8.07 47.36 7.97 100 54.25 7.92 52.03 8.05 500 68.15 8.06 69.90 7.97 1000 73.60 8.03 73.15 8.04 1500 75.72 8.11 74.98 7.99 2000 75.19 7.96 75.83 8.00 2500 76.73 8.06 74.64 8.05 3000 75.33 8.08 76.95 8.14 4000 75.45 7.91 77.35 8.05 5000 77.04 8.02 77.96 8.16 8000 76.54 8.16 77.65 7.80 10000 73.73 8.01 75.75 7.96

Table 4.15: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 10. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 81) (Map Size = 73) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 45.91 8.02 46.39 8.24 100 54.40 7.92 51.88 7.95 500 69.53 7.87 69.72 8.06 1000 73.20 8.06 73.57 7.90 1500 74.88 8.06 75.63 8.13 2000 76.18 8.03 75.64 7.90 2500 76.33 8.04 76.43 8.02 3000 76.73 7.91 77.13 7.94 4000 75.85 7.95 77.95 8.05 5000 75.83 7.99 77.45 8.00 8000 75.12 7.93 76.84 8.00 10000 74.74 8.03 73.73 8.29

117 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.16: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 15. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 71) (Map Size = 63) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 48.18 7.94 46.62 7.77 100 54.05 8.00 52.75 7.84 500 70.20 8.27 69.81 8.04 1000 72.96 7.89 73.93 7.93 1500 74.43 8.06 75.28 7.97 2000 74.30 7.89 75.97 7.85 2500 74.93 8.10 76.65 7.91 3000 75.33 8.03 77.04 7.96 4000 76.34 8.07 77.34 7.93 5000 76.72 7.97 78.64 7.91 8000 75.54 8.02 76.95 7.96 10000 74.74 8.15 76.35 7.91

Table 4.17: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 20. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 63) (Map Size = 58) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 47.03 8.13 47.68 7.94 100 51.93 7.90 51.18 7.82 500 68.45 8.18 70.89 7.96 1000 73.26 7.96 73.42 8.09 1500 73.59 7.85 74.94 8.07 2000 74.29 7.94 76.39 8.23 2500 74.73 8.09 76.14 8.02 3000 75.53 8.11 76.34 7.78 4000 75.72 8.02 77.04 8.17 5000 75.72 8.03 77.14 8.13 8000 75.73 8.06 76.84 7.94 10000 72.72 7.85 76.14 8.02

118 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.18: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 30. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 55) (Map Size = 49) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 46.41 8.08 45.07 7.99 100 53.21 8.04 52.73 7.97 500 68.11 8.07 67.40 8.12 1000 72.73 7.99 74.67 8.06 1500 73.83 8.00 74.62 8.05 2000 74.37 8.01 74.98 7.86 2500 76.82 7.92 75.73 7.90 3000 75.42 8.14 76.13 8.02 4000 75.64 8.00 76.95 7.97 5000 73.81 8.07 77.55 8.13 8000 73.73 7.91 76.04 8.13 10000 73.73 7.89 74.94 7.94

Table 4.19: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 40. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 47) (Map Size = 44) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 45.31 8.09 47.58 8.18 100 52.34 8.14 49.15 8.02 500 68.02 8.03 69.08 8.27 1000 72.15 8.16 71.98 7.97 1500 72.81 8.11 73.41 8.06 2000 74.50 8.02 75.76 8.16 2500 74.83 8.09 74.94 8.27 3000 75.43 8.02 75.83 8.03 4000 75.63 7.96 76.43 7.92 5000 75.43 8.04 75.83 8.08 8000 75.05 8.00 75.83 7.87 10000 73.73 7.96 74.93 7.97

119 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.20: Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = 50. Case 1 MFDMap Case 2 MFDMap Number of (Map Size = 40) (Map Size = 37) Hidden Standard Standard Neurons Accuracy (%) Accuracy (%) Deviation (%) Deviation (%) 50 45.44 8.03 45.33 8.13 100 52.38 8.15 51.56 8.21 500 67.72 8.14 68.24 8.01 1000 71.73 8.00 71.90 8.02 1500 73.23 8.09 73.05 7.94 2000 73.52 7.97 73.77 8.16 2500 73.60 8.07 73.93 7.99 3000 73.73 8.19 75.74 7.95 4000 74.12 8.02 74.11 8.22 5000 75.31 7.97 76.55 8.08 8000 73.21 7.95 73.64 8.07 10000 72.72 8.07 72.62 7.90

Table 4.21: Confusion matrix for Case 1 MFDMap with map size 71 (x = 15) at 8000 hidden neurons, using the ELM classifier.

Class 1 Class 2 Class 3 Class 4 Class 5

(Dongbei) (Shanxi) (Sichuan) (Guangdong) (Jiangsu) Class 1 5.80 0.34 0.10 0.06 0.70 (Dongbei) Class 2 0.44 5.88 0.24 0.42 0.62 (Shanxi) Class 3 0.36 0.36 2.26 0.38 0.84 (Sichuan) Class 4 0.12 0.14 0.76 4.22 0.76 (Guangdong) Class 5 1.12 0.36 0.22 0.46 6.34 (Jiangsu)

120 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.22: Confusion matrix for Case 2 MFDMap with map size 63 (x = 15) at 8000 hidden neurons, using the ELM classifier.

Class 1 Class 2 Class 3 Class 4 Class 5

(Dongbei) (Shanxi) (Sichuan) (Guangdong) (Jiangsu) Class 1 5.78 0.32 0.10 0.10 0.70 (Dongbei) Class 2 0.58 6.02 0.26 0.30 0.44 (Shanxi) Class 3 0.08 0.40 2.26 0.38 1.08 (Sichuan) Class 4 0.06 0.08 0.88 4.12 0.86 (Guangdong) Class 5 1.04 0.18 0.40 0.44 6.44 (Jiangsu)

Table 4.23: Confusion matrix for Case 1 MFDMap with map size 121 (x = 3) at 3000 hidden neurons, using the R-ELM classifier.

Class 1 Class 2 Class 3 Class 4 Class 5

(Dongbei) (Shanxi) (Sichuan) (Guangdong) (Jiangsu) Class 1 6.00 0.23 0.13 0.10 0.53 (Dongbei) Class 2 0.40 6.27 0.07 0.20 0.67 (Shanxi) Class 3 0.43 0.30 2.07 0.57 0.83 (Sichuan) Class 4 0.23 0.07 0.63 4.50 0.57 (Guangdong) Class 5 0.90 0.17 0.06 0.53 6.83 (Jiangsu)

121 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Table 4.24: Confusion matrix for Case 2 MFDMap with map size 63 (x = 15) at 5000 hidden neurons, using the R-ELM classifier.

Class 1 Class 2 Class 3 Class 4 Class 5

(Dongbei) (Shanxi) (Sichuan) (Guangdong) (Jiangsu) Class 1 6.07 0.20 0.10 0.17 0.47 (Dongbei) Class 2 0.20 6.47 0.27 0.17 0.50 (Shanxi) Class 3 0.13 0.50 2.33 0.33 0.90 (Sichuan) Class 4 0.07 0 0.70 4.70 0.53 (Guangdong) Class 5 1.00 0.13 0.10 0.63 6.63 (Jiangsu)

4.6 Discussion

The results for experiments using the extreme learning machine and the regularized extreme learning machine as the classifiers for Han Chinese folk song classification are presented in the previous section. Overall, the best classification accuracy using the ELM classifier is 73.90% and 78.64% for the R-ELM classifier. The confusion matrix in Table 4.21 to Table 4.24 show that Sichuan folk songs are the class of folk songs that is most difficult to classify. It seems that the current music encoding technique might not be sophisticated enough to clearly distinguish the differences between Sichuan folk songs and other classes of folk songs.

The experiments in this chapter are designed to investigate the influence of the following factors on the classification performance: (1) size of the neural network hidden layer (i.e. number of hidden neurons), (2) size of the MFDMaps and (3) the effect of the rests in folk song classification task.

122 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Size of the Hidden Layer

The size of a hidden layer is represented by the number of hidden neurons employed in that hidden layer. The role of the hidden neurons is to allow the neural network to extract high-order signal features from the inputs. The hidden layer served as a pre- processor to project the high-dimensionality input (feature) space to a simpler (abstract) feature space. Patterns represented in this space will be more easily separated by the network output layer. The size of the hidden layer can influence the performance of the neural network. In classification tasks, using a larger hidden layer (greater number of hidden neurons) allows the features to be projected more sparsely and hence can be more easily separated and a better mapping of the feature patterns can be achieved.

It can be clearly seen from the results in Table 4.3 to Table 4.20 that, the classification accuracy improved as the number of hidden neurons increased. It can be clearly seen in the tables that, the classification accuracy begins to deteriorate once the best accuracy is achieved. This is where the saturation point of the neural network is observed.

In most experiments using the ELM classifier, the best classification accuracy is achieved at 8000 hidden neurons, except in Table 4.3, Table 4.4, Table 4.10 and Table 4.11. In these four tables, the best classification accuracy is achieved at 5000 hidden neurons. As the hidden layer size increased from 50 hidden neurons to 8000 hidden neurons, the classifier accuracy improved from around 50% to around 73%.

The saturation point of the neural network for Case 1 MFDMaps using the R- ELM classifier changes within the range of 2500 hidden neurons to 5000 hidden neurons. As for Case 2 MFDMaps, the best classification accuracy is achieved at 5000 hidden neurons except in Table 4.13, Table 4.15 and Table 4.19. In these three tables, the saturation point is at 4000 hidden neurons. The classification accuracies using the R- ELM classifier start around 45% and improved to around 78%.

123 Chapter 4 The Extreme Learning Machine Folk Song Classifier

Size of the MFDMaps

As discussed before, although it is useful in a classification task to have a greater number of features in order to have a better representation of the characteristics of each of the different classes, a large feature vector does not always guarantee good classification results. Although the input vectors (MFDMaps) in the experiments are not particularly of very high dimension, it is useful to examine the classifier performance with various designs of the input vectors.

The original size of the MFDMap (also, the length of the input vector) is 172 for Case 1 MFDMap and 145 for Case 2 MFDMap. A method of dimensionality reduction based on the significance level of features is employed to reduce the size of the MFDMap. In Table 4.3 to Table 4.11 and Table 4.12 to Table 4.20, the size of the MFDMap is reduced from 172 to 40 for Case 1 MFDMap and from 145 to 37 for Case 2 MFDMap. Overall, the difference in the classification accuracy is not significant.

For experiments using the ELM classifier, the performance seems to show slight deterioration as the MFDMap size was reduced. As for experiments using the R-ELM classifier, similar pattern was observed for Case 1 MFDMaps. As for Case 2 MFDMaps, classification accuracies improved as the size of the MFDMaps reduced to the point at significance value x = 15. Thence, the classification accuracies show slight deterioration. Nonetheless, the changes in classification accuracy are not significant, especially if the standard deviations are to be considered.

Effect of the Musical Rests

As rests are included as part of the features, all Case 1 MFDMaps are larger than their Case 2 equivalents. The classification results in Table 4.3 to Table 4.20 show that, in most cases, the classification accuracy is slightly higher for cases where rests are excluded in the MFDMap. But again, these differences are not significant if the standard deviations are taken into consideration.

124 Chapter 4 The Extreme Learning Machine Folk Song Classifier

4.7 Conclusion

In this chapter, the extreme learning machine algorithm and the regularized extreme learning machine were employed for the single-hidden layer feedforward neural network classifier in Han Chinese folk song classification. There is no previous example where this technique was employed for folk song related research.

Overall, the classification accuracy of 73.90% is achieved for Han Chinese folk song classification using the extreme learning machine and 78.64% for the regularized extreme learning machine. These results are fairly good as compared to previous research [51].

The confusion matrices in Section 4.5 show that Sichuan folk songs are the most difficult class of folk songs to be distinguished among the five classes. This is because this particular class of folk songs possesses characteristics that are fairly close to other classes of folk songs and the current music encoding method is not sophisticated enough to distinguish the differences.

A smaller MFDMap does not significantly improve the classification accuracy nor cause significant deterioration to the classification results. However, a smaller MFDMap size allows some saving on the training time and resources.

The author believed that the classification accuracy can be further improved if a more robust classifier, such as the one described in the next chapter, is employed for the classification task.

125

This page is intentionally left blank.

Chapter 5 The FIR-ELM Folk Song Classifier

Chapter 5

The Finite Impulse Response Extreme Learning Machine Folk Song Classifier

In the previous chapter, the extreme learning machine and the regularized extreme learning machine were employed as the classifier for geographical based Han Chinese folk song classification. In this chapter, an improved variant of the extreme learning machine, called the finite impulse response extreme learning machine (FIR-ELM) is employed as the music classifier to study the effect of the new algorithm on Han Chinese folk song classification. The chapter begins with an overview of the FIR-ELM algorithm and proceeds on to the design and setting of the experiments to verify the performance of the FIR-ELM on a multi-class classification of Han Chinese folk songs. This chapter concludes with the discussion on the experiment results.

5.1 Introduction

As discussed in the previous chapter, the extreme learning machine algorithm greatly improves the learning speed of conventional gradient descent-based learning algorithm for feedforward neural networks and yet manages to achieve good generalization performance. The ELM algorithm, designed for single-hidden layer feedforward neural networks, works by randomly assigning the input weights and hidden layer biases, then the SLFN is simply treated as a linear network. The output weights of the SLFN are

127 Chapter 5 The FIR-ELM Folk Song Classifier

analytically computed through generalized inverse operation of the hidden layer output matrix. Such a technique eliminates the need to iteratively tune the network’s parameters as in gradient descent-based algorithms.

Although the extreme learning machine greatly improved the performance of conventional gradient descent-based algorithm, it still poses drawbacks. The random assignment of the input weights and hidden layer biases results in poor robustness property of the SLFNs when the ELM algorithm is employed for signal processing to handle noisy data [19]. When the input weights and hidden layer biases are randomly assigned, the effects of the input disturbances can cause large changes in the hidden layer output matrix which subsequently result in big changes of the output weight matrix of the SLFN.

Two modified ELM algorithms were proposed in [34] and [111] to improve the robustness of the ELM algorithm in [33]. In these variants of the ELM, the cost function of the algorithm consists of the sum of the weighted error squares and the sum of the weighted output weight squares. This modification balanced and reduced the structural and empirical risks in terms of the optimization of the cost function in the output weight space, and the proper choice of the weights of the error squares. However, the structural and empirical risks are not significantly reduced and the robustness property of the SLFN is not significantly improved due to the random assignment of the input weights and the hidden layer biases.

The statistical learning theory [112-118] shows that significant changes in the output weight matrix will largely increase both the structural risk and empirical risk of the SLFNs. Therefore, in order for the empirical risk and the structural risk to be significantly reduced, and for the robustness property of the network with respect to input disturbances to be improved, the input weights of the SLFNs need to be properly chosen instead of arbitrarily assigned.

128 Chapter 5 The FIR-ELM Folk Song Classifier

5.2 Finite Impulse Response Extreme Learning Machine

The finite impulse response extreme learning machine is an improved variant of the ELM algorithm where robustness is introduced into the algorithm through careful design of the input and output weights. The FIR-ELM learning algorithm is also designed for single-hidden layer feedforward neural networks. However, instead of having non-linear nodes in the hidden layer of the SLFN (such as those in the ELM), the hidden layer is designed with linear nodes and an input tapped-delay-line memory, for signal processing purpose. The output layer contains linear nodes.

Since the output of each hidden neuron in the SLFN is the sum of the weighted input data, the design of the hidden layer (linear nodes with an input tapped-delay-line memory) enables each hidden neuron to be treated as a FIR filter. By using the FIR filter design techniques in signal processing [119-122], these hidden neurons can then be designed as a group of low-pass filter, high-pass filter, band-pass filter, band-stop filter or any other types of desire filters. This allows the hidden layer to act as a pre- processor to remove input disturbances and undesired frequency components in the input data. In addition, the structural and empirical risks of the SLFN can be greatly reduced in terms of the output of the SLFN.

In designing the output weights matrix for the SLFNs, an objective function that includes both the weighted sum of the output error squares and the weighted sum of the output weights squares is chosen [26,34,111,123-125]. This objective function is then minimized in the output weight space. The use of the FIR filter design technique for the input weights and the objective function for the output weights allow the structural and empirical risks to be reduced and balanced for signal processing purposes.

The network structure of a single-hidden layer feedforward neural network with linear hidden nodes and an input tapped-delay-line memory is shown in Figure 5.1. In the figure, D is the n – 1 time-delay element that is added to the input layer to form the tapped-delay-line memory, the input sequence x(k), x(k – 1),…, x(k – n + 1), represents a time series consisting of the present observation x(k) and the past n – 1 observations of

129 Chapter 5 The FIR-ELM Folk Song Classifier

the process, the hidden layer consists of Ñ linear neurons and the output layer has m linear neurons.

Figure 5.1: A single-hidden layer neural network with linear neurons and time-delay elements.

The input data vector x(k) and the output data vector O(k) of the SLFN in Figure 5.1 can be expressed as follows:

T x = []kxkxk − L nkx +− )1()1()()( (5.1)

T o = []1 2 L m kokokok )()()()( . (5.2)

Then, the output of the ith hidden neuron is computed as

n T i = ∑ ij jkxwky =+− i xw k)()1()( (5.3) j=1 for i = 1,2,…,Ñ and

T w = []iii 21 L www in (5.4)

130 Chapter 5 The FIR-ELM Folk Song Classifier

for i = 1,2,…,Ñ. The ith output of the neural network, oi(k), is of the form

~ N T i ko )( = ∑ β ppi xw k)( (5.5) p=1 for i = 1,2,…,m. Thus, the output data vector of the SLFN, O(k) can then be written as

~ N T k)( = ∑ pp xwβo k)( (5.6) p=1 where

T β = [ iii 21 L βββ im ] (5.7) for i = 1,2,…,Ñ.

The N distinct sample data vector pair (xi,ti) used to train the SLFN in Figure 5.1 can be expressed as

T x = [ iii 21 L xxx in ] and (5.8)

T t = [ iii 21 L ttt im ] (5.9)

for i = 1,2,…,N where xi is the desired input data vector and ti is the desired target data vector. Hence, for the ith input data vector xi, the corresponding neural network output vector oi can be expressed as

~ N T i = ∑ pp xwβo i k)( (5.10) p=1 for i = 1,2,…,N and all N equations can then be written in matrix form:

= OHβ (5.11) where

131 Chapter 5 The FIR-ELM Folk Song Classifier

~ T T ⎡ 1211 L hhh 1N ⎤ ⎡ ~ ⎤ 11 L N xwxw 1 ⎢ ⎥ ⎢ ⎥ 2221 L hhh ~ H = = ⎢ 2N ⎥ , (5.12) ⎢ MLM ⎥ ⎢ ⎥ T T MLMM ⎢ ~ ⎥ 1 N L N xwxw N ⎢ ⎥ ⎣ ⎦ ~ ⎣⎢ NN 21 L hhh NN ⎦⎥

⎡βT ⎤ ⎢ 1 ⎥ βT β = ⎢ 2 ⎥ and (5.13) ⎢ ⎥ ⎢ M ⎥ βT ⎣⎢ Ñ ⎦⎥ ×mÑ

⎡oT ⎤ ⎢ 1 ⎥ oT O = ⎢ 2 ⎥ . (5.14) ⎢ M ⎥ ⎢ T ⎥ ⎣⎢o N ⎦⎥ ×mN

H is know as the hidden layer output matrix of the neural network [33] where the ith column of H is the ith hidden neuron outputs with respect to the input vectors x1,x2,…,xN.

In the previous discussion, the robustness of the SLFNs can be improved through proper choice of input weights of the SLFNs. This can be achieved by designing the hidden layer as a group of FIR filters to remove input disturbances and undesired frequency components in the input data. It can be seen in (5.3) that the equation possesses the typical structure of a FIR filter [119-122] where the input weight set {wij} can be treated as the set of the filter coefficients or impulse response coefficients of the filter while the output yi(k) is the result of the convolution sum of the filter impulse response and the input data; the length of the filter is the number of the input data of the SLFN. Since it is possible to obtain a priori knowledge of the frequency responses from the training data sets of the neural network, the weights wij for the ith hidden neuron can be designed as follow:

ˆ ˆ ˆ 1 = idi ],0[ 2 = idi ],...,1[ idin nhwhwhw −= ]1[ (5.15) where

132 Chapter 5 The FIR-ELM Folk Song Classifier

1 ωc ω nk −− )]2/)1((sin[ ˆ = ω −− 2/)1( ωkjnj ω = c id ][ ∫ deekh (5.16) 2π −ωc π nk −− )2/)1((

ˆ for 0 ≤ k < n – 1. In (5.16), id kh ][ is the impulse response of a truncated low-pass filter for the ith hidden neuron and ωc is the cutoff frequency of the low-pass filter. It is to be noted that, the similar design methods from the signal processing in [119-122] can be used to design the hidden neurons as high-pass filter, band-pass filter or band-stop filter, or any other types of filters for the purpose of pre-processing of the input data to remove input disturbance and undesired frequency components.

The robust output weights of the SLFNs can be achieved by minimizing both the weighted sum of the output error squares and the weighted sum of the output weights squares of the SLFNs [26,34,111,123-125]:

⎧γ 2 d 2 ⎫ Minimize ⎨ + βε ⎬ (5.17) ⎩ 22 ⎭

subject to = − = − THβTOε (5.18) where γ and d are constant balancing parameters for adjusting the balance of the empirical risk and the structural risk.

The above problem can be solved by using the method of Lagrange multipliers:

N m m Ñ N m γ 2 d 2 T L = ∑∑ε ij + ∑∑ij − ∑∑ ( kkp Th −− εβλβ kpkpp ) (5.19) 2 i==11j 2 i == 11j k == 11p

where ɛij is the ijth element of the error matrix ɛ, βij is the ijth element of the output weight matrix β, Tij is the ijth element of the output data matrix T, hi is the ith column of the hidden layer output matrix H, β j is the jth column of the output weight matrix β,

λij is the ijth Lagrange multiplier, γ and d are constant parameters used to adjust the balance of the empirical risk and the structural risk.

Differentiate L in (5.19) with respect to βij to obtain

133 Chapter 5 The FIR-ELM Folk Song Classifier

~ ∂L ∂ ⎛ N N ⎞ β −= ⎜ βλ ⎟ d ij ⎜ kj ∑∑ h pjki ⎟ ∂β ij ∂β ij ⎝ k p== 11 ⎠ N d ij −= ∑λβ hkikj . (5.20) k =1 ()λλβ hhd 2211 jjjjij L+++−= λ hNjNj

Then, let L / β ij =∂∂ 0 and using (5.20), the following is obtained:

dβ ij = λ + λ hh 2211 jjjj + L + λ h NjNj

⎡λ1 j ⎤ ⎢λ ⎥ ⎢ 2 j ⎥ (5.21) = []hh 21 jj L h Nj ⎢ M ⎥ ⎢ ⎥ λ ⎣⎢ Nj ⎦⎥ and

⎡ 11 12 L λλλ 1m ⎤ ⎢ λλλ ⎥ ⎢ 21 22 L 2m ⎥ d[]ii 21 L βββ im = []21 jj L hhh Nj . (5.22) ⎢ MMMM ⎥ ⎢ ⎥ ⎣ NN 21 L λλλ Nm ⎦

Thus,

d = T λHβ . (5.23)

Then, differentiate L with respect to ɛij to obtain

∂L += λγε ijij . (5.24) ∂ε ij

Similarly, let L / ε ij =∂∂ 0 and the following relationship can be obtained:

= −γελ . (5.25)

Finally, by considering the constraint in (5.18), (5.25) is then expressed as

134 Chapter 5 The FIR-ELM Folk Song Classifier

= −γ (H − Tβλ ) (5.26) and using (5.26) in (5.23) leads to

d γ T (H −−= TβHβ ). (5.27)

Hence, the output weight matrix β of the SLFN can be obtained as follows:

−1 ⎛ d T ⎞ T ⎜ += ⎟ THHHIβ . (5.28) ⎝ γ ⎠

5.3 Experiment Design and Setting

The FIR-ELM algorithm is employed on a single-hidden layer feedforward neural network for a multi-class classification task. The experimental designs and settings in this chapter are similar to those in Chapter 4 but are not exactly the same. Therefore, for clarity and convenience purposes, all details will be included in this section without duplicating any content of Chapter 4.

This section begins with the explanations of the pre-processing phase where folk songs are encoded into feature vectors that are ready to be fed to the machine classifier for classification. Next, the post-processing phase where the classifier output is processed to obtain the classification results is outlined. Finally, the section concludes with a discussion on the parameter settings and network structure of the neural classifier.

135 Chapter 5 The FIR-ELM Folk Song Classifier

5.3.1 Data Pre-Processing and Post-Processing

Data Pre-Processing

The inputs for a neural classifier have to be in a form that further mathematical operation can be applied. The folk songs in the Essen Folksong Collection are encoded into respective feature vectors using a novel encoding technique called the musical feature density map (MFDMap). The details of this encoding technique are presented in Chapter 3. There are two variants. Case 1 MFDMaps are designed such that all musical notes and rests are included in the final feature vectors. Case 2 MFDMaps completely omit the musical rests and only sounded musical notes are used. Both designs of MFDMaps will be used for the experiments.

Feature Selection and Dimensionality Reduction

The phenomenon of the curse-of-dimensionality [109-110] is where the number of data samples is not large enough to represent each combination of values in high- dimensional feature space and hence leads to a poor representation of the mapping. The curse-of-dimensionality phenomenon always leads to poor classification performance. Hence, it is important to ensure that such a phenomenon can be prevented.

In order to investigate the effect of the curse-of-dimensionality, a set of MFDMaps with diverse sizes were designed from the two variants of MFDMaps (Case 1 and Case 2 MFDMaps). See Table 4.1 and Table 4.2. The criteria for selecting the features from the original MFDMaps to be included in the new reduced size MFDMaps are based on the level of significance of each feature in each of the five classes of folk songs.

136 Chapter 5 The FIR-ELM Folk Song Classifier

Data Post-Processing

Similar to the experiments in Chapter 4, all experiments in this chapter employ the 1-of- c method to encode the targets for the neural classifiers. In the 1-of-c method, for a set of targets, the number of targets is the same as the number of classes. The output with the highest activation is taken as the class label assigned by the neural classifier for a particular test sample. This method of post-processing of neural network outputs is known as the winner-takes-all method [22].

5.3.2 Parameter Setting

The single-hidden layer feedforward neural network is employed as the folk song classifier. In this chapter, the neural network is trained using the finite impulse response extreme learning machine algorithm. It is to be noted that the SLFN structure has both linear hidden neurons and linear output neurons. In addition, there is no activation function employed for neurons in both hidden and output layers.

As in Chapter 4, the training and testing data set for all experiments in this chapter are constructed from a total of 333 Han Chinese folk songs of five different classes. The division of training and testing data set in each of the five classes is 90% and 10% respectively. A 10-fold cross-validation method is employed to assess all 333 folk songs. Using such method, each experiment is repeated 10 times, employing a different set of testing data at each repetition.

In the experiments, the four commonly used filters are employed: low-pass, high-pass, band-pass and band-stop filters. As described in Section 5.2, the cutoff frequencies used for each of the four filters are the normalized cutoff frequency. Various cutoff frequencies are used to investigate the effect of the filtered data on the classification results. The range of the normalized cutoff frequency, ωc, is set from 0.1 to 0.9, with a step size of 0.1, for both low-pass and high-pass filters. As for the band- pass and band-stop filters, a bandwidth of ±0.05 of the same range (i.e. 0.1 to 0.9) is used. It is to be noted that the rectangular window method is employed for the design of the input weights. 137 Chapter 5 The FIR-ELM Folk Song Classifier

In (5.28), there are two balancing parameters, d and γ, employed in the designed of the robust network output weights. These parameters are used to balance and reduce the empirical risk and structural risk of the neural network. In this chapter, a set of different values are used to investigate the effect of the ratio d/γ on classifier performance. The different values assigned for the ratio d/γ is 0.001, 0.01, 0.1, 1, 10, 100 and 1000.

Unlike the extreme learning machine classifiers discussed in the previous chapter, the FIR-ELM classifier does not possess arbitrary characteristics, i.e. no random assignment of weights and other network parameters. Hence, no repetition is required for all of the experiments (except for the 10-fold cross-validation).

In all experiments, the number of hidden neurons employed in the hidden layer of the neural classifier starts from one hidden neuron and gradually increased to a maximum of 10000 hidden neurons. The results of each experiment are show in the next section.

5.4 Experiment Results

This section presents the results of the experiments designed for verifying the performance of the finite impulse response extreme learning machine classifier for geographical based Han Chinese folk song classification. The details of the experiments settings are presented in the previous section.

Table 5.1 to Table 5.9 are the results for Case 1 MFDMap using the original MFDMap and eight reduced size MFDMaps. The results for Case 2 MFDMaps are presented in Table 5.10 to Table 5.18. The sequence of both sets of results goes as follows: (1) original MFDMap, (2) reduced size map where x = 3, (3) reduced size map where x = 5, (4) reduced size map where x = 10, (5) reduced size map where x = 15, (6) reduced size map where x = 20, (7) reduced size map where x = 30, (8) reduced size map where x = 40 and (9) reduced size map where x = 50. Recall that, x is the value for the significance level to include a particular feature in a MFDMap. In these tables, μ

138 Chapter 5 The FIR-ELM Folk Song Classifier

represents the mean classification accuracy across all 10 cross-validation data sets and σ represents the standard deviations for the means.

Table 5.19 and Table 5.20 show the confusion matrix for both Case 1 and Case 2 MFDMap using MFDMaps at significance value x = 15 at 500 hidden neurons. It is to be noted that all classification results reported in this section are the results using the four filters at cutoff frequency, ωc= 0.6 and with d/γ = 0.001.

Table 5.21 presents a comparison of the classification accuracy between five machine classifiers: the resilient propagation (RPROP) [126] single-hidden layer feedforward neural network, the extreme learning machine (ELM) [33], the regularized extreme learning machine (R-ELM) [34], the finite impulse response extreme learning machine (FIR-ELM) and the support vector machine (SVM) [27]. In the table, the linear kernel is used in the SVM classifier as it performs the best among some of the popular kernels such as the Gaussian radial basis function kernel and the polynomial kernel.

Table 5.1: Classification accuracy using Case 1 MFDMap with the original map size

(map size = 172, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 57.89 5.67 59.41 5.19 63.91 5.00 44.11 6.22 100 66.63 5.39 66.32 5.02 71.44 6.42 57.30 5.75 500 74.45 5.86 74.73 7.30 71.73 5.79 74.15 6.05 1000 74.43 5.55 73.51 6.49 70.23 6.96 72.96 6.60 1500 72.65 5.86 72.59 6.13 71.73 5.80 72.56 6.06 2000 71.44 5.77 72.59 5.51 71.72 6.20 72.56 5.48 2500 71.14 5.80 72.59 5.43 71.73 5.43 72.56 5.57 3000 71.14 5.75 72.59 6.27 71.72 5.69 71.76 5.92 4000 71.13 6.62 72.31 6.78 71.14 5.42 71.76 5.90 5000 70.82 6.31 72.30 5.15 70.84 6.20 71.02 5.57 8000 70.22 6.03 70.21 5.78 70.53 6.65 70.57 6.09 10000 69.89 5.27 69.78 5.96 69.99 5.70 69.81 6.63

139 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.2: Classification accuracy using Case 1 MFDMap with x = 3

(map size = 121, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 64.23 5.87 64.23 6.25 54.91 7.01 66.64 5.77 100 75.93 5.90 71.41 6.38 59.11 6.01 78.04 6.05 500 75.64 4.90 72.32 6.39 63.31 6.15 78.07 5.87 1000 75.34 5.61 72.30 5.26 66.02 5.53 77.75 5.91 1500 75.33 5.30 71.69 6.27 68.14 6.84 76.44 5.48 2000 74.72 5.81 71.69 5.95 68.75 6.06 75.15 5.84 2500 74.44 6.26 71.69 5.62 68.73 6.27 74.75 6.38 3000 74.14 6.76 71.69 5.65 68.14 5.52 74.44 6.87 4000 73.53 6.90 71.66 6.64 67.84 6.43 74.15 5.42 5000 73.52 5.94 71.30 5.60 67.84 6.19 73.86 7.19 8000 72.01 5.84 71.00 5.38 67.22 5.42 73.86 6.76 10000 71.56 6.41 70.50 6.11 66.85 6.02 72.94 6.08

Table 5.3: Classification accuracy using Case 1 MFDMap with x = 5

(map size = 101, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 61.22 5.85 61.22 5.66 66.64 7.31 54.91 6.08 100 73.26 5.65 72.63 6.59 72.34 6.28 74.75 5.86 500 75.94 6.42 75.83 6.40 74.13 6.15 80.13 6.58 1000 75.64 5.65 73.93 6.14 72.91 5.61 75.06 5.43 1500 74.45 5.77 72.63 6.00 72.01 5.47 74.44 6.34 2000 74.44 6.44 72.32 6.18 71.71 5.12 74.15 5.67 2500 74.42 6.22 72.22 7.76 71.61 5.79 74.00 5.80 3000 74.42 6.45 72.22 5.94 71.55 5.47 73.84 5.66 4000 74.41 6.25 72.21 5.22 71.31 6.32 73.04 6.29 5000 74.40 5.80 72.30 6.96 71.11 5.84 73.04 5.61 8000 72.91 5.74 72.00 6.30 70.58 6.88 72.73 5.47 10000 71.56 6.40 71.50 5.68 70.02 6.76 71.93 6.28

140 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.4: Classification accuracy using Case 1 MFDMap with x = 10

(map size = 81, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 65.74 5.79 69.32 6.05 66.60 5.17 66.02 5.62 100 76.52 6.18 77.44 5.58 70.84 6.97 78.04 5.44 500 75.63 5.82 76.83 5.82 72.62 5.46 80.45 6.04 1000 74.73 6.13 74.13 5.91 72.62 6.11 78.05 7.05 1500 74.73 4.72 73.82 5.76 71.72 6.55 77.17 5.64 2000 73.82 6.23 73.82 6.42 71.42 6.07 75.05 5.86 2500 73.51 6.93 73.82 7.27 71.42 7.15 74.94 6.58 3000 73.20 6.52 73.73 5.34 71.13 7.38 74.64 6.61 4000 72.91 6.46 73.53 6.06 70.55 6.07 74.04 6.24 5000 72.58 5.88 72.32 5.28 70.23 5.05 73.73 6.51 8000 71.68 6.09 72.01 6.65 69.64 5.82 72.93 6.44 10000 70.59 6.12 71.55 6.70 69.04 5.58 71.69 5.81

Table 5.5: Classification accuracy using Case 1 MFDMap with x = 15

(map size = 71, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 68.73 6.21 69.63 5.61 73.52 6.26 67.51 5.52 100 76.52 5.85 75.63 6.39 73.24 6.43 79.22 5.18 500 75.62 5.55 75.37 6.70 72.44 6.67 81.03 6.38 1000 74.43 6.32 73.24 5.73 71.97 4.75 80.45 6.60 1500 74.71 6.03 73.54 6.96 71.45 5.92 78.33 6.82 2000 74.02 5.91 73.34 5.91 70.85 6.18 75.63 5.23 2500 74.02 6.15 73.24 5.88 70.25 6.36 74.64 5.33 3000 73.43 6.49 72.63 5.55 70.25 5.35 74.43 5.26 4000 73.12 6.20 72.63 5.60 69.74 5.50 74.03 5.98 5000 72.93 6.10 72.33 5.52 69.35 6.40 73.63 5.69 8000 71.63 6.14 72.02 6.18 69.32 5.94 72.72 6.66 10000 71.03 6.03 71.51 6.80 68.87 6.28 72.11 5.27

141 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.6: Classification accuracy using Case 1 MFDMap with x = 20

(map size = 63, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 70.22 5.13 72.62 5.85 72.63 6.37 73.50 5.85 100 74.39 6.10 78.65 6.91 74.40 5.59 75.02 6.23 500 76.82 6.60 72.91 6.46 74.12 6.29 78.32 5.86 1000 75.93 5.60 72.92 5.97 73.83 6.14 75.64 6.22 1500 73.83 5.37 73.82 6.65 73.23 6.57 75.03 5.93 2000 74.43 5.93 74.43 5.48 73.54 5.79 74.72 5.99 2500 73.53 5.18 74.43 5.83 73.52 6.32 74.72 6.23 3000 73.51 6.01 74.43 6.71 73.41 6.40 74.72 6.68 4000 72.93 6.41 74.14 6.75 72.94 5.55 74.13 6.23 5000 72.65 6.11 74.13 6.37 72.94 6.08 74.13 6.82 8000 72.34 5.05 73.83 6.25 72.34 6.80 73.53 4.99 10000 72.00 5.73 73.65 5.71 72.13 6.06 73.04 5.78

Table 5.7: Classification accuracy using Case 1 MFDMap with x = 30

(map size = 55, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 70.80 6.12 69.95 7.18 76.25 5.68 70.83 4.86 100 72.91 5.58 74.47 5.76 73.55 5.98 73.50 5.18 500 73.52 5.36 74.72 6.32 72.63 7.34 77.12 6.21 1000 74.13 6.31 73.22 5.48 74.11 5.43 76.32 5.67 1500 74.11 6.31 73.52 6.67 73.81 6.28 75.84 5.85 2000 72.92 6.14 73.41 5.52 73.51 5.46 73.82 5.25 2500 72.92 6.20 73.30 6.10 73.51 6.52 73.85 5.55 3000 72.82 5.56 73.30 5.69 73.21 6.16 73.82 5.80 4000 72.51 5.75 73.21 6.26 73.12 6.33 73.53 5.64 5000 72.31 5.95 73.12 6.01 72.82 5.86 73.21 5.57 8000 71.92 5.66 72.81 5.98 72.41 6.12 72.91 5.79 10000 71.69 6.17 72.17 7.47 71.54 6.74 72.75 5.53

142 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.8: Classification accuracy using Case 1 MFDMap with x = 40

(map size = 47, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 68.71 6.67 75.04 5.71 72.34 6.39 72.02 5.95 100 72.94 5.51 75.33 5.08 75.92 6.19 76.23 5.87 500 75.94 6.91 76.53 5.78 73.83 6.35 76.22 6.60 1000 73.84 5.81 73.82 6.47 73.24 5.94 73.52 6.30 1500 74.14 5.27 73.53 6.36 73.24 5.98 73.51 6.27 2000 73.83 5.69 73.22 7.14 73.24 6.04 72.91 5.28 2500 73.83 6.47 73.13 6.08 72.84 5.61 73.04 5.52 3000 73.83 6.53 73.51 4.92 72.24 6.71 73.84 6.10 4000 72.83 6.08 73.51 6.84 72.14 6.00 73.75 5.83 5000 72.53 6.14 73.13 6.64 71.74 6.34 73.55 6.65 8000 72.73 6.32 72.91 5.71 71.14 5.57 73.35 6.67 10000 72.00 5.27 72.50 6.11 70.64 5.46 72.34 5.71

Table 5.9: Classification accuracy using Case 1 MFDMap with x = 50

(map size = 40, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 70.53 6.44 72.92 6.89 68.70 6.14 69.59 6.67 100 71.13 6.70 73.82 6.11 70.21 6.74 70.82 7.06 500 74.14 6.16 73.84 7.37 71.71 6.72 76.01 6.03 1000 73.23 6.81 74.45 5.85 73.81 5.99 75.81 6.08 1500 72.93 6.53 74.14 6.28 72.91 6.46 75.44 5.68 2000 72.34 6.11 73.85 6.79 73.83 5.84 74.85 6.81 2500 72.22 6.44 73.85 7.36 73.81 6.33 74.82 5.96 3000 71.93 6.10 73.25 6.15 72.91 6.96 73.83 5.76 4000 71.65 5.79 72.95 5.60 72.61 6.08 73.44 7.09 5000 71.56 6.18 72.74 6.40 72.60 5.85 73.14 6.40 8000 71.24 6.02 72.33 5.34 72.30 5.75 72.25 6.36 10000 70.89 5.82 71.81 5.86 71.72 6.36 72.17 5.50

143 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.10: Classification accuracy using Case 2 MFDMap with the original map size

(map size = 145, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 58.81 6.22 59.13 6.60 61.82 5.89 61.51 6.31 100 68.47 6.26 72.96 6.26 71.16 5.56 73.84 5.86 500 79.27 5.45 77.17 6.20 75.36 5.84 76.87 6.30 1000 78.96 5.89 75.94 5.76 75.64 5.61 74.46 5.88 1500 78.98 5.80 75.94 5.88 76.54 5.82 73.84 5.11 2000 78.36 6.26 75.15 6.31 76.23 6.06 73.84 4.83 2500 78.10 5.50 75.96 6.84 76.84 6.09 73.55 5.14 3000 77.74 6.54 75.96 6.28 76.14 5.89 73.54 5.88 4000 77.45 6.89 75.26 5.40 76.05 5.92 73.45 5.69 5000 77.44 5.85 75.26 6.22 75.45 6.02 73.15 5.64 8000 76.84 6.00 74.86 5.95 75.06 6.23 73.14 6.02 10000 76.43 6.25 74.23 5.88 74.87 6.64 73.05 5.67

Table 5.11: Classification accuracy using Case 2 MFDMap with x = 3

(map size = 102, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 60.63 5.68 59.71 5.71 63.05 6.08 64.23 6.09 100 75.35 6.30 77.42 6.25 74.14 5.62 78.95 6.46 500 80.19 6.39 79.25 6.50 75.34 5.91 80.75 5.94 1000 79.86 7.22 78.95 6.54 78.65 5.90 80.14 6.79 1500 80.16 6.15 78.65 6.39 77.75 6.45 79.23 6.28 2000 79.86 6.03 78.04 4.87 79.55 6.21 79.55 5.79 2500 78.96 5.71 79.56 5.72 79.86 6.27 79.25 5.92 3000 78.95 5.90 79.25 6.45 79.57 6.07 79.24 5.86 4000 78.35 5.97 78.86 6.20 79.27 5.82 78.94 6.12 5000 78.35 5.12 78.56 6.00 79.26 6.03 78.93 6.38 8000 78.05 5.87 78.06 6.22 78.88 6.11 78.64 5.85 10000 77.99 6.37 77.86 6.57 78.28 5.30 78.24 6.23

144 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.12: Classification accuracy using Case 2 MFDMap with x = 5

(map size = 88, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 65.45 6.88 65.45 4.89 69.97 6.09 63.60 7.00 100 72.99 6.47 77.74 6.22 74.76 6.03 81.04 6.71 500 77.78 6.41 78.96 6.00 74.75 6.34 82.25 6.01 1000 80.17 5.59 80.34 5.62 76.85 5.30 79.85 5.53 1500 79.87 5.73 78.35 6.20 77.76 6.71 78.64 5.13 2000 80.47 6.12 79.55 5.60 77.75 5.55 78.65 6.01 2500 80.45 5.95 79.45 6.43 77.65 6.02 79.55 6.11 3000 79.34 5.19 79.45 6.03 77.58 5.82 79.55 6.52 4000 79.17 5.24 79.17 5.18 77.25 6.07 78.65 5.52 5000 79.17 6.51 79.17 4.79 77.23 5.03 78.56 6.40 8000 78.78 5.62 79.17 5.86 76.97 6.76 78.55 6.04 10000 78.15 7.04 79.05 6.57 76.81 6.27 78.17 5.61

Table 5.13: Classification accuracy using Case 2 MFDMap with x = 10

(map size = 73, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 67.83 6.39 71.17 6.18 71.76 5.57 66.93 5.80 100 74.47 6.13 80.76 6.94 72.05 5.29 81.64 5.26 500 81.37 5.88 81.36 5.85 74.71 5.74 83.76 5.83 1000 80.17 6.94 80.77 5.49 78.05 5.46 85.57 5.34 1500 80.78 6.30 81.16 5.88 79.28 5.65 82.56 5.71 2000 79.57 5.94 81.06 5.76 79.55 5.85 81.08 5.63 2500 79.88 6.14 81.06 5.84 79.57 5.40 82.86 5.41 3000 79.88 6.04 80.45 6.24 79.26 5.99 81.39 4.97 4000 79.56 6.09 80.45 5.93 79.26 6.17 80.78 6.28 5000 79.56 6.23 80.14 5.70 79.23 5.52 80.49 7.00 8000 79.26 5.70 79.85 5.78 78.97 6.56 80.18 7.11 10000 78.88 5.60 79.24 5.34 78.66 5.21 79.93 5.75

145 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.14: Classification accuracy using Case 2 MFDMap with x = 15

(map size = 63, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 72.34 5.98 71.16 5.16 70.84 5.75 70.53 5.75 100 73.83 5.77 79.87 6.21 75.15 6.20 78.64 5.69 500 80.46 6.04 80.45 6.25 77.62 6.54 85.58 5.53 1000 79.87 5.54 79.26 6.04 78.66 6.48 82.85 5.86 1500 79.28 5.04 79.56 6.08 78.93 6.14 81.35 5.90 2000 79.17 5.98 79.26 5.74 78.93 5.68 80.76 6.07 2500 80.47 5.39 79.55 6.36 78.93 6.31 82.25 5.56 3000 79.88 5.05 79.55 5.58 78.64 5.96 81.65 6.04 4000 79.86 7.19 79.24 5.60 78.63 6.69 80.45 5.70 5000 79.56 5.88 79.24 6.36 78.23 5.30 80.15 5.82 8000 79.25 6.20 78.64 6.84 77.93 6.07 79.85 5.58 10000 78.78 6.60 78.36 5.81 77.64 6.64 79.55 5.86

Table 5.15: Classification accuracy using Case 2 MFDMap with x = 20

(map size = 58, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 74.73 7.78 72.66 6.00 75.34 6.41 74.14 5.29 100 77.11 7.70 78.05 6.57 76.24 5.89 77.44 5.98 500 79.63 6.57 79.93 6.21 76.84 6.13 81.94 6.13 1000 78.56 6.39 79.65 6.09 78.34 5.55 81.07 6.26 1500 78.96 5.36 79.85 6.33 79.86 4.92 80.45 6.34 2000 78.36 5.71 79.25 6.29 78.95 6.05 78.65 6.34 2500 78.97 5.69 79.54 5.19 78.96 5.52 80.45 6.34 3000 78.36 6.27 79.25 6.01 78.93 6.59 80.45 5.84 4000 78.35 6.24 79.25 5.79 78.85 6.87 80.06 6.19 5000 78.15 6.05 79.16 4.98 78.55 5.81 79.87 4.87 8000 77.86 5.83 78.76 5.34 77.93 6.60 79.26 5.42 10000 77.15 5.54 78.25 5.84 77.15 6.48 78.84 6.60

146 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.16: Classification accuracy using Case 2 MFDMap with x = 30

(map size = 49, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 72.92 5.67 76.85 5.73 75.04 6.21 69.31 5.80 100 74.42 6.16 77.35 5.87 75.94 5.53 72.02 4.91 500 77.16 6.75 78.43 5.45 78.34 6.42 79.83 5.40 1000 76.53 5.56 76.85 6.27 77.46 6.55 77.16 6.18 1500 76.25 5.52 76.25 4.91 76.86 6.16 77.16 7.21 2000 76.24 5.75 76.25 6.07 76.55 6.01 76.84 6.37 2500 76.54 6.22 76.55 6.33 76.55 5.42 76.74 6.15 3000 76.24 5.90 76.46 5.58 76.16 6.25 76.74 5.73 4000 76.15 5.51 76.16 5.97 75.85 5.39 76.16 6.06 5000 76.05 5.21 76.06 6.15 75.78 6.80 76.16 6.48 8000 75.76 5.81 75.85 5.69 75.05 6.26 75.81 5.87 10000 75.06 5.89 75.16 5.99 74.87 6.13 75.45 6.95

Table 5.17: Classification accuracy using Case 2 MFDMap with x = 40

(map size = 44, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 71.44 5.39 69.95 6.18 74.14 5.76 72.34 6.50 100 73.83 5.81 74.75 5.37 74.14 5.29 75.04 6.21 500 75.36 6.03 76.21 5.48 76.52 6.49 79.54 5.50 1000 77.45 5.02 77.44 5.79 77.13 6.14 77.44 5.68 1500 77.16 6.28 77.14 6.04 76.84 6.58 76.27 5.48 2000 77.15 6.04 76.84 6.90 76.74 5.55 76.56 6.25 2500 76.85 6.04 76.75 6.40 76.74 6.12 76.65 6.17 3000 76.85 6.40 76.75 5.78 76.64 5.67 76.45 5.18 4000 76.55 6.52 76.44 5.69 76.33 6.64 76.45 5.05 5000 76.25 6.40 76.14 6.77 76.03 5.72 76.04 5.98 8000 75.85 5.77 75.54 7.37 75.74 5.96 75.85 6.35 10000 75.46 5.98 75.25 6.08 75.15 5.83 75.45 6.08

147 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.18: Classification accuracy using Case 2 MFDMap with x = 50

(map size = 37, ωc = 0.6, d/γ = 0.001).

Number Low-Pass High-Pass Band-Pass Band-Stop of Hidden Neurons μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) μ (%) σ (%) 50 69.62 5.23 72.93 7.37 72.04 6.27 69.31 5.04 100 71.73 5.22 75.34 5.24 71.74 5.93 72.02 6.31 500 77.73 6.43 77.45 6.22 75.63 6.31 77.16 6.38 1000 77.13 5.93 77.15 5.89 76.84 6.00 77.16 6.11 1500 76.85 5.81 76.86 5.59 76.86 6.55 76.84 5.61 2000 76.55 6.66 76.85 5.58 76.55 5.91 76.54 6.00 2500 76.26 5.60 76.86 6.25 76.45 5.44 76.54 6.05 3000 76.26 6.07 76.85 7.16 76.05 6.12 76.45 6.47 4000 76.03 6.21 76.85 5.60 75.84 6.78 76.22 6.33 5000 75.96 6.41 76.27 6.27 75.75 5.40 75.98 5.82 8000 75.12 5.57 75.96 5.72 75.05 5.88 75.44 6.81 10000 74.84 6.18 75.75 6.99 74.44 6.50 74.99 5.97

Table 5.19: Confusion matrix for Case 1 MFDMap with x = 15 at 500 hidden neurons

(map size = 71, ωc = 0.6, d/γ = 0.001).

Class 1 Class 2 Class 3 Class 4 Class 5

(Dongbei) (Shanxi) (Sichuan) (Guangdong) (Jiangsu) Class 1 6.3 0.1 0 0.1 0.5 (Dongbei) Class 2 0.2 6.8 0.2 0 0.4 (Shanxi) Class 3 0.5 0.1 2.3 0.1 1.2 (Sichuan) Class 4 0.5 0.2 0.6 4.1 0.6 (Guangdong) Class 5 0.8 0 0 0.2 7.5 (Jiangsu)

148 Chapter 5 The FIR-ELM Folk Song Classifier

Table 5.20: Confusion matrix for Case 2 MFDMap with x = 15 at 500 hidden neurons

(map size = 63, ωc = 0.6, d/γ = 0.001).

Class 1 Class 2 Class 3 Class 4 Class 5

(Dongbei) (Shanxi) (Sichuan) (Guangdong) (Jiangsu) Class 1 6.1 0.1 0 0.2 0.6 (Dongbei) Class 2 0.1 6.9 0.2 0.1 0.3 (Shanxi) Class 3 0.1 0 2.9 0.5 0.7 (Sichuan) Class 4 0.3 0.2 0.3 4.7 0.5 (Guangdong) Class 5 0.4 0.1 0 0.1 7.9 (Jiangsu)

Table 5.21: Classification accuracy of the RPROP, ELM, R-ELM, FIR-ELM and SVM classifier.

Classifier RPROP ELM R-ELM FIR-ELM SVM

Accuracy (%) 59.84 73.90 78.64 85.58 63.92

Standard 9.40 7.47 7.91 5.53 11.29 Deviation (%)

5.5 Discussion

Overall, the best performance achieved in all experiments is 85.58% using the band- stop filter at cutoff frequency, ωc= 0.6 and the ratio d/γ = 0.001. The MFDMap used to

149 Chapter 5 The FIR-ELM Folk Song Classifier

represent the folk songs are the Case 2 MFDMap with the size 63. This MFDMap is constructed using features which at least 15% of folk songs in a particular class possessed. For Case 1 MFDMaps, the best accuracy achieved is 81.03%. This result is achieved using also, the band-stop filter at cutoff frequency, ωc= 0.6 and the ratio d/γ = 0.001, and with the similar MFDMap structure, i.e. the MFDMap with size 71 which is constructed using features which at least 15% of folk songs in a particular class possessed. Also, this result is achieved using 500 hidden neurons in the SLFN.

It can be seen that for most cases, the neural classifier achieved its best performance within 1000 hidden neurons. The amount of hidden neurons required is many times lesser than that used by both the ELM and the R-ELM classifiers presented in the previous chapter, and yet the accuracy is much better (85.58% instead of 73.90% by the ELM classifier and 78.64% by the R-ELM). This can be interpreted as a trace of the robustness of the design of the FIR-ELM algorithm.

In the discussion in Chapter 4, the experiments results do not reveal much relation between the size of the MFDMaps (i.e. the size of the input vectors) and the classification performance. However, in this chapter, with the FIR filtering capability, a “cleaner” input samples enable the effect of using various MFDMaps to be revealed. The most obvious pattern can be spotted among the results using the band-stop FIR- ELM. The classification accuracies of all nine different MFDMaps reveal a bell-like shape pattern where the peak is at MFDMap with x = 15. In other words, starting from the original MFDMap (MFDMap with the largest size), as the size reduced, the classification accuracy improved up to the peak point at the fifth MFDMap (MFDMap with x = 15), then the accuracy starts to deteriorate. This pattern is consistent in both Case 1 MFDMaps and Case 2 MFDMaps. This interesting pattern suggests that, the MFDMaps with x = 15 might be the optimal MFDMap design to be adopted. In addition, the pattern in the classification accuracies shows a trace of the curse-of- dimensionality phenomenon where a large input dimensionality results in poorer performance.

Overall, the band-stop FIR-ELM shows a fairly consistent performance through all MFDMaps. Its best performing accuracy in each of the MFDMap cases is almost

150 Chapter 5 The FIR-ELM Folk Song Classifier

always achieved using 500 hidden neurons. Although it is not always the case, but majority of the time, the band-stop FIR-ELM performed the best among the four.

The band-pass FIR-ELM on the other hand is the worst among the four. For Case 2 MFDMaps, unlike the band-stop FIR-ELM, the band-pass FIR-ELM always required more hidden neurons, usually double of that required by the band-stop FIR- ELM. The band-pass FIR-ELM usually requires about 1500 hidden neurons and yet the performance is not as good as those achieved by the band-stop FIR-ELM. However, the situation is a little varied for Case 1 MFDMaps. In Case 1 MFDMaps, the band-pass FIR-ELM starts off with better accuracy and then deteriorates. This phenomenon can be spotted in Table 5.5 to Table 5.8.

Recall that the reason for having Case 1 MFDMaps and Case 2 MFDMaps is to investigate the effect of the presence of musical rests in the task of Han Chinese folk song classification. From Table 5.1 to Table 5.18, it can be seen that Case 2 MFDMaps always perform better than Case 1 MFDMaps in each of the individual cases. These results suggest that instead of contributing more features to represent each of the five classes, these extra features might in fact distort the overall feature patterns.

The confusion matrix for the best performing experiment design for Case 1 and Case 2 MFDMaps, are shown in Table 5.19 and Table 5.20, respectively. It can be seen that Sichuan folk songs are still the most difficult to differentiate among the five classes. Part of the reason could be because there are not enough samples to represent the class. It could also hint that Sichuan folk songs may possess similar characteristics as folk songs from other classes. It is interesting to see that, in both confusion matrix, (and also those in Chapter 4), there is a higher time that Sichuan folk songs are being “confused” with Jiangsu folk songs.

The comparison of the classification accuracy among the five classifiers in Table 5.21 suggests that the FIR-ELM classifier managed to achieve the best accuracy with the smallest standard deviation. The regularized extreme learning machine, which is also an enhanced variant of the extreme learning machine ranks second. The extreme learning machine classifier comes third followed by the support vector machine then the resilient propagation neural network.

151 Chapter 5 The FIR-ELM Folk Song Classifier

5.6 Conclusion

In this chapter, the finite impulse response extreme learning machine algorithm is employed as the training algorithm for the single-hidden layer feedforward neural network in Han Chinese folk song classification. This technique has never been applied in music classification task and is the first example of application in multi-class classification.

The main characteristics of the FIR-ELM lie in the design of the robust input weights and output weights of the SLFN. The technique of FIR filter is employed to design the hidden layer so that each hidden neurons can function as a FIR filter to remove input disturbance and undesired frequency components. The robust output weights is achieved by minimizing an objective function that include both the weighted sum of the output error squares and the weighted sum of the output weight squares of the SLFN.

The robustness of the FIR-ELM neural classifier, particularly in Han Chinese folk song classification, was verified through the experiments in this chapter. It can be seen that the overall performance of the classifier is much better than that in Chapter 4. The classification accuracies were significantly improved using the FIR-ELM classifier. In addition, the FIR-ELM classifier successfully achieved a better accuracy than some commonly used machine classifiers in music classification, such as the gradient descent-based neural network and the support vector machine.

In this chapter, two variants of MFDMaps: Case 1 MFDMaps (include both musical notes and rests) and Case 2 MFDMaps (only musical notes) are designed to verify the role of musical rests in Han Chinese folk song classification. In each of the variant, a number of reduced size MFDMaps is designed to verify the phenomenon of the curse-of-dimensionality.

In the experiments using the FIR-ELM classifier, two useful patterns were successful revealed. Firstly, the experiments using various reduced size MFDMaps show that a large number of features do not guarantee good performance. This coincides

152 Chapter 5 The FIR-ELM Folk Song Classifier

with the theory of the curse-of-dimensionality where, for a fixed set of samples, a high dimensionality input will lead to poor performance. In the experiments, the best accuracy in each of the two variants of MFDMap is achieved by using the MFDMap with x = 15. Each of these reduced size MFDMaps are more than 50% smaller in size than their respective original MFDMap. It is very exciting to be able to conclude from the experiments results that better classification accuracy can be achieved by using smaller MFDMaps. In addition, the best classification accuracy is achieved using the SLFN with 500 hidden neurons. A smaller MFDMap and a smaller network structure mean lesser resources are required.

Next, a comparison between the experiments results of Case 1 MFDMaps and Case 2 MFDMaps reveals a pattern which shows that the presence of musical rest in Han Chinese folk song classification not only increased the size of the MFDMaps but also results in poorer classification accuracy. This suggests that musical rests might not be representative of the characteristics of folk songs from different classes and yet might disfigure the representation.

It is interesting to see from the confusion matrices that, although the accuracy of classification is greatly improved by using the FIR-ELM classifier, the problem of confusion in recognizing the Sichuan folk songs has not completely solved.

A good classification result is successfully obtained for Han Chinese folk song classification using the technique of the MFDMap and the FIR-ELM. This shows potential in using such techniques for Han Chinese folk song classification. Therefore, the techniques employed in this chapter worth further investigation. In this chapter, the rectangular window method is employed for the FIR filter. One of the potential investigations is to investigate the effect of other window methods, such as Kaiser method, Hamming method and Bartlett method. As mentioned in [19], in some cases, the non-linear sigmoid hidden nodes can help reduce the effects of the disturbances and lower both the structural and empirical risks. Although this is not always that case, but it might be worthwhile to investigate the effect of applying non-linear hidden neurons in Han Chinese folk song classification.

153

This page is intentionally left blank.

Chapter 6 A Two-Case European Folk Song Classification

Chapter 6

A Two-Case European Folk Song Classification

This chapter presents a two-case (German and Austrian) European folk song classification using the technique developed in Chapter 5. The main objective of the research in this chapter is to investigate and validate the capability of such technique in solving folk song classification problem using an entirely different data samples. In addition, the capability of generalizing the technique on folk songs of other countries, particularly with significant cultural differences needs to be verified. This chapter begins with a brief review of the issue followed by the design of the experiments. Then, the experiments results are presented and discussed.

6.1 Introduction

It is well-known that the Western and the Eastern have significant cultural differences. Hence, it is expected that the folk songs from these cultures possess significant differences. The technique of employing the musical feature density map to construct input element vectors and the finite impulse response extreme learning machine as a machine classifier to solve Han Chinese folk song classification was shown to give good performance. Hence, the functionality of such technique in diverse environment is worth investigating.

155 Chapter 6 A Two-Case European Folk Song Classification

In order to verify the performance of the proposed technique in European folk song classification, it would be better if there is existing outcome to refer to. To the author concern, there is one research that is related to the topic discussed in this chapter.

Chai and Vercoe [81] examine folk music of three countries: Ireland, Germany and Austria. In their paper, they performed two-case and three-case classification using a hidden Markov model as the machine classifier. Four melody representations were employed: absolute pitch representation, absolute pitch with duration representation, interval representation and contour representation. They reported 77% classification accuracy for two-case classification between Irish and Austrian folk music, 75% between Irish and German folk music and 66% between German and Austria folk music. According to the author, the results of the two-case classification coincide with their intuition that German folk music and Austrian folk music are less distinguishable as compared to those with Irish folk music. Their three-case classification achieves an accuracy of 63%.

6.2 Experiment Design and Setting

6.2.1 The Musical Feature Density Map

The musical feature density map is a music encoding method that is designed to represent music elements in folk songs. It portrays the occurrence frequency of each music representation in a folk song. Figure 6.1 shows an example of the raw musical data of an Austrian folk song and Figure 6.2 shows an example of German folk song. As seen in Figure 6.1 and Figure 6.2, the original musical data pattern of the two folk songs is fairly disorder and does not reveal much pattern to differentiate between the two classes. Figure 6.3 and Figure 6.4 is the same folk songs encoded using MFDMap. After encoding, the MFDMap is able to reveal some differences in pattern between the two classes. Notice that the various locations of the peaks in the figures and the spread are good hints for differences.

156 Chapter 6 A Two-Case European Folk Song Classification

Figure 6.1: An example of raw musical data of Austrian folk song.

Figure 6.2: An example of raw musical data of German folk song.

157 Chapter 6 A Two-Case European Folk Song Classification

Figure 6.3: An example of a MFDMap of Austrian folk song.

Figure 6.4: An example of a MFDMap of German folk song.

158 Chapter 6 A Two-Case European Folk Song Classification

In this chapter, 15 different Case 2 MFDMaps are designed to investigate the roles of the four music elements: solfege, interval, duration and duration ratio in differentiating between folk songs of the two countries. These MFDMaps are design using the different combinations of the four elements. The 15 MFDMaps are divided into three groups. The first group consists of MFDMaps that employ only one music element. The second group consists of MFDMaps that encompass two music elements and the third group of MFDMaps each employs at least three music elements. The list of the elements in the different combinations is presented in Table 6.1.

Table 6.1: The fifteen MFDMaps. Number of List of Music Elements Group Music Elements MFDMap 1 Solfege only MFDMap 2 Interval only 1 1 MFDMap 3 Duration only MFDMap 4 Duration ratio only MFDMap 5 Solfege & interval MFDMap 6 Solfege & duration MFDMap 7 Solfege & duration ratio 2 2 MFDMap 8 Interval & duration MFDMap 9 Interval & duration ratio MFDMap 10 Duration & duration ratio MFDMap 11 Solfege, interval & duration MFDMap 12 Solfege, interval & duration ratio Solfege, duration & duration MFDMap 13 3 ratio 3 Interval, duration & duration MFDMap 14 ratio Solfege, interval, duration & MFDMap 15 4 duration ratio

159 Chapter 6 A Two-Case European Folk Song Classification

6.2.2 Data Set

The two classes of the folk song classification task are folk songs from Germany and Austria. Due to their geographic location and cultural influences, German and Austrian folk songs are closely related, hence they are not easily distinguishable. This is one of the main reasons they are chosen as the data set. The other main reason for using these folk songs is the availability of the data.

The melody of 106 German folk songs and 104 Austrian folk songs in the Essen folksong database [1] is used as the data set for the classification task discussed in this chapter. 20% of the folk songs from each country are randomly selected by the computer to form the testing set. There are a total of 168 songs in the training set and 42 songs in the testing set. It is to be noted that the same database is used in [81].

6.2.3 Parameter Setting

The finite impulse response extreme learning machine algorithm is used as the training algorithm and the single-hidden layer feedforward neural network using the structure in Figure 6.5 is employed. This SLFN has linear hidden neurons and linear output neurons and an input tapped-delay-line memory.

The simulations were performed using four different types of FIR filter: low- pass, high-pass, band-pass and band-stop filters, over a range of cutoff frequencies ωc ranging from 0.1 to 0.9 with a step size of 0.1. A band width of ±0.05 is used for the band-pass and band-stop filters. The maximum number of hidden neurons employed in all simulations is 1 to 2000 neurons. The targets for the FIR-ELM classifier are set using the 1-of-c method by assigning each of the two countries to one target. For a set of targets, the one representing a particular country is assigned ‘1’ and the remaining target is assigned ‘0’. The final output of the neural network is determined using the winner-takes-all method.

160 Chapter 6 A Two-Case European Folk Song Classification

Figure 6.5: The FIR-ELM network structure with linear neurons and time-delay elements.

6.3 Experiment Results

The low-pass filter performs the best among the four different filters. Hence the FIR- ELM shown in Table 6.2 to Table 6.6 is the low-pass FIR-ELM with cutoff frequency at 0.1 and the balancing parameter ratio d/γ = 0.1. Table 6.2 to Table 6.4 show the classification accuracy achieved by the FIR-ELM in two-case European folk song classification using the 15 MFDMaps. In Table 6.2, single music element is used in the MFDMap. Table 6.3 shows the classification accuracy of the FIR-ELM classifier using two music elements in each of the MFDMaps. Finally, the MFDMaps in Table 6.4 are each constructed with either three music elements or all four elements. Table 6.5 is the confusion matrix for the highest accuracy 83.33%, namely the two-case folk song classification using MFDMap 14 (interval, duration and duration ratio) at 100 hidden neurons. Table 6.6 shows a comparison of the classification accuracy between five classifiers: the gradient descent-based resilient propagation (RPROP) [126] single- hidden layer feedforward neural network, the extreme learning machine (ELM) [33], the regularized extreme learning machine (R-ELM) [34], the finite impulse response extreme learning machine (FIR-ELM) [19] and the support vector machine (SVM) [27].

161 Chapter 6 A Two-Case European Folk Song Classification

The classification accuracy for RPROP, ELM and R-ELM is the average of 50 repetitions, each using a different set of random initial weights.

Table 6.2: Classification accuracy (%) using one music element in the MFDMap.

Number of Duration Duration Hidden Solfege Only Interval Only Only Ratio Only Neurons 50 52.38 66.67 45.24 61.90 100 52.38 71.43 45.24 61.90 500 54.76 71.43 52.38 57.14 1000 57.14 71.43 52.38 61.90 1500 61.90 64.29 59.52 59.52 2000 61.90 59.52 57.14 59.52

Table 6.3: Classification accuracy (%) using two music elements in the MFDMap.

Solfege Interval Duration Number of Solfege Solfege Interval & & & Hidden & & & Duration Duration Duration Neurons Interval Duration Duration Ratio Ratio Ratio 50 47.62 52.38 50.00 73.81 71.43 54.76 100 61.90 47.62 54.76 78.57 78.57 54.76 500 61.90 50.00 66.67 64.29 69.05 59.52 1000 66.67 54.76 64.29 66.67 71.43 61.90 1500 64.29 52.38 57.14 69.05 73.81 61.90 2000 59.52 57.14 59.52 66.67 76.19 61.90

162 Chapter 6 A Two-Case European Folk Song Classification

Table 6.4: Classification accuracy (%) using three music elements in the MFDMap.

Solfege, Solfege, Solfege, Interval, Number of Solfege, Interval, Interval & Duration & Duration & Hidden Interval & Duration & Duration Duration Duration Neurons Duration Duration Ratio Ratio Ratio Ratio 50 54.76 59.52 57.14 66.67 59.52 100 59.52 64.29 57.14 83.33 59.52 500 66.67 78.57 61.90 80.95 76.19 1000 66.67 73.81 69.05 71.43 76.19 1500 69.05 69.05 61.90 69.05 71.43 2000 64.29 69.05 61.90 69.05 71.43

Table 6.5: Confusion matrix for MFDMap using interval, duration and duration ratio elements.

German Austrian German 17 4 Austrian 3 18

Table 6.6: Classification accuracy of the RPROP, ELM, R-ELM, FIR-ELM and SVM classifier.

Classifier RPROP ELM R-ELM FIR-ELM SVM

Accuracy (%) 57.86 62.62 64.05 83.33 57.14

Standard 5.27 7.10 2.62 0 0 Deviation (%)

163 Chapter 6 A Two-Case European Folk Song Classification

6.4 Discussion

The best classification accuracy achieved by the FIR-ELM classifier in two-case European folk song classification is 83.33%, using the MFDMap with three music elements (interval, duration and duration ratio). In general, improvement in classification accuracy is expected as the number of hidden neurons increases until the classifier reaches a saturation point. In this case, the neural network only requires 100 hidden neurons, i.e. the saturation point is at 100 hidden neurons.

In Table 6.2, the interval element appears to be the optimal element among the four. The classification accuracy using the interval element achieved its best (71.43%) at 100 hidden neurons and maintains the good performance through to 1000 hidden neurons. The solfege element comes second among the four with consistent improvement in performance as the number of hidden neurons increased. On the other hand, duration ratio gives its best at the very early stages then the accuracy sways around above 100 hidden neurons. As for duration element, the performance improves gradually with the increase in number of hidden neurons and drop slightly at 2000 hidden neurons.

The performance of MFDMaps using two music elements (Table 6.3) seems less stable except for MFDMap 5 (solfege and interval) and MFDMap 10 (duration and duration ratio). It is interesting to note that both of these MFDMaps use elements of the same characteristic (both solfege and interval elements are pitch-related elements while duration and duration ratio elements are rhythmical elements). Despite the less stable characteristic, MFDMap using interval and duration elements and MFDMap using interval and duration ratio elements perform the best among this group of MFDMaps.

Finally, the classification accuracy stabilized as the third (and fourth) music element is added to the MFDMap. Although each MFDMap in Table 6.4 reaches its best performance at various stages, the classification accuracy generally improved as the number of hidden neurons increased. The best performing MFDMap within the group uses the element combination of interval, duration and duration ratio. MFDMap using

164 Chapter 6 A Two-Case European Folk Song Classification

solfege, interval and duration ratio elements comes second while MFDMap that employs all four music elements arrived third.

An interesting pattern is observed in the classification performance from Table 6.2 to Table 6.4. Firstly, a conclusion might be drawn from Table 6.2 that the interval element is the best performing element. Next, the best performing combinations in Table 6.3 are interval and duration elements and interval and duration ratio elements. Also, the classification accuracy of these two combinations (in Table 6.3) is higher than the best performing in Table 6.2. Finally, the best performing MFDMap in Table 6.4 is also the best performing among all 15 MFDMaps. It uses interval, duration and duration ratio elements. Again, the classification accuracy is higher than the one in Table 6.3.

Another pattern can be easily spotted from Table 6.2 through to Table 6.4. Firstly, in Table 6.2, the duration ratio element performs better than duration element. In Table 6.3, the MFDMap using combination of solfege and duration ratio elements perform better than the MFDMap using solfege and duration elements. It is the same between MFDMap using interval and duration ratio elements and MFDMap using interval and duration elements. Finally, the similar pattern occurs again in Table 6.4 where the combination of solfege, interval and duration ratio elements performs better than the combination of solfege, interval and duration elements.

The classification accuracy using combined elements is usually better compared to classification using individual element. The classification accuracy using combined elements usually improved as the number of elements increased. For example, the classification accuracy using interval element is 71.43%. The accuracy improved to 78.57% when interval element is used in combination with either duration element or duration ratio element. Finally, the accuracy improved to 83.33% using the combination of interval, duration and duration ratio elements.

Table 6.6 clearly shows that the FIR-ELM classifier performs the best among all five classifiers. The R-ELM classifier comes second followed by the ELM classifier.

165 Chapter 6 A Two-Case European Folk Song Classification

The gradient descent-based RPROP classifier achieved similar classification accuracy with the SVM classifier.

Figure 6.6 depicts the classification accuracy of the low-pass FIR-ELM using MFDMap with three elements (interval, duration and duration ratio elements) at different cutoff frequency with 100 hidden neurons. Figure 6.7 portrays the classification accuracy of the FIR-ELM using each of the four filters with cutoff frequency at 0.1. The MFDMap structure employed in these classifiers is the three elements combination of interval, duration and duration ratio elements.

Figure 6.6: Classification accuracy of the low-pass FIR-ELM with 100 hidden neurons (MFDMap: interval, duration and duration ratio).

166 Chapter 6 A Two-Case European Folk Song Classification

Figure 6.7: Classification accuracy of four filters FIR-ELM with cutoff frequency 0.1 (MFDMap: interval, duration and duration ratio).

6.5 Conclusion

In this chapter, the technique of using the musical feature density map and the finite impulse response extreme learning machine on a two-case German and Austrian folk song classification task is verified. The experiments show that even with a single music element, the FIR-ELM classifier gives a reasonably good performance. The classification accuracy improved as the number of music elements increased. The highest accuracy achieved is 83.33% using the combination of interval, duration and duration ratio elements. The low-pass FIR-ELM classifier performs better than the high- pass, band-pass and band-stop FIR-ELMs.

The classification accuracy achieved is fairly encouraging. A poorer accuracy is expected if the number of classes is increased for the European folk song data. To further investigate the machine capability in folk song classification, the future work should include folk songs from other European countries.

167

This page is intentionally left blank.

Chapter 7 Conclusion

Chapter 7

Conclusion

This chapter presents a summary of the research activities and contributions presented in this thesis. Some suggestions for further development of the topic discussed in this thesis are also included.

7.1 Summary

This thesis presents the research topic of machine classification of Han Chinese folk songs. In this research, a machine is used in place of the human. In order for the machine to be able to read and understand the music, a simple yet meaningful encoding method is developed in Chapter 3 to represent the folk songs. This encoding method effectively encapsulated useful musical information that is readable by the machine and at the same time can be easily interpreted by humans. The main aim of developing an encoding method that benefits both humans and the machines was achieved effectively by the MFDMap.

In Chapter 4, this encoding method is put to the test to verify its functionality in the actual machine classification task. The extreme learning machine, an extremely fast learning algorithm, and its enhanced variant called the regularized extreme learning machine, successfully proved that the learning of folk songs can be accomplished several hundred times faster than a conventional gradient descent algorithm and yet the

169 Chapter 7 Conclusion

classification accuracy can be 14% better than the accuracy achieved by using a gradient descent learning algorithm and 10% better than the support vector machine. In addition, the effect of incorporating or eliminating musical rests is investigated to verify their importance in folk songs classification. The outcomes of the experiments in Chapter 4 do not suggest any significance of the role of musical rests.

Further investigations were completed in Chapter 5 where a more robust learning algorithm is employed as a classifier. The finite impulse response extreme learning machine is a powerful learning algorithm where its robustness is reflected in the design of the input weights and the output weights. This algorithm is designed so that the input disturbances and undesired frequency components in the data samples can be handled through the use of the pre-processing capability of the hidden neurons of the neural network. It is believed that the undesired features in the input vectors (MFDMaps) posed as “noise” in the overall structure of the input data were effectively overcome and hence the classification accuracy was significantly improved. The classification accuracy using the FIR-ELM is about 11% better than the ELM and 7% better than the R-ELM. The capability of the FIR-ELM in real-world multi-class classification is verified through its performance in accomplishing a real-world five-class folk song classification task.

In Chapter 5, it was found that the inclusion of the musical rests in the encoding of the songs into MFDMaps worsened the overall performance of the classification task. In addition, the various experiments conclude that, it is not necessary to include all features in the input vector. Better classification accuracy is achieved by using a reduced size MFDMap.

Finally, in Chapter 6, the technique employed in Chapter 5 is applied on the European folk songs to verify its performance on folk songs of other countries, particularly folk songs with significant cultural differences. The outcomes achieved in Chapter 6 show that the same technique employed for Han Chinese folk songs can be successfully applied to European folk songs.

170 Chapter 7 Conclusion

7.2 Future Works

Below is a list of potential future developments on the research topic presented in this thesis:

1. As seen in Chapter 5, the design of the structure of the SLFN consists of linear hidden neurons, linear output neurons and an input tapped-delay-line memory. This structure contains finite depth memory which enables the network to have dynamics learning. However, the conventional structure of the SLFN uses non- linear hidden neurons and linear output neurons without dynamics to learn complex non-linear mapping. Therefore, an in-depth analysis should be performed on the effect of non-linear hidden nodes in the FIR-ELM network structure, particularly in the application of Han Chinese folk song classification.

2. In Chapter 5, the rectangular window method is employed for the FIR filter. The effect of other window methods, such as, the Kaiser method, Hamming method and Bartlett method on the input samples and the classifier performance should be investigated.

3. In Chapter 4 and 5, the optimal MFDMap is determined through exhaustive testing. However, an effective method to analytically determine the optimal MFDMap should be developed to eliminate such tedious procedure.

4. The musical feature density map is designed to address the task of Han Chinese folk song classification where folk songs consist of a single melody line. It is possible that the same concept can be extended to accommodate polyphonic music.

5. This thesis focuses on features extracted from the melody of the folk songs. The song lyrics in many cases convey extra information. Further research on the fusion of features from both the melody and the lyrics might be able to reveal more patterns that can be adopted for the development of theoretical evidence on encoding songs for machine classification.

171 Chapter 7 Conclusion

6. The folk song classification techniques developed in this thesis focus on Han Chinese folk songs. Although a two-case European folk song classification was performed to verify the effectiveness of the technique on folk songs from other countries, more diversity should be included in the examples.

172 References

References

[1] H. Schaffrath, The Essen Folksong Collection in Kern Format, D. Huron (ed.), Menlo Park, CA: Center for Computer Assisted Research in the Humanities, 1995.

[2] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4 no. 2, pp. 251-257, 1991.

[3] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward network are universal approximators,” Neural Networks, vol. 2, no. 5, pp.359-366, 1989.

[4] G.-B. Huang, D. H. Wang, and Y. Lan, “Extreme learning machines: a survey,” International Journal of Machine Learning and Cybernetics, vol. 2, pp. 107-122, 2011.

[5] D. Yu and L. Deng, “Efficient and effective algorithms for training single- hidden-layer neural networks,” Pattern Recognition Letters, vol. 33, no. 5, pp. 554-558, 2012.

[6] B. P. Chacko and A. P. Babu, “Online sequential extreme learning machine based handwritten character recognition,” in IEEE Students’ Technology Symposium, TechSym 2011, pp. 142-147, 2011.

[7] T. Helmy and Z. Rasheed, “Multi-category bioinformatics dataset classification using extreme learning machine,” in Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2009, pp. 3234-3240, 2009.

173 References

[8] D. Wang and G.-B. Huang, “Protein sequence classification using extreme learning machine,” in Proceedings of the IEEE International Joint Conference on Neural Networks, IJCNN 2005, vol. 3, pp. 1406- 1411, 2005.

[9] G. Wang, Y. Zhao, and D. Wang, “A protein secondary structure prediction framework based on the extreme learning machine,” Neurocomputing, vol. 72, no. 1-3, pp. 262-268, 2008.

[10] R. Zhang, G.-B. Huang, N. Sundararajan, and P. Saratchandran, “Multicategory classification using an extreme learning machine for microarray gene expression cancer diagnosis,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 485-495, 2007.

[11] S. Baboo and S. Sasikala, “Multicategory classification using an extreme learning machine for microarray gene expression cancer diagnosis,” in Proceedings of the IEEE International Conference on Communication Control and Computing Technologies, ICCCCT 2010, pp. 748-757, 2010.

[12] F.-C. Li, P.-K. Wang, and G.-E. Wang, “Comparison of the primitive classifiers with extreme learning machine in credit scoring,” in Proceedings of the IEEE International Conference on Industrial Engineering and Engineering Management, IEEM 2009, pp. 685-688, 2009.

[13] G. Duan, Z. Huang, and J. Wang, “Extreme learning machine for bank clients classification,” in Proceedings of the International Conference on Information Management, Innovation Management and Industrial Engineering, vol. 2, pp. 496-499, 2009.

[14] W. Deng, Q.-H. Zheng, S. Lian, and L. Chen, “Adaptive personalized recommendation based on adaptive learning,” Neurocomputing, vol. 74, no. 11, pp. 1848-1858, 2011.

174 References

[15] X.-G. Zhao, G. Wang, X. Bi, P. Gong, and Y. Zhao, “XML document classification based on ELM,” Neurocomputing, vol. 74, no. 16, pp. 2444-2451, 2011.

[16] Y. Sun, Y. Yuan, and G. Wang, “An OS-ELM based distributed ensemble classification framework in P2P networks,” Neurocomputing, vol. 74, no. 16, pp. 2438-2443, 2011.

[17] W. Deng, Q.-H. Zheng, and L. Chen, “Real-time collaborative filtering using extreme learning machine,” in Proceedings of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, WI- IAT 2009, vol. 1, pp. 466-473, 2009.

[18] Q. J. B. Loh and S. Emmanuel, “ELM for the classification of music genres,” in Proceedings of the 9th International Conference on Control, Automation, Robotics and Vision, ICARCV 2006, pp. 1-6, 2006.

[19] Z. Man, K. Lee, D. Wang, Z. Cao, and C. Miao, “A new robust training algorithm for a class of single-hidden layer feedforward neural networks,” Neurocomputing, vol. 74, pp. 2491-2501, 2011.

[20] K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of bioinformatics dataset using finite impulse response extreme learning machine for cancer diagnosis,” Neural Computing & Applications, Available online 30 Jan. 2012, Doi: 10.1007/s00521-012-0847-z.

[21] J. Jin, Chinese Music. Cambridge: Cambridge University Press, 2011.

[22] S. Haykin, Neural Networks: A Comprehensive Foundation. New York: Macmillan, 1994.

[23] W. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bulletin of Mathematical Biophysics, vol. 7, pp. 115-133, 1943.

175 References

[24] J. C. Principe, N. R. Euliano, and W. C. Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulations. New York: John Wiley & Sons, Inc., 1999.

[25] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Cornell Aeronautical Laboratory, Psychological Review, vol. 65, no. 6, pp. 386-408, 1958.

[26] S. Haykin, Neural Networks and Learning Machines. Third Edition, New Jersey: Pearson Prentice Hall, 2009.

[27] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, pp. 273-297, 1995.

[28] S. Abe, Support Vector Machines for Pattern Classification. Springer, 2005.

[29] S. Tamura and M. Tateishi, “Capabilities of a four-layered feedforward neural network: Four layers versus three,” IEEE Transactions on Neural Networks, vol. 8, no. 2, pp. 251-255, 1997.

[30] G.-B. Huang, “Learning capability and storage capacity of two-hidden-layer feedforward networks,” IEEE Transactions on Neural Networks, vol. 14, no. 2, pp. 274-281, 2003.

[31] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Real-time learning capability of neural networks,” in School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Technical Report ICIS/45/2003, Apr. 2003.

[32] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using incremental feedforward networks with arbitrary input weights,” in School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Technical Report ICIS/46/2003, Oct. 2003.

[33] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, pp. 489-501, 2006.

176 References

[34] W. Deng, Q. Zheng, and L. Chen, “Regularized extreme learning machine,” in Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, pp. 389-395, 2009.

[35] M. Bosi and R. E. Goldberg, Introduction to digital audio coding and standards. (The Kluwer International Series in Engineering and Computer Science). New York: Springer, 2003.

[36] E. Selfridge-Field, Beyond MIDI: The Handbook of Musical Codes. MIT Press, 1997.

[37] G. Tzanetakis, G. Essl, and P. Cook, “Automatic musical genre classification of audio signals,” in Proceedings of the International Symposium of Music Information Retrieval, ISMIR, 2001.

[38] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293-302, 2002.

[39] M. Cooper and J. Foote, “Automatic music summarization via similarity analysis,” in Proceedings of the International Conference on Music Information Retrieval, pp.81-85, 2002.

[40] Y. Zhang and J. Zhou, “A study on content based music classification,” in Proceedings of the 7th International Symposium on Signal Processing and Its Applications, pp. 113-116, 2003.

[41] M. F. McKinney and J. Breebaart, “Features for audio and music classification,” in Proceedings of the International Conference on Music Information Retrieval, pp.151 -158, 2003.

177 References

[42] C. Xu, N. C. Maddage, X. Shao, F. Cao, and Q. Tian, “Musical genre classification using support vector machines,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.429 - 432, 2003.

[43] T. Li, M. Ogihara, and Q. Li, “A comparative study on content-based music genre classification,” in Proceedings of the 26th Annual ACM Conference Research and Development in Information Retrieval, SIGIR 2003, pp. 282-289, 2003.

[44] F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer, “Evaluating rhythmic descriptors for musical genre classification,” in Proceedings of the AES 25th International Conference, pp.196 -20, 2004.

[45] R. Malheiro, R. Paiva, A. Mendes, T. Mendes, and A.Cardoso, “A prototype for classification of classical music using neural networks,” in Proceedings of the 8th International Conference on Artificial Intelligence and Soft Computing, pp. 294-299, 2004.

[46] G. Mitri, A. Uitdenbogerd, and V. Ciesielsk, “Automatic music classification problems,” in Proceedings of the 27th Australasian Conference on Computer Science, vol. 26, pp. 315-322, 2004.

[47] C. Xu, N. C. Maddage, and X. Shao, “Automatic music classification and summarization,” IEEE Transaction on Speech Audio Processing, vol. 13, no. 3, pp.441-450, 2005.

[48] A. Meng and J. Shawe-Taylor, “An investigation of feature models for music genre classification using the support vector classifier,” in Proceedings of the International Conference on Music Information Retrieval, pp.604 -609, 2005.

[49] N. Scaringella, G. Zoia, and D. Mlynek, “Automatic genre classification of music content: a survey,” IEEE Signal Processing Magazine, vol. 23, no. 2, pp.133 -141, 2006.

178 References

[50] Y. Liu, J. Xu, L. Wei, and Y. Tian, “The study of the classification of Chinese folk songs by regional style,” in Proceedings of the International Conference on Semantic Computing, pp. 657-662, 2007.

[51] J. Xu, P. Wang, and L. Yan, “Feature selection for automatic classification of Chinese folk songs,” in Proceedings of the Congress on Image and Signal Processing, pp. 441-446, 2008.

[52] Panagakis, E. Benetos, and C. Kotropoulos, “Music genre classification: a multilinear approach,” in Proceedings of the International Conference on Music Information Retrieval, pp.583-588, 2008.

[53] Y. Liu, L. Wei, and P. Wang, “Regional style automatic identification for Chinese folk songs,” in Proceedings of the World Congress on Computer Science and Information Engineering, pp. 5-9, 2009.

[54] T. Langlois and G. Marques, “A music classification method based on timbral features,” in Proceedings of the International Conference on Music Information Retrieval, pp. 81-86, 2009.

[55] Y.-L. Lo, “Content-based music classification,” in Proceedings of the 3rd IEEE International Conference on Computer Science and Information Technology, vol. 2, pp. 112-116, 2010.

[56] C. Xiang and Z. Zhou, “A new music classification method based on BP neural network,” International Journal of Digital Content Technology and its Applications, vol. 5, no. 6, 2011.

[57] Z. Fu, G. Lu, K. M. Ting, and D. Zhang, “A survey of audio-based music classification and annotation,” IEEE Transaction on Multimedia, vol. 13, no. 2, pp. 303-319, 2011.

179 References

[58] Z. Fu, G. Lu, K. Ting, and D. Zhang, “Music classification via the bag-of- features approach,” Pattern Recognition Letters, vol. 32, no. 14, p. 1768–1777, 2011.

[59] J. Salamon , B. Rocha, and E. Gomez, “Musical genre classification using melody features extracted from polyphonic music signals,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.81-84, 2012.

[60] D. Li, I. K. Sethi, N. Dimitrova, and T. McGee, “Classification of general audio data for content-based retrieval,” Pattern Recognition Letters, vol. 22, pp. 533- 544, 2001.

[61] D. Manolakis and V. Ingle, Applied Digital Signal Processing. Cambridge: Cambridge University Press, 2011.

[62] L. Rabiner and B. H. Juang, Fundamental of Speech Recognition. Prentice Hall, 1993.

[63] T. Tolonen and M. Karjalainen, “A computationally efficient multipitch analysis model,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 6, pp. 708–716, 2000.

[64] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using chroma-based representation for audio thumbnailing,” in Proceedings of the IEEE International Workshop on applications of Signal Processing to Audio and Acoustics, pp. 15– 19, Mohonk, NY, 2001.

[65] R. N. Shepard, “Circularity in judgments of relative pitch,” Journal of the Acoustical Society of America, vol. 35, pp. 2346–2353, 1964.

[66] J. Pierce, “Consonance and scales,” in P. Cook, editor, Music Cognition and Computerized Sound, pp. 167–185, MIT Press, 1999.

180 References

[67] J. Rothstein, MIDI: A Comprehensive Introduction. Madison, WI: A-R Editions, 1995.

[68] MIDI Manufacturers Association, Complete MIDI 1.0 detailed specification v96.1. Los Angeles, International MIDI Association, 2001.

[69] M. Wright and A. Freed, “Open Sound Control: A new protocol for communicating with sound synthesizers,” in Proceedings of the International Computer Music Conference, pp. 101-104, 1997.

[70] M. Wright, OpenSound Control Specification v1.0, http://archive.cnmat.berkeley.edu/OpenSoundControl/OSC-spec.html, 2002. Retrieved 26 January 2013.

[71] M. Wright, “Open Sound Control: An enabling technology for musical networking,” Organised Sound, vol. 10, no. 3, pp. 193-200, 2005

[72] H. Hoos, K. Hamel, K. Renz, and J. Kilian, “Representing score-level music using the GUIDO music-notation format,” Computing in Musicology, vol. 12, 2001.

[73] D. Huron, Music Research Using Humdrum: A User’s Guide. Menlo Park, CA: Center for Computer Assisted Research in the Humanities, 1999.

[74] M. Good, “MusicXML for notation and analysis”, in The Virtual Score: Representation, Retrieval, Restoration, W. B. Hewlett and E. Selfridge-Field, Ed. Cambridge, MA: MIT Press, pp. 113-124, 2001.

[75] M. Good, “MusicXML: An internet-friendly format for sheet music,” in Proceedings of the XML 2001 Conference, Orlando, FL, 2001.

[76] R. B. Dannenberg, “Music representation issues, techniques, and systems,” Computer Music Journal, vol. 17, no. 3, pp. 20-30, 1993.

181 References

[77] C. McKay, “Automatic music classification with jMIR,” Ph.D. dissertation, Department of Music Research, McGill University, Montreal, 2010.

[78] H. W. Nienhuys and J. Nieuwenhuizen, “LilyPond, a system for automated ,” in Proceedings of the XIV Colloquium on Musical Informatics, 2003.

[79] D. Huron, The Humdrum Toolkit: Reference Manual. Menlo Park, CA: Center for Computer Assisted Research in the Humanities, 1995.

[80] D. Huron, “Humdrum and Kern: Selective feature encoding,” in Beyond MIDI: the handbook of musical codes, E. Selfridge-Field, Ed. Cambridge, Massachusetts: MIT Press, pp. 375-401.

[81] W. Chai and B. Vercoe, “Folk music classification using hidden Markov models,” in Proceedings of the International Conference on Artificial Intelligence, 2001.

[82] C. Lin, N. Liu, Y. Wu, and A. Chen, “Music classification using significant repeating patterns,” Database Systems for Advanced Applications, vol. 2973, pp. 506-518, 2004.

[83] P. León and J. Iñesta, “Musical style classification from symbolic data: a two- styles case study,” Computer Music Modeling and Retrieval, vol. 2771, pp. 166– 177, 2004.

[84] C. McKay and I. Fujinaga, “Automatic genre classification using large high- level musical feature sets,” in Proceedings of the 5th International Conference on Music Information Retrieval, 2004.

[85] Y.-P. Huang, G.-L. Guo, and C.-T. Lu, “Using back propagation model to design a MIDI music classification system,” in Proceedings of the International Computer Symposium, pp.15-17, 2004.

182 References

[86] R. Basili, A. Serafmi, and A. Stellato, “Classification of musical genre: a machine learning approach,” in Proceedings of the International Symposium on Music Information Retrieval, 2004.

[87] C. Pérez-Sancho, J. Iñesta, and J. Calersa-Rubio, “Style recognition through statistical event models,” Journal of New Music Research, vol. 34, no. 4, pp. 331-339, 2005.

[88] A. N. Y. M. I. Karydis, “Symbolic musical genre classification based on repeating patterns,” in Proceedings of the 1st ACM workshop on audio and music computing multimedia, pp. 53 - 58, 2006.

[89] C. Kofod and D. O. Arroyo, “Exploring the design space of symbolic music genre classification using data mining techniques,” in Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation, pp. 43 – 48, 2008.

[90] R. Hillewaere, B. Manderick, and D. Conklin, “Global feature versus event models for folk song classification,” in Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR2009, pp. 729-733, 2009.

[91] C. Pérez-Sancho, D. Rizo, S. Kersten, and R. Ramirez, “Genre classification of music by tonal harmony,” Intelligent Data Analysis, vol. 14, no. 5, pp. 533-545, 2010.

[92] D. Bainbridge and N. Carter, “Automatic reading of music notation,” in Handbook of character recognition and document image analysis, H. Bunke and P. Wang, Ed. Singapore: World Scientific, 1997.

[93] P. Bellini, I. Bruno, and P. Nesi, “Assessing optical music recognition tools,” Computer Music Journal, vol. 31, no. 1, pp. 68-93, 2007.

183 References

[94] P. Bellini, I. Bruno, and P. Nesi, “Optical music recognition: Architecture and algorithms,” in Interactive multimedia music technologies, K. Ng and P. Nesi, Ed. Hershey, PA: Information Science Reference, 2008.

[95] R. Rowe, Machine musicianship. Cambridge, MA: MIT Press, 2001.

[96] R. Middleton, Studying Popular Music. Philadelphia: Open University Press, 1990.

[97] R. Middleton, “Popular music analysis and musicology: Bridging the gap,” in Reading Pop: Approaches to Textual Analysis in Popular Music, R. Middleton, Ed. New York: Oxford University Press, 2000.

[98] G. Cooper and L. B. Meyer, The Rhythmic Structure of Music. Chicago: University of Chicago Press, 1960.

[99] K.-H. Han, “Folk songs of the Han Chinese: Characteristics and classifications,” Asian Music, vol. 20, no. 2, pp. 107-128, 1989.

[100] J. Miao and J. Qiao, Lun Hanzu Minge Jinshi Secaiqu de Huafen (A Study of Similar Color Area Divisions in Han Folk Songs). Beijing: Wenhua Yishu, 1987.

[101] D. Huron, “Music information processing using the Humdrum toolkit: Concepts, examples, and lessons,” Computer Music Journal, vol. 26, no. 2, pp. 11-26, 2002.

[102] H. Owen, Music Theory Resource Book. Oxford University Press, 2000.

[103] J. McKinney, The Diagnosis and Correction of Vocal Faults: A Manual for Teachers of Singing and for Choir Directors. Nashville, TN: Genovex Music Group, 1994.

[104] B. Bartók and A. B. Lord, Serbo-Croatian Folk Songs. New York: Columbia University Press, 1951.

184 References

[105] G.-B. Huang, Y.-Q. Chen, H.A. Babri, “Classification ability of single hidden layer feedforward neural networks,” IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 799-801, 2000.

[106] P.L. Bartlett, “The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network,” IEEE Transactions on Information Theory, vol. 44, no. 2, pp. 525-536, 1998.

[107] G.-B. Huang and H. A. Babri, “Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions,” IEEE Transactions on Neural Networks, vol. 9, no. 1, pp. 224-229, 1998.

[108] G.-B. Huang, “Learning capability and storage capacity of two-hidden-layer feedforward networks,” IEEE Transactions on Neural Networks, vol. 14, no. 2, pp. 274-281, 2003.

[109] R. Bellman, Adaptive Control Processes: A Guide Tour. New Jersey, Princeton University Press, 1961.

[110] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, Walton Street, Oxford, 1995.

[111] G.-B. Huang, X. Ding and H. Zhou, “Optimization method based extreme learning machine for classification,” Neurocomputing, vol. 74, pp. 155–163, 2010.

[112] V. Vapnik, Statistical Learning Theory. John Wiley: NewYork, 1998.

[113] M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge University Press, 1999.

[114] L. Breiman, J. Friedman, R. Olshen and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth International, 1984.

185 References

[115] L. Devroye, L. Gyorfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag, 1996.

[116] R. Duda and P. Hart, Pattern Classification and Scene Analysis. New York: John Wiley, 1973.

[117] K. Fukunaga, Introduction to Statistical Pattern Recognition. New York: Academic Press, 1972.

[118] M. Kearns and U. Vazirani, An Introduction to Computational Learning Theory. Cambridge, Massachusetts: MIT Press, 1994.

[119] J. Proakis and D. Manolakis, Digital Signal Processing. 3rd Edition, Prentice Hall, 1996.

[120] S. M. Kuo, B. H. Lee and W. Tian, Real-Time Digital Signal Processing. John Wiley & Sons Ltd, 2007.

[121] A. V. Oppenheim, R. W. Schafer and J. R. Buck, Discrete-Time Signal Processing. Prentice Hall, 1999.

[122] E. C. Ifeachor and B. W. Jervis, Digital Signal Processing: A Practical Approach. 2nd Edition, Prentice Hall, 2002.

[123] S. S. Rao, Engineering Optimization: Theory and Practice. John Wiley & Sons Inc., 1996.

[124] P. S. Iyer, Operations Research. Tata McGraw-Hill, 2008.

[125] F. S. Hillier and G. J. Lieberman, Introduction to Operations Research. McGraw Hill, 2005.

186 References

[126] M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The RPROP algorithm,” in Proceedings of the IEEE International Conference on Neural Networks, 1993.

187

This page is intentionally left blank.

Appendix A Folk Song Classification Using Audio Representation

Appendix A

Folk Song Classification Using Audio Representation

This appendix records the results of the preliminary experiments for five-class folk song classification using audio data. The audio files are 16 bits monophonic wave files with a sampling rate of 22,050Hz which are obtained by converting the ready-made MIDI files of the 333 Han Chinese folk songs in the Essen Folksong Collection. The conversion tool employed is the Direct MIDI to MP3 Converter1 by Piston Software.

As folk songs are varied in length, each folk song is segmented into five-second clips. In order to perform feature extraction, the size of the analysis window is set as 20 milliseconds with a hop size of 10 milliseconds (i.e. 50% overlapped). There is a total of 1466 data samples, 80% is used for training and the remaining 20% for testing. Hence, there are 1173 training sets and 293 testing sets.

Two types of audio features are employed: time domain features and frequency domain features. The time domain features are the root mean square (RMS), fraction of low energy windows and zero-crossing (ZC). The frequency domain features are the spectral centroid (SC), spectral roll-off (SR), spectral flux (SF) and Mel-frequency cepstral coefficients (MFCC). The experiments consist of three groups. The first group employs only the time domain features, the second group consists of only frequency

1 www.pistonsoft.com/midi2mp3.html 189 Appendix A Folk Song Classification Using Audio Representation

domain features and the last group uses features from both time and frequency domains. For each feature in all the groups (except the fraction of low energy window feature), statistics such as the median, mean and variance were computed over all analysis windows of each of the 5-second clips and then used as the inputs for the machine classifier. Different combinations of these statistics are used to design the experiments within each of the three groups.

Three machine classifiers are employed to examine the performance of Han Chinese folk song classification using audio data. The first classifier is a conventional gradient descent-based single-hidden layer feedforward neural network. The learning algorithm employed is the resilient propagation (RPROP) algorithm. The second classifier is the extreme learning machine (ELM) and the third classifier is the finite impulse response extreme learning machine (FIR-ELM).

In all experiments, the hyperbolic tangent function is used to activate the hidden neurons in the hidden layer for both the RPROP and the ELM. The linear activation function is used for the output layer of the RPROP. Due to the presence of arbitrary characteristics in both the RPROP classifier and the ELM classifier, each experiment is repeated 50 times. The classification accuracy is then determined using the mean accuracy of the 50 repetitions. In experiments that used the FIR-ELM classifier, the simulations were performed using four different types of FIR filter: low-pass, high-pass, band-pass and band-stop filters, over a range of cutoff frequencies ωc ranging from 0.1 to 0.9 with a step size of 0.1. A band width of ±0.05 is used for the band-pass and band- stop filters. In all experiments, the simulation of the neural network begins with one hidden neuron in the hidden layer and gradually increased to a maximum number of 10,000 hidden neurons.

It is to be noted that all classification results using the FIR-ELM classifier recorded in this appendix are the results using the four filters at cutoff frequency, ωc= 0.5 and with d/γ = 0.001.

190 Appendix A Folk Song Classification Using Audio Representation

Table A.1: Classification accuracy (%) of the RPROP classifier using median.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.26 31.40 32.54 100 29.69 34.13 32.76 500 31.29 36.18 33.67 1000 31.63 35.49 33.22 1500 31.51 33.56 32.76 2000 31.06 33.45 32.54 2500 30.60 31.63 31.63 3000 29.69 30.49 30.57 4000 29.58 29.69 30.21 5000 29.24 29.47 29.45 8000 29.01 29.24 28.11 10000 28.21 28.78 28.09

Table A.2: Classification accuracy (%) of the RPROP classifier using mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.06 32.88 34.13 100 30.15 33.45 34.24 500 29.58 34.33 35.38 1000 29.47 36.41 38.45 1500 29.58 34.58 38.57 2000 28.33 32.08 32.31 2500 28.78 31.51 31.50 3000 29.81 30.38 30.19 4000 29.47 29.69 29.58 5000 29.49 27.87 29.16 8000 29.01 27.87 28.97 10000 28.73 26.85 27.91

191 Appendix A Folk Song Classification Using Audio Representation

Table A.3: Classification accuracy (%) of the RPROP classifier using variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 32.08 31.17 31.29 100 31.06 31.63 30.94 500 30.60 31.51 31.51 1000 30.26 30.72 31.97 1500 30.26 26.05 30.86 2000 29.69 27.30 30.15 2500 29.24 26.39 29.57 3000 30.03 25.37 29.00 4000 29.69 25.37 28.67 5000 29.24 24.23 27.31 8000 28.78 23.66 25.60 10000 28.12 23.44 24.43

Table A.4: Classification accuracy (%) of the RPROP classifier using median and mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.85 30.83 31.74 100 32.08 32.08 31.51 500 31.97 34.70 34.58 1000 31.51 34.02 35.27 1500 31.29 32.65 32.65 2000 31.17 32.31 32.54 2500 30.94 31.63 32.33 3000 30.83 30.49 31.80 4000 30.38 30.12 31.49 5000 30.26 29.58 30.91 8000 30.17 29.46 29.85 10000 29.47 28.94 27.90

192 Appendix A Folk Song Classification Using Audio Representation

Table A.5: Classification accuracy (%) of the RPROP classifier using median and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 35.95 30.60 30.26 100 35.15 34.13 33.33 500 33.56 35.38 32.88 1000 34.24 31.51 32.20 1500 33.90 30.83 30.72 2000 33.33 29.54 30.15 2500 32.42 28.66 29.84 3000 32.88 28.31 29.13 4000 32.76 27.42 28.86 5000 32.20 26.73 28.01 8000 31.63 26.85 27.94 10000 30.72 25.88 26.30

Table A.6: Classification accuracy (%) of the RPROP classifier using mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 34.58 33.56 30.72 100 34.47 34.36 35.49 500 33.33 34.47 33.90 1000 32.42 32.31 32.85 1500 33.67 31.51 31.10 2000 32.20 28.78 29.12 2500 32.08 27.65 29.57 3000 31.51 26.96 28.64 4000 31.40 26.15 28.10 5000 31.29 25.99 27.65 8000 30.83 25.45 27.11 10000 30.49 24.65 26.91

193 Appendix A Folk Song Classification Using Audio Representation

Table A.7: Classification accuracy (%) of the RPROP classifier using median, mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 35.72 32.76 27.99 100 34.93 34.36 33.69 500 34.02 33.11 34.24 1000 35.49 32.99 31.85 1500 34.36 30.94 32.30 2000 34.36 29.58 31.83 2500 34.70 28.94 30.65 3000 33.67 28.16 29.90 4000 33.56 27.83 28.21 5000 33.22 26.89 27.08 8000 33.56 26.73 26.99 10000 32.76 25.90 25.51

Table A.8: Classification accuracy (%) of the ELM classifier using median.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.44 30.31 30.03 100 31.74 30.72 31.19 500 31.26 31.40 32.01 1000 31.26 32.08 33.21 1500 30.92 35.49 38.98 2000 32.01 31.54 35.58 2500 32.90 31.19 34.78 3000 33.24 30.27 33.34 4000 32.08 29.83 32.83 5000 31.95 29.83 31.13 8000 31.54 28.19 30.83 10000 30.79 27.51 30.58

194 Appendix A Folk Song Classification Using Audio Representation

Table A.9: Classification accuracy (%) of the ELM classifier using mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.79 30.31 29.35 100 31.06 31.06 30.99 500 31.13 32.63 31.98 1000 31.40 34.33 32.01 1500 30.58 37.00 38.43 2000 29.83 35.80 35.80 2500 29.42 34.20 34.85 3000 29.08 32.89 33.52 4000 29.01 31.19 32.70 5000 28.94 30.65 31.60 8000 28.81 29.08 30.38 10000 28.60 28.65 29.35

Table A.10: Classification accuracy (%) of the ELM classifier using variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 28.26 29.08 29.69 100 29.01 30.85 30.79 500 29.08 31.74 31.06 1000 28.94 32.29 32.97 1500 28.74 36.11 35.97 2000 28.67 35.18 33.31 2500 28.46 34.34 33.04 3000 28.46 32.92 32.66 4000 27.65 31.47 31.95 5000 27.24 31.47 31.13 8000 27.24 30.58 29.69 10000 27.10 29.42 29.35

195 Appendix A Folk Song Classification Using Audio Representation

Table A.11: Classification accuracy (%) of the ELM classifier using median and mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 32.63 31.67 30.22 100 30.72 32.63 30.72 500 28.74 33.11 32.35 1000 28.67 36.04 34.13 1500 28.67 39.52 39.22 2000 28.26 38.89 38.84 2500 28.05 35.62 36.25 3000 27.71 34.98 34.74 4000 27.44 33.24 33.99 5000 27.37 33.24 33.69 8000 27.24 32.97 32.35 10000 27.11 31.60 31.47

Table A.12: Classification accuracy (%) of the ELM classifier using median and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.44 31.74 30.44 100 30.31 32.22 32.63 500 30.24 32.63 34.54 1000 30.17 36.21 35.56 1500 30.03 37.82 39.66 2000 29.90 36.72 39.28 2500 29.90 34.68 38.12 3000 29.56 33.65 35.97 4000 29.56 33.42 35.36 5000 29.49 32.26 34.47 8000 29.42 31.21 32.49 10000 29.01 30.61 30.44

196 Appendix A Folk Song Classification Using Audio Representation

Table A.13: Classification accuracy (%) of the ELM classifier using mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.10 29.15 31.95 100 30.24 30.94 33.38 500 30.85 32.63 34.16 1000 30.79 32.97 35.22 1500 30.72 33.92 35.29 2000 30.58 38.63 38.57 2500 28.94 35.29 38.26 3000 28.40 34.08 35.97 4000 28.40 33.65 35.56 5000 28.19 33.31 35.15 8000 27.65 33.24 34.26 10000 27.44 31.26 32.90

Table A.14: Classification accuracy (%) of the ELM classifier using median, mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 29.90 28.26 30.79 100 30.15 30.85 31.81 500 32.22 33.45 33.38 1000 31.83 34.16 35.56 1500 30.95 35.15 36.01 2000 30.65 37.13 37.44 2500 30.16 40.89 39.80 3000 29.90 37.34 36.66 4000 29.39 35.85 36.18 5000 28.96 35.77 34.68 8000 28.37 34.85 33.72 10000 27.85 34.58 32.35

197 Appendix A Folk Song Classification Using Audio Representation

Table A.15: Classification accuracy (%) of the low-pass FIR-ELM classifier using median.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.96 34.47 35.15 100 30.72 34.13 33.45 500 28.33 35.49 33.11 1000 27.99 34.47 35.15 1500 27.99 34.47 35.49 2000 27.30 34.47 36.18 2500 26.96 34.47 35.84 3000 26.96 34.47 35.84 4000 26.96 34.13 35.84 5000 26.96 34.13 35.84 8000 26.96 34.13 35.49 10000 26.96 34.13 35.49

Table A.16: Classification accuracy (%) of the low-pass FIR-ELM classifier using mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.62 34.13 33.79 100 30.72 33.79 33.45 500 28.67 35.84 35.15 1000 27.99 36.52 35.15 1500 27.30 35.84 36.18 2000 27.30 35.49 36.18 2500 26.96 35.49 36.86 3000 26.96 35.49 37.20 4000 26.62 35.49 37.20 5000 26.62 35.49 37.20 8000 26.62 35.49 36.86 10000 26.62 35.49 36.18

198 Appendix A Folk Song Classification Using Audio Representation

Table A.17: Classification accuracy (%) of the low-pass FIR-ELM classifier using variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.74 37.88 35.84 100 32.08 35.84 33.11 500 31.74 34.47 36.52 1000 31.74 34.13 36.52 1500 31.74 34.13 36.52 2000 31.74 34.47 36.18 2500 31.74 34.47 36.18 3000 31.74 34.47 36.18 4000 31.74 34.47 36.18 5000 31.74 34.47 36.18 8000 31.74 34.47 36.18 10000 31.74 34.47 36.18

Table A.18: Classification accuracy (%) of the low-pass FIR-ELM classifier using median and mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 25.26 32.42 35.15 100 30.03 33.11 34.81 500 30.03 34.81 38.23 1000 29.69 37.20 38.23 1500 29.35 37.54 37.88 2000 29.35 37.54 37.88 2500 29.35 38.23 37.54 3000 29.01 37.20 37.54 4000 29.01 37.20 37.20 5000 29.01 36.86 36.86 8000 28.67 36.86 36.86 10000 28.67 36.86 36.86

199 Appendix A Folk Song Classification Using Audio Representation

Table A.19: Classification accuracy (%) of the low-pass FIR-ELM classifier using median and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.38 35.15 36.82 100 31.06 35.15 35.84 500 33.11 36.52 36.86 1000 32.76 38.23 37.20 1500 32.76 36.86 37.20 2000 32.76 36.86 37.20 2500 32.76 36.86 37.20 3000 32.08 36.86 37.20 4000 31.74 36.86 36.86 5000 31.40 36.52 36.86 8000 31.06 36.18 36.86 10000 30.38 36.18 36.52

Table A.20: Classification accuracy (%) of the low-pass FIR-ELM classifier using mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.38 35.49 37.20 100 31.06 36.52 38.23 500 32.08 37.54 38.23 1000 31.74 38.23 37.88 1500 31.74 37.88 37.88 2000 31.74 37.88 37.88 2500 31.06 37.88 37.88 3000 31.06 37.88 37.88 4000 31.06 37.54 37.54 5000 31.06 37.54 37.54 8000 31.06 37.54 37.54 10000 30.38 37.54 37.20

200 Appendix A Folk Song Classification Using Audio Representation

Table A.21: Classification accuracy (%) of the low-pass FIR-ELM classifier using median, mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.40 38.23 34.81 100 32.08 38.91 36.86 500 33.79 39.25 39.21 1000 33.79 39.93 41.30 1500 33.45 39.93 40.96 2000 33.45 39.59 40.96 2500 33.45 39.25 40.61 3000 33.11 39.25 39.93 4000 33.11 39.25 39.59 5000 32.76 39.25 39.25 8000 32.76 39.25 39.25 10000 32.08 39.25 38.91

Table A.22: Classification accuracy (%) of the high-pass FIR-ELM classifier using median.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.96 32.08 31.40 100 27.30 33.79 32.76 500 30.72 34.81 35.15 1000 27.99 34.47 36.52 1500 27.99 34.81 36.18 2000 27.30 34.81 35.84 2500 27.30 34.81 35.84 3000 26.96 34.81 35.84 4000 26.96 34.81 35.84 5000 26.96 34.47 35.49 8000 26.96 33.79 35.49 10000 26.96 33.79 35.49

201 Appendix A Folk Song Classification Using Audio Representation

Table A.23: Classification accuracy (%) of the high-pass FIR-ELM classifier using mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.62 34.81 34.13 100 27.99 34.81 33.45 500 30.38 35.49 35.49 1000 27.99 35.49 36.52 1500 27.30 35.49 36.18 2000 27.30 35.49 35.84 2500 27.30 35.49 35.84 3000 26.62 35.49 35.49 4000 26.62 34.81 35.49 5000 26.62 34.81 35.49 8000 26.62 34.47 35.49 10000 26.62 34.47 35.49

Table A.24: Classification accuracy (%) of the high-pass FIR-ELM classifier using variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.74 34.13 33.45 100 32.08 36.18 35.15 500 31.74 34.81 36.18 1000 31.74 34.47 36.52 1500 31.74 34.47 36.18 2000 31.74 34.47 36.18 2500 31.74 34.47 36.18 3000 31.74 34.13 36.18 4000 31.74 34.13 36.18 5000 31.74 34.13 36.18 8000 31.74 34.13 36.18 10000 31.74 34.13 36.18

202 Appendix A Folk Song Classification Using Audio Representation

Table A.25: Classification accuracy (%) of the high-pass FIR-ELM classifier using median and mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 25.26 33.79 35.49 100 29.01 35.84 35.15 500 30.38 37.20 37.88 1000 30.03 37.54 37.88 1500 30.03 37.20 37.88 2000 30.03 37.20 37.54 2500 29.69 36.86 37.54 3000 29.01 36.86 37.54 4000 28.67 36.86 37.54 5000 28.33 36.52 37.54 8000 28.33 36.18 36.86 10000 27.30 35.49 36.52

Table A.26: Classification accuracy (%) of the high-pass FIR-ELM classifier using median and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.38 35.15 37.20 100 31.74 35.15 37.54 500 33.11 38.23 37.20 1000 32.76 37.20 37.20 1500 32.76 36.86 37.20 2000 32.76 36.86 36.86 2500 32.42 36.86 36.86 3000 32.08 36.52 36.86 4000 31.74 36.52 36.86 5000 31.11 36.52 36.86 8000 31.11 36.18 36.52 10000 30.38 36.18 36.52

203 Appendix A Folk Song Classification Using Audio Representation

Table A.27: Classification accuracy (%) of the high-pass FIR-ELM classifier using mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.38 35.49 36.18 100 30.72 37.20 38.23 500 31.40 37.54 37.88 1000 31.74 37.88 37.54 1500 31.74 37.88 37.54 2000 31.40 37.88 37.54 2500 31.06 37.88 37.54 3000 31.06 37.88 37.20 4000 31.06 37.88 37.20 5000 31.06 37.88 36.86 8000 31.06 37.54 36.86 10000 31.06 37.20 36.52

Table A.28: Classification accuracy (%) of the high-pass FIR-ELM classifier using median, mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 29.35 36.18 36.18 100 30.74 38.57 38.23 500 33.11 38.57 38.91 1000 34.13 38.91 40.96 1500 33.11 40.51 40.96 2000 33.11 40.27 40.61 2500 32.76 40.27 40.61 3000 32.42 39.59 40.61 4000 32.42 39.59 40.61 5000 32.42 39.59 40.27 8000 31.74 39.59 40.27 10000 31.74 39.59 39.59

204 Appendix A Folk Song Classification Using Audio Representation

Table A.29: Classification accuracy (%) of the band-pass FIR-ELM classifier using median.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.96 33.45 36.18 100 26.96 34.81 36.86 500 30.72 33.79 36.18 1000 27.99 33.79 36.18 1500 27.30 33.79 36.18 2000 27.30 33.79 35.84 2500 26.96 33.79 35.84 3000 26.96 33.45 35.84 4000 26.96 33.45 35.84 5000 26.96 33.45 35.49 8000 26.96 33.11 35.49 10000 26.96 33.11 35.49

Table A.30: Classification accuracy (%) of the band-pass FIR-ELM classifier using mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.62 34.13 35.15 100 26.62 35.84 35.84 500 30.38 36.18 37.20 1000 27.99 35.84 36.86 1500 27.30 35.84 36.52 2000 27.30 35.49 36.52 2500 26.96 35.49 36.52 3000 26.96 35.49 36.52 4000 26.96 35.49 36.18 5000 26.62 35.49 36.18 8000 26.62 35.15 36.18 10000 26.62 35.15 36.18

205 Appendix A Folk Song Classification Using Audio Representation

Table A.31: Classification accuracy (%) of the band-pass FIR-ELM classifier using variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.74 34.47 35.15 100 31.74 34.47 35.49 500 31.74 34.13 36.52 1000 31.74 34.13 36.18 1500 31.74 34.13 36.18 2000 31.74 34.13 36.18 2500 31.74 34.13 36.18 3000 31.74 34.13 36.18 4000 31.74 34.13 36.18 5000 31.74 34.13 36.18 8000 31.74 34.13 36.18 10000 31.74 34.13 36.18

Table A.32: Classification accuracy (%) of the band-pass FIR-ELM classifier using median and mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 25.26 36.52 33.45 100 26.96 36.86 36.18 500 29.01 38.23 38.91 1000 30.03 38.23 39.93 1500 30.03 38.23 39.93 2000 30.03 38.23 39.93 2500 30.03 37.88 39.25 3000 30.03 37.88 39.25 4000 29.35 37.88 39.25 5000 29.35 37.88 38.91 8000 29.01 37.20 38.23 10000 28.67 37.20 37.20

206 Appendix A Folk Song Classification Using Audio Representation

Table A.33: Classification accuracy (%) of the band-pass FIR-ELM classifier using median and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 32.08 36.52 35.84 100 32.08 37.20 36.52 500 32.76 38.57 37.88 1000 32.76 38.57 37.20 1500 32.76 38.57 36.86 2000 32.76 38.57 36.86 2500 32.42 38.57 36.18 3000 32.42 38.57 36.18 4000 32.08 38.57 35.84 5000 32.08 38.23 35.84 8000 32.08 38.23 35.49 10000 32.08 37.88 35.49

Table A.34: Classification accuracy (%) of the band-pass FIR-ELM classifier using mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.40 36.86 36.86 100 32.08 37.54 37.20 500 32.42 38.57 37.54 1000 32.08 38.57 37.20 1500 32.08 38.57 36.52 2000 32.08 38.57 36.52 2500 32.08 38.23 36.18 3000 31.74 38.23 36.18 4000 31.40 38.23 35.84 5000 31.40 37.88 35.84 8000 31.06 37.88 35.49 10000 31.06 37.88 35.15

207 Appendix A Folk Song Classification Using Audio Representation

Table A.35: Classification accuracy (%) of the band-pass FIR-ELM classifier using median, mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 29.69 37.88 37.20 100 30.38 38.57 39.25 500 32.08 40.96 40.96 1000 34.47 41.30 40.96 1500 33.79 40.96 40.61 2000 33.79 40.96 40.27 2500 33.79 40.61 39.93 3000 33.79 40.61 39.93 4000 33.45 39.93 39.59 5000 33.45 39.93 39.59 8000 33.11 39.59 39.25 10000 33.11 38.57 39.25

Table A.36: Classification accuracy (%) of the band-stop FIR-ELM classifier using median.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.96 28.26 29.01 100 26.96 31.67 35.02 500 26.96 34.40 37.88 1000 26.96 39.52 40.89 1500 26.96 40.61 41.71 2000 26.96 41.37 42.66 2500 26.96 41.98 41.91 3000 26.96 41.16 41.71 4000 26.96 41.02 40.27 5000 26.96 40.89 40.00 8000 26.96 39.93 39.86 10000 26.96 38.29 39.39

208 Appendix A Folk Song Classification Using Audio Representation

Table A.37: Classification accuracy (%) of the band-stop FIR-ELM classifier using mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.62 27.44 30.17 100 26.62 29.83 31.60 500 26.62 33.45 34.13 1000 26.62 37.54 37.82 1500 26.62 39.04 40.27 2000 26.62 41.09 43.82 2500 26.62 43.41 41.77 3000 26.62 42.25 41.57 4000 26.62 41.30 41.09 5000 26.62 41.09 40.55 8000 26.62 40.55 40.27 10000 26.62 39.32 39.66

Table A.38: Classification accuracy (%) of the band-stop FIR-ELM classifier using variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.74 26.76 28.67 100 31.74 29.22 31.54 500 31.74 36.11 37.20 1000 31.74 38.02 38.43 1500 31.74 39.25 39.32 2000 31.74 40.27 39.93 2500 31.74 39.59 39.93 3000 31.74 39.39 39.66 4000 31.74 39.25 39.32 5000 31.74 38.77 38.63 8000 31.74 38.43 38.50 10000 31.74 38.02 37.95

209 Appendix A Folk Song Classification Using Audio Representation

Table A.39: Classification accuracy (%) of the band-stop FIR-ELM classifier using median and mean.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 26.28 29.62 28.19 100 26.28 31.13 31.33 500 26.28 36.59 35.49 1000 26.96 39.32 40.48 1500 26.96 40.96 41.71 2000 26.96 44.03 44.10 2500 29.01 43.75 43.75 3000 27.65 43.62 43.41 4000 26.96 43.28 42.53 5000 26.96 42.12 42.39 8000 26.96 41.64 42.32 10000 26.96 41.23 42.05

Table A.40: Classification accuracy (%) of the band-stop FIR-ELM classifier using median and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.40 31.47 30.44 100 31.40 37.06 33.45 500 31.40 39.53 38.91 1000 31.40 40.14 39.80 1500 31.40 41.77 41.16 2000 31.40 43.55 43.41 2500 31.74 43.28 43.07 3000 31.74 42.18 42.46 4000 31.40 42.12 42.39 5000 31.40 41.84 41.50 8000 31.06 41.57 40.14 10000 31.06 40.14 40.00

210 Appendix A Folk Song Classification Using Audio Representation

Table A.41: Classification accuracy (%) of the band-stop FIR-ELM classifier using mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 31.40 31.95 31.67 100 31.40 32.01 34.47 500 31.74 37.95 38.29 1000 31.74 40.68 40.20 1500 31.74 42.32 43.21 2000 31.74 43.41 43.96 2500 32.08 42.25 43.82 3000 31.40 41.71 43.34 4000 31.40 41.37 42.39 5000 31.06 41.30 42.05 8000 30.72 40.00 41.91 10000 30.72 38.16 40.75

Table A.42: Classification accuracy (%) of the band-stop FIR-ELM classifier using median, mean and variance.

Number of Time Domain Frequency Domain Time & Frequency Hidden Features Features Domain Features Neurons 50 30.03 31.47 32.01 100 30.72 35.70 38.29 500 30.72 39.11 39.25 1000 30.72 41.91 40.20 1500 30.72 42.73 41.84 2000 32.42 44.10 44.64 2500 32.42 43.69 43.07 3000 31.40 43.69 42.80 4000 31.06 43.62 42.59 5000 30.72 42.66 41.57 8000 30.38 42.59 40.20 10000 30.38 41.68 40.00

211

This page is intentionally left blank.

List of Publications

1. S. Khoo, Z. Man and Z. Cao, “Automatic Han Chinese folk song classification using extreme learning machines,” 25th Australasian Joint Conference on Artificial Intelligence, AI 2012, pp. 49-60, 4-7 Dec. 2012.

2. S. Khoo, Z. Man and Z. Cao, “Automatic Han Chinese folk song classification using the musical feature density map,” 6th International Conference on Signal Processing and Communication Systems, ICSPCS 2012, 12-14 Dec. 2012.

3. S. Khoo, Z. Man, Z. Cao and J. Zheng, “German vs. Austrian folk song classification,” 8th IEEE Conference on Industrial Electronics and Applications, ICIEA 2013, 19-21 Jun. 2013.

213