<<

UNIVERSITA` DEGLI STUDI DI BRESCIA FACOLTA` DI INGEGNERIA Dipartimento di Ingegneria dell’Informazione

DOTTORATO DI RICERCA IN INGEGNERIA DELLE TELECOMUNICAZIONI XXVII CICLO SSD: ING-INF/03

Harmonic Analysis for Music Transcription and Characterization

Ph.D. Candidate: Ing. Alessio Degani

...... Ph.D. Supervisor: Prof. Pierangelo Migliorati

......

Ph.D. Coordinator: Prof. Riccardo Leonardi

......

ANNO ACCADEMICO 2013/2014 to my family

Sommario

L’oggetto di questa tesi `elo studio dei vari metodi per la stima dell’informazione tonale in un brano musicale digitale. Il lavoro si colloca nel settore scientifico de- nomitato Music Information Retrieval, il quale studia le innumerevoli tematiche che riguardano l’estrazione di informazioni di alto livello attraverso l’analisi del segnale audio. Nello specifico, in questa dissertazione andremo ad analizzare quelle procedure atte ad estrarre l’informazione tonale e armonica a diversi lev- elli di astrazione. Come prima cosa verr`apresentato un metodo per stimare la presenza e la precisa localizzazione frequenziale delle componenti sinusoidali stazionarie a breve termine, ovvero le componenti fondamentali che indentificano note e accordi, quindi l’informazione tonale/armonica. Successivamente verr`aesposta un’analisi esaustiva dei metodi di stima della frequenza di riferimento (usata per accordare gli strumenti musicali) basati sui picchi spettrali. Di solito la frequenza di riferimento `econsiderata standard e associata al valore di 440 Hz, ma non sempre `ecos`ı. Vedremo quindi che per migliorare le prestazioni dei vari metodi che si affidano ad una stima del contenuto armonico e melodico per determinati scopi, `efondamentale avere una stima coerente e robusta della freqeunza di riferimento. In seguito, verr`apresentato un sistema innovativo per per misurare la rile- vanza di una data componente frequenziale sinusoidale in un ambiente polifonico. Questo pu`oessere usato come front-end per un metodo per la trascrizione auto- matica di partiture polifoniche. Poi vedremo come usare dei descrittori audio di tipo armonico, chiamati Profile, per identificare i cambi di accordo in una composizione musicale. Infine verr`aaffrontato il tema dell’identificazione delle canzoni cover (versione alternativa di una canzone). A tal proposito proponiamo una strategia automat- ica per combinare i risultati ottenuti da diversi metodi, in modo da migliorare le performance dell’intero sistema.

Abstract

This thesis is concerned with the analysis of the digital music signal for the extraction of meaningful information about the tonal content of the audio excerpt. This work lies in the filed of Music Information Retrieval which is a science that has the goal of extracting high level, human-readable information from a musical composition. In this work we cover the retrieval of the tonal content at different levels of abstraction. First, we present a method for the estimation of the presence of short-term stationary sinusoidal components, with a precise resolution. The sinu- soidal components are the main atoms that compose a or a musical chord, and thus the tonal/harmonic information. Next, we show an exhaustive comparative analysis of different spectral peak based tuning frequency estimation algorithms. The tuning frequency is usually set to the widely accepted standard, that is, of 440 Hz. However, several musical pieces exhibit a slight deviation in the tuning frequency. Therefore, a reliable reference frequency estimation method is fundamental in order to not deteriorate the performances of the systems that use this information. Then, we present a novel system to measure the salience of a given sinusoidal component in a mixture of partials generated in a polyphonic composition. This measure can be used as a front-end for an automatic music transcription system. Then we show how to use the harmonic mid-level representation called Pitch Class Profile for detecting the musical chord boundaries in a song. Finally, we deal with the task of cover song identification (identify different rendition of a given song). We propose an automatic method to combine the results of several different systems in order to improve the detection accuracy of a cover song identification algorithm.

Acknowledgements

I would like to thank my supervisor, Prof. Pierangelo Migliorati, for his guid- ance and support throughout my three years of PhD. I would also like to thank Prof. Riccardo Leonardi, my PhD coordinator, for giving me the opportunity to work on this field and making possible my experience abroad in the beautiful city of Paris. My special thanks go to Ing. Marco Dalai for his precious help and his enlight- ening conversations. I would also like to thank HDR-Dr. Geoffroy Peeters for his supervision during one of the best experiences in these years: my stay at IRCAM, Paris, in the Analysis/Synthesis team. There, I’ve met a lot of great people that share the passion for music and science. Finally, a big thanks to my family and my girlfriend for believing in me during the PhD years.

I Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Context ...... 2 1.2.1 Music Information Retrieval ...... 3 1.2.2 Frequency Analysis ...... 4 1.2.3 Audio Features ...... 5 1.3 Overview of the presented work ...... 5 1.4 Contributions ...... 7 1.5 Outline ...... 8

2 Background 11 2.1 Elements of Music Theory ...... 11 2.1.1 Pitch and Musical Notes ...... 12 2.1.2 Tuning and Temperaments ...... 13 2.1.3 MIDI Tuning Standard ...... 14 2.1.4 Musical Scales ...... 15 2.1.5 Musical Chords, Melody and Harmony ...... 16 2.1.6 Timbre and Dynamics ...... 17

3 Phase-based sinusoids localization 19 3.1 Time-Frequency analysis ...... 21 3.1.1 Short Time Fourier Transform ...... 21 II CONTENTS

3.1.2 Phase evolution of the STFT ...... 22 3.2 Phase coherence measure ...... 23 3.2.1 Coherence measure ...... 23 3.2.2 Coherence function ...... 25 3.3 Results and Applications ...... 28 3.4 Conclusions ...... 30

4 Tuning Frequency Estimation 33 4.1 Tuning Frequency Estimation Methods ...... 35 4.1.1 Frequency Deviation Histogram ...... 36 4.1.2 Circular statistics ...... 37 4.1.3 Least-Squares Estimation ...... 38 4.2 Evaluation Strategy ...... 39 4.2.1 Ideal case performances and global reference frequency es- timation ...... 40 4.2.2 Speed of convergence and estimation stability ...... 41 4.2.3 Local tuning estimation ...... 42 4.2.4 Computational cost and complexity ...... 42 4.3 Data Set ...... 43 4.3.1 Cover Song 80 (covers80) ...... 43 4.3.2 MuseScore Symbolic Music Dataset (MS2012) ...... 44 4.4 Results ...... 45 4.4.1 Ideal case performances and global reference frequency es- timation results ...... 45 4.4.2 Speed of convergence and estimation stability results . . . . 48 4.4.3 Local tuning estimation results ...... 50 4.4.4 Computational cost and complexity ...... 52 4.5 Conclusions ...... 53

5 Polyphonic Pitch Salience Function 55 5.0.1 Classical approach ...... 56 5.0.2 Proposal ...... 56 5.1 Proposed Method ...... 58 CONTENTS III

5.1.1 Overview ...... 58 5.1.2 Motivations for using frequency deviations for pitch salience computation ...... 59 5.1.3 Short Time Fourier Transform ...... 61 5.1.4 Spectrum Peak Picking ...... 61 5.1.5 Reference Frequency Estimation ...... 62 5.1.6 Salience Function Computation ...... 62 5.2 Evaluation ...... 64 5.2.1 Multiple-pitch estimation: post-processing of the salience function ...... 64 5.2.2 Evaluation measures ...... 65 5.2.3 Test-Set ...... 66 5.2.4 Results ...... 67 5.3 Conclusions ...... 70

6 Chord Bounds Detection 73 6.1 Harmonic Change Detection Function ...... 74 6.1.1 Algorithm for HCDF calculation ...... 75 6.2 Chroma Features and Novelty calculation ...... 78 6.2.1 Other Chroma Features ...... 78 6.2.2 Distance measure ...... 81 6.3 Evaluation ...... 82 6.4 Results ...... 83 6.5 Conclusions ...... 84

7 Distance Fusion for Cover Song Id. 87 7.1 Audio Features and distance metrics ...... 90 7.1.1 Audio Features ...... 90 7.1.2 Distance Measures ...... 91 7.2 Distance Selection ...... 92 7.3 Results ...... 94 7.4 Conclusions ...... 94 IV CONTENTS

8 Conclusions 97 8.1 Summary of contributions ...... 97 8.2 Future Perspectives ...... 100

Bibliography 102 V List of Figures

1.1 Outline of the dissertation ...... 10

2.1 keys and its note names ...... 13 2.2 Examples of Western musical scales ...... 16 2.3 Examples of musical chords ...... 17 2.4 An excerpt of “Polonaise in G minor” by J. S. Bach ...... 17 2.5 spectrum of two different instruments ...... 18

3.1 Example of Phase coherence measure...... 26 3.2 Phase Coherence Function for a single frequency component . . . . 27 3.3 Amplitude spectrum versus Phase Coherence Weighted Modulus (one freq. component) ...... 28 3.4 Amplitude spectrum versus Phase Coherence Weighted Modulus (two freq. components) ...... 29 3.5 Amplitude spectrum versus Phase Coherence Weighted Modulus (SNR = 10 dB) ...... 31 3.6 Amplitude spectrum versus Phase Coherence Weighted Modulus (SNR = −10 dB) ...... 32

4.1 fref estimation of a sawtooth sweep signal ...... 46

4.2 fref estimation histogram (k =5)...... 47

4.3 fref estimation histogram (k = 30) ...... 47 4.4 Convergence results for covers80 ...... 48 VI LIST OF FIGURES

4.5 Convergence results for MS2012 ...... 49 4.6 Estimated Σ for covers80 ...... 49 4.7 Estimated Σ for MS2012 ...... 50 4.8 Local tuning estimation of the song “Let It Be” ...... 50 4.9 Local tuning estimation of “Variations 16-20” ...... 51 4.10 Local tuning estimation of Choir performance ...... 52

5.1 Frequency location of the pitches of the equal tempered scale . . . 57 5.2 General scheme of the method for salience computation ...... 58 5.3 Deviation of the first 20 harmonic of a complex tone . 60 5.4 Piano roll representation obtained using our salience function . . . 66 5.5 Pitch estimation results (Harmonic model) ...... 68 5.6 Pitch estimation results ...... 69 5.7 Pitch-Class estimation results ...... 69 5.8 Pitch Recall/Precision curve ...... 71

6.1 Annotated chord progression and ideal HCDF for chords ...... 75

7.1 General scheme of the method of distance fusion ...... 89 7.2 Scatter plot of the 2D distance space ...... 92 VII List of Tables

4.1 Length of the songs in the datasets for fref estimation ...... 43 4.2 General MIDI instruments name...... 45 4.3 Genre tags...... 45

4.4 fref mean and standard deviation results ...... 46 4.5 Total and average execution time ...... 52

5.1 Comparison of Pitch F-Measure results on MAPS test-set . . . . . 70

6.1 Test-set: 16 Beatles song...... 82 6.2 Result summary of the HCDF segmentation ...... 86

7.1 Accuracy results of distance fusion method ...... 95 1 1 Introduction

“Mathematics compares the most diverse phenomena and discovers the secret analogies that unite them” — Jean Baptiste Joseph Fourier

1.1 Motivation

Music Information Retrieval (MIR) is a relatively young science and there are many open questions in this field. In the recent past years, the number of re- searchers involved in MIR has grown exponentially and the topics are contin- uously increasing and consequently, the problems becomes more complex and interesting. Furthermore, music is for the author, from both the artistic and the scientific point of view, an interesting and stimulating field. MIR covers a large spectrum of facets that goes from the understanding of the physiological and psychological aspect of the cognitive process of sound per- ception, to the mathematical formulation of the methods involved in order to make musical features understandable by a machine. For example, for a trained musician, music transcription is not a difficult task, but for a computer this can be a very tough assignment. It is a great challenge to “teach” a machine how to listen and understand music as efficiently as for a human. The goal of MIR is to deeply understand the cognitive process and, through the means of scientific 2 CHAPTER 1. INTRODUCTION analysis and tools, to give to the user some usable equipment to help him in organizing, indexing, learning, creating, studying or visualize music. Recently, MIR research has come to light with different user-centered ap- plications such as automatic song identification software (i.e. Shazam1), music recommendation based on the user preferences (i.e. Spotify2) and many other facilities for musician such as automatic score follower (i.e. AnteScoFo3). The aim of the author, as a MIR researcher and as a user of such MIR applications, is to give a personal contribute to this field in the understanding of the behaviour of different algorithms and propose some new tools for the tonal analysis of musical pieces.

1.2 Context

This thesis is concerned with the analysis of the digital music signal for the extraction of meaningful information about the tonal content of the audio excerpt. In this work we cover the tonal aspect at different levels, from low level layers such as sinusoidal estimation, to higher musical-oriented level such as automatic notes/chords estimation and audio similarity. The complicated task of automatic music transcription is considered as the “Holy Grail” of the Music Information Retrieval since, once the note-level tran- scription of a song is obtained, it would be much easier to perform other tasks such as chord estimation, music similarity and so on. Nowadays there is still no automatic music transcription system that works satisfactory, except in some re- strictive cases (monophonic or mono-instrument music). This is the main reason why, for example, a chord recognition system uses a mid-level audio represen- tation, instead of the more obvious note-level transcription, in order to label musical chords. In the context of musical signal processing, a mid-level representation is a summarization of the low-level musical facets, such as the spectrum, or the si- nusoidal content, into a sort of human-readable representation in between from

1http://www.shazam.com/ 2http://www.spotify.com/ 3http://forumnet.ircam.fr/product/antescofo/ 1.2. CONTEXT 3 physical representation of the audio (air pressure vs. time) and the symbolic representation, that can be itself analysed by a computer. The mid-level infor- mation is constructed by the task of feature extraction explained in the Section 1.2.3.

Now, we give a brief excursus about the emerging discipline of Music Infor- mation Retrieval and the basics of the audio and music signal processing.

1.2.1 Music Information Retrieval

Music Information Retrieval is a interdisciplinary science and the aim of MIR is to retrieve information from the music signal. An MIR researcher may have background in machine learning, statistics, digital signal processing, music/mu- sicology and psychology/psychoacoustics. MIR is used for categorizing (music indexing and recommendation), creating or manipulating music. Some applica- tions are for example:

• Automatic indexing This class includes tasks such as musical genre categorization (pop, jazz, rock, classical, ... ) and artist recognition. In general, this category can be summarized as a music auto-tagging that can be useful for organizing big libraries of music in a full automatic (or supervised) manner. This is mainly exploited as a classification or clustering problem and it usually makes use of machine learning techniques. With the rising of huge on-line music libraries, the automatic music indexing becomes very useful either for data base organization and for recommendation systems. Other task that falls in this category are musical structure segmentation/- summarization and music similarity/cover song identification

• Automatic music transcription Automatic music transcription aims to convert audio recording into sym- bolic music representation such as musical score or MIDI files. This task can be subdivided in several sub-systems that are dedicated to solve specific problem such as onset detection, pitch/multi-pitch estimation, key/chord 4 CHAPTER 1. INTRODUCTION

estimation, tempo/rhythm estimation and instrument identification. This is perhaps the most difficult task in MIR. The complexity (and thus the error rate) grows very rapidly with the number of instruments or polyphony degree in a musical piece. For this task, ad-hoc signal processing methods and models have been developed; however, for key/chord recognition, ma- chine learning systems that use Hidden Markov Models (HMM) or Dynamic Bayesian Networks (DBN) are the best performing.

• Recommendation systems Recommendation systems use musical tags and user preferences in order to suggest new songs on the basis of the user’s taste. This is strictly related to the task of automatic indexing but can integrate automatic annotations, user/expert tagging and preferences into a big data machine learning system in order to generate recommendation for a specific user.

• Source separation and instrument recognition The aim of source separation is to split a music composition in its con- stituting instruments. Instruments recognition is then used to classify the separated tracks. Source separation in audio is also referred as audio up- mix (the opposite of downmix, the work done in the recording studio during mixing). This task has more or less the same difficulty of the automatic music tran- scription, although it uses different approaches such as Non-negative Matrix Factorization (NMF).

1.2.2 Frequency Analysis

Frequency analysis techniques include a set of (mathematical) tools used for the transformation of the sampled audio signal, in order to obtain useful information in a human-readable and usable form. One of the basic and, perhaps, the most used tool is the discrete formulation of the Fourier Transform, the Discrete Fourier Transform (DFT, or its efficient implementation known as FFT). Since the audio signal has a non-stationary content, a sliding-window DFT 1.3. OVERVIEW OF THE PRESENTED WORK 5 is needed in order to analyse the musical signal in the time-frequency domain. This extension of the DFT is called Short-Time Fourier Transform (STFT). The majority of the mid-level representation and methods used in MIR shares a STFT computation as a starting point.

1.2.3 Audio Features

As said before, MIR usually uses a representation of the audio signal that can be directly understandable by both the computer and the user. The feature extraction process aims to create a mid-level representation, or summarization of the audio/musical content upon which a MIR algorithm can work. A sampled audio stream can be very expensive in terms of memory storage, an thus for computational purposes. Extracting meaningful information from the audio signal is crucial for any MIR task, and the type of the feature to extract from the musical signal depends on the application. With a specific audio mid-level representation, MIR algorithm can perform further analysis such as applying Machine Learning methods on the audio fea- tures in order to accomplish a certain task. As mentioned before, the STFT is a common starting point for the feature extraction, indeed, it can be considered a mid-level representation as well. Two key parameters of the STFT analysis are the duration and the time shift of the so called analysis frame. The analysis frame is a temporal window indexed by a time-stamp and the frame duration and the hop-size (distance between adjacent frames) are crucial for the trade-off between time versus frequency reso- lution. There is no optimal choice in general, the hop-size and the frame duration (usually measured in samples or in seconds), are set according to the application.

1.3 Overview of the presented work

In our work we studied different methods of features extraction and its applica- tions at different levels of abstraction. 6 CHAPTER 1. INTRODUCTION

First, a method to detect the presence, and estimate the frequency, of short- term stationary sinusoidal components of an audio signal is presented. Here, short-term stationary means a sinusoid, at a given frequency, that lasts for two or more analysis frames. Usually, a sinusoid is associated to a frequency-localized peak in the STFT domain. Since the sinusoidal components of the audio signal are the main elements of the timbre and pitch perception, a method that detects and precisely locate in time and frequency this sinusoidal atoms is crucial.

Then, different methods of reference frequency estimation have been studied. We have investigated the methods that use the spectral peaks as audio feature for the estimation task and we have compared their performances under different criteria and constraints.

Next, we propose a new pitch salience function for the analysis of polyphonic music. The aim of a pitch salience function is to assign a score to a single pitch component (i.e. spectral peak) in order to identify the most perceptually strong pitches in a mixture. This can be used as a pre-processing step of a pitch/multi- pitch detection algorithm.

At an higher level of abstraction we find chords estimation algorithms, which are in rapid development nowadays. Several Automatic Chords Estimation (ACE) methods are evaluated yearly at the Music Information Retrieval Evaluation eX- change (MIREX4) campaign. Our contribution in this field is a study on chord boundary detection algorithms. Identifying the correct chord boundaries is ben- eficial for the ACE tasks since this information can be exploited in order to improve estimation accuracy using feature smoothing methods that can preserve chord bounds and improve the Signal-to-Noise Ratio (SNR) of an audio feature.

The last problem addressed in this dissertation is the Cover Song Identifica- tion task. The aim of a Cover Song Identification algorithm is to find different rendition of the same musical piece. This can be used also for detecting copyright

4http://www.music-ir.org/mirex/wiki/MIREX_HOME 1.4. CONTRIBUTIONS 7 infringement (Content-Base Copy Detection). Basically, a song is considered a cover if it exhibits a low “distance” with respect to the original piece. The dis- tance has to be sensitive to the harmonic and melodic content of a song and not to the musical facets that are considered “noise” such as non-tonal or per- cussive sound, the use of different musical instruments (timbre), different tempo and equalization. We propose an heuristic method to combine several distance measures in order to obtain better detection performances. We will see that a combination of different distance measures can be beneficial when a simple com- bination scheme is used.

In the next two sections, we list our scientific contributions to international conferences and journal papers and we describe the outline of the thesis.

1.4 Contributions

In this section we present the papers on which is based this dissertation. Journal Papers

i Alessio Degani, Marco Dalai, Riccardo Leonardi and Pierangelo Migliorati, “Comparison of tuning frequency estimation methods”, Multimedia Tools and Applications, to appear (Published online: March 2014)

International Conference Papers

iii Alessio Degani, Riccardo Leonardi, Pierangelo Migliorati and Geoffroy Peeters, “A Pitch Salience Function Derived from Harmonic Frequency Deviations for Polyphonic Music Analysis”, International Conference on Digital Audio Effects, DAFx 2014

iv Alessio Degani, Marco Dalai, Riccardo Leonardi and Pierangelo Migliorati, “Time-Frequency Analysis of Musical Signals Using the Phase Coherence”, International Conference on Digital Audio Effects, DAFx 2013 8 CHAPTER 1. INTRODUCTION

v Alessio Degani, Marco Dalai, Riccardo Leonardi and Pierangelo Migliorati, “A Heuristic for Distance Fusion in Cover Song Identification”, International Workshop on Image Analysis for Multimedia Interactive Ser- vices, WIAMIS 2013

vi Alessio Degani, Marco Dalai, Riccardo Leonardi and Pierangelo Migliorati, “Real-Time Performance Comparison of Tuning Frequency Estimation Algo- rithms”, International Symposium on Image and Signal Processing and Analysis, ISPA 2013

Submitted vii Alessio Degani, Marco Dalai, Riccardo Leonardi and Pierangelo Migliorati, “Harmonic Change Detection for Musical Chords Segmentation”

1.5 Outline

This thesis is organized as follows:

• Chapter 2: In this chapter, we give a brief introduction on the elements of the music theory that are needed in order to fully understand this dis- sertation and we provide the terms and the notation used in this work.

• Chapter 3: This chapter covers the theoretical aspects of the time-frequency analysis designed for the localization of short-term stationary sinusoids. Furthermore we propose a phase-based method for measuring the exact frequency that inherently offers a coherence score in order to estimate a confidence measure for the existence of a given frequency component.

• Chapter 4: In this chapter we investigate the characteristics of different tuning frequency estimation algorithms that are based on a peak-picking procedure. We provide detailed analysis of the accuracy performances un- der different conditions and constraints (i.e. real-time estimation). Fur- thermore, we propose a new dataset of symbolic music that can be used to synthesize a reliable audio with a corresponding aligned ground-truth. 1.5. OUTLINE 9

• Chapter 5: In this chapter we present a novel approach for the compu- tation of the pitch salience function. Our proposal is intended as a pre- processing step for a polyphonic music analysis. The main goal is to mea- sure how salient is a pitch in a mixture of harmonic partials. In other words, our method is designed to distinguish a note’s from its partials. This work is developed in collaboration with the Institut de Recherche et Coordination Acoustique/Musique (IRCAM), Paris, France.

• Chapter 6: In this chapter we investigate how different chroma features (Pitch Class Profile), and different Harmonic Change calculation strategy impact on the accuracy performances of a chord segmentation method. Furthermore we will see that there are some chroma features that are par- ticularly effective in capturing chord changes.

• Chapter 7: In this chapter we deal with the concept of music “similarity”, especially in the Cover Song Identification task. A cover song is a new per- formance or recording of a previously recorded song with some differences in orchestration, instrumentation, tempo, and other musical facets. It is very difficult to give a formal definition of what a cover song is, and this is a main concern for the task of cover song identification. In the literature there are different strategies to calculate the similarity between two songs and each one focuses on a particular musical facets. We propose a heuristic method to efficiently combine different similarity (or conversely, distance) measures that provides an improvement on the detection accuracy of a cover song identification task.

• Chapter 8: In this last chapter we draw the conclusion and we give an overview of the future perspectives.

Fig. 1.1 depicts a graphical representation of the outline of this thesis. We put the focus on the different abstraction levels of our work. Furthermore, this schematic overview puts in the right context each of our published paper giving the chapter number and the paper number (in [roman numeral]) as reference for the reader. 10 CHAPTER 1. INTRODUCTION

Reference freq. Sinusoid localization Physical quantities Low Estimation Level [iv] - Chapter 3 (frequency components) [i, vi] - Chapter 4

Polyphonic music analysis Mid-Level [iii] - Chapter 5 (fine grain: notes)

Chord segmentation Mid-Level [vii] - Chapter 6 (coarse grain: chords/harmony)

High Cover Song distance fusion Macro Level Level [v] - Chapter 7 (songs)

Figure 1.1: Graphical outline of the dissertation. From low to high level analysis of musical facets. Each section is characterized by the corresponding chapter number and a roman numeral that identify the relative paper as labelled in Section 1.4. 2 11 Background

“We must see that music theory is not only about music, but about how people process it. To understand any art, we must look below its surface into the psychological details of its creation and absorption” — Marvin Minsky

In this chapter we give a brief overview of the basic music theory and defini- tions in order to fix the terminology and to avoid ambiguities.

2.1 Elements of Music Theory

Music theory is a collection of rules, definitions and relations that underlie musical compositions. Here, we will not cover music theory in detail, which by itself would require an entire book, but we will focus on the fundamental aspects that rule the modern Western music. It is worth noting that the concepts behind music theory are not throughout strictly applied as they are. For example, the artist itself can decide to reinterpret some rules and definitions and that is a very challenging matter from the point of view of a MIR researcher. As we mentioned, our objective is to give a starting reference in order to understand the several aspects of a MIR algorithm designed to treat musical 12 CHAPTER 2. BACKGROUND facets of Western music. The term “Western music” can be misleading and it needs some clarifications. We classify as Western music all of the musical compositions that belongs to the Western culture (European heritage) and mainly characterized by the division of the into a series of twelve tones, called a chromatic scale, within which the interval between adjacent tones is called a half step or . Non-Western music systems (sometimes called World Music), for example African, Arabic and Indian, to name a few, often make use of multiples of quarter tones, the notation is not as significant as in Western music (and therefore is not standardized) and it is usually passed down from generation to generation through word of mouth. Due to this characteristics, it is more difficult to treat Non-Western music from a MIR point of view.

2.1.1 Pitch and Musical Notes

Pitch is a perceptual phenomenon that allows to perceive the lowness or highness of a tone. That property permits to arrange the musical notes in an ordered frequency scale. The perceived highness of a musical tone is strictly related to the note’s fundamental frequency but is not the same thing. The frequency is a physical property of a note sound but the pitch is a subjective auditory sensation that is induced by complex interaction of objective quantities such as the combination of frequencies (harmonic content) of a musical tone and it’s amplitude [72]. Since the pitch is subjective quantity, it is a common practice to approximate the pitch sensation with the measurable fundamental frequency f0 and therefore this two terms are considered synonyms in the scope of this dissertation.

Musical notes are a set of labelled pitches and the difference in frequency between two notes is called interval. The most important interval is the octave that correspond to doubling (or halving) the frequency. The octave is subdivided in 12 notes (the chromatic scale). Each of this 12 tones are labelled using letters from A to G and modifiers such as [ (flat) and ] (sharp) that uniquely identifies the so called pitch class. The correspondent octave is identified using a number, 2.1. ELEMENTS OF MUSIC THEORY 13

D♭ E♭ G♭ A♭ B♭ D♭ E♭ G♭ A♭ B♭ D♭ E♭ G♭ A♭ B♭ # D# F# G# A# C# D# F# G# A# C# D# F# G# A#

C D E F G A B C D E F G A B C D E F G A B ... Octave 1 Octave 2 Octave 3 ...

Figure 2.1: Piano keys and its note names. for example the note named A4 is the note A in the 4-th octave. Note names associated to some octave of the piano are depicted in Fig. 2.1. The basic interval between adjacent notes is called semitone (or half tone). The mapping between note names and note frequencies is ruled by the Tuning system or Temperament described in the following section.

2.1.2 Tuning and Temperaments

The note names are a sort of abstraction with respect to the actual frequencies of the pitches. A note name without a specification of the used tuning system is not automatically associated to a specific frequency. Basically speaking, a tuning system is composed by two elements: a reference frequency and a Temperament that is a set of ratios that constitute the interval between the 12 notes of the Western chromatic scale and the reference frequency fref . Although there are many temperaments, such as the Pythagorean tuning, the and so on, the most commonly used tuning system in Western music is the .

The Equal temperament divides an octave into 12 equal-ratio intervals so that √ the frequency ratio between adjacent notes is then 12 2. This ratio defines the frequency spacing between imitating the human perception of musical intervals that is approximately logarithmic in the frequency space. For example, for a given note frequency, the adjacent note frequency, one semitone apart, is 14 CHAPTER 2. BACKGROUND √ calculated by multiplying (or dividing) the actual note frequency by 12 2. Start- ing from f it is possible to calculate the frequencies of all notes by iteratively ref √ multiplying (or dividing) by 12 2. A more convenient way to represent the pitch space, is the logarithmic- frequency cents scale, in which the notes are linearly spaced by an interval of 100 cents (the amplitude of a semitone in the cents scale). In the cents scale, the in-tune notes correspond to an integer number that measures the distance

(in cents) from the reference frequency fref . The note index in the cents scale is defined as:  f  c = 1200 · log2 , (2.1) fref where f is the note frequency in and fref is the reference frequency also in Hertz. With that system, an octave interval (doubling the frequency) corresponds to 1200 cents.

The fref plays a fundamental role in the frequency to note mapping. In the common Western music this value is often set to a widely accepted standard that is fref = 440 Hz. This is the frequency of the note A4, the A at the 4-th (central) octave. The reference frequency is also named as or tuning frequency. The estimation of fref from the audio data is a crucial step for some MIR algorithm and is discussed in the Chapter 4. Another method for indexing the note of the equal tempered scale is the MIDI Tuning Standard (MTS). Since the cents value c in eq. (2.1) can assume negative values, the MTS, described in the following section, use an equation similar to (2.1) to map the frequency value to a positive integer in the range [21,..., 108] that correspond to the keys of a standard piano, from the note A0 to C8 (8 octave span). To be precise, the MIDI tuning standard can use the note number in the range [0,..., 127] but, in general, the useful note range is from 21 to 108.

2.1.3 MIDI Tuning Standard

The Musical Instrument Digital Interface (MIDI) protocol is a standard serial protocol that allows digital instrument, either software (virtual instruments) or hardware, to intercommunicate in real-time. With the MIDI protocol it is pos- sible to send control messages to a digital audio device such as volume control, 2.1. ELEMENTS OF MUSIC THEORY 15 program change and audio parameters. Furthermore, it is possible to send mes- sages that tells to a music synthesizer to play a specified note. Without going into details, the MIDI message NOTE ON indexes the notes with an integer MIDI note number m ∈ [0,..., 127]. The MIDI note number is calculated from the note frequency f as:  f  m = 69 + 12 · log . (2.2) 2 440

Here, the reference frequency is implicitly set to fref = 440 Hz and 69 is an offset to make m a positive value (m = 69 corresponds to the piano central A, or A4 ). If the note frequency f is tuned with the standard tuning frequency and the equal temperament, the MIDI note number m is an integer value. Without some tricks it is impossible for the MIDI protocol to use different temperaments or tuning frequency since the low level protocol transmit the note number as an unsigned 4-bit integer. However, it is possible to transmit the “exact frequency” of a note using the Frequency Data Format which permits to transmit note fre- quency as a fraction of semitone with a granularity of 0.0061 cents. For further details refer to http://www.midi.org.

The MIDI note format is an useful tool for MIR algorithms such as melody or pitch/multi-pitch estimation since it offers a common interchange format (the standard MIDI file, smf) that also permits to visualize the output as a music score that is supported in almost all of the score editor/synthesizers software.

2.1.4 Musical Scales

Notes can be arranged in a variety of scales. A selection of a sub set of tones (usually seven, but other combination are possible) from the set of 12 tones of the chromatic scale that are arranged in patterns of semitones and whole tones (two semitones) creates the scales. The first tone of the scale is called the key, and the intervals (called degrees and denoted in roman numerals or using specific adjectives) between the key and the other notes of the scale identify the mode of the scale and thus the notes selection rule. The most commonly encountered scale mode in Western music are the Major scale and the Minor scale depicted 16 CHAPTER 2. BACKGROUND

C D E F G A B C

4 I II III IV V VI VII VIII  4      (a) C Major scale. C D E♭ F G A♭ B♭ C

4 I II III IV V VI VII VIII  4      (b) C Minor scale

Figure 2.2: Examples of musical scale in Western music notation. The labels above the note symbols are the note names and the labels below are the degree of the note in the scale. At the beginning of the partiture there is the key and time signature. in Fig. 2.2. In traditional Western notation, the scale used for a composition is usually indicated by a key signature at the beginning to designate the pitches that make up that scale. Music can be transposed from one scale to another for various purposes. Transposition is a positive or negative shift of the overall pitch range that changes the key, but preserves the intervallic relationships (the mode) of the original scale.

2.1.5 Musical Chords, Melody and Harmony

A group of 3 or more notes that sound simultaneously is called a chord. The notes that form a chord are usually (but not mandatorily) taken from the subset of pitches that identifies the scale of the musical piece. The chord name depends on the pitches that compose the chord, and, like the scale name, is formed by a root name (the primary tone) and a specification of the type of the chord. The type of a chord, just like the name of a scale, depends on the intervallic relations between the root and the other notes of the chord. The most used types of chords in Western music are Major, Minor, Seventh, Suspended, Augmented and Dimished. Some examples are shown in Fig. 2.3. The melody is a series of tones that sound in succession. The notes of the melody are usually (but not mandatorily) taken from the subset of pitches of the scale of the actual musical piece. The basic elements of melody are pitch, 2.1. ELEMENTS OF MUSIC THEORY 17

C Cm G7 Asus4     4     Figure 2.3: Examples of musical chords. C is C Major, Cm is C Minor, G7 is G-Seventh and Asus4 is A Suspended fourth.

marcato 5 5 3 1 2 2 4 2                3f              4   1 3 2 1     5 3 2  2   3        5 4    4      Figure 2.4: An excerpt of “Polonaise in G minor” by J. S. Bach. duration of the notes, rhythm patterns, and tempo (speed, measured in Beats Per Minute, BPM). See Fig. 2.4 for an example of melody in Western notation. The relations that simultaneously occur between the melody notes and the chords is referred as harmony. The harmony is a key point for MIR tasks such as chord or key estimation, because exploiting the harmonic relationship that underlies a musical piece can improve the accuracy performances of the MIR algorithms and also the computational cost. It is clear that the knowledge of the musical scale that underlies a musical composition can reduce the subset of the “valid” chords for that given song.

2.1.6 Timbre and Dynamics

Timbre, also called “color”, is the principal characteristic that allows us to dis- tinguish one musical instrument from another when both play at the same pitch and volume. Timbre is mainly defined by the relative balance of overtones (har- monic partials frequencies) produced by a given instrument, and the temporal envelope of the sound including changes in the overtone structure over time. In Fig. 2.5, the spectral envelope of a piano sound is compared to the trumpet spectral envelope. Dynamic in music, normally refers to variations of intensity or volume in the musical composition. In music notation the dynamic are treated as relative quan- 18 CHAPTER 2. BACKGROUND

Spectrum of a Piano Note Spectrum of a Trumpet Note 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Absolute Amplitude Absolute Amplitude 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Frquency [Hz] Frquency [Hz] (a) C Note on piano. (b) C Note on a trumpet.

Figure 2.5: Difference in the amplitude spectrum of the C (523.25 Hz) note played by two

different instruments. Note that the note’s fundamental frequency f0 = 523.25 Hz is the highest in the piano spectrum, while in the trumpet sound, the second harmonic is the highest. tities usually identified by italian words like forte (f), fortissimo (ff), piano (p), pianissimo (pp), crescendo (<), ... and so on. From a digital signal processing point of view, the dynamic is an absolute measure and is expressed in decibels (dB). Both dynamic and timbre need to be adequately treated by an MIR algorithm. For example, a multi-pitch transcription method must, ideally, work consistently regardless of timbre (the instrument used) and dynamic (same recognition accu- racy either for loud and soft notes). 3 19 Phase-based sinusoids localization

In this chapter we show a technique based on the phase evolution of the Short Time Fourier Transform (STFT) for increasing the spectral resolution in the time- frequency analysis of a musical signal. It is well known that the phase information of the STFT coefficients brings important information on the spectral components of the analysed signal. This property has already been exploited in different ways to improve the accuracy in the estimation of the frequency of a single component. We propose a different approach, where all the coefficients of the STFT are used jointly to build a measure of how likely all the frequency components are, in terms of their phase coherence evaluated in consecutive analysis window. In more detail, we construct a phase coherence function which is then integrated with the usual amplitude spectrum to obtain a refined description of the spectral components of an audio signal. Time-frequency analysis is a central tool in most of the applications of audio/- music signal processing, Music Information Retrieval algorithms [11] and audio coding systems. The most common used tool for this purpose is the Short Time Fourier Transform (STFT) [6] which is the non-stationary counterpart of the Discrete Fourier Transform (DFT). The STFT decomposes the discrete signal in partially overlapping frames, and it expands each of these frames in the discrete Fourier basis [64]. It thus provides a time varying discrete-frequency content description of the signal. Usually, only the amplitude spectrum of the STFT is taken into account. In some applications which require both good frequency ac- curacy and good time localization, this may not suffice. To overcome this issue, 20 CHAPTER 3. PHASE-BASED SINUSOIDS LOCALIZATION some additional processing may be required to increase the frequency resolution (see [62]) or to add a-priori information (see [58]). One possible approach con- sists in using the phase of the STFT to improve the frequency resolution. Some works in the literature propose specific techniques to refine the frequency esti- mation by using the phase evolution of STFT coefficients [63, 25, 34]. These methods, however, operate on a coefficient-wise basis to improve the frequency estimation of a single sinusoid in a local frequency interval. In particular they do not allow a global exploitation of the full phase spectrum evolution to blindly enhance the frequency analysis over the entire range. This is instead the aim of other approaches based on the reassignment method first proposed in [32]. This technique corrects the information contained in the spectrogram by “moving” the energy in the time-frequency plane according to phase information (see [2] for more details).

Here, we propose a different approach where phase information is used to assign a “coherence score” to spectral amplitude components. As for other works based on the phase evolution of the STFT, the underlying idea goes back to Flanagan and Golden [19] (see [44, 33] for recent advances). Here, however, we propose a technique for combining the phase evolution of different coefficients in order to obtain a function Xm(f), that we call Phase Coherence Function (PCF), which measures the likelihood of the presence of a sinusoidal component at the unquantized frequency f at time instant m. The function Xm(f) is computed using only the phase information, and we then combine it with the STFT am- plitude spectrum to obtain a refined spectral analysis of the signal. The main difference between our method and other available techniques is that our method does not try to move the components from one frequency to another, but rather assigns coherence scores to components. In particular, the function Xm(f) takes on positive (respectively, negative) values for those f that are likely (unlikely) to be present in the signal according to the phase evolution of nearby coefficients of the spectrogram. The “coherence score”, furthermore, is computed in a way which inherently takes into account the issue of the phase unwrapping which often constitutes a problem in many of the methods mentioned above.

This chapter is structured as follows. In Section 3.1 we give the basic notions 3.1. TIME-FREQUENCY ANALYSIS 21 on the STFT and we introduce the key idea of the phase coherence. In Section 3.2 we show how to combine the information given by different coefficients to obtain the PCF. In Section 3.3 we present the experimental results and we discuss the possible applications of our technique.

3.1 Time-Frequency analysis

3.1.1 Short Time Fourier Transform

The classic time-frequency analysis is performed using STFT. The N-terms STFT, at time frame m, of a discrete signal x[n] is defined as N−1 X −j2π k n Xm,k = x[n + τm] · w[n] · e N , (3.1) n=0 where k = −N/2+1,...,N/2, τ is the hop size (in samples) from two subsequent frames and w[n] is the windowing function. The STFT is a complex valued function which can be equivalently described in terms of its amplitude |Xm,k| and its phase Φm,k = ]Xm,k. (3.2)

If x[n] is sampled at frequency Fs, the frequency resolution of the STFT is given [71] by the expression F ∆ = s (3.3) k N that can be seen as the width of the frequency interval associated to each coeffi- cient Xm,k ignoring the windowing effects and thus it is related to the STFT accu- racy in positioning the spectral components of the signal. We assume, where not otherwise specified, Fs = 22050 Hz, N = 4096 without zero-padding, τ = 1024 samples and w[n] is the Hanning analysis window. We will see that, under certain hypothesis on the structure of the analysed signal, we can partially overcome this limit. If we are interested in the detection of the frequency location of the short term sinusoidal components in an audio signal, we can exploit the phase evolu- tion of two consecutive frames of the STFT to increase the frequency resolution of a time-frequency representation. This basic idea is shared by all the frequency estimation methods that uses the phase spectrum. 22 CHAPTER 3. PHASE-BASED SINUSOIDS LOCALIZATION

3.1.2 Phase evolution of the STFT

In this section, we introduce the principle at the base of the coherence measure that will be described in the next section. The key point is in the phase evolution of the STFT coefficients of a pure sinusoidal signal. For the sake of simplicity, we consider complex exponential functions; the effect on real signals will then be intuitively derived from this analysis. Consider then a signal x[n] assumed to be j2πF t a sampled version of a signal x(t) = e 0 at sampling frequency Fs, that is

x[n] = ej2πf0n, (3.4) where f0 = F0/Fs is the normalized frequency. In the continuous domain, if we let X(F ) = F{x(t)}(F ) be the Fourier transform of x(t), we know that

j2πF t0 F{x(t + t0)}(F ) = e F{x(t)}(F ). (3.5)

Since our signal is a pure exponential, however, its Fourier transform is a Dirac delta function and thus we may as well write

j2πF0t0 F{x(t + t0)}(F ) = e F{x(t)}(F ). (3.6)

Since the STFT is a sliding window discrete version of the Fourier transform, intuition suggests that its coefficients in (3.1) evolve for varying m ruled by this property of the Fourier transform. Since a pure sinusoidal function, in general, affects different coefficients due to the windowing effect, one may be induced to expect eq. (3.5) to hold rather than eq. (3.6). This is not the case, however, since it is easily checked that

N−1 k X j2πf0(n+τm) −j2π n Xm,k = e w[n] · e N (3.7) n=0 N−1 j2πf τm X j2πf n −j2π k n = e 0 e 0 w[n] · e N (3.8) n=0

j2πf0τm = e X0,k. (3.9)

That is, the k-th coefficient evolves as a complex exponential function of m with frequency f0, regardless of the value of k. 3.2. PHASE COHERENCE MEASURE 23

This relation holds exactly for all k for a complex exponential signal x[n]. For a real sinusoidal function x[n] = cos(2πf0n), one can use Euler’s relation to write x[n] as a composition of exponential functions with frequency ±f0. It is then seen that, if the windowing function w[n] has a discrete transform that decreases sufficiently fast in k, the above analysis shows that the k-th coefficient evolve as a complex exponential in m with frequency ±f0 if k/N is sufficiently close to

±f0. Since we always work with real signals in the audio frequency range, we may only consider the positive frequency f0.

So, in presence of a pure tone at frequency f0, one should expect the coef-

ficients with k/N in the neighbourhood of f0 to evolve as exponential functions of m with frequency f0. It is this property that can be used to measure the likelihood of having a sinusoidal component at some given frequency by only considering the phase evolution of the STFT coefficients. In the next section, we propose a method to jointly perform such an analysis over different coefficients to “test” different possible frequency components.

3.2 Phase coherence measure

3.2.1 Coherence measure

The analysis in the previous section shows that in the presence of a pure sinusoid we can predict Xm,k from X0,k according to (3.9). In practice, we will only need to consider two adjacent frames in the STFT and in this case we can say that, given Xm,k and τ, we can write the one step forward prediction for Xm+1,k as

j2πf0τ Xˆm+1,k = Xm,k · e . (3.10)

Equation (3.10) gives the ideal evolution of the k-th coefficient when the signal is a pure esponential at frequency f0. Since we do not know f0 but rather measure the true coefficient phases, we can say that a frequency f0 is compatible with the measured phase evolution if

Φm+1,k − Φm,k − 2πf0τ = 0 mod (2π). (3.11) 24 CHAPTER 3. PHASE-BASED SINUSOIDS LOCALIZATION

This equation is often used, with a fixed value of k, as a way to extract the value of f0 from the knowledge of the other terms. This approach has however some problems. First, the fact that the equation only holds (mod 2π) leads to the problem of the phase unwrapping. That is, the frequency f0 is not certain even in the ideal case, since it can only be determined up to multiples of 1/τ. Second, we should not expect to have exact equality in (3.11), since our real signal will not be an ideal sinusoid, but a combination of sinusoidal components usually affected by noise. Hence, when considering (3.11) for different k values, different noisy estimates for f0 are obtained. Here, we suggest a different approach that does not try to estimate one single f0 from eq. (3.11), but rather uses that equation to test whether a frequency f is compatible with the phase evolution of the coefficient k. In our test, moreover, we chose to adopt a “soft” approach defining a coherence measure that is function of three variables m, k and f. More precisely, setting ∆Φm,k = Φm+1,k −Φm,k, we define a coherence measure as given by the expression

Cm,k(f) = cos(∆Φm,k − 2πfτ) ∈ [−1, 1]. (3.12)

It is easy to see that for all f that exhibit coherent phase evolution we have

Cm,k(f) = 1. Conversely, Cm,k(f) = −1 for all those f for which ∆Φm,k −2πfτ = π mod (2π), which means that we have phase opposition between the predicted coefficient and the measured one. It may be useful to note here that we can rewrite (3.12) in terms of the cross-spectral components as follows

 X∗  m,k Xm+1,k −j2πfτ Cm,k(f) = < · · e , (3.13) |Xm,k| |Xm+1,k| where <{·} denotes the real part and (·)∗ the complex conjugate. Thus, our coherence function is a measure of the contribution given by the frequency bin k in the the “modified” cross-correlation between the two frames of the signal, according to the fact that equation (3.6) holds in place of the usual (3.5).

For the sake of simplicity we look at the coherence measure for a fixed m = m0 and k = k0. From (3.12), we know that for a given m and k, Cm,k(f) is a 3.2. PHASE COHERENCE MEASURE 25

sinusoidal function of the variable f with a “frequency” τ, since ∆Φm0,k0 = ∆Φ is a constant value. This function has local maxima fM in n ∆Φ f (n) = + , ∀n ∈ . (3.14) M τ 2πτ Z

If we consider the maximum obtained for n = 0, we find that N · fM (0) gives an instantaneous frequency in the range [k0 − 1/2, k0 + 1/2] as defined in [19]. This is the frequency which is usually selected by other methods based on the phase evolution of the STFT. Here, however, we will not select this frequency a-priori and we will instead use the whole function Cm,k(f) over different values of k to test a generic f value. Considering a test signal defined as x [n] = sin[2π 440 n], the phase coherence t Fs measure of xt[n] at the time frame m = m0 is shown in Fig. 3.1. We can easily see that at the top of the graph there is a group of bins k in which Cm0,k(f) shares the same phase, that suggest, as seen in Section 3.1.2, that the neighbours of the STFT evolves with an identical phase difference due to the presence of a stationary tone. In this “coherence band”, the local maxima of Cm0,k(f) are located corresponding to f = 440 . Fs

3.2.2 Coherence function

Our aim is to refine the amplitude spectrum using the coherence measure in order to obtain a better resolution in the localization of sinusoidal components.

Intuitively not all the values of Cm0,k(f) are of practical interest. More precisely, for a given f = f0, it is clear that only the neighbouring values of the discrete frequency k0 = N · f0 give a reliable phase coherence measure. The spreading of the frequency components due to the windowing effects of the STFT, together with the propagation of the phase coefficients described in Section 3.1.2, suggests in fact that in presence of a component at frequency f0, in the neighbours of

N · f0, the coefficients evolve according to f0. Far from this region, instead, the coefficients evolve independently from this component. We automatically consider only the relevant coefficients by choosing a weighting procedure that uses the amplitude spectrum of the analysis window to weight the coefficients of the phase coherence measure around a specified k0. The weighting coefficients 26 CHAPTER 3. PHASE-BASED SINUSOIDS LOCALIZATION

0

200 0.5 400

600 0

800 k [FFT bin index] −0.5 1000

1200 −1 0.014 0.016 0.018 0.02 0.022 0.024 f

Figure 3.1: Cm0,k(f) for xt[n] at m0-th time frame. The black solid line is the reference for f = 440 . Fs are calculated as the unity energy amplitude spectrum of the analysis window modulated at the normalised frequency f as follows

N  j2πfn DFTk w[n] · e 1 Wk(f) = q , ∀f ∈ [0, 2 ], (3.15) PN−1 2 n=0 |w[n]|

N where DFTk [·] is the Discrete Fourier Transform using N samples. Now we can define the Phase Coherence Function (PCF) as a weighted sum of the coherence measure as

X 1 Xm(f) = Wk(f) · Cm,k(f), ∀f ∈ [0, 2 ]. (3.16) k The PCF is a phase coherence indicator between time frame m and m + 1 for all f. In Fig. 3.2 it is shown the PCF around f · Fs = 440 Hz, for the test signal xt[n] previously defined, at a given time frame m = m0.

By looking at Fig. 3.2, we can notice high values of coherence also in f · Fs 6=

440 Hz. This is due to the periodicity of Cm0,k0 (f); two local maxima are obtained 3.2. PHASE COHERENCE MEASURE 27

True peak 1 PCF

0.5

0 Amplitude

−0.5

−1 410 420 430 440 450 460 470 Frequency [Hz]

Figure 3.2: Plotting of Xm0 (f) for xt[n] at the m0-th time frame. The Frequency axis shows the non-normalized frequencies in Hz.

near f ·Fs = 440 Hz at a distance Fs/τ since Cm0,k0 (f) has period 1/τ. However, the two local maxima show different amplitudes since the weighting coefficients

Wk(f) change there with respect to the true central frequency. The obtained function gives a measure of the likelihood that the signal con- tains a pure sinusoidal component at each frequency f by only considering the phase of the STFT. Taking one step further, a more useful representation of the signal is obtained by combining this phase information with the amplitude spec- trum. We define the Phase Coherence Function Weighted Modulus (PCFWM) as ¯ X 1 Xm(f) = |Xm,k| · Wk(f) · Cm,k(f), ∀f ∈ [0, 2 ]. (3.17) k The PCFWM is an amplitude spectrum-like representation with improved local- ization of the spectral peaks of pure tones. However, it is not strictly an amplitude spectrum because negative values of the PCFWM may occur. A negative phase coherence is measured when the phase difference between two consecutive time frames approaches ±π. 28 CHAPTER 3. PHASE-BASED SINUSOIDS LOCALIZATION

The advantages of this combination can be seen in Fig. 3.3. The secondary peaks in the coherence function are strongly attenuated by the amplitude since no spectral energy is present there. Furthermore, the negative peaks of the coherence function fall in a region where the spectrum does take relevant values. Those negative values of the phase coherence indicate that the energy contained in this coefficients of the STFT is in some sense “spurious”, and this is due to pure components in nearby frequencies.

1.2 True peak PCFWM 1 Amp. Spectrum 0.8

0.6

0.4

0.2

0 Amplitude (norm. to 1)

−0.2

−0.4 420 425 430 435 440 445 450 455 460 Frequency [Hz]

¯ Figure 3.3: Amplitude spectrum (solid line) and Xm0 (f) (dashed line) of xt[n]. Plots are scaled to unit amplitude.

3.3 Results and Applications

Our method find its main application in signal processing tasks that requires blind but accurate frequency localization of pure sinusoidal components of a signal. As shown in Fig. 3.3, in presence of a single component our method leads to a sharper lobe in the frequency analysis, and it allows for a more precise estimation of the peak position, if desired, ensuring that the results of [63, 25, 34] are recovered. 3.3. RESULTS AND APPLICATIONS 29

1.2 True peak 450 Hz 1 True peak 440 Hz PCFWM 0.8 Amp. Spectrum

0.6

0.4

0.2

0 Amplitude (norm. to 1)

−0.2

−0.4 420 430 440 450 460 470 Frequency [Hz]

¯ Figure 3.4: Amplitude spectrum (solid line) and Xm0 (f) (dashed line) of xd[n] = sin[2π 440 n] + sin[2π 450 n]. Plots are scaled to unit amplitude. Fs Fs

Fig. 3.4 shows then the advantage when more than one components are present. Here, two pure sinusoids with very close frequencies are analysed. Our method does not assume any a-priori knowledge on the number of components. Figure 3.5 and 3.6 show the effect of the parameter τ. Here, we used a noisy synthetic sound with three sinusoids with frequency 440 Hz, 445 Hz and 450 Hz. In these figures, the amplitude spectrum, the interpolated amplitude spectrum and the PCFWM are compared. The interpolated amplitude spectrum is calculated using zero-padding during the FFT in order to obtain the same frequency resolution of the PCFWM. It is important to keep in mind that a negative value of the PCFWM at a frequency f indicates that a pure sinusoidal function at that frequency is very unlikely to be present in the signal. Hence, the alternation of large positive and negative peaks allows us to give a sharp estimation of true peak positions. Due to the coherence measure adopted according to equation (3.12), the value of τ determines how fast this positive and negative peaks alternate. This however also 30 CHAPTER 3. PHASE-BASED SINUSOIDS LOCALIZATION impacts on the number of peaks with a large positive value that are generated around each single sinusoidal component in the signal. Figures 3.5 and 3.6 show this trade-off. The parameter τ sets a trade-off between how narrow the peaks ¯ in the Xm0 (f) are and the number of “false positive” peaks. In these examples τ = 1024 samples (50% of overlap) is a good compromise between resolution and “false” peak detection.

3.4 Conclusions

In this chapter we have introduced a novel technique that combines the amplitude spectrum and the phase coherence measure in order to refine the time-frequency representation of musical signals. We have demonstrated how this method can improve the frequency localization of short term stationary sinusoid in a audio signal. Since musical signal are composed primarily by notes and tones, our technique brings benefits for time-frequency analysis of this kind of signals, when accurate frequency measures are needed.

Furthermore, this technique can be employed in place of standard spectral peak picking procedure used for calculating higher level features such as Harmonic Pitch Class Profile (for further details refer to Section 6.2.1.2) or for reference frequency estimation task that are explained in details in the next chapter. 3.4. CONCLUSIONS 31

1 1 PCFWM PCFWM 0.8 Amp. Spectrum 0.8 Amp. Spectrum Zero−padding spectrum Zero−padding spectrum 0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2

Amplitude (norm. to 1) −0.4 Amplitude (norm. to 1) −0.4

−0.6 −0.6

−0.8 −0.8

−1 −1 420 425 430 435 440 445 450 455 460 465 470 420 425 430 435 440 445 450 455 460 465 470 Frequency [Hz] Frequency [Hz] (a) τ = 1536 samples (b) τ = 1024 samples

1 1.2 PCFWM PCFWM Amp. Spectrum Amp. Spectrum 0.8 Zero−padding spectrum 1 Zero−padding spectrum

0.6 0.8

0.4 0.6

0.2 0.4 Amplitude (norm. to 1) Amplitude (norm. to 1) 0 0.2

−0.2 0

−0.4 −0.2 420 425 430 435 440 445 450 455 460 465 470 420 425 430 435 440 445 450 455 460 465 470 Frequency [Hz] Frequency [Hz] (c) τ = 512 samples (d) τ = 256 samples

Figure 3.5: Comparison of amplitude spectrum, interpolated (zero-padding) spectrum and ¯ Xm0 (f) for different τ, calculated on a signal with three sinusoids at frequency 440 Hz, 445 Hz and 450 Hz plus noise with SNR = 10 dB. In this example N = 2048 and F s = 5513 Hz. 32 CHAPTER 3. PHASE-BASED SINUSOIDS LOCALIZATION

1 1 PCFWM PCFWM 0.8 Amp. Spectrum 0.8 Amp. Spectrum Zero−padding spectrum Zero−padding spectrum 0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2 Amplitude (norm. to 1) Amplitude (norm. to 1) −0.4 −0.4

−0.6 −0.6

−0.8 −0.8

−1 −1 420 425 430 435 440 445 450 455 460 465 470 420 425 430 435 440 445 450 455 460 465 470 Frequency [Hz] Frequency [Hz] (a) τ = 1536 samples (b) τ = 1024 samples

1 1 PCFWM PCFWM Amp. Spectrum 0.9 Amp. Spectrum Zero−padding spectrum Zero−padding spectrum 0.8

0.7 0.5 0.6

0.5

0.4

0 Amplitude (norm. to 1) Amplitude (norm. to 1) 0.3

0.2

0.1

−0.5 0 420 425 430 435 440 445 450 455 460 465 470 420 425 430 435 440 445 450 455 460 465 470 Frequency [Hz] Frequency [Hz] (c) τ = 512 samples (d) τ = 256 samples

Figure 3.6: Comparison of amplitude spectrum, interpolated (zero-padding) spectrum and ¯ Xm0 (f) for different τ, calculated on a signal with three sinusoids at frequency 440 Hz, 445 Hz and 450 Hz plus noise with SNR = −10 dB. In this example N = 2048 and F s = 5513 Hz. 4 33 Tuning Frequency Estimation

In this chapter, a comparison of different algorithms for concert pitch (i.e., tuning frequency or reference frequency) estimation is presented and discussed. The un- availability of ground-truth datasets makes this evaluation on real music record- ings less trivial than it may initially appear. Hence, we use two datasets, one of real music (covers80 provided by LabROSA) and one of synthesized music (MS2012, constructed by the authors). The algorithms have been compared in terms of speed of convergence and stability of the estimated value over an in- creasing length of the analysed signal. A local tuning frequency estimation was also performed in order to compare the ability of the algorithms to follow the local variations of the reference frequency in a real-time environment. Moreover, an analysis of the execution time have been provided. While the various algorithms perform comparably in terms of asymptotic precision, they show a quite different behaviour in terms of speed of convergence and local tuning frequency estimation accuracy. In several Music Information Retrieval (MIR) tasks such as melody extraction, key or chord estimation and other pitch based feature extraction processes, a quantization of frequency values to the equal-tempered scale is performed. Usually, the equal-tempered scale is expressed in an equally spaced integer values called cents, where an equal- tempered semitone interval corresponds to 100 cents. In order to compute the exact frequency of the note scale, the correct tun- ing frequency must be known or estimated from the data. For this reason, the task of concert pitch estimation find it’s main application as a pre-processing 34 CHAPTER 4. TUNING FREQUENCY ESTIMATION step of chords transcription algorithms [28, 41], musical key estimation [50, 51], pitch/multi-pitch or melody extraction [35], or in order to monitor the pitch drift of an a cappella choir. Another possible application of a tuning frequency esti- mation method is the audio restoration, where a pitch drift occurs due to the non constant rotation speed of a turntable or a reel to reel player. In all of these tasks, a reliable tuning frequency estimation algorithm is needed in order to avoid accuracy degradation when a different reference frequency is used. Beside of that, different estimation algorithms has been proposed, but no evaluation have been conducted in order to establish which method is well suited for a given algorithm or constraint. Although the frequency to cent mapping is in many cases supposed to be trivial, the precision of this task relies in the correct estimation of the tuning

(reference) frequency fref . This conversion is in fact computed according to the equation  f  c = 1200 · log2 , (4.1) fref where f is the frequency value in Hz, and c is the corresponding value in cents.

In many cases, a 440 Hz frequency for the reference (A4) pitch is chosen a priori. This assumption is justified by the fact that this tuning frequency is inter- nationally considered as a standard [30]. However, for timbre preferences or due to instrumentation issues, the actual A4 pitch maybe different from 440 Hz. In this case, a wrong assumption or a wrong estimation of concert pitch might affect the overall performance of a Music Information Retrieval system. Moreover, has been proved that an estimation of the reference frequency is beneficial for the accuracy of various MIR algorithms [31, 36, 9]. The estimation of the tuning frequency usually constitutes a pre-processing block (or post-processing in some cases) in the work-flow of larger systems, and no specific evaluation of this block is available in the literature, to the best of the authors’ knowledge. Furthermore, one might think that the assumption that the analysed musical piece is played using the equal temperament is too strong. Concerning this, the author of [36] states that when alternative temperament is used, while equal is 4.1. TUNING FREQUENCY ESTIMATION METHODS 35 assumed, it brings only slightly less accurate results. In [10], where the goal is to estimate the used temperament, the reference frequency is calculated assuming equal temperament and the results of tuning estimation agree with the ground truth values. Another issue that may affect the performance of a tuning estimation is the deviation of the tone’s partials from the equal-tempered scale. This is an intrinsic characteristic of the harmonic series of the overtones. Assuming that the funda- mental frequency is perfectly in tune, we can easily check [24] that, for example, the 5th and the 7th harmonic have a detuning factor of, respectively, −13.69 and −31.17 cents. However, this is not a problem for a tuning estimation algorithm because this detuned partial appears significantly less frequently with respect to the in-tune partials [24] and, usually, the detuned partials becomes negligible in terms of their energy. The deviation of the partials can be caused also by the inharmonicity, a phys- ical phenomenon of stringed instruments, in which the partials are not integer multiples of the fundamental frequency. However, in [69], it is shown that taking into account the inharmonicity coefficient during the estimation of the reference frequency, does not give an improvement of the overall performances of the poly- phonic pitch transcription method described in that paper.

We have studied how different algorithms perform regardless the structure of the analysed audio signal. In more detail, we have compared three methods of tuning frequency estimation. In Section 4.1 we provide a brief introduction of the considered algorithms. In Section 4.2 we explain the evaluation and comparison procedure, whereas in Section 4.3 we describe the used datasets. Results and concluding remarks are given in Section 4.4.

4.1 Tuning Frequency Estimation Methods

We perform an evaluation of a specific class of tuning frequency estimation meth- ods that share a common pre-processing step. While other methods are presented in the literature (for example, [36]), we focus our attention on those methods 36 CHAPTER 4. TUNING FREQUENCY ESTIMATION which requires a sequence of spectral peaks as an input. The spectral peaks can be calculated from the spectrum (windowed Fast Fourier Transform) in several ways [11]. For our evaluation purposes, we have used the peak picking algorithm de- veloped in the context of the Sinusoidal Modelling Synthesis framework (SMS) [1] and presented in [62]. We give some details of this algorithm in the Section 5.1.4. An estimation of each frequency location and peak amplitude is calculated by fitting peaks in the discrete spectrum with a parabola, and using the vertex of the parabola as an estimation of the position of the true non quantized peak. The output of this peak picking process is the common starting point shared by all of the three analysed tuning frequency estimation algorithms. All of the analysed audio pieces are mono wave files sampled at 22050 Hz. The Fast Fourier Transform (FFT) is calculated over a window of 8192 samples with an overlap of 75%, that leads to an hop size of approximately 93 msec. The peak picking algorithm returns k = 30 peaks in the range of 50 − 5000 Hz (reduced to k = 5 in some tests) for each analysis window, as suggested in [24]. These 30 peaks are then sorted from the highest to the lowest peak magnitude.

The first type of tuning frequency estimation method makes use of the fre- quency deviation histogram (see [53, 28, 24, 70]), and we call it Hist01. Here, the width of the histogram bin is set to 1 cent. The second tested method is that presented in [14], which is based on circular statistics. The last evaluated algorithm uses a Least-Squares optimization of the map- ping error to the equal-tempered scale [23]. From now on we will call it L-S. We give now a brief description of these different approaches.

4.1.1 Frequency Deviation Histogram

The underlying idea of this type of algorithms is to build a histogram of the deviations of each spectral peak from the equal-tempered scale. The deviation 4.1. TUNING FREQUENCY ESTIMATION METHODS 37 in semitones for each peak frequency fi can be estimated as follows

c − round(c ) d = i i , (4.2) i 100 where ci is the cent value of the frequency fi calculated using (4.1) with the standard fref = 440 Hz. From (4.2), one can see that d ∈ [−0.5, 0.5[ due to the rounding operation to the nearest semitone. At this point, a histogram H of all the possible deviations di is computed. Each peak deviation is weighted by its peak magnitude ri to avoid high impact of small (noise) peaks. The overall estimated deviation dˆ is the deviation value associated to the histogram bin with the maximum value dˆ= arg max(H). (4.3)

The reference frequency of the entire music piece can then be computed as

dˆ fref = 440 · 2 12 . (4.4)

A fundamental parameter for this algorithm is the histogram resolution. Many of this kind of algorithms differ only in this parameter. In our tests we consider only the resolution of 1 cent.

4.1.2 Circular statistics

A different approach for tuning frequency estimation which makes use of circular statistics was presented in [14]. This approach is entirely based on the observation that the deviation d is a periodic measure and not an absolute measure, since it is a “wrapped around” quantity that should be evaluated from the nearest 100 cents grid point. Each cent value is mapped onto a unit circle 100 cents-periodic, and represented as a unit modulus vector as follows

u = 1 · ejφ, (4.5) where 2π φ = · c. (4.6) 100 38 CHAPTER 4. TUNING FREQUENCY ESTIMATION

jφ For each peak i, with a frequency fi, we consider the vector ui = rie i where ri is the peak amplitude, and then we take the mean vectoru ˆ of all circular quantities ui as follows

PN r [cos(φ ) + j sin(φ )] uˆ = i=1 i i i . (4.7) PN i=1 ri

Each peak’s φi is therefore weighted by its peak magnitude ri to avoid high impact of small (noise) peaks. The overall deviation is then computed from the angle of the resulting vectoru ˆ, that is

1 dˆ= (ˆu). (4.8) 2π ∠

The tuning frequency can be finally estimated using (4.4).

4.1.3 Least-Squares Estimation

This algorithm uses a Least-Squares optimization approach, and is presented in [23]. The aim of this method is to estimate the reference frequency in real-time, in order to visualize the evolution of the tuning frequency of a choir ensemble. In this context, the author of this algorithm has developed an application that runs on a modern smartphone. As the choir changes the tuning frequency during singing, the conductor, with the aid of this application is eventually able to ex- ploit this variation, and can take some countermeasures to correct the eventual pitch drift.

In a nutshell, this method, at each analysis frame k, calculates an equal- k−1 tempered frequency scale values using the previous estimation of fref and up- dates this estimation minimizing the average squared error when mapping each peak frequency fi to the “new” frequency scale. At first, a deviation in integer semitone index is calculated at each peak frequency fi as " !# fi si = round 12 · log2 k−1 , (4.9) fref 4.2. EVALUATION STRATEGY 39 then, the new reference frequency is estimated in a Least-Squares sense minimiz- ing the squared position error 2 X  k si  E = fi − fref 2 12 . (4.10) i k Differentiating E with respect to fref , we obtain

δE X k si X si = 2f 2 6 − 2fi2 12 , (4.11) δf k ref ref i i δE k and setting k = 0, the estimation of fref is given by: δfref

si P (f · 2 12 ) f k = i i . (4.12) ref P si i 2 6

This method is able to estimate a fref that goes out of the bounds 440 Hz ±50 cents. For this reason, the authors of the L-S algorithm takes some countermeasures in order to prevent this possibility. First, a Reset button is k present in the real-time application. Second, the current estimated fref is stored in a circular buffer of B estimations, and the new estimated fref is taken as the median value of this buffer. In our experiments, we chose B = 20, and if the estimation exceeds the limit of 440 Hz ±50 cents, we force fref = 440 Hz, simulating the Reset button. Further details of this method can be found in [23]. In order to give a fair confrontation between the tested algorithms, the L-S is slightly modified for all of the test except for the local tuning estimation task. Since the L-S algorithm was created for a frame by frame estimation, the result of a global estimation depends only on the last analysed frames. This behaviour makes the L-S algorithm not well suited for a global fref estimation. For that reason we have modified the algorithm in order to consider all the audio frames of the analysed audio excerpt as an unique frame, concatenating all of the spectral peaks of all frames and treat them as a one big macro-frame.

4.2 Evaluation Strategy

Since the availability of a ground-truth for the reference tuning estimation algo- rithms is hard to find, a comparison in terms of absolute precision of the con- 40 CHAPTER 4. TUNING FREQUENCY ESTIMATION sidered methods is difficult using real world recordings. In the literature, some authors uses synthesized tones (for example sinusoids or sawtooth) in order to make a preliminary evaluation of the precision [23]. However, this kind of signals do not give a completely fair representation of real world recordings, and not all the characteristics of the tested algorithm can be exploited. Even symbolic music data such as MIDI files or MusicXML sheets synthesized by a sequencer, do not provide a valid ground-truth. This is because the sound banks used by a sequencer (called soundfont) may exhibit an unpredictable detuning. How- ever, synthesized symbolic music provides a good starting point because we can make the assumption that all of the synthesized songs shares the same “detuning behaviour”. In light of this observations, we decide to adopt a different strategy for the evaluation of the algorithms. First of all, two different datasets (discussed in Sec. 4.3) are considered, one of synthesized symbolic music (MS2012) and one of real world recordings (covers80). Second, our evaluation strategy is not based on a comparison with a ground truth but is mainly based on the statistical properties of the estimation algorithms such as the speed of convergence, the stability, the robustness, and the standard deviation of the estimator. Finally, a discussion on the computation cost and complexity is also discussed. A detailed description of the evaluation methods is presented in the following sections.

4.2.1 Ideal case performances and global reference frequency es- timation

In this task, a test with a simple synthetic sound is performed in order to verify the best-case accuracy of the algorithms. The synthetic sound is a sawtooth sweep, defined as: H 1 X 1 s(t) = sin[2πh(f t + αt2)]. (4.13) H h 0 h=1

For a 5 seconds signal sampled at 44100 Hz, and a base frequency f0 = 440 4.2. EVALUATION STRATEGY 41

Hz, we set α = 5 and the number of harmonics to H = 45 in order to obtain an aliasing-free test signal. An estimation of fref is calculated for each analysis frame. Then, we study the global reference frequency estimation for each song in the two datasets under the hypothesis that the tuning frequency is constant over the entire musical piece. Every algorithm is tested using 5 and 30 spectral peaks per frame. A histogram of the estimated fref is presented for each test run. Observ- ing the histograms, we can study how similarly the analysed methods behave. For the covers80 dataset, we cannot make any a priori assumptions on the distri- bution of the reference frequency along the songs. For the synthesized MS2012 dataset, although we do not know the exact fref , we can reasonably expect that all of the music pieces share the same reference frequency since they have been synthesized using the same soundfont set.

4.2.2 Speed of convergence and estimation stability

In this evaluation task, we test how much data for each algorithm is needed to obtain the final fref global estimation that would be obtained using the whole song, and how much this estimation is stable. Again, we make the assumption that for a given song, the fref remains constant over the entire music piece. An increasing fraction p of the song’s frames are randomly extracted, then a reference frequency estimation is performed on this data and compared to the global estimation in terms of absolute error and estimation deviation in Hertz. For each song, we take from p = 10% to p = 100% of the analysis frames, with a step of 10%. The results are averaged over N = 50 runs of this test. For a given percentage p, the mean absolute error Ep for a set of songs S = {s1, s2, . . . , sS} is defined as N S 1 X X n,p global Ep = f − f , (4.14) NS refs refs n=1 s=1 where f n,p is the reference frequency of the song s at the extraction n using p% refs of analysis frames, and f global is the global reference frequency of the song s using refs the whole song (100%). 42 CHAPTER 4. TUNING FREQUENCY ESTIMATION

Reasonably, a trusted algorithm must carry out the same fref estimation regardless of which part of the song is examining. We analyse the standard devi- ation of the estimation for all the songs in the dataset. The standard deviation,

Σp, is calculated using v N S u 2 u 1 X X  n,p global Σp = t f − f . (4.15) NS refs refs n=1 s=1

4.2.3 Local tuning estimation

In this test we study the ability of a tuning estimation algorithm to “follow” the local variations of the tuning frequency. This evaluation points out the suitability of the methods to be used for real-time applications. In the case where the hypothesis of fref = const for a given song is not satisfied (for example in the choral or a cappella exhibitions where a pitch drift is present, or in the tape or vinyl recording with a non-constant motor rotation of the playing gear), we need another kind of test. For this purpose we simulate a real-time reference frequency estimation where a value of fref is carried out each 4 or 2 seconds during the reproduction of the musical piece. In more detail, we give a fref estimation every local analysis window that groups Lwnd = 80 or Lwnd = 40 analysis frames. Each local window has a 50% of overlap with the adjacent windows. With this test we can observe the variations of the tuning frequency along the time axis of a musical piece.

4.2.4 Computational cost and complexity

Since the test script is written in Matlab code, we use the Matlab Profiler to calculate the execution time of the algorithms. The computational cost of the various implementations of the algorithms could be different using other program- ming languages (for example C/C++, . . . ), especially for the Hist01 algorithm. For this reason, our measured execution time may vary from an implementation to another. However, since the Circ and L-S methods consist of a single alge- braic calculation (see (4.8) and (4.12)), a further optimization is not required. 4.3. DATA SET 43

In addiction to the execution time, we provide a brief analysis of the asymptotic complexity of the algorithm as a function of the number of input peaks.

4.3 Data Set

In our evaluation process, two datasets have been considered. As stated in Section 4.2, we use both a synthesized symbolic music dataset (MS2012), and a real world recording (covers80) collection. Table 4.1 shows some statistics about the length of the musical pieces in the two datasets.

Table 4.1: Minimum, Maximum, Average and Total length of the songs in the datasets in HH:MM:SS format.

Dataset min max average total

covers80 00:01:57 00:09:55 00:04:10 11:07:28 MS2012 00:00:08 00:25:34 00:02:46 14:06:16

The following sections illustrate some useful details of the two datasets.

4.3.1 Cover Song 80 (covers80)

This dataset is a well known music collection used in several Music Information Retrieval research topics such as Cover Song Identification or music similarity measurements. The Cover Song 80 [16] is a freely available dataset of real world recordings created by LabROSA of Columbia University. It contains 80 songs, and for each song there is an alternative version (cover song), for a total of 160 songs. All the songs are in 32 Kb/s MP3 format. For our test, we convert all the song in 16 bit, mono PCM wave format, sampled at 22050 Hz. 44 CHAPTER 4. TUNING FREQUENCY ESTIMATION

4.3.2 MuseScore Symbolic Music Dataset (MS2012)

We have constructed a collection of 306 songs in MIDI and MusicXML format. All the songs in this dataset are distributed under free-to-share Creative Com- mons CC0 license, and are kindly provided by MuseScore1. The complete dataset with meta-data (title, genre and instrument statistics) is freely available at [8]. For the evaluation test, we have to synthesize symbolic data to a 22050 Hz, 16 bit mono PCM wave file. For this task we have used an open source software synthesizer named FluidSynth2, with the soundfont named FluidR3.

In Table 4.2 the most used musical instruments (using the General MIDI standard nomenclature) in the dataset are reported. The values represent the percentages of songs in which an instrument is used in one or more track of the song. Table 4.3 shows the distribution of the user annotated musical genre label for all of the songs in the dataset.

1http://www.musescore.com 2http://www.fluidsynth.org/ 4.4. RESULTS 45

Table 4.2: General MIDI instruments name. Table 4.3: Genre tags.

Instrument % Genre %

Acoustic Grand Piano 26% Classical 45.4% String Ensemble 1 19% Christian 12.7% Choir Aahs 16% Christmas 6.2% 15% Pop 6.2% Clarinet 14% Jazz 5.6% French Horn 13% Contemporary 5.2% Trumpet 12% Traditional 3.3% Trombone 11% Film 2.9% Tuba 10% Rock 2.3% Violin 10% Folk 1.6% Alto Sax 10% Latin 1.3% Acoustic Guitar nylon 9% Other 12.5%

4.4 Results

4.4.1 Ideal case performances and global reference frequency es- timation results

As we mentioned in Sec. 4.2, an evaluation of the precision of the fref estimation is inapplicable for real world data because of the lack of a suitable ground truth.

Since we do not know the true fref of the soundfont FluidR3, the precision cannot be evaluated for the MS2012 dataset either. However we can investigate how the different algorithms perform in terms of the ideal performances, and how the distribution of the estimated tuning frequency differs each other among the two datasets. As shown in Fig. 4.1, all the algorithms follow the pitch drift of the sawtooth sweep signal. The Circ and L-S show a very close behaviour but we can notice the effect of the quantization in the histogram for the Hist01 algorithm that introduces non-linearities in the ideal behaviour.

Table 4.4 shows the mean µk and the standard deviation σk of the distribution 46 CHAPTER 4. TUNING FREQUENCY ESTIMATION

Figure 4.1: Frame by frame fref estimation of a sawtooth sweep signal.

of the estimated fref for each dataset, calculated using respectively k = 5 and k = 30 peaks per frame.

Table 4.4: Mean µk and standard deviation σk of fref using k = {5, 30} peaks per frame.

Dataset Measure Circ Hist01 L-S

covers80 σ5 2.41 2.74 1.03

covers80 µ5 440.29 440.69 440

covers80 σ30 2.33 2.55 0.63

covers80 µ30 440.2 440.39 439.78

MS2012 σ5 0.87 1.2 1.07

MS2012 µ5 439.71 439.91 439.87

MS2012 σ30 0.79 1.12 0.7

MS2012 µ30 439.6 440 439.37

Considering the covers80 dataset, the algorithms Circ and Hist01 exhibit very similar behaviour. For the MS2012 dataset we can notice that the standard deviation is smaller for all methods, except the L-S, compared with the results on covers80 dataset since all of the songs are synthesized using the same soundfont. 4.4. RESULTS 47

Figure 4.2: fref histograms for MS2012 dataset using k = 5 peaks per frame. The solid line indicates the reference frequency of 440 Hz. The dashed line is the mean.

Figure 4.3: fref histograms for MS2012 dataset using k = 30 peaks per frame. The solid line indicates the reference frequency of 440 Hz. The dashed line is the mean.

Moreover, since the minimum audible frequency interval is 3 − 4 cents [72], and around of 440 Hz an interval of ±4 cents means a difference of ∼ ±1 Hz, we can say that all of the tuning frequency estimation algorithms gives reliable results because they exhibit an estimation standard deviation close to 1 Hz for the MS2012 dataset. The same consideration can not be extended to the results on the covers80 dataset, because the hypothesis that all of the music pieces in the dataset share the same tuning frequency is not satisfied. Furthermore, Fig. 4.2 and Fig. 4.3 seems to confirm the fact that the Circ and the Hist01 exhibit a similar behaviour also on a MS2012 dataset. However, as we will see in the following tests, the histogram of the estimated fref is not sufficient for an accurate analysis of the tuning frequency estimation algorithms. 48 CHAPTER 4. TUNING FREQUENCY ESTIMATION

Figure 4.4: Convergence results for covers80 dataset using k = {5, 30} peaks per frame.

The number of peaks per frame k does not seem to significantly affect the performances of the considered algorithms. Only a slight improvement on the σk can be noticed using k = 30 for all considered methods.

4.4.2 Speed of convergence and estimation stability results

In this evaluation task we calculate the mean absolute error Ep for each dataset with an increasing percentage p of analysed frames per song. The tests are made using both k = 5 and k = 30 peaks per frame. The underlying idea of this evalu- ation is to asses how much data are needed to obtain the global fref estimation, assuming that the reference frequency is constant all over the entire music piece.

In Fig. 4.4 and Fig. 4.5 the graph of Ep in the various test conditions are shown.

As we can see in Fig. 4.4 and Fig. 4.5, the Circ and Hist01 algorithms are less sensitive to the number of peaks k considered in the estimation. Conversely, if we consider the performances of the L-S algorithm, we can see a better convergence behaviour if we use k = 30 peaks per frame, especially for the covers80 dataset where the L-S achieves the same performances of the Circ algorithm.

Regarding the estimation stability of each algorithm, we use the estimated standard deviation Σ as defined in (4.15) as an indicator of the reliability of the 4.4. RESULTS 49

Figure 4.5: Convergence results for MS2012 dataset using k = {5, 30} peaks per frame.

Figure 4.6: Estimated Σ for covers80 dataset using k = {5, 30} peaks per frame.

estimation. For each song we calculate Σ over N = 50 estimation using only the p% of the song’s analysis frame. In Figs. 4.6 and 4.7 the results of this test are reported. The estimation stability calculated here agrees with the results shown in Figs. 4.4 and 4.5, that is, the L-S algorithm gets more benefit using k = 30 peaks with respect to the others, even with the MS2012 dataset.Moreover, we can see that only the Circ and the L-S algorithm guarantees (on average) an inaudible estimation error using only the 10% of the song. 50 CHAPTER 4. TUNING FREQUENCY ESTIMATION

Figure 4.7: Estimated Σ for MS2012 dataset using k = {5, 30} peaks per frame.

Figure 4.8: Local tuning estimation of the song “Let It Be” played by The Beatles.

4.4.3 Local tuning estimation results

We analyse the local tuning estimation performances simulating a real-time fref estimation. For the firsts two cases, we use a local analysis window of length

Lwnd = 80 frames. This means that in our case, an estimation is carried out every 4 seconds. In this test we measure the deviation in cents between the estimated tuning frequency and a reference set to 440 Hz. In Figs. 4.8 and 4.9, the local estimation performances of the considered algorithms is reported. In order to get a more stable reference frequency esti- mation, we use k = 30 peaks per frame. In more detail, in Fig. 4.8 we show 4.4. RESULTS 51

Figure 4.9: Local tuning estimation of “Variations 16-20” in J.S. Bach, Goldberg Variations, BWV 988, played by Wanda Landowska, Paris (1933), CD version of 78 rpm recording.

the local estimation results for a constant reference tuning song “Let It Be” per- formed by The Beatles, that we assume to have a constant fref , while in Fig. 4.9 we show how the algorithms behaves when a pitch drift occurs during the reproduction. This particular recording comes from a remastered CD version of an old (1933) 78 rpm vinyl recording of classical music. In both cases, the Circ and the L-S method gives a less “shaky” estimation with respect to the Hist01 according to the results showed in Section 4.4.2. Moreover, the Circ and the L- S are consistent each other even though a vertical offset is visible in many frames.

In Fig. 4.10, we show the results of the estimation with the same choir ensembles recording used in the evaluation test presented in [23], and setting

Lwnd = 40 frames. The performances suffer of a gradual pitch falling that exceeds one semitone from the starting tuning frequency. In that case, the Circ and L-S algorithms perform quite well, and show a similar behaviour. For this reason, this two algorithms are more suitable for real-time frequency estimation than the Hist01. 52 CHAPTER 4. TUNING FREQUENCY ESTIMATION

Figure 4.10: Local tuning estimation of Choir performance used in [23].

4.4.4 Computational cost and complexity

The last evaluation test considers the computational complexity and cost (time) required for each algorithm. This test is performed using the Matlab Profiler global utility on the fref estimation task of all the 160 songs in the covers80 dataset with k = 30 peaks per frame. In the total execution time, we consider only the time spent in the reference frequency estimation routines, so the FFT and the peak picking algorithm are not taken into account. Only a relative evaluation is given here since the algorithms optimization in terms of computation efficiency is not the goal of this analysis. We think that a further improvement on the absolute execution time may be possible, for example, using a C/C++ implementation of the methods. The average execution time in seconds is reported in Table 4.5. The test are made on a 64 bit GNU/Linux equipped laptop, with 4 GB of RAM and an Intel i5 M 430 CPU at 2.27 GHz without parallel Matlab optimization.

Table 4.5: Total and average execution time.

Algorithm Tot. Time/number of songs (s)

Circular 0.19 Hist01 2.63 L-S 1.15 4.5. CONCLUSIONS 53

As we can see in Table 4.5, all of the algorithms are fast enough to run also in real-time. The L-S has been designed to be real-time even in a computationally limited environment such as modern smartphones. In our test, however, the best performing algorithm in terms of computational cost is the Circ. Moreover, it is easy to check that the asymptotic complexity is linear with the number of considered peaks k for all of the presented algorithms. Since all the tasks are O(k), the longer execution time of the Hist01 is caused only by the overhead in some Matlab functions used for this method such as find(·) and max(·).

4.5 Conclusions

The lack of a scientific ground truth makes the evaluation of the performances of tuning estimation algorithms a non trivial task. We proposed a set of tests that allow to compare different algorithms in terms of desirable behaviours such as speed of convergence, stability, and computational cost and complexity. As seen in Section 4.4.1, all the considered algorithms perform quite well in the global fref estimation. However, a remarkable difference is noticeable in terms of speed of convergence and estimation stability between the Hist01 algorithm and the other studied methods. The Circ and L-S algorithms exhibit a very similar behaviour in the tests presented in Section 4.4.2 and 4.4.3, and it suggests that they are more suitable for real-time local frequency estimation. Moreover, it appears that the number of peaks per frame k is not a fun- damental parameter for the Circ and Hist01 methods, while, especially with synthesized songs, the L-S can benefit from using more peaks. However, from the computational cost optimization point of view, using more peaks means more computation time. In our tests we demonstrated that for Circ and L-S, k = 5 peaks per frame are sufficient for all of estimation tasks presented here. Our tests show also that the Circ algorithm outperforms the other ones on all of the considered evaluation tasks. We have shown also that all of the MIR algorithms that needs a reference frequency estimation and already use a peak peaking algorithm, can benefit using 54 CHAPTER 4. TUNING FREQUENCY ESTIMATION

Circ and L-S methods for the local and global fref estimation, with consistent estimation even using small amounts of data. However, the main advantage of Hist01 method is that it can be applied successfully, with small modifications, also without a peak peaking procedure [36, 53].

Furthermore, as we will see in the next chapter, a reliable reference fre- quency procedure is beneficial for our proposed pitch salience function calculation method. For our need, we chose to adopt the Circ algorithm. 5 55 Polyphonic Pitch Salience Function

In this chapter, a novel approach for the computation of a pitch salience function is presented. The aim of a pitch (considered here as synonym for fundamental frequency) salience function is to estimate the relevance of the most salient musi- cal pitches that are present in a certain audio excerpt. Such a function is used in numerous Music Information Retrieval (MIR) tasks such as pitch, multiple-pitch estimation, melody extraction and audio features computation (such as chroma or Pitch Class Profiles). In order to compute the salience of a pitch candidate f, the classical approach uses the weighted sum of the energy of the short time spectrum at its integer multiples frequencies hf. In the present work, we propose a different approach which does not rely on energy but only on frequency location. For this, we first estimate the peaks of the short time spectrum. From the frequency location of these peaks, we evaluate the likelihood that each peak is an harmonic of a given fundamental frequency. The specificity of our method is to use as likelihood the deviation of the harmonic frequency locations from the pitch locations of the equal tempered scale. This is used to create a theoretical sequence of deviations which is then compared to an observed one. The proposed method is then evaluated for a task of multiple-pitch estimation using the MAPS test-set. A salience function is a function that provides an estimation of the predominance of different frequencies in an audio signal at every time frame. It allows to obtain an improved spectral representation in which the fundamental frequencies have a greater relevance 56 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION compared to the higher partials of a complex tone. The computation of a salience function is commonly used as a first step in melody, predominant-pitch (pitch is considered here as synonym to fundamental frequency or f0) or multiple-pitch estimation systems [12, 54, 37, 65].

5.0.1 Classical approach

In the classical approach [56], the salience (or strength) of each f0 candidate is calculated as a weighted sum of the amplitudes of the spectrum at its harmonic frequencies (integer multiples of f0). In the discrete frequency case, this can be express as: H X S[k] = wh|X[hk]| (5.1) h=1 where k is the spectral bin, H is the number of considered partials, wh is a partials’ weighting scheme and |X[k]| is the amplitude spectrum. This process is repeated for each time frame m. In this approach, the choice of the number of considered harmonics H and the used weighting scheme wh are important factors and directly affect the obtained results [56]. The weighting scheme wh implicitly models the sound source. Since the classical approach is based on the amplitude/energy of the spectrum, it is sensitive to the timbre of the sources. In order to make the estimation more robust against timbre variations, spectral whitening or flattening processes have been proposed [54, 4, 3, 55]. Among other approaches, the one of [67] proposes to estimate the salient pitch of a complex tone mixture using a psychoacoustic motivated approach. It uses the notions of masking and virtual pitch (sub-harmonic coincidence) calculation.

5.0.2 Proposal

We propose a novel salience function which does not rely on the amplitude/en- ergy of the spectrum but only on the frequency location of the peaks of the spectrum. Doing this, our method is not sensitive to timbre variations hence does not necessitate whitening processes. The specificity of our method is to use as likelihood the deviation of the harmonic frequency locations from the pitch locations of the equal tempered 57 scale. This is illustrated in Figure 5.1, for the harmonic frequencies of the pitch C4 (MIDI Key Number i = 60), which 3-rd and 6-th harmonic frequencies are slightly above the pitches i = 79 and i = 81 respectively, while its 5-th and 7-th are below the pitches i = 88 and i = 94 respectively (in the equal tempered scale). This is used to create a theoretical sequence of deviations which is then compared to an observed one derived from the peaks detected in the spectrum.

Pitch location (−) and Harmonic frequencies of C4 (−−) 1 1 2 3 4 5 6 7 8

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

0 400 600 800 1000 1200 1400 1600 1800 2000 Frequency [Hz]

Figure 5.1: (Lower part) Frequency location of the pitches of the equal tempered scale for a tuning of 440 Hz. (Upper part) Frequencies of the harmonic series of the pitch C4 (261.6 Hz).

Chapter organization: In Section 5.1 we present the motivation behind the concept of this novel salience computation approach (Section 5.1.2) and the 58 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION details of its computation (Sections 5.1.3, 5.1.4, 5.1.5 and 5.1.6 ). In Section 5.2 we propose a basic evaluation framework of salience function based on a multi-pitch estimation paradigm (Section 5.2.1) and asses the performances of our proposed method (Section 5.2.4). We finally conclude in Section 5.3 and provide directions for future works.

5.1 Proposed Method

5.1.1 Overview

The global flowchart of our method is represented in Fig. 5.2. The content of the audio signal is first analyzed using Short Time Fourier Transform (STFT). At each time frame m, the peaks of the local Discrete Fourier Transform (DFT) are estimated using a peak-picking algorithm.

We denote by Pm = {(f1, a1),..., (fP , aP )} the set of peaks detected at the frame m where fp and ap are the frequency and amplitude of the p-th peak. Since our salience function is based on an equal tempered cents grid, we then need to estimate the tuning frequency fref of the audio signal. We then compute at each frame m the salience value of each peak p by comparing its frequency to the ones of an equal tempered scale tuned on fref . This salience allows to discriminate peaks which are fundamental frequency from the ones that are harmonic partials.

STFT Input Song

Referece Frequency Spectrum Peak−Picking Estimation

Salience computation Salience Output

Figure 5.2: General scheme of the method. 5.1. PROPOSED METHOD 59

5.1.2 Motivations for using frequency deviations for pitch salience computation

The computation of our salience function only relies on the frequency positions of the peaks of the spectrum (not on their energy). The basic idea we develop is the following: for a given note at fundamental frequency f0 its h-th harmonic fre- quency exhibits a specific deviation from the equal tempered scale. For example, for a tuning at 440 Hz, the third (h = 3) harmonic of a A4 note (f0 = 440 Hz) is at frequency 1320 Hz while the closest note of the equal tempered scale is at 1318.5 Hz. The specific deviation of the third harmonic is then 1.95 cents.

For a given frequency f0, the frequency of its h-th harmonic is defined by

f0 fh = h · f0 (5.2)

f0 The deviation in cents of the harmonic fh from the equal tempered grid is defined as: " f0 ! $ f0 !'# f0 fh fh dh = 100 12 log2 − 12 log2 (5.3) fref fref where b·e is the rounding operator and fref is the A4 tuning frequency estimated 1 f0 from the data . We denote by {fh }, the sequence of all the harmonic frequencies f0 of f0 and by {dh }, the theoretical sequence of deviations. This deviation is independent of the actual f0. We therefore simply denote it by {dh} in the following.

f0 Proof that dh is independent of f0: Under the hypothesis that the anal- ysed musical excerpt is played on the equal temperament scale and using an ac- curately tuned instrument, we can calculate each equal-tempered note frequency fi in the audio spectrum using an integer number i as follows:

( i ) fi = fref · 2 12 (5.4)

The integer number i represents the note index in the MIDI notation without the offset of 69 (for the sake of simplicity, we assume that A4 correspond to i = 0

1Or blindly chosen as 440 Hz. 60 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION instead of i = 69). Now, it is easy to check that for all fundamental frequencies f0 = fi, (5.3) can be rewritten as:

fi dh = 100 [12 log2(h) + i − b12 log2(h) + ie]

= 100 [12 log2(h) − b12 log2(h)e] (5.5)

Since i ∈ Z, we can say that b12 log2(h) + ie = i + b12 log2(h)e and it is clear fi that the sequence {dh } does not depend on the fundamental frequency fi.

In Fig. 5.3, we illustrate the deviation of the first 20 harmonics of a complex tone from the equal tempered note scale.

50

40

30

20

10

0

−10 Deviation [cents]

−20

−30

−40

−50 0 2 4 6 8 10 12 14 16 18 20 Harmonic order

Figure 5.3: Deviation of the first 20 harmonic frequencies of a complex tone from the pitch of the equal tempered scale.

Salience computation: Since the sequence {dh} is independent of funda- mental frequency, we can simply compute the salience of each f0 candidate at frequency fp, as the correlation between the theoretical sequence of deviation ˆfp {dh} and the measured sequence of the deviation {dh }. The measured sequence of deviations is the one corresponding to the peak detected in the spectrum Pm. Extension to inharmonic signals: Inharmonicity is a phenomenon related to the physical characteristics of a non-ideal string. The frequencies of the modes 5.1. PROPOSED METHOD 61 of of an ideal string are exact integer multiples of the fundamental, but the stiffness of the material of the real strings shifts the modes of vibration at non-integer multiples [20]. f0 In mathematical terms, the relation between the h-th partial fh and the funda- mental frequency f0 can be modelled as

f0 p 2 fh (β) = hf0 1 + βh (5.6) where β is the inharmonicity coefficient which is related to the physical properties of a string. In order to take into account inharmonicity we use (5.6) instead of (5.2) into equation (5.3). It should be noted that whatever inharmonicity is taken into account or not, the theoretical sequence of deviations is always independent of f0. However, the theoretical sequence of deviations now depends on the parameter β and it is denoted by {dh(β)}. In the next sections, we describe in details each block of our algorithm (see Fig. 5.2).

5.1.3 Short Time Fourier Transform

The N-terms STFT, at time frame m, of a discrete signal x[n] is defined in equation (3.1) and explained in Section 3.1.1. For our computation, we only use the amplitude of the STFT denoted by

|Xm,k|. We use N = 4096 samples (which corresponds to 92.9 ms for a sampling rate of 44.1 KHz), τ = 2048 samples (overlap of 50%) and Hanning windowing function.

5.1.4 Spectrum Peak Picking

In order to detect the local peaks of the spectrum, we use the algorithm proposed in the context of the Sinusoidal Modelling Synthesis framework (SMS) [1, 62]. In this context, a fixed number P of local maxima is detected in the amplitude spectrum |Xm,k|. For each local maximum, its frequency Mp is refined using a 3-point parabolic interpolation using [Mp − 1, Mp, Mp + 1]. The obtained frequency is denoted by fp in Hz. The result of the peak picking algorithm is the sequence Pm = {(f1, a1),..., (fP , aP )} made of pairs of peaks frequency 62 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION location fp and amplitude ap. The peak-picking is performed at each time frame m ∈ [1 ...M]. The concatenation of all peaks sequences, Ptot = P1kP2k ... kPM , is used as input for the reference tuning estimation algorithm.

5.1.5 Reference Frequency Estimation

Since our algorithm relies on the equal-tempered cent scale, the tuning fref (or reference frequency) of the audio signal needs to be estimated. For our purposes, based on the findings presented in the Chapter 4, we chose the method presented in [14] and explained in Section 4.1.2.

5.1.6 Salience Function Computation

As previously said, the salience Sp(β) of a given peak p can be calculated as the correlation C between the theoretical sequence of deviation {dh(β)} and the ˆfp measured one {dh (β)}. From the abstract point of view, Sp(β) is calculated using:   ˆfp Sp(β) = C {dh(β)}, {dh (β)} (5.7) where C(·, ·) is a generic correlation measure. The two deviation sequences can be ˆp ˆfp ˆfp seen as two vectors d(β) = [d1(β), . . . , dH (β)] and d (β) = [d1 (β),..., dH (β)], so that, a good correlation measure can be the inner product < ·, · >. In practice, in order to reduce the influence of very small values (hence often noisy) in the computation of the salience, the correlation is weighted by the local amplitude ap of the f0 candidate fp:

H ˆp X ˆfp Sp(β) = ap < d(β), d (β) >= ap dh(β) · dh (β) (5.8) h=1

ˆfp fp Computation of dh (β): {fh (β)} is the sequence made of the harmonic fp p 2 ˆfp frequencies of a detected peak p: fh (β) = hfp 1 + βh . {dh (β)} is the vector fp ˆfp of measured deviations corresponding to {fh (β)}. {dh (β)} is computed for all the detected peaks p ∈ Pm at frame m (i.e. we consider each detected peak as a potential pitch candidate). 5.1. PROPOSED METHOD 63

To validate a given pitch candidate fp, we look among the detected peaks the ones that are harmonics of this candidate. This is done by using a function G centered on the h−th harmonic of fp and evaluated at the detected peaks fp0 .

More precisely, G(fp0 ; µh,p(β), σh,p(β)) is a Gaussian function evaluated at fp0 , with

p 2 • mean µh,p(β) = hfp 1 + βh and

 α  • standard deviation σh,p(β) = µh,p(β) 1 − 2 1200 where the parameter α = 20 cents is chosen experimentally in order to take into account the effect of the frequency location error of the peak picking step. We chose the α such that the number of the False Positive is reduced without losing Precision (see the Section 5.2.2 for the explanation of the evaluation measures).

The Gaussian function we use, has a maximum value of one when fp0 = µh,p(β) = p 2 hfp 1 + βh ; in other words G will only take non-zero values for the fp0 (the p 2 detected peaks) which are close to hfp 1 + βh . ¯ To each detected peaks fp0 is associated a deviation dp0 as defined in (5.3).       ¯ fp0 fp0 dp0 = 100 12 log2 − 12 log2 (5.9) fref fref

fp The deviation of fh (β) is then computed as the following weighted sum:

P ˆfp X ¯ dh (β) = G(fp0 ; µh,p(β), σh,p(β)) · dp0 (5.10) p0=1

A single value of β is assigned to each pitch candidate fp. The typical range of β for a piano string [20] is β ∈ B = {0} ∪ [10−5, 10−3]. In order to estimate β we maximize

Sp = max[Sp(β)] (5.11) β∈B Notice that in the practical case, all the values of β in the search range must be tested exhaustively because Sp(β) is an “unpredictable” function and no nu- merical optimized algorithm can be used in order to find the maximum of that function. The maximization of (5.11) provides simultaneously the value of Sp and 64 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION the one of the inharmonicity coefficient β for each spectral peak p. Of course, only the values of β corresponding to true notes make sense. The limits of the equal temperament: Using the equal-tempered grid of semitones is fundamental for the consideration made in Sec. 5.1.2. Moreover, it is reasonable to think that only exact tuned instruments2 are needed in order to maintain the validity of equation (5.5). However, the gaussian weighting scheme used in (5.10) ensures that the slighted deviated fundamental frequencies are not much negatively affected. However, the spectral peaks that are detuned more than ±α cents can be excessively penalized.

5.2 Evaluation

There is no standard method to evaluate the performances of a salience function by itself. This is because such a function is usually a pre-processing step of a more complicated algorithm (as for example a pitch-estimation method [54, 13]). Therefore, in order to be able to test our salience, we chose to construct a very simple and straightforward multiple-pitch estimation algorithm from our salience function. In Section 5.2.1, we explain the post-processing applied to the salience function in order to obtain a multi-pitch estimation.

5.2.1 Multiple-pitch estimation: post-processing of the salience function

In order to test our salience function as a multi-pitch estimation algorithm, we chose to apply a basic post-processing process that transforms the salience func- tion into a piano-roll representation. The piano-roll Rˆm,i can be seen as a spectrogram-like binary representation where the rows are the time frames m and the columns are the MIDI Key Number i3 . If a note i is marked as detected at the time frame m, the corresponding element Rˆm,i is set to 1; otherwise it is set to 0.

2For example, the octave stretching in piano tuning can be a problem. 3Ranging from 21 (A0) to 108 (C8). 5.2. EVALUATION 65

At each time frame m, we have a sequence of P pairs of peak frequency and salience values Sm = {(f1,S1),..., (fP ,SP )}. We normalize the values Sp in order to obtain maximum amplitude of one at each time frame. The negative values of salience are set to zero. Each peak frequency fp is quantized to the nearest MIDI Key Number using   fp ip = 69 + 12 log2 (5.12) fref where 69 correspond to the MIDI Key Number associated to the note A4 in the MIDI Tuning Standard (MTS). In order to remove holes (estimation errors) in the middle of notes (disruption in the salience value), we then apply a sliding median filter of size L frames along the time dimension m. Finally the binary piano-roll is obtained by applying a fixed threshold T to the values of Rˆm,i. We set to 1 all the values that are above T , and 0 the other ones. In Fig. 5.4, an example of piano-roll transcription is shown and different colors are used in order to highlight the True Positive, False Positive and False Negative. Notice that a considerable number of False Positive are just after a True Positive in the same MIDI Key Number. This is caused by the release time of the piano sound that is extended by the reverberation time simulated in the recordings.

5.2.2 Evaluation measures

In order to evaluate our salience-based piano-roll, we have to compute a ground- truth piano-roll Rm,i for each song in the dataset. Rm,i is obtained from the ground-truth text annotation that reports onset time, offset time and MIDI Key Number for each note played in a specific song. The note onset and offset time are quantized with the same hop size τ (converted in seconds) used by the algorithm.

We compare the ground-truth piano-roll Rm,i to the estimated piano-roll Rˆm,i by comparing the values on each cell (m,i). We then compute the Precision (P), the Recall (R) and the F-Measure (F) defined as follows:

tp tp 2PR P = ,R = ,F = (5.13) tp + fp tp + fn P + R where tp (True Positive) is the total number of correctly identified notes, fn 66 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION

Figure 5.4: Piano roll representation Rˆ m,i obtained using our salience function. In this case P = 0.65, R = 0.8 and F = 0.72 (see explanation of the evaluation measures in Section 5.2.2).

(False Negative) of missed notes and fp (False Positive) the number of false notes detected.

5.2.3 Test-Set

Experiments are performed on the MIDI Aligned Piano test-set [18]. MAPS provides CD quality piano recording (44.1 kHz, 16-bit). This test-set is available under Creative Commons license and consists of about 40GB (65 hours) of audio files recorded using both real and synthesized . The aligned ground-truth is provided as MIDI or plain text files. The alignment and the re- liability of the ground-truth is guaranteed by the fact that the sound files are generated form this MIDI files with high quality samples or a Disklavier (real piano with MIDI input). In order to have a generalized test-set, the pianos have been played in different conditions, such as various ambient with different rever- 5.2. EVALUATION 67 beration characteristics (9 combinations in total). This collection is subdivided into four different subsets. The set ISOL contains monophonic excerpts, MUS contains polyphonic music, UCHO is a set of usual chords in western music, and RAND is a collection of chords with random notes.

5.2.4 Results

Setting the parameters: The parameters of our algorithm are:

• H: the total number of considered harmonics,

• L: the length of the median filter,

• T : the salience threshold.

In order to tune these parameters we used the AkPnStgb audio files of the test-set4. The values that maximize the F-Measure are H = 8, L = 6 and T = 0.2. The total number of peaks per frame P , is not itself a parameter of the salience algorithm. P = 40 is chosen experimentally.

Harmonic vs Inharmonic model: We first compare in Figure 5.5 the Pitch estimation obtained by our model in harmonic setting (the β parameters is forced to 0) to the inharmonic setting (β is estimated). This is done using the whole MAPS test-set. As we expected, taking into account the inharmonicity brings an improvement on overall. The precision P increases by 12% (from 0.43 to 0.55) and the F- Measure increases by 5%. Since the Recall does not change significantly, while the Precision does, we can say that considering string inharmonicity allows reducing the number of False Positive. Because the results are better with our inharmonic model, we only consider this one in the following. Detailed Analysis: In Fig. 5.6, we provide the results in terms of Pitch estimation for each subset of MAPS using the inharmonic model. In Fig. 5.7, we provide the results in terms of Pitch-Class (i.e., without octave information). As we can see from the Figures, our approach is prone to octave errors. This

4This is one of the nine different piano and recording condition set-up in the MAPS test-set 68 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION

Figure 5.5: Pitch estimation results for our model in Harmonic (model forced to β = 0) vs Inharmonic setting (β is estimated). is due to the fact that the deviation template itself does not exploit the octave information5. This octave ambiguity could only be solved with an ad hoc procedure. Figures 5.6 and 5.7 also show that on average, the precision P is greater than the recall R. For a fixed number of True Positive, this means that the number of False Negative (missed notes) is greater than the False Positive (added notes).

Influence of the T parameter: In Figure 5.8, we show the variation of the Recall and Precision in function of the choice of the parameter T (threshold on salience values). We see that the choice of T is a key parameter for the Precision/Recall trade-off, hence for the FP / FN trade-off. If our system is used as a front-end of a more complicated system which can filter-out the False Positives, we should use a value of T which maximizes Recall. It should be noted

5In a hypothetical scenario where the peak peaking algorithm detects the peaks at frequency fp and in an infinite number of its harmonics with amplitude equal to ap, the salience value for ¯ ¯ j + the peak p and for the peaks with frequency hfp with h = 2 , j ∈ N will be the same. To put it in another way, the peaks with frequency that is j above fp will measure the same salience value as fp. 5.2. EVALUATION 69

Figure 5.6: Pitch estimation results for each subset and the overall average (β is estimated).

Figure 5.7: Pitch-Class estimation results for each subset and the overall average (β is estimated).

that Figure 5.8 is computed using only the MUS subset of MAPS. Because of this, the best value for T (in F-Measure sense) is T = 0.1 (which is different from the global optimum value for the entire MAPS test-set). 70 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION

Peak P1 P2 Emiya et al. Benetos et al. F-Meas. 0.31 0.44 0.49 0.82 0.87

Table 5.1: Comparison of Pitch F-Measure results on MAPS test-set. Peak is a fixed thresh- old on detected peaks, P1 is the proposed method without considering inhar- monicity (β forced to 0) and P2 is with the inharmonic model (β is estimated). Emiya et al. is presented in [18] and Benetos et al. in [3].

Comparison to state-of-the-art: In Table 5.1, we indicate the Pitch F-Measure results of our system in harmonic setting (P1, β is forced to 0) and in inharmonic setting (P2 , β is estimated). We compare our results to the ones obtained by Emiya et al. [18] and Benetos et al. [3] on the same test-set. Also, the results obtained by directly applying a threshold on the detected peaks are reported as a baseline results. As expected, the results obtained with our methods are not as good as the ones obtained with dedicated multi-pitch estimation algorithms.

The main reason is that our system is not a multi-pitch estimation method but only a pre-processing step to be used in a more complex system. Our straightfor- ward post-processing procedure has been introduced only to asses the potential performances of our novel salience function design. In this context, our salience exhibit very promising results.

5.3 Conclusions

The performances obtained by our proposed salience function for the estimation of pitch-classes (Fig. 5.7) show that this kind of salience, even with simple post-processing procedure, is suitable for extracting audio features like Pitch Class Profile (PCP [22]) used in cover song detection or key/chord recognition tasks [59, 50]. Moreover, especially for a piano music test-set such as MAPS, considering the string inharmonicity is beneficial in terms of precision and F-Measure. Despite the fact that our salience function look promising, further development of an ad-hoc post-processing procedure is needed in order to be used for multi-pitch estimation. Moreover, as indicated in Section 5.2.4, the 5.3. CONCLUSIONS 71

Figure 5.8: Pitch Recall/Precision curve for different values of T for the MUS subset. The best F-Measure (0.62) is obtained for T = 0.1 and is marked with the “O”. parameter T should be tuned depending on the application, in order to favour the F-Measure or the Recall. During our tests we have identified some weakness that are subjects for future research. The accuracy of the peak peaking algo- rithm is a key factor. A missing peak can negatively affect the overall accuracy performances. The octave ambiguity discussed in the previous section can be treated by developing specific procedure. Furthermore, the worst resolution in the low frequency spectrum can led to a large error when calculating the high order harmonic frequencies. Conversely, the note in the high portion of the audio spectrum does not have a sufficient number of partials to give a consistent value of salience because of the spectral roll-off near the Nyquist limit.

Due to the better accuracy results in detecting the correct pitch class (i.e. without the octave information), this salience can be used to build a Pitch Class Profile that are invariant to differences in timbre between instruments. Further- more, the rough note transcription procedure described here, allows to discard all of the percussive/noise components of the musical signal. These two char- 72 CHAPTER 5. POLYPHONIC PITCH SALIENCE FUNCTION acteristics are desirable behaviour for detecting musical chord boundaries as we will see in detail in the following chapter. 6 73 Chord Bounds Detection

In this chapter, different strategies for the calculation of the Harmonic Change Detection Function (HCDF) are discussed. HCDFs can be used for detecting chord boundaries for Automatic Chord Estimation (ACE) tasks. The chord transitions are identified as peaks in the HCDF. We show that different audio features and different novelty metric have signif- icant impact on the overall accuracy results of a chord segmentation algorithm. Furthermore, we show that certain combination of audio features and novelty measures provide a significant improvement with respect to the current chord segmentation algorithms. We study the influence of different audio features and harmonic change measure (also called novelty in the harmonic content) for the calculation of the Harmonic Change Detection Function developed by Harte et al. [29]. The HCDF is a powerful tool for detecting harmony boundaries in digitally recorded songs, hence it can be used for audio segmentation based on harmonic rules, such as musical chords (local) or for detecting modulation of the musical key (global). The HCDF is particularly beneficial for Automatic Chord Estimation or Key Detection algorithms. For example, an ACE method that knows the boundaries of the musical chords in a given song, can apply an informed local averaging strategy. This post-processing step allows to reduce the impact of noise in the audio features (usually, Chroma Features also called Chromagram) improving the correct chord estimation accuracy without altering the temporal precision. Some na¨ıve averaging scheme, such as temporal moving average on Chroma 74 CHAPTER 6. CHORD BOUNDS DETECTION

Features, usually brings great benefits in terms of correct detected chords, but in many cases, the chord boundaries are not preserved. It has been proven that a more clever average method, like beat-average or music structure informed averaging are beneficial for ACE tasks [42].

We will show how different Chroma Features and distance metrics impacts on the performances of the HCDF used for detecting chord change boundaries. This chapter is organized as follows. In the Section 6.1 we give an overview of the Harte et al. HCDF presented in [29]. The Section 6.2 describes the audio features that we use in addition to the simple chroma vectors and few alternatives for the calculation of the novelty function. Finally, in the Sections 6.3 and 6.4 we explain the evaluation procedure and we discuss the results.

6.1 Harmonic Change Detection Function

Ideally, the HCDF is a function that measures how likely is a change in the harmonic content of a song for a given time frame n with respect to a neighboring time frame n + 1 or n + 2, where n is the frame index assuming a sliding window song analysis (i.e. Short Time Fourier Transform or Chroma Features). Low values of the HCDF means that no substantial harmonic differences were detected between consecutive frames. On the contrary, when sudden changes in the harmonic content occurs, they appear as a peaks (i.e. local maxima) in the HCDF. What exactly an harmonic change is, depends on the context. For example, an harmonic change can be a note change, a chord change, or a modulation in the musical key. Accordingly on the granularity of the change we want to detect, we have to chose the appropriate audio descriptor that is suitable for our needs. We focus on the chord change detection, therefore we use Chroma Features (also denoted as Chromagram), an audio descriptor that is the most used for ACE methods [43]. An example of the ideal behaviour of an HCDF suitable for chord detection (or segmentation) tasks is depicted in Fig. 6.1. 6.1. HARMONIC CHANGE DETECTION FUNCTION 75

Chord progression C Am Dm

HCDF

Time

Figure 6.1: Annotated chord progression and ideal HCDF for chords.

A straightforward yet powerful method for HCDF calculation is presented in [29], which will be described in the following sections.

6.1.1 Algorithm for HCDF calculation

The first step of the Harte et al. HCDF calculation approach is the 36 bins-per- octave Constant-Q spectral analysis with frequency f in the range (log-scaled) f ∈ [110, 3520] Hz, that correspond to a note span from A2 to A6. They used the efficient implementation of the Constant-Q transform (CQT) described in [5]. In order to have sufficient resolution for the low frequencies, an analysis window of 7 1 743 msec is used, with an overlap of 8 = 87.5% (hop-size of 8 frame) between adjacent frames for preserving time resolution, obtaining an hop-size of 93 msec. Then, the Constant-Q transform is used to compute a 12 bin Chromagram with the method described in [28]. In order to compute the HCDF, Harte et al. apply a further transformation. The 12 bin Chromagram is mapped into a six dimensional representation called Tonal Centroid (TC), explained in the Section 6.1.1.2. The HCDF at a given time frame n is then calculated using the euclidean distance between the time frame n − 1 and n − 1. 76 CHAPTER 6. CHORD BOUNDS DETECTION

6.1.1.1 Chromagram calculation

1 A 36 bin Chromagram ( 3 of semitone resolution) is constructed from the CQT coefficients in order to obtain an octave independent representation of the spec- trum. Let CQTn(k) be the Constant-Q Transform at frame n and frequency bin k, and Pn(s) the 36 bin Chromagram at frame n and s the Chromagram bin, we can compute the Chromagram as:

M X Pn(s) = CQTn(s + 36m) 1 ≤ s ≤ 36. (6.1) m=0

Then, a procedure for the tuning frequency estimation is performed in order to compensate the frequency deviation due to non-standard reference frequency tuning1. In that way, the positions of boundaries between the semitones are correctly determined and it is possible to allocate each bin of Pn to the correct pitch class of the final 12 bin Chromagram Cn. Since we test different types of

Chromagram, we will refer to Cn with the generic name CHF (CHroma Feature).

6.1.1.2 Tonal Centroid mapping

The Tonal Centroid representation is particularly useful for locating changes in harmonic content since it allows close harmonic relations to have small euclidean distance in the TC space. TC representation, also know as Tonnetz, first in- troduced by Euler [7], it is widely used for modelling tonal relationship in the European classical music.

For mapping the 12 bin Chromagram Cn into the TC representation ζn, a simple linear transformation is applied as follows:

11 1 X ζn(d) = φ(d, l)Cn(l) 0 ≤ d ≤ 5, (6.2) ||Cn||1 l=0 where || · ||1 is the L1 norm and φ is the transformation matrix obtained by

1 The standard tuning reference frequency is widely accepted as fref = 440 Hz. 6.1. HARMONIC CHANGE DETECTION FUNCTION 77 concatenating 11 row vector defined as:

 7π  r1 sin(l 6 )  7π   r1 cos(l )   6   r sin(l 3π )  φ =  2 2  0 ≤ l ≤ 11 (6.3) l  3π   r2 cos(l )   2   r sin(l 2π )   3 3  2π r3 cos(l 3 )

where r1 = 1, r2 = 1 and r3 = 0.5, chosen in order to match our perception of harmonic relations. For further details refer to [29].

6.1.1.3 HCDF calculation and segment identification

Harte et. al [29] calculates the HCDF Hn using the smoothed ζn sequence. The smoothing procedure is needed in order to reduce the effect of the transients and noise. Each centroid vector is convolved in a row-by-row fashion with a Gaussian window with σ = 8. However, in our tests we can calculate the HCDF either on the smoothed TC sequence or directly using the smoothed Chroma Feature sequence. Let γn the audio feature (Chroma Feature with or without the TC mapping) at the time frame n, the HCDF is computed as:

Hn = D(γn−1, γn+1), (6.4) where D(p, q) is a generic distance value between vector p and vector q. Harte et al. uses the euclidean distance defined as: v u I e uX 2 D (p, q) = t (pi − qi) , (6.5) i=0

where I is the total number of elements in the vector (i.e. I = 6 if we use TC vectors, I = 12 if we use Chroma vectors). After the calculation of H, a further step is required for identify harmonic boundaries. A peak-picking algorithm based on simple thresholding technique is applied to HCDF in order to detect the harmonic segment bounds. 78 CHAPTER 6. CHORD BOUNDS DETECTION

6.2 Chroma Features and Novelty calculation

Here we investigate how different Chroma Features influences in the HCDF per- formances and, furthermore, we test the effect of using or not the TC mapping and how other distance measures impact on the accuracy of the segmentation algorithm. In the past few years, several novel approaches to Chroma Features calcula- tion have been proposed. Most of them rely on involved DSP techniques that allows to compute Chromagrams that are invariant with respect to some musical facets. Timbre variation, transient noise and other unwanted noise, can bring potential undesirable behaviour in the final result. The following sections gives a brief description of the approaches that we have considered in our analysis. Furthermore, we describe the distance measures that we use to calculate the HCDF.

6.2.1 Other Chroma Features

In this section we describe the advanced methods for computing the 12 bin Chro- magram that we use in our tests. The main difference between the straightfor- ward approach described in Section 6.1.1.1 is that the following methods are more robust to some music facets that can potentially pollute the Chromagram representation. In other words, the advanced Chroma Features described below, tries to extract only the harmonic/melodic meaningful characteristics instead of mapping the entire musical spectrum into the 12 bin Chromagram.

6.2.1.1 Loudness-Based Chromagram (LBC)

This type of Chromagram [47] is based on the fact that the perception of loudness is not linearly proportional to the amplitude spectrum. Therefore the standard Chroma Features do not reflect the actual human perception of the audio con- tent. Fletcher et al. [21] showed that the loudness perception is almost linearly proportional to the 10 log10(·) of the power spectrum. Furthermore, Fletcher et al. discovered the so-called equal-loudness curve that describes the human system’s sensitivity at different frequency ranges. This curve is used in 6.2. CHROMA FEATURES AND NOVELTY CALCULATION 79 the calculation of the LBC as weighting coefficient (called A-weighting) for the power spectrum. In that way, LBC Chroma reflect the behaviour of the human perception of audio signals. Another important aspect of LBC relies in the pre-processing step. Before calculating the loudness-based spectrum, the tuning frequency is first estimated and then the harmonic part of the audio signal is extracted via the Harmon- ic/Percussive Signal Separation (HPSS [48]). This procedure makes LBS more robust against percussive/transient noise and, in general, non-tonal components. As an optional post-processing step, a beat-synchronous average can be ap- plied to LBC. For further details, please refer to [47].

6.2.1.2 Harmonic Pitch Class Profile (HPCP)

The main feature of the Harmonic Pitch Class Profile (HPCP [24]) is that the Chromagram are calculate using only the spectral peaks in the 50 Hz to 5 kHz range. The idea is to separate the sinusoidal components, that usually identifies a peak in the audio spectrum, from the “noise” part. The concept in based on the sound modelling by sinusoids plus noise presented in [1]. In this way, the majority of the non-tonal components are discarded and the resulting Chroma Features is more robust to transient and broad-band noise. Like the LBC, also the HPCP performs a reference frequency estimation, in order to tune the detected peaks to the equal-temperament note scale. Another important feature, is that the energy of a single pitch class, does not depends only on the energy of the peaks that belong to that specific pitch class but also on the energy of the peaks at its partial frequencies up to 8 terms. This procedure is supported by the fact that perception of the human hearing system use both the fundamental frequency and its partials in order to detect the pitch of a note.

Note that in this case there can be some terminology ambiguity. We call HPCP the Chroma Features proposed by G´omezet al. as they named that in [24]. Harte et al. in [28] named HPCP their Chroma Features that, although 80 CHAPTER 6. CHORD BOUNDS DETECTION it shares some common concept with the G´omez’sHPCP, it is calculated in a different way. We refer as HPCP the one developed by G´omezet al.

6.2.1.3 Chroma DCT-Reduced log Pitch (CRP)

The Chroma DCT-Reduced log Pitch was introduced by M¨ulleret al. in [46] and its main goal is to derive a timbre-invariant Chroma representation. That means that if we imagine to have different recordings of the same song that only differs in the instrumentation, they virtually have the same CRP representation. First, the audio signal is decomposed into 88 frequency bands with center frequencies corresponding to the frequencies of the keys of a piano. Then, the short-time mean-square power (local energy) for each of the 88 sub-bands is calculated, using a rectangular window of a fixed length and an overlap of 50%. The actual number of the sub-bands is increased to 120 by adding 20 bands with zero energy in the lower part and 12 in the upper part. The resulting representation is a sequence of 120-dimensional vectors, called pitch representation. Each entry e of the resulting sequences is replaced by the value log(C · e + 1) where C is a positive constant. The motivation of this logarithmic compression is for taking into account the sensation of , as mentioned for the LBC in Section 6.2.1.1. Next, a discrete cosine transform (DCT) is applied to each of the 120- dimensional pitch vectors resulting in 120 coefficients, which are referred to as pitch-frequency cepstral coefficients (PFCC). Since the goal is to achieve timbre- invariance, the lower n−1 coefficients of the PFCC (we use n = 55 as suggested in [46]) are discarded by setting them to zero [66]. Each resulting 120-dimensional vector is then transformed by the inverse DCT. In the last stage, the entries of each enhanced pitch vector are projected onto the 12 chroma bins to obtain a 12-dimensional Chroma representation.

6.2.1.4 NNLS Chroma

This method [40] use a Non-Negative Least-Squares (NNLS) approximate note transcription prior to the Chroma mapping. In order to do this, a log-frequency 6.2. CHROMA FEATURES AND NOVELTY CALCULATION 81

DFT spectrum is calculated. Then, assuming equal temperament, the global tuning frequency of the piece is estimated from the spectrogram. After that, the log-frequency spectrogram is recalculated using a linear interpolation, taking into account the estimated reference frequency. Then, the background spectrum is estimated and subtracted from the original spectrum as explained in [40]. In order to estimate the notes that originates the current audio frame, we need to know a note dictionary describing the assumed profile of (idealised) notes and an inference procedure to determine the note activation patterns. Further details on this are shown in [40]. After the NNLS decomposition the transcription spectrum is returned, and it is mapped into a 12 bin Chroma representation. This kind of Chroma Feature do not have, virtually, any form of noise due to transient or non-tonal components since the Chroma Feature is calculated using the estimated note transcription.

6.2.2 Distance measure

In our test, we use different combination of Chroma Features and distance mea- sures. In the following section we briefly describe the distances measures we consider in our evaluation. 1. Euclidean distance. This is probably the most used distance measure in several application. It is calculated using (6.5).

2. Correlation distance. The correlation distance is defined as: < ¯p, ¯q > Dx(p, q) = , (6.6) ||¯p||2 · ||¯q||2

where < ·, · > is the scalar product, || · ||2 the L2 norm of the vector, and ¯· means the removal of the average value.

3. Cosine distance. The cosine distance is quite similar to (6.6), except for the fact that the vector are not, in general, zero-mean vectors. The cosine distance is defined as: < p, q > Dc(p, q) = . (6.7) ||p||2 · ||q||2 82 CHAPTER 6. CHORD BOUNDS DETECTION

Table 6.1: Test-set: 16 Beatles song with some additional information such as the song duration (sec.) and the numer of the annotated chord transitions.

Song Title Duration # of trans. Please Please Me 01:59 76 Do You Want To Know A Secret 01:55 113 All My Loving 02:05 68 Till There Was You 02:11 79 A Hard Day’s Night 02:28 91 If I Fell 02:17 73 Eight Days A Week 02:41 86 Every Little Thing 01:59 70 Help! 02:16 49 Yesterday 02:03 82 Drive My Car 02:26 81 Michelle 02:39 85 Eleanor Rigby 02:03 35 Here There And Everywhere 02:21 91 Lucy In The Sky With Diamonds 03:25 104 Being For The Benefit Of Mr Kite 02:34 102

6.3 Evaluation

For the evaluation of the performances of the HCDF, we use the same procedure explained in [29], hence, our results will be directly comparable with the one obtained by Harte et al. For the same reason, we use the same test-set that is a collection of 16 Beatles songs shown in Table 6.1. Ground-truth of the chord transcription is freely available as a sub-set of the Isophonics dataset2 used for the evaluation of ACE algorithms. The number of annotated transitions reported in the Table 6.1 are slightly different from those shown in [29] due to minor updates to the dataset after publication of [29].

2http://www.isophonics.net/ 6.4. RESULTS 83

We then compute the performance measures such as the Precision (P), the Recall (R) and the F-Measure (F) defined in equation (5.13), where tp (True Positive) is the total number of correctly detected chord transitions, fn (False Negative) of missed transitions and fp (False Positive) the number of detected transitions that are not real transitions. As in [29], we consider a detected transi- tion as a True Positive (tp) if there is an annotated transition within ±278 msec with respect to the time-stamp of the detected transition. All chroma features that we use, are implemented by the respective author as a Vamp plug-in for Sonic Annotator or in MATLAB, and all of them are available on-line and free to use. In our tests, we set the effective frame length of 93 msec for all extracted features. For the considered features, a simple thresholding technique is enough for do the peak picking. In order to avoid over-fitting, we chose a single threshold for each kind of distance, and is used regardless the chosen Chroma Feature and beat-synchronization or Tonal Centroid mapping. More precisely, we use a threshold τ = 0.3 for the euclidean distance and τ = 0.03 for the correlation and cosine distance. In the following section we present the result of our test and, for compari- son, the results of Harte et al. and the harmonic on-set detection algorithm by Hainsworth et al. [26].

6.4 Results

The results presented in Table 6.2 shows the performance measures of our tests (the methods named HCDF[1 ... 24]) compare with the performance measures obtained by Harte et al. and Hainsworth et al. The highlighted row is the best performing combination in a F-Measure sense (HCDF5). The results are rounded to the second decimal for clarity but we can state that HCDF[3, 5, 6] gives almost the same results. One important fact to notice is that the Tonal Centroid mapping with the euclidean distance gives the best Recall performances both globally and with a fixed Chroma Feature. That suggests that this combination is very useful for 84 CHAPTER 6. CHORD BOUNDS DETECTION minimizing the number of False Negative. Conversely, this strategy exhibits, for all Chroma Features, the worst Precision performances since the number of False Positives remains high. This behaviour indicates that the HCDF calculated using the euclidean distance on the Tonal Centroid is too much sensitive to the small variations on the Chromagram. When the cosine distance is used instead of the euclidean, we obtain the complete opposite effect for almost all Chroma Features. The cosine distance is the best performing distance in a F-measure sense, for all of the considered features. Although it may not seem true for the NNLS feature, there is only one point percent difference with respect to the best F-measure for the NNLS feature. This means that the cosine distance sets a good trade-off between False Negatives and False Positives thus allows to optimize the F-measure. Overall, the most important thing to notice is that the Recal performances obtained in the works of Harte et al. and Hainsworth et al. are comparable to those obtained in our tests (except few outliers). However, the Precision indicators are significantly better for almost all of the configuration presented in our test. The main reason is that all of the considered Chroma Features in our tests, have been designed in order to deal with the effect of the non-tonal noise and, at the same time, preserving the harmonic characteristics of the musical signal.

6.5 Conclusions

We presented a detailed evaluation of the methods to calculate the HCDF for musical chords segmentation, proposing the use of more effective Chroma Fea- tures. We shown that the distance measure used for the HCDF computation is crucial for setting the trade-off between Precision and Recall and we find that the cosine distance is the distance measure that exhibits the best compromise. Furthermore, we demonstrate how the most evolved Chroma Feature im- proves up to 17% the F-measure performances of a musical chord segmentation algorithm. 6.5. CONCLUSIONS 85

In this chapter we have shown how the Chroma Features are used for the task of chord boundaries detection. Another use of the Chroma Features is to detect different rendition of the same musical composition (Cover Song Identification). This topic is covered in the next chapter. 86 CHAPTER 6. CHORD BOUNDS DETECTION anwrhe al. et Hainsworth at tal. et Harte al 6.2: Table Method HCDF5 HCDF24 HCDF23 HCDF22 HCDF21 HCDF20 HCDF19 HCDF18 HCDF17 HCDF16 HCDF15 HCDF14 HCDF13 HCDF12 HCDF11 HCDF10 HCDF9 HCDF8 HCDF7 HCDF6 HCDF4 HCDF3 HCDF2 HCDF1 esr etdi u ok o oprsn h attorw ftetbeso h eut fHisot ta.[26] al. et Hainsworth of results the show table the of rows two last [29]. the al. comparison, For et Harte work. and our in tested HCDF[1 measure methods The summary. Results hoaFeature Chroma HPCP HPCP HPCP HPCP HPCP HPCP NNLS NNLS NNLS NNLS NNLS NNLS CRP CRP CRP CRP CRP CRP PCP LBC LBC LBC LBC LBC LBC - oa etodMapping Centroid Tonal . . . 4 r h obntoso etr,TnlCnri apn n distance and mapping Centroid Tonal feature, of combinations the are 24] yes yes yes yes yes yes yes yes yes yes yes yes yes no no no no no no no no no no no no - Correlation Correlation Correlation Correlation Correlation Correlation Correlation Correlation Distance Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Cosine Cosine Cosine Cosine Cosine Cosine Cosine Cosine - Precision 0.75 0.75 0.75 0.78 0.77 0.53 0.31 0.74 0.48 0.57 0.74 0.71 0.67 0.49 0.67 0.76 0.75 0.59 0.78 0.69 0.78 0.8 0.8 0.7 0.7 0.5 Recall 0.99 0.99 0.98 0.88 0.93 0.84 0.88 0.79 0.84 0.73 0.83 0.97 0.56 0.86 0.54 0.36 0.77 0.78 0.85 0.75 0.82 0.96 0.83 0.77 0.83 0.8 F-meas. 0.79 0.78 0.78 0.78 0.82 0.82 0.82 0.65 0.46 0.77 0.78 0.71 0.64 0.72 0.64 0.63 0.47 0.66 0.72 0.77 0.75 0.66 0.73 0.77 0.79 0.79 7 87 Distance Fusion for Cover Song Iden- tification

In this chapter, we propose a method to integrate the results of different cover song identification algorithms into one single measure which, on the average, gives better results than initial algorithms. The fusion of the different distance measures is made by projecting all the measures in a multi-dimensional space, where the dimensionality of this space is the number of the considered distances. In our experiments, we test two distance measures, namely the Dynamic Time Warping and the Qmax measure when applied in different combinations to two features, namely a Salience feature and a Harmonic Pitch Class Profile (HPCP). While the HPCP is meant to extract purely harmonic descriptions, in fact, the Salience allows to better discern melodic differences. It is shown that the combination of two or more distance measure improves the overall performance. Cover song identification aims at finding different versions of the same musical piece within a large database of songs. In the last 10 years, a lot of work has been made to try to successfully accomplish this task. Thanks to the MIREX evalu- ation campaign for Music Information Retrieval (MIR) algorithms, this research topic has gained attention and methods have improved in accuracy. Different algorithms have been developed in the literature and the standard approach to measuring similarity between cover songs is to exploit music facets shared be- tween them. This similarity measure is computed using different descriptors (or features) extracted from the raw audio file. 88 CHAPTER 7. DISTANCE FUSION FOR COVER SONG ID.

In order for these descriptors to be effective, they have to be relatively insen- sitive to the majority of musical changes among covers, like tempo or key. Once the descriptors are extracted, a measure of distance between song descriptions is evaluated and a similarity score between songs is thus obtained. Hence, a cover song identification algorithm usually takes a query song as an input and, after a processing step for the extraction of the descriptors, performs a comparison between this song and all songs in a database, using the extracted feature. The result of this run is a ordered list of songs ranked with a distance criteria, where the most similar song must ideally rank first in this list. Of course, different features and different distances over such features can be used. The first largely used descriptor, the so called Pitch Class Profile (PCP), was introduced in 1999 by Fujishima [22]. Over the years, PCP (sometimes also called chromagram) was extended and modified in different variants, some of which are still successfully used in cover song identification (see HPCP [24]).

Among the possible techniques to compute distances between sequences of features, we can list two which are of particular importance for the cover song identification problem. One is the Dynamic Time Warping (DTW [45, Chap- ter 4]), that aims to find an optimal alignment path between two given time- dependent sequences. It gives us a warping cost value between a song u and a song v. Another technique is introduced in [59], which uses the Cross-Recurrence

Plot [38] and recurrence quantification analyses like Qmax [61]. Different performances are obtained when using a specific feature with a specific distance measure, and it is not always easy to understand which feature- distance combination behaves better, since this obviously depends on the dataset at hand, on the query etc. Moreover, even a single feature, when extracted with different settings, can give different performance. Based on this fact, for example, the system Hydra [52] combines features and distances extracted with different parameters which are fed to a Support Vector Machine which output, for each pair of songs, a single bit decision of the type cover/non-cover. A similar approach is used in [57], where a distance is calculated over three different audio descriptor and a classifier is trained with a subset of known cover or non-cover 89

Features Extraction N Distances Fusion Query song and for Distances Calculation Final Dist. Score

Distance Songs DB

Figure 7.1: General scheme of the method

songs pairs. In our work instead, we do not apply any classification and no training is needed.

We discuss an approach to cover song identification that involves a blind combination of different features and different distance measures without making any assumption on the audio descriptor used. For the evaluation, we consider two type of features, the Salience function and the HPCP combined with two distance measure, namely Dynamic Time Warp and a Qmax. Each feature, when used with a given distance, allows sorting the songs in the database in order of decreasing similarity with a given query song. We propose a technique to combine the lists obtained by N feature-distance combinations into one single N-dimensional space in order to assess a “globally informed” distance measure. As we will see in Section 7.3, the so obtained combined list leads to a relative improvement in accuracy with respect to the results obtained by the single measures separately. In the Figure 7.1 we can see the general outline of the algorithm.

In the Section 7.1 we explain what features and what distances are involved in our test and in the Section 7.2 we show how we combine them. In Section 7.3 and 7.4 we present and discuss the accuracy results. 90 CHAPTER 7. DISTANCE FUSION FOR COVER SONG ID.

7.1 Audio Features and distance metrics

7.1.1 Audio Features

Here we give a brief introduction of the audio descriptors used in our test. The described features are computed for each frame for a total of Nf frames. This leads to a huge amount of data which would make the complexity of any distance measure evaluation impractical. Hence, a temporal down-sampling is applied to the sequence of features to obtain a shorter sequence of length Nt < Nf . In our case, we use an adaptive decimation factor that is dependent on the beat duration in order to obtain a beat-synchronous time average. This part is based on the algorithm presented in [15]. Where not explicitly mentioned all of the feature and distance calculation algorithms are re-implemented by the authors of this dissertation.

7.1.1.1 Pitch Salience Function

As presented in [56], a salience function for a given frequency fi is calculated as a weighted sum of the energy at the first 8 harmonics of the fundamental frequency fi like fi, 2fi, 3fi , . . . . The pitch salience function is calculated at each frame using the amplitude spectrum and covers a frequency ranging from 55 Hz to 1.76 kHz (5 octaves range from A1 to A6) using a resolution of resolution of 1 bin/semitone.

7.1.1.2 HPCP

An introduction on the computation of the HPCP descriptor has already been provided in Section 6.2.1.2. In brief, the signal is decomposed in a sequence of vectors that represent the energy of each Pitch Class of the twelve-tone equal tempered scale, calculated from the correspondent spectral peak and the weighted summation of its harmonic frequencies peak’s energy up to 8 terms. The reader is referred to [24] for a complete description. 7.1. AUDIO FEATURES AND DISTANCE METRICS 91

7.1.2 Distance Measures

In this section we give a brief description of the distance measures used for the evaluation of our method.

7.1.2.1 CRP/Qmax

Basically, the Qmax distance calculates the length of the longest time segment in which two song u and v exhibit similar feature patterns. This is done by using a cross-recurrence plot. A cross-recurrence plot (CRP) is a binary similarity matrix

C whose elements ci,j are set to 1 when there is a recurrence between the i-th feature vector of song u and the j-th feature vector of song v, and zero otherwise. Here, a recurrence means that the euclidean distance between this two vectors is below a specified threshold. For more details such as the threshold value and the CRP algorithm see [38, 59]. When consecutive feature vectors are similar for a certain amount of frames, a diagonal patterns of ones become visible in CRP.

What the Qmax algorithm does is to quantify the presence and the length of this diagonal patterns in the CRP using an efficient recurrence quantification analysis [59]. In a nutshell, a cumulative matrix Q is computed over the elements of C starting form the element c1,1 and counting the elements with value equal to 1 that are aligned in a diagonal way. Finally, the Qmax value is calculated as the maximum amplitude of the elements Qi,j of the matrix Q as

Qmax = max (Qi,j) . (7.1)

The Qmax measure gives a similarity quantification of two input songs. In our case we need a distance measure that can be calculated as p v Nt du,v = , (7.2) Qmax v where Nt is the length of the salience function of song v and plays the role of a normalization factor [59].

7.1.2.2 Dynamic Time Warping

Dynamic Time Warping [45, Chapter 4] is a technique to find an optimal path to align two time sequences. Ideally, the two sequences are warped in a non- 92 CHAPTER 7. DISTANCE FUSION FOR COVER SONG ID.

1

0.9

0.8

0.7 q,s 2 0.6 d

0.5

0.4

0.3

0.2 0.4 0.5 0.6 0.7 0.8 0.9 1 1 d q,s

1 Figure 7.2: Cloud for a given q = q0, d calculated using DTW on Salience feature and 2 d with Qmax over HPCP. The triangle identifies the correct cover song. Note that in this case the correct cover does not minimize d2. linear way to reach the maximum matching between each other. DTW gives itself a measure of distance between two sequences, and it thus be used to assess similarity between two songs [60, 68]. With DTW, we obtain the total alignment cost DTWu,v between two features sequences u and v. For more details, see [45, Chapter 4]. We used the DTW implementation freely available at [17], with only one minor modification, namely that the a normalization similar to that of (7.2) is applied to obtain du,v as follows DTW d = u,v . (7.3) u,v p v Nt In our tests this normalization leads to a performance improvement.

7.2 Distance Selection

In this section we describe the proposed technique for the merging of two or more feature and distance measure combinations in order to create a single ranking 7.2. DISTANCE SELECTION 93 with improved performance. Independently from the used features and metrics, 1 N we assume that N different distance metrics dq,s = [dq,s, ··· , dq,s] are computed between the query song q and each song s ∈ [1 ...S] in the database. In a nutshell, the proposed method mixes N distances by projecting them in a N-dimensional space in order to refine the ranked list in a more reliable way. The process is now described in detail. Assuming the cover song identification algorithm returns a square S × S cross-distances matrix D where each element dq,s of this matrix represents a distance between the song q and the song s in the database, and we can calculate more than one distance matrix using different combination of features and metrics, we obtain N distances matrix D1, ··· , DN . For a fixed query song q = q0 and a fixed distance metric n = n0, we make a normalization of the distance vector as dn0 ¯n0 q0,s dq0,s = n0 , ∀s ∈ [1 ...S] (7.4) max [dq0,s] s∈[1...S]

¯n0 in order to ensure that dq0,s ∈ [0, 1]. Now, for a fixed query q = q0 and a fixed song s = s0 in the database, we define a point in a N-dimensional space that uniquely identifies the position of the pair (q0, s0) in the distances space

¯ 1 N N dq0,s0 = [dq0,s0 , ··· , dq0,s0 ] ∈ [0, 1] . (7.5)

N At this point we are able to compute d¯q,s ∈ [0, 1] for each q, s ∈ [1 ...S].

The points d¯q,s form an N-dimensional cloud. An example of one such cloud with N = 2 for a given query q is shown in Fig. 7.2. We now compute a new

S × S refined distance matrix R whose elements rq,s are defined as

rq,s = ||1|| − ||d¯q,s − 1||, (7.6) where || · || is the l2 norm and 1 = [1, ··· , 1] is the N-dimensional “one” vector. Since the vector 1 represent the point in our space where all the distances are maximum, the terms ||d¯q,s − 1|| in (7.6) expresses how far the pair (q, s) is from to be a non-cover pair. It follows that (7.6) can be seen as a measure of how likely the pair is a cover pair. The origin of the N-dimensional space is the ideal place where a cover pair (q, s) should be placed, so intuitively, one may think 94 CHAPTER 7. DISTANCE FUSION FOR COVER SONG ID. that each element of R can be calculated as rq,s = ||d¯q,s||. Though this approach does work in practice, however, in our tests this strategy leads to worse results compared to (7.6).

7.3 Results

The evaluation task is performed using the well known cover song dataset named covers80 [16]. We perform a comparison of the performances between the ba- sic algorithms that use one distance metric over a single feature type, and a number of combinations of choices of features and distance metrics. The used performance indicators are some of the commonly used indicators in Music In- formation Retrieval: Precision, Mean Rank of First Correctly Identified Cover (MR1st) and Mean Average Precision (MAP). The total virtual score T is calcu- lated by counting the total unique correct identified cover for all of the method in the combination and normalizing by S. As we stated in Section 7.1, for our evaluation we use two features type: the HPCP (H) and the Salience function

(S). For the distance metrics we use the Qmax measure (Q) and the dynamic Time Warping (D). Table 7.1 reports the accuracy indicator for different com- bination of feature/distance. For the HPCP with Qmax approach, the accuracy indicators may be different from the original implementation of the algorithm by J. Serr`a[59] since it has been completely rewritten by the authors. As we can see in Table 7.1, all the accuracy results with N > 1 bring an improvement with respect to the simple distance measure. Although the best results are obtained using all (N = 4) of the possible combination of basic distances, we can see that using (S,D)+(H,Q) we obtain a comparable result but with a lower dimensional- ity N = 2. Furthermore, we can see that the indicator T is proportional to the MAP. Higher T means higher MAP and consequently better precision.

7.4 Conclusions

We presented a method to merge N combination of feature and distance mea- sures to increase the accuracy results of a cover song identification algorithm. 7.4. CONCLUSIONS 95

Combination N Prec. MR1st MAP T (S,D) 1 0.47 12 0.55 - (S,Q) 1 0.52 15 0.60 - (H,D) 1 0.35 17 0.43 - (H,Q) 1 0.59 9 0.65 - (S,D)+(S,Q) 2 0.61 12 0.66 0.61 (S,D)+(H,D) 2 0.46 12 0.53 0.55 (S,D)+(H,Q) 2 0.65 7 0.71 0.66 (S,Q)+(H,D) 2 0.56 13 0.63 0.60 (S,Q)+(H,Q) 2 0.64 10 0.69 0.69 (H,D)+(H,Q) 2 0.60 9 0.66 0.60 (S,D)+(S,Q)+(H,D) 3 0.60 11 0.66 0.66 (S,D)+(S,Q)+(H,Q) 3 0.65 9 0.71 0.74 (S,Q)+(H,D)+(H,Q) 3 0.66 9 0.70 0.70 (S,D)+(H,D)+(H,Q) 3 0.60 7 0.67 0.68 (ALL) 4 0.66 8 0.72 0.75

Table 7.1: Accuracy results. 96 CHAPTER 7. DISTANCE FUSION FOR COVER SONG ID.

This method is based uniquely on a geometric N-dimensional distance measure that has a very low computational cost. A particularly useful combination has been obtained by using a Salience Feature with a Dynamic Time Warp similarity measure and a HPCP with a Qmax measure. This combination, in our tests, proved to give excellent performance with a low dimensionality N. The percent- age T of the virtual total unique correct identified cover plays a fundamental role for the accuracy performances of the distance fusion process. The most im- portant property of the method, however, is that it can be used to combine any set of different distance metrics, regardless of what they measure and without making any assumption on the specific audio features involved. 8 97 Conclusions

“Whether you can observe a thing or not depends on the theory which you use. It is the theory which decides what can be observed” — Albert Einstein

In this thesis we have covered several aspect of tonal/harmonic content extraction from digital audio recordings. We dealt with this problem at different levels, from low level, where the detection of the basic atoms (sinusoid) was studied, towards the higher level cover song identification and harmonic bounds detection. Despite the fact that the research contributions in the MIR field are exponentially growing, there are still open questions. In this dissertation, we have focused on some of them and we have proposed our new contribution. Furthermore, we have seen that in some cases, a lack of a reliable ground truth for testing or training is still an issue.

In the next section we give a brief summary on the contribution of our work.

8.1 Summary of contributions

In Chapter 3 we have introduced a novel technique that combines the amplitude spectrum and the phase coherence measure in order to refine the time-frequency 98 CHAPTER 8. CONCLUSIONS representation of musical signals. Furthermore, we have demonstrated how the information given by the phase spectrum can improve the frequency localization of short term stationary sinusoids in an audio signal. When accurate frequency measures are needed, such as in partials tracking or note detection, our approach can bring benefits when the classical time-frequency analysis (e.g. STFT, or Constant-Q transform) is not enough.

In Chapter 4 we have shown that the lack of a scientific ground truth makes the evaluation of the performances of tuning estimation algorithms a non trivial task. We have proposed a new data-set and a number of tests that allow us to compare different algorithms in terms of desirable behaviours such as speed of convergence, stability, and computational cost. Furthermore, we have shown how different tuning frequency estimation algo- rithms perform in different conditions such as local or global estimation or under a real-time constraint. Moreover, in our tests we have demonstrated that using k = 5 spectral peaks per frame, suffices for all of estimation tasks presented here. Our tests have shown also that the Circ algorithm outperforms the other ones on all of the considered evaluation tasks. We have shown also that all of the MIR algorithms that need a reference frequency estimation and already use a peak peaking algorithm, can benefit using Circ and L-S methods for the local and global fref estimation, with consistent estimation even using small amounts of data.

In Chapter 5, we have proposed a novel pitch salience measure. The performance tests have shown that this kind of salience, even with simple post-processing procedures, is suitable for extracting audio features like Pitch Class Profile used in cover song detection or key/chord recognition tasks. Moreover, especially for a piano music test-set such as MAPS, considering the string inharmonicity is beneficial in terms of precision and F-Measure. Despite the fact that our salience function looks promising, further development of an ad-hoc post-processing procedure is needed in order to be used for multi-pitch estimation. Furthermore, the threshold parameter T can be tuned depending on 8.1. SUMMARY OF CONTRIBUTIONS 99 the application, in order to favour the Precision or the Recall. During our tests we have identified some weakness that are the subjects of future research. For example, the accuracy of the peak peaking algorithm and the octave ambiguity can be treated by developing specific procedures. Furthermore, a bad resolution in the low frequency spectrum can lead to a large error when calculating the high order harmonic frequencies. Conversely, the notes in the high portion of the audio spectrum do not have a sufficient number of partials to give a consistent value of salience because of the spectral roll-off near the Nyquist limit.

In Chapter 6, we have presented a detailed evaluation of the methods to calculate the HCDF for musical chords segmentation, and we have proposed the use of a more effective Chroma Features. We have shown that the distance measure used for the HCDF computation is crucial for setting the trade-off between Precision and Recall and we have found that the cosine distance is the distance measure that exhibits the best compromise. Furthermore, we have demonstrated how the most evolved Chroma Feature improves up to 17% the F-measure performances of a musical chord segmentation algorithm.

Finally, in Chapter 7, we have presented a method to merge N combination of feature and distance measures in order to increase the accuracy result of a cover song identification algorithm. This method is based uniquely on a geometric N-dimensional distance measure that has a very low computational cost. A par- ticularly useful combination has been obtained by using a Salience Feature with a Dynamic Time Warp similarity measure and a HPCP with a Qmax measure. This combination, in our tests, has been proved to give excellent performance with a low dimensionality N. The most important property of the method, however, is that it can be used to combine any set of different distance metrics, regardless of what they measure and without making any assumption on the specific audio features involved. 100 CHAPTER 8. CONCLUSIONS

8.2 Future Perspectives

Almost all of the aspects of MIR are open to new contributions. In order to guarantee fair testing and comparison, datasets and reliable ground-truth are needed. Standardized testing procedure and performances indicators are not clear for some tasks such as tuning frequency estimation. Furthermore, for some tasks such as musical structure segmentation, it’s difficult to develop rigorous testing rules, mainly because of the subjectivity about what the structure of a given song is and what the level of granularity of the segmenta- tion should be. This is one of the difficulties in this relatively young research field.

Another example is the Automatic Chord Estimation (ACE). Ground-truth alignment is one of the issue of this task. Also, the evaluation method for this kind of algorithms is a source of debates, though the recent works presented in [27, 39] and [49] are widely accepted, and the latter has become the current evaluation procedure used in the MIREX evaluation campaign for ACE methods. Defining a common accepted chord vocabulary, nomenclature and score measure is crucial for ensuring a fair comparison between different algorithms and for providing a good performance assessment. Furthermore, that can be also used to help the analysis of the methods and permits to move towards better algorithms.

In MIR there are also some tasks that are still far to have satisfactory accuracy, as for example multi-pitch estimation. There are some methods that perform very well in the case of mono-instrument polyphonic compositions, like the work presented in [18] that shows good accuracy, but it works, as it is, only for piano sound. In general this is a complicated task mainly because it is difficult to associate each spectral peak to the note’s fundamental frequency that generates it, especially when a mixture of tone creates several overlapping partials. This fact, if compared with the relative ease with which the trained human ear can distinguish all of the played notes, also in a poly-instrumental environment, suggests that we are far away from developing accurate multi-pitch estimation algorithms. 8.2. FUTURE PERSPECTIVES 101

In this dissertation we have studied different signal processing strategies from locating single sinusoidal components with an increased frequency accuracy, to methods that exploit higher level musical features, such as harmonic content. High level musical features are used for detecting chord bounds or finding different rendition of the same musical piece. Furthermore we have proposed a novel salience measure aimed at the analysis of polyphonic musical compositions. Nevertheless, our salience is only suitable for a preliminary step since an ad-hoc procedure needs to be developed in order to use our findings in a more complex algorithm such as a multi-pitch musical transcription procedure.

In all of the different abstraction levels of the tonal content analysis of a digital recording addressed in this thesis, several issues have emerged. Some of them are still an open subject of future research and represent an interesting challenge for the author of this dissertation. 102 Bibliography

[1] Xavier Amatriain, Jordi Bonada, Alex Loscos, and Xavier Serra. in Udo Z¨olzerDAFX-Digital Audio Effects, chapter Spectral processing, pages 373– 438. John Wiley & Sons, Baffins Lane, Chichester, West Sussex, PO 19 1UD, England, 2002.

[2] Francois Auger and Patrick Flandrin. Improving the readability of time- frequency and time-scale representations by the reassignment method. IEEE Transactions on Signal Processing, 43(5), 1995.

[3] Emmanouil Benetos and Simon Dixon. Multiple-f0 estimation of piano sounds exploiting spectral structure and temporal evolution. Proc. of ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition, pages 13–18, 2010.

[4] Emmanouil Benetos and Simon Dixon. Polyphonic music transcription using note onset and offset detection. Proc. of IEEE International Conference on , Speech and Signal Processing, (ICASSP), pages 37–40, 2011.

[5] Judith C. Brown and Miller S. Puckette. An efficient algorithm for the calculation of a constant q transform. Journal of the Acoustical Society of America, 92(5):2698–2701, 1992.

[6] Leon Cohen. Time-Frequency Analysis. Prentice Hall PTR, Upper Saddle River, New Jersey 07458, 1994. BIBLIOGRAPHY 103

[7] Richard Cohn. Introduction to neo-riemannian theory: A survey and a historical perspective. Journal of Music Theory, 48(2):167–180, 1998.

[8] A. Degani. The ’ms2012’ symbolic music dataset. URL: http://www.ing.unibs.it/alessio.degani/?p=ms2012, 2012.

[9] Simon Dixon. A dynamic modelling approach to music recognition. Proc. of the Int. Computer Music Conf., (ICMC), pages 83–86, 1996.

[10] Simon Dixon, Dan Tidhar, and Emmanouil Benetos. The temperament police: The truth, the ground truth, and nothing but the truth. Proc. of the 12th Int. Conf. on Music Information Retrieval, (ISMIR), 2011.

[11] Karin Dressler. Sinusoidal extraction using an efficient implementation of a multi-resolution fft. Proc. of the 9th Int. Conf. on Digital Audio Effects, (DAFX), 2006.

[12] Karin Dressler. Audio melody extraction for mirex 2009. 5th Music Infor- mation Retrieval Evaluation eXchange, (MIREX), 2009.

[13] Karin Dressler. Pitch estimation by the pair-wise evaluation of spectral peaks. Proc. of Audio Engineering Society Conference: 42nd International Conference: Semantic Audio, 2011.

[14] Karin Dressler and Sebastian Streich. Tuning frequency estimation using cir- cular statistics. Proc. of the 8th Int. Conf. on Music Information Retrieval, (ISMIR), pages 357–360, 2007.

[15] Daniel P. W. Ellis. Beat tracking by dynamic programming. Journal of New Music Research, 36(1):51–60, 2007.

[16] Daniel P. W. Ellis. The ’covers80’ cover song data set. URL: http://labrosa.ee.columbia.edu/ projects/coversongs/covers80/, 2007.

[17] Daniel P. W. Ellis. Dynamic time warp (dtw) in matlab. URL: http://labrosa.ee.columbia.edu/matlab/dtw/, 2008. 104 BIBLIOGRAPHY

[18] Valentin Emiya, Roland Badeau, and Bertrand David. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1643– 1654, 2010.

[19] J. L. Flanagan and R. M. Golden. Phase vocoder. Bell Systems Technical Journal, 45:1493–1509, 1966.

[20] Harvey Fletcher, E. Donnell Blackham, and Richard Stratton. Quality of piano tones. Journal of Acoustical Society of America, 34(6):749–761, 1962.

[21] Harvey Fletcher and W. A. Munson. Loudness, its definition, measurement and calculation. The Journal of the Acoustical Society of America, 5(2):82– 108, 1933.

[22] Takuya Fujishima. Realtime chord recognition of musical sound: a system using common lisp music. Proc. of the Int. Computer Music Conference, (ICMC), pages 464–467, 1999.

[23] Volker Gnann, Markus Kitza, Julian Becker, and Martin Spiertz. Least- squares local tuning frequency estimation for choir music. Proc. of the 131st AES Convention, 2011.

[24] Emilia G´omez. Tonal Description of Music Audio Signals. PhD thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2006.

[25] Bertrand Gottin, Irena Orovic, Cornel Ioana, Srdjan Stankovic, and Jocelyn Chanussot. Signal characterization using generalized “time-phase deriva- tives” representation. Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3001–3004, 2009.

[26] S. Hainsworth and M. Macleod. Onset detection in musical audio signals. Proceedings of International Computer Music Conference, ICMC, 2003.

[27] Christopher Harte. Towards Automatic Extraction of Harmony Information from Music Signals. PhD thesis, Department of Electronic Engineering, Queen Mary, University of London, 2010. BIBLIOGRAPHY 105

[28] Christopher Harte and Mark Sandler. Automatic chord identification using a quantised chromagram. Proceedings of the 118th AES Convention, pages 66–71, 2005.

[29] Christopher Harte, Mark Sandler, and Martin Gasser. Detecting harmonic change in musical audio. Proceedings of th 1st ACM workshop on Audio and music computing multimedia, pages 21–26, 2006.

[30] ISO. Acoustics - standard tuning frequency (standard musica pitch). ISO, 16(1975), 1975.

[31] Maksim Khadkevich and Maurizio Omologo. Phase-change based tuning for automatic chord recognition. Proc. of the 12th Int. Conf. on Digital Audio Effects, (DAFx), 2009.

[32] Kunihiko Kodera, Roger Gendrin, and Claude Villedary. Analysis of time- varying signals with small bt values. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1):64–76, 1978.

[33] Sebastian Kraft, Martin Holters, Adrian von dem Knesebeck, and Udo Z¨olzer. Improved pvsola time-stretching and pitch-shifting for polyphonic audio. Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx- 12), 2012.

[34] Mathieu Lagrange and Sylvain Marchand. Estimating the instantaneous frequency of sinusoidal components using phase-based methods. Journal of AES, 55(5), 2007.

[35] Cheng-Te Lee, Yi-Hsuan Yang, and Homer H. Chen. Multipitch estimation of piano music by exemplar-based sparse representation. IEEE Transaction on Multimedia, 14(2):608–618, 2012.

[36] Alexander Lerch. On the requirement of automatic tuning frequency estima- tion. Proc. of the 7th Int. Conf. on Music Information Retrieval, (ISMIR), pages 212–215, 2006. 106 BIBLIOGRAPHY

[37] Yipeng Li and DeLiang Wang. Pitch detection in polyphonic music using instrument tone models. Proc. of IEEE International Conference on Acous- tics, Speech and Signal Processing, (ICASSP), 2:481–484, 2007.

[38] Norbert Marwan, M. Carmen Romano, Marco Thiel, and J¨urgenKurths. Recurrence plots for the analysis of complex systems. Physics Reports, 438(5):237–329, 2007.

[39] Matthias Mauch. Automatic Chord Transcription from Audio Using Com- putational Models of Musical Context. PhD thesis, School of Electronic En- gineering and Computer Science,Queen Mary, University of London, 2010.

[40] Matthias Mauch and Simon Dixon. Approximate note transcription for the improved identification of difficult chords. Procedings of 11th International Society Music Information Retrieval Conference, ISMIR, pages 135–140, 2010.

[41] Matthias Mauch and Simon Dixon. Simultaneous estimation of chords and musical context from audio. IEEE Transactions on Audio, Speech, and Lan- guage Processing, 18(6):1280–1289, 2010.

[42] Matthias Mauch, Katy Noland, and Simon Dixon. Using musical structure to enhance automatic chord transcription. Proceedings of 10th International Society for Music Information Retrieval Conference, ISMIR, 2009.

[43] Matt McVicar, Ra´ulSantos-Rodr´ıguez,Yizhao Ni, and Tijl De Bie. Auto- matic chord estimation from audio: A review of the state of the art. IEEE Transactions on Audio, Speech, and Language Processing, 22(2):556–575, 2014.

[44] Alexis Moinet and Thierry Dutoit. Pvsola: A phase vocoder with synchro- nized overlap-add. Proc. of the 14th Int. Conference on Digital Audio Effects (DAFx-11), 2011.

[45] Meinard M¨uller. Information Retrieval for Music and Motion. Springer, 2007. BIBLIOGRAPHY 107

[46] Meinard M¨ullerand Sebastian Ewert. Towards timbre-invariant audio fea- tures for harmony-based music. IEEE Transactions on Audio, Speech, and Language Processing, 18(3):649–662, 2010.

[47] Yizhao Ni, Matt McVicar, Raul Santos-Rodriguez, and Tijl De Bie. An end-to-end machine learning system for harmonic analysis of music. IEEE Transaction on Audio, Speech, and Language Processing, 20(6):1771–1782, 2012.

[48] Nobutaka Ono, Kenichi Miyamoto, Jonathan Le Roux, Hirokazu Kameoka, and Shigeki Sagayama. Separation of a monaural audio signal into harmon- ic/percussive components by complementary diffusion on spectrogram. Pro- ceedings of 16th European Signal Processing Conference, EUSIPCO, 2008.

[49] Johan Pauwels and Geoffroy Peeters. Evaluating automatically estimated chord sequences. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pages 749–753, 2012.

[50] Geoffroy Peeters. Chroma-based estimation of musical key from audio-signal analysis. Proc. of the 7th Int. Conf. on Music Information Retrieval, (IS- MIR), 2006.

[51] Geoffroy Peeters. Musical key estimation of audio signal based on hidden markov modeling of chroma vectors. Proc. of the 9th Int. Conf. on Digital Audio Effects, (DAFx), pages 127–131, 2006.

[52] Suman Ravuri and Daniel P.W. Ellis. The hydra system of unstructured cover song detection. MIREX 2009 extended abstract, 2009.

[53] Matti Ryyn¨anen.Probabilistic modelling of note events in the transcription of monophonic melodies. Master’s thesis, Tampere University of Technology, 2004.

[54] Matti P. Ryynanen and Anssi P. Klapuri. Automatic transcription of melody, bass line, and chords in polyphonic music. Computer Music Journal, 32(3), 2008. 108 BIBLIOGRAPHY

[55] Justin Salamon and Emilia G´omez. A chroma-based salience function for melody and bass line estimation from music audio signals. Proc. of Sound and Music Computing Conference (SMC), pages 331–336, 2009.

[56] Justin Salamon, Emilia G´omez,and Jordi Bonada. Sinusoid extraction and salience function design for predominant melody estimation. Proc. of the 14th Int. Conference on Digital Audio Effects, (DAFx), 2011.

[57] Justin Salamon, Joan Serr`a,and Emilia G´omez.Melody, bass line, and har- mony representations for music version identification. Proceedings of the 21st international conference companion on World Wide Web (WWW 2012), 2012.

[58] Ralph O. Schmidt. Multiple emitter location and signal parameter estima- tion. Proc. RADC Spectrum Estimation Workshop, pages 243–258, 1973.

[59] Joan Serr`a. Identification of Versions of the Same Musical Composition by Processing Audio Descriptions. PhD thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2011.

[60] Joan Serr`a,Emilia G´omez, Perfecto Herrera, and Xavier Serra. Chroma binary similarity and local alignment applied to cover song identification. IEEE Trans. on Audio, Speech and Language Processing, 16(6):1138–1152, 2008.

[61] Joan Serr`a,Xavier Serra, and Ralph G. Andrzejak. Cross recurrence quan- tification for cover song identification. New Journal of Physics, 11, 2011.

[62] Xavier Serra. Musical sound modeling with sinusoids plus noise. In A. Picialli C. Roads, S. Pope and G. De Poli, editors, Musical Signal Processing, chapter Musical Sound Modeling with Sinusoids plus Noise, pages 91–122. Swets & Zeitlinger Publishers, Lisse, the Netherlands, 1997.

[63] Kevin M. Short and Ricardo A. Garcia. Signal analysis using the com- plex spectral phase evolution (cspe) method. 120th AES Convention, Paris, France, 2006. BIBLIOGRAPHY 109

[64] Elias M. Stein and Rami Shakarchi. Fourier Analysis: an introduction. Princeton University Press, 41 William Street, Princeton, New Jersey 08540, 2002.

[65] Tiago Fernandes Tavares, Jayme Garcia Arnal Barbedo, and Amauri Lopes. Improving a multiple pitch estimation method with ar models. Proc. of Audio Engineering Society Conference: 42nd International Conference: Se- mantic Audio, 2011.

[66] Hiroko Terasawa, Malcolm Slaney, and Jonathan Berger. The thirteen colors of timbre. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA, 2005.

[67] Ernst Terhardt, Gerhard Stoll, and Manfred Seewann. Algorithm for ex- traction of pitch and pitch salience from complex tonal signals. Journal of the Acoustical Society of America, 71:679–688, 1982.

[68] Wei-Ho Tsai, Hung-Ming Yu, and Hsin-Min Wang. Using the similarity of main melodies to identify cover versions of popular songs for music document retrieval. Journal of Information Science and Engineering, 24:1669–1687, 2008.

[69] Emmanuel Vincent, Nancy Bertin, and Roland Badeau. Harmonic and in- harmonic nonnegative matrix factorization for polyphonic pitch transcrip- tion. Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 109–112, 2008.

[70] Yongwei Zhu, Mohan S. Kankanhalli, and Sheng Gao. Music key detection for musical audio. Proc. of the 11th Int. Conf. on Multimedia Modelling, (MMM), pages 30–37, 2005.

[71] Udo Z¨olzer. DAFX - Digital Audio Effects. John Wiley & Sons, LTD, Baffins Lane, Chichester, West Sussex, PO 19 1UD, England, 2002.

[72] Eberhard Zwicker. Psychoacoustics: Facts and Models. Springer-Verlag, Berlin, Heidelberg, 2006.