A Report on the Project

Cover Song Identification

Submitted in partial fulfillment for the award of degree

BACHELOR OF ENGINEERING

in

ELECTRONICS AND TELECOMMUNICATION by 2018 Dhikesh Karuvankandy (Roll No. 47) Shreyas Menon (Roll No. 59) Manoj Molankar (Roll No. 63) Shyamsundar Gupta (Roll No. 72)

Under the guidance of LIBRARYMr. Santosh Chapaneri

SFIT Department of Electronic & Telecommunication Engineering St. Francis Institute of Technology, Mumbai University of Mumbai 2017-2018 2018

LIBRARY

SFIT

Scanned by CamScanner 2018

LIBRARY

SFIT

Scanned by CamScanner 2018

LIBRARY

SFIT

Scanned by CamScanner ABSTRACT

Interest of people in music has been increasing tremendously since ages and the scenario remains the same even today. More and more people have found and acknowledged their love towards music and have contributed a lot towards the development of Music industry. A recent trend has started of singing covers to the original music. “Cover song” is the generic term used to denote a new performance of a previously recorded track. For example a cover song may refer to a live performance, a remix or an interpretation in a different music style. Interest for cover songs have increased as people sometimes find it better than the original track. Our project aims to identify the cover songs available for every track by calculating distance between them. We intend to study the use of DTW for calculating distance and check for its Precision and Accuracy. To improve2018 the Precision, Accuracy and Time complexity we use SiMPle (Similarity Matrix Profile). We first divide songs into subsequences and then find distance between them using MASS (Mueen’s Algorithm for Similarity Search) which is the fastest known technique to calculate distance. It makes use of ‘subsequence similarity join’ which significantly reduces the computational time and provides better results.

Keywords: Cover song, Calculate distance, MASS, SiMPle, Subsequence, and Similarity Join. LIBRARY

SFIT

i CONTENTS

Chapter 1: Introduction ...... 1

1.1 Motivation ...... 2

1.2 Scope of the Project ...... 4

1.3 Organization of the Project ...... 5

Chapter 2: Theoretical Background ...... 6

2.1 Musical Theory ...... 6 2.2 Musical Dataset ...... 2018 12 2.3 Chromagram ...... 13

2.4 Longest Common Subsequence ...... 18

2.5 Distance Calculation Algorithms ...... 18

Chapter 3: Literature Survey ...... 22

3.1 Chroma features and Dynamic Programming beat tracking ...... 23

3.2 Music Shapelets for Fast Cover Song Recognition ...... 25

3.3 Audio Cover Song Identification Based On Tonal Sequence Alignment ...... 26

3.4 A Heuristic for Distance Fusion in Cover Song Identification ...... 28 3.5 Music Fingerprint LIBRARYExtraction for Cover Song Identification ... 29 3.6 Large-Scale Cover Song Identification Using Chord Profiles ...... 30

3.7 Large-Scale Cover Song Recognition Using Hashed Chroma Landmarks ...... 31

3.8 Cover Song Identification with 2D Fourier Transform Sequences ...... 33 ChapterSFIT 4: Cover Song Recognition Using SiMPle ...... 34 4.1 Description of SiMPle...... 35

4.2 Optimal Transposition Index (OTI) ...... 37

4.3 Process of SiMPle ...... 39

ii Chapter 5: Performance Evaluation ...... 40

5.1 Evaluation Measures ...... 40

5.2 Evaluation of dataset using DTW and SiMPle ...... 41

5.2.1 Covers80 Dataset ...... 41

5.2.2 YouTube Covers Dataset ...... 41

5.3 Result Analysis...... 45

5.4 Application User Interface ...... 46

Chapter 6: Conclusion and Future work ...... 50

6.1 Conclusion ...... 50

6.2 Future work ...... 50 2018 Appendix A ...... 52

MATLAB ...... 52

Chroma Toolbox ...... 52

Application User Interface ...... 54

COVERS 2018 ...... 55

Appendix B ...... 60

Timeline Chart of the Project ...... 60

References ...... 62

Acknowledgement ...... LIBRARY 64

SFIT

iii LIST OF FIGURES

Fig. 1.1 Summer of 69 by various artists ...... 1

Fig. 1.2 Organized Records ...... 2

Fig. 1.3 Copyright Management ...... 3

Fig. 1.4 Options available to a user ...... 4

Fig. 2.1 Musical Elements ...... 6

Fig. 2.2 Octaves ...... 7

Fig. 2.3 C Major Scale ...... 7

Fig. 2.4 Chords in Music ...... 8

Fig. 2.5 Intervals in Music ...... 8 2018 Fig. 2.6 Music editor re-mastering some music piece ...... 9

Fig. 2.7 Instrumental Performance ...... 9

Fig. 2.8 Acoustic Instruments ...... 10

Fig. 2.9 (a) Musical score of a C-major scale.

(b) Chromagram obtained from the score.

(c) Audio recording of the C-major scale played on a .

(d) Chromagram obtained from the audio recording...... 14

Fig. 2.10 Normalized Chromagram ...... 15

Fig. 2.11 Normalized Log Chromagram ...... 16

Fig. 2.12 CENS Chromagram ...... 16

Fig. 2.13 Overview of the featureLIBRARY extraction pipeline...... 17

Fig. 2.14 Example of DTW calculation ...... 19

Fig. 3.1: Example of Beat Synchronous Chroma Features………………………… ………..23

Fig. 3.2 Illustration of Cover song matching ...... 24

Fig. 3.3: Excerpts Retrieval …………………………...... ……25

Fig. 3.4: General Procedure to generate Triplets………………………… ...... ……26 SFIT Fig. 3.5: General Block diagram of the System ………………………… ...... ……27

Fig. 4.1 Process of SiMPle……………………………………………………………….38

Fig. 5.1 Covers 2018 ...... 41

Fig. 5.2 Distance profile of a song with itself using DTW ...... 42

iv Fig. 5.3 Distance profile of a song with its cover using DTW ...... 42

Fig. 5.4 Distance profile of a song with a random song using DTW ...... 42

Fig. 5.5 Distance profile of a song with itself using SiMPle ...... 43

Fig. 5.6 Distance profile of a song with its cover using SiMPle ...... 43

Fig. 5.7 Distance profile of a song with a random song using SiMPle ...... 43

Fig. 5.8 Distance profile of a Covers 2018 using DTW ...... 44

Fig. 5.9 Distance profile of a Covers 2018 using SiMPle...... 44

Fig 5.10 GUI guide...... 46

Fig 5.11 Cover Song Identifier (User interface) ...... 47

Fig 5.12 Display of Music list ...... 48

Fig 5.13 Playing ‘Havana’ song ...... 48

Fig 5.14 Searching for cover songs ...... 49 2018 Fig 5.15 Display of list of cover songs ...... 49

LIBRARY

SFIT

v

LIST OF TABLES

Table I Calculation of Sliding Dot Products ...... 20

Table II Mueen’s Algorithm for Similarity Search (Mass) ...... 21

Table III Calculate SiMPle and SiMple index ...... 35

Table IV System Evaluation Results using DTW ...... 45

Table V System Evaluation Results using SiMPle ...... 45

Table VI Some MATLAB Functions in the Chroma Toolbox ...... 53

2018

LIBRARY

SFIT

vi

LIST OF ABBREVIATIONS

MIR Music Information Retrieval

MIREX Music Information Retrieval Evaluation Exchange

CP Chroma Pitch

CLP Chroma Log Pitch

CENS Chroma Energy Normalized Statistics

MIDI Musical Instrument Digital Interface

DTW Dynamic Time Warping

MASS Mueen’s Algorithm for Similarity Search

MAP Mean Average Precision 2018

MR1 Mean Rank 1

GUI Graphical User Interface

SiMPle Similarity Matrix Profile

LIBRARY

SFIT

vii Chapter 1: Introduction

A cover song is a generic term used to denote a new performance of a previously recorded track. It is basically an alternative rendition of a previously recorded song. Musicians play covers as a homage or a tribute to the original performer, or band. Sometimes, new versions are rendered for translating songs to other languages, for adapting them to a particular country/region taste, for contemporizing old songs, for introducing new artists, or just for the simple pleasure of playing a familiar song. In addition, cover songs represent the opportunity for beginners and consolidated artists to perform a radically different interpretation of a musical piece. Therefore, today, and perhaps not being the proper way to name it, a cover song can mean any new version, performance, rendition, or recording of a previously recorded track. 2018

LIBRARY

Fig. 1.1: Summer of 69 by various artists

From the perspective of audio content processing, cover song identification yields important information on how musical similarity can be measured and modeled. AutomaticSFIT cover song identification is a surprisingly difficult classical problem that has long been of interest to the music information retrieval community. This problem is significantly more challenging than traditional audio fingerprinting because a combination of tempo changes, musical key transpositions, embellishments in time and expression, and changes in vocals and instrumentation can all occur simultaneously

1 between the original version of a song and its cover. The Music Information Retrieval (MIR) community has paid much attention to this task in recent years and many approaches have been proposed.

Our project aims to identify the cover song available for every track in the list of available songs. These songs are then added to a playlist as and when demanded by the user. This makes it easy and comfortable for the user to get cover songs of his or her choice thereby, saving a lot of time. There are several techniques available for the same but our project uses ‘subsequence similarity join’ which significantly reduces the computational time and provides much better results.

1.1 Motivation

A recent trend shows that people love to sing covers of various2018 songs. Love for cover songs has increased among people as they sometimes find the cover song better than the original song. So whenever any individual listens to the original song, he/she will definitely like to hear covers of the same. In such situation, searching for the covers in the music collection of your device becomes a difficult task. With ease in production of music, the process of collection of songs has become a hectic task. Let us take an example of a common iPod which can carry about 20,000 songs or a storage device which can carry up to 250,000 songs. Or even an online music store which has millions of songs available in its LIBRARY database for the customer.

Such large musical database raises new interesting computational Fig. 1.2: Organized Records problems which needs solving, such as direct access to large collection and thus there is a needSFIT of algorithms for automatic search methods and automatic music similarity estimations. This method of estimation is based on musical features such as harmony, melody, rhythm and instrumentation, which describes the audio at a new level.

2 Identifying the cover given original as a query, or finding the original given a cover using only musical features, is quite a challenge which is researched by a community named Music Information Retrieval Evaluation eXchange (MIREX). MIREX is a community-based endeavor to scientifically evaluate Music Information Retrieval (MIR) algorithms and techniques. Such algorithms describe various tasks such as Audio classification, Audio beat tracking, Audio tempo estimation, Audio music similarity and Retrieval and Audio cover song identification. MIR system have the ability to manage

Fig. 1.3: Copyright Management extremely large digital music2018 database in an efficient and reliable manner paving the way for future music related industry development. Copyright issues can also be directly identified since database would be stored to identify the truth. Thus, it’s an application which serves towards security of musical industry. This technique could be used in number of applications such as: automatic playlist generators; organization and visualization of large music collections; system for recommendation of unknown pieces and/or artist to one’s preference; or even a system which helps out a DJ in choosing his next song so that the current and the next have similar rhythm and/or melody. Application of this technique helps in organizing the covers according to the content present instead of searching by the name of the song.

Another practical use of musical similarity is Query-by-example (QBE) system, whereby the user has the possibilityLIBRARY to search the music collection for a particular song but is unable to remember neither the song’s title nor the artist name to find the song in database. But what if he remembers a melody or processes a record part of a song?

We may use the musical similarity in such a way that the user can load, sing, whistle or hum (in this case Query-by-Humming (QBH) system) the melody to the system and SFITobtain a list of closest matches ranked by their similarity measure. Identifying two different versions of a song requires considerations in the variations of several music dimensions. The main attributes that may vary between two cover songs are: its timbre representing difference in instrumentation and production techniques; noise, presented

3 e.g. during audience manifestation; key, when a song is transposed to a different key to

2018

Fig. 1.4: Options available to a user adapt pitch range of a particular singer; language of lyrics, when the song is translated; timing, depending on performers feeling; tempo, structure, duration or even genre. The feature which is robust enough to endure several of the above mentioned changes and will be used throughout this work is the chromagram, otherwise known as Pitch Class Profile (PCP) of the song, which will be describes in more detailed in our project.

1.2 Scope of the Project

We intend to make a project which is capable of identifying similar songs. This would tremendously help LIBRARYin Collection and Organization of large musical databases according to the content present instead of organizing according to the name of the song. Copyright issues can also be directly identified leading towards the security of musical industry. This technique could be used in a number of applications such as automatic playlist generators; a system for recommendation of unknown pieces and/or artist to one’sSFIT preference; or even a system which helps out a DJ in choosing his next song so that the current and the next have similar rhythm and/or melody.

4 1.3 Organization of the Project

Chapter 2 gives brief explanation of various Musical Terminologies such as pitch, scale, structure, etc. which is essential to understand the basics of music theory. We discuss different facets that might change in a cover song which adds up to the difficulty in cover song identification. These changing facets lead to different types of cover songs which are discussed in brief in this chapter. We also discuss what is chromagram and its variants, distance calculation algorithms like DTW and MASS.

Chapter 3 is a Literature Survey giving short explanations of different work done in this field and existing methods used to identify Cover Songs.

Chapter 4 is focused on SiMPle based cover song recognition and explains what is SiMPle and SiMPle index, algorithm used in SiMPle, steps to calculate2018 SiMPle and SiMPle index in detail. We also discuss the concept of Similarity join and Self-similarity join in brief.

Chapter 5 evaluates the performance of system based on Mean Average Precision (MAP), Precision at 10 and Mean Rank 1. This chapter also describes various datasets used in our project; results obtained for these datasets and provide an analysis of the same.

Chapter 6 concludes the discussion giving an insight of the future work and possible modifications in our project.

LIBRARY

SFIT

5 Chapter 2: Theoretical Background

2.1 Musical Theory

Music theory [1] is frequently concerned with describing how musicians and make music, including tuning systems and composition methods among other topics. Music is an art form and cultural activity whose medium is sound organized in time. The common elements of music are pitch, rhythm, dynamic, timbre and texture. Different styles or types of music may emphasize, de-emphasize or omit some of these elements. Music is performed with a vast range of instruments and vocal techniques ranging from singing to rapping; there are solely instrumental pieces, solely vocal pieces and pieces that combine singing and instruments. The creation, performance, significance, and even the definition of music vary according to culture2018 and social context.

LIBRARY

SFIT

Fig. 2.1: Musical Elements

6 2.1.1 Related Terminologies  Pitch: The frequency of a note determining how high or low it sounds.  Timbre: Tone color, quality of sound that distinguishes one verse or instrument to another. It is determined by the harmonies of sound.  Structure: The Organization or the form of a piece of song.  Tempo: Indicates the speed of Music.  Octave: It is the interval between one musical pitch and another with half or double its frequency.

2018

Fig. 2.2: Octaves

 Scale: Successive notes of a key or mode either ascending or descending. Different types of Scales are Major Scale, Minor Scale, etc.

LIBRARY

SFIT

Fig. 2.3: C Major Scale

7  Chord: Three or four notes played simultaneously in harmony. Chord progression means a string of chords played in succession.

2018

Fig. 2.4: Chords in Music  Interval: The distance in pitch between two notes.

LIBRARY

SFIT

Fig. 2.5: Intervals in Music  Dynamics: Pertaining to the loudness or softness of a musical composition  Silence: No sound

8 2.1.2 Types of Cover Songs

A cover song can mean any new version, performance, rendition, or recording of a previously recorded track. A cover song can be recorded in the following ways:

 Remaster: Creating a new master for an album or song generally implies some sort of sound enhancement (compression, equalization, different endings, fadeouts, etc.) to a previous, existing product.

2018

Fig. 2.6: Music editor re-mastering some music piece

 Instrumental: Sometimes, versions without any sung lyrics are released. These might include karaoke versions to sing or play with, cover songs for different record- buying public segments (e.g. classical versions of pop songs, children versions, etc.), or rare instrumental takes of a song in CD-box editions specially made for collectors.

LIBRARY

SFIT Fig. 2.7: Instrumental Performance

 Live Performance: A recorded track from live performances. This can correspond to a live recording of the original artist who previously released the song in a studio album, or to other performers.

9  Acoustic: The piece is recorded with a different set of acoustical instruments in a more intimate situation.

Fig. 2.8: Acoustic Instruments

 Demo: It is a way for musicians to approximate their ideas on tape2018 or disc, and to provide an example of those ideas to record labels, producers, or other artists. Musicians often use demos as quick sketches to share with band mates or arrangers. In other cases, a songwriter might make a demo in order to be send to artists in hopes of having the song professionally recorded, or a music publisher may need a simplified recording for publishing or copyright purposes.  Duet: A successful piece can be often re-recorded or performed by extending the number of lead performers outside the original members of the band.  Medley: Mostly in live recordings, and in the hope of catching listeners’ attention, a band covers a set of songs without stopping between them and linking several themes.  Remix: This word may be very ambiguous. From a ‘traditionalist’ perspective, a remix implies an alternateLIBRARY master of a song, adding or subtracting elements, or simply changing the equalization, dynamics, pitch, tempo, playing time, or almost any other aspect of the various musical components. But some remixes involve substantial changes to the arrangement of a recorded work and barely resemble the original one. Finally, a remix may also refer to a re-interpretation of a given work such as a hybridizing process simultaneously combining fragments of two or more works.

2.1.3SFIT Involved musical facets

With nowadays’ concept of cover song, one might consider the musical dimensions in which such a piece may vary from the original one. In classical music,

10 different performances of the same piece may show subtle variations and differences, including different dynamics, tempo, timbre, articulation, etc. On the other hand, in , the main purpose of recording a different version can be to explore a radically different interpretation of the original one. Therefore, important changes and different musical facets might be involved. It is in this scenario where cover song identification becomes a very challenging task. Some of the main characteristics that might change in a cover song are listed below:

 Timbre: Many variations changing the general color or texture of sounds might be included into this category. Two predominant groups are: • Production techniques: Different sound recording and processing techniques (e.g. equalization, microphones, dynamic compression, etc.) introduce texture variations in the final audio rendition. 2018

• Instrumentation: The fact that the new performers can be using different instruments, configurations, or recording procedures, can confer different timbres to the cover version.

 Tempo: Even in a live performance of a given song from its original artist, tempo might change, as it is not so common to control tempo in a concert. In fact, this might become detrimental for expressiveness and contextual feedback. Even in classical music, small tempo fluctuations are introduced for different renditions of the same piece. In general, tempo changes abound (sometimes on purpose) with different performers.  Timing: In addition to tempo, the rhythmical structure of the piece might change depending on the performer’sLIBRARY intention or feeling. Not only by means of changes in the drum section, but also including more subtle expressive deviations by means of swing, syncopation, pauses, etc.  Structure: It is quite common to change the structure of the song. This modification can be as simple as skipping a short ‘intro’, repeating the chorus, introducing an instrumental section, or shortening one. But it can also imply a radical change in the SFIT musical section ordering.  Key: The piece can be transposed to a different key or tonality. This is usually done to adapt the pitch range to a different singer or instrument, for ‘aesthetic’ reasons, or to induce some mood changes on the listener.

11  Harmonization: While maintaining the key, the chord progression might change (adding or deleting chords, substituting them by relatives, modifying the chord types, adding tensions, etc.). This is very common in introduction and bridge passages. Also, in instrument solo parts, the lead instrument voice is practically always different from the original one.  Lyrics and language: One purpose of performing a cover song is for translating it to other languages. This is commonly done by high-selling artists to be better known in large speaker communities.  Noise: In this category we consider other audio manifestations that might be present in a song recording. Examples include audience manifestations such as claps, shouts, or whistles, audio compression and encoding artifacts, speech, etc.

Notice that, in some cases, the characteristics of the song might change, except, perhaps, a lick or a phrase that is on the background, and that it is the2018 only thing that reminds of the original song (e.g. remixes or quotations). In these cases, it becomes a challenge to recognize the original song, even if the song is familiar to the listener.

2.2 Musical Dataset

Dataset is a collection [2] of related sets of information that is composed of separate elements but can be manipulated as a unit by a computer. To create a dataset for this project, first step is to collect sufficient original and cover songs. For our Project, we downloaded a standard Musical Dataset ‘Covers80’ which is easily available online. It consists of 80 original songs, 84 cover songs (approx. 1 cover song for each original). So this dataset consists of totalLIBRARY 164 songs with average time duration of approx. 3 minutes. This dataset is further split into two parts as Train Data and Test Data.

2.2.1 Train Data

Train Data is a set of data used to train the model to discover potentially predictive relationships. In other words, it is a set of examples used for learning that is to fit theSFIT parameters of the classifier. In our Project, all the original songs are classified under Train Data. Every song in this dataset is labelled so that we can verify our obtained results.

12 2.2.2 Test Data

Test Data is used to assess the performance of the model (e.g. predictive power). The test data set may not be used in the model building process. In our Project, all the cover songs are classified under Test Data. Every cover song in this dataset is given the same label as its corresponding original song in the Train dataset. This helps us to check the precision and accuracy of our results. We can assess the performance of our system by using these labels.

2.3 Chromagram

A Chromagram [3] is a variation on time-frequency distribution, which represents the spectral energy at each of the 12 different pitch classes. The underlying observation is that humans perceive two musical pitches as similar in color if they differ2018 by an octave. Based on this observation, a pitch can be separated into two components, which are referred to as tone height and chroma. Assuming the equal-tempered scale, one considers twelve chroma values represented by the set {C, C#, D, D#, E, F, F#, G, G#, A, A#, B}.

A pitch class is defined as the set of all pitches that share the same chroma. For example, using the scientific pitch notation, the pitch class corresponding to the chroma

C is the set {..., C−2, C−1, C0, C1, C2, C3 ...}. Given a music representation, the main idea of chroma features is to aggregate for a given local time window (e.g. specified in beats or in seconds) all information that relates to a given chroma into a single coefficient. Shifting the time window across the music representation results in a sequence of chroma features each expressing how the representation's pitch content within the time window is spread over the twelve chroma bands. TheLIBRARY resulting time-chroma representation is also referred to as chromagram. Because of the close relation between the terms chroma and pitch class, chroma features are also referred to as pitch class profiles.

Chroma features are powerful mid-level feature representation in content-based audio retrieval such as cover song identification or audio matching. There are many ways of computingSFIT and enhancing chroma features, which results in a large number of Chroma variants with different properties. There is no single chroma variant that works best in all applications. Furthermore, the properties of chroma features can be significantly changed by introducing suitable pre- and post-processing steps modifying spectral, temporal, and dynamical aspects. This leads to a large number of chroma variants.

13

2018

LIBRARY

Fig. 2.9: a) Musical score of a C-major scale. (b) Chromagram obtained from the score.

(c) Audio recording of the C-major scale played on a piano. (d) Chromagram SFIT obtained from the audio recording.

In music, notes exactly one octave apart is perceived as particularly similar. Knowing the distribution of chroma, even without the absolute frequency can give useful musical information about the audio and may even reveal perceived musical similarity

14 that is not apparent in the original spectra. Some of the different Chroma variants that can be generated are CP (Chroma pitch), CLP (Chroma log pitch), CENS (Chroma Energy Normalized Statistics) etc.

The visual representations of Chroma variants for “C Major Chord” are as follows:

2018

Fig. 2.10: Normalized Chromagram

LIBRARY

From the above figure, we can see that the bar on the right hand side indicates the intensity levels. Higher intensities are represented by brighter patches and lower intensities are represented by dark color patches. A ‘C Major Scale’ is as follows: C D E F G A B, wherein the first, third and fifth keys i.e. C, E, and G together forms C major chord.SFIT This can be clearly seen by the bright color patches near the C, E and G keys.

15

2018

Fig. 2.11: Normalized Log Chromagram

LIBRARY

SFIT

Fig. 2.12: CENS Chromagram

16  Steps to generate chroma variants

2018 Fig. 2.13: Overview of the feature extraction pipeline.

We first decompose a given audio signal into 88 frequency bands with center frequencies corresponding to the MIDI pitches p = 21 to p = 108. Then we employ a constant Q multi-rate filter bank using a sampling rate of 22050 Hz for high pitches, 4410 Hz for medium pitches, and 882 Hz for low pitches. In the next step, for each of the 88 pitch sub-bands, we compute the short-time mean-square power. The resulting features are called as pitches. From the pitch representation, we can obtain a chroma representation simply by adding up the corresponding values that belong to the same chroma. For example, to compute the entry corresponding to Chroma C, one adds up values corresponding to the musical pitches C1, C2, . . ., C8 (MIDI pitches p = 24, 36, . . . , 108). Further to achieveLIBRARY invariance in dynamics, we can normalize the features with respect to some suitable norm. As the name suggests in CLP variant, we apply logarithmic compression. Each energy values e of the pitch representation is replaced by the value log (η* e + 1), where η is a called as compression parameter. Another important chroma variant is CENS. CENS features have turned out to be very useful in audio matchingSFIT and retrieval applications. To compute CENS feature each chroma vector is first normalized with a help of a suitable norm. Then we apply quantization. After quantization the features are further smoothed to add further degree of abstraction.

For our project we are using the CENS chroma variant. CENS variant is used because they are able to absorb the variations in timbre, tempo and other micro-

17 deviations. Therefore the CENS variant basically absorbs all the non- relevant information and gives us only the relevant and meaningful information. Also applying energy thresholds makes the CENS features insensitive to noise components. Furthermore, because of their low temporal resolution, CENS features can be processed efficiently.

2.4 Longest Common Subsequence

It is the sequence of alphabets of short string that appears in order (but possibly separated) in a long text string when any two strings of different lengths are compared. For Example: String 1- “nano”

String 2- “nematode knowledge”

Then the alphabets ‘n’ ‘a’ ‘n’ ‘o’ of the string-1 appears in order (separated) in string-2 i.e. nematode knowledge 2018

If such condition is satisfied then we can say that “nano” is the longest Common Subsequence.

2.5 Distance Calculation Algorithms

To match any two songs we calculate the distance between them. There are various methods available to calculate distance between two time series sequences such as Euclidean Distance (ED), Dynamic Time Warping (DTW), etc. In our Project, the Algorithm used to Calculate Distance is Mueen’s Algorithm for Similarity Search (MASS). This is the fastest known algorithm to calculate distance between two time series sequences. But beforeLIBRARY implementing MASS in our project, we would like to implement our project using DTW which is the most commonly used method in Musical Information Retrieval (MIR) projects.

2.5.1 Dynamic Time Warping (DTW)

DTW is one of the well-known and widely applied algorithms for measuring similaritySFIT between two time series sequences by finding an optimal alignment between them under certain restrictions. A potential solution would be segmenting the song before applying the DTW similarity estimation. DTW was originally developed for speech recognition. It aims at aligning two sequences of feature vectors by warping the time axis iteratively until an optical match between the two sequences is found. Any data which

18 can be turned into a linear sequence can be analyzed using DTW. It has been applied to temporal sequences of video, audio, and graphics data. Time Complexity of the algorithm 2 is O(n ). Various methods exist to fast approximate the DTW calculation however maximum level of error cannot be set.

Calculations of DTW:

Example of DTW calculation is illustrated below:

2018

Fig. 2.14: Example of DTW calculation

Euclidean Distance between every instant of the two time series sequence is found out. To determine an optimal path, one could test every possible warping path between X and

Y. We define the prefix sequences X(1 : n) = (x1,... xn) for n ∈ [1 : N ] and Y (1 : m) =

(y1,...ym) for m∈ [1 : M ] LIBRARY D(n,m) =DTW(X(1:n),Y(1:m)).

The values D(n, m) is define as an NxM matrix D, which is also referred to as the Distance Profile Matrix. An optimal path is found close to the diagonal. Adding the values of the optimal path gives us the final distance between two time series.

2.5.2SFIT Mueen’s Algorithm for Similarity Search (MASS)

It is the Fastest Similarity Search Algorithm [4] for Time Series Subsequences under Euclidean Distance and Correlation Coefficient. This algorithm does not just find the nearest neighbor to a query and return its distance; it returns the distance to every

19 subsequence. The algorithm requires just O(nlogn) time by exploiting the FFT to calculate the dot products between the query and all subsequences of the time series. In particular, it computes the distance profile.

MASS is an algorithm to create Distance Profile of a query to a long time series.

This algorithm is independent of data and query. The underlying concept of the algorithm is known for a long time to the signal processing community.

There are dozens of algorithms for time series similarity search that utilize index structures to efficiently locate neighbors. While such algorithms can be faster in the best case, all these algorithms degenerate to brute force search in the worst case. Following are the key features of Mueen’s Algorithm:

1. The algorithm has an overall time complexity of O(nlogn) which does not depend on datasets and is the lower bound of similarity search over time series subsequences.2018

2. The algorithm produces all of the distances from the query to the subsequences of a long time series.

Calculation of MASS:

Table I: Calculation of Sliding Dot Products

Procedure SlidingDotProduct(Q, T) Input: A query Q, and a user provided time series T Output: The dot product between Q and all subsequences in T 1 n ← Length(T), m ← Length(Q) 2 Ta← Append T with n zeros 3 Qr ← Reverse(Q) LIBRARY

4 Qra ← Append Qr with 2n-m zeros 5 Qraf ← FFT(Qra), Taf ← FFT(Ta) QT ← InverseFFT(ElementwiseMultiplication(Q , T )) 6 raf af 7 return QT

Considering the above table, line 1 determines the length of both the time series T SFIT and the query Q. In line 2, we use that information to append T with an equal number of zeros. In line 3, we obtain the mirror image of the original query. This reversing ensures that a convolution (i.e. “crisscrossed “multiplication) essentially produces in-order alignment because we require both vectors to be the same length. In line 4, we append

20 enough zeros to the (now reversed) query so that, like Ta, it is also of length 2n. In line 5,

the algorithm calculates Fourier transforms of the appended- reversed query (Qra) and the appended time series Ta. Note that we use FFT algorithm which is an O(nlogn) algorithm. The Qraf and the Taf produced in line 5 of above table are vectors of complex numbers representing frequency components of the two time series.

This algorithm calculates the element-wise multiplication of the two complex vectors and performs inverse FFT on the product. Lines 5-6 are the classic convolution operation on two vectors. The algorithm time complexity does not depend on the length of the query (m).

Normally, it takes O(m) time to calculate the mean and standard deviation for every subsequence of a long time series. We cache cumulative sums of the values and square of the values in the time series. At any stage the two cumulative2018 sum vectors are sufficient to calculate the mean and the standard deviation of any subsequence of any length.

Table II: Mueen’s Algorithm for Similarity Search (MASS)

Procedure of MASS(Q,T)

Input: A query Q, and a user provided time series T

Output: A distance profile D of the query Q

1 QT Sliding Dot Products for inputs Q, T

2 sumQ, sumT Compute Cumulative sums

3 D Calculate Distance Profile for inputs sumQ, sumT, QT LIBRARY 4 return D

This algorithm calculates distance to every subsequence, i.e. the distance profile of time series T. Alternatively, in join nomenclature the algorithm produces one full row of all-pair similarity matrix. Our join algorithm is simply a loop that computes each full row SFITof the all-pair similarity matrix and updates the current “best-so-far” matrix profile.

21 Chapter 3: Literature Survey

With tremendous increase in cover songs being recorded these days, various techniques have been developed to identify these cover songs. Cover song identification is gathering a lot of attention recently in the Music Information Retrieval (MIR) domain. The focus is to build an algorithm which is able to process two music signals, try to find similarity between them and it should be capable of identifying whether the two signals are covers of each other or not. Various attempts have been made in this regard wherein the basic idea is to divide the time series sequences into sub sequences and find the distances between the two sub sequences. The distance between two subsequences determines the similarity between them. Less distance corresponds to more similarity and more distance indicates less similarity. Various projects made to identify cover songs makes use of distance finding algorithms like Euclidean distance (ED),2018 Dynamic Time warping (DTW), Early Abandoning techniques. This chapter comprehensively summarizes the work done in cover song identification while encompassing the background related to this area of research. In the literature, one can find plenty of approaches addressing song similarity and retrieval, both in the symbolic and the audio domains. Within these, research done in areas such as query-by-humming systems, content-based music retrieval, genre classification, or audio fingerprinting, is relevant for addressing cover song similarity. Many ideas for cover song identification systems come from the symbolic domain and query-by-humming systems are paradigmatic examples. In query by-humming systems, the user sings or hums a melody and the system searches for matches in a musical database. This query-by-example situation is parallel to retrieving cover songs fromLIBRARY a database. In fact, many of the note encoding or alignment techniques employed in query-by-humming systems could be useful in future approaches for cover song identification. However, the kind of musical information that query-by- humming systems manage is symbolic (usually MIDI files), and the query, as well as the music material, must be transcribed into the symbolic domain. Unfortunately, transcriptionSFIT systems of this kind do not yet achieve a significantly high accuracy on real- world audio music signals.

22 3.1 Chroma features and Dynamic Programming beat tracking

Daniel P.W. Ellis and Graham E. Poliner. “Identifying ‘Cover Songs’ with Chroma Features and Dynamic Programming Beat Tracking”. In MIREX 2006 Audio Beat Tracking Contest system description, 2006

This method [5] involves three major steps. Firstly to overcome variability in tempo, beat tracking is used to describe each piece with one feature vector per beat. To deal with variation in instrumentation, 12-dimensional ‘Chroma’ feature vectors since chroma features record the intensity associated with each of the 12 semitones. We perform these two steps to get a “Beat-Synchronous Chroma” representation for both the query and original song as shown in fig 3.1. We expect cover versions to have long stretches (verses, choruses, etc.) that match reasonably well, although we cannot expect these to occur in exactly the same places, in absolute or relative terms,2018 in the two versions, for instance due to minor errors in the beat tracking, or as a result of variations in the structure (number of verses etc.). Thus to find similarity the simpler approach is of cross-correlating the entire two feature matrices. Although this is unable to reward the situation when multiple fragments align but at different relative alignments, it does have the nice property of rewarding both a good correlation between the chroma vectors and a long sequence of aligned beats, since the overall peak correlation is a product of both of these.

LIBRARY

SFIT

Fig. 3.1: Example [5] of Beat Synchronous Chroma Features

23 The cross-correlation is further normalized by the length of the shorter segment, so the correlation results are bounded to lie between zero and one. The cross correlation and the results obtained by them for the Song “Between the Bars” is shown in fig 3.2. We perform the cross-correlation twelve times, once for each possible relative rotation (transposition) of the two feature matrices.

2018

LIBRARY

Fig. 3.2: Illustration [5] of Cover Song Matching

Genuine matches were indicated not only by cross-correlations of large magnitudes, but that these large values occurred in narrow local maxima in the cross correlations that fell off rapidly as the relative alignment changed from its best value. To emphasize these sharpSFIT local maxima, we choose the transposition that gives the largest peak correlation then high-pass filter that cross-correlation function with a 3 dB point at 0.1 rad/sample. The ‘distance’ value measured between two pieces is simply the reciprocal of the peak value of this high-pass filtered cross-correlation; matching tracks typically score below 20, whereas unrelated tracks are usually above 50.

24 3.2 Music Shapelets for Fast Cover Song Recognition

Diego F. Silva, Vin´ıcius M. A. Souza, Gustavo E. A. P. A. Batista. “Music Shapelets for Fast Cover Song Recognition”, 16th International Society for Music Information Retrieval Conference, 2015

Time series shapelets is a well-known approach [6] for time series classification. In this technique a training phase is added that finds small excerpts of feature vectors that best describe each song. Such small segments can be used to identify cover songs with higher identification rates and more than one order of magnitude faster than methods that use features to describe the whole music.

2018

Fig. 3.3: Excerpts [6] retrieval

In classification, there exists a training set of labeled instances. A typical learning system uses the information in training set to create a classification model, in a step known as training phase. When a new instance is available, the classification algorithm associates it to one of the classes in training set. A time series shapelet may be informally defined as the subsequenceLIBRARY that is the most representative of a class. The original algorithm finds a set of shapelets and uses them to construct a decision tree classification model. The training phase of such learning system consists of three basic steps:

 Generate candidates: This step consists in extracting all subsequences from each training time series.  Candidates’ quality assessment: This step assesses the quality of each subsequence.  SFIT Classification model generation: This step induces a decision tree. The decision in each node is based on the distance between the query time series and a shapelet associated to that node.

25 To improve the results the each recording is represented as three shapelets. Figure 3.4 illustrates this procedure. The first step of this procedure is to divide the feature vector into three parts of the same length. After that, the most representative subsequence of each segment is found. Finally, during the retrieval phase, the mean distance from a query recording to each of the three shapelets is used. These triple of shapelets is referred as triplets.

2018

Fig. 3.4: General [6] procedure to generate triplets

Although this method is fast it requires a training phase that is absent in similarity search with DTW. LIBRARY

3.3 Audio Cover Song Identification Based On Tonal Sequence Alignment

Thierry Bertin-Mahieux and Daniel P. W. Ellis, "Large-scale cover song recognition usingSFIT hashed chroma landmarks ," IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) on, pp. 117-120, November, 2011

Tonal sequence alignment technique [7] is a new method for determining the similarity between tonal sequences for identifying cover songs. This is based on a novel

26 chroma similarity measure, and on a newly developed dynamic programming local alignment technique. Tonal sequences are useful descriptors for cover song identification. The method exposed in this technique uses sequences of feature vectors describing tonality (in this case Harmonic Pitch Class Profiles). But it presents relevant differences in two important aspects: A new binary similarity function between chroma features, and a new local alignment algorithm for assessing resemblance between sequences is developed.

From each pair of songs A and B being compared (inputs), one can obtain a distance between them. Preprocessing comprises extracting HPCP sequences and a global HPCP for each song. Then, one song is transposed to the key of the other by means of an Optimal Transposition Index (OTI). From these two sequences, a binary similarity matrix is computed. The last is the only input needed 2018for a Dynamic Programming Local Alignment (DPLA) algorithm, which calculates a score matrix that gives highest ratings to best aligned subsequences. Finally, in the post-processing step, a normalized distance between the two processed songs is found.

LIBRARY

SFIT Fig. 3.5: General [7] Block Diagram of the System

The system’s main novelty relies on two facts: a new binary similarity measure for chroma features and a custom-made dynamic programming local alignment algorithm for determining subsequence similarity.

27 3.4 A Heuristic for Distance Fusion in Cover Song Identification

Alessio Degani, Marco Dalai, Riccardo Leonardi and Pierangelo Migliorati, “A Heuristic

for Distance Fusion in Cover Song Identification”, Image Analysis for Multimedia Interactive Services (WIAMIS), 14th International Workshop on Paris, France, 2013.

In this technique, a method [8] to integrate the results of different covers song identification algorithms into one single measure which gives better results than initial algorithms has been proposed. In this method the different distance measure have been fused into a multi-dimensional space. Two distance measures, namely the Dynamic Time Warping

and the Qmax measure are tested. A method to merge N combination of feature and distance measures to increase the accuracy results of a cover song identification algorithm2018 has been used. This method is based uniquely on a geometric N-dimensional distance measure that has a very low computational cost of the overall distance refinement. A particularly useful combination has been obtained by using a Salience Feature with a Dynamic Time Warp

similarity measure and a HPCP with a Qmax measure.

A salience function for a given frequency fi is calculated as a weighted sum of the

energy at the first 8 harmonics of the fundamental frequency fi like fi, 2fi, 3fi . . . . The pitch salience function is calculated at each frame using the amplitude spectrum and covers a frequency range. The computation of the HPCP descriptor begins with some pre- processing step like spectral peak detection. The energy of each Pitch Class is calculated from the correspondent spectral peak and the weighted summation of its harmonic frequencies peak energy upLIBRARY to 8 terms.

Basically, the Qmax distance calculates the length of the longest time segment in which two song u and v exhibit similar feature patterns. This is done by using a cross- recurrence plot. A cross-recurrence plot (CRP) is a binary similarity matrix C whose th elements ci,j are set to 1 when there is a recurrence between the i feature vector of song th u andSFIT the j feature vector of song v, and zero otherwise. Here, a recurrence means that the Euclidean distance between these two vectors is below a specified threshold. When consecutive feature vectors are similar for a certain amount of frames, a diagonal pattern of ones become visible in CRP. If the number of combination of different features

28 increases, accuracy will increase but at the same time computing-time will also increase and system would no longer be useful for real-time application.

3.5 Music Fingerprint Extraction for Classical Music Cover Song Identification

Samuel Kim, ErdemUnal, and Shrikanth Narayanan, “Music Fingerprint extraction for Classical Music Cover Song Identification”, Multimedia and Expo, IEEE International

Conference on Hannover, Germany, 2008.

The proposed method deals with extracting music fingerprints [9] directly from an audio signal. The proposed method aims to encapsulate various aspects of musical information, such as overall note distribution, harmony structure, and2018 their temporal changes, all in a compact representation. The utility of the proposed music fingerprinting method to the task of automatic classical music cover song identification is explored through experimental studies specifically; the goal here is to identify the different versions of the same music through similarity comparisons of the music fingerprints.

Chroma features based on Shepard’s helix model are used, which factorize the perception of frequency into tone height and chroma. It is desirable to have less memory and complexity as well as higher accuracy in capturing the unique characteristics of the music for the target application. In the covariance matrix of the chroma feature vectors, each element of the matrix is an energy-related quantity. The diagonal elements of the covariance matrix representLIBRARY the degree of presence of each pitch class in terms of energy. Furthermore, each column of the covariance matrix denotes the degree of co-presence of each pitch class with a given pitch class. Since the co-presence of two or more pitch classes represents the harmony information, it reveals the harmony structure of the music. Feature vectors are extracted in a beat-synchronous way, the delta feature represent the dynamics between two consecutive beats. SFIT This technique significantly improves the performance in terms of both complexities as well as accuracy. The approach accelerates the search speed by approximately up to 60 times with a 30% relative accuracy improvement. It is also notable that a significant accuracy difference exists depending on the choice of the

29 normalization scheme. The fingerprinting method with column-wise normalization outperforms the overall normalization method, discerned more from the harmony structure than the overall note energy distribution. Improved performance over the state- of-the-art cover song identification systems in terms of both accuracy and speed. The accuracy improved by approximately 40% while the search speed is about 60 times faster than the conventional system.

3.6 Large-Scale Cover Song Identification Using Chord Profiles

Maksim Khadkevich, MaurizioOmologo, “Large-Scale Cover Song Identification Using Chord Profiles”, Conference on International Society for Music Information Retrieval, 2013 2018

A compact representation [10] of music contents plays an important role in large- scale analysis and retrieval. The proposed approach is based on high-level summarization of musical songs using chord profiles. Search is performed in two steps. In the first step, the Locality Sensitive Hashing (LHS) method is used to retrieve songs with similar chord profiles. On the resulting list of songs a second processing step is applied to progressively refine the ranking. In this technique an approach to large-scale cover song identification using chord progressions and chord profiles has been proposed. A chord progression is extracted from audio or from chroma features provided with the Million Song Dataset (MSD). For the most part, the approaches proposed based on the alignment of local features, which is typicallyLIBRARY performed by Dynamic Time Warping (DTW) or string alignment, and require a significant amount of computational resources. To build a system that can operate on a large scale, chord profiles are used for indexing and fast retrieval. A chord profile of a song is a compact representation that summarizes the rate of occurrence of each chord. To solve the problem of cover song identification, a well- establishedSFIT approach for fast retrieval in multidimensional spaces has been used, which is Locality-Sensitive Hashing (LHS).

Chord progressions and chord profiles are the two high-level features used in the proposed system. The extraction of beat-synchronous chord progression is the first step. Chord profiles and chord progressions extracted for each song of a given dataset are

30 stored in a database. In the retrieval stage, high-level features extracted from a queried song are used to derive a ranked list of possible covers from the database. The proposed cover song identification system relies on a two-step retrieval schema. Given a large database of chord profiles and a queried song the problem of finding the nearest neighbors has been addressed. Finding the nearest neighbors of an element in large databases is a well-known problem addressed in many areas of information retrieval. Locality Sensitive Hashing is a probabilistic approach to reduce dimensionality by hashing feature so that items that are close to each other fall in the same bucket with high probability. When working with larger databases, containing several millions of songs, a significant increase in speed can be achieved by using LHS.

In the second step, the top k results are re-ranked by computing edit distances between chord progressions. The edit distance is the number of insertions,2018 deletions and substitutions to transform one sequence into another. Thus, a more accurate matching is performed, which takes temporal information into account the output is refined taking into account temporal alignment between a queried song and the top k items from the merged ranked list.

3.7 Large-Scale Cover Song Recognition Using Hashed Chroma Landmarks

Thierry Bertin-Mahieux and Daniel P. W. Ellis, "Large-scale cover song recognition using hashed chroma landmarks," IEEE Workshop on Applications of Signal Processing to Audio and AcousticsLIBRARY (WASPAA) on, pp. 117-120, November, 2011

The proposed method [11] considers the problem of finding covers in a database of a million songs. Using a fingerprinting inspired model, they presented the first results of cover song recognition on the Million Song Dataset. This task has been renewed by the availabilitySFIT of so many tracks, and this work was intended to be the first step towards a practical solution. Typical Music Information Retrieval (MIR) tasks includes two things

(1) Fingerprinting and (2) Music recommendation. The first one involves identifying some specific piece of audio, the second deals with finding "similar" songs according to some human metric. Cover song recognition sits somewhere in the middle. The goal is to

31 identify a common musical work that might have been highly transformed by two different musicians. We look for a fingerprinting-inspired set of features that can be used as hash codes. They identified landmarks in the audio signal, i.e. recognizable peaks, and measured the distance between them. These "jumps" constitute a very accurate identifier of each song, robust to noise due to encoding or coming from the background. For cover songs, since the melody and the general structure (e.g. chord changes) are often preserved, we can use sequences of jumps between pitch landmarks as hash codes.

A hashing system contains two main parts. The extraction part computes the hash codes (also called fingerprints) from a piece of audio. The second part compares hash codes in a database given a query song. This has to be optimized so

1) Hash codes are easy to compare

2) The number of hash codes per song is tractable

They then averaged the Chroma values over beats and identified2018 landmarks, i.e. "prominent" bins in the resulting Chroma matrix. Once these landmarks were identified, they looked at all possible combinations, i.e. set of jumps from landmarks to landmarks, over a window of size W. To avoid non-informative duplicates, they only consider jumps in one direction within a same time frame. The simplest set of hash codes consists of all pairs of landmarks that fall within the window length W. Such a pair can already give information about musical objects like chords, or changes in the melody line.

These sets of jumps are hash codes characteristic of the song, and also characteristic of its covers. For sake of clarity, they referred to them as "jumpcodes". A jumpcode is a list of difference in time and semi-tone between landmarks plus an initial semi-tone. The jumpcodesLIBRARY are entered in a database so we can use them to compare songs. This is considered to be the initial step towards the music information retrieval system.

SFIT

32 3.8 Cover Song Identification with 2D Fourier Transform Sequences

Prem Seetharaman, Zafar Rafii, "Cover song identification with 2D Fourier Transform sequences," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 616-620, June, 2017

In this method, cover song identification [12] using a novel time-series representation of audio based on the 2DFT is performed. The audio is represented as a sequence of magnitude 2D Fourier Transforms. This representation is robust to key changes, timbral changes, and small local tempo deviations. Cross-similarity between these time-series, and extraction of a distance measure that is invariant to music structure changes is the main aim. 2018

Identifying cover songs automatically involves finding a representation of the audio that is robust to these transformations. In this technique an audio representation using sequences of 2D Fourier Transforms (2DFT) is presented. For this purpose an approach based on beat-synchronous chromagram representations of audio is used. The chromagram of the covers and the originals are cross-correlated in pitch and time. When the cover and the original match are compared, there will be a peak in the cross- correlation matrix. The beat-tracking makes their method resilient to tempo variation and the cross-correlation in chroma and time makes it resilient to key changes and time skews. As the chromagram retains only pitch class information, it is somewhat resilient to instrumentation changes.

The 2DFT breaks downLIBRARY images into sums of sinusoidal grids at different periods and orientations, represented by points in the 2DFT. A useful representation of musical audio is the Constant Q Transform (CQT). The CQT is a transform with a logarithmic frequency resolution, with spacing between frequencies mirroring the human auditory system, and the Western musical scale. A linear shift in frequency in the CQT correspondsSFIT to a pitch shift in the music. By taking the magnitude of the 2DFT on the CQT, we obtain a key-invariant representation of the audio. This representation is robust to key changes, timbral changes and small local tempo deviations. The similarity between these time-series representations were assessed and the distance measure was extracted that is invariant to structural changes.

33 Chapter 4: Cover Song Recognition Using SiMPle

With the growing interest in applications related to music processing [13], the area of music information retrieval (MIR) has attracted huge attention. A common approach to assessing similarity in music recordings is achieved by utilizing a self- similarity matrix. This representation reveals the relationship between each snippet or a particular part of a track to all the other segments in the same recording. This idea has been generalized to measure the relationships between subsequences of different songs, as in the application of cross recurrence analysis for cover song recognition.

The main advantage of similarity matrices is the fact that they simultaneously reveal both the global and the local structure of music recordings. However, this representation requires quadratic space in relation to the length of the feature2018 vector used to describe the audio. Because of this reason, most methods to find patterns in the similarity matrix are quadratic in time complexity. Also most of the information contained in the similarity matrices are irrelevant or have very little or negligible impact on its analysis. Thus in this proposed method “subsequences all-pairs-similarity-search” also known as similarity join is used in order to assess the similarity between audio recordings for MIR tasks. A new data structure called matrix profile is exploited which allows space efficient representation of the similarity join matrix between subsequences. Also by making use of the recent FFT-based all-neighbor search algorithms the matrix profile can be computed much more efficiently.

In summary the proposed method has the following advantages LIBRARY • It is a novel approach to assess the audio similarity and can be used in several MIR algorithms

• The fastest known subsequence similarity search technique called MASS (Mueen’s Algorithm for Similarity Search) is used which makes this method fast and exact.

•SFIT This method is simple and only requires a single parameter, which is intuitive to set for MIR applications.

• It is space efficient requiring the storage of only O(n) values.

• Once the similarity profile for a dataset is calculated it can be efficiently updated, which has implications for streaming audio processing.

34 4.1 Description of SiMPle

SiMPle stands for Similarity Matrix Profile. The term “time series” is used to refer to the ordered set of features that describe a whole recording and the term “subsequence” to define any continuous subset of features from the time series.

Similarity join: Given two time series A and B with the desired subsequence length m, the similarity join identifies the nearest neighbor of each subsequence (with length m) in A from all the possible subsequence set of B.

Through such a similarity join, we can gather two pieces of information about each subsequence in A, which are:

1) The Euclidean distance to its nearest neighbor in B

2) The position of its nearest neighbor in B.

Such information can be compactly stored in vectors, referred as 2018similarity matrix profile (SiMPle) and similarity matrix profile index (SiMPle index) respectively. One special case of similarity join is when both input time series refer to the same recording. It is defined as self-similarity join.

Self-similarity join: Given a time series A with the desired subsequence length m, the self-similarity join identifies the non-trivial nearest neighbor of each subsequence (with length m) in A from all the possible subsequence set of A.

Table III: Calculate SiMPle and SiMPle index

Procedure to calculate SiMPle and SiMPle index Input: Two user provided timeLIBRARY series, A and B, and the desired subsequence length m Output: The SiMPle PAB and the associated SiMPle index IAB η 1 B← Length(B) η 2 PAB ← infs, IAB ← zeros, idxes ← 1: B -m+1 3 for each idx in idxes 4 D ← MASS( B [ idx : idx+m-1], TA ) 5 PAB , IAB ← ElementWiseMin( PAB , IAB, D, idx ) 6 SFITend for

7 return PAB , IAB

The Method to calculate SiMPle is described in the above algorithm. The steps are as follows:

35 In line 1, the length of B is recorded. In line 2, we allocate memory and initialize

SiMPle PAB and SiMPle index IAB. From line 3 to line 6, calculation of the distance profile vector D takes place, which contains the distances between a given subsequence in time series B and each subsequence in time series A. The particular algorithm used to compute D is MASS (Mueen’s Algorithm for Similarity Search), which is claimed to be the most efficient algorithm known for distance vector computation. We then perform the pairwise

minimum for each element in D with the paired element in PAB (i.e., min(D[i], PAB [i]) for

i = 0 to length(D) - 1.) We also update IAB [i] with idx when D[i] ≤ PAB [i] as we

perform the pairwise minimum operation. Finally, the results PAB and IAB are returned in line 7.

Note that the above Algorithm computes SiMPle for the general similarity join. To modify it to compute the self-similarity join SiMPle of a time series A, we can simply replace B by A in lines 1 and 4 and ignore trivial matches in D when2018 performing ElementWiseMin in line 5. The method MASS (used in line 4) is important to speed-up the similarity calculations. This algorithm has a time complexity of O(nlogn).

SiMPle-Based Cover Song Recognition

This method proposes the use of Simple to measure the distance between recordings in order to identify cover songs. This method exploits the fact that the global relation between the tracks is composed of many local similarities. In this way, we are able to simultaneously take advantage of both local and global pattern matching.

SiMPle obtained by comparing a cover song to its original version is composed mostly of low values. In contrast,LIBRARY two completely different songs will result in a SiMPle constituted mainly of high values. Therefore, the median value of the SiMPle is adopted as the global distance estimation. Formally, the distance between a query B and a candidate original recording A is defined in the equation below

dist(A,B) = median(SiMPle(B,A))

SFITNote that several other measures of statistics could be used instead of the median. However, the median is robust to outliers in the matrix profile. Such distortions may appear when a performer decides, for instance, to add a new segment (e.g., an improvisation or drum solo) to the song.

36 Structural Invariance

The structural variance is a critical concern when comparing different songs but this method is robust to changes in structure. From a high-level point of view, SiMPle describes a global similarity outline between songs by providing information of local comparisons. This makes it largely invariant to structural variations.

• If two performances are virtually identical, except for the order and the number of repetitions of each representative excerpt (i.e., chorus, verse, bridge, etc.), all the values that compose SiMPle are close to zero.

• If a segment of the original version is deleted in the cover song, this will cause virtually no changes in the SiMPle.

• If a new feature is inserted into a cover, this will have as consequence a peak in the SiMPle that will cause only a slight increase in its median value.2018

4.2 Optimal Transposition Index (OTI)

Before calculating the similarity between songs, we transpose one of them in order to have the same key using the optimal transposition index (OTI). Transposing musical excerpts to a common key or tonality is a necessary feature when comparing melodies, harmonies or any tonal representation of these musical excerpts. This process is especially crucial in many music information retrieval (MIR) tasks related to music similarity such as audio matching and alignment, song structure analysis or cover song identification.

The first step is to LIBRARYextract chroma features of both the songs. A chromagram is a matrix of dimensions 12xN, where N depends on the length of the song. By adding elements in each row, we get a 12x1 matrix. We do this for both the songs and get two 12x1 matrix. We circularly shift one of the matrices keeping the other one unchanged and multiply them element wise. The matrix is circularly shifted twelve times and multiplied. ThusSFIT we get twelve values corresponding to each multiplication. Out of these twelve values, the highest value corresponds to maximum resemblance between the two time series sequence. We note the number of times the matrix had to be circularly shifted to obtain this maximum value and finally we circularly shift the original sequence (chromagram) by this number.

37

2018

LIBRARY

SFIT

Fig. 4.1: Process of SiMPle

38 4.3 Process of SiMPle

• Consider two time series (chromagram of songs) 1 and 2. • Compare every subsequence in song 1 (S1, S2, S3) with all the subsequence in song 2 to get a column matrix called distance matrix. Therefore we get three different distance matrices for all the three subsequences in song 1.

• From the distance matrix the smallest distance and its corresponding position is selected. This distance is stored in the Matrix Profile (MP) and the corresponding position is stored in the Matrix Profile Index (MP Index).

• An MP and an MP index pair gives the overall distance of one song with another.

• MP and MP index pair is calculated for each and every song in the dataset. This forms the MP Pairs matrix.

• Finally the distance profile matrix is computed by taking median 2018of all corresponding MPs in MP pairs.

LIBRARY

SFIT

39 Chapter 5: Performance Evaluation

This chapter is dedicated to a discussion of simulations performed and the results thereof. We have used MATLAB for the simulations.

5.1 Evaluation Measures

In order to assess the performance of our method, we used three commonly applied evaluation measures: Mean Average Precision (MAP), Precision at 10 (P@10) and Mean Rank of First correctly identified cover (MR1).

5.1.1 Mean Average Precision (MAP)

For a given set of Queries MAP is the mean of average precision scores for each query.Precision is the fraction of the documents retrieved that are relevant2018 to the user's information need. Precision = |{relevant documents} ∩ { retrieved documents}| (1)

|{ retrieved documents}|

Average precision is the mean of the precision scores after each relevant document is retrieved. The average precision is very sensitive to the ranking of retrieval results.

1 n   1 j  

MAP = ∑  Ω ( ri , j )  ∑ Ω ( ri , k )  

n j = 1   j k =1   (2)

MAP is just an extension where mean is taken across all the AP scores for queries asked.

5.1.2 Precision at 10 LIBRARY

In an information retrieval system that retrieves a ranked list, the top-n documents are the first n in the ranking. Precision at n is the proportion of the top-n documents that are relevant.

n 1 0 1 SFIT ∑ ∑ Ω ( ri , j ) = = (3) n i 1 j 1

The value of n can be chosen based on an assumption about how many documents the user will view. The Mean of these Precision at 10 values gives us the desired result.

40 5.1.3 Mean Rank 1

Rank refers to the rank position of the first relevant document. The mean value of Rank for all the queries gives the Mean Rank 1.

n 1 f ∑ p ( ri ) n i = 1 (4)

The system makes guesses assuming that the result at rank 1 is the correct result. Mean rank 1 should have value one ideally.

5.2 Evaluation of dataset using DTW and SiMPle

5.2.1 Covers80 Dataset 2018 Covers80 Dataset consists of 80 original songs [14], 84 cover songs (approx. 1 cover song for each original). So this dataset consists of total 164 songs with average time duration of approx. 3 minutes. These songs were classified as Train Dataset and Test Dataset. After extracting the CENS value for each song, we applied DTW and MUEENs algorithm respectively to calculate distance between every two songs (one from Train Data and the other from Test Data). The lowest value of distance between any two songs implies that one song is the cover of the other.

5.2.2 YouTube Covers Dataset

This dataset has 350 recordings out of which 100 songs belong to train data and rest 250 songs belong to test data.LIBRARY We downloaded this [15] dataset from the internet. Only chromagram of these 350 songs are available without much information about the songs.

5.2.3 Covers 2018

For our project we created our own dataset (Covers 2018).SFIT Covers 2018is a collection of 112 songs out of which 50 are original and the rest are covers. On an average for each original song there is one cover song. This dataset includes both Hindi as well as English songs. It also includes mash-up, live, etc.

Fig. 5.1: Covers 2018 41 Distance Calculation Using DTW

Fig. 5.2: Distance profile of a song with itself

Alan Walker: Faded (with itself)

Distance = 0

2018 Fig. 5.3: Distance profile of a song with its cover

Alan Walker: Faded

with

Sara Farell: Faded

Distance = 195.2301

LIBRARY Fig. 5.4: Distance profile of a song with a random song

Alan Walker: Faded

with

SFIT Camila Cabello: Havana

Distance = 349.6295

42 Distance Calculation Using SiMPle

Fig. 5.5: Distance profile of a song with itself

Alan Walker: Faded (with itself)

Distance = 6.6613e-15

2018

Fig. 5.6: Distance profile of a song with its cover

Alan Walker: Faded

With

Sara Farell: Faded

Distance = 6.1022

LIBRARY

Fig. 5.7: Distance profile of a song with a random song

Alan Walker: Faded

With SFIT Camila Cabello: Havana

Distance = 12.2785

43

2018

Fig. 5.8: Distance profile of Dataset2018 using DTW

LIBRARY

SFIT

Fig. 5.9: Distance profile of Dataset2018 using SiMPle

44 • After Evaluating for Precision and Accuracy, the results for our system are as follows:

Table IV: System Evaluation Results using DTW

Evaluation Measures YouTube Covers Covers 2018 dataset

Mean Average Precision 0.4249 0.7455

Precision at 10 0.1144 0.1830

Mean Rank 1 11.6920 13.4286

Table VI: System Evaluation Results using SiMPLe

Evaluation Measures YouTube Covers Covers2018 2018 dataset

Mean Average Precision 0.4128 0.7765

Precision at 10 0.1196 0.1955

Mean Rank 1 10.4080 7.5089

5.3 Result Analysis.

Firstly, we observe that DTW has high time complexity as compared to SiMPle. For cover song identification to be automatic and real time, we cannot afford to use DTW due to its high time complexity. Also, the precision and accuracy provided by DTW is unsatisfactory. In order to overcomeLIBRARY these limitations, SiMPle uses MASS for its distance calculation. Performance of SiMPle is much better as compared to DTW. This is evident from the results obtained.

By comparing the Precision and Accuracy obtained for the above three datasets we see that the results obtained are much better for Covers 2018 dataset. Change in vocalsSFIT (Male/Female), tempo, beats, lyrics, structure, timing, etc. are few facets that act as a barrier in cover song identification task. Using SiMPle, we aim to make the cover song identification task invariant to these changing facets. This method could detect cover songs in scenarios like mash-ups, live performance, change in vocals, in absence of instruments and in few cases when completely different instruments were used.

45 5.4 Application User Interface

2018

LIBRARY

Fig. 5.10: GUI guide

GUI layout is shown below. It consists of the following: SFIT 1. Music List: It is a push button which displays the entire list of songs when pressed

2. List Box: When ‘Music List’ is pressed the songs are displayed in the list box

3. Play: It is a push button which can be used to play a particular song

46 4. Pause/Resume: It is a push button which can be used to pause a song which is currently being played. This same button can be used to resume a song which is paused.

5. Stop: It is a push button which is used to stop a song which is currently being played.

6. Cover Songs: In case, the user wishes to get a list of cover songs, this push button is pressed.

2018

LIBRARY

SFIT

Fig. 5.11:Cover song identifier(User interface)

47 On pressing the ‘Music List’ button the entire list of songs is displayed

2018

Fig. 5.12: Display of music list

By selecting a song from the list, we can play that particular song by pushing the ‘Play’ button. The name of the song is displayed as well.

LIBRARY

SFIT

Fig. 5.13: Playing ‘Havana’ song

48 To get a list of cover songs for the selected track, push the ‘Cover Songs’ button.

2018

Fig. 5.14: Searching for cover songs

Finally the GUI displays all the cover songs of the selected track.

LIBRARY

SFIT

Fig. 5.15: Display of list of cover songs

As all the covers of a selected song is displayed the user can select any track from the displayed list and play the same.

49 Chapter 6: Conclusion and Future work

6.1 Conclusion

The proposed method exploits a new data structure called ‘Similarity Matrix Profile’. This allows a space efficient representation of the similarity join matrix between subsequences. Recent optimizations in FFT-based all-neighbor search is used which allows the matrix profile to be computed quickly and efficiently. From the results it is evident that SiMPle is much better than DTW in terms of precision and accuracy. SiMPle uses MASS which is claimed to be the fastest distance calculation algorithm.

Intuitively, we should expect that the SiMPle obtained by comparing a cover song to its original version is composed mostly of low values. In contrast, two completely different songs will result in a SiMPle constituted mainly by high values.2018 But it may happen that two completely different songs might have a very similar patch e.g. silence at the beginning of a song or else two a cover and its original might have a patch of large dissimilarity e.g. mash-up of cover songs which would result in ambiguity. For this reason, median value of the SiMPle is adopted which helps to remove this ambiguity. Using this method we get better results as compared to DTW but there is still scope for improvement.

6.2 Future work

In the broader context of musical similarity, an interesting future direction would be the situation in which the input is a stream of audio and the output is a sorted list of similar objects in a databaseLIBRARY (not necessarily cover). Since, SiMPle cannot be directly used to identify regions where several subsequences are next to each other. For this reason, an alternative is to measure the impact of the reduction in the amount of information in different tasks. Incorporating additional information to SiMPle can be explored which would provide no loss of time and space efficiency. Beyond many conceptualSFIT open issues, there are still some technical aspects that deserve effort to improve the efficiency of a system. Firstly, perfecting a music processing system requires careful examination and analysis of errors. When errors are patterned they can reveal specific deficiencies or shortcomings in the algorithm. This kind of in-depth analysis is lacking. Secondly, achieving a robust, scalable, and efficient method is still an issue. It is

50 outstanding that, systems achieving the highest accuracies are quite computationally expensive and fast retrieval systems fail in recognizing many of the cover songs a music collection might contain. There exists a trade-off between system’s accuracy, efficiency and time complexity. However, these and many other technical as well as conceptual issues can be solved in the years to come.

2018

LIBRARY

SFIT

51 Appendix A

MATLAB

MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Typical uses include Math and computation; Algorithm development; Modeling, simulation, and prototyping; Data analysis, exploration, and visualization; Scientific and engineering graphics; Application development, including Graphical User Interface building; etc.

MATLAB is an interactive system whose basic data element is an array that does not require dimensioning. This allows you to solve many technical computing2018 problems, especially those with matrix and vector formulations. The name MATLAB stands for matrix laboratory. In industry, MATLAB is the tool of choice for high-productivity research, development, and analysis. Therefore it is one of the most useful technical computing language with a lot of standard toolboxes.

MATLAB features a family of application-specific solutions called toolboxes. Very important to most users of MATLAB, toolboxes allow you to learn and apply specialized technology. Toolboxes are comprehensive collections of MATLAB functions (M-files) that extend the MATLAB environment to solve particular classes of problems. Areas in which toolboxes are available include signal processing, control systems, neural networks, fuzzy logic, wavelets,LIBRARY simulation, and many others.

Chroma Toolbox

The Chroma Toolbox [2] has been developed by Meinard Müller and Sebastian Ewert. It contains MATLAB implementations for extracting various types of novel pitch- based and chroma-based audio features. The Chroma Toolbox basically helps in extractingSFIT various musically meaningful features from waveform based audio signals. In particular, it contains feature extractors for pitch features as well as parameterized families of variants of chroma-like features.

52 The different chroma variants that can be extracted using this chroma toolbox are as follows:

1] Chroma Pitch (CP)

2] Chroma Log Pitch (CLP)

3] Chroma Energy Normalized Statistics (CENS)

4] Chroma DCT-Reduced log Pitch (CRP)

The main advantage of using this chroma toolbox is that for a particular input audio signal we get a number of chroma variants. We can compare all these chroma variants and find the best possible match based on our requirements.

For the Chroma Toolbox the MATLAB Signal Processing Toolbox is required. Some of the MATLAB functions contained in the chroma toolbox are as follows.

Table VI: Some MATLAB Functions in the Chroma Toolbox2018

Filename Main Parameters Description

wav_to_audio.m ------Convert the WAV files into expected audio format. estimateTuning.m pitchRange Estimation of the filterbank shift parameter σ. audio_to_pitch_via_FB.m winLenSTMSP Extraction of pitch features from audio data. pitch_to_chroma.m applyLogCompr, Derivation of CP and CLP features factorLogCompr = η from Pitch features. LIBRARY pitch_to_CENS.m winLenSmooth = w, Derivation of CENS features from downsampSmooth= d Pitch features. pitch_to_CRP.m coeffsToKeep = n, Derivation of CRP features from Pitch factorLogCompr = η features. visualizePitch.m featureRate Visualization of pitch features. visualizeChroma.mSFIT featureRate Visualization of chroma features. generateMultiratePitchFilterbank. ------Generation of filterbanks (used in m audio_to_pitch_via_FB.m). smoothDownsampleFeature.m winLenSmooth = w, Post-processing of features: smoothing downsampSmooth= d and downsampling

53 Parameter info

1. The structsideinfo returns the meta information about the WAV file.

2. StructparamPitch is used to pass optional parameters to the feature extraction function. If some parameters or the whole struct are not set manually, then meaningful default settings are used.

3. The parameter winLenSTMSP specifies the window length in samples.

4. The logarithmic compression is activated using the parameterapplyLogCompr. The compression level is specified in line by the parameter factorLogCompr, which corresponds to the parameter η

5. For computing the CRP features, the main parameter is denoted by n and it corresponds to the lower bound of the range which specified by the parametercoeffsToKeep.

6. The function smoothDownsampleFeature consists of two main2018 parameters, winLenSmooth (denoted by w) and downsampSmooth (denoted by d).They are used for computing the CENS variant.

Application User Interface

The graphical user interface is a type of user interface that allows users to interact with electronic devices through graphical icons and visual indicators such as secondary notation, instead of text-based user interfaces, typed command labels or text navigation.

To create a MATLAB GUI interactively we used GUIDE function. GUIDE (GUI development environment) provides tools to design user interfaces for custom apps. Using the GUIDE LayoutLIBRARY Editor, you can graphically design your UI. GUIDE then automatically generates the MATLAB code for constructing the UI, which you can modify to program the behavior of your app. We can add dialog boxes, user interface controls (such as push buttons and sliders) and containers (such as panels and button groups).

SFITWe have developed a Graphical User Interface using MATLAB for demonstrating the working of our project. GUIs provide point-and-click control of software applications, eliminating the need to learn a language or type commands in order to run applications. MATLAB apps are self-contained MATLAB programs with GUI front ends

54 that automate a task or calculation. The GUI typically contains controls such as menus, toolbars, buttons, and sliders.

Basic components of GUI include:

• Icons: Small pictures that represent commands, files, or windows. By moving the pointer to the icon and pressing a mouse button, you can execute a command or convert the icon into a window.

• Push Button: It has a textual label and is designed to invoke an action when pushed.

• List Box: It is a component that defines a scrollable list of text items.

• Text Field: It is a component that implements a single line of text.

• Panels: It is a container for grouping together UI components. 2018 • Figure Window: It is a container for graphics or user interface components

COVERS 2018

Sr.no Song Name Artists Duration

Rahat Fateh Ali Khan, Momina 1 Afreen Afreen 0:10:05 Mustehsan 2 Apologize Timbaland, OneRepublic 0:03:09 3 Apologize (Cover) Christina Grimmie 0:03:22 4 Apologize (Cover) Victoria & The Islanders 0:03:45 5 Blank Space LIBRARYTaylor Swift 0:04:32 6 Blank Space (Cover) Vidya Vox 0:03:11 7 Blank Space (Cover ) J. Fla 0:02:58 8 Call Me Maybe Carly Rae Jepsen 0:03:19 9 Call Me Maybe J. Fla 0:02:32 SFIT10 Chandelier Sia 0:03:51 11 Chandelier (Cover) J. Fla 0:03:20 12 Cheap Thrills Sia 0:03:37 Vidya Vox, Shankar Tucker & 13 Cheap Thrills (Cover) 0:02:57 Akshaya Tucker 14 Closer The Chainsmokers Ft Halse 0:04:21

55 Samarth Swarup, The CLOSER - AFREEN (Mash- 15 Chainsmokers , Rahat Fateh Ali 0:03:54 up Cover) Khan The Chainsmokers, Halsey Boyce 16 Closer (Cover) 0:03:54 Avenue, Sarah Hyland 17 Closer (Cover) J. Fla 0:02:36 Casey Breves Vidya Vox ft 18 Closer - kabira (Cover) 0:03:10 mashup co. 19 Counting Stars One Republic 0:04:17 20 Counting Stars (Cover) J. Fla 0:04:17 21 Don't Let Me Down The Chainsmokers, Daya 0:03:28 22 Don't Let Me Down Vidya & KHS 0:03:28 23 Faded Alan Walker 20180:03:32 24 Faded (Cover) Sara Farell 0:03:12 25 Firework Katy Perry 0:03:53 26 Firework (Cover) Avery 0:03:36 27 Fix You Coldplay 0:04:55 David Guetta, Coldplay One 28 Fix You vs Apologize 0:04:53 Republic, Otto Knows 29 Happier Ed Sheeran 0:03:21 Happier (Live Performance- 30 Ed Sheeran 0:03:41 Cover) 31 Havana Camila Cabello 0:03:36 32 Havana (Cover) LIBRARYAndie Case 0:02:57 33 Havana (Cover) Boy band 0:03:34 34 How Deep Is Your Love Calvin Harris, Disciples 0:04:20 How Deep Is Your Love & 35 This is what you came for ( J. Fla Mashup Cover) 0:02:32 Iktara - Closer - Afreen 36 DJ Harshal 0:03:23 SFIT(Cover) DJ Khaled, Justin Bieber, Quavo, 37 I'm the One 0:05:21 Chance, Lil Wayne 38 I'm The One (cover) Gen Halilintar 0:05:04 Kygo, Selena Gomez, Casey 39 It Ain't Me 0:04:01 Breves Vidya Vox

56 40 It Ain't Me (Cover) J. Fla 0:02:28 41 Just A Dream Nelly 0:04:02 42 Just A Dream Sam Tsui & Christina Grimmie 0:04:29 43 Kabira Arjit Singh 0:04:11 44 Kya Hua Tera Wada Mohammed Rafi 0:02:56 45 Kya Hua Tera Wada (Cover) Pranav Chandran 0:04:23 46 Let Me Love You DJ Snake, Justin Bieber 0:03:25 47 Let me love you _ tum hi ho Vidya vox 0:03:39 48 Love Me Like You Do Ellie Goulding 0:04:09 Love Me Like You Do _ 49 Vidya Vox 0:03:21 Hosanna (Cover) 50 Love You Like A Love Song Selena Gomez & The Scene 20180:03:40 Love You Like A Love Song 51 J. Fla 0:02:31 (Cover) 52 Paradise Coldplay 0:04:20 53 Paradise (Cover) Coldplay 0:05:35 54 Paris The Chainsmokers 0:03:42 55 Paris (Cover) J. Fla 0:02:55 56 Pehla Nasha Udit Narayan, Sadhana Sargam 0:04:29 57 Pehla Nasha (Cover) Siddharth Slathia 0:03:30 58 Pehla Nasha (Cover) Amrita Nayak 0:03:54 59 Perfect Ed Sheeran 0:04:39 60 Perfect LIBRARYOne Direction Band 0:03:08 61 Perfect (Cover) Tiffany Alvord & Chester 0:04:47 62 Perfect (Cover) GAC & KHS 0:03:48 63 Photograph Sam Tsui & KHS 0:03:08 64 Photograph Ed Sheeran 0:03:48 65 Price Tag Jessie J, B.O.B 0:04:06 SFIT66 Price Tag (Cover) J. Fla 0:02:43 67 Roar (Cover) J. Fla 0:02:37 68 Roar (Official) Katy Perry 0:04:29 Clean Bandit, Sean Paul & Anne- 69 Rockabye 0:04:13 Marie

57 70 Rockabye (Cover) J. Fla 0:02:30 71 See You Again Wiz Khalifa, Charlie Puth 0:03:57 72 See You Again (Cover) Cimorelli, The Joh 0:07:03 73 Shape of You Ed sheran 0:03:33 74 Shape Of You (Cover) J. Fla 0:02:53 75 Side To Side Ariana Grande ,Nicki Minaj 0:03:57 76 Side To Side (Cover) J. Fla 0:02:19 77 Something Just Like This The Chainsmokers & Coldplay 0:03:37 Something Just Like This 78 J. Fla 0:02:34 (Cover ) Something Just Like This 79 Romy Wave 0:02:25 (Cover ) 80 Summer_of_69 Bryan_Adams 20180:04:33 81 Summer_of_69 (Cover) Mxpx band 0:04:33 82 Symphony Zara Larsson 0:04:06 83 Symphony (Cover) J. Fla 0:02:24 84 Tera Jaisa Yaar Kahan DJ Aqeel 0:06:05 85 Tere Jaisa Yaar Kahan Kishore Kumar 0:03:34 86 Tere Jaisa Yaar Kahan Rahul Jain 0:04:06 Tere Jaisa Yaar Kahan 87 Suryaveer (Cover) 0:03:17 88 The Greatest Sia 0:05:51 89 The Greatest (Cover) J. Fla 0:02:04 90 The Humma SongLIBRARY A.R. Rahman 0:01:33 91 The Humma Song (Cover) Badshah, Tanishk Bagchi 0:03:15 92 The Spectre Alan Walker 0:03:26 93 The Spectre (Cover) J. Fla 0:02:50 94 Thunder Imagine Dragons 0:03:24 SFIT95 Thunder (Cover) J. Fla 0:01:50 96 We Found Love Rihanna ,Calvin Harris 0:04:35 97 We Found Love (Cover) J. Fla 0:03:19 98 What About Us Pink 0:05:22 99 What About Us (Cover) J. Fla 0:03:47

58 100 Where Have You Been Rihanna 0:04:28 Where Have You Been 101 J. Fla 0:02:07 (Cover) 102 Wolves Selena Gomez, Marshmello 0:03:32 103 Wolves (Cover) J. Fla 0:02:46 104 This Is What You Came For Calvin Harris, Rihanna 0:03:59 105 Hymn_For_The_Weekend Coldplay 0:04:20 Hymn_For_The_Weekend 106 Anurag Mohn 0:03:43 (Cover)

107 Million Voices Otto knows 0:05:57

2018

LIBRARY

SFIT

59 Appendix B

Timeline Chart of the Project

TIMELINE CHART FOR SEMESTER VII MONTH JULY AUGUST SEPTEMBER OCTOBER

WEEK NO. W1 W2 W3 W4 W1 W2 W3 W4 W5 W1 W2 W3 W4 W1 W2 W3 W4

WORK TASKS

1.PROBLEM DEFINITION

Search for topics

Identify the goal of the project 2018

2.PREPARATION

Study of IEEE papers related to our project

Study of basics of music theory

Study of various topics and terminologies pertaining to our project

Study of Chromagram and LIBRARY various algorithms to find distance

Implementation of

Chroma toolbox

3.EXECUTION SFITOF THE PROJECT

Implemented various datasets using DTW and evaluated the results

60 TIMELINE CHART FOR SEMESTER VIII

MONTH JANUARY FEBRUARY MARCH APRIL

WEEK NO. W1 W2 W3 W4 W1 W2 W3 W4 W5 W1 W2 W3 W4 W1 W2 W3 W4

WORK TASKS

1. EXECUTION OF PROJECT

Implemented SiMPle on various dataset

Created our own dataset for the project

Implemented SiMPle 2018 on our dataset

3. GUI

Design of final GUI

3. BLACKBOOK & DOCUMENTATION

Black book LIBRARY Presentation & Documentation

SFIT

61 References

[1] Joan Serra, Emilia G´omez, and Perfecto Herrera. Transposing chroma representations to a common key. In IEEE CS Conference on The Use of Symbols to Represent Music and Multimedia Objects, pages 45–48, 2008.

[2] D. F. Silva, C.-C. M. Yeh, G. E. A. P. A. Batista, E. Keogh. “Supporting website for this work”, url: http://sites.google.com/site/ismir2016simple/ (accessed 24 May, 2016).

[3] MeinardMuller, Sebastian Ewert, “Chroma Toolbox: Matlab Implementations

For Extracting Variants Of Chroma-Based Audio Features”. International Society for Music Information Retrieval Conference, 2011.

[4] Mueen, K. Viswanathan, C. K. Gupta and E. Keogh. “The fastest similarity search algorithm for time series subsequences under Euclidean2018 distance”, url: www.cs.unm.edu/~mueen/FastestSimilaritySearch.html (accessed 24 May, 2016).

[5] Daniel P.W. Ellis and Graham E. Poliner. “Identifying ‘Cover Songs’ with

Chroma Features and Dynamic Programming Beat Tracking”. In MIREX 2006 Audio Beat Tracking Contest system description, 2006.

[6] D. F. Silva, V. M. A. Souza, and G. E. A. P. A. Batista. “Music shapelets for fast cover song recognition”. International Society for Music Information Retrieval Conference, pp. 441–447, 2015.

[7] Joan Serra` and Emilia Gomez’, “Audio Cover Song Identification Based On

Tonal Sequence Alignment”, May, 2008

[8] AlessioDegani, MarcoLIBRARY Dalai, Riccardo Leonardi and PierangeloMigliorati “A Heuristic for Distance Fusion In Cover Song identification”, 2013.

[9] Samuel Kim, ErdemUnal, and Shrikanth Narayanan “Music Fingerprint Extraction for Classical Music Cover Song Identification”, 2008.

[10] MaksimKhadkevich, Fondazione Bruno Kessler-irst, “Large-Scale Cover Song

Identification Using Chord Profiles”, 2013.

[11]SFIT Thierry Bertin-Mahieux and Daniel P. W. Ellis, "Large-scale cover song recognition using hashed chroma landmarks," November, 2011.

[12] Prem Seetharaman, Zafar Rafii, "Cover song identification with 2D Fourier Transform sequences," June, 2017.

62 [13] Diego F. Silva1,2, Chin-Chia M. Yeh2,Gustavo E. A. P. A. Batista1, Eamonn Keogh2, “Simple: Assessing Music Similarity Using Subsequences Joins” , International Society for Music Information Retrieval Conference, 2016.

[14] https://labrosa.ee.columbia.edu/projects/coversongs/covers80/

[15] D. F. Silva, V. M. A. Souza, and G. E. A. P. A. Batista. “Music shapelets for fast cover song recognition”. International Society for Music Information Retrieval Conference, pp. 441–447, 2015.

2018

LIBRARY

SFIT

63 2018

LIBRARY

SFIT

Scanned by CamScanner