Masaryk University Faculty of Informatics

Sheet Music Search

Bachelor’s Thesis

Jan-Sebastian Fabík

Brno, Spring 2018

MASARYKOVA UNIVERZITA Fakulta informatiky

ZADÁNÍ BAKALÁŘSKÉ PRÁCE

Student: Jan-Sebastian Fabík

Program: Informatika

Obor: Počítačové systémy a zpracování dat

Garant oboru: prof. RNDr. Václav Matyáš, M.Sc., Ph.D. (PSZD)

Vedoucí práce: doc. RNDr. Aleš Horák, Ph.D.

Katedra: Katedra strojového učení a zpracování dat

Název práce: Vyhledávání notových zápisů

Název práce anglicky: Sheet Music Search

Zadání: Na internetu jsou volně dostupné notové zápisy nejrůznějších typů, není však dostupný vhodný způsob jejich vyhledávání. Cílem práce je návrh a implementace metavyhledávače, který bude schopný z výsledků hledání standardních vyhledávačů filtrovat a rozpoznávat právě notové zápisy ve formě obrázků a PDF dokumentů. Navržený systém musí na základě strojového učení rozpoznat, zda se jedná o notový zápis a pomocí OCR nástroje zjistit textové údaje skladby (název, autor, ...).

Výsledná práce se bude skládat jednak z textové části popisující stávající techniky a projekty zaměřené na rozpoznávání (metadat) notových zápisů a jejich vyhledávání, dále analýzu, návrh řešení a vlastní popis a vyhodnocení implementace, a jednak funkční implementovaný systém.

Literatura: Chamberlain, A., & Crabtree, A. (2016). Searching for music: understanding the discovery, acquisition, processing and organization of music in a domestic setting for design. Personal and Ubiquitous Computing, 20(4), 559-571.

The International Society for Music Information Retrieval, www.ismir.net

Zadání bylo schváleno prostřednictvím IS MU.

Prohlášení autora školního díla

Jméno, příjmení a UČO studenta:

Beru na vědomí, že Masarykova univerzita může na základě zákona (§ 35 odst. 3 a 4 autorského zákona č. 121/2000 Sb.) užít mou kvalifikační práci nebo jiné mé školní dílo, které jsem jako autor vytvořil ke splnění svých studijních povinností vůči této vysoké škole, a to k výuce nebo k její vlastní vnitřní potřebě nikoli za účelem přímého nebo nepřímého obchodního nebo jiného hospodářského prospěchu. Vlastní vnitřní potřebou Masarykovy univerzity se rozumí užití díla nejen v původní podobě, ale též ve zpracované nebo jinak změněné podobě zahrnující též takové užití mého díla touto vysokou školou, které spočívá v zadání mého školního díla k dalšímu zpracování jinému studentovi této vysoké školy (členovi téže akademické obce) za účelem vytvoření další kvalifikační práce nebo jiného školního díla, které bude odvozené od mého díla při uvedení mého autorství, názvu mého díla a pramene; a to vše v souladu s rozvojem vzdělanosti na Masarykově univerzitě a zájmem této veřejné vysoké školy navazovat na výsledky mé práce a mé školní dílo dále rozpracovávat v téže akademické obci. Okolnosti hodné zvláštního zřetele z mé strany, například zájem o vlastní další rozpracování své kvalifikační práce na Masarykově univerzitě nebo jinde, jsem povinen sdělit této vysoké škole prostřednictvím studijního oddělení nejpozději při odevzdání kvalifikační práce.

V Brně dne______podpis studenta

1

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Jan-Sebastian Fabík

Advisor: doc. RNDr. Aleš Horák, Ph.D.

i

Acknowledgements

I would like to thank my advisor, doc. RNDr. Aleš Horák, Ph.D., for his helpful guidance.

iii Abstract

Various sheet music is freely available online, but there is no suitable search tool for it. The aim of this thesis is to design and implement a metasearch engine which filters and recognizes sheet music from the search results of standard image search engines. The designed system uses convolutional neural networks and deep residual networks to recognize whether the given result is sheet music and determines the text information of the tracks using OCR (the track title and artist name).

iv Keywords classification, convolutional neural networks, deep residual networks, OCR, sheet music identification, web application, search engine

v

Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 A brief description of the system ...... 1 1.2.1 Image classifiers ...... 1 1.2.2 Metadata extractor ...... 1 1.2.3 Term search ...... 2 1.2.4 Web application ...... 2 1.3 Structure of the thesis ...... 2

2 State of the art 5 2.1 Sheet music metasearch engines ...... 5 2.2 Catalogs and music databases ...... 5 2.3 Optical music recognition (OMR) ...... 6 2.4 Sheet music identification ...... 7 2.5 Optical character recognition (OCR) ...... 7 2.6 Feedforward neural networks ...... 7 2.6.1 Architecture ...... 8 2.6.2 Activation functions ...... 8 2.6.3 Cost function ...... 10 2.6.4 Stochastic gradient descent algorithm (SGD) . . 10 2.6.5 Back-propagation algorithm ...... 11 2.6.6 Regularization ...... 11 2.7 Convolutional neural networks ...... 13 2.7.1 Convolution operation ...... 13 2.7.2 Architecture ...... 14 2.7.3 Convolution stage ...... 15 2.7.4 Detector stage ...... 15 2.7.5 Pooling stage ...... 15 2.8 Deep residual networks ...... 15

3 Image Classifiers 19 3.1 Dataset ...... 19 3.1.1 Downloading dataset ...... 19 3.1.2 Manual classification of images ...... 19 3.1.3 Dataset storage ...... 21

vii 3.1.4 Preprocessing ...... 22 3.2 Neural network architectures ...... 24 3.2.1 Convolutional neural networks ...... 24 3.2.2 Deep residual networks ...... 27 3.3 Implementation ...... 27 3.3.1 Format of the HTTP API ...... 28 3.3.2 Command line interface ...... 29 3.4 Test results ...... 30 3.4.1 Sheet music classification ...... 31 3.4.2 Watermark detection ...... 33 3.4.3 Title detection ...... 34

4 Metadata extraction 35 4.1 Database of track titles and artists ...... 35 4.1.1 Normalization ...... 36 4.2 Implementation ...... 36 4.2.1 Text entries recognition ...... 36 4.2.2 Text entries processing ...... 37 4.2.3 Weight calculation ...... 37 4.3 Search with an error tolerance ...... 37 4.3.1 Levenshtein distance ...... 38 4.3.2 Prefix trees ...... 39 4.3.3 Searching for strings in a prefix tree by the Lev- enshtein distance from a given string ...... 40 4.3.4 Implementation ...... 41 4.4 Test results ...... 42 4.4.1 Text recognition ...... 42 4.4.2 Metadata recognition ...... 44

5 Conclusion 47 5.1 Image classification ...... 47 5.2 Metadata extraction ...... 47 5.3 Web application ...... 48

A System Implementation 49 A.1 System components ...... 49 A.2 Web application ...... 49 A.2.1 Controllers ...... 50 viii A.2.2 Command line interface ...... 50 A.3 Deployment ...... 51 A.4 Installation instructions ...... 52

Bibliography 55

ix

List of Tables

3.1 Category sizes 20 3.2 Categories included in the classes 23 3.3 Parameters of the ConvNets with topology A 26 3.4 Parameters of the ConvNets with topology B 26 3.5 Modified dropout rates of the ConvNets 27 3.6 Accuracy of the models evaluated on the test set 30 3.7 Accuracy of the ConvNets with modified dropout rates 31 4.1 Metadata recognition statistics 42

xi

List of Figures

2.1 The rectified linear function 9 2.2 The logistic sigmoid function 9 2.3 An example of 2-D convolution, Based on: Figure 9.1 from Deep Learning by Ian Goodfellow et al. [28] 14 2.4 Structure of a convolutional neural network, Source: Mathworks.com [29] 14 2.5 Training error (left) and test error (right) on CIFAR-10 dataset with 20-layer and 56-layer networks, Source: Deep Residual Learning for Image Recognition, Figure 1 [30] 16 2.6 Residual learning: building block, Source: Deep Residual Learning for Image Recognition, Figure 2 [30] 16 3.1 Manual classification application UI 20 3.2 Examples of images by categories (cropped) 21 3.3 Resize modes 23 3.4 Sample handwritten sheet music 24 3.5 ConvNet topologies 25 3.6 Error and loss of convnetA2 and convnetA3 on the sheet music dataset 32 3.7 Sample images incorrectly marked as sheet music 32 3.8 Sample images incorrectly marked as watermarked 33 3.9 Sample images with incorrectly detected title 34 4.1 The structure of the database tables for the Discogs.com dataset 35 4.2 Sample prefix tree 39 4.3 Examples of images not recognized by the OCR tool 43 4.4 Examples of images with incorrectly recognized metadata 45 A.1 Search form and results page 50 A.2 Deployment diagram 51

xiii

1 Introduction

1.1 Motivation

The primary aim of this thesis is to design and implement a metasearch engine for sheet music freely available on the Internet. As of April 2018, there is no specialized search engine for freely available sheet music. People who want to find sheet music online typically use standard search engines and append the phrase “sheet music” to the query. The search results sometimes include images that are not sheet music or contain a different track than the user expected. A metasearch engine designed directly for sheet music search would help them find the most relevant results faster.

1.2 A brief description of the system

The system designed and implemented in this thesis provides image classifiers, a metadata extractor, an application for term search param- eterized by Levenshtein distance, and a web application with a search form.

1.2.1 Image classifiers The classifiers classify images by three criteria: whether they contain sheet music, whether they are watermarked, and whether they contain a heading with a track title. They are implemented using neural net- works. I experimented with convolutional neural networks and deep residual networks. For the training, I downloaded a dataset of 13,000 images from an image search engine and manually assigned them to the corresponding classes. After training, I evaluated the models by the accuracy rate on the test set.

1.2.2 Metadata extractor The metadata extractor uses an OCR tool to extract text from the input image. Then, it compares the lines from the OCR output against a

1 1. Introduction database of track titles and artist names. As the OCR tool is susceptible to noise, the database is not only searched for an exact match but also for entries closest to the recognized text, i.e., the terms with the lowest Levenshtein distance from it.

1.2.3 Term search The term search application is utilized by the metadata extractor for searching the database of track titles and artist names. The application loads a database of search terms and provides an API for finding the term with the lowest Levenshtein distance from a given query. The database of terms is stored in a prefix tree which allows efficient im- plementation of the search operation using a dynamic programming algorithm.

1.2.4 Web application The web application provides a search form. After submitting the form, a query in an image search engine is performed. The thumbnail images from the results are then classified using the image classifiers. The images classified as sheet music are displayed on the results page along with their metadata. The metadata is extracted from the full-size version of images containing a heading with a track title. Watermarked images are sometimes difficult to read; therefore, they are displayed at the end of the results page.

1.3 Structure of the thesis

Chapter 2 summarizes state of the art in optical music recognition, sheet music identification, optical character recognition, feedforward neural networks, convolutional neural networks, and deep residual networks. Chapter 3 describes the downloading and manual classification of the dataset, the architecture of the image classifiers, the implementa- tion, and the analysis of the test results. The techniques used for metadata extraction are explained in Chap- ter 4. The chapter also covers the implementation and evaluates the test results.

2 1. Introduction

Chapter 5 formulates the conclusion and proposes directions for further research and improvement. Appendix A describes the implementation of the web application, and the architecture and deployment of the whole system.

3

2 State of the art

2.1 Sheet music metasearch engines

Several existing projects are focused on sheet music metasearch [1]. One of them is the Sheet Music Consortium [2]. This project consoli- dates multiple digital sheet music collections, mostly from universities and libraries, and provides a search tool for them. The system loads the data from several data sources using Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [3] and stores them in a database, which is utilized by the search engine. The database currently contains 226,914 items from 22 collections [4]. The service OAIster [5] also provides a searchable catalog of records aggregated from libraries that implement the OAI-PMH protocol. Its database contains 50 million records from more than 2,000 sources; however, not all of the records are sheet music. The search results of various online search engines for the term “sheet music metasearch engine” contain links to several tools that are not actual metasearch engines, but only redirect the queries to standard search engines [6, 7]. For example, the tool Musicedmagic.com [8] redirects the users to Google search results. Another website Looksheetmusic.com [9] contains only an inline frame with a sheet music catalog Sheetmusic- plus.com [10]. Hence, it seems there is no publicly available sheet music meta- search engine on the Internet that would filter and recognize sheet music from the search results of standard search engines.

2.2 Catalogs and music databases

However, there are catalogs of sheet music available online. E.g., the servers Musicnotes.com [11] and Sheetmusicplus.com [10] provide interfaces for sheet music search by artist name and track title. It is possible to preview the sheet music found, and purchase its copy in a printable format. In addition to the search functionality, the catalogs provide alpha- betic lists of artists and tracks. In this thesis, the alphabetic lists would

5 2. State of the art make it possible to compare the recognized text against a database and automatically correct minor mistakes in them. For this purpose, the online music databases can be used, e.g., the web services AllMu- sic.com [12], Discogs.com [13] and Last.fm [14]. For the automatic correction of mistakes, it is necessary to use a music database with either a search functionality supporting auto- matic spelling mistake correction, or a possibility to download the whole database of artists and tracks. In a local database, it is possible to find the most similar entry for the recognized text. The web service Discogs.com provides XML datasets with artists and releases. [15] The dataset from March 2018 contains approximately 5.5 million artists and 83 million tracks.

2.3 Optical music recognition (OMR)

Research on optical music recognition was already conducted in the late 1960s at Massachusetts Institute of Technology [16, 17]. A paper focused on new approaches to OMR was published at International Society for Music Information Retrieval Conference in 2011 [18]. It states that OMR is a much more challenging problem than OCR and proposes a model-based top-down recognition approach inspired by the modern OCR techniques. Several publications covering the state of the art of OMR were published [19, 20, 21]. All of them describe OMR as a process consisting of the following steps: ∙ image preprocessing ∙ detection of staves and their removal from the image ∙ location and recognition of notes and other symbols on the staves ∙ semantic analysis of the notes ∙ conversion of the sheet music to a computer-readable format The Handbook of Document Image Processing and Recognition [20] provides an overview of different approaches to these steps and ex- isting systems that implement these approaches. It also covers the recognition of handwritten sheet music.

6 2. State of the art 2.4 Sheet music identification

The process of OMR described above does not include detection and recognition of text information in the document – the track title, artist name, dynamic markings, nor lyrics. For this purpose, it is possible to utilize optical character recognition (OCR). In 2010 a dissertation on the automatic organization of digital music documents was published [22]. The author identified several problems associated with sheet music identification. One of the most problematic issues is that there are no general conventions for the position of the track title, artist name, and other text information. Another problem is that the recognized text may contain errors. As a possible solution, the writer suggests using a database of artist names, work types (a sonata, opera, etc.) and tracks titles to determine the meaning of the textual entries.

2.5 Optical character recognition (OCR)

Modern OCR systems can recognize printed texts with high accu- racy. These systems include [23], ABBYY FineReader [24], OCRopus [25], Asprise OCR SDK [26], OmniPage [27] and others.

2.6 Feedforward neural networks

Feedforward neural networks, also known as Multilayer perceptrons, are machine learning models used for classification. This chapter con- tains their summary based on their description in the book Deep Learn- ing by Ian Goodfellow et al. [28] The understanding of the internal structure of neural networks is essential for the choice of architecture and hyperparameters. The goal of a feedforward neural network is to approximate a function f *(x) using a set of sample pairs (x, f *(x)). The output of the network is represented by a function f (x, θ). The aim of the learning algorithm is to find a set of parameters θ for which the function f (x, θ) is the best approximation of f *(x).

7 2. State of the art

2.6.1 Architecture The network is usually a composition of several functions. The depen- dencies between the functions form a directed acyclic graph. One of the most common topologies is a chain structure, i.e.,

f (x) = f (n)(... ( f (3)( f (2)( f (1)(x))))).

The functions f (1), f (2), f (3),..., f (n) are called layers, and they are re- ferred to by their order – f (1) is the first layer, etc. For each input x from the set of samples, the algorithm is only provided with the expected value of f (x), but not the values of f (i)(x). Hence, it must determine the values of the functions f (i)(x) itself so that the function f (x) gives the best approximation of f *(x). The layers f (1), f (2),..., f (n−1) are called the hidden layers. The layers usually receive and output vectors. Therefore, it is pos- sible to consider them to be sets of units. Each of the units receives a vector (the outputs of the units from the previous layer) and produces a scalar value. The units are loosely inspired by biological neurons. The typical approach is that the output for each unit is computed by an activation function which is applied to z = wx + b, where x is the input vector, w is the vector of weights of the unit and b is the bias of the unit (w and b are included in θ). The value z is called the inner potential.

2.6.2 Activation functions If all the layers used only linear functions, the network would output a linear transformation of the input. Many functions cannot be expressed by a linear transformation, even some simple functions such as XOR. For this reason, non-linear activation functions are used.

∙ Nowadays, one of the most popular activation functions is the rectified linear function (Figure 2.1): g(z) = max{0, z} A significant advantage of this function is that it preserves alot of properties of linear functions, which makes its optimization efficient. It also has a low computational complexity.

8 2. State of the art

Units with this activation function are referred to as rectified linear units (ReLU).

4

2

0 −6 −4 −2 0 2 4 6

Figure 2.1: The rectified linear function

∙ Another broadly used activation function is the logistic sigmoid (Figure 2.2): 1 σ(z) = 1 + e−z An advantage of this function is that its derivative is d dz σ(z) = σ(z) · (1 − σ(z)), which simplifies the gradient computation. Another useful property is that its graph is close to a linear function around zero. However, the graph is very flat for very high and very low values of z, which sometimes makes the learning process slow.

1

0.8

0.6

0.4

0.2

0 −6 −4 −2 0 2 4 6

Figure 2.2: The logistic sigmoid function

9 2. State of the art

∙ The softmax function is often used in the output layer of neural networks for classification. Sometimes, it is also used in hidden layers for “summarization”. It is defined as

ezi (z ) = softmax i m zj ∑j=1 e

where zi = wix + bi is the inner potential of the ith neuron in the output layer and m is the number of units in the output layer. The units in the output layer with this activation function pro- vide a probability distribution of the classes.

2.6.3 Cost function The learning algorithm minimizes a cost function, also referred to as a loss function, (θ, (xn), (yn)) where (xn), (yn) are the sample inputs and outputs. One of the possible cost functions is an average of a per-sample loss function L(θ, x, y) over all samples: 1 n C(θ, (xn), (yn)) = ∑ L(θ, xi, yi) n i=1 The per-sample loss functions of most modern neural networks correspond with the maximum likelihood, for example, the categorical cross-entropy L(θ, x, y) = − log pmodel(y|x) where pmodel(y|x) is the probability of the class y for the input x as- signed by the model. If the output layer uses the softmax activation function, the proba- bility of classification of x to the ith class is the output value of the ith unit in the output layer.

2.6.4 Stochastic gradient descent algorithm (SGD) Networks are usually trained using the stochastic gradient descent algorithm, which is an extension of the gradient descent algorithm. The gradient descent algorithm, also known as the method of steep- est descent, aims to minimize a cost function c(θ). For this purpose, it

10 2. State of the art

uses the gradient of the cost function ∇θc(θ) to iteratively modify θ so that the value of the cost function decreases. In the ideal case, it should converge to the global minimum or at least a local minimum that is close to the global minimum. In each step, the algorithm computes the direction u of the steepest descent of the function c(θ). It is possible to prove that the optimal direction is

u = −∇θc(θ). The algorithm then calculates the new weights as θ′ = θ + eu where e is a learning rate. The learning rate is most often either a small constant or found using the line search strategy. The line search strategy examines several values of e and chooses a value for which the cost function c(θ′) gives the lowest result. The SGD algorithm minimizes the cost function

c(θ) = C(θ, (xn), (yn)).

With large data sets, the computation of C(θ, (xn), (yn)) is very slow, which makes it impractical. The minibatch variant of SGD re- solves this issue. In each step, it chooses a batch of samples (xm), (ym) of size m and optimizes the value of C(θ, (xm), (ym)). In most cases, the batch size m is a constant much lower than the number of all samples, so the computation of the cost function is significantly faster.

2.6.5 Back-propagation algorithm The back-propagation algorithm is used for the computation of the gradient ∇θC(θ, (xn), (yn)), which is required by the SGD algorithm. This algorithm iterates over the layers of the neural network from the output layer to the first hidden layer and in each iteration calculates the partial derivatives of the weights between the current and previous layer.

2.6.6 Regularization The purpose of regularization is to generalize the model so that it works not only on the training set but also on new inputs. The phe- nomenon of neural network being bound to the training data too much

11 2. State of the art is called overfitting. There are several techniques that prevent neural networks from overfitting.

Early stopping Complex neural networks with enough capacity to represent the train- ing data are prone to overfitting over time, i.e., the error rate on the training set is still decreasing while the error rate on the validation is stagnating or increasing. The early stopping strategy resolves this issue by using the weights with the lowest error rate on the validation set. If the error rate on the validation set does not decrease for several epochs in a row, the training is stopped completely.

Dropout The dropout method randomly chooses a set of units for each trained example. These units are then ignored during forward propagation, back-propagation, and updating of weights and biases. It is usually implemented by forcing the output of the unit to be zero. The proba- bility of dropping out of a specific unit is a parameter that is constant throughout the training process. The dropout method ensures that the classification does not de- pend on a small subset of units, which makes the model more robust and generalize better.

L2 parameter regularization The L2 parameter regularization adds a regularization term

1 2 ∑ 2 |wi| , wi where wi are the weights of the units, to the cost function, which means that the model is penalized for high weights. The output of a unit is therefore derived from many inputs with small weights rather than from a small number of inputs with high weights. Hence, the model is again more robust.

12 2. State of the art 2.7 Convolutional neural networks

This chapter summarizes the chapter Convolutional Networks from the book Deep Learning [28]. Convolutional neural networks (ConvNets) are neural networks that at some layer use convolution. They are useful for processing of data in which a pattern may be located at different positions in different inputs. They are used with 1-D data, such as text inputs, as well as with multidimensional data, such as images represented by 2-D grids of pixels.

2.7.1 Convolution operation

The convolution operation is denoted by an asterisk (*) and takes two arguments – an input and a kernel. The result of the operation is called a feature map. In the case of a 2-D input I and a 2-D kernel K, it is defined as

(I * K)(i, j) = ∑ ∑ I(m, n)K(i − m, j − n). m n

The operation is commutative, so the value is equal to

(K * I)(i, j) = ∑ ∑ I(i − m, j − n)K(m, n), m n which allows a more efficient computation. After application of the commutativity property, the kernel be- comes flipped. Cross-correlation is a similar operation which does not flip the kernel but keeps the important properties of convolution:

(I * K)(i, j) = ∑ ∑ I(i + m, j + n)K(m, n), m n

Both of these operations are useful for searching of the same pat- tern in all possible locations in the input. Figure 2.3 shows an example of a 2-D convolution without kernel flipping.

13 2. State of the art

Input Kernel a b c d w x e f g h y z i j k l

Output aw+bx bw+cx cw+dx +ey+fz +fy+gz +gy+hz

ew+fx fw+gx gw+hx +iy+jz +jy+kz +ky+lz

Figure 2.3: An example of 2-D convolution, Based on: Figure 9.1 from Deep Learning by Ian Goodfellow et al. [28]

Figure 2.4: Structure of a convolutional neural network, Source: Math- works.com [29]

2.7.2 Architecture

Convolutional neural networks contain one or more convolutional layers (Figure 2.4). A convolutional layer typically consists of three stages: the convolution stage, detector stage, and pooling stage. On top of the network, the grid is usually flattened and further processed by a multilayer perceptron.

14 2. State of the art

2.7.3 Convolution stage The convolution stage consists of one or more filters which apply the convolution operation to the input. The stage outputs the feature maps of the filters. For each filter and each position in the input, there is a set ofunits that apply the weights of the kernel to the input at the specific position. All units share their weights with the corresponding units in all the sets related to the filter. Stride is a hyperparameter which specifies the horizontal and vertical distance between two neighboring sets of units. The convolution stage is parameterized by |K| f weights where |K| is the kernel size, and f the number of filters. As the kernel size is usually much smaller than the input size, the number of parameters is significantly lower compared to matrix multiplication, which is parameterized by |I|m weights where |I| is the input size, and m the number of units in the layer.

2.7.4 Detector stage The detector stage applies an activation function, which is usually the rectified linear function, to the outputs of the convolution stage.

2.7.5 Pooling stage The pooling stage computes a statistic for each position in the output of the detector stage, e.g., a maximum or an average of a rectangular neighborhood.

2.8 Deep residual networks

This chapter summarizes the paper Deep Residual Learning for Image Recognition [30]. Deep convolutional neural networks used in practice have a prob- lem with the error rate degradation when the number of convolutional layers is increased (Figure 2.5). This observation is counterintuitive because for the modified network with added layers there is always a solution with the same weights as the original model that performs an

15 2. State of the art

Figure 2.5: Training error (left) and test error (right) on CIFAR-10 dataset with 20-layer and 56-layer networks, Source: Deep Residual Learning for Image Recognition, Figure 1 [30] identity mapping in the added layers, which has the same error rate as the original model. However, it is difficult for the modern training algorithms to achieve this solution. Deep residual neural networks (ResNets) are an extension of con- volutional neural networks that resolve this issue by modifying the convolution layers in a way that makes it easier for the training algo- rithms to train an identity mapping or a transformation very close to an identity mapping. Residual networks extend the convolutional layers with “shortcut connections” which add the input x to the output f (x) (Figure 2.6). The layer output is, therefore, g(x) = f (x) + x.

Figure 2.6: Residual learning: building block, Source: Deep Residual Learning for Image Recognition, Figure 2 [30]

16 2. State of the art

For every convolutional neural network which at ith convolutional layer receives input xi and outputs fi(xi), there exists a deep residual network with the same topology which at ith convolutional layer outputs fi(xi) − xi, so it has the same output for all inputs. There is an assumption that it is easier for the training algorithms to decrease f (x) to zero than to make it an identity mapping, as the original convolutional layer had to do in the case that identity mapping was optimal. The practical applications confirm this assumption. The authors of the paper experimented with various datasets, such as ImageNet and CIFAR-10, and achieved better results with residual networks compared to the original networks. Their 152-layer deep residual net- work took first place on the ImageNet Large Scale Visual Recognition Competition in 2015 [31].

17

3 Image Classifiers

Using neural networks, I implemented three classifiers for the classifi- cation of thumbnail images. The first classifier receives an arbitrary image and determines whether it contains sheet music. The remaining two classifiers are used to check whether an image with sheet music has particular features – a watermark, or a track title.

3.1 Dataset

3.1.1 Downloading dataset It is possible to obtain a set of sample images by querying an image search engine using a suitable list of search terms. The search terms containing phrases such as “sheet music” or “chords” are likely to have sheet music results. By combining these words with names of musical instruments or musicians of various genres and by using different languages (in this case, English and Czech) it is possible to obtain a diverse set of images containing sheet music. A collection of images not containing sheet music is also necessary for the training. Querying the search engines with names of ordinary objects results in a set of such images. The classifier should be able to distinguish between sheet music and other kinds of documents and papers; therefore, the names of these objects must also be present in the list of the search terms. The dataset used for training in this thesis was created by perform- ing 130 queries on Google Image Search, which resulted in a collection of 13,000 images.

3.1.2 Manual classification of images After downloading the dataset, it is necessary to manually classify the images, based on whether they are sheet music, whether they are handwritten or printed, and whether they contain a title of the track or a watermark. I implemented a simple web application for the manual classifica- tion, as shown in Figure 3.1. It displays a batch of images for every

19 3. Image Classifiers

Figure 3.1: Manual classification application UI search query and proposes a category for each batch, based on whether the search term is likely to have sheet music results or not. It is then possible to classify the individual images by clicking the links below them. By clicking the link at the bottom of the page, all the remaining unclassified images are assigned to the proposed category. I manually classified all the images from the dataset using this application. Table 3.1 shows the image counts and Figure 3.2 image examples by categories.

Table 3.1: Category sizes

Category Count of images Sheet music with title and no watermark 4196 Sheet music with no title and no watermark 869 Sheet music with watermark and title 731 Sheet music with watermark and no title 112 Handwritten sheet music 43 Other images 7049

20 3. Image Classifiers

(a) sheet_music (b) with_watermark (c) with_no_title

(d) with_watermark (e) handwritten (f) other _and_no_title

Figure 3.2: Examples of images by categories (cropped)

3.1.3 Dataset storage The dataset is stored in a MySQL database. The table images has the following columns: ∙ uuid: the unique identifier of the image (UUID) ∙ source: the source of the image (in this case, all images were downloaded from Google Image Search, which is represented by value "google") ∙ query: the search query used to download the image ∙ language: the language used to download the image ∙ index: the zero-based index of the image in the search results

21 3. Image Classifiers

∙ thumbnail_data_url: the data URL of the thumbnail image ∙ fullsize_image_url: the HTTP URL of the full-size image ∙ assumed_type: the category assumed from the search query ∙ type: the category assigned using the manual classification application ∙ is_sheet_music_probability: the probability of whether the image contains sheet music ∙ contains_watermark_probability: the probability of whether the image contains a watermark ∙ contains_title_probability: the probability of whether the image contains a heading with a track title ∙ recognized_text: the recognized text ∙ recognized_meta: a boolean value indicating whether the meta- data was recognized ∙ meta_track_title: the recognized track title ∙ meta_artist: the recognized artist name

The table is not only used for the dataset, but also for other images downloaded by the web application for users’ search queries. For efficient querying, two indexes are defined in it: a primary keyon the uuid column, and a composite index over the source, query and language columns. The latter is utilized by the web application for finding already downloaded images. The columns with probabilities are assigned by classifiers, the column with recognized text by an OCR tool, and the columns with metadata by a metadata extractor. All these columns default to null.

3.1.4 Preprocessing For the training, the images are exported to CSV files by the command line script bin/export_data_set. Each thumbnail is resized to 128 × 128 pixels. Each pixel is represented by 3 bytes, one for every channel of the RGB color model. The images are stored in the CSV files as base64-encoded byte arrays. I experimented with two resize modes (Figure 3.3). The first mode filled the whole image ignoring the original aspect ratio, the second mode did not change the ratio and added a white background around the image.

22 3. Image Classifiers

(a) ignore_aspect_ratio (b) embed

Figure 3.3: Resize modes

Table 3.2: Categories included in the classes

Classifier Positive class Negative class is_sheet_music sheet_music handwritten with_watermark other with_no_title with_watermark_and_no_title contains_watermark with_watermark sheet_music with_watermark_and_no_title with_no_title contains_title sheet_music with_no_title with_watermark with_watermark_and_no_title

The dataset is divided into classes and normalized before further processing. Each classifier recognizes two classes (Table 3.2). The images with handwritten sheet music are often difficult to read and do not contain a title, so the search results with such images are not very relevant (Figure 3.4). Furthermore, there are only 43 images from this category in the dataset, which would cause problems with the training of the classifiers that detect watermarks and title. Therefore, I decided to treat these images as if they were not sheet music. For each classifier, a random set of images from the larger class is omitted so that both classes are of the same size.

23 3. Image Classifiers

Figure 3.4: Sample handwritten sheet music

After that, the input vectors are normalized. The byte values are divided by 255 so that they are in the range [0, 1]. Then, the mean of the values with the same coordinates in the whole training set is subtracted from each value. During training, the images are randomly shifted horizontally and vertically for better generalization.

3.2 Neural network architectures

3.2.1 Convolutional neural networks

I experimented with various neural network architectures. As the inputs are images and the locations of the searched features (staves, notes, watermarks, titles, etc.) are not always the same, convolutional neural networks (Chapter 2.7) are likely to produce good results. Figure 3.5 shows the topologies of the convolutional neural net- works I used and Tables 3.3 and 3.4 their parameters. The first topology uses two convolutional layers and the second four convolutional layers so that the results show whether adding more convolutional layers increases accuracy. I started with specific numbers of filters, units, and sizes of kernels which, I estimated, may produce good results. Then, I tried to modify these parameters in different ways, so it was possible to determine from the results how these changes affected the accuracy.

24 3. Image Classifiers

Input layer (128x128×3 units)

2D convolutional layer (f1 filters, kernel k1×k1, stride 1) with ReLU detector layer

2D convolutional layer (f2 filters, kernel k1×k1, stride 1) with ReLU detector layer

2D max pooling layer Input layer (pool size 2×2) (128x128×3 units) with 25% dropout rate

2D convolutional layer 2D convolutional layer (f1 filters, kernel k×k, stride 1) (f3 filters, kernel k2×k2, stride 1) with ReLU detector layer with ReLU detector layer

2D convolutional layer 2D convolutional layer (f2 filters, kernel k×k, stride 1) (f4 filters, kernel k2×k2, stride 1) with ReLU detector layer with ReLU detector layer

2D max pooling layer 2D max pooling layer (pool size 2×2) (pool size 2×2) with 25% dropout rate with 25% dropout rate

Dense layer (m units) Dense layer (m units) with rectified linear activation with rectified linear activation function and 50% dropout rate function and 50% dropout rate

Output layer (2 units) Output layer (2 units) with softmax activation function with softmax activation function

(a) topology A (b) topology B

Figure 3.5: ConvNet topologies

25 3. Image Classifiers

Table 3.3: Parameters of the ConvNets with topology A

Name f1 f2 m k learning rate η A1 16 16 256 3 10−3 A2 32 32 512 3 10−3 A3 32 32 512 3 10−4 A4 64 64 1024 3 10−4 A5 32 32 512 5 10−4

Table 3.4: Parameters of the ConvNets with topology B

Name f1 f2 f3 f4 k1 k2 m learning rate η B1 16 16 32 32 3 3 256 10−4 B2 32 32 64 64 3 3 512 10−4 B3 64 64 128 128 3 3 1024 10−4 B4 32 64 128 256 3 3 1024 10−4 B5 32 32 64 64 5 3 512 10−4

The model uses the categorical cross-entropy loss function and the RMSprop optimizer, which is an adaptive learning rate method that aims to make the model converge faster. [32] RMSprop is configured with a learning rate η, of which the recom- mended default value is 10−3. [32] When I trained the model A2 with this value of η, the accuracy reached only approximately 50% and stopped improving. Therefore, I decreased η to 10−4 for the model A3 and all the following models, which resolved the issue. A paper from 2012 focused on training of convolutional neural networks on the ImageNet dataset suggests using a dropout rate of 50% [33]. I experimented with the dropout rate of 25% after every max pooling layer and 50% after the final dense layer. After evaluating all the models with these parameter values, I tried to modify the dropout rates in the models with the highest accuracy. The convnetB4 model achieved the highest accuracy rate in the case of sheet music classification and the convnetB3 in the remaining

26 3. Image Classifiers

two cases. I modified these models to use a dropout rate d1 after every max pooling layer and d2 after the final dense layer. Table 3.5 shows the values of d1 and d2. Table 3.5: Modified dropout rates of the ConvNets

Models d1 d2 B3, B4 25% 50% B3a, B4a 5% 10% B3b, B4b 10% 20% B3c, B4c 15% 30% B3d, B4d 20% 40% B3e, B4e 30% 60%

3.2.2 Deep residual networks I also experimented with deep residual networks (Chapter 2.8). I used the architecture that the authors of the paper designed for the CIFAR- 10 dataset [30]. I trained four ResNet models with various depths (20, 32, 44 and 56 layers).

3.3 Implementation

The source code is written in Python and uses various open source libraries including: ∙ Keras, a high-level machine learning library with an API de- signed primarily for researchers and data scientists ∙ TensorFlow, a backend for the Keras library, which provides an API for efficient numerical computations on CPUs and GPUs ∙ NumPy, which provides an interface for multidimensional ar- rays and tools for their manipulation ∙ Matplotlib, a library for plotting graphs ∙ Django, a web framework, which is used in this thesis for the implementation of the HTTP API

27 3. Image Classifiers

3.3.1 Format of the HTTP API

The API has only one HTTP endpoint – POST /classify. The request body must contain a JSON object with the images in both resize modes formatted as base64-encoded byte arrays of size 128 × 128 × 3, e.g.:

{ "images":[ { "ignore_aspect_ratio":"a25ma2ZzZG5mamFzZ...", "embed":"ZGZnc3NkZmtzZGZrc..." }, { "ignore_aspect_ratio":"YWRzZmRzZnNkZHNmc...", "embed":"ZHNhbWxkc2FrZHNrZ..." } ] }

The total length of the request body must not exceed 20 MB. The API returns a JSON object with class probabilities for each of the images in the response body. The results are in the same order as the images in the input. The following is a sample response body:

{ "classes":[ { "is_sheet_music": 0.9, "contains_watermark": 0.0, "contains_title": 1.0 }, { "is_sheet_music": 0.1, "contains_watermark": 0.5, "contains_title": 0.5 } ] }

28 3. Image Classifiers

3.3.2 Command line interface The command python manage.py runserver starts the HTTP server with the API on port 8000. The commands sh train-convnet.sh and sh train-resnet.sh train the models and save them into the models directory. Both com- mands expect the dataset to be in the dataset directory. After training, it is possible to evaluate the models by commands sh test-convnet.sh and sh test-resnet.sh. Both commands output the results to the results.csv files in the models directory.

29 3. Image Classifiers 3.4 Test results

There are six datasets – one for each combination of a classification criterion and resize mode. In each of the datasets, the training set is randomly assigned 60% of all images, the validation set 20%, and the test set the remaining 20%. The neural network architectures are trained on all datasets for 200 epochs. The accuracy on the validation set is measured after every epoch. For every architecture, the model with the highest accuracy on the validation set is chosen for further evaluation. The models are then compared by the accuracy rate on the test set, which contains inputs that were unknown to the model during training. The accuracy rates on the test set are shown in Table 3.6.

Table 3.6: Accuracy of the models evaluated on the test set

Model Sheet music Watermark Title Ign.i Emb.ii Ign.i Emb.ii Ign.i Emb.ii convnetA1 97.34% 97.17% 92.01% 93.79% 94.15% 91.09% convnetA2 49.79% 96.53% 57.40% 57.40% 94.40% 44.27% convnetA3 96.70% 97.38% 92.90% 91.72% 93.89% 92.88% convnetA4 97.25% 97.72% 92.01% 93.20% 93.38% 93.64% convnetA5 97.67% 97.42% 92.60% 93.79% 94.40% 93.38% convnetB1 96.95% 97.34% 90.24% 91.12% 91.60% 93.38% convnetB2 97.55% 97.50% 93.20% 92.90% 94.40% 95.42% convnetB3 97.63% 97.88% 93.79% 93.49% 96.44% 94.40% convnetB4 98.05% 97.63% 93.49% 92.90% 94.66% 95.17% convnetB5 97.59% 97.63% 93.49% 90.53% 91.86% 94.15% resnet20 97.88% 97.63% 92.90% 94.67% 95.42% 96.44% resnet32 97.88% 97.88% 94.08% 94.08% 95.67% 95.42% resnet44 97.88% 97.88% 94.38% 94.97% 93.38% 95.42% resnet56 97.93% 97.55% 93.79% 92.90% 95.42% 97.20%

i Resize mode: ignore_aspect_ratio ii Resize mode: embed

30 3. Image Classifiers

After evaluating the models, I tried to modify the dropout rates of the ConvNets with the highest accuracy rate. Table 3.7 shows the accuracy rates on the test set. For all three criteria, the highest accuracy was obtained with the ignore_aspect_ratio resize mode; therefore, the models with the modified dropout rates were trained with this resize mode.

Table 3.7: Accuracy of the ConvNets with modified dropout rates

Model Sheet music (B4) Watermark (B3) Title (B3) B3/B4 98.05% 93.79% 96.44% B3a/B4a 98.05% 92.90% 94.91% B3b/B4b 98.14% 93.79% 94.15% B3c/B4c 97.97% 93.20% 93.89% B3d/B4d 97.72% 93.20% 93.64% B3e/B4e 97.67% 94.38% 95.42%

3.4.1 Sheet music classification In the case of sheet music classification, the accuracy rates of most of the models are comparable. The only exception is the convnetA2 model, which uses the RMSprop optimizer with η = 10−3. With this learning rate, the accuracy stops improving at approximately 50%. The model convnetA3, which has the same architecture and all parameters except η = 10−4, is not susceptible to this issue (Figure 3.6). In most cases, the convolutional neural networks with more filters achieve negligibly higher accuracy rates. The model convnetB4 with the ignore_aspect_ratio resize mode achieves the highest accuracy rate. After the dropout rates modifica- tion, the model convnetB4b achieves a negligibly higher accuracy. This model has 1.86% error rate on the test set. Figure 3.7 shows two ex- amples of images that were not classified correctly. There were no images not containing sheet music that would be marked as sheet music; however, there were sheet music images that the model did not classify correctly. One of them is an image of sheet music with a large picture next to it. Another image contains sheet music with colored notes.

31 3. Image Classifiers

(a) error of convnetA2 (b) loss of convnetA2

(c) error of convnetA3 (d) loss of convnetA3

Figure 3.6: Error and loss of convnetA2 and convnetA3 on the sheet music dataset

(a) not marked as sheet music (b) not marked as sheet music

Figure 3.7: Sample images incorrectly marked as sheet music

32 3. Image Classifiers

3.4.2 Watermark detection

The watermark detection is less accurate compared to the sheet music classification, and the accuracy rates of different models varymore significantly. The convnetA2 model again stops learning with a low accuracy rate. The deep residual networks achieve slightly higher accuracy rates compared to the convolutional neural networks. Some of the models work better with the ignore_aspect_ratio resize mode and some with the embed resize mode. The most accurate model uses the embed resize mode. The modification of dropout rates to higher values improves the accuracy of the convnetB3 model; however, its accuracy is still lower than the accuracy of the best ResNet model. The model resnet44 achieves the highest accuracy rate of 94.97%. Figure 3.8 shows sample images that are classified incorrectly. One of them contains an illustration, which the model considers to be a watermark. Another image contains a small triangular watermark in the bottom right corner, which the model failed to detect.

(a) marked as watermarked (b) not marked as watermarked

Figure 3.8: Sample images incorrectly marked as watermarked

33 3. Image Classifiers

3.4.3 Title detection The title detection results are similar to the watermark detection re- sults. The deep residual networks again provide better results than convolutional neural networks. The modification of dropout rates of the convnetB3 model does not improve the accuracy rate. The model resnet56 achieves the highest accuracy rate of 97.2%. Examples of incorrectly classified images are shown in Figure 3.9. There were no images without a title that would be classified incor- rectly, but there were some images in which the model failed to detect a title. One of them contains a colored title. Another image is extensively illustrated.

(a) title not detected (b) title not detected

Figure 3.9: Sample images with incorrectly detected title

34 4 Metadata extraction

The images with thumbnails classified as sheet music containing a header with a track title are downloaded in the full-size version and processed by a metadata extractor. As mentioned in Chapter 2.4, a possible approach to the track and artist identification is to recognize the text using an OCR tool and compare the text entries against a database of track titles and artist names.

4.1 Database of track titles and artists

Chapter 2.2 lists available catalogs and music databases. The service Discogs.com provides an extensive dataset of artists and releases. Every month, the service exports the current data from their database to XML files and publishes them at the data.discogs.com website [15]. I downloaded the dataset from March 2018 and implemented an import script which loaded the track titles and artist names from the XML files to a local relational database. I used the MySQL database be- cause it was already used for the classification dataset (Chapter 3.1.3). Figure 4.1 describes the database structure. The import script loads the tracks, artists and their relationships to the discogs_tracks, discogs_artists and discogs_tracks_artists tables.

discogs_tracks discogs_artists discogs_tracks_artists 1 0..* 0..* 1 id : int unsigned id : int unsigned track_id : int unsigned release_id : int unsigned name : text, nullable artist_id : int unsigned index : int unsigned realname : text, nullable title : text 1 1

0..* 0..* discogs_artists_normalized discogs_tracks_normalized

id : int unsigned id : int unsigned field : enum('name', 'realname') title : varchar(255) name : varchar(255)

Figure 4.1: The structure of the database tables for the Discogs.com dataset

35 4. Metadata extraction

4.1.1 Normalization The artist names and track titles are then normalized: ∙ Some of the track titles start with a number representing their order (e.g., “3. Ghost Child”). Other track titles end with a bracket (e.g., “Missing You (Original Version)”). Both of these labels are removed from the titles, because it may not be present in the sheet music. ∙ Non-English letters are replaced with their corresponding En- glish variants (e.g., “ř” is replaced with “r”). ∙ All letters are converted to lower case. ∙ The article “the”, if present, is removed from the beginning of the title. ∙ All characters other than letters, digits and spaces are removed. ∙ Whitespace is normalized. Whitespace characters at the begin- ning and end of the text entries are removed. All consecutive whitespace character sequences are replaced with a single space. ∙ All text entries longer than 50 characters are shortened. The script bin/normalize_discogs_data normalizes the artist names and track titles in the database. The normalized entries are stored in the corresponding tables with the normalized suffix (Figure 4.1).

4.2 Implementation

4.2.1 Text entries recognition The implementation recognizes the text from the images using Tesser- act OCR [23]. The OCR tool is configured to search for texts in the language in which the user performed the search query, and to use a whitelist of characters consisting of the alphabet of that language, digits and punctuation. Then, the first five lines of the recognized text are compared against the database of track titles and artist names. If a line contains a dash or the word “by” it is split and its parts are processed individually.

36 4. Metadata extraction

4.2.2 Text entries processing As the OCR tool is susceptible to minor errors, the search is performed with 20% error tolerance, i.e., the results are allowed to have a Leven- n shtein distance (Chapter 4.3.1) of up to ⌈ 5 ⌉ where n is the length of the text entry. The search algorithm is further explained in Chapter 4.3. If the results contain an artist and a track with a relationship be- tween them in the database, such artist-track pair is preferred over the other pairs. If there are still multiple possible artists and tracks, they are sorted by their weights (Chapter 4.2.3), and the artist-track pair with the greatest total weight is returned.

4.2.3 Weight calculation The weight of a result is calculated as the number of correct letters, i.e., n − d where n is the length of the result and d is the Levenshtein distance from the searched entry. As the OCR tool is prone to returning short lines with noise (e.g., “mm”, “oo”) and there are tracks with such names in the database, I implemented a penalization for all words being very short. If the average word length is at least three letters per word, the weight is increased by 100.

4.3 Search with an error tolerance

The metadata extractor requires a tool for searching the database of artist names and track titles with an error tolerance. The search operation is parameterized by a search string s, max- imum Levenshtein distance d, and maximum number of results q. When it is executed, it must return all entries form the database of which the Levenshtein distance from the string s is less than or equal to d. If there are more than q results, it must return only the first q results to prevent system overload in case of a query with too many results. As the MySQL database, which is used for storing Discogs.com data, does not support this operation, I implemented a custom applica- tion for this purpose. The database loads a dictionary of entries from

37 4. Metadata extraction a text file, reads search queries from the standard input and writes the search results to the standard output. Since the database of track titles contains tens of millions of entries, the implementation must be time and memory efficient. I decided to implement the application using C++ because of its low overhead compared to other languages, such as JavaScript or Python.

4.3.1 Levenshtein distance The Levenshtein distance [34] of two strings s, t is equal to the mini- mum number of insert, delete and change operations needed to trans- form string s to t. The insert operation adds one character to the string, delete operation removes one character from the string, and change operation replaces one character in the string with another character. It is possible to compute the Levenshtein distance of two strings s, t using a dynamic programming algorithm with 풪(nm) time com- plexity where n, m are the lengths of the strings s, t, respectively. Algorithm 1 calculates an array dist of Levenshtein distances be- tween all prefixes of s and t. It is possible to determine each of the distances in 풪(1) time by computing the distance in case of insertion, removal and change of the last character using the already computed distances. Algorithm 1 was rediscovered several times by different authors [35]. Wagner and Fisher described it in a paper in 1974 [36], so it is some- times referred to as Wagner-Fisher algorithm.

Algorithm 1 for computation of Levenshtein distance 1: function LevenshteinDistance(string s, string t) 2: n ← length of s 3: m ← length of t 4: dist[0, 0] ← 0 . distance between empty strings 5: for i ← 1 to n do 6: dist[i, 0] ← i . distance between empty string and s[1..i] 7: for j ← 1 to m do 8: dist[0, j] ← j . distance between empty string and t[1..j] 9: for i ← 1 to n do . for every prefix s[1..i] 10: for j ← 1 to m do . and every prefix t[1..j]

38 4. Metadata extraction

11: distInsert ← dist[i, j − 1] + 1 . insert t[j] 12: distDelete ← dist[i − 1, j] + 1 . delete s[i] 13: if s[i] = t[j] then 14: distChange ← dist[i − 1, j − 1] . no change 15: else 16: distChange ← dist[i − 1, j − 1] + 1 . s[i] → t[j] 17: dist[i, j] ← min{distInsert, distDelete, distChange} 18: return dist[n, m]

4.3.2 Prefix trees

A prefix tree, also known as a trie, [37] is a data structure that storesa set of strings in a tree. The nodes of the tree represent prefixes of the strings, with the root node representing an empty prefix. The parent of a node with a non-empty prefix is the node with the same prefix without the last character. Each node also contains a boolean flag indicating whether the set contains a word equal to the prefix. Figure 4.2 shows an example of a prefix tree for the words a, ape, app, and bar.

empty prefix isWordEnd: false

prefix "a" prefix "b" isWordEnd: true isWordEnd: false

prefix "ap" prefix "ba" isWordEnd: false isWordEnd: false

prefix "ape" prefix "app" prefix "bar" isWordEnd: true isWordEnd: true isWordEnd: true

Figure 4.2: Sample prefix tree

39 4. Metadata extraction

It is possible to add a string to the trie in 풪(n|Σ|) time where n is the string length and |Σ| the size of the alphabet.

4.3.3 Searching for strings in a prefix tree by the Levenshtein distance from a given string It is possible to extend the dynamic programming algorithm from Chapter 4.3.1 to find all strings in a prefix tree with the Levenshtein distance from the string s less than or equal to a parameter d.

Algorithm 2 for finding all strings in a prefix tree satisfying a Levenshtein distance constraint 1: function SearchTree(Tree tree, string s, int d) 2: for i ← 0 to |s| do 3: dist[i] ← i . distance between empty string and s[1..i] 4: return SearchSubtree(tree.root, dist, s, d) 5: function SearchSubtree(Node node, int[] dist, string s, int d) 6: results ← {} 7: if node.isWordEnd and dist[|s|] ≤ d then 8: add node.prefix to results 9: for child in node.children do 10: t ← child.prefix 11: childDist[0] ← |t| . distance between empty string and t 12: for i ← 1 to |s| do . for every prefix s[1..i] 13: distInsert ← dist[i] + 1 . insert t[|t|] 14: distDelete ← childDist[i − 1] + 1 . delete s[i] 15: if s[i] = t[|t|] then 16: distChange ← dist[i − 1] . no change 17: else 18: distChange ← dist[i − 1] + 1 . s[i] → t[|t|] 19: childDist[i] ← min{distInsert, distDelete, distChange} 20: add SearchSubtree(child, childDist, s, d) to results 21: return results

Algorithm 2 recursively traverses the tree. In each node, it receives an array dist of distances from all prefixes of s to the prefix of the

40 4. Metadata extraction

current node. If a word ends in the current node, it checks the Lev- enshtein distance constraint and if it is satisfied, it adds the word to the set of results. Then, it iterates over all children of the current node, calculates the distances from all prefixes of s to the child node prefix, and recursively continues processing the child subtree. The time complexity of the algorithm is 풪(n|s|) where n is the number of tree nodes and |s| the length of the string. There are several possible optimizations.

∙ If all the values in the array dist are greater than d, there cannot be any solution in the current subtree. Therefore, it is possible to check for this condition at the beginning of the SearchSubtree function and stop processing the subtree if the condition is satisfied. This optimization decreases the number of processed nodes, especially when the parameter d is low.

∙ Because the metadata extractor penalizes strings with high Lev- enshtein distances from the original string, it is reasonable to modify the search algorithm to return only the strings with the minimum Levenshtein distance. Thus, if the algorithm finds a result with a Levenshtein distance d′ < d, it will decrease the parameter d to d′ and remove all the results with a higher Levenshtein distance. This optimization also decreases the number of processed nodes, especially when the algorithm finds a result with a low Leven- shtein distance from the string s.

4.3.4 Implementation I implemented an application with this algorithm and both of these optimizations using C++. The application loads the database from a text file to a prefix tree, reads search queries from the standard input, searches the prefix tree for each query and writes the results tothe standard output. The metadata extractor is implemented using ECMAScript 6, so it runs the search application in a subprocess.

41 4. Metadata extraction 4.4 Test results

I run the metadata extraction on all images from the classification dataset (Chapter 3.1) that contained a title. Table 4.1 shows the statis- tics.

Table 4.1: Metadata recognition statistics

Total number of images 4927 Number of valid full-size images 4804 Number of images with recognized text 4469 Number of images with recognized track title 3778 Number of images with recognized artist name 1416

4.4.1 Text recognition The text was recognized from the full-size version of the images. How- ever, some of the full-size image URLs (2.5%) were outdated or linked to files that were not valid images. The OCR tool successfully extracted text from 93.0% of the down- loaded images. There are various reasons why it failed to process the remaining images (Figure 4.3):

a) Some of the images contained an overlay when they were down- loaded directly by the link, and not from the web page.

b) Other images were too small.

c) Although the OCR tool automatically adjusts the contrast of images, it failed to process several scanned images with a darker background.

d) There were some rotated images, but only infrequently.

e) Some of the images used handwritten fonts or were extensively illustrated.

f) Some images were blurred or poorly scanned.

42 4. Metadata extraction

(a) (b)

(c) (d)

(e) (f)

Figure 4.3: Examples of images not recognized by the OCR tool

43 4. Metadata extraction

4.4.2 Metadata recognition After the text recognition, the application searched the text for track titles and artist names. It managed to recognize a track title from 84.5% and an artist name from 31.7% images with recognized text. The most frequent reasons why the metadata recognition failed were (Figure 4.4):

a) There was an extensive noise in the recognized text, i.e., the Levenshtein distance from the correct track title or artist name was more than 20% of the length.

b) The track title or artist name was not present in the database.

c) The OCR tool did not support the font used in the sheet music.

d) There was an overlay in the full-size version of the image.

The primary reason why the success rate of the artist name recog- nition is much lower compared to the track title recognition is that many sheet music images do not contain the artist name. Furthermore, many images are assigned incorrect track title be- cause of the noise in the OCR output. As the noise that matches existing track titles and artist names in the database usually consists only of short words, the metadata recognizer penalizes results with a short average word length and secondarily with a high Levenshtein distance (Chapter 4.2.3).

44 4. Metadata extraction

(a) SMELLS WFFEIEN SPIRIT (b) BETLEMSKÉ HVĚZDIČKY

(c) Come 6mm va11igthQQgg (d) FULL RESOLUTION CLICK HERE

Figure 4.4: Examples of images with incorrectly recognized metadata

45

5 Conclusion

The major contribution of this thesis is the design and implementation of the image classifiers, metadata extractor, term search parameterized by Levenshtein distance and web application for sheet music search.

5.1 Image classification

The accuracy of the most accurate model for sheet music detection is 98.05%, for watermark detection 94.97% and for title detection 97.2%. In the case of sheet music detection, a convolutional neural network with four convolutional layers achieved the highest accuracy. In the remaining two cases, better results were obtained with deep residual networks with 44 and 56 layers. There is still a room for improvement of the classification accuracy. Chapter 3.4 describes some of the erroneous cases. The future work in this matter should analyze these cases and extend the dataset with more samples of the problematic cases to help the neural networks generalize better.

5.2 Metadata extraction

The metadata extractor successfully recognized the text from 93.0% images and matched 84.5% of them with track titles and 31.7% with artist names. Chapter 4.4 describes some of the images which were not recognized correctly. The most frequent cause of the artist name recognition failure was that many images did not contain an artist name. Some of the track titles and artist names were recognized incor- rectly because the OCR tool is susceptible to noise that is sometimes matched with entries in the database. The future work should focus on the image preprocessing to elimi- nate the noise. One of the possible approaches is to train a machine learning model that would find the position of the track title and artist name in the image. If only the cropped part of the image was processed by the OCR tool, it would eliminate a lot of the noise.

47 5. Conclusion

Another possible improvement is to add more data sources of track titles and artist names. Some tracks and artists are not present in the Discogs.com music database, especially folk songs and non-English artists. It would be also possible to improve the accuracy by prioritization of tracks and artists. The Discogs.com database contains many entries that are matched with the noise in the OCR output. If the tracks and artists were sorted by their priorities (e.g., by the number of sold copies of the artists’ releases, or the number of listens on online music streaming services), it would prefer the well-known track titles and artist names over the entries matched with the noise.

5.3 Web application

The web application works and is optimized for speed. It returns the image thumbnails only a few seconds after the search form is submitted, the exact response time depends on the Internet connection speed and server performance. The metadata is recognized from the full-size images, so their downloading is time-consuming. For this reason, the metadata extraction runs asynchronously. There are several possible ways of improvement of the web appli- cation. A very useful feature would be a filter for notes, chords and lyrics. Users who want to play a song on a guitar would probably search for the chords whereas users who want to play it on a piano would search for the notes. It would be necessary to train an additional classifier to implement this filter. It would also be possible to increase the number of search results by adding more data sources (e.g., Bing images [38] or language-specific search engines such as Obrazky.cz [39]), and support for more file formats such as PDF. The PDF documents support would require a search engine that displays their thumbnails in the search results.

48 A System Implementation

A.1 System components

The whole system consists of three components: the sheet music rec- ognizer, term search, and sheet music search. The sheet music recognizer is a Python application that provides an HTTP API endpoint for image classification. The implementation details are explained in Chapter 3.3. The term search is a C++ application that is utilized by the meta- data extractor for searching the database of text entries with an error tolerance. Chapter 4.3 describes its implementation. The sheet music search is a Node.js application which uses the Puppeteer library for downloading images from Google Image Search, filters the results using the sheet music recognizer API, extracts the metadata from the images and provides a web user interface. The implementation is further explained in Chapter A.2.

A.2 Web application

I implemented the web application using ECMAScript 6. The project dependencies are installed using Node Package Manager [40] and include the following packages: ∙ Babel, a library for transpilation of ECMAScript 6 to older ver- sions of JavaScript ∙ Config, a tool for application configuration ∙ EJS, a template engine ∙ Express, a web application framework ∙ MySQL, a library for communication with the MySQL server ∙ node-tesseract, a tool for running Tesseract processes ∙ Progress, a console utility for rendering progress bars ∙ Puppeteer, a library for controlling a headless Chromium web browser ∙ request and request-promise-native, libraries for asynchronous sending of HTTP requests ∙ Sharp, a tool for images manipulation ∙ xml-stream, a library for parsing XML documents

49 A. System Implementation

Figure A.1: Search form and results page

A.2.1 Controllers

The web application uses the Model-View-Controller architectural pattern [41] and contains two controllers: ∙ ManualClassificationController, which provides a user in- terface for manual classification (Chapter 3.1.2) ∙ SearchController, which renders the search form and results (Figure A.1)

A.2.2 Command line interface

The application provides the following command line scripts: ∙ bin/initialize_data_set, which downloads the dataset from Google Image Search to the MySQL database ∙ bin/export_data_set, which exports the manually classified dataset from the MySQL database to the dataset directory for model training

50 A. System Implementation

∙ bin/import_discogs_data, which imports the Discogs.com data from the XML files to the MySQL database ∙ bin/normalize_discogs_data, which normalizes the Discogs.com data in the MySQL database ∙ bin/export_discogs_artists and bin/export_discogs_tracks, which exports the normalized Discogs.com textual entries to CSV files utilized by the term search application ∙ bin/extract_song_meta, which extracts metadata from the dataset images and outputs statistics ∙ bin/export_incorrectly_classified_images, which exports the incorrectly classified images ∙ bin/export_incorrectly_recognized_images, which exports the images with incorrectly recognized text or metadata ∙ bin/www, which starts the web server

A.3 Deployment

Figure A.2 shows the deployment diagram of the system.

«client» «server» End user's computer Google servers

«component» «component» Web browser Google Image Search

HTTP HTTP

«server» Application server

«component» Sheet music search HTTP API subprocess MySQL connection

«component» «component» «component» Sheet music recognizer MySQL database Term search

Figure A.2: Deployment diagram

51 A. System Implementation A.4 Installation instructions

It is possible to install the system by following these instructions:

1) Copy the source code from the attachment of this thesis and extract it. 2) Install Python 3.6 or higher and run the following commands in the sheet_music_recognizer directory. # Createa virtual environment python-m venv./venv source venv/bin/activate

# Install the project dependencies pip install pipenv pipenv install pip install tensorflow-gpu# optional

3) Install Node Package Manager and Tesseract OCR and run the following command in the sheet_music_search directory. npm install

4) Install MySQL and create an empty database. Copy the SQL files from the attachment of this thesis, extract them and import them into the database. There are three SQL files: ∙ structure.sql, which creates the tables ∙ images.sql.gz with the image dataset ∙ discogs.sql.gz with the imported and normalized data from the Discogs.com XML files from March 2018 5) Open the sheet_music_search/config directory and create a configuration file local.js using the file local.template.js as a template. The only required parameters are the database credentials. Optionally, it is possible to override any of the de- fault options in the default.js configuration file. 6) Copy the models from the attachment of this thesis to the di- rectory sheet_music_recognizer/models. It is also possible to train the models by the commands mentioned in Chapter 3.3.2.

52 A. System Implementation

7) Export Discogs.com data for the term search application by executing the following commands in the sheet_music_search directory: bin/export_discogs_artists bin/export_discogs_tracks

8) Install CMake and G++. Then, compile the term search applica- tion by running the following commands in the term_search directory: mkdir build cd build cmake-DCMAKE_BUILD_TYPE=Release.. make

9) Start the sheet music recognizer HTTP server by executing the following commands in the sheet_music_recognizer di- rectory: source venv/bin/activate python manage.py runserver

The HTTP server starts listening on port 8000.

10) Start the sheet music search web server by running the following command in the sheet_music_search directory: bin/www

The HTTP server starts listening on port 3000.

11) Navigate your web browser to http://127.0.0.1:3000/search/

53

Bibliography

1. Sheet music metasearch [online]. Google Scholar [visited on 2018-05-15]. Available from: https://scholar.google.com/scholar?q=sheet+ music+metasearch. 2. SAMPSEL, Laurie J. Review: Reviewed Work: Sheet Music Consortium by The Consortium. Notes [online]. 2007, no. 3, pp. 663–667 [visited on 2018-05-15]. ISSN 00274380, 1534150X. ISSN 00274380, 1534150X. Available from: http://www.jstor.org/stable/4487851. 3. Protocol for Metadata Harvesting - v.2.0 [online]. Open Archives Initiative [visited on 2018-05-15]. Available from: https : //www.openarchives.org/OAI/openarchivesprotocol.html. 4. Sheet Music Consortium [online]. UCLA Digital Library Program [vis- ited on 2018-05-15]. Available from: http://digital2.library. ucla.edu/sheetmusic/. 5. OAIster [online]. OAIster [visited on 2018-05-15]. Available from: https://www.oclc.org/en/oaister.html. 6. Sheet music metasearch engine [online]. Google [visited on 2018-02-20]. Available from: https://www.google.cz/search?q=sheet+music+ metasearch+engine. 7. Bing: Sheet music metasearch engine [online]. Microsoft [visited on 2018-02-20]. Available from: https://www.bing.com/search?q= sheet+music+metasearch+engine. 8. Musicedmagic: Free Sheet Music Search Engine [online]. Chad Criswell [visited on 2018-02-20]. Available from: https : / / www . musicedmagic . com / uncategorised / sheet - music - search-engine.html. 9. Sheet Music Meta Search Engine [online]. Looksheetmusic.com [visited on 2018-02-20]. Available from: http://www.looksheetmusic.com/ info.php?sl=smp&sku=HL.240024. 10. Sheet Music Plus [online]. Sheet Music Plus [visited on 2018-02-20]. Available from: https://www.sheetmusicplus.com/.

55 BIBLIOGRAPHY

11. Sheet Music Downloads at Musicnotes.com [online]. Musicnotes, Inc. [visited on 2018-02-20]. Available from: https://www.musicnotes. com/. 12. Record Reviews, Streaming Songs, Genres & Bands [online]. AllMusic [visited on 2018-02-20]. Available from: https://www.allmusic. com/. 13. Database and Marketplace for Music on Vinyl, CD, Cassette and More [online]. Discogs [visited on 2018-02-20]. Available from: https: //www.discogs.com/. 14. Play music, find songs, and discover artists [online]. Last.fm [visited on 2018-02-20]. Available from: https://www.last.fm/. 15. Discogs Data [online]. Discogs [visited on 2018-02-20]. Available from: https://data.discogs.com/. 16. PRUSLIN, Dennis Howard. Automatic recognition of sheet music. Cam- bridge, Massachusetts, USA, 1966. PhD thesis. Massachusetts Insti- tute of Technology. 17. PRERAU, David Stewart. Computer Pattern Recognition of Standard En- graved Music Notation. Cambridge, Massachusetts, USA, 1970. PhD thesis. Massachusetts Institute of Technology. 18. RAPHAEL, Christopher; WANG, Jingya. New Approaches to Optical Music Recognition. Proceedings of the 12th International Society for Mu- sic Information Retrieval Conference, ISMIR 2011, Miami, Florida, USA, October 24-28, 2011 [online]. 2011, pp. 305–310 [visited on 2018-05-15]. Available from: http : / / ismir2011 . ismir . net / papers / OS3 - 3.. 19. REBELO, Ana et al. Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Information Retrieval [online]. 2012, vol. 1, no. 3, pp. 173–190 [visited on 2018-02-20]. ISSN 2192-662X. Available from: https://doi.org/10.1007/s13735- 012-0004-6. 20. FORNÉS, Alicia; SÁNCHEZ, Gemma. t. Handbook of Document Image Processing and Recognition [online]. 2014, pp. 749–774 [visited on 2018-05-22]. ISBN 978-0-85729-859-1. Available from: https://doi. org/10.1007/978-0-85729-859-1_24.

56 BIBLIOGRAPHY

21. NOVOTNÝ, J; POKORNÝ, J. Introduction to optical music recogni- tion: Overview and practical challenges. CEUR Workshop Proceedings. 2015, vol. 1343, pp. 65–76. 22. FREMEREY, Christian. Automatic Organization of Digital Music Docu- ments – Sheet Music and Audio [online]. Bonn, Germany, 2010 [visited on 2018-02-20]. Available from: http://hss.ulb.uni-bonn.de/ 2010/2242/2242.htm. Dissertation. University of Bonn. 23. tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository) [online]. GitHub [visited on 2018-02-20]. Available from: https: //github.com/tesseract-ocr/tesseract. 24. Best OCR for Windows – ABBYY FineReader 14 [online]. ABBYY [visited on 2018-02-20]. Available from: https://www.abbyy.com/ en-ee/finereader/. 25. tmbdev/ocropy: Python-based tools for document analysis and OCR [online]. GitHub [visited on 2018-02-20]. Available from: https://github. com/tmbdev/ocropy. 26. Asprise: global imaging leader; offering software SDK library API for OCR, Java/.NET document scanner scanning component (dll) [online]. Asprise Software [visited on 2018-02-20]. Available from: https://asprise. com/. 27. OmniPage Family | Nuance [online]. Nuance Communications, Inc. [visited on 2018-02-20]. Available from: https : / / www . nuance . com/print- capture- and- pdf- solutions/optical- character- recognition/.html. 28. GOODFELLOW, Ian; BENGIO, Yoshua; COURVILLE, Aaron. Deep Learning [online]. MIT Press, 2016 [visited on 2018-04-02]. Available from: http://www.deeplearningbook.org. 29. Convolutional Neural Network - MATLAB & Simulink [online]. Math- Works, Inc. [visited on 2018-05-15]. Available from: https : / / www.mathworks.com/solutions/deep-learning/convolutional- neural-network.html. 30. HE, Kaiming; ZHANG, Xiangyu; REN, Shaoqing; SUN, Jian. Deep Residual Learning for Image Recognition. CoRR [online]. 2015, vol. abs/1512.03385 [visited on 2018-04-02]. Available from: http://arxiv.org/abs/1512.03385.

57 BIBLIOGRAPHY

31. RUSSAKOVSKY, Olga et al. ImageNet Large Scale Visual Recogni- tion Challenge. CoRR [online]. 2014, vol. abs/1409.0575 [visited on 2018-04-02]. Available from: http://arxiv.org/abs/1409.0575. 32. RUDER, Sebastian. An overview of gradient descent optimization algorithms. CoRR [online]. 2016, vol. abs/1609.04747 [visited on 2018-04-02]. Available from: http://arxiv.org/abs/1609.04747. 33. KRIZHEVSKY, Alex; SUTSKEVER, Ilya; HINTON, Geoffrey E. Im- ageNet Classification with Deep Convolutional Neural Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems – Volume 1 [online]. 2012, pp. 1097–1105 [visited on 2018-05-15]. Available from: http://dl.acm.org/citation. cfm?id=2999134.2999257. 34. LEVENSHTEIN, V. I. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady. 1966, vol. 10, pp. 707. 35. NAVARRO, Gonzalo. A Guided Tour to Approximate String Matching. ACM Comput. Surv. [online]. 2001, vol. 33, no. 1, pp. 31–88 [visited on 2018-05-15]. ISSN 0360-0300. Available from: http://doi.acm. org/10.1145/375360.375365. 36. WAGNER, Robert A.; FISCHER, Michael J. The String-to-String Cor- rection Problem. J. ACM [online]. 1974, vol. 21, no. 1, pp. 168–173 [visited on 2018-05-15]. ISSN 0004-5411. Available from: http:// doi.acm.org/10.1145/321796.321811. 37. BRASS, Peter. Advanced Data Structures. Cambridge University Press, 2008. Available from DOI: 10.1017/CBO9780511800191. 38. Bing images [online]. Microsoft [visited on 2018-04-02]. Available from: https://www.bing.com/images/. 39. Obrázky.cz [online]. Seznam.cz, a.s. [visited on 2018-04-02]. Available from: https://www.obrazky.cz/. 40. npm [online]. npm, Inc. [visited on 2018-04-02]. Available from: https: //www.npmjs.com/. 41. The DCI Architecture: A New Vision of Object-Oriented Programming [online]. Trygve Reenskaug and James O. Coplien [visited on 2018-04-02]. Available from: https://www.artima.com/articles/ dci_vision.html.

58