Towards Full-Pipeline Handwritten Omr with Musical Symbol Detection by U-Nets

TOWARDS FULL-PIPELINE HANDWRITTEN OMR WITH MUSICAL SYMBOL DETECTION BY U-NETS Jan Hajicˇ jr.1 Matthias Dorfer2 Gerhard Widmer2 Pavel Pecina1 1 Institute of Formal and Applied Linguistics, Charles University 2 Institute of Computational Perception, Johannes Kepler University [email protected] ABSTRACT Detecting music notation symbols is the most immediate unsolved subproblem in Optical Music Recognition for musical manuscripts. We show that a U-Net architecture for semantic segmentation combined with a trivial detector already establishes a high baseline for this task, and we propose tricks that further improve detection perfor- mance: training against convex hulls of symbol masks, and multichannel output models that enable feature shar- ing for semantically related symbols. The latter is help- ful especially for clefs, which have severe impacts on the overall OMR result. We then integrate the networks into an OMR pipeline by applying a subsequent notation assembly Figure 1. OMR pipeline in this work. Top-down: (1) input stage, establishing a new baseline result for pitch inference score, (3) symbol detection output, (4) notation assembly in handwritten music at an f-score of 0.81. Given the au- output. Obtaining MIDI from output of notation assem- tomatically inferred pitches we run retrieval experiments bly stage (for evaluating pitch accuracy and retrieval per- on handwritten scores, providing first empirical evidence formance) is then deterministic. Our work focuses on the that utilizing the powerful image processing models brings symbol detection step (1) ! (3); notation reconstruction is content-based search in large musical manuscript archives done only with a simple baseline. within reach. 1. INTRODUCTION [17, 28] or SIMSSA/Liber Usualis [3] projects. However, for manuscripts, results are not forthcoming. Optical Music Recognition (OMR), the field of automat- The usual approach to OMR is to break down the prob- ically reading music notation from images, has long held lem into a four-step pipeline: (1) preprocessing and bina- the significant promise for music information retrieval of rization, (2) staffline removal, (3) symbol detection (local- making a great diversity of music available for further ization and classification), and (4) notation reconstruction processing. More compositions have probably been writ- [2]. Once stage (4) is done, the musical content — pitch, ten than recorded, and more have remained in manuscript duration, and onsets — can be inferred, and the score itself form rather than being typeset; this is not restricted to the can be encoded in a digital format such as MIDI, MEI 1 tens of thousands of manuscripts from before the age of or MusicXML. We term OMR systems based on explicitly recordings, but holds also for contemporary music, where modeling these stages Full-Pipeline OMR. many manuscripts have been left unperformed for rea- Binarization and staff removal have been successfully sons unrelated to their musical quality. Making the con- tackled with convolutional neural networks (CNNs) [4,11], tent of such manuscript collections accessible digitally formulated as semantic segmentation. Symbol classifica- and searchable is one of the long-held promises of OMR, tion achieves good results as well [12, 13, 33]. However, and at the same time OMR is reported to be the bottle- detecting the symbols on a full page remains the next ma- neck there [17]. On printed music or simpler early mu- jor bottleneck for handwritten OMR. As CNNs have not sic notation, this has been attempted by the PROBADO been applied to this task yet, they are a natural choice. Full-Pipeline OMR is not necessarily the only viable ap- c Jan Hajicˇ jr., Matthias Dorfer, Gerhard Widmer, Pavel proach: recently, end-to-end OMR systems have been pro- Pecina. Licensed under a Creative Commons Attribution 4.0 International posed. [16, 24]. However, they have so far been limited to License (CC BY 4.0). Attribution: Jan Hajicˇ jr., Matthias Dorfer, Ger- short excerpts of monophonic music, and it is not clear how hard Widmer, Pavel Pecina. “Towards Full-Pipeline Handwritten OMR to generalize their output design from MIDI equivalents to with Musical Symbol Detection by U-Nets”, 19th International Society for Music Information Retrieval Conference, Paris, France, 2018. 1 http://music-encoding.org/ 225 226 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 lossless structured encoding such as MEI or MusicXML, detection with deep learning. so full-pipeline approaches remain justified. Object Detection CNNs. A standard architecture for Our work mainly addresses step (3) of the pipeline, ap- object detection is the Regional CNN (R-CNN) family, plied in the context of a baseline full-pipeline system, as most notably Faster R-CNN [40] and Mask R-CNN [26]). depicted in Fig. 1. We skip stage (2): we treat stafflines as These networks output probabilities of an object’s pres- any other object, since we jointly segment and classify and ence in each one of a pre-defined sets of anchor boxes, and do not therefore have to remove them in order to obtain a make the bounding box predictions more accurate with re- more reasonable pre-segmentation. We claim the follow- gression. In comparison, the U-Net architecture may have ing contributions: an advantage in dealing with musical symbols that have (1) U-Nets used for musical symbol detection. Apply- significantly varying extents, such as beams or stems, as ing fully convolutional networks, specifically the U-Net ar- it does not require specifying the appropriate anchor box chitecture [38], for musical symbol segmentation and clas- sizes, and it is significantly faster, requiring only one pass sification, without the need for staffline removal. We ap- of the network (the detector then requires one connected ply improvements in the training setup that help overcome component search). Furthermore, Faster R-CNN does not OMR-specific issues. The results in Sec. 5 show that the output pixel masks, which are useful for archival- and improvements one expects from deep learning in computer musicology-oriented applications downstream of OMR, vision are indeed present. such as handwriting-based authorship attribution. Mask (2) Full-Pipeline Handwritten OMR Baseline for R-CNN, admittedly, does not have this limitation, but still Pitch Accuracy and Retrieval. We combine our stage requires the same bounding box setup. (3) symbol detection results with a baseline stage (4) sys- Another option is the YOLO architecture [25], specifi- tem for notation assembly and pitch inference. This OMR cally the most recent version YOLOv3 [36], which predicts system already achieves promising pitch-based retrieval re- bounding boxes and confidence degrees without the need sults on handwritten music notation; to the best of our to specify anchor boxes. A similar approach was proposed knowledge, its pitch inference f-score of 0.81 is the first in [22], achieveing a notehead detection f-score of 0.97, reported result of its kind, and it is the first published full- but only with a post-filtering step. pipeline OMR system to demonstrably perform a useful Convolutional Networks in OMR. Convolutional net- task well on handwritten music. works have been applied in OMR to symbol classification [33], indicating that they can in principle handle the 2. RELATED WORK variability of music notation symbols, but not yet in also U-Nets. U-Nets [38] are fully convolutional networks finding the symbols on the page. Fully convolutional net- shaped like an autoencoder that introduce skip-connections works have been successfully applied to staff removal [4], between corresponding layers of the downsampling and and to resolving the document to a background, staff, text, upsampling halves of the model (see Fig. 2). For each and symbol layers [11]. However, these are semantic seg- pixel, they output a probability of belonging to a specific mentation tasks; whereas we need to make decisions about class. U-Nets are meant for semantic segmentation, not individual symbols. The potential of U-Nets for sym- instance segmentation/object detection, which means that bol detection was preliminarily demonstrated on noteheads they require an ad-hoc detector on top of the pixel-wise [22, 31], but compared to other symbol classes, noteheads output. On the other hand, this formulation avoids domain- are “easy targets”, as they look different from other ele- specific hyperparameters such as choosing R-CNN anchor ments, have constant size, and appear only in one pose (as box sizes, is agnostic towards the shapes of the objects we opposed to, e.g., beams). are looking for, and does not assume any implicit priors OMR Symbol Detection. Localizing symbols on the on their sizes. This promises that the same hyperparameter page has been previously addressed with heuristics rather settings can be used for all the visually disparate classes than machine learning, e.g. with projections [8, 18], (the one neuralgic point being the choice of receptive field Kalman Filters [14], Line Adjacency Graphs [37], or other size). Furthermore, U-Nets process the entire image in a combinations of low-level image features [39]. On hand- single shot — which is a considerable advantage, as music written music, due to its variability, more complex heuris- notation often contains upwards of 500 symbols on a single tics such as the algorithm of [1] that consists of 14 interde- page. A disadvantage of U-Nets (as well as most CNNs) pendent steps have been applied. is their sensitivity to the

Load more