Handwritten Optical Music Recognition Phd Thesis Proposal
Total Page:16
File Type:pdf, Size:1020Kb
Handwritten Optical Music Recognition PhD Thesis Proposal Jan Hajicˇ jr. Faculty of Mathematics and Physics, Charles University Institute of Formal and Applied Linguistics [email protected] 1 Introduction cess of reading music that OMR is seeking to emulate (Subsec. 1.2), and from this description we draw con- Optical Music Recognition (OMR) is a field of doc- clusions about why OMR is a difficult problem, espe- ument analysis that aims to automatically read music cially compared to OCR (Subsec. 1.3). We finally de- scores. Music notation encodes music in a graphical scribe the individual “flavors” of OMR based on the in- form; OMR backtracks through this process to extract put modalities and output requirements (Subsec. 1.4). the musical information from this graphical representa- Next, in Section2, we describe how OMR solu- tion. tions are typically structured and review the literature Common western music notation (CWMN) is an in- on OMR. We then proceed to present our contribu- tricate system for visually representing music that has tions to OMR so far: in Section3, we analyze the shown considerable resilience in the face of many de- requirements for designing an OMR benchmark, and velopments in the musical world, and has thrived be- present the MUSCIMA++ dataset,1 and in Section4, yond its original European cultural sphere, being able we present a data-driven approach to OMR end-goal to represent sufficiently well much of the world’s mu- evaluation.2 In Section5, we then present a plan of ex- sical traditions. It has been evolving for about 1000 periments towards completing a workable handwritten years, with the current standard having emerged in the offline optical music recognition system. 17th century and stabilized around the end of the 18th Section6 summarizes current achievements and con- century. In the latter half of the 20th century, there have cludes the thesis proposal. been multiple attempts to find new ways of conveying musical thought visually from composer to performer; 1.1 Motivation: Why OMR? however, none have yet threatened the dominance of the CWMN system. As long as musicians think of mu- There are three main application domains where OMR sic in terms of notes, and have the need to communicate can make a mark. this kind of music visually, CWMN is – probably – here 1.1.1 Live and post-hoc digitalization of music to stay, or at most keep evolving. Throughout these scores years, CWMN has settled into an optimum for quickly and accurately decoding the musical information. Un- OMR can be a useful tool for composers and arrangers, fortunately for OMR, this optimum has proven rather helping to digitize their work quicker; and for orches- complex for document analysis systems, and there are tral and ensemble leaders, especially early music prac- only a few sub-problems of OMR that have been re- titioners, who often need to digitize and transcribe solved successfully since OMR has first been attempted manuscripts. Even though digital note entry tools are over 50 years ago (Denis Pruslin, 1966; David S Prerau, available, they tend to restrict creative thought, and de- 1971). spite the best efforts of many developers, using these The main aim of the proposed thesis is to create an tools is often tedious and frustrating, which is detri- 3 OMR system capable of reading handwritten CWMN mental for the composition process. music notation. Two tasks will be attempted: offline 1 handwriting recognition, and augmenting the system The section on data,3, is taken from Haji cˇ jr. and Pecina (2017); all the work from that publication used in the thesis with audio inputs for multimodal human-in-the-loop proposal is the author’s own. recognition. Given the state of the art in OMR, how- 2The section on evaluation,4, is taken from the paper of ever, it was necessary to first create the supporting in- Hajicˇ jr. et al.(2016); all the work from that publication used frastructure of evaluation and datasets, including some in the thesis proposal is the author’s own. theoretical work on ground truth design. 3We have conducted a survey among current Czech com- posers at the Jana´cekˇ Academy of Music and Performing In the rest of this introduction, we first briefly present Arts, where 8 of 10 respondents stated they compose primar- the motivation for OMR and its application domains ily by hand, and 9 out of 10 stated they would use OMR soft- (Subsec. 1.1). We then describe in more detail the pro- ware if it was available and good enough. 1.1.2 Making accessible the musical content of standing the objectives of OMR. We now analyze the large music notation archives music reading process, in order to better understand the There is also a large amount of music that only ex- structure of the problem is OMR trying to solve. ists in its written form, in archives: both the works The “music” that CWMN encodes can be described of contemporary composers, and earlier works, espe- as an arrangement of note musical objects in time. Each cially from the 17th and 18th centuries. OMR would of these objects has four characteristics: pitch, dura- enable content-based search in these archives, at scale. tion, strength, and timbre. When performed, these In conjunction with open-source music analysis tools,4 properties translate to fundamental frequency, projec- OMR has the potential to expand the scope of possi- tion of the note into time, loudness (perceptual, not ble in musicology. In the Czech Republic,there are always straightforwardly correlated to amplitude), and thousands of early music pieces in archives such as the spectral characteristics, both for the harmonic series as- Kromeˇrˇ´ızˇ collection,5 or the Church of St. Jacob col- sociated with the note, and for noise outside this row. lection.6 Even more importantly, many 20th and 21st- Reading music denotes the process of gradually decod- century composers have produced significant amounts ing this arrangement of notes from the written score. of manuscripts that have not been published or digi- The rules of reading music determine how to decode tized (e.g.: I. Hurn´ık, K. Sklenicka,ˇ J. Novak,´ F. G. notes from music notation. It is a stateful process: ac- Emmert, A. Haba)´ 7 that are, or may well become, a part cording to these rules, in order to correctly decode the of the “classical music canon” – given proper exposure. next note, one needs to remember some of what has al- OMR could make such collections accessible at a frac- ready been read, especially in order to correctly decode tion of the current costs, thus enabling preservation of note onsets: when to start playing a certain note. valuable cultural resources and their active utilization Pitch and duration are recorded most carefully. The and dissemination worldwide.8 Besides significantly entirety of music notation is centered around conveying cutting down on publishing costs, OMR would open them precisely – and very quickly, to allow for sight- this trove of musical documents to digital humanities reading: real-time performance, where the music is in musicology. played just as the musician is reading. Encoding pitch also entails a complex apparatus that is intimately con- 1.1.3 Integration of written music into other nected to the prevailing musical theory of harmony and digital music applications relationships of pitches within this theoretical frame- OMR may also be useful as a component of a multi- work. The notation apparatus for strength and timbre modal system, where visual information from the score is less complex, and CWMN does not attempt to en- is combined with the corresponding audio signal. Score code these two attributes very precisely, merely pro- following, where the “active” part of the musical score viding some clues for the musician to interpret. is tracked according to the audio, is one such estab- The primary point of contact between the space of lished application,9 but applications in music peda- music and the space of music notation is the notehead. gogy or a “musician’s assistant” require more thorough Overwhelmingly often, this is a bijection: one notehead OMR capabilities. encodes one note object, and vice versa. Other music notation symbols are then used to encode the four at- 1.2 How music notation works tributes of notes: the aforementioned pitch, duration, We have stated that “music notation encodes music in strength, and timbre. a written form, OMR backtracks through this process” Pitch is a categorical variable, with values from a to decode the music from its written form; this is read- subset of frequencies defined by the scale and tuning ing music. Understanding this process is key to under- used. In the western music tradition, the scale of avail- able pitches usually consists of semitones, using pre- 4Such as the music21 library: http: dominantly – since the late 18th century – the well- //web:mit:edu/music21/ tempered tuning system, where the frequency of the 5http://wwwold nkp cz/iaml/ 1 : : next higher semitone si+1 is defined as si ∗ 2 12 . studie mahel2006:pdf The linear ordering of pitches by frequency, from 6Inventory of musical items of the St. Jacob church in Brno, 1763, De Adventu, no. 4, Moravian Land Museum, low to high, is arranged into a repeating pattern of 10 copy G 5.035 steps: A, B , C, D, E, F and G. The distance between 7From personal communication with persons knowledge- neighboring steps is one tone, except for the step from able or in charge of the collections. B to C and from E to F, which is a semitone. One rep- 8The fact that OMR is a topic of interest to music etition of this pattern is called an octave.