Phd Thesis, University of Helsinki, Department of Computer Science, November 2000
Total Page:16
File Type:pdf, Size:1020Kb
5 Quantisation hythm together with melody is one of the basic elements in music. According to Longuet-Higgins ([LH76]) human listeners are much more sensitive to the perception of rhythm than to the R perception of melody or pitch related features. Usually it is easier for trained and untrained listeners as well as for musicians to transcribe a rhythm, than details about heard intervals, harmonic changes or absolute pitch. “The term rhythm refers to the general sense of movement in time that characterizes our experience of music (Apel, 1972). Rhythm often refers to the organization of events in time, such that they combine perceptually into groups or induce a sense of meter (Cooper & Meyer, 1960; Lerdahl & Jackendoff, 1983). In this sense, rhythm is not an objective property of music, it is an experience that has both objective and subjective components. The experience of rhythm arises from the interaction of the various materials of music – pitch, intensity, timbre, and so forth – within the individual listener.” [LK94] In general the task of transcribing the rhythm of a performance is different from the previously described beat tracking or tempo detection issue. For estimating a tempo profile it is sufficient to infer score time positions for a set of dedicated anchor notes (i.e., beats respectively clicks), for the rhythm transcription task score time positions and durations for all performance notes must be inferred. Because the possible time positions in a score are specified as fractions using a rather small set of possible denominators, a score represents a grid of discrete time positions. The resolution of the grid is context and style dependent, common scores often uses only a resolution of 1/16th notes but higher resolutions (in most cases binary) or non-standard resolutions for arbitrary tuplets can always occur and might also be correct. In the context of computer aided transcription the process of rhythm transcription is called (musical) quantisation. It is equivalent to transferring timing information from a continuous domain (absolute performance timing in seconds) into a discrete domain (metrical score timing). Depending on the way an input file was created (e.g., free performance recording, recording synchronous to a MIDI click, mechanical performance) a tempo detection must be performed before the quantisation; for quantisation a given tempo profile is required. Because the tempo detection module might create a tempo profile that indicates only the score time positions for anchor notes (e.g., start of a measure or down beats) the score time information for notes between these anchor points can still be imprecise because of slight tempo fluctuations between two beats. These errors need now to be removed by the quantisation module. For example, an unquantised onset time at a score time position of 191/192 needs to be quantised to a reasonable grid position (e.g., quarter or eighth notes) to be displayable in conventional music notation. By choosing a too high resolution for the conversion from performance timing into score timing (e.g., 1/128 of a whole note), the resulting scores will be correct displayable but very hard to read because of complex note durations. If the resolution is chosen too low (e.g., a quarter note) the resulting score will contain overlap errors (i.e., notes that did not overlap in the performance have overlapping durations in the score) and it will be very inaccurate compared to the performance data. An human transcriber would here – if possible – prefer ‘simple’ transcriptions over complex solutions. But he would intuitively detect the point where a correct, ‘simple’ transcription is not possible and a more complex solution needs to be chosen. For example, five equal spaced notes cannot be transcribed as eighth notes if they were played all during a single half note, here a more complex quintuplet needs to be chosen. Musical quantisation can be divided into onset time quantisation and duration quantisation. It depends on the specific approach, if the durations of notes are treated and quantised as a duration or if they become quantised indirectly by quantising the respective offset time positions. In commercial sequencer 106 CHAPTER 5. QUANTISATION applications often also a quantisation approach – usually called swing- or groove-quantisation – is im- plemented. This type of quantisation is used for creating ‘performance-like’ MIDI files from quantised (mechanical) scores or from very inaccurate performance data. This performance simulation approach is not in the scope of this thesis and will therefore not be discussed in the following. Similar to the already discussed issue of tempo detection and the area of voice separation the basic problem for quantisation is indicated by the fact that the translation from score to performance and also the inverse task of transcription are highly ambiguous relations and no injective functions. A single score can result in different but ‘correct’ performances (including different tempo profiles), and a single (non-mechanical) performance can be transcribed correctly by different scores of different complexity. If we, for example, assume that a rhythmically complex performance is a mechanical performance (e.g., played by a notation software system), then its rhythmical complexity should be preserve in a transcription. If we would assume that this performance was created by an untrained, human performer, a much simpler transcription should be preferred, because the complexity of the performance has likely be caused only by inexact playing. An untrained performer would probably not be able to play a complex piece very precisely. An human transcriber would here intuitively evaluate the performance quality of the input data. If the whole piece was played with poor accuracy, he would allow more deviations between the written and the performed data as for a very exact played performance. This means that if a piece is played with low accuracy the grid size for the quantisation will be increased intuitively by the transcriber. As shown in Chapter 4 the quality or accurateness of a performance can be inferred algorithmically from the performance data itself before starting tempo detection or quantisation. Different from other known approaches our system uses this accuracy information and will adapt some resolution and search window parameters. As already shown in previous sections (e.g., Chapter 4), the position of the onset time of a note is usually much closer to the timing as specified in the score than the timing of the performed duration of a note. In [DH89] citation of [Pov77]: “. the note durations in performance can deviate by up to 50 percent from their metrical value.” Therefore the expected and allowed amount of inaccuracy (i.e., deviation between observed performance and inferred score) is higher for the duration of a note than for its onset time. As shown in [Cam00b] and in own experiments the output quality of a quantisation algorithm also depends on a correct voice separation (see Chapter 2). If because of an incorrect voice separation the correct local context of a note is not available (previous, or successive notes have been attached to other voice streams) the onset time position and even more the duration of this note might be quantised incorrect. Figure 5.1 shows a correct quantisation for the last note in the first measure (note c”) and a wrong quantisation if this note is attached incorrectly to the voice stream of the upper voice. correct wrong wrongcorrect E –EÚ– Ú –EÚ–EÚ E 4 4–EÚ– Ú –– –EÚ– Ú –XÚ– Ú a a¥¥ –– & 44 –– ‹‹ ‹‹‹‹ –– ‹‹ J–– \ —— —— —— 4 4—— —— —— XÛ—XÛ— w —— —— —— ww & 44 EÛ—Û—E XÛ—XÛ— XÛXÛ. ‹ ‹ ‹‹‹ ‹EÛ—Û— XÛ—XÛ— XÛXÛ ‹ ‹ \ \ ««« Figure 5.1: Possible effect of wrong voice separation to quantisation. In the remainder of this chapter we first give an overview on existing quantisation approaches and then describe an approach developed in the context of this thesis that consists of a pattern-based part and a quantisation model which evaluates the local context of notes. All described approaches work on a single voice stream (including chords, but no overlapping notes) as input data. 5.1. EXISTING APPROACHES 107 5.1 Existing Approaches In the following we give an introduction to musical quantisation. First we show and formally define the principles of the simplest type of quantisation called grid quantisation. Then we discuss the details of some more advanced approaches known from literature. 5.1.1 Grid Quantisation This simplest form of musical quantisation maps (shifts) all performance time information to the closest time position of a fixed, equidistant grid of quantised score positions. For a performance time position t always the closest grid time position is chosen as quantised time position. This method – also called hard quantisation – is easy to calculate and implemented in most commercial sequencer and notation software products. Given an arbitrary, unquantised score time position t and an ordered set of equidistant score time grid positions G with G = { g1, . , g|G| | gi = (i − 1) · r, i ∈ N }, (5.1) where r is the grid resolution given as a beat duration in score time units. For an arbitrary unquantised time position t and a given grid G (with g|G| ≥ t), a quantisation function tqpos returning the quantised time position for t can be defined as tqpos (t, G) = arg min{|g − t|}. (5.2) g∈G If |gi − t| = |gi+1 − t| (because t = (n + 1/2) · r) the earlier grid position gi should be returned as result of tqpos (t, G). In general, depending on the actual implementation, also other selection strategies are possible (e.g., latest grid position, avoid collisions between notes). This approach requires that the grid size r needs to be selected in advance (e.g., quavers) and kept fixed for the range of notes to which tqpos should be applied.