<<

Pattern Discovery in Old Hispanic Neume Notation

Paul Rouse, University of Bristol

16th May 2021

1 Introduction

The earliest surviving for medieval liturgical consists of neumes which do not provide the precise pitch and rhythmic information which forms the basis of much musicological scholarship. For traditions, such as the Old Hispanic chant, which were suppressed or fell out of use before pitched notation was developed, there is no later, pitched record of the same ,1 and they cannot be transcribed into modern notation. However, we do know how the neumes describe the direction of melodic movement, although not the exact intervals, and their shapes have a rich structure, which is clearly used in a consistent manner, even if we do not understand its meaning. Thus they do contain sufficient information for meaningful analysis, albeit using methodologies largely developed in the last decade.[1] A class of musicological questions can be phrased in terms of identifying frequently-occurring pat- terns of neumes, especially ones which are used in positions of particular musical significance, such as cadences or phrase beginnings, or settings of important words, or are used in a particular relationship with stressed and unstressed text . In previous work, we have developed computer-aided methods for building a database of notated , for codifying the written shape and pitch contour represented by neumes, and for searching for given patterns of neumes.[2] Our Chant Editing and Analysis Program (CEAP) provides musicologists with a tool to test hypotheses concerning the use of specific neume patterns, but not, until now, to discover new patterns. The work reported here extends these techniques to discover recurring patterns automatically, presenting the musicologist with a display of those which occur frequently. The patterns are necessarily approximate. A musical formula may vary according to the context, for example to accommodate an extra of text, or different accent pattern, and, while there is much consistency in the use of the notation, scribal variations do occur. In some cases one uses two separate neumes, while another runs them together to form a larger unit. Additionally, our tran- scriptions may differ from the scribe’s intent, either by using two distinct, but almost identical, shapes where the medieval musician would have seen no difference, or by making a different judgement as to whether a gap between strokes starts a fresh neume. As a consequence of such variations, the result of comparing two patterns must be a measure of similarity, not a simple yes or no answer. Before we can discuss the algorithm for discovering patterns, we must first explore the measurement of similarity between two patterns, which, in turn, involves an understanding of the information which can be gathered from neume notation. In the next section, we formulate a set of features

1While two dozen Old Hispanic chants do survive in heighted Aquitanian notation, the evidence they provide about cognates in unpitched notation is limited, and must be treated with care.[1]

1 which characterise each note within a neume, providing a new approach to encoding the neume interpretations established in our previous work. Then we show how this formulation is used to calculate a quantitative measure of the similarity, or otherwise, between two patterns. In section 4, we are then in a position to discuss how this measure is used in the discovery of recurring patterns.

2 Encoding Neume Interpretations

For pattern-matching purposes, we focus on the sequence of individual notes represented by a neume, in terms of both melodic shape and the characteristics of the pen strokes used to notate them. This provides a structured way to encode the meaning of each neume, and to compare corresponding parts of the notated music without assuming that division into neumes is completely consistent. However, the beginning and ending notes are given specific markers in their encodings, allowing the division into neumes to be taken into account in measuring the quality of match. Comparison between patterns will rely on encoding this information in a particular way, which we present first below. This is a fine-grained encoding which should, nonetheless, be meaningful to musicologists, albeit not the most useful representation for everyday work. Section 2.2 shows how the new formulation is related to, and derived from, concepts and encodings developed previously for describing neume notation.

2.1 Note Features

In the encoding used in this paper, the characteristics of each note within a neume are expressed as a set (in the mathematical sense) of binary features, which are explained in this section. The term “feature” is always used with this meaning, and we will use the term “feature vector” to refer to the set of features which apply to a given note. Representing neume characteristics as sets of binary features leads to a straightforward measure of similarity between patterns, presented in section 3. As shown in Fig. 1, a neume is divided into components representing separate notes. Each component can be described by the shape of the main part of the pen stroke, the shape of the connection with the previous stroke, and the direction of pitch movement relative to the preceding note. Unless there is a sharp change of direction, a connection covers a small region as the pen moves out of one stroke and into the next, and has its own direction of curvature.

(a) (b) Fig. 1 After the first note of a neume, we usually know whether each note is higher than its predecessor, lower, or at the same pitch. Sometimes we know only that the pitch is unlikely to be lower, so the movement must be broadly upward, or conversely that it is unlikely to be higher, so the movement must be broadly downward. The appendix lists all of the symbols used in the present encoding; those beginning with the letter P are used for the features describing pitch relationships. Each of the

2 cases just described is encoded by a pair of features in the feature vector, chosen so that overlapping meanings share one common feature. For example, a note which is definitely higher than the one before is described by the pair of features {Ph,Pph}, while a note which is merely unlikely to be lower is described by the pair {Pph,Pnl}. These pairs share the feature Pph, so are considered to match each other partially. At the start of a new neume, we have no information relating the pitch to the previous neume, so no pitch-related features are included in the feature vector of the first note. Similar methods are used to encode the remaining characteristics. For example, again using feature names defined in the appendix, a continuous, smooth connection between strokes is represented by the features {Cj,Cs} in the feature vector, always accompanied by an additional feature which describes the direction of curvature and whether a loop is formed. The shared features Cj and Cs provide a partial match between any pair of smooth connections, even if they have different curvature directions, or one makes a loop while the other does not. Likewise, the features used to represent stroke shapes are designed to produce partial matching between the long strokes, whether curved or straight, and regardless of slope or curve direction; this happens via the feature called Sl in the appendix. Angled strokes with different directions also partially match each other because of the shared feature Sa. Only one feature is used for short, horizontal strokes, but for consistency with most strokes, which are encoded using two features, it is given double weight in the calculation of the metric in section 3. The same applies to the feature for a wavy stroke. In addition to the features arising from the interpretation of the note itself, some extra feature types show the context in which the note appears in the neume. One feature (Ns) is used to mark the first note of a neume, and another (Ne) is used to mark the last (both are present on the same note in the case of a single-note neume). Another feature is present in addition on the first note when the neume is the first on a syllable (Nsyl). When the neume is terminated with a hook, an additional feature (Nh) is present on its last note. In the future, further features may be added to encode characteristics of the text, such as whether the syllable is stressed. As an example, the complete feature vector for the third note of Fig. 1(a) is: {Ph,Pph,Cj,Cs,Cw,Sl,Scr} The symbols used here are all described in the appendix, where there is also a table showing the encodings of all five notes of both example neumes. This is a low-level encoding, which would normally be derived automatically from a more user- friendly notation, such as the neume descriptions used by CEAP. Before moving on to discussing its use in constructing a measure of similarity between patterns, we briefly comment on the relationship with other notations.

2.2 Relationship with Existing Encodings

The feature-based encoding rests on the same conceptual framework as the neume descriptions used in CEAP. The neumes module of MEI accommodates similar ideas.[3] The crucial change required for the pattern-matching methods of this paper is the use of binary features, since, as shown in the next section, this allows similarity between patterns to be measured simply by counting the number of features which are shared, and the number which are not shared.2 By contrast, CEAP and MEI notations both make use of attributes which can hold a range of values. 2Our previous work has performed pattern matching using tables of similarity scores, indexed by the components of the reading. This is sufficient for the algorithms used then, but is not easily extended to have the mathematical properties required for clustering algorithms used in the present work.

3 For comparison, the complete description in CEAP of the neume in Fig. 1(a) is NUHLH:*())/:gwca. The pitch interpretations of all five notes are written first: neutral–probably upward–higher–lower– higher. These are followed by the stroke shapes: short horizontal–curved outward to left–curved rightward–curved rightward–straight to top right. Finally the connections are written, omitting the implicit gap which precedes the first note of the neume: gap–anti-clockwise curve–clockwise curve– sharp angle. In this description, each attribute of a note takes one of several values: pitch can be any of N,H,U,S,D, or L; stroke shape can be *,~,\,/,(,),<, or >; and the connection can be g,a,c, or w. Two additional modifier characters effectively add further alternatives to these attributes. Firstly, the final note of a neume sometimes ends with a small hook-shaped stroke, which does not appear to represent a separate note; we represent this using a quote character (’) appended to the abbreviation for the last stroke. Secondly, the join between two strokes is sometimes written in a backwards direction, so that a loop is formed; our notation represents this using a small circle (◦) placed between the abbreviations denoting the two strokes which cross. In practice, features are derived when needed from the CEAP notation, together with the context in which each neume appears. The right hand column of table A.1 in the appendix shows the CEAP symbols which result in a given feature being included in the feature vector. For example, the table shows that the pitch letters H,U,S,D, and L are each translated to two features in such a way that a single feature is shared by any adjacent pair in the order just given (S and D, for example, share Pnh). Note that the neutral pitch (N) is not associated with any features because it gives no information about pitch direction. In a few cases, such as a very short curved stroke, the translation to features might be modified as noted in the table, but this is not currently done. This section has concentrated on the relationship between the existing CEAP notation and the features defined in this paper. However, the neumes module of MEI[3] can also describe the individual notes within neumes in terms of their shape and pitch direction, so the techniques of this paper could be applied to MEI input, using a broadly similar collection of features.3

3 Measuring (Dis)similarity Between Patterns

Having defined the concepts of features and feature vectors in the last section, we are now in a position to show how to quantify the similarity between two patterns, or rather its converse, the dissimilarity. Initially, we define the measure of dissimilarity between pairs of single notes, calculated from their corresponding feature vectors, and then discuss how note by note measures are combined to produce a measure for entire patterns.

3.1 Single Notes

Given feature vectors x and y corresponding to two single notes, the dissimilarity between them will be measured by the Tanimoto distance.4[5, 6] This is calculated by counting the number of features which are the same in both feature vectors, and the number which differ. To be precise, define the

3Some of the characteristics we use to describe neume shapes in CEAP can be encoded in MEI.[4] A full translation of our descriptions would require extensions to the current specification of the MEI neumes module. 4The nomenclature used for this measure is inconsistent in the literature, since in simple cases it is also called either the Jaccard distance or the Soergel distance; however, in some situations, we rely on Lipkus’ extended definition[5], in which features may be weighted differently, so we use his terminology.

4 following values in terms of x and y: a the number of features present in x but not present in y b the number of features present in y but not present in x c the number of features present in both x and y d the number of features present in either x or y or both, i.e. d = a + b + c Certain features (Sp and So) are always counted doubly in these values, as if there were two distinct features with the same meaning.5 The Tanimoto distance between the feature vectors is then defined as6

a + b D(x,y) = d

Table A.2 in the appendix shows the values of this function for note-by-note comparisons of the neumes of Fig. 1. Taking the third note of each of these neumes to illustrate the calculation, and writing the encodings so that the features which are shared are aligned in columns: Neume (a) 3rd note (x): Ph Pph Cj Cs Cw Sl Scr Neume (b) 3rd note (y): Ph Pph Cj Cd Sl Scl we see that a = 3, b = 2, and c = 4, giving D(x,y) = 5/9 as shown in the table. The function D(x,y) has a value between 0 and 1, where 0 arises from identical feature vectors, and 1 indicates completely dissimilar feature vectors, with no features in common. Comparing randomly chosen feature vectors gives values around 2/3, so genuine matches are indicated by distances signi- ficantly less than this.7 While the dissimilarity form of the measure is needed for the algorithms of section 4, output intended for users is more naturally presented in terms of a score which has larger values when the agreement is better:

S(x,y) = 1 − D(x,y) = c/d

The function D satisfies the requirements for a metric on the space of feature vectors. The clustering algorithms we use require only that D(x,y) ≥ 0, D(x,x) = 0 , and D(x,y) = D(y,x) for all x and y. However, scores based on the measure are visible to the user, and might appear unintuitive if there were any large departures from the remaining metric axioms: D(x,y) > 0 if x 6= y and the triangle inequality D(x,y) + D(y,z) ≥ D(x,z) for all x, y and z. A valuable property of this metric is that it depends only on the features actually used in the two vectors being compared. The existence of rarely used features does not reduce the sensitivity of the comparison between those patterns which do not use them. A related, convenient property is that the value of the metric is automatically normalised to lie in the range 0 to 1. The choice of features used in the encoding has been made with this metric in mind. An application with different requirements might use another metric, for example the Hamming distance or the Euc- lidean distance, but the use of a different metric may require adjustments to the choice or weighting of features. 5We may use non-integer weighting factors in the future, perhaps for features representing characteristics of the text. 6Assuming that at least one feature is present in at least one of x and y, which is always true in our case. 7The value 2/3 is the expected value of the distance between two randomly chosen feature vectors in which the probability of each feature being present is 0.5. Observationally, a similar value arises when random pieces of chant notation are compared using the set of features described in this paper.

5 3.2 Sequences of notes

Longer patterns can be compared by taking the corresponding notes of each, and combining the values of the dissimilarity at each position. We will assume that both patterns have the same length N. Encoding them as sequences of feature vectors x1,x2,...,xN and y1,y2,...,yN, the distances between corresponding notes are: di = D(xi,yi) Any type of average provides a natural way of combining these values. Two alternatives, both of which produce a metric satisfying the mathematical laws, and which are simple to calculate in our context, are:8

1. The sum of the individual di, or, equivalently for a fixed length N, their mean.

2. The mediant, assuming that the fractions di are not reduced to lowest terms. This can be viewed as a direct extension of the note-level metric, since the result is equivalent to the Tanimoto distance between the combined sets of features for the whole sequences, provided that occurrences of a given feature at different places in the sequence are treated as distinct.

For example, taking the two sequences of five notes contained in the neumes of Fig. 1, the mean (computed from the dissimilarities shown in table A.2 of the appendix) is 0.17, while the mediant is 7/35 = 0.2. Phrased in terms of the similarity scores presented to the user, these two neumes are therefore either 83% similar, or 80%, depending on which method of calculation is used. So far, this example has assumed that the neumes are aligned with each other. However, they might occur at different positions in longer patterns, resulting in a much larger dissimilarity: for example, if neume (a) is matched against a punctum followed by the first four notes of neume (b), the dissimilarity is 0.71 (using the mean), which would be presented to the user as a similarity score 29%, indicating correctly that the patterns do not match when aligned this way. For the clustering algorithm of section 4, we have implemented both the mean and the mediant, but further work is required to determine whether there is any significant difference in performance between them. However, as discussed further in section 5, the mediant cannot be used when this metric is used in the type of algorithm we use to search for individual patterns. Now that we have defined the machinery needed to measure the similarity between sequences of notes, we can discuss the way it is used in discovering patterns in chant notation.

4 Pattern Discovery by Clustering

The previous sections have shown how to construct a metric between groups of notes contained in unpitched neume notation, starting by identifying features of the notational signs, and counting matching and mismatching features. The intent is that musically similar groups should be close according to this metric, and those which are musically unrelated should be distant, the success of which ultimately has to be decided by musicologists. The primary motivation for this formulation is to allow the use of a clustering algorithm to discover new recurring patterns, as opposed to searching for patterns already identified by the researcher (which is the subject of section 5).

8Further examples of product metrics are given in [7].

6 Three main stages are used in pattern discovery:

1. Find every sequence of notes in the group of chants which satisfies some chosen constraints.

2. Separately for each length of sequence, cluster them according to the distance function defined in section 3.

3. Post-process the clusters to

(a) merge shorter sequences into longer ones, and remove resulting duplicates (b) group related clusters for presentation to the user.

The first step defines some limits on patterns included in the search. One is the length, in terms of the number of notes: initially a range of lengths, say 8 to 25 notes, is used, but longer sequences can be created by step 3. Sequences consisting mostly of single-note neumes are omitted, since they are unlikely to be interesting; at present we omit sequences with an average of fewer than two notes per neume. Finally, the note sequences may be limited to those having particular relationships with the text, for example cadential patterns can be found by considering only sequences ending at a text phrase boundary, or patterns within can be found by considering those which lie entirely on a single syllable. Having identified the candidate sequences, they are grouped in step 2 by using a hierarchical clus- tering algorithm. Complete-linkage clustering, in its optimised form (CLINK[8]), is used to build a hierarchy of progressively larger clusters, stopping when the member sequences would become too far apart according to our metric. The result is a set of clusters, each of which contains note- sequences (in which each note is represented by its feature vector) which are within some distance dmax of each other. In our current implementation, dmax is fixed at 0.35 (recall that random pairs of sequences are likely to have a distance around 0.7), but may in the future be made adjustable by the user. Many sequences cannot be clustered with any other sequence within dmax, and are discarded as not representing recurring patterns. The clusters resulting from step 2 contain many near-duplicate copies, which are merged, as far as possible, in step 3a. This duplication occurs as a result of working with sequences of several different lengths: for example, if a pattern is discovered in ten-note sequences, a closely related nine-note pattern is likely to be discovered in the nine-note sequences as well; in fact, two overlapping nine- note patterns may be discovered unless the search is fixed to text phrase boundaries. Two cases can be merged safely:

1. When two clusters contain the same number of member sequences, and each sequence in one cluster overlaps a sequence in the other, and all of the overlaps are by the same amount, then the pairs of sequences can be merged, and put into a single cluster replacing the original two.

2. If cluster A consists of longer note sequences than cluster B, and every member of B is wholly contained within a member of A, then B can be removed. However, to avoid misleading the user, this is not done if the quality of matching in A is poor, and that in B is significantly better.

After merging, there are still distinct clusters which contain recognisably related patterns. As an aid to understanding these, a second level of grouping is used when presenting the clusters to the user, without changing the clusters themselves. This grouping looks for pairs of clusters where a member of one overlaps a member of the other, and assigns a degree of closeness according to the extent of

7 the overlap (or the best if there are several overlaps). Using this notion of closeness, grouping is implemented by an additional complete-linkage clustering operation. The final result seen by the user is translated back to the neumes from which the note sequences were derived, rounding up to whole neumes if a sequence does not start and end at a neume boundary. It offers the researcher a set of suggested patterns which occur more than once in the set of chants searched, and so may be worth pursuing further.

5 Searching for individual patterns

Searching for patterns nominated by the researcher, as opposed to discovering them automatically as described in the previous section, has been an important part of CEAP functionality from the beginning.9 The search operation finds approximate matches which may differ from the exemplar by insertion and deletion, as well as by substitution of one symbol with another. Prior to the present work, this has made use of manually-constructed tables to determine how much the quality of match is affected by each insertion, deletion, or substitution. As our neume descriptions have become more complete, we have been able to improve CEAP’s pattern matching, but the tables have become more complicated, and the values they contain harder to justify. This section discusses replacing these tables with a cost function based on the metric defined in this paper. Pattern matching is performed using the Needleman-Wunsch alignment algorithm,[9, 10] with bound- ary conditions which allow the exemplar to match a portion of a target chant, not necessarily the whole of it.10 When the user asks for matches restricted to single syllables, or to the beginnings or ends of text phrases, this is achieved by placing further constraints on the boundary conditions, together with a modification to the calculation of the score matrix to avoid crossing the relevant text divisions. This algorithm finds the lowest-cost succession of edit steps needed to change one sequence of symbols into the other. Here a “symbol” is a note within a neume, with the shape descriptions and melodic interpretation we have been using throughout this paper. Previously the cost of substituting one symbol with another was obtained by looking up these characteristics in tables, but in the new implementation, the substitution cost is simply the metric (of section 3) applied to the feature vectors of the two notes involved. A substitution is therefore considered more costly if it makes a greater change to the note features. One additional cost parameter has to be chosen, and does not follow directly from the metric: this is the cost of an edit consisting of an insertion or deletion. This “gap cost” is a fixed value lying in the same range as the metric (0 to 1), and is currently set to 0.9. The overall cost of the edit steps (which is being minimised by this algorithm) is analogous to the dissimilarity between sequences of notes discussed in section 3.2. While the mediant rule mentioned there might be an attractive option, because of its consistency with the construction of the metric on note features, it cannot be used in this algorithm. Like all dynamic programming algorithms, a fundamental requirement is that an optimal solution to a sub-problem remains optimal when used as part of the solution to a bigger problem. This is true when, as usual, the algorithm is formulated to minimise the sum of costs, but fails if the mediant is used instead.11 Therefore, the summation form

9CEAP’s pattern searching grew out of related methods used in earlier work by Hornby, Ripley and Caldwell. 10Boundary conditions refer both to the way the first row and column of the score matrix are initialised, and to the choice of starting point for the traceback stage. The general principles are discussed in section 2.3 of [10]. 11Here is a counterexample using numbers of realistic sizes: if at an intermediate stage two possible alignments give metrics 113/355 and 106/333, the former being slightly smaller (so better), and if the next elements have metric 4/13, then using the mediant would give combined metrics (113+4)/(355+13) and (106+4)/(333+13) respectively, which are in the reverse order: the second one is the smaller.

8 of whole-sequence metric is optimised by the algorithm, and is used to derive a score for presentation to the user: it is divided by the length (of the aligned sequences), to create a value between 0 and 1, and subtracted from one to produce a score which is larger when the match is better. Compared with our best previous version, using the feature-based metric as the substitution cost has two main advantages. Firstly, it is based on a theory - the choice of features - which can be explained in musicological terms, instead of relying on large tables of numbers which are hard to interpret. Secondly, it appears to work well for matching sequences of notes, whereas our previous version had to work on whole neumes to achieve acceptable performance. The set of features we have defined here does include some which mark notes at the start and end of neumes, giving a preference for matches which have the same neume boundaries, but matches can also be found with different division into neumes, say in different , if the other properties agree sufficiently closely.

6 Conclusion

This paper has shown that recasting the neume descriptions we use in CEAP in terms of per-note sets of features provides a useful basis for defining a metric for comparing patterns of unpitched musical notation. This has been used for a new pattern-discovery operation in CEAP, as well as becoming the default comparison method in its existing pattern-search function. The discussion has focused on Old Hispanic notation, and the particular set of note features described here has been tuned for this case. However, the same principles should be applicable to other unpitched notations, provided the neumes can be subdivided into identifiable pieces corresponding to notes or other musical units. In the future, we plan to add additional features representing characteristics of the text on which each note appears, such as whether the syllable is accented, or is part of a word of particular significance. These extra features may contribute to the metric with a weight other than one, and the weight might be adjustable by the user depending on the focus of their work. The formalism presented here does not constitute a musicological theory per se, although it has certainly been informed by both musicology and practical experience of writing pattern-matching code in CEAP. Primarily it is a mathematical construction, crafted to have the properties required for the computer algorithms we wish to use, and the criterion for its validity is the extent to which the results it produces are seen as relevant by the musicologists working with these chants. However, the fact that it does appear to be successful in this sense may, in turn, provide further insights into what is important about the musical notation.

References

[1] Emma Hornby, Rebecca Maloy and Raquel Rojo Carrillo. “Old Hispanic musical notation: its characteristics and an analytical methodology”. In: Emma Hornby et al. Liturgical and Musical Culture in Early Medieval Iberia: Decoding a Lost Tradition. Cambridge University Press, To appear. Chap. 6. [2] Emma Hornby, Rebecca Maloy and Paul Rouse. “Chant Editing and Analysis Program: A Tool for Analyzing Liturgical Chant”. In: Journal of Medieval Iberian Studies (To appear). [3] Music Encoding Initiative. Guidelines. Repertoire: Neume Notation. Version 4.0.1. 12th Apr. 2019. URL: https://music- encoding.org/guidelines/v4/content/neumes.html (visited on 10/05/2021).

9 [4] Elsa De Luca et al. “Capturing Early Notations in MEI”. In: Musiktheorie: Zeitschrift für Musikwissenschaft 34.3 (2019), pp. 229–249. [5] Alan H. Lipkus. “A proof of the triangle inequality for the Tanimoto distance”. In: Journal of Mathematical Chemistry 26.1 (1999), pp. 263–265. ISSN: 1572-8897. DOI: 10.1023/A: 1019154432472. URL: https://doi.org/10.1023/A:1019154432472. [6] Peter Willett, John M. Barnard and Geoffrey M. Downs. “Chemical Similarity Searching”. In: Journal of Chemical Information and Computer Sciences 38.6 (1998), pp. 983–996. DOI: 10.1021/ci9800211. URL: https://doi.org/10.1021/ci9800211. [7] Michel Marie Deza and Elena Deza. “Metric Transforms”. In: Encyclopedia of Distances. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 79–87. ISBN: 978-3-642-30958-8. DOI: 10.1007/978-3-642-30958-8_4. URL: https://doi.org/10.1007/978-3-642- 30958-8_4. [8] Daniel Defays. “An efficient algorithm for a complete link method”. In: The Computer Journal 20.4 (1977), pp. 364–366. [9] S. B. Needleman and C. D. Wunsch. “A general method applicable to the search for similarities in the amino acid sequence of two proteins”. In: Journal of molecular biology 48.3 (Mar. 1970), pp. 443–453. ISSN: 0022-2836. DOI: 10.1016/0022-2836(70)90057-4. URL: http://www. ncbi.nlm.nih.gov/pubmed/5420325. [10] Richard Durbin et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. DOI: 10.1017/CBO9780511790492.

10 A Appendix

The available features are defined here. Each note of a neume is described by the set of all features which apply. The final column shows the CEAP readings for which each feature is included. Table A.1 Features which can be used in the encoding

Feature Symbol Notes Included when CEAP reading is Pitch relative to previous note (no pitch feature is used when the relationship is unknown, i.e. N) Higher Ph H Probably Pph U, H higher Not lower Pnl For U, this is slightly weaker: “probably S, U not lower”. Not higher Pnh For D, this is slightly weaker: “probably S, D not higher”. Probably Ppl D, L lower Lower Pl L

Stroke shape Long Sl A long stroke, either straight or curved. \, /, (, ) (Sl may be omitted when a curved shape is short) Straight Ssl Straight stroke inclined towards the top left \ leftward Straight Ssr Straight stroke inclined towards the top / rightward right Curved Scl Curved stroke convex as seen from the left ( leftward Curved Scr Curved stroke convex as seen from the ) rightward right Angled Sal Angled stroke with corner towards the left < leftward Angled Sar Angled stroke with corner towards the right > rightward Angled Sa An angled stroke (either of the previous <, > two) Short Sp Short horizontal stroke, such as a punctum * (Used with weight 2 in the metric of sec. 3) Wavy So Short wavy stroke (possibly oriscus) ~ (Used with weight 2 in the metric of sec. 3)

11 Feature Symbol Notes Included when CEAP reading is The way the previous stroke is connected to the current stroke Gapped Cg There is a gap between the previous stroke g and this stroke (always used for the first note of a neume) Joined Cj There is no gap between the previous a, c, w stroke and this stroke Discontinuous Cd A sharp change of direction, or lift of the g, a pen, between strokes Continuous Cs Continuous (smooth) transition between c, w strokes Curve Cc Clockwise curve between strokes c in the absence of ◦ between clockwise strokes Loop Ccl Clockwise curve between strokes forming c when ◦ appears between clockwise a loop strokes Curve Cw Anti-clockwise (“widdershins”) curve w in the absence of ◦ between anti-clockwise between strokes strokes Loop Cwl Anti-clockwise curve between strokes w when ◦ appears between anti-clockwise forming a loop strokes

Context within the neume Start neume Ns Used on the first note of the neume Start syllable Nsyl First note on a text syllable (used in addition to Ns) End neume Ne Last note of a neume Hooked Nh A hook, which appears not to be a separate ’ note, is appended to the neume. (Nh is used in addition to Ne on the final note.)

Using the features shown above, the full encodings of the two neumes in Fig. 1 of the main text are given in the following table. The encoding of the first note of each assumes that these neumes are not the first on their respective syllables; if they were, the first note would also have the feature Nsyl. The final column shows the dissimilarity measure between the corresponding notes, calculated by the method of section 3 (in the case of the first note, recall that Sp is counted twice, weighting it by a factor of two). Table A.2 Encodings of the two neumes shown in Fig. 1

Neume (a) NUHLH:*())/:gwca Neume (b) NUHLH:*\()/:gaca Dissimilarity D(x,y) (see section 3) First note: {Ns,Cg,Cd,Sp} {Ns,Cg,Cd,Sp} 0/5 Second note: {Pph,Pnl,Cg,Cd,Sl,Scl} {Pph,Pnl,Cg,Cd,Sl,Ssl} 2/7 Third note: {Ph,Pph,Cj,Cs,Cw,Sl,Scr} {Ph,Pph,Cj,Cd,Sl,Scl} 5/9 Fourth note: {Pl,Ppl,Cj,Cs,Cc,Sl,Scr} {Pl,Ppl,Cj,Cs,Cc,Sl,Scr} 0/7 Fifth note: {Ne,Ph,Pph,Cj,Cd,Sl,Ssr} {Ne,Ph,Pph,Cj,Cd,Sl,Ssr} 0/7

12