Towards a General Model of Interlinear Text

Cathy Bow, Baden Hughes and Steven Bird Department of Computer Science and Software Engineering University of Melbourne, Victoria 3010, Australia {cbow, badenh, sb}@cs.mu.oz.au

Abstract The use of interlinear text has long been a valuable tool in linguistic description, and the development of a number of different software tools has facilitated the creation and processing of such texts. In this paper we survey of a range of interlinear texts, focusing on issues such as grouping and alignment. Abstracting away from the presentation, we look specifically at the structure of the data, in an attempt to create a general purpose data model for interlinear text. Our findings are that a four level model œ incorporating Text, Phrase, Word and levels œ is sufficient to represent a very wide range of practice. We present an XML format for representing data in this model, and describe stylesheets for converting such data into presentational formats. Because of its generality, and the way it abstracts away from presentation, we believe the model is a suitable basis for developing archival storage formats for interlinear text and delivering interlinear text to end-users and external software tools in a web environment.

1 Introduction Interlinear texts serve a variety of purposes in linguistic research. In many cases they are used demonstrate various linguistic principles in a certain , such as a traditional narrative included in a descriptive grammar. In textbooks or other reference material, a single line of text may be given an interlinear treatment to highlight a specific issue under discussion. Some references may include interlinear texts to demonstrate higher level discourse features, others may wish to consider a fine-grained analysis of morphology or even phonology. Religious texts are often given interlinear treatments, and may include a source language, a transliteration, a vernacular and any number of comments or layers of analysis. While we can observe many tendencies concerning the display of interlinear texts (for example, the gloss is generally given below the text), there are no explicit standards or universals constraining either the form or content of such material.

The goal of this paper is to build a general-purpose model for interlinear text. An initial survey of the structure and content of a range of texts allows us to consider what information is represented and how it needs to be stored. Any general model for interlinear text needs to be sufficiently granular to meet all the requirements of the linguist. Our survey will show how varied these requirements can be, whether referring to a field linguist collecting initial data or an end user analysing it at a later point in time. The model also needs to be typologically inclusive, being flexible enough to cover the wide range of typological features across . The model also needs to be usable to software developers (whether as an interchange format or an internal data model) as well as field workers and analysts.

1 The importance of developing appropriate methods of archiving and the flexibility of formats in order to make data available to others are key components of the E-MELD proposal (EMELD, 2001). Since one of the goals of the current workshop is to establish best-practice methods, the ratification of such a model could serve as the basis for archival storage of interlinear text sources in the E-MELD showroom.

The paper begins with a survey of interlinear text representations (Section 2). Issues of content and presentation raised by this survey are discussed in Section 3, followed by our proposal of an general-purpose four level model (Section 4). We present an XML format for the model and show how the surveyed interlinear texts can be represented in the format. In Section 5 we show how the structural markup can be rendered into presentational markup and displayed in familiar forms.

2 Survey of interlinear text A variety of sources were surveyed to collect diverse types of interlinear text, with the goal of considering a representative sample rather than developing a comprehensive inventory. Some texts were supplied by colleagues, some taken from the internet and some from printed grammars, and full texts were preferred over isolated examples. The majority were text-based, with some linked to audio files, and an attempt was made to cover the major geographic and typological areas. The broad range of samples surveyed attempts to adequately reflect the types of information contained in interlinear texts, as well as to examine how such information is structured.

This survey begins with the most common form of interlinear text in linguistic description with three lines of content: a transcribed text aligned in some way to a gloss, and some form of free translation. The survey then progresses to more complex examples, discussing various issues of both display and organisation, such as alignment, mapping and wrapping. Issues raised on these topics in the survey will then be discussed prior to the proposal of a model of interlinear texts.

In order to clearly define the terms of reference, we will use the word —text“ to refer to a piece of interlinearised material as well as information in a given language (e.g. a line of text followed by a line of translation). A —line“ of interlinear text will refer to a row of text plus all the rows of analysis pertaining to it. Within the line, there is a horizontal arrangement of words and their analysis, plus there is the vertical alignment of row elements to indicate the structure of the interlinear text. There is also the vertical arrangement of one row above the other. There is a consistency from one block of rows to the next in the vertical arrangement of these rows. The term —free translation“ is commonly applied to any translation in which more emphasis is given to the overall meaning of the text than to the exact wording. Since this term may suggest that liberty is taken in translating the text, we will occasionally substitute the term —phrasal translation.“

2

Figure 1: Nepali

A typical example of a basic three-line interlinear text comes from Nepali (Genetti, 1994:166). In this sample, one line of transcription data (in this case phonetic rather than orthographic) is aligned word-by-word with a gloss, and an unaligned free translation is given on the third line. The phrases are numbered, and there is one line of metadata giving a title (the footnote missing from the scanned image notes who recorded, transcribed and glossed the text). The words are broken into using hyphens, the gloss is aligned to the whole word, rather than to individual morphemes, and complex glosses are separated by full stops (e.g. line 3, nisk-yo = come.out-3smL.PST). There are

3 examples of portmanteau morphemes, for example in line 3 where the morpheme œyo is glossed as —-3smL.PST,“ which is explained elsewhere in the text as indicating a third person singular masculine low-grade honorific past tense marker.

Figure 2: Ainu The Ainu example (Shibatani, 1990:85) is similar to the Nepali in that each word of the text is aligned to a gloss, the phrases are numbered, and there is one line of metadata giving the source, though the display is different in that a free translation of the entire text

4 is given below the interlinear text. On closer analysis however, the free translation is itself divided into numbered phrases with a one-to-one correspondence to the interlinearised phrases. We contend that structurally this resembles the Nepali, though its visual presentation is different. The free translation doesn‘t always read fluently, which may be an attempt to reproduce the phrase structure of the source text in the English translation. Like the Nepali example, the words are broken into morphemes by hyphens, though multi-word English glosses are not linked, e.g. cirikinka = —rise high“, enkasike = —over there,“ which can lead to some ambiguity, as in the relationship in line 6 between oka nankor and the gloss —be perhaps.“ There appear to be some gaps in glossing, where polymorphemic units are given a monomorphemic gloss (e.g. asso-kotor = —wall“, ran- pes = —cliff“). It is unclear how such gaps should be interpreted, whether they reflect uncertainty or oversight, however such gaps are common features in field notes, and need to be represented in an interlinear text editor.

Figure 3: Pitjantjatjara text

A Pitjantjatjara story (Bowe, 1986) marked up using Microsoft Word appears to have the same three line format, with some metadata (title) relating to the entire text. The words are broken into morphemes using hyphens, and the glosses are aligned to the word (rather than to each individual morpheme). In the glosses, a punctuation distinction is made between complex glosses using full stops (e.g. kulila = —listen.IMP“) and morpheme

5 boundaries using hyphens ( tjukurpa-na = —story-I“). Free translation is given for each phrase, vertically aligned to the beginning of the phrase, however many phrases wrap over several lines.

Figure 4: Nivkh

Comrie‘s book on the languages of the Soviet Union (Comrie, 1981:276-277) gives examples of texts from the languages described in each chapter, as in the Nivkh text

6 shown in Figure 4. On the page, the interlinear text has a line of text (given in a phonetic rather than orthographic form) and a line of glossing, then some metadata on the source of the text, a section of Notes, followed by a free translation. The alignment between word and gloss is different to those seen previously, as each word is broken into morphemes by both white space and hyphens, with vertical alignment to the gloss line for each morpheme. Compound glosses are joined by hyphens (e.g. —younger-brother,“ —go- up“). The words can be reconstructed by visual concatenation, but the text itself is not very ”readable‘, partly due to the page layout, with little separation between lines of text. This example also includes indentation for paragraph breaks, and other examples have punctuation for speech. The ”Notes‘ section highlights any interesting elements of the text, and this may link to a morpheme, word, phrase, or possibly the entire text, however no distinction is made between comments at any of these levels.

Figure 5: Tundra Nenets (Susoi)

7

Figure 6: Tundra Nenets (Paakkan)

A passage from a highly agglutinative language, Tundra Nenets, is given two different interlinear treatments. The Susoi version (Susoi, 1990) has three levels, with words aligned to glosses, and a separate free translation of the entire text, plus some metadata. However, while the glosses are broken into morphemes using full stops, the words themselves are not, making it impossible to identify which portion of the word corresponds to the morphemic gloss. The second treatment of basically the same text, from Paakan, (Paakan, 1997) gives a 5 level analysis, with the word level (\TEXT) broken down into morphemes (\UNIT) which are then aligned to inflectional coding (\MNNG). A ”base‘ form of the word is then given (\BASE), and a free translation (\MITA) which is aligned to the phrase level (\ref). Interestingly this text lacks any actual gloss, however the combination of the Paakan and Susoi treatments would give sufficient information for a thorough analysis of this text.

8

Figure 7: Diyari

9

Figure 8: Yidinj

The Diyari (Austin, 1981:252) and Yidinj (Dixon, 1977:527) texts are laid out basically the same on the page (and both are published by the same publisher). Each has three lines: text, gloss and free translation. There is metadata and a précis about the text, the phrases are numbered, intonation groups are marked with a slash (/), and there are some

comments inserted between some phrases. In both cases, comments can apply to a

¡ ¢ £

specific word or morpheme (e.g. Yidinj line 98 on -: ), a syntactic note (e.g.

¡

¤ ¥ ¦ Diyari line 7 on § ,) an anthropological observation (e.g. Diyari line 5 explaining a custom referred to in the text), or a discourse note. Comments are distinguished from interlinearised text on the page by lack of indentation. Despite these visual similarities, closer examination reveals structural differences between these two samples. The Diyari text breaks the words into morphemes separated by hyphens, which the Yidinj does not. Therefore it seems that the transcription of the text in Yidinj is at the level of the word, while in Diyari it is at the level of the morpheme. Once we abstract away from the presentation on the page, the visual similarity is not matched by structural similarity.

10

Figure 9: South Efate

Extra information can be included in some interlinear texts without adding much complexity. This sample from South Efate (Namaf, 2001) analysed in Shoebox (SIL, 2002) has 12 levels, including four of metadata relating to the full text (reference \_sh, title \itm, information about the text \nt and a reference to the audio/video file \aud). Each phrase also includes reference to the audio start and end (\as, \ae), then two levels of text: one at word level (\tx may correspond to an orthographic representation) which is then broken down into morphemes separated by hyphens (\mr), which is then vertically aligned to a gloss (\mg) and each morpheme is labelled with a part of speech (\POS). Free translation is given for each phrasal unit in two languages, English (\fg) and Bislama (\fgb), and this translation may correspond to a sentence or sometimes a series of phrases. It is not clear whether the separation into these units is linguistically significant.

Garrwa text Source: \dn14.003.01 Speakers: Bindie West (Moreland) (speaker code: B)

11 Stumpy George (speaker code: S) In the transcript, Flint actually uses three code letters: B, S and G. G may stand for "Stumpy George". The sound recording of this material is not available. The conversation recorded on the audio-tape (\cm R297C) presumably corresponds to the text transcribed in \dn 14.003.01. However, the informants' speech was very rapid and somewhat unclear, and no clear connections between the recording and the transcript could be made. This text material is unanalysed, so the analysis, gloss and "other sources" fields are not included.

\sp G \ft nanyi??/nganyi?? ngaranjan jilajbaya ngabayan yanba cont . \fg my mother go white man talk cont . \ft ngangangi \fg want \ncr \ncft It is difficult to determine word order here, and which sentences words belong to. \ncfg —white man“ refers to a —Mr Haely??“, whose name is written beneath ngabayan . \fft \ncfft

\sp G?? \ft kudbinji jilabaya kingkarina (namukiya yan ba lajba) cont . \fg happy go down paint paint?? cont . \ft yalungka \fg all fellows down at [camp] \ncr \ncft \ncfg \fft \ncfft Figure 10: Garrwa

Another example of a text which incorporates supplementary information comes from Garrwa (Laughren, Keith and Hughes, 2002). In this sample, a dialogue between two speakers, the original notes taken by a researcher in the 1960s are supplemented by levels for notes by later analysts who approach the data from different perspectives. This allows for extra complexity, permitting multiple commentaries at different levels. While most of the notes apply to a specific level of the data (i.e. the entire text, a particular phrase or word, even phonological analysis), there may be cases of notes which correspond to a number of different points (e.g. a phonological feature may appear throughout a text, however reference to it may be linked to a certain instance). Annotations at multiple levels and by multiple analysts raise issues of consistency in the markup, which is of particular interest to the EMELD project.

12

Figure 11: Latin

Having seen that there appears to be no upper limit on the number of lines an interlinear text may have, the question is then raised as to whether or not a lower limit exists. As noted above, the most common basic form of interlinear text has three lines, however examples with two lines do exist. The Latin example (Valiulis & Wasson, 1998:12) in Figure 11 presents as a two-level interlinear translation with one line of Latin text beneath a line of English translation (an unusual arrangement given the tendency to show the gloss beneath the text). There is also a line of metadata at the bottom giving the source of the text. Conceptually, the English line sits on the boundary between gloss and free translation, as many individual words are broadly glossed, and the glosses can be read as a form of free translation. There is no attempt to break words into morphemes, or analyse the data in any way. The issue of alignment is interesting in this sample, as it appears that the gloss is ”centred‘ above the corresponding word/s, but the correspondences are not always clear. There are some many-to-one correspondences, where one Latin word corresponds to more than one English word (e.g. hominum = —of men,“ radicitus = —by its roots“), and also where a group of words in one language is translated by a group in the other, with no indication of one-to-one correspondences (e.g. cum enim = —for while,“ dei dicat esse = —he says is god‘s,“ id quod = —the very thing which is“). There are also some items with no corresponding glosses (e.g. est in line 4). Such correspondences can be difficult to interpret clearly, though the purpose of this sample is to demonstrate the interlinear function of a widely used database system, which has not been specifically designed for linguistic analysis. It is interesting to note that having fewer lines in the interlinear text does not make it simpler, but raises many more issues.

13

Figure 12: Ngiti

The concept of ”text‘ can vary considerably, as in the previous example which was part of a narrative of several minutes, compared to the following example from Ngiti (Lojenga, 1994:424) which gives a collection of proverbs. The interlinearisation includes two levels of transcription, with morphophonemic alternations indicated in the second line

14 (i.e. narongo is separated into n --ar -ongo ), which is then aligned to the gloss line. A free translation is given to the text, then an extra ”translation‘ is added, giving the proverbial explanation of the literal meaning.

Figure 13: Hebrew

So far all the surveyed texts have used phonetic transcription as the basis for analysis. While non-Roman scripts should no longer be a problem for linguistic data with the adoption of the Unicode Standard (The Unicode Consortium, 2000) , scripts which do not read left-to-right create a new level of complexity in interlinear text, as demonstrated in this example from the first chapter of Genesis in Hebrew (Mullins, 2001). In this example, the directionality of the text is juxtaposed with the directionality of the

15 transliteration. The orthographic text reads correctly from right to left, while the corresponding transliteration reads from left-to-right. The glosses read from left to right at the word level but the opposite way at the phrase level. The transliteration and gloss are aligned to the left edge of the word, and the text is aligned to the left of the page, despite reading from the right. Each word is given a gloss, though there are one-to-many correspondences (e.g. line 5 vayavdel = —and began to cause a division“). There is no attempt at any morphemic gloss (which could offer insight into interlinearisation of templatic morphology), though in some cases the gloss line includes supplementary information (e.g. line 1: ‘Elohim = —God (plural of excellence)“). This sample is unusual in that there is no free translation of the text, though the meaning can be retrieved through reading the gloss line œ an awkward process due to the alignment with the left-to-right text. An additional line of information contains the reference number of each word in Strong‘s Concordance (Strong, 1988) . The numbers in the left margin ostensibly refer to the canonical biblical versification, however these in fact do not correspond exactly (e.g. the break between verses 1 and 2 actually occurs between the words ha‘arets and veha‘arets in the first line), the label for verse 3 appears to be in the wrong place, and there is some confusion caused by the repetition of the number 5. This text raises many interesting and complex issues for interlinear text processing.

Figure 14: Indonesian

16 A sample of Indonesian (p.c. Musgrave, 2003) gives a different perspective on interlinear text. The SPINOZA Typological Database (Musgrave, 2002) investigates situations of language contact between unrelated languages. The interlinear text is not typical, as it gives syntactic analysis of data entries and explores different source languages . In the screen shot shown, the full text consists of a single line of data, represented here in orthographic and phonemic form (with a slot for transliterated form also), and a free translation (labelled ”Idiomatic‘), all aligned to the left edge of the phrase. The morphological gloss is aligned to a morphological breakdown of the text, and on a separate screen the gloss line is also aligned to a Word Class identification for each morpheme. Bracketing is used to identify syntactic structures, and each word can be identified by its source language (e.g. the —Sasak“ label aligned to the final word of the text). This text raises issues relating to the ability of interlinear text processors to handle syntactic material.

Figure 15: Ega

Interlinear text can also be used for phonetic analysis, as in the sample from Ega, one of the ten endangered languages of the EMELD showroom of best practice. This sample is generated from the Praat program (Boersma & Weenink, 2003) and is aligned to a sound file, which can be segmented at any user-defined level. In this sample the file is segmented into words, with four annotations tiers: two phonetic transcriptions (using SAMPA), one tonetic transcription, and a gloss of the entire text in French. Besides access to useful information on the acoustic qualities of the sound file, the alignment

17 between the sound file and the transcription allows the user to listen to each section whether word (by selecting one interval of the first three tiers) or phrase (by selecting the gloss tier). This format also allows further annotations, which could include word-by- word gloss, part-of-speech labels, etc. This sample gives a different perspective on how interlinear text can be used for analysis.

Figure 16: Pitjantjatjara Scripture

Most of the texts surveyed so far have some sense of completeness, in that all information is somehow coded into the ouput. However the practice of field work more commonly results in an incomplete analysis of data, which must still requires incorporation. An example combining full and partial analysis is shown in the interlinearised biblical extract from Pitjantjatjara (p.c., Bowe, 2003). The printed version resembles the standard 3-line texts already described (words, glosses, free translation of phrase), however additional information is hand-written on the page, such as use of square brackets to link certain

18 constituents, part of speech annotations on some elements, evidence of corrections and revisions, and some queries noted in the margin. Such a text reflects the practice of a field worker (who annotates the text in various ad hoc ways) more accurately than many of the published texts previously described (which formalise the material for presentation). These annotations raise questions of how much additional information needs to be incorporated in interlinear texts, as well as how uncertain information should be incorporated.

Figure 17: Denya

19

Figure 18: Yele

More complex examples of interlinear text may include higher level analysis, such as discourse features. The Denya example (Abangma, 1997:112-113) has two lines of text, with words separated into morphemes by hyphens, aligned with the gloss. A free translation is given separately on the page, however it is numbered according to each phrase of the text, and therefore conceptually belongs to each phrase, where visually it appears to correspond to the entire text. The numbered phrases themselves however are further divided into sub-phrases indicated by letters, and a further dimension is added through the charting of the text on the page such that the verb phrase appears in a second column. This relates to the purpose of this particular text, which is part of a monograph examining verb functions in this language. Similarly, the Yele example (Henderson, 1995:92-93) is laid out in clauses, with notes on the semantic relationships between the clauses. While such features may be useful for some interlinear texts, the question remains as to whether or not interlinear text processors should include such functionality. 3 Discussion This survey has raised several issues relating to both the form and content of interlinear texts. The issues of alignment, wrapping, display and mapping are all inter-related, and some correspond specifically to either form or content, while some fit both. In this section we will consider these issues in turn, beginning with content, then considering the

20 broad category of page layout, then layout issues within the text, such as typographic issues, alignment, mapping and wrapping.

The first consideration relates to what information is included in an interlinear text. The most basic examples give three sets of information œ a row of text, a row of gloss and a phrasal translation. Beyond this, it is common to include metadata relating to the text, minimally a title. Some examples (e.g. Nivkh, Diyari, Yidinj) include notes or comments relating to some aspect of the text, some included part-of speech information (e.g. South Efate, Indonesian), and some prosodic information (e.g. intonation boundaries in Diyari and Yidinj). Multiple speakers and analysts (e.g. Garrwa), multiple (e.g. South Efate, Ngiti) and even different source languages (e.g. Indonesian) can be incorporated. Specific aspects of the text can be analysed in detail, such as phonetic (Ega), syntactic (Indonesian), or discourse information (Denya, Yele). For a tool to be useful to a field linguist, it must allow for the partial documentation of a text as well as for the gradual, systematic process of filling in gaps during the collection and analysis of data (as in Pitjantjatjara Scripture example). The system must also be flexible enough to allow for marking of missing or uncertain information. While no one text requires all these components, they should at least be considered in the design of an interlinear text editor.

With regard to the way interlinear texts are displayed on the page, the samples surveyed demonstrate how flexible such presentation can be, yet this flexibility can also lead to ambiguity in interpretation. The most consistent feature is that the line of text to be analysed is placed above the interlinear gloss, and this appears to be standard practice in interlinearising texts (with the exception of the Latin example). There appears to be considerable freedom in where to place the phrasal translation, either line-by-line within the text or separately as a block. Numbering the lines within the text may assist in cross- referencing to an external reference (e.g. discussion in a grammar), or may refer to canonical versification (e.g. Hebrew). Metadata or other information about the text can be placed anywhere on the page (e.g. right-aligned beneath text in Ainu, left-aligned above text in Hebrew). More complex examples such as Denya and Yele use the visual layout of the text to add another dimension to the analysis. The flexibility demonstrated in the surveyed texts requires that an interlinear text editor be able to generate a range of different page displays from a single data set.

With regard to the typographical issues within a text, again there is great flexibility. Many of the texts sampled here use punctuation in some way, most commonly the use of full stops to indicate sentence breaks, and occasionally the use of commas within sentences. Some texts include indentation for paragraphs (e.g. Nivkh), or inverted commas to indicate direct speech (e.g. Pitjantjatjara Scripture). In around half the samples surveyed capital letters are used to indicate the beginning of a new sentence in the first row of text. Almost all the texts use typeface to represent the role of a particular row of interlinear text. This can assist in the interpretation of the visual layout of the material, for example rendering the text line in bold (Ngiti) or italic (Diyari) to distinguish it from rows of analysis or translation. In once case (Susoi‘s treatment of Tundra Nenets) there is a combination of styles within one line of data, where certain

21 words are given in the free translation in italics. In some cases (e.g. Yidinj) indentation is used to separate the lines of interlinearised texts from additional comments. The use of capital letters or small caps is a common way of distinguishing grammatical information from glosses within the text (e.g. the Nepali gloss —dog-GEN“). The requirements for an interlinear text editor are that it should be able to generate a full range of typographical features for the visual presentation of a text.

With regard to alignment, all the texts considered in the survey conform to some standard of alignment between rows of a text, though the Latin example is atypical. Issues of alignment are two-dimensional œ the horizontal alignment of lines on the page leads to issues of wrapping (discussed below), and there is also vertical alignment across rows. The entries in the gloss row can be aligned with the entries in the text row either at word level œ where the gloss takes the word as a unit and maps completely œ or the morpheme level. These can be considered as cells in a table œ some texts use the word as the cell, and so all sub-level glosses correspond to this, whether the words may be broken into morphemes (as in Ainu) or not (as in Susoi‘s Tundra Nenets), while others take the morpheme as the cell (as in Nivkh).

©       ¨ Ainu ¨ © Tundra Nenets 1SG/O-raise-PL (Susoi) what kind LIM ABS NOM SG

 ©    ¨  ©  Nivkh REFL younger-brother AND Table 1: Examples of alignment Phrasal translation of a complete text or separate phrases cannot be aligned vertically with its corresponding text, but simply left aligned wherever it is situated on the page. It appears then that alignment is linguistically significant only between the word / morpheme and the gloss, where other alignment is simply page presentation, which has visual rather than linguistic motivation. The requirements for an interlinear text editor are that it should be able to manage alignment at either word or morpheme level, and to manage different alignment of rows within a text.

With regard to mapping issues, the alignment between the word / morpheme and the gloss usually indicates a one-to-one mapping. There are however examples of one-to- many mapping, where a single form can be analysed as more than one morpheme (as in the form -yo in Nepali analysed above), or where variant glosses are given (e.g. Hebrew —in the beginning (head)“). A case of one-to-zero mapping was identified in Ainu, where a word separated into two morphemes was given a single gloss ( ran-pes = —cliff“), and also in Latin where est was given no gloss. While no cases of zero-to-one mapping were found in the survey, there are cases of phonological insertion which do not correspond to morphemes (e.g. French ”t‘ in y-a t‘il ). Cases of many-to-one mapping were found in the Latin example, where multiple words corresponded to a single gloss, though these are generally not permitted in interlinear texts. The issue of mapping is significant in texts which use contrasting directionality, discussed above in relation to the Hebrew text. It was noted previously that some gaps in glossing may serve a purpose, particularly in

22 incomplete field notes where some detail cannot be included until later. The requirements for an interlinear text editor are that it should normally require one-to-one mapping, yet also be able to manage these other configurations.

With regard to wrapping, some samples in the survey highlighted the issue of where a line or lines of text can be broken. A distinction needs to be made between breaks which are linguistically significant and consequently have certain restrictions on wrapping, and those which are not. There appear to be restrictions on which lines stay together when wrapping is required. In the Diyari text for example, aligned sets of text-plus-gloss carry over to another line as a unit (e.g. line 5), while the free translation of this line wraps separately. The following tables show line 5 without wrapping, then with wrapping at different points.

£ ¢ £ ¡  ¡  ¢ £ ¡ ¢  ¡ ¢ £ ¡ ¢ ¡ £ ¡ £  ¢ ¡

   



¥   ¥ ¥  ¦    ¥  ¥    ¥   ¥    ¥ ¥    ¥     ¥  ¤ ¥ ¦ ¦  ¥ 

§ § § § then-LOC clothing-ABS take off-PART AUX-PRES naked go-PART AUX-IMPL ss meat rotten-ERG Then they took their clothes off to walk around naked painted with some rotten meat

£ ¢ £ ¡  ¡  ¢ £

¥   ¥ ¥  ¦    ¥  ¥    ¥ §

then-LOC clothing-ABS take off-PART

¡ ¢  ¡ ¢ £

  



  ¥    ¥ ¥    ¥

AUX-PRES naked go-PART

¡ ¢ ¡ £ ¡ £  ¢ ¡



    ¥  ¤ ¥ ¦ ¦  ¥ 

§ § §

AUX-IMPL ss meat rotten-ERG

¡ ¡ ¢ £ ¡ ¢ £ 



  ¥    ¥ paint-PART AUX-REL ss Then they took their clothes off to walk around naked painted with some rotten meat

£ ¢ £ ¡  ¡  ¢ £ ¡ ¢  ¡ ¢ £

  



¥   ¥ ¥  ¦    ¥  ¥    ¥   ¥    ¥ ¥    ¥ §

then-LOC clothing-ABS take off-PART AUX-PRES naked go-PART

¡ ¢ ¡ ¡ ¢ £ £ ¡ £  ¢ ¡ ¡ ¡ ¢ £

 



    ¥       ¥ ¤ ¥ ¦ ¦  ¥    ¥

§ § §

AUX-IMPL ss meat rotten-ERG paint-PART AUX-REL ss Then they took their clothes off to walk around naked painted with some rotten meat Table 2: Text wrapping in Diyari

This demonstrates that wrapping of the complete phrase is independent of wrapping of the lines of interlinearised text, and that no internal wrapping is permitted within the word or morpheme. Wrapping in the South Efate example is slightly more complex, following a similar pattern but with more lines of information: the tags \tx, \mr, \mg and \POS stay together when wrapping across lines, while \fg and \fgb wrap separately. The use of numbers can assist in determining which parts of a text stay together, yet the

23 Diyari example shows that there are different types of wrapping even within numbered phrases. From this it seems that rules concerning which lines of text need to stay together when wrapping need to be made explicit in an interlinear text editor. While it is permissible to break a line of free translation between any words for wrapping, line breaks are never permitted within words, even across designated morpheme boundaries. The requirements for an interlinear text editor are that it should forbid line breaks within a word or a morpheme, but allow them at other levels.

4 Towards a formal model for interlinear text The preceding survey tried to adduce the structure of an interlinear text from its presentation on the page. This process of abstracting from visual format to conceptual model has some subtleties, and even requires guesswork in some ambiguous cases.

4.1 Hierarchical Organization The alignments and groupings that we observe in interlinear text can be represented using a hierarchical model. We will represent two rows (e.g. morphemes and their glosses) within a single node of this hierarchy if they are aligned with each other. We will represent one row as dominating another (e.g. words dominating morphemes) if the dominated content is aligned as a group to the dominating content. We now illustrate this hierarchical organization by considering several cases from our survey.

Let us begin with a consideration of the Susoi treatment of Tundra Nenets. On the page, a row of text appears above a row of glosses, with a free translation of the entire text given separately at the bottom of the page. In a hierarchical structure of the text, the phrase is represented as a sequence of words, and the word as a sequence of morphemes, as indicated in the following tree diagram.

Traditional folk songs

! ! ' ( ! )

! " # $ %# & & ( )

*

ancient ABS NOM SG person ABS GEN PL song ABS NOM PL

Figure 19: Tundra Nenets (Susoi) tree & text

24 In this case, the words are not separated into morphemes, so there is a one-to-many mapping between the word and morpheme level. Contrast this with the first proverb from the Ngiti text, where the words are transcribed twice, once as whole units and then broken into morphemes, which therefore requires two representations in the model, as in the following tree.

Figure 20: Ngiti tree & text

When considering what information belongs at each level of the tree, we allow more than one row of information in each node. In this case, the morphemes and their glosses belong in the same node, as do the phrasal translation of the proverb and its meaning. This allows for the concatenation of morphemes into words, words into phrases, and phrases into text, as required by the interlinear text. It also allows for complex information structures to be represented in simple tree diagrams.

In the Nepali example, the concept of ”word‘ is in some ways ambiguous. While there is a visual representation of the word on the page, it is basically a concatenation of linked morphemes, separated from other words by white space. This implies that each word is transcribed at the level of the morpheme, leaving the ”word‘ node empty. Line 4 of the text fits the tree in the following way:

25

Figure 21: Nepali tree & text

It was noted above that some texts which looked similar on the page were structurally different. The Diyari and Yidinj examples correspond to the following tree structures. It was noted above that the Diyari and Yidinj texts look similar on the page, however the Diyari text breaks the words into morphemes where the Yidinj does not. According to our model, Diyari has a transcription and gloss both at morpheme level, while Yidinj has transcription at word level and gloss at morpheme level, as indicated in the following diagrams. Additionally, informed guesswork is needed to determine where each comment would fit in the tree structure, whether at text, phrase, word or morpheme level.

Figure 22: Diyari tree & text

26

Figure 23: Yidinj tree & text

Conversely, samples which look different on the page may in fact be conceptually similar, such as the Nepali and Ainu examples. The free translation in Ainu is displayed on the page as if it corresponded to text level, but the numbering system indicates that it corresponds to phrase level.

Figure 24: Nepali tree

Figure 25: Ainu tree

Above the level of the phrase, text-level information can also be included in the hierarchy. The South Efate data includes four lines of metadata at text level, plus another eight lines of information, all of which fit into a four level tree structure as follows. As implied by this example, further phrases can be attached to the text node and analysed accordingly. Note that the arrangement in the model does not match the order of items on the page.

27

Figure 26: South Efate tree & text

4.2 Four-levels Given that interlinear text is hierarchically organized, how many levels of hierarchy are required? From our survey, we believe that four levels are sufficient, and we will call them Text, Phrase, Word and Morpheme. While these labels correspond to linguistic units, they are to be interpreted with reference to the common forms of interlinear display. A phrase is a collection of transcription and its interlinear analysis arrayed across two or more rows, normally represented in interlinear text as beginning on a new line, and wrapping only if necessary. A word is a smaller collection of material (e.g. transcriptions, morphemes and glosses) that must be kept together on the same line. A morpheme is the smallest possible level of alignment between linguistic forms and their meanings. Finally, a text is the highest level of structure corresponding to the external linguistic artefact being investigated. Just as it is possible for a word to contain a single morpheme, a phrase may contain a single word, and a text may contain a single phrase. Thus there is no prior commitment to the total length of this external linguistic artefact. According to this four-level scheme, the user has flexibility in the assignment of content to levels, and the decision may be influenced by both linguistic considerations (what is a —word“? Is this one text or two?) and layout considerations (what should be kept on the

28 same line in a display?). The model is neutral about these considerations. The following diagram shows a schematic representation of the four levels:

Figure 27: Model tree

Text level: This is the complete unit of data under examination which functions as a unit in its entirety, whether it is a 3-word proverb or a 500-page epic narrative. The text level includes metadata which we represent in an ad hoc fashion here (we propose to extend the model with a metadata container which can hold one or more metadata records in standard formats). Of the texts considered in the survey, the majority at least have title, while many have source (such as speaker (e.g. Garrwa) or printed reference (e.g. Nivkh)). Some have notes or comments relevant to the entire text (e.g. South Efate), while others have a free translation of the entire text (c.f. the Nivkh example). An unanalysed source text would also be represented at this level.

Phrase level: In most of the examples the phrase level corresponds to a sentence, though in some cases it is a smaller unit (e.g. the first Nenets phrase) or a larger one (e.g. South Efate). Simons (1989) refers to —units of convenient size for translation, often more than a single clause but less than a sentence.“ The majority of the texts include free translation at this level, and some number the units at this level (Nepali, Hebrew). Some texts have no information at this level at all (e.g. Nivkh, Latin), leading to empty phrase nodes. Note that they are empty, not simply deleted, since each of the four levels represents grouping, not just content.

Word level: In some cases this can be constrained by the typology of the language, for example in languages without white space separations (e.g. Thai, Pali) may not indicate word breaks in the orthographic system, however phonetic transcriptions should account for the separation appropriately.

Morpheme level: The separation of a word into morphemes is also subject to linguistic principles outside the immediate scope of the interlinearisation process. As noted throughout the survey, issues of alignment between word and morpheme level are wide- ranging. Significant issues at this level include the standard use of markup tags and the use of punctuation (e.g. whether hyphens or full-stops are used to separate morphemes or

29 complex glosses). Questions of how to handle such items as discontinuous morphemes, contractions, , etc. are raised at this level.

4.3 An XML representation Each level consists of some content, and a sequence of children at the next level down. The content is given a user-defined type which documents its intended semantic interpretation. We assume, but do not discuss, orthogonal content properties specified at this level, such as language, encoding, directionality and so forth.

Content at the text level, such as metadata, or an unaligned transcription of the entire text, or a pointer to an unaligned audio file Nested XML content to represent the phrasal constituents of the text

This structure is repeated at each of the four levels.

The Title A phrasal translation Word Morph Gloss Morph Gloss

We now illustrate it with examples chosen from our survey, beginning the first phrase from the Susoi treatment of Tundra Nenets, corresponding to the tree diagram in Figure 19. At this early stage of development, the text processor does not support non-standard characters, which have been simplified for the purpose of demonstration.

30 An excerpt from Susoi 1990, p.20 Traditional folk songs. Besides presenting various kinds of tales(lax‘nako), lament recitatives (yarobc‘), and heroic recitatives (syud‘bobc‘),... Nyew‘xi‘ ancient ABS NOM SG nyenecyoyeq person ABS GEN PL syoq song ABS NOM PL

31 Xurkaryi what kind LIM ABS NOM SG lax‘naku tale ABS ACC PL yarobc‘ yarabts ABS ACC PL

32 syud‘bobc‘ syudbabts ABS ACC PL ngodyibyelyewantoh present INF IMPERF POSS GEN SG3PL xaw‘na besides ABS ...

33 As noted earlier, the free translation includes a combination of renderings, with Nenets words included in italics. To support this properly would require markup inside item elements, which is beyond the scope of the current effort. In the original displayed form punctuation (commas and fullstops) appeared in the source text row. We deliberately omit them from the abstract representation, and assume they can be inserted in the rendering process.

The next example takes a small section from the Ngiti proverbs, corresponding to the tree diagram in Figure 20.

Proverbs Leaves that have fallen down can‘t go back upward Take advantage of the present moment, because once the opportunity is gone it can never come back ... sobi sobi leaf narongo

n ¨ RSM ari HAB ongo return:NOM1 ...

34 To demonstrate how extra lines of data can be incorporated into the model, the following example is an extract from the South Efate data, corresponding to the tree diagram in Figure 26.

SE Text kalsrap.mov Story from tape 20001bx told by Kalsarap Namaf. Transcribed and translated into Bislama by ... We all know that place ... Yumi evriwan isave ples ia ... Akit akit 1plincS pron< /item> tumaui tu 1plincRS pron< /item> mau all quantifier< /item> ... ...

The original interlinear text also includes audio offsets which are not represented here. Note also the use complex glosses like —1plincRS“ and the fact that we do not model portmanteu morphemes.

The following example from Nepali (corresponding to Figure 21 plus some words from the following phrase) demonstrates how the processor handles empty nodes, as in this case the word level is empty.

35 Lobhi Kukur (Lo) He was walking. u 3L hi~D walk day SP gar do eko PP thi be.PST yo 3smL.PST While walking,...

36 hi~D walk day SP jaa~ go daa SP ... ...

5 Rendering

Now that we have a model of the structure of interlinear text, we need to demonstrate its adequacy by showing how it can be used in generating the kinds of layouts covered in our survey. (This logic is analogous to that used in conventional generative grammar: having analyzed surface forms and proposed an underlying representation, it is incumbent upon us to provide the rules and constraints which can be applied to generate the original surface forms from the putative underlying representations.) In this section we discuss a variety of presentation issues, before defining the mapping using XSL.

5.1 Presentation issues Grouping of content. Recall that the Denya example has a free translation co-indexed to the phrases of the interlinear text, but displayed after the entire text. The Pitjantjatjara example has segmentation down to the morpheme level, but the alignment of morphemes with their glosses is only shown at the word level. Generalizing from these examples, it should be possible to view an interlinear text with coarser-grained alignment, e.g. combining phrase-level items to form a single text-level item, or combining morpheme- level items to form a single word-level item. More generally, when a text is enriched by the addition of finer-grained segmentation, it should still be possible to view it with coarser alignment. In terms of our tree diagrams, this amounts to taking material from a lower-level set of nodes, concatenating it together, and storing it in a higher-level node. Just as we can ignore the low-level structures, we must also be able to ignore the high- level structures. For example, given an interlinear text it should be possible to extract its words or morphemes in order to construct a wordlist. Therefore, it is a requirement on

37 the rendering process that it can ignore aspects of the structure for the purpose of display and alignment.

Which rows to display. The examples covered in our survey have widely varying levels of detail, ranging from two rows to a dozen. When a text is enriched by the addition of another row of information, it should still be possible to view it in its original form without this extra row. Therefore, it is a requirement on the rendering process that it can omit specific rows from the display.

Row styles. The Nepali text presents source data in boldface, and free translations in italics. In the Diyari example, source data is shown in italics, while phrase-level commentary is shown using a smaller point size. Such variation in form may reflect the preference of the author or the requirements of the publisher. It should be possible to view the rows of a given text in different styles without having to manually change the style of every item in the text. Therefore, it is a requirement on the rendering process that it can distinguish different types of rows and display them in user-specified fonts, typefaces, point sizes and so forth.

Ordering of rows. There are some widespread conventions concerning the vertical arrangement of the rows of interlinear text. For example, morphemes are usually written underneath the corresponding words , and glosses are usually written underneath the corresponding morphemes. However, there are occasional exceptions to such patterns (e.g. the Latin example). At the text level we can observe considerable variation. In the Nivkh example, the notes on the text are placed after the phrases of the text, while in the Tundra Nenets example this order is reversed. It should be possible to view a the rows of a given text in different orders without having to manually reorganize the layout of every item. Therefore, it is a requirement on the rendering process that it can distinguish different types of rows and display them in a user-defined sequence.

5.2 An XSL implementation Our implementation of rendering is based on the Extensible Stylesheet Language (XSL), which can be used to transform XML documents into other formats. By choosing different stylesheets, or selecting different parameters for a given stylesheet, it should be possible to generate a variety of useful formats, whether for human consumption using a particular technology (e.g. conversion to HTML for delivery to a web browser), or for machine consumption (e.g. conversion to another XML format for delivery to another program). Figure 28 is a schematic diagram of this process.

XML XSL XML, PDF, HTML, ...

Abstract Stylesheet Rendered representation transformation version

Figure 28: Rendering an abstract representation

38

The transformation performed by this model must accomplish two things: (i) convert the abstract XML representation into a format which specifies grouping, row ordering and styles, and (ii) convert the XML markup into the formatting instructions of some other language like PDF or HTML. The second of these can be further broken down into two stages: conversion to XSL formatting objects, and conversion to the delivery format. The full model is shown in Figure 29. On the left we see the abstract representation, which is mapped using one of our stylesheets XSL i to a surface representation that fixes the grouping of items, ordering of rows, and so forth. A different choice of stylesheet will result in different groupings and orderings in the surface representation XML SR . This format may be further transformed using third-party XSL stylesheets (XSLPUB ) to meet the requirements of publishers. We supply another transform XSL FO which converts the surface representation into a low-level representation using XSL formatting objects. Third party software can then be used to generate the format to be delivered to the end- user. PDF

XSL 1 XSL PUB HTML

XML UR RTF XSL 2 XML SR XSL FO XMLFO

SVG

XSL 3 XSL PUB JPEG

Abstract Surface Rendered Delivery representation representation in XML

Figure 29: Detailed rendering model

5.3 Implementation Details The key rendering challenge posed by interlinear text is line-wrapping. A line of interlinear text contains multiple rows, and these must be wrapped as a group, keeping words and their morphological analysis together on the same line. Thus, these multirow constituents must be treated as indivisible entities with respect to line wrapping, and there should be no line-wrapping within these entities. Formatting languages such as HTML and Docbook model a document as a collection of blocks (paragraphs, quotes, tables, lists) each containing lines of text. Critically, blocks must appear on a line of their own, or equivalently, lines cannot contain multiple blocks. In order to handle the line wrapping requirements of interlinear text, we need a language which permits inline

39 blocks, such as TeX or XSL-FO (XSL Formatting Objects). To facilitate flexible integration with web services we have chosen to use XSL-FO.

Unfortunately, some XSL-FO engines do not handle inline blocks correctly at the present time (e.g. Apache FOP and XEP). We have found that XSL Formatter correctly handles inline blocks, and so we have used it to generate the examples included below. For a discussion of XSL-FO engines and their conformance level, see [http://xml.coverpages.org/KimberProductionQuality-XSL-FO.html].

XSL processing of the abstract XML format to specify grouping, row-ordering and styles is quite straightforward. We give three examples to illustrate. The following template matches a phrase, and orders its constituent words before any content at the phrase level (such as a free translation).

The following template matches an interlinear-text, ordering any content of type —title“ before the constituent phrases, and ordering any other content (e.g. notes) afterwards.

<xsl:value-of select = —.“ />

The following template matches a document, constructs a single interlinear-text consisting of just the words (including any morphemes and glosses), and sorts them.

40

Using such templates we can generate documents for delivery to end-users. We give three examples below, showing different styles for the XML examples presented in the previous section.

Figure 30: Rendered version of four XML examples, with four levels of nested rectangles to show the four levels of hierarchy.

41

Figure 31: Nenets example showing correct wrapping of inline blocks

42

Figure 32: Sorted wordlist generated from the four interlinear texts

Further work on the XSL mapping is needed: to support common layout styles and to include additional rendering parameters; to permit display parameters to be stored within the abstract XML representation itself (or in a special XML format declaration file); and to model prefixes and suffixes so that hyphens are introduced appropriately at morpheme boundaries. 6 Topics for further research

Syntactic analysis. The conceptual space between the word and phrase level leaves open the possibility of syntactic representation, which has not usually been addressed by interlinear text tools. While syntactic analysis may not be a primary goal of interlinear text, some level of syntactic analysis is involved in the process of interlinearisation, for example in the breaking up of words into morphemes. Samples such as the Pitjantjtatjara text above demonstrate that linguists find it useful to use bracketing or some kind of labelling to encode syntactic structure. While tools such as Spinoza (see comments above on Ambon) crucially require syntactic analysis as part of the interlinear representation, many other tools do not have this functionality. Drude (2002) gives an example of advanced glossing in Shoebox which allows for complex interlinear markup of both syntax and morphology. An interface with an existing tool such as TreeTrans (Ma, Maeda, Bird 2002) would allow an analyst to build syntactic trees from the data in an interlinear text.

43

Directionality. One constraint of the model advocated here is the requirement for interlinear text to be in a horizontal modality (as opposed to a vertical modality exhibited by some writing system types, a possible research extension). Within the horizontal plane, there is the additional requirement for multi-directionality in text flow. Whilst it can be argued that directionality is inherent at the level of the character (The Unicode Consortium, 2000:55), or at the level of the Small Linguistic Unit (Sproat, 2000:25), in our context it is more important at the level of the word or phrase. As previously described, our survey found that interlinear texts can exhibit multi-directional text flow between the original text and its subsequent glossing. In such circumstances the higher level element of the word or phrase allows apparently contradictory flows to be linked across interlinearisation layers. One possible solution is to enhance the model to use a directionality marker at the level of the word, which is analogous to the recommendation in the XML specification.

Editing operations. We have presented a data model but apart from our discussion of rendering we have not attempted to define the legal operations that construct, access and otherwise manipulate the structures. In future work we plan to define the data types and legal operations formally, independently of their possible XML and XSL representations. This will serve as the basis for object-oriented implementation, for the definition of application programming interfaces, and for research on suitable query languages. More generally, we need to consider interfaces to other data sources such as texts and lexicons.

Annotation graphs . The model presented here is a specialization of the interlinear text model presented by Maeda and Bird (2000), and can be represented using annotation graphs. In future work we will implement a tool that is part of the Annotation Graph Toolkit, building on the existing InterTrans tool. This will require extending the XML format to support the specification of media file offsets. 7 Conclusion

In presenting a model of interlinear text, we have drawn a sharp distinction between structure and presentation. We have described a four-level, hierarchical model which represents the structure of a diverse range of interlinear text types and which may serve as a starting point for discussions of suitable archival formats for interlinear text. We have shown how data in this model can be rendered into conventional presentation forms. The implementation is based on published XML standards and freely available technologies.

44 References

Abangma, Samson Negbo. 1987. Modes in Dényá Discourse. Summer Institute of and University of Texas at Arlington.

Andrews, Avery. 1991. Primitive Linguistics Tree Formatter. http://www-nlp.stanford.edu/~manning/tex/trees.sty

Austin, Peter. 1981. A Grammar of Diyari, South Australia. Cambridge: Cambridge University Press.

Bird, Steven, Kazuaki Maeda, Xiaoyi Ma, Haejoong Lee, Beth Randall, and Salim Zayat. 2002. TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit. Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, pp 364-370.

Boersma, Paul and David Weenink. 2003. Praat: Doing Phonetics By Computer. Institute of Phonetic Sciences, University of Amsterdam. http://www.fon.hum.uva.nl/praat/

Bowe, Heather. 1986. Pitjantjatjara Stories for Children œ Fregon 1984-85. Research Report to the Australian Institute of Aboriginal Studies, Canberra.

Comrie, Bernard. 1981. The Languages of the Soviet Union. Cambridge: Cambridge University Press.

Dixon, R. M. W. 1977. A grammar of Yidinj. Cambridge University Press.

Drude, Sebastian. 2002. Advanced Glossing œ a language documentation format and its implementation with Shoebox. http://www.mpi.nl/lrec/papers/lrec-pap-10-ag.pdf

EMELD, 2001. Electronic Metastructures for Endangered Language Data œ Research Proposal http://linguist.emich.edu/%7Eworkshop/E-MELD.html.

Genetti, Carol (ed.). 1994. Aspects of Nepali Grammar. Santa Barbara: Dept. of Linguistics, University of California.

Henderson, James. 1995. Phonology and Grammar of Yele, Papua New Guinea. Canberra: Research School of Pacific and Asian Studies, Australian National University.

Kimber, W. Eliot. 2002. Production Quality XSL-FO 2002-12. http://xml.coverpages.org/KimberProductionQuality-XSL-FO.html

Laughren, Mary Naomi Keith and Baden Hughes 2002. Australian Aboriginal Language Data : Garrwa. http://emsah.uq.edu.au/linguistics/austlang/garrwa/index.html

45

Lojenga, Constance Kutsch. 1994. Ngiti: A Central-Sudanic Language of Zaire. Köln: R. Köppe.

Maeda, Kazuaki and Steven Bird. 2000. A Formal Framework for Interlinear Text. Proceedings of the Workshop on Web-based Documentation and Description, Philadelphia, USA; December 12-15, 2000.

Manning, Christopher. 1992. Linguistic Macros for TeX. http://www-nlp.stanford.edu/~manning/tex/cm-lingmacros.sty

Mullins, Mick. 2001. Genesis Hebrew Interlinear. http://community.webshots.com/sym/image3/1/7/42/9510742TGEvfHqlnA_fs.jpg

Musgrave, Simon. 2002. Inducing typological generalizations in a cross-linguistic database. International Workshop on Resources and Tools for Field Linguistics at LREC 2002. Las Palmas, Spain. http://www.mpi.nl/lrec/papers/lrec-pap-21-LREC_COL.pdf

Namaf, Kalsarap. 2001. Litrapong [transcribed story from audio tape (PDSC:NT1- 20001\20001.B.wav)], MS supplied by Nick Thieberger, University of Melbourne.

Paakan, Jan. 1997. A Proposal for an Arrangement of an Interlinear Morphological Corpus of Tundra Nenets. http://www.helsinki.fi/~jpaakkan/Interlinear.html#Appendix2

Pacific Linguistics, 2002. A Guide for Pacific Linguistics Authors. Pacific Linguistics. http://pacling.anu.edu.au/for_authors/guide_for_PL_authors.pdf

Shibatani, Masayoshi. 1990. The Languages of Japan. Cambridge: Cambridge University Press.

SIL, 2002. The Linguist‘s Shoebox. Summer Institute of Linguistics. http://www.sil.org/computing/shoebox/

Simons, Gary. 1989. Proposed framework for encoding analysis and interpretation of running text. http://tei-c.org/Vault/AI/aiw01.txt

Sproat, Richard. 2000. A Computational Theory of Writing Systems. Cambridge Studies in Natural Language Processing. Cambridge: Cambridge University Press.

Strong, James. 1998. Strong‘s Exhaustive Concordance of the Bible. Hendrickson Publishers.

Susoj, E. G. 1990. Nenècie literatura. Leningrad: Prosveshchenie. Transcribed at http://www.helsinki.fi/~tasalmin/text.html

46

The Unicode Consortium. 2000. Unicode Standard Version 3.0. Boston, MA: Addison- Wesley.

Valiulis, David & Gregory Wasson, 1998. Adobe FrameMaker Template Series Template Pack 4 Guide. Adobe Systems Incorporated. http://www.adobe.com/products/framemaker/tempseries/pdfs/tempac4.pdf

Acknowledgements The authors would like to thank Heather Bowe, Simon Musgrave and Nick Thieberger for their provision of interlinear texts and associated commentary. This material is based upon work supported by the National Science Foundation under Grant No. 0094934 Electronic Metastructure for Endangered Languages Data .

47