Towards a General Model of Interlinear Text

Towards a General Model of Interlinear Text Cathy Bow, Baden Hughes and Steven Bird Department of Computer Science and Software Engineering University of Melbourne, Victoria 3010, Australia {cbow, badenh, sb}@cs.mu.oz.au Abstract The use of interlinear text has long been a valuable tool in linguistic description, and the development of a number of different software tools has facilitated the creation and processing of such texts. In this paper we survey of a range of interlinear texts, focusing on issues such as grouping and alignment. Abstracting away from the presentation, we look specifically at the structure of the data, in an attempt to create a general purpose data model for interlinear text. Our findings are that a four level model œ incorporating Text, Phrase, Word and Morpheme levels œ is sufficient to represent a very wide range of practice. We present an XML format for representing data in this model, and describe stylesheets for converting such data into presentational formats. Because of its generality, and the way it abstracts away from presentation, we believe the model is a suitable basis for developing archival storage formats for interlinear text and delivering interlinear text to end-users and external software tools in a web environment. 1 Introduction Interlinear texts serve a variety of purposes in linguistic research. In many cases they are used demonstrate various linguistic principles in a certain language, such as a traditional narrative included in a descriptive grammar. In textbooks or other reference material, a single line of text may be given an interlinear treatment to highlight a specific issue under discussion. Some references may include interlinear texts to demonstrate higher level discourse features, others may wish to consider a fine-grained analysis of morphology or even phonology. Religious texts are often given interlinear treatments, and may include a source language, a transliteration, a vernacular translation and any number of comments or layers of analysis. While we can observe many tendencies concerning the display of interlinear texts (for example, the gloss is generally given below the text), there are no explicit standards or universals constraining either the form or content of such material. The goal of this paper is to build a general-purpose model for interlinear text. An initial survey of the structure and content of a range of texts allows us to consider what information is represented and how it needs to be stored. Any general model for interlinear text needs to be sufficiently granular to meet all the requirements of the linguist. Our survey will show how varied these requirements can be, whether referring to a field linguist collecting initial data or an end user analysing it at a later point in time. The model also needs to be typologically inclusive, being flexible enough to cover the wide range of typological features across languages. The model also needs to be usable to software developers (whether as an interchange format or an internal data model) as well as field workers and analysts. 1 The importance of developing appropriate methods of archiving and the flexibility of formats in order to make data available to others are key components of the E-MELD proposal (EMELD, 2001). Since one of the goals of the current workshop is to establish best-practice methods, the ratification of such a model could serve as the basis for archival storage of interlinear text sources in the E-MELD showroom. The paper begins with a survey of interlinear text representations (Section 2). Issues of content and presentation raised by this survey are discussed in Section 3, followed by our proposal of an general-purpose four level model (Section 4). We present an XML format for the model and show how the surveyed interlinear texts can be represented in the format. In Section 5 we show how the structural markup can be rendered into presentational markup and displayed in familiar forms. 2 Survey of interlinear text A variety of sources were surveyed to collect diverse types of interlinear text, with the goal of considering a representative sample rather than developing a comprehensive inventory. Some texts were supplied by colleagues, some taken from the internet and some from printed grammars, and full texts were preferred over isolated examples. The majority were text-based, with some linked to audio files, and an attempt was made to cover the major geographic and typological areas. The broad range of samples surveyed attempts to adequately reflect the types of information contained in interlinear texts, as well as to examine how such information is structured. This survey begins with the most common form of interlinear text in linguistic description with three lines of content: a transcribed text aligned in some way to a gloss, and some form of free translation. The survey then progresses to more complex examples, discussing various issues of both display and organisation, such as alignment, mapping and wrapping. Issues raised on these topics in the survey will then be discussed prior to the proposal of a model of interlinear texts. In order to clearly define the terms of reference, we will use the word —text“ to refer to a piece of interlinearised material as well as information in a given language (e.g. a line of text followed by a line of translation). A —line“ of interlinear text will refer to a row of text plus all the rows of analysis pertaining to it. Within the line, there is a horizontal arrangement of words and their analysis, plus there is the vertical alignment of row elements to indicate the structure of the interlinear text. There is also the vertical arrangement of one row above the other. There is a consistency from one block of rows to the next in the vertical arrangement of these rows. The term —free translation“ is commonly applied to any translation in which more emphasis is given to the overall meaning of the text than to the exact wording. Since this term may suggest that liberty is taken in translating the text, we will occasionally substitute the term —phrasal translation.“ 2 Figure 1: Nepali A typical example of a basic three-line interlinear text comes from Nepali (Genetti, 1994:166). In this sample, one line of transcription data (in this case phonetic rather than orthographic) is aligned word-by-word with a gloss, and an unaligned free translation is given on the third line. The phrases are numbered, and there is one line of metadata giving a title (the footnote missing from the scanned image notes who recorded, transcribed and glossed the text). The words are broken into morphemes using hyphens, the gloss is aligned to the whole word, rather than to individual morphemes, and complex glosses are separated by full stops (e.g. line 3, nisk-yo = come.out-3smL.PST). There are 3 examples of portmanteau morphemes, for example in line 3 where the morpheme œyo is glossed as —-3smL.PST,“ which is explained elsewhere in the text as indicating a third person singular masculine low-grade honorific past tense marker. Figure 2: Ainu The Ainu example (Shibatani, 1990:85) is similar to the Nepali in that each word of the text is aligned to a gloss, the phrases are numbered, and there is one line of metadata giving the source, though the display is different in that a free translation of the entire text 4 is given below the interlinear text. On closer analysis however, the free translation is itself divided into numbered phrases with a one-to-one correspondence to the interlinearised phrases. We contend that structurally this resembles the Nepali, though its visual presentation is different. The free translation doesn‘t always read fluently, which may be an attempt to reproduce the phrase structure of the source text in the English translation. Like the Nepali example, the words are broken into morphemes by hyphens, though multi-word English glosses are not linked, e.g. cirikinka = —rise high“, enkasike = —over there,“ which can lead to some ambiguity, as in the relationship in line 6 between oka nankor and the gloss —be perhaps.“ There appear to be some gaps in glossing, where polymorphemic units are given a monomorphemic gloss (e.g. asso-kotor = —wall“, ran- pes = —cliff“). It is unclear how such gaps should be interpreted, whether they reflect uncertainty or oversight, however such gaps are common features in field notes, and need to be represented in an interlinear text editor. Figure 3: Pitjantjatjara text A Pitjantjatjara story (Bowe, 1986) marked up using Microsoft Word appears to have the same three line format, with some metadata (title) relating to the entire text. The words are broken into morphemes using hyphens, and the glosses are aligned to the word (rather than to each individual morpheme). In the glosses, a punctuation distinction is made between complex glosses using full stops (e.g. kulila = —listen.IMP“) and morpheme 5 boundaries using hyphens ( tjukurpa-na = —story-I“). Free translation is given for each phrase, vertically aligned to the beginning of the phrase, however many phrases wrap over several lines. Figure 4: Nivkh Comrie‘s book on the languages of the Soviet Union (Comrie, 1981:276-277) gives examples of texts from the languages described in each chapter, as in the Nivkh text 6 shown in Figure 4. On the page, the interlinear text has a line of text (given in a phonetic rather than orthographic form) and a line of glossing, then some metadata on the source of the text, a section of Notes, followed by a free translation. The alignment between word and gloss is different to those seen previously, as each word is broken into morphemes by both white space and hyphens, with vertical alignment to the gloss line for each morpheme.

Towards a General Model of Interlinear Text

How to Format Interlinearized Linguistic Examples Three-Line Format

Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations

For Linguists User's Guide

Using Interlinear Glosses As Pivot in Low-Resource Multilingual Machine Translation

The Faith Between the Lines: a Survey of Doctrinal Language in the Old Northumbrian Glosses to the Gospel of Matthew in the Lindisfarne Gospels

E-Research for Linguists

A Catalogue of Manuscripts Known to Contain Old English Dry-Point Glosses

Colorado Technical University Enabling A

Modelling and Annotating Interlinear Glossed Text from 280 Di Erent

Pdf Field Stands on a Separate Tier

Talking in Halq'eméylem

An Armour of Sound. Sancte Sator (Carmen Ad Deum) and Its German Gloss