An Ontology for Accessing Transcription Systems (OATS)

An Ontology for Accessing Transcription Systems (OATS) Steven Moran University of Washington Seattle, WA, USA [email protected] Abstract tions can be written to perform intelligent search (deriving implicit knowledge from explicit infor- This paper presents the Ontology for Ac- mation). They can also interoperate between re- cessing Transcription Systems (OATS), a sources, thus allowing data to be shared across ap- knowledge base that supports interopera- plications and between research communities with tion over disparate transcription systems different terminologies, annotations, and notations and practical orthographies. The knowl- for marking up data. edge base includes an ontological descrip- OATS is a knowledge base, i.e. a data source tion of writing systems and relations for that uses an ontology to specify the structure of mapping transcription system segments entities and their relations. It includes general to an interlingua pivot, the IPA. It in- knowledge of writing systems and transcription cludes orthographic and phonemic inven- systems that are core to the General Ontology of tories from 203 African languages. OATS Linguistic Description (GOLD)2 (Farrar and Lan- is motivated by the desire to query data in gendoen, 2003). Other portions of OATS, in- the knowledge base via IPA or native or- cluding the relationships encoded for relating seg- thography, and for error checking of dig- ments of transcription systems, or the computa- itized data and conversion between tran- tional representations of these elements, extend scription systems. The model in this paper GOLD as a Community of Practice Extension implements these goals. (COPE) (Farrar and Lewis, 2006). OATS provides 1 Introduction interoperability for transcription systems and practical orthographies that map phones and phonemes The World Wide Web has emerged as the pre- in unique relationships to their graphemic repre- dominate source for obtaining linguistic field data sentations. These systematic mappings thus pro- and language documentation in textual, audio and vide a computationally tractable starting point for video formats. A simple keyword search on the interoperating over linguistic texts. The resources 1 nearly extinct language Livonian [liv] returns nu- that are targeted also encompass a wide array of merous results that include text, audio and video data on lesser-studied languages of the world, as files. As data on the Web continue to increase, in- well as low density languages, i.e. those with few cluding material posted by native language com- electronic resources (Baldwin et al., 2006). munities, researchers are presented with an ideal This paper is structured as follows: in section medium for the automated discovery and analysis 2, linguistic and technological definitions and ter- of linguistic data, e.g. (Lewis, 2006). However, minology are provided. In section 3, the theoreti- resources on the Web are not always accessible to cal and technological challenges of interoperating users or software agents. The data often exist in over heterogeneous transcriptions systems are de- legacy or proprietary software and data formats. scribed. The technologies used in OATS and its This makes them difficult to locate and access. design are presented in section 4. In section 5, Interoperability of linguistic resources has the OATS’ implementation is illustrated with linguis- ability to make disparate linguistic data accessible tic data that was mined from the Web, therefore to researchers. It is also beneficial for data aggre- motivating the general design objectives taken into gation. Through the use of ontologies, applica- 1ISO 639-3 language codes are in []. 2http://linguistics-ontology.org/ Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 112–120, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics 112 account in its development. Section 6 concludes symbols. Featural systems are less common and with future research goals. encode phonological features within the shapes of the symbols represented in the script. 2 Conventions and Terminology The term script refers to a collection of sym- 2.1 Conventions bols (or distinct marks) as employed by a writ- Standard conventions are used for distinguishing ing system. The term script is confused with and between graphemic < >, phonemic / / and pho- often used interchangeably with ‘writing system’. netic representations [ ].3 For character data infor- A writing system may be written with different mation, I follow the Unicode Standard’s notational scripts, e.g. the alphabet writing system can be conventions (The Unicode Consortium, 2007). written in Roman and Cyrillic scripts (Coulmas, Character names are represented in small capi- 1999). A grapheme is the unit of writing that represents a particular abstract representation of a tal letters (e.g. LATIN SMALL LETTER SCHWA) and code points are expressed as ‘U+n’ where n symbol employed by a writing system. Like the is a four to six digit hexadecimal number (e.g. phoneme is an abstract representation of a distinct U+0256), which is rendered as <@>. sound in a language, a grapheme is a contrastive graphical unit in a writing system. A grapheme 2.2 Linguistic definitions is the basic, minimally distinctive symbol of a In the context of this paper, a transcription sys- writing system. A script may employ multiple tem is a system of symbols and rules for graphi- graphemes to represent a single phoneme, e.g. the cally transcribing the sounds of a language variety. graphemes <c> and <h> when conjoined in En- A practical orthography is a phonemic writing glish represent one phoneme in English, <ch> system designed for practical use by speakers al- pronounced /Ù/ (or /k/). The opposite is also found ready competent in the language. The mapping re- in writing systems, where a single grapheme rep- lation between phonemes and graphemes in prac- resents two or more phonemes, e.g. <x> in En- tical orthographies is purposely shallow, i.e. there glish is a combination of the phonemes /ks/. is a faithful mapping from a unique sound to a A graph is the smallest unit of written language unique symbol.4 The IPA is often used by field (Coulmas, 1999). The electronic counterpart of linguists in the development of practical orthogra- the graph is the glyph. Glyphs represent the varia- phies for languages without writing systems. An tion of graphemes as they appear when rendered or orthography specifies the symbols, punctuation, displayed. In typography glyphs are created using and the rules in which a language is correctly writ- different illustration techniques. These may result ten in a standardized way. All orthographies are in homoglyphs, pairs of characters with shapes language specific. that are either identical or are beyond differenti- Practical orthographies and transcription sys- ation by swift visual inspection. When rendered tems are both kinds of writing systems. A writing by hand, a writer may use different styles of hand- system is a symbolic system that uses visible or writing to produce glyphs in standard handwriting, tactile signs to represent a language in a systematic cursive, or calligraphy. When rendered computa- way. Differences in the encoding of meaning and tionally, a repertoire of glyphs makes up a font. sound form a continuum for representing writing A final distinction is needed for interoperating systems in a typology whose categories are com- over transcription systems. The term scripteme monly referred to as either logographic, syllabic, is used for the use of a grapheme within a writ- phonetic or featural. A logographic system de- ing system with the particular semantics (i.e., pro- notes symbols that visually represent morphemes nunciation) it is assigned within that writing sys- (and sometimes morphemes and syllables). A tem. The notion scripteme is needed because syllabic system uses symbols to denote syllables. graphemes may be homoglyphic across scripts and A phonetic system represents sound segments as languages, and the semantics of a grapheme is de- 3Phonemic and phonetic representations are given in the pendent on the writing system using it. For ex- International Phonetic Alphabet (IPA). ample, the grapheme <p> in Russian represents a 4Practical orthographies are intended to jump-start written dental or alveolar trill; /r/ in IPA. However, <p> is materials development by correlating a writing system with its sound units, making it easier for speakers to master and realized by English speakers as a voiceless bilabial acquire literacy. stop /p/. The defining of scripteme is necessary 113 for interoperability because it provides a level for Table 1: Phoneme-to-grapheme relations mapping a writing system specific grapheme to the phonological level, allowing the same grapheme /kp/ d /Ù/ /I/ /U/ Tone to represent different sounds across different tran- sig kp d, r ky Ì V not marked scription and writing systems. sil kp d ch i u accents 2.3 Technological definitions ssl - d ky I U accents A document refers to an electronic document that contains language data. Each document is associ- phoneme /d/ in Sisaala Pasaale (Toupin, 1995).5 ated with metadata and one or more transcription These three orthographies also differ because of systems or practical orthographies. A document’s their authors’ choices in assigning graphemes to content is comprised of a set scriptemes from its phonemes. In Sisaala Pasaale and Sisaala West- transcription system. A mapping relation is an

An Ontology for Accessing Transcription Systems (OATS)

Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging

The Challenge of Chinese Character Acquisition

Rune Caster – a New Character Class

Chapter 6, Writing Systems and Punctuation

The Phoenician Alphabet Reassessed in Light of Its Descendant Scripts and the Language of the Modern Lebanese

Will the Chinese One Day Write with an Alphabet? Many People in This Century Have Thought So

An Efficient Character Segmentation Algorithm for Offline Handwritten Uighur Scripts Based on Grapheme Analysis

Extending Gardiner's Code for Hieroglyphic Recognition And

The Nature of Chinese Characters Sheng Jie1

A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes

Corrected Copy of Hargis' Full Article

Write Text with Emoji