<<

An Ontology for Accessing Transcription Systems (OATS)

Steven Moran University of Washington Seattle, WA, USA [email protected]

Abstract tions can be written to perform intelligent search (deriving implicit knowledge from explicit infor- This paper presents the Ontology for Ac- mation). They can also interoperate between re- cessing Transcription Systems (OATS), a sources, thus allowing data to be shared across ap- knowledge base that supports interopera- plications and between research communities with tion over disparate transcription systems different terminologies, annotations, and and practical orthographies. The knowl- for marking up data. edge base includes an ontological descrip- OATS is a knowledge base, i.e. a data source tion of writing systems and relations for that uses an ontology to specify the structure of mapping transcription system segments entities and their relations. It includes general to an interlingua pivot, the IPA. It in- knowledge of writing systems and transcription cludes orthographic and phonemic inven- systems that are core to the General Ontology of tories from 203 African languages. OATS Linguistic Description (GOLD)2 (Farrar and Lan- is motivated by the desire to query data in gendoen, 2003). Other portions of OATS, in- the knowledge base via IPA or native or- cluding the relationships encoded for relating seg- thography, and for error checking of dig- ments of transcription systems, or the computa- itized data and conversion between tran- tional representations of these elements, extend scription systems. The model in this paper GOLD as a Community of Practice Extension implements these goals. (COPE) (Farrar and Lewis, 2006). OATS provides 1 Introduction interoperability for transcription systems and prac- tical orthographies that map phones and phonemes The World Wide Web has emerged as the pre- in unique relationships to their graphemic repre- dominate source for obtaining linguistic field data sentations. These systematic mappings thus pro- and language documentation in textual, audio and vide a computationally tractable starting point for video formats. A simple keyword search on the interoperating over linguistic texts. The resources 1 nearly extinct language Livonian [liv] returns nu- that are targeted also encompass a wide array of merous results that include text, audio and video data on lesser-studied languages of the world, as files. As data on the Web continue to increase, in- well as low density languages, i.e. those with few cluding material posted by native language com- electronic resources (Baldwin et al., 2006). munities, researchers are presented with an ideal This paper is structured as follows: in section medium for the automated discovery and analysis 2, linguistic and technological definitions and ter- of linguistic data, e.g. (Lewis, 2006). However, minology are provided. In section 3, the theoreti- resources on the Web are not always accessible to cal and technological challenges of interoperating users or software agents. The data often exist in over heterogeneous transcriptions systems are de- legacy or proprietary software and data formats. scribed. The technologies used in OATS and its This makes them difficult to locate and access. design are presented in section 4. In section 5, Interoperability of linguistic resources has the OATS’ implementation is illustrated with linguis- ability to make disparate linguistic data accessible tic data that was mined from the Web, therefore to researchers. It is also beneficial for data aggre- motivating the general design objectives taken into gation. Through the use of ontologies, applica-

1ISO 639-3 language codes are in []. 2http://linguistics-ontology.org/

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 112–120, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

112 account in its development. Section 6 concludes symbols. Featural systems are less common and with future research goals. encode phonological features within the shapes of the symbols represented in the script. 2 Conventions and Terminology The term script refers to a collection of sym- 2.1 Conventions bols (or distinct marks) as employed by a writ- Standard conventions are used for distinguishing ing system. The term script is confused with and between graphemic < >, phonemic / / and pho- often used interchangeably with ‘’. netic representations [ ].3 For data infor- A writing system may be written with different mation, I follow the Unicode Standard’s notational scripts, e.g. the writing system can be conventions (The Unicode Consortium, 2007). written in Roman and Cyrillic scripts (Coulmas, Character names are represented in small capi- 1999). A is the unit of writing that represents a particular abstract representation of a tal letters (e.g. LATIN SMALL SCHWA) and code points are expressed as ‘U+n’ where n symbol employed by a writing system. Like the is a four to six digit hexadecimal number (e.g. phoneme is an abstract representation of a distinct U+0256), which is rendered as <@>. sound in a language, a grapheme is a contrastive graphical unit in a writing system. A grapheme 2.2 Linguistic definitions is the basic, minimally distinctive symbol of a In the context of this paper, a transcription sys- writing system. A script may employ multiple tem is a system of symbols and rules for graphi- to represent a single phoneme, e.g. the cally transcribing the sounds of a language variety. graphemes and when conjoined in En- A practical orthography is a phonemic writing glish represent one phoneme in English, system designed for practical use by speakers al- pronounced /Ù/ (or /k/). The opposite is also found ready competent in the language. The mapping re- in writing systems, where a single grapheme rep- lation between phonemes and graphemes in prac- resents two or more phonemes, e.g. in En- tical orthographies is purposely shallow, i.e. there glish is a combination of the phonemes /ks/. is a faithful mapping from a unique sound to a A graph is the smallest unit of written language unique symbol.4 The IPA is often used by field (Coulmas, 1999). The electronic counterpart of linguists in the development of practical orthogra- the graph is the . represent the varia- phies for languages without writing systems. An tion of graphemes as they appear when rendered or orthography specifies the symbols, , displayed. In typography glyphs are created using and the rules in which a language is correctly writ- different illustration techniques. These may result ten in a standardized way. All orthographies are in homoglyphs, pairs of characters with shapes language specific. that are either identical or are beyond differenti- Practical orthographies and transcription sys- ation by swift visual inspection. When rendered tems are both kinds of writing systems. A writing by hand, a writer may use different styles of hand- system is a symbolic system that uses visible or writing to produce glyphs in standard handwriting, tactile signs to represent a language in a systematic cursive, or calligraphy. When rendered computa- way. Differences in the encoding of meaning and tionally, a repertoire of glyphs makes up a font. sound form a continuum for representing writing A final distinction is needed for interoperating systems in a typology whose categories are com- over transcription systems. The term scripteme monly referred to as either logographic, syllabic, is used for the use of a grapheme within a writ- phonetic or featural. A logographic system de- ing system with the particular semantics (i.e., pro- notes symbols that visually represent morphemes nunciation) it is assigned within that writing sys- (and sometimes morphemes and syllables). A tem. The notion scripteme is needed because syllabic system uses symbols to denote syllables. graphemes may be homoglyphic across scripts and A phonetic system represents sound segments as languages, and the semantics of a grapheme is de- 3Phonemic and phonetic representations are given in the pendent on the writing system using it. For ex- International Phonetic Alphabet (IPA). ample, the grapheme

in Russian represents a 4Practical orthographies are intended to jump-start written dental or alveolar trill; /r/ in IPA. However,

is materials development by correlating a writing system with its sound units, making it easier for speakers to master and realized by English speakers as a voiceless bilabial acquire literacy. stop /p/. The defining of scripteme is necessary

113 for interoperability because it provides a level for Table 1: Phoneme-to-grapheme relations mapping a writing system specific grapheme to the phonological level, allowing the same grapheme /kp/ d /Ù/ /I/ /U/ Tone to represent different sounds across different tran- sig kp d, r ky Ì V not marked scription and writing systems. sil kp d ch i u accents 2.3 Technological definitions ssl - d ky I U accents A document refers to an electronic document that contains language data. Each document is associ- phoneme /d/ in Sisaala Pasaale (Toupin, 1995).5 ated with metadata and one or more transcription These three orthographies also differ because of systems or practical orthographies. A document’s their authors’ choices in assigning graphemes to content is comprised of a set scriptemes from its phonemes. In Sisaala Pasaale and Sisaala West- transcription system. A mapping relation is an ern, the phonemes /Ù/ and /Ã/ are written as unordered pair of a scripteme in a transcription and . In Sisaala Tumulung, however, these system and its representation in IPA. sounds are written and . Orthography OATS first maps scriptemes to their grapheme developers may have made these choices for prac- equivalent(s). Graphemes are then mapped to tical reasons, such as ease of learnability or tech- their character equivalents. A character in OATS nological limitations (Bodomo, 1997). During the is a computational representation of a grapheme. development of practical orthographies for Sisaala Character encodings represent a range of inte- Pasaale and Sisaala Western, the digraphs gers known as the code space.A code point is and were chosen because children learn Da- a unique integer, or point, within this code space. gaare [dga] in schools, so they are already famil- An abstract character is then mapped to a unique iar with their sounds in the Dagaare orthography code point and rendered as an encoded charac- (Mcgill et al., 1999) (Moran, 2008). ter and typographically defined by the font used Another difference lies in the representation of to render it. A set of encoded characters is a char- vowels. Both Sisaala Pasaale and Sisaala West- acter set and different character encodings en- ern represent their full sets of vowels orthograph- code characters as numbers via different encoding ically. These orthographies were developed rela- schemes. tively recently, when computers, character encod- ings, and font support, have become less problem- 3 Interoperating Over Transcription atic. In Sisaala Tumulung, however, the phonemes Systems /i/ and /I/ are collapsed to , and /u/ and /U/ to (Blass, 1975). Sisaala Tumulung’s orthog- Section 3.1 uses the Sisaala languages to illus- raphy was developed in the 1970s and technologi- trate interoperability challenges posed by linguis- cal limitations may have led its developers to col- tic data. Section 3.2 addresses technological is- lapse these phonemes in the writing system. For sues including encoding and ambiguity. example, the Ghana Alphabet Committee’s 1990 Report lacks an individual grapheme for the 3.1 Linguistic challenges phoneme /N/ for Dagaare. This difficulty of render- Three genetically related languages spoken in ing unconventional symbols on typewriters once Northern Ghana, Sisaala Pasaale [sig], Sisaala posed a challenge for orthography development Tumulung [sil] and Sisaala Western [ssl], differ (Bodomo, 1997). slightly in their orthographies for two reasons: Tone is both lexically and grammatically con- they have slightly divergent phonemic inventories trastive in Sisaala languages. In Sisaala Pasaale’s and their orthographies may differ graphemically official orthography tone is not marked and is not when representing the same phoneme. See Table used in native speaker materials. On the other 1. hand, in linguistic descriptions that use this or- The voiceless labial-velar phoneme /kp/ appears thography, tone is marked to disambiguate tonal in both Sisaala Tumulung and Sisaala Pasaale, but 5The phoneme /d/ has morphologically conditioned al- has been lost in Sisaala Western. There is a con- lographs (word initial) or (elsewhere) (McGill, vergence of the allophones [d] and [r] into one 2004).

114 minimal pairs in lexical items and grammatical two character sets differently, then data could not constructions (McGill, 2004). In the Sisaala be reliably and correctly displayed. (Tumulung)-English dictionary, tone is marked To circumvent these problems, OATS uses the only to disambiguate lexical items (Blass, 1975). Unicode Standard7 for multilingual character en- In linguistic descriptions of Sisaala Western, non- coding of electronic textual data. Unicode en- contrastive tone is marked. When tone is marked, codes 76 scripts and includes the IPA.8 In principle it appears as acute (high tone) and grave (low tone) this allows OATS to interoperate over IPA and all accents over vowels or nasals. scripts currently encoded in Unicode. However, Language researchers would quickly pick up on writing systems, scripts and transcriptions are of- these minute differences in orthographies. How- ten themselves encoded ambiguously. ever, what first seem to be trivial differences, illus- Unicode encodes characters, not glyphs, in trate one issue of resource discovery on the Web – scripts and sometimes unifies duplicate characters without methods for interoperability, even slightly across scripts. For example, IPA characters of divergent resources are more difficult to discover, Greek and Latin origin, such as and query and compare. How would someone re- are not given a distinct position within Unicode’s searching a comparative analysis of /Ù/ sounds of IPA character block. The Unicode code space languages in Northern Ghana discover that it is is subdivided into character blocks, which gener- represented as and without first lo- ally encode characters from a single script, but as cating the extremely sparse grammatical informa- is illustrated by the IPA, characters may be dis- tion available on these languages? Furthermore, persed across several different character blocks. automatic phonetic research is possible on lan- This poses a challenge for interoperation, particu- guages with shallow orthographies (Zuraw, 2006), larly with regard to homographs. Why shouldn’t a but crosslinguistic versions of such work require speaker of Russian use the CYRILLICSMALL interoperation over writing systems. LETTERA at code point U+0430 for IPA transcrip- tion, instead of LATIN SMALL LETTER A at 3.2 Technological challenges code point U+0061, when visually they are indis- The main technological challenges in interoperat- tinguishable? ing over textual electronic resources are: encod- Homoglyphs come in two flavors: linguistic and ing multilingual language text in an interoperable non-linguistic. Linguists are unlikely to distin- format and resolving ambiguity between mapping guish between the <@> LATIN SMALL LETTER relations. These are addressed below. SCHWA at code point U+0259 and <@> LATIN Hundreds of character encoding sets for writ- SMALLLETTERTURNEDE at U+01DD. And non- ing systems have been developed, e.g. ASCII, linguists are unlikely to differentiate any seman- GB 180306 and Unicode. Historically, different tic difference between an open back unrounded standards were formalized differently and for dif- vowel , the LATIN SMALL LETTER ALPHA ferent purposes by different standards commit- at U+0251, and the open front unrounded vowel tees. A lack of interoperability between char- , LATIN SMALL LETTER A at U+0061. acter encodings ensued. Linguists, restricted to Another challenge is how to handle ambigu- standard character sets that lacked IPA support ity in transcription systems and orthographies. In and other language-specific graphemes that they Serbo-Croatian, for example, the digraphs , needed, made their own solutions (Bird and Si- and represent distinct phonemes and mons, 2003). Some chose to represent unavailable each are comprised of two graphemes, which graphemes with substitutes, e.g. the combination themselves represent distinct phonemes. Words of to represent . Others redefined se- like ‘to outlive’ are composed of lected characters from a character encoding to map the morphemes , a prefix, and the verb their own fonts to. One linguist’s redefined char- . In this instance the combination of acter set, however, would not render properly on and does not represent a single digraph another linguist’s computer if they did not share ; they represent two neighboring phonemes the same font. If two character encodings defined across a morpheme boundary. Likewise in En-

6Guoji´ a¯ Biaozh¯ u,ˇ the national standard character set for 7ISO/IEC 1064 the People’s Republic of China 8http://www.unicode.org/Public/UNIDATA/Scripts.txt

115 glish, the grapheme sequence can be both annotation to enable intelligent search across lin- a digraph as well as a sequence of graphemes, guistic resources (Farrar and Langendoen, 2003). as in and . When pars- Several technologies are integral to the architec- ing words like and both ture of the Semantic Web, including Unicode, disambiguations are theoretically available. An- XML,10 and the Resource Description Framework other example is illustrated by , , and (RDF).11 OATS has been developed with these . How should be interpreted be- technologies and uses SPARQL12 to query the fore when English gives us both /tOm@s/ knowledge base of linked data. ‘Thomas’ and /Tioudor/ ‘Theodore’? The Sisaala The Unicode Standard is the standard text Western word ‘waterfall’ could be encoding for the Web, the recommended best- parsed as /niik.yuru/ instead of /nii.Ùuru/ to speak- practice for encoding linguistic resources, and the ers unfamiliar with the digraph of orthogra- underlying encoding for OATS. XML is a gen- phies of Northwestern Ghana. eral purpose specification for markup languages These ambiguities are due to mapping relations and provides a structured language for data ex- between phonemes and graphemes. Transcrip- change (Yergeau, 2006). It is the most widely tion systems and orthographies often have com- used implementation for descriptive markup, and plex grapheme-to-phoneme relationships and they is in fact so extensible that its structure does not vary in levels of phonological abstraction. The provide functionality for encoding explicit rela- transparency of the relation between spelling and tionships across documents. Therefore RDF is phonology differ between languages like English needed as the syntax for representing informa- and French, and say Serbo-Croatian. The former tion about resources on the Web and it is itself represent deep orthographic systems where the written in XML and is serializable. RDF de- same grapheme can represent different phonemes scribes resources in the form subject-predicate- in different contexts. The latter, a shallow or- object (or entity-relationship-entity) and identi- thography, is less polyvalent in its grapheme-to- fies unique resources through Uniform Resource phoneme relations. Challenges of ambiguity reso- Identifiers (URIs). In this manner, RDF encodes lution are particularly apparent in data conversion. meaning in sets of triples that resemble subject- verb-object constructions. These triples form a 4 Ontological Structure and Design graph data structure of nodes and arcs that are 4.1 Technologies non-hierarchical and can be complexly connected. In Philosophy, Ontology is the study of existence Numerous algorithms have been written to access and the meaning of being. In the Computer and and manipulate graph structures. Since all URIs Information Sciences, ontology has been co-opted are unique, each subject, object and predicate are to represent a data model that represents concepts uniquely defined resources that can be referred within a certain domain and the relationships be- to and reused by anyone. URIs give users flex- tween those concepts. At a low level an ontol- ibility in giving concepts a semantic representa- ogy is a taxonomy and a set of inference rules. tion. However, if two individuals are using differ- At a higher-level, ontologies are collections of in- ent URIs for the same concept, then a procedure formation that have formalized relationships that is needed to know that these two objects are in- hold between entities in a given domain. This pro- deed equivalent. A common example in linguis- vides the basis for automated reasoning by com- tic annotation is the synonymous use of genitive puter software, where content is given meaning and possessive. By incorporating domain specific in the sense of interpreting data and disambiguat- knowledge into an ontology in RDF, disambigua- ing entities. This is the vision of the Semantic tion and interoperation over data becomes pos- Web,9 a common framework for integrating and sible. GOLD addresses the challenge of inter- correlating linked data from disparate resources operability of disparate linguistic annotation and for interoperability (Beckett, 2004). The Gen- termsets in morphosyntax by functioning as an in- eral Ontology for Linguistic Description (GOLD) terlingua between them. In OATS, the interlingua is grounded in the Semantic Web and provides 10 a foundation for the interoperability of linguistic http://www.w3.org/XML/ 11http://www.w3.org/RDF/ 9http://www.w3.org/2001/sw/ 12http://www.w3.org/TR/rdf-sparql-query/

116 between systems of transcription is the IPA. Each TranscriptionSystem is a set of instances of Scripteme. Every Scripteme instance is in a 4.2 IPA as interlingua Mapping relation with its IPA counterpart. The OATS uses the IPA as an interlingua (or pivot) MappingSystem contains a list of Transcription- to which elements of systems of transcription are System instances that have Scripteme instances mapped. The IPA was chosen for its broad cov- mapped to IPA. The Grapheme class provides erage of the sounds of the world’s languages, its the mapping between Scripteme and Character. mainstream adoption as a system for transcription The Character class is the set of Unicode char- by linguists, and because it is encoded (at least acters and contains the Unicode version number, mostly) in Unicode. The pivot component resides character name, HTML entity and code point. at the Character ID entity, which is in a one-to-one 5 Implementation relationship with a Unicode Character. The Char- acter ID entity is provided for mapping characters 5.1 Data to multiple character encodings. This is useful for The African language data used in OATS were mapping IPA characters to legacy character encod- mined from Systemes` alphabetiques´ des langues ing sets like IPA Kiel and SIL IPA93, allowing africanies,15 an online database of des for data conversion between character encodings. langues africaines (Hartell, 1993). Additional The IPA also encodes phonetic segments as small languages were added by hand. Currently, OATS feature bundles. Phonological theories extend the includes 203 languages from 23 language families. idea and interpretation of proposed feature sets, Each language contains its phonemic and ortho- an area of debate within Linguistics. These issues graphic inventories. should be taken into consideration when encoding interoperability via an interlingua, and should be 5.2 Query leveraged to expand current theoretical questions Linguists gain unprecedented access to linguistic that can be asked of the knowledge base. Charac- resources when they are able to query across dis- ter semantics also require consideration (Gibbon parate data in standardized notations regardless of et al., 2005). Glyph semantics provide implicit in- how the data in those resources is encoded. Cur- formation such as a resource’s language, its lan- rently OATS contains two phonetic notations for guage family assignment, its use by a specific so- querying: IPA and X-SAMPA. To illustrate the cial or scientific group, or corporate identity (Trip- querying functionality currently in place, the IPA pel et al., 2007). Documents with IPA characters is used to query the knowledge base of African or in legacy IPA character encodings provide se- language data16 for the occurrence of two seg- mantic knowledge regarding the document’s con- ments. The first is the voiced palatal nasal /ñ/. The tent, namely, that it contains transcribed linguistic results are captured in table 2. data.

4.3 Ontological design Table 2: Occurrences of voiced palatal nasal /ñ/ OATS consists of the following ontological classes: Character, Grapheme, Document, Map- Grapheme Languages % of Data ping, MappingSystem, WritingSystem, and 114 84% Scripteme. WritingSystem is further subdivided 11 8% into OrthographicSystem and TranscriptionSys- <ñ> 8 6% tem. Each Document is associated with the 2 1% OLAC Metadata Set,13 an extension of the Dublin 1 .05% Core Type Vocabulary14 for linguistic resources. This includes uniquely identifying the language The voiced palatal nasal /ñ/ is accounted for represented in the document with its ISO 639-3 in 136 languages, or roughly 67% of the 203 three letter language code. Each Document is also languages queried. Orthographically the voiced associated with an instance of WritingSystem. palatal nasal /ñ/ is represented as , ,

13http://www.language-archives.org/OLAC/metadata.html 15http://sumale.vjf.cnrs.fr/phono/ 14http://dublincore.org/usage/terms/dcmitype/ 16For a list of these languages, see http://phoible.org

117 <ñ>, , and interestingly as . The two Table 4: Occurrence of /gb/ and lack of /kp/ languages containing , Koonzime [ozm] and Akoose [bss] of Cameroon, both lack a phonemic Code Language Name Genetic Affiliation /N/. In these languages’ orthographies, both emk Maninkakan Mande and are used to represent the phoneme /ñ/. With further investigation, one can determine if kza Karaboro Gur they are contextually determined allographs like lia Limba Atlantic the and in Sisaala Pasaale. mif Mofu-Gudur Chadic The second simple query retrieves the occur- sld Sissala Gur rence of the voiced alveo-palatal affricate /Ã/. Ta- ssl Sisaala Gur ble 3 displays the results from the same sample of sus Susu Mande languages. ted Krumen Kru tem Themne Atlantic tsp Toussian Gur Table 3: Occurrences of voiced alveo-palatal af- fricate /Ã/

Grapheme Languages % of Data same task as querying the knowledge base via 84 92% the pivot. In this case, however, a mapping rela- 2 2% tion from the language-specific grapheme to IPA is first established. Since all transcription systems’ 2 2% graphemes must have an IPA counterpart, this re- 1 1% lationship is always available. A query is then <Ã> 1 1% made across all relevant mapping relations from 1 1% IPA to languages within the knowledge base. For example, a user familiar with the Sisaala The voiced alveo-palatal affricate /Ã/ is ac- Western orthography queries the knowledge base counted for in 92 languages, or 45%, of the 203 for languages with . Initially, the OATS languages sampled. The majority, over 92%, use system establishes the relationship between the same grapheme to represent /Ã/. Other and its IPA counterpart. In this case, repre- graphemes found in the language sample include sents the voiceless alveo-palatal affricate /Ù/. Hav- , , , <Ã>, and . The ing retrieved the IPA counterpart, the query next stands out in this data sample. Interestingly, it retrieves all languages that have /Ù/ in their phone- comes from Sudanese Arabic, which uses Latin- mic inventories. In the present data sample, this based characters in its orthography. It contains the query retrieves 99 languages with the phonemic phonemes /g/, /G/, and /Ã/, which are gramphemi- voiceless alveo-palatal affricate. If the user then cally represented as , and . wishes to compare the graphemic distributions of These are rather simplistic examples, but the /Ù/ and /Ã/, which was predominately , these graph data structure of RDF, and the power of results are easily provided. They are displayed in SPARQL provides an increasingly complex sys- Table 5. tem for querying any data stored in the knowledge The 97 occurrences of /Ù/ account for five more base and relationships as encoded by its ontologi- than the 92 languages sampled in section 5.2 that cal structure. For example, by combining queries had its voiced alveo-palatal affricate counterpart. such as ‘which languages have the phoneme /gb/’ Such information provides statistics for phoneme and ‘of those languages which lack its voiceless distribution across languages in the knowledge counterpart /kp/’, 11 results are found from this base. OATS is a powerful tool for gathering such sample of African languages, as outlined in Table knowledge about the world’s languages. 4.

5.3 Querying for phonetic data via 5.4 Code orthography There were two main steps in the implementation The ability to query the knowledge base via a of OATS. The first was the design and creation of language-specific orthography is ultimately the the OATS RDF model. This task was undertaken

118 query, and multilingual character encoding, OATS Table 5: Occurrences of voiceless alveo-palatal af- is designed to facilitate resource discovery and fricate /Ù/ intelligent search over linguistic data. The cur- rent knowledge base includes an ontological de- Grapheme Languages % of Data scription of writing systems and specifies rela- 60 62% tions for mapping segments of transcription sys- < > ch 28 29% tems to their IPA equivalents. IPA is used as the 3 3% interlingua pivot that provides the ability to query 2 2% across all resources in the knowledge base. OATS’ <Ù> 1 1% data source includes 203 African languages’ or- 1 1% thographic and phonemic inventories. 1 1% The case studies proposed and implemented in 1 1% this paper present functionality to use OATS to query all data in the knowledge base via stan- dards like the IPA. OATS also supports query via 17 using Protege, an open source ontology editor any transcription system or practical orthography developed by Stanford Center for Biomedical In- in the knowledge base. Another outcome of the formatics Research. The use of Protege was pri- OATS project is the ability to check for incon- marily to jump start the design and implementa- sistencies in digitized lexical data. The system tion of the ontology. The software provides a user could also test linguist-proposed phonotactic con- interface for ontology modeling and development, straints and look for exceptions in data. Data and exports the results into RDF. After the archi- from grapheme-to-phoneme mappings, phonotac- tecture was in place, the second step was the de- tics and character encodings can provide an ortho- 18 velopment of a code base in Python for gather- graphic profile/model of a transcription or writing ing data and working with RDF. This code base system. This could help to bootstrap software and includes two major pieces. The first was the de- resource development for low-density languages. velopment of a scraper, which was used to gather OATS also provides prospective uses for docu- phonemic inventories off of the Web by download- ment conversion and development of probabilistic ing Web pages and scraping them for relevant con- models of orthography-to-phoneme mappings. tents. Each language was collected with its ISO 639-3 code, and its orthographic inventory and Acknowledgements the mapping relation between these symbols and This work was supported in part by the Max- their IPA phonemic symbols. The second chunk of Planck-Institut fur¨ evolutionare¨ Anthropologie the code base provides functionality for working and thanks go to Bernard Comrie, Jeff Good and with the RDF graph and uses RDFLib,19 an RDF Michael Cysouw. For useful comments and re- Python module. The code includes scripts that add views, I thank Emily Bender, Scott Farrar, Sharon all relevant language data that was scraped from Hargus, Will Lewis, Richard Wright, and three the Web to the OATS RDF graph, it fills the graph anonymous reviewers. with the Unicode database character tables, and provides SPARQL queries for querying the graph as illustrated above. There is also Python code for References using OATS to convert between two character sets, Timothy Baldwin, Steven Bird, and Baden Hughes. and for error checking of characters within a doc- 2006. Collecting Low-Density Language Materials ument that are not in the target set. on the Web. In Proceedings of the 12th Australasian World Wide Web Conference (AusWeb06). 6 Conclusion and Future Work David Beckett. 2004. RDF/XML Syntax Specification OATS is a knowledge base that supports interop- (Revised). Technical report, W3C. eration over disparate transcription systems. By Steven Bird and Gary F. Simons. 2003. Seven Di- leveraging technologies for ontology description, mensions of Portability for Language Documenta- tion and Description. Language, 79(3):557–582. 17http://protege.stanford.edu/ 18http://python.org Regina Blass. 1975. Sisaala-English, English-Sisaala 19http://rdflib.net/ Dictionary. Institute of Linguistics, Tamale, Ghana.

119 Adams Bodomo. 1997. The Structure of Dagaare. Stanford Monographs in African Languages. CSLI Publications. Florian Coulmas. 1999. The Blackwell Encyclopedia of Writing Systems. Blackwell Publishers. Scott Farrar and Terry Langendoen. 2003. A Linguis- tic Ontology for the Semantic Web. GLOT, 7(3):97– 100. Scott Farrar and William D. Lewis. 2006. The GOLD Community of Practice: An Infrastructure for Lin- guistic Data on the Web. In Language Resources and Evaluation. Dafydd Gibbon, Baden Hughes, and Thorsten Trip- pel. 2005. Semantic Decomposition of Charac- ter Encodings for Linguistic Knowledge Discovery. In Proceedings of Jahrestagung der Gesellschaft fur¨ Klassifikation 2005. Rhonda L. Hartell. 1993. Alphabets des langues africaines. UNESCO and Societ´ e´ Internationale de Linguistique. William D. Lewis. 2006. ODIN: A Model for Adapt- ing and Enriching Legacy Infrastructure. In Pro- ceedings of the e-Humanities Workshop, held in co- operation with e-Science 2006: 2nd IEEE Interna- tional Conference on e-Science and Grid Comput- ing. Stuart Mcgill, Samuel Fembeti, and Mike Toupin. 1999. A Grammar of Sisaala-Pasaale, volume 4 of Language Monographs. Institute of African Studies, University of Ghana, Legon, Ghana. Stuart McGill. 2004. Focus and Activation in Paasaal: the particle rE. Master’s thesis, University of Read- ing. Steven Moran. 2008. A Grammatical Sketch of Isaalo (Western Sisaala). VDM. The Unicode Consortium. 2007. The Unicode Stan- dard, Version 5.0. Boston, MA, Addison-Wesley. Mike Toupin. 1995. The Phonology of Sisaale Pasaale. Collected Language Notes, 22. Thorsten Trippel, Dafydd Gibbon, and Baden Hughes. 2007. The Computational Semantics of Characters. In Proceedings of the Seventh International Work- shop on Computational Semantics (IWCS-7), pages 324–329. Francois Yergeau. 2006. Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C Recommenda- tion 16 August 2006, edited in place 29 September 2006. Kie Zuraw. 2006. Using the Web as a Phonological Corpus: a case study from Tagalog. In Proceedings of the 2nd International Workshop on Web as Cor- pus.

120