EACL 2009

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages

AfLaT 2009

31 March 2009 Megaron Athens International Conference Centre Athens, Greece Production and Manufacturing by TEHNOGRAFIA DIGITAL PRESS 7 Ektoros Street 152 35 Vrilissia Athens, Greece

This workshop is sponsored by

c 2009 The Association for Computational Linguistics ISBN 1-932432-25-6

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ii Preface

In multilingual situations, language technologies are crucial for providing access to information and opportunities for economic development. With somewhere between 1,000 and 2,000 different languages, Africa is a multilingual continent par excellence and presents acute challenges for those seeking to promote and use African languages in the areas of business development, education, research, and relief aid.

In recent times a number of researchers and institutions, both from Africa and elsewhere, have come forward to share the common goal of developing capabilities in language technologies for the African languages. The goal of the workshop, then, is to provide a forum to meet and share the latest developments in this field. It also seeks to attract linguists who specialize in African languages and would like to leverage the tools and approaches of computational linguistics, as well as computational linguists who are interested in learning about the particular linguistic challenges posed by the African languages.

The workshop consists of an invited talk on African language families and their structural properties by Prof. Sonja Bosch (UNISA, ), followed by refereed research papers in computational linguistics. The call for papers specified that the focus would be on the less-commonly studied and lesser-resourced languages, such as those of sub-Saharan Africa. These could include languages from all four families: Niger-Congo, Nilo-Saharan, Khoisan and Afro-Asiatic, with the exception of Arabic (which is covered by the ’Computational Approaches to Semitic Languages’ workshop). Variants of European languages such as African French, African English or were also excluded from the call. The call was well answered, with 24 proposals being submitted. Following a rigorous review process, nine were withheld as full papers, and a further seven as poster presentations. These proceedings contain the double-blind, peer reviewed texts of the contributions that were withheld.

We would like to thank the US Army RDECOM International Technology Center - Atlantic for generously sponsoring this workshop.

We wish you a very enjoyable time at AfLaT 2009!

— The Organizers.

iii

AfLaT 2009 Organizers

Proceedings Editors:

Guy De Pauw Gilles-Maurice de Schryver Lori Levin

Organizers:

Lori Levin, Language Technologies Institute, Carnegie Mellon University (USA) Guy De Pauw, CNTS - Language Technology Group, University of Antwerp (Belgium) - School of Computing and Informatics, University of Nairobi (Kenya) - AfLaT.org John Kiango, Institute of Kiswahili Research, University of Dar Es Salaam (Tanzania) Judith Klavans, Institute for Advanced Computer Studies, University of Maryland (USA) Manuela Noske, Microsoft Corporation (USA) Gilles-Maurice de Schryver, African Languages and Cultures, Ghent University (Belgium) - Xhosa Department, University of the Western Cape (South Africa) - AfLaT.org Peter Waiganjo Wagacha, School of Computing and Informatics, University of Nairobi (Kenya) - AfLaT.org

Program Committee:

Alan Black, Carnegie Mellon University (USA) Sonja Bosch, University of South Africa (South Africa) Christopher Cieri, University of Pennsylvania, Linguistic Data Consortium (USA) Guy De Pauw, University of Antwerp (Belgium) Gilles-Maurice de Schryver, Ghent University (Belgium) Robert Frederking, Carnegie Mellon University (USA) Dafydd Gibbon, University of Bielefeld (Germany) Jeff Good, SUNY Buffalo (USA) Mike Gasser, Indiana University (USA) Gregory Iverson, University of Maryland, Center for Advanced Study of Language (USA) Stephen Larocca, US Army Research Lab (USA) Michael Maxwell, University of Maryland, Center for Advanced Study of Language (USA) Manuela Noske, Microsoft Corporation (USA) Tristan Purvis, University of Maryland, Center for Advanced Study of Language (USA) Clare Voss, US Army Research Lab (USA) Peter Waiganjo Wagacha, University of Nairobi (Kenya) Briony Williams, University of Wales, Bangor (Great Britain)

Invited Speaker:

Sonja Bosch, Department of African Languages, University of South Africa (South Africa)

v

Table of Contents

Collecting and Evaluating Speech Recognition Corpora for Nine Southern Jaco Badenhorst, Charl Van Heerden, Marelie Davel and Etienne Barnard ...... 1

The SAWA Corpus: A Parallel Corpus English - Swahili Guy De Pauw, Peter Waiganjo Wagacha and Gilles-Maurice de Schryver ...... 9

Information Structure in African Languages: Corpora and Tools Christian Chiarcos, Ines Fiedler, Mira Grubic, Andreas Haida, Katharina Hartmann, Julia Ritz, Anne Schwarz, Amir Zeldes and Malte Zimmermann ...... 17

A Computational Approach to Yorub` a´ Morphology Raphael Finkel and Odetunji Ajadi Odejobi ...... 25

Using Technology Transfer to Advance Automatic Lemmatisation for Setswana Hendrik Johannes Groenewald ...... 32

Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words Gertrud Faaß, Ulrich Heid, Elsabe´ Taljard and Danie Prinsloo ...... 38

Development of an Amharic Text-to-Speech System Using Cepstral Method Tadesse Anberbir and Tomio Takara ...... 46

Building Capacities in Human Language Technology for African Languages Tunde Adegbola ...... 53

Initial Fieldwork for LWAZI: A Telephone-Based Spoken Dialog System for Rural South Africa Tebogo Gumede and Madelaine Plauche...... ´ 59

Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography Rigardt Pretorius, Ansu Berg, Laurette Pretorius and Biffie Viljoen ...... 66

Interlinear Glossing and its Role in Theoretical and Descriptive Studies of African and other Lesser– Documented Languages Dorothee Beermann and Pavel Mihaylov ...... 74

Towards an Electronic Dictionary of Tamajaq Language in Niger Chantal Enguehard and Issouf Modi ...... 81

A Repository of Free Lexical Resources for African Languages: The Project and the Method Piotr Banski´ and Beata Wojtowicz.´ ...... 89

Exploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology Laurette Pretorius and Sonja Bosch ...... 96

Methods for Amharic Part-of-Speech Tagging Bjorn¨ Gamback,¨ Fredrik Olsson, Atelach Alemu Argaw and Lars Asker ...... 104

An Ontology for Accessing Transcription Systems (OATS) StevenMoran...... 112

vii

Conference Program

Tuesday, March 31, 2009

09:00–10:30 Invited Talk: ”African Language Families and their Structural Properties” by Sonja Bosch

10:30–11:00 Coffee Break

Session 1: Corpora (11:00–12:30)

11:00–11:30 Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu Languages Jaco Badenhorst, Charl Van Heerden, Marelie Davel and Etienne Barnard

11:30–12:00 The SAWA Corpus: A Parallel Corpus English - Swahili Guy De Pauw, Peter Waiganjo Wagacha and Gilles-Maurice de Schryver

12:00–12:30 Information Structure in African Languages: Corpora and Tools Christian Chiarcos, Ines Fiedler, Mira Grubic, Andreas Haida, Katharina Hartmann, Julia Ritz, Anne Schwarz, Amir Zeldes and Malte Zimmermann

12:30–14:00 Lunch Break

Session 2: Morphology, Speech and Part-of-Speech Tagging (14:00–16:00)

14:00–14:30 A Computational Approach to Yorub` a´ Morphology Raphael Finkel and Odetunji Ajadi Odejobi

14:30–15:00 Using Technology Transfer to Advance Automatic Lemmatisation for Setswana Hendrik Johannes Groenewald

15:00–15:30 Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words Gertrud Faaß, Ulrich Heid, Elsabe´ Taljard and Danie Prinsloo

15:30–16:00 Development of an Amharic Text-to-Speech System Using Cepstral Method Tadesse Anberbir and Tomio Takara

16:00–16:30 Coffee Break

ix Tuesday, March 31, 2009 (continued)

Session 3: General Papers and Discussion (16:30–18:00)

16:30–17:00 Building Capacities in Human Language Technology for African Languages Tunde Adegbola

17:00–17:30 Initial Fieldwork for LWAZI: A Telephone-Based Spoken Dialog System for Rural South Africa Tebogo Gumede and Madelaine Plauche´

17:30–18:00 Discussion

Poster Session (During Coffee and Lunch Breaks)

Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography Rigardt Pretorius, Ansu Berg, Laurette Pretorius and Biffie Viljoen

Interlinear Glossing and its Role in Theoretical and Descriptive Studies of African and other Lesser–Documented Languages Dorothee Beermann and Pavel Mihaylov

Towards an Electronic Dictionary of Tamajaq Language in Niger Chantal Enguehard and Issouf Modi

A Repository of Free Lexical Resources for African Languages: The Project and the Method Piotr Banski´ and Beata Wojtowicz´

Exploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology Laurette Pretorius and Sonja Bosch

Methods for Amharic Part-of-Speech Tagging Bjorn¨ Gamback,¨ Fredrik Olsson, Atelach Alemu Argaw and Lars Asker

An Ontology for Accessing Transcription Systems (OATS) Steven Moran

x Collecting and evaluating speech recognition corpora for nine Southern Bantu languages

Jaco Badenhorst, Charl van Heerden, Marelie Davel and Etienne Barnard HLT Research Group, Meraka Institute, CSIR, South Africa [email protected], [email protected] [email protected], [email protected]

Abstract such as text-to-speech (TTS) systems and auto- matic speech recognition (ASR) systems are re- We describe the Lwazi corpus for auto- quired. The latter category of technologies is the matic speech recognition (ASR), a new focus of the current contribution. telephone speech corpus which includes Speech recognition systems exist for only a data from nine Southern Bantu lan- handful of African languages (Roux et al., ; Seid guages. Because of practical constraints, and Gambck, 2005; Abdillahi et al., 2006), and the amount of speech per language is to our knowledge no service available to the gen- relatively small compared to major cor- eral public currently uses ASR in an indigenous pora in world languages, and we report African language. A significant reason for this on our investigation of the stability of state of affairs is the lack of sufficient linguistic the ASR models derived from the corpus. resources in the African languages. Most impor- We also report on phoneme distance mea- tantly, modern speech recognition systems use sta- sures across languages, and describe initial tistical models which are trained on corpora of phone recognisers that were developed us- relevant speech (i.e. appropriate for the recogni- ing this data. tion task in terms of the language used, the pro- file of the speakers, speaking style, etc.) This 1 Introduction speech generally needs to be curated and tran- There is a widespread belief that spoken dialog scribed prior to the development of ASR systems, systems (SDSs) will have a significant impact in and for most applications speech from a large the developing countries of Africa (Tucker and number of speakers is required in order to achieve Shalonova, 2004), where the availability of alter- acceptable system performance. On the African native information sources is often low. Tradi- continent, where infrastructure such as computer tional computer infrastructure is scarce in Africa, networks is less developed than in countries such but telephone networks (especially cellular net- as America, Japan and the European countries, the works) are spreading rapidly. In addition, speech- development of such speech corpora is a signifi- based access to information may empower illiter- cant hurdle to the development of ASR systems. ate or semi-literate people, 98% of whom live in The complexity of speech corpus development the developing world. is strongly correlated with the amount of data that Spoken dialog systems can play a useful role is required, since the number of speakers that need in a wide range of applications. Of particular im- to be canvassed and the amount of speech that portance in Africa are applications such as ed- must be curated and transcribed are major fac- ucation, using speech-enabled learning software tors in determining the feasibility of such devel- or kiosks and information dissemination through opment. In order to minimise this complexity, it media such as telephone-based information sys- is important to have tools and guidelines that can tems. Significant benefits can be envisioned if be used to assist in designing the smallest cor- information is provided in domains such as agri- pora that will be sufficient for typical applications culture (Nasfors, 2007), health care (Sherwani et of ASR systems. As minimal corpora can be ex- al., ; Sharma et al., 2009) and government ser- tended by sharing data across languages, tools are vices (Barnard et al., 2003). In order to make also required to indicate when data sharing will be SDSs a reality in Africa, technology components beneficial and when detrimental.

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 1–8, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

1 In this paper we describe and evaluate a new tionaries, text and speech corpora are being de- speech corpus of South African languages cur- veloped. ASR speech corpora consist of ap- rently under development (the Lwazi corpus) and proximately 200 speakers per language, produc- evaluate the extent in which computational anal- ing read and elicited speech, recorded over a tele- ysis tools can provide further guidelines for ASR phone channel. Each speaker produced approxi- corpus design in resource-scarce languages. mately 30 utterances, 16 of these were randomly selected from a phonetically balanced corpus and 2 Project Lwazi the remainder consist of short words and phrases: answers to open questions, answers to yes/no The goal of Project Lwazi is to provide South questions, spelt words, dates and numbers. The African citizens with information and informa- speaker population was selected to provide a bal- tion services in their home language, over the anced profile with regard to age, gender and type telephone, in an efficient and affordable manner. of telephone (cellphone or landline). Commissioned by the South African Department of Arts and Culture, the activities of this three year 3 Related work project (2006-2009) include the development of core language technology resources and compo- Below, we review earlier work relevant to the de- nents for all the official languages of South Africa, velopment of speech recognisers for languages where, for the majority of these, no prior language with limited resources. This includes both ASR technology components were available. system design (Sec. 3.1) and ASR corpus design The core linguistic resources being developed (Sec. 3.2). In Sec. 3.3, we also review the ana- include phoneme sets, electronic pronunciation lytical tools that we utilise in order to investigate dictionaries and the speech and text corpora re- corpus design systematically. quired to develop automated speech recognition (ASR) and text-to-speech (TTS) systems for all 3.1 ASR for resource-scarce languages eleven official languages of South Africa. The The main linguistic resources required when de- usability of these resources will be demonstrated veloping ASR systems for telephone based sys- during a national pilot planned for the third quar- tems are electronic pronunciation dictionaries, an- ter of 2009. All outputs from the project are be- notated audio corpora (used to construct acous- ing released as open source software and open tic models) and recognition grammars. An ASR content(Meraka-Institute, 2009). audio corpus consists of recordings from multi- Resources are being developed for all nine ple speakers, with each utterance carefully tran- Southern Bantu languages that are recognised as scribed orthographically and markers used to indi- official languages in South Africa (SA). These lan- cate non-speech and other events important from 1 guages are: (1) isiZulu (zul ) and isiXhosa (xho), an ASR perspective. Both the collection of ap- the two Nguni languages most widely spoken in propriate speech from multiple speakers and the SA. Together these form the home language of accurate annotation of this speech are resource- 41% of the SA population. (2) The three Sotho intensive processes, and therefore corpora for languages: Sepedi (nso), Setswana (tsn), Sesotho resource-scarce languages tend to be very small (sot), together the home language of 26% of the (1 to 10 hours of audio) when compared to the SA population. (3) The two Nguni languages less speech corpora used to build commercial systems widely spoken in SA: siSwati (ssw) and isiNde- for world languages (hundreds to thousands of bele (nbl), together the home language of 4% of hours per language). the SA population. (4) Xitsonga (tso) and Tshiv- Different approaches have been used to best enda (ven), the home languages of 4% and 2% of utilise limited audio resources when developing the SA population, respectively (Lehohla, 2003). ASR systems. Bootstrapping has been shown to (The other two official languages of South Africa be a very efficient technique for the rapid devel- are Germanic languages, namely English (eng) opment of pronunciation dictionaries, even when and Afrikaans (afr).) utilising linguistic assistants with limited phonetic For all these languages, new pronunciation dic- training (Davel and Barnard, 2004). 1After each language name, the ISO 639-3:2007 language Small audio corpora can be used efficiently by code is provided in brackets. utilising techniques that share data across lan-

2 guages, either by developing multilingual ASR Active and unsupervised learning methods can systems (a single system that simultaneously be combined to circumvent the need for tran- recognises different languages), or by using addi- scribing massive amounts of data (Riccardi and tional source data to supplement the training data Hakkani-Tur, 2003). The most informative untran- that exists in the target language. Various data scribed data is selected for a human to label, based sharing techniques for language-dependant acous- on acoustic evidence of a partially and iteratively tic modelling have been studied, including cross- trained ASR system. From such work, it soon be- language transfer, data pooling, language adap- comes evident that the optimisation of the amount tation and bootstrapping (Wheatley et al., 1994; of variation inherent to training data is needed, Schultz and Waibel, 2001; Byrne et al., 2000). since randomly selected additional data does not Both (Wheatley et al., 1994) and (Schultz and necessarily improve recognition accuracy. By fo- Waibel, 2001) found that useful gains could be cusing on the selection (based on existing tran- obtained by sharing data across languages with scriptions) of a uniform distribution across differ- the size of the benefit dependent on the similar- ent speech units such as words and phonemes, im- ity of the sound systems of the languages com- provements are obtained (Wu et al., 2007). bined. In the only cross-lingual adaptation study In our focus on resource-scarce languages, the using African languages (Niesler, 2007), similar main aim is to understand the amount of data that gains have not yet been observed. needs to be collected in order to achieve accept- able accuracy. This is achieved through the use 3.2 ASR corpus design of analytic measures of data variability, which we describe next. Corpus design techniques for ASR are generally aimed at specifying or selecting the most appro- 3.3 Evaluating phoneme stability priate subset of data from a larger domain in order In (Badenhorst and Davel, 2008) a technique is to optimise recognition accuracy, often while ex- developed that estimates how stable a specific plicitly minimising the size of the selected corpus. phoneme is, given a specific set of training data. This is achieved through various techniques that This statistical measure provides an indication of aim to include as much variability in the data as the effect that additional training data will have on possible, while simultaneously ensuring that the recognition accuracy: the higher the stability, the corpus matches the intended operating environ- less the benefit of additional speech data. ment as accurately as possible. The model stability measure utilises the Bhat- Three directions are primarily employed: (1) tacharyya bound (Fukunaga, 1990), a widely-used explicit specification of phonotactic, speaker and upper bound of the Bayes error. If Pi and pi(X) channel variability during corpus development, (2) denote the prior probability and class-conditional automated selection of informative subsets of data density function for class i, respectively, the Bhat- from larger corpora, with the smaller subset yield- tacharyya bound ² is calculated as: ing comparable results, and (3) the use of active p Z q learning to optimise existing speech recognition ² = P1P2 p1(X)p2(X)dX (1) systems. All three techniques provide a perspec- tive on the sources of variation inherent in a speech When both density functions are Gaussian with corpus, and the effect of this variation on speech mean µi and covariance matrix Σi, integration of recognition accuracy. ² leads to a closed-form expression for ²: In (Nagroski et al., 2003), Principle Component p −µ(1/2) Analysis (PCA) is used to cluster data acousti- ² = P1P2e (2) cally. These clusters then serve as a starting point for selecting the optimal utterances from a train- where ing database. As a consequence of the clustering 1 hΣ + Σ i−1 µ(1/2) = (µ − µ )T 1 2 (µ − µ ) technique, it is possible to characterise some of the 8 2 1 2 2 1 acoustic properties of the data being analysed, and Σ1+Σ2 1 | 2 | to obtain an understanding of the major sources of + lnp (3) 2 |Σ1||Σ2| variation, such as different speakers and genders (Riccardi and Hakkani-Tur, 2003). is referred to as the Bhattacharyya distance.

3 In order to estimate the stability of an acous- a Gaussian mixture model. For both context de- tic model, the training data for that model is sep- pendent and context independent models similar arated into a number of disjoint subsets. All sub- trends are also observed. (Asymptotes occur later, sets are selected to be mutually exclusive with re- but trends remain similar.) Because of the limited spect to the speakers they contain. For each sub- size of the Lwazi corpus, we therefore only report set, a separate acoustic model is trained, and the on single-mixture context-independent models in Bhattacharyya bound between each pair of mod- the current section. els calculated. By calculating both the mean of As we also observe similar trends for phonemes this bound and the standard deviation of this mea- within the same broad categories, we report on sure across the various model pairs, a statistically one or two examples from several broad categories sound measure of model estimation stability is ob- which occur in most of our target languages. Us- tained. ing SAMPA notation, the following phonemes are selected: /a/ (vowels), /m/ (nasals), /b/ and /g/ 4 Computational analysis of the Lwazi (voiced plosives) and /s/ (unvoiced fricatives), af- corpus ter verifying that these phonemes are indeed rep- We now report on our analysis of the Lwazi resentative of the larger groups. speech corpus, using the stability measure de- Figures 1 and 2 demonstrate the effects of vari- scribed above. Here, we focus on four languages able numbers of phonemes and speakers, respec- (isiNdebele, siSwati, isiZulu and Tshivenda) for tively, on the value of the mean Bhattacharyya reasons of space; later, we shall see that the other bound. This value should approach 0.5 for a model languages behave quite similarly. fully trained on a sufficiently representative set of data. In Fig. 1 we see that the various broad cate- 4.1 Experimental design gories of sounds approach the asymptotic bound in For each phoneme in each of our target lan- different ways. The vowels and nasals require the guages, we extract all the phoneme occurrences largest number of phoneme occurrences to reach from the 150 speakers with the most utterances per a given level, whereas the fricatives and plosives phoneme. We utilise the technique described in converge quite rapidly (With 10 observations per Sec. 3.3 to estimate the Bhattacharyya bound both speaker, both the fricatives and plosives achieve when evaluating phoneme variability and model values of 0.48 or better for all languages, in con- distance. In both cases we separate the data for trast to the vowels and nasals which require 30 ob- each phoneme into 5 disjoint subsets. We calcu- servations to reach similar stability). Note that we late the mean of the 10 distances obtained between employed 30 speakers per phoneme group, since the various intra-phoneme model pairs when mea- that is the largest number achievable with our pro- suring phoneme stability, and the mean of the tocol. 25 distances obtained between the various inter- For the results in Fig. 2, we keep the number phoneme model pairs when measuring phoneme of phoneme occurrences per speaker fixed at 20 distance. (this ensures that we have sufficient data for all In order to be able to control the number of phonemes, and corresponds with reasonable con- phoneme observations used to train our acoustic vergence in Fig. 1). It is clear that additional models, we first train a speech recognition system speakers would still improve the modelling ac- and then use forced alignment to label all of the curacy for especially the vowels and nasals. We utterances using the systems described in Sec. 5. observe that the voiced plosives and fricatives Mel-frequency cepstral coefficients (MFCCs) with quickly achieve high values for the bound (close cepstral mean and variance normalisation are used to the ideal 0.5). as features, as described in Sec. 5. Figures 1 and 2 – as well as similar figures for the other phoneme classes and languages we have 4.2 Analysis of phoneme variability studied – suggest that all phoneme categories re- In an earlier analysis of phoneme variability of quire at least 20 training speakers to achieve rea- an English corpus (Badenhorst and Davel, 2008), sonable levels of convergence (bound levels of it was observed that similar trends are observed 0.48 or better). The number of phoneme observa- when utilising different numbers of mixtures in tions required per speaker is more variable, rang-

4 Figure 1: Effect of number of phoneme utterances per speaker on mean of Bhattacharyya bound for different phoneme groups using data from 30 speakers ing from less than 10 for the voiceless fricatives debele model, as would be expected given their to 30 or more for vowels, liquids and nasals. We close linguistic relationship, but for /n/, the Tshiv- return to these observations below. enda model is found to be closer than either of the other Nguni languages. For comparative pur- 4.3 Distances between languages poses, we have included one non-Bantu language In Sec. 3.1 it was pointed out that the simi- (Afrikaans), and we see that its models are indeed larities between the same phonemes in different significantly more dissimilar from the isiNdebele languages are important predictors of the bene- model than any of the Bantu languages. In fact, fit achievable from pooling the data from those the Afrikaans /n/ is about as distant from isiNde- languages. Armed with the knowledge that sta- bele /n/ as isiNdebele and isiZulu /l/ are! ble models can be estimated with 30 speakers per phoneme and between 10 and 30 phonemes oc- 5 Initial ASR results currences per speaker, we now turn to the task of measuring distances between phonemes in various In order to verify the usability of the Lwazi cor- languages. pus for speech recognition, we develop initial We again use the mean Bhattacharyya bound ASR systems for all 11 official South African to compare phonemes, and obtain values between languages. A summary of the data statistics for all possible combinations of phonemes. Results the Bantu languages investigated is shown in Tab. are shown for the isiNdebele phonemes /n/ and /a/ 1, and recognition accuracies achieved are sum- in Fig. 3. As expected, similar phonemes from marised in Tab. 2. For these tests, data from 30 the different languages are closer to one another speakers per language were used as test data, with than different phonemes of the same language. the remaining data being used for training. However, the details of the distances are quite re- Although the Southern Bantu languages are vealing: for /a/, siSwati is closest to the isiN- tone languages, our systems do not encode tonal

5 Figure 2: Effect of number of speakers on mean of Bhattacharyya bound for different phoneme groups using 20 utterances per speaker

Language total # # speech # distinct rules are therefore extracted from the general minutes minutes phonemes dictionaries using the Default&Refine algorithm isiNdebele 564 465 46 (Davel and Barnard, 2008) and used to generate isiXhosa 470 370 52 missing pronunciations. isiZulu 525 407 46 We use HTK 3.4 to build a context-dependent Tshivenda 354 286 38 cross-word HMM-based phoneme recogniser with Sepedi 394 301 45 triphone models. Each model had 3 emitting Sesotho 387 313 44 states with 7 mixtures per state. 39 features are Setswana 379 295 34 used: 13 MFCCs together with their first and sec- siSwati 603 479 39 ond order derivatives. Cepstral Mean Normali- Xitsonga 378 316 54 sation (CMN) as well as Cepstral Variance Nor- N-TIMIT 315 - 39 malisation (CMV) are used to perform speaker- independent normalisation. A diagonal covariance Table 1: A summary of the Lwazi ASR corpus: matrix is used; to partially compensate for this in- Bantu languages. correct assumption of feature independence semi- tied transforms are applied. A flat phone-based language model is employed throughout. information, since tone is unlikely to be impor- As a rough benchmark of acceptable phoneme- tant for small-to-medium vocabulary applications recognition accuracy, recently reported results ob- (Zerbian and Barnard, 2008). tained by (Morales et al., 2008) on a similar-sized As the initial pronunciation dictionaries were telephone corpus in American English (N-TIMIT) developed to provide good coverage of the lan- are also shown in Tab. 2. We see that the Lwazi guage in general, these dictionaries did not cover results compare very well with this benchmark. the entire ASR corpus. Grapheme-to-phoneme An important issue in ASR corpus design is

6 Figure 3: Effective distances in terms of the mean of the Bhattacharyya bound between a single phoneme (/n/-nbl top and /a/-nbl bottom) and each of its closest matches within the set of phonemes investigated.

Language % corr % acc avg # total # phons speakers isiNdebele 74.21 65.41 28.66 200 isiXhosa 69.25 57.24 17.79 210 isiZulu 71.18 60.95 23.42 201 Tshivenda 76.37 66.78 19.53 201 Sepedi 66.44 55.19 16.45 199 Sesotho 68.17 54.79 18.57 200 Setswana 69.00 56.19 20.85 207 siSwati 74.19 64.46 30.66 208 Xitsonga 70.32 59.41 14.35 199 N-TIMIT 64.07 55.73 - -

Table 2: Initial results for South African ASR sys- Figure 4: The influence of a reduction in training tems. The column labelled “avg # phonemes” lists corpus size on phone recognition accuracy. the average number of phoneme occurrences for each phoneme for each speaker. baseline to investigate data-efficient training meth- ods such as those described in Sec. 3.1. the trade-off between the number of speakers and 6 Conclusion the amount of data per speaker (Wheatley et al., 1994). The figures in Sec. 4.2 are not conclusive In this paper we have introduced a new tele- on this trade-off, so we have also investigated the phone speech corpus which contains data from effect of reducing either the number of speakers nine Southern Bantu languages. Our stability anal- or the amount of data per speaker when training ysis shows that the speaker variety as well as the isiZulu and Tshivenda recognisers. As shown the amount of speech per speaker is sufficient to in Fig. 4, the impact of both forms of reduction achieve acceptable model stability, and this con- is comparable across languages and different de- clusion is confirmed by the successful training of grees of reduction, in agreement with the results phone recognisers in all the languages. We con- of Sec. 4.2. firm the observation in (Badenhorst and Davel, These results indicate that we now have a firm 2008) that different phone classes have different

7 data requirements, but even for the more demand- N. Morales, J. Tejedor, J. Garrido, J. Colas, and D.T. ing classes (vowels, nasals, liquids) our amount of Toledano. 2008. STC-TIMIT: Generation of a data seems sufficient. Our results suggest that sim- single-channel telephone corpus. In LREC, pages 391–395, Marrakech, Morocco. ilar accuracies may be achievable by using more speech from fewer speakers – a finding that may A. Nagroski, L. Boves, and H. Steeneken. 2003. In search of optimal data selection for training of auto- be useful for the further development of speech matic speech recognition systems. ASRU workshop, corpora in resource-scarce languages. pages 67–72, Nov. Based on the proven stability of our models, we P. Nasfors. 2007. Efficient voice information services have performed some preliminary measurements for developing countries. Master’s thesis, Depart- of the distances between the phones in the dif- ment of Information Technology, Uppsala Univer- ferent languages; such distance measurements are sity. likely to be important for the sharing of data across T. Niesler. 2007. Language-dependent state clustering languages in order to further improve ASR accu- for multilingual acoustic modeling. Speech Commu- racy. The development of real-world applications nication, 49:453–463. using this data is currently an active topic of re- G. Riccardi and D. Hakkani-Tur. 2003. Active and search; for that purpose, we are continuing to in- unsupervised learning for automatic speech recog- vestigate additional methods to improve recogni- nition. In Eurospeech, pages 1825–1828, Geneva, tion accuracy with such relatively small corpora, Switzerland. including cross-language data sharing and effi- J.C. Roux, E.C. Botha, and J.A. du Preez. Develop- cient adaptation methods. ing a multilingual telephone based information sys- tem in african languages. In LREC, pages 975–980, Athens, Greece. References T. Schultz and A. Waibel. 2001. Language- independent and language-adaptive acoustic model- Nimaan Abdillahi, Pascal Nocera, and Jean-Franois ing for speech recognition. Speech Communication, Bonastre. 2006. Automatic transcription of Somali 35:31–51, Aug. language. In Interspeech, pages 289–292, Pitts- burgh, PA. Hussien Seid and Bjrn Gambck. 2005. A speaker inde- pendent continuous speech recognizer for Amharic. J.A.C. Badenhorst and M.H. Davel. 2008. Data re- In Interspeech, pages 3349–3352, Lisboa, Portugal, quirements for speaker independent acoustic mod- Oct. els. In PRASA, pages 147–152. A. Sharma, M. Plauche, C. Kuun, and E. Barnard. E. Barnard, L. Cloete, and H. Patel. 2003. Language 2009. HIV health information access using spoken and technology literacy barriers to accessing govern- dialogue systems: Touchtone vs. speech. Accepted ment services. Lecture Notes in Computer Science, at IEEE Int. Conf. on ICTD. 2739:37–42. J. Sherwani, N. Ali, S. Mirza, A. Fatma, Y. Memon, W. Byrne, P. Beyerlein, J. M. Huerta, S. Khudanpur, M. Karim, R. Tongia, and R. Rosenfeld. Healthline: B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Ver- Speech-based access to health information by low- gyri1, and W. Wang. 2000. Towards language inde- literate users. In IEEE Int. Conf. on ICTD, pages pendent acoustic modeling. In ICASSP, volume 2, 131–139. pages 1029–1032, Istanbul, Turkey. R. Tucker and K. Shalonova. 2004. The Local Lan- M. Davel and E. Barnard. 2004. The efficient cre- guage Speech Technology Initiative. In SCALLA ation of pronunication dictionaries: human factors Conf., Nepal. in bootstrapping. In Interspeech, pages 2797–2800, B. Wheatley, K. Kondo, W. Anderson, and Jeju, Korea, Oct. Y. Muthusumy. 1994. An evaluation of cross- M. Davel and E. Barnard. 2008. Pronunciation predi- language adaptation for rapid HMM development cation with Default&Refine. Computer Speech and in a new language. In ICASSP, pages 237–240, Language, 22:374–393, Oct. Adelaide. Y. Wu, R. Zhang, and A. Rudnicky. 2007. Data selec- K. Fukunaga. 1990. Introduction to Statistical Pattern tion for speech recognition. ASRU workshop, pages Recognition. Academic Press, Inc., 2nd edition. 562–565, Dec. Pali Lehohla. 2003. Census 2001: Census in brief. S. Zerbian and E. Barnard. 2008. Phonetics of into- Statistics South Africa. nation in South African Bantu languages. Southern African Linguistics and Applied Language Studies, Meraka-Institute. 2009. Lwazi ASR corpus. Online: 26(2):235–254. http://www.meraka.org.za/lwazi.

8 The SAWA Corpus: a Parallel Corpus English - Swahili Guy De Pauw CNTS - Language Technology Group, University of Antwerp, Belgium School of Computing and Informatics, University of Nairobi, Kenya [email protected]

Peter Waiganjo Wagacha School of Computing and Informatics, University of Nairobi, Kenya [email protected]

Gilles-Maurice de Schryver African Languages and Cultures, Ghent University, Belgium Xhosa Department, University of the Western Cape, South Africa [email protected]

Abstract For a language like Swahili, digital resources have become increasingly important in everyday Research in data-driven methods for Ma- life, both in urban and rural areas, particularly chine Translation has greatly benefited thanks to the increasing number of web-enabled from the increasing availability of paral- mobile phone users in the language area. But most lel corpora. Processing the same text in research efforts in the field of natural language two different languages yields useful in- processing (NLP) for African languages are still formation on how words and phrases are firmly rooted in the rule-based paradigm. Lan- translated from a source language into a guage technology components in this sense are target language. To investigate this, a par- usually straight implementations of insights de- allel corpus is typically aligned by linking rived from grammarians. While the rule-based linguistic tokens in the source language to approach definitely has its merits, particularly in the corresponding units in the target lan- terms of design transparency, it has the distinct guage. An aligned parallel corpus there- disadvantage of being highly language-dependent fore facilitates the automatic development and costly to develop, as it typically involves a lot of a machine translation system and can of expert manual effort. also bootstrap annotation through projec- Furthermore, many of these systems are decid- tion. In this paper, we describe data col- edly competence-based. The systems are often lection and annotation efforts and prelim- tweaked and tuned towards a small set of ideal inary experimental results with a parallel sample words or sentences, ignoring the fact that corpus English - Swahili. real-world language technology applications have to be principally able to handle the performance aspect of language. Many researchers in the field 1 Introduction are quite rightly growing weary of publications that ignore quantitative evaluation on real-world Language technology applications such as ma- data or that report incredulously high accuracy chine translation can provide an invaluable, but scores, excused by the erroneously perceived reg- all too often ignored, impetus in bridging the dig- ularity of African languages. ital divide between the Western world and Africa. In a linguistically diverse and increasingly com- Quite a few localization efforts are currently un- puterized continent such as Africa, the need for a derway that improve ICT access in local African more empirical approach to language technology languages. Vernacular content is now increasingly is high. The data-driven, corpus-based approach being published on the Internet, and the need for described in this paper establishes such an alter- robust language technology applications that can native, so far not yet extensively investigated for process this data is high. African languages. The main advantage of this

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 9–16, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

9 approach is its language independence: all that is MT system has long since been abandoned and needed is (linguistically annotated) language data, most research in the field currently concentrates which is fairly cheap to compile. Given this data, on minimizing human pre- and post-processing ef- existing state-of-the-art algorithms and resources fort. A human translator is thus considered to can consequently be re-used to quickly develop ro- work alongside the MT system to produce faster bust language applications and tools. and more consistent translations. Most African languages are however resource- The Internet brought in an interesting new di- scarce, meaning that digital text resources are few. mension to the purpose of MT. In the mid 1990s, An increasing number of publications however are free on-line translation services began to surface showing that carefully selected procedures can in- with an increasing number of MT vendors. The deed bootstrap language technology for Swahili most famous example is Yahoo!’s Babelfish , of- (De Pauw et al., 2006; De Pauw and de Schryver, fering on-line versions of Systran to translate En- 2008), Northern Sotho (de Schryver and De Pauw, glish, French, German, Spanish and other Indo- 2007) and smaller African languages (Wagacha et European languages. Currently Google.inc is also al., 2006a; Wagacha et al., 2006b; De Pauw and offering translation services. While these systems Wagacha, 2007; De Pauw et al., 2007a; De Pauw provide far from perfect output, they can often et al., 2007b). give readers a sense of what is being talked about In this paper we outline on-going research on on a web page in a language (and often even char- the development of a parallel corpus English - acter set) foreign to them. Swahili. The parallel corpus is designed to boot- There are roughly three types of approaches to strap a data-driven machine translation system for machine translation: the language pair in question, as well as open up 1. Rule-based methods perform translation us- possibilities for projection of annotation. ing extensive lexicons with morphological, We start off with a short survey of the different syntactic and semantic information, and large approaches to machine translation (Section 2) and sets of manually compiled rules, making showcase the possibility of projection of annota- them very labor intensive to develop. tion (Section 3). We then concentrate on the re- quired data collection and annotation efforts (Sec- 2. Statistical methods entail the collection and tion 4) and describe preliminary experiments on statistical analysis of bilingual text corpora, sentence, word and morpheme alignment (Sec- i.e. parallel corpora. The technique tries tions 5 and 6). We conclude with a discussion of to find the highest probability translation of the current limitations to the approach and provide a sentence or phrase among the exponential pointers for future research (Section 7). number of choices. 3. Example-based methods are similar to sta- 2 Machine Translation tistical methods in that they are parallel cor- The main task of Machine Translation (MT) can pus driven. An Example-Based Machine be defined as having a computer take a text in- Translator (EBMT) scans for patterns in both put in one language, the Source language (SL), languages and relates them in a translation decode its meaning and re-encode it producing as memory. output a similar-meaning text in another language, Most MT systems currently under development the Target language (TL). The idea of building are based on methods (2) and/or (3). Research an application to automatically convert text from in these fields has greatly benefited from the in- one language to an equivalent text-meaning in creasing availability of parallel corpora, which are a second language traces its roots back to Cold needed to bootstrap these approaches. Such a par- War intelligence efforts in the 1950’s and 60’s for allel corpus is typically aligned by linking, either Russian-English text translations. Since then a automatically or manually, linguistic tokens in the large number of MT systems have been developed source language to the corresponding units in the with varying degrees of success. For an excellent target language. Processing this data enables the overview of the history of MT, we refer the reader development of fast and effective MT systems in to Hutchins (1986). both directions with a minimum of human involve- The original dream of creating a fully automatic ment.

10 English Swahili English Swahili Sentences Sentences Words Words New Testament 7.9k 189.2k 151.1k Quran 6.2k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Full Corpus Total 33.6k 33.6k 542.1k 442.9k

Table 1: Overview of the data in the SAWA Corpus

3 Projection of Annotation main comp While machine translation constitutes the most straightforward application of a parallel corpus, subj pp projection of annotation has recently become an dt dt interesting alternative use of this type of resource. root The cat fell into the water As previously mentioned, most African languages are resource-scarce: annotated data is not only un- available, but commercial interest to develop these resources is limited. Unsupervised approaches ya can be used to bootstrap annotation of a resource- root Paka alianguka ndani maji scarce language (De Pauw and Wagacha, 2007; De pp Pauw et al., 2007a) by automatically finding lin- subj ?? guistic patterns in large amounts of raw text. Projection of annotation attempts to achieve the main comp same goal, but through the use of a parallel cor- pus. These techniques try to transport the annota- Figure 1: Projection of Dependency Analysis An- tion of a well resourced source language, such as notation English, to texts in a target language. As a natu- ral extension of the domain of machine translation, these methods employ parallel corpora which are cuses on semantic relationships, rather than core aligned at the sentence and word level. The di- syntactic properties, that are much more trouble- rect correspondence assumption coined in Hwa et some to project across languages. The idea is that al. (2002) hypothesizes that words that are aligned a relationship that holds between two words in the between source and target language, must share source language (for instance the subj relationship linguistic features as well. It therefore allows for between cat and fell), also holds for the corre- the annotation of the words in the source language sponding linguistic tokens in the target language, to be projected unto the text in the target language. i.e. paka and alianguka. The following general principle holds: the closer In the next section we describe data collection source and target language are related, the more and preprocessing efforts on the SAWA Corpus, accurate this projection can be performed. Even a parallel corpus English - Swahili (cf. Table 1), though lexical and structural differences between which will enable this type of projection of anno- languages prevent a simple one-to-one mapping, tation, as well as the development of a data-driven knowledge transfer is often able to generate a well machine translation system. directed annotation of the target language. 4 Data Collection and Annotation This holds particular promise for the annotation of dependency analyses for Swahili, as exempli- While digital data is increasingly becoming avail- fied in Figure 1, since dependency grammar fo- able for Swahili on the Internet, sourcing useful

11 Figure 2: Manual word alignment using the UMIACS interface bilingual data is far from trivial. At this stage in 4.1 Sentence-aligned Resources the development of the MT system, it is paramount to use faithfully translated material, as this benefits We found digitally available Swahili versions of further automated processing. The corpus-based the New Testament and the Quran for which we MT approaches we wish to employ, require word sourced the English counterparts. This is not a alignment to be performed on the texts, during trivial task when, as in the case of the Swahili which the words in the source language are linked documents, the exact source of the translation is to the corresponding words in the target language not provided. By carefully examining subtle dif- (also see Figures 1 and 2). ferences in the English versions, we were how- ever able to track down the most likely candidate. While religious material has a specific register and But before we can do this, we need to perform may not constitute ideal training material for an sentence-alignment, during which we establish an open-ended MT system, it does have the advan- unambiguous mapping between the sentences in tage of being inherently aligned on the verse level, the source text and the sentences in the target text. facilitating further sentence alignment. Another While some data is inherently sentence-aligned, typical bilingual text is the UN Declaration of Hu- other texts require significant preprocessing before man Rights, which is available in many of the word alignment can be performed. world’s languages, including Swahili. The manual sentence alignment of this text is greatly facilitated The SAWA Corpus currently consists of a rea- by the fixed structure of the document. sonable amount of data (roughly half a million The downloadable version of the on-line dictio- words in each language), although this is not nary English-Swahili (Benjamin, 2009) contains comparable to the resources available to Indo- individual example sentences associated with the European language pairs, such as the Hansard cor- dictionary entries. These can be extracted and pus (Roukos et al., 1997) (2.87 million sentence used as parallel data in the SAWA corpus. Since pairs). Table 1 gives an overview of the data avail- at a later point, we also wish to study the specific able in the SAWA Corpus. For each segment it lists linguistic aspects of spoken language, we opted the number of sentences and words in the respec- to have some movie subtitles manually translated. tive languages. These can be extracted from DVDs and while the

12 language is compressed to fit on screen and con- stitutes scripted language, they nevertheless pro- vide a reasonable approximation of spoken lan- guage. Another advantage of this data is that it is inherently sentence-aligned, thanks to the tech- nical time-coding information. It also opens up possibilities for MT systems with other language pairs, since a commercial DVD typically contains subtitles for a large number of other languages as well.

4.2 Paragraph-aligned Resources The rest of the material consists of paragraph- aligned data, which was manually sentence- aligned. We obtained a substantial amount of data from a local Kenyan translator. Finally, we also included Kenyan investment reports. These are yearly reports from local companies and are pre- sented in both English and Swahili. A major dif- ficulty was extracting the data from these docu- ments. The company reports are presented in col- orful brochures in PDF format, meaning automatic text exports require manual post-processing and paragraph alignment (Figure 3). They neverthe- Figure 3: Text Extraction from Bilingual Invest- less provide a valuable resource, since they come ment Report from a fairly specific domain and are a good sam- ple of the type of text the projected MT system may need to process in a practical setting. Monitoring the accuracy of the automatic word- The reader may note that there is a very diverse alignment method against the human reference, variety of texts within the SAWA corpus, ranging will allow us to tweak parameters to arrive at the from movie subtitles to religious texts. While it optimal settings for this language pair. certainly benefits the evaluation to use data from We used the UMIACS word alignment interface texts in one specific language register, we have (Hwa and Madnani, 2004) for this purpose and chosen to maintain variety in the language data at asked the annotators to link the words between the this point. Upon evaluating the decoder at a later two sentences (Figure 2). Given the linguistic dif- stage, we will however investigate the bias intro- ferences between English and Swahili, this is by duced by the specific language registers in the cor- no means a trivial task. Particularly the morpho- pus. logical richness of Swahili means that there is a lot of convergence from words in English to words 4.3 Word Alignment in Swahili (also see Section 6). This alignment All of the data in the corpus was subsequently was done on some of the manual translations of tokenized, which involves automatically cleaning movie subtitles, giving us a gold-standard word- up the texts, conversion to UTF-8, and splitting alignment reference of about 5,000 words. Each punctuation from word forms. The next step in- annotator’s work was cross-checked by another volved scanning for sentence boundaries in the annotator to improve correctness and consistency. paragraph-aligned text, to facilitate the automatic sentence alignment method described in Section 5. 5 Alignment Experiments While not necessary for further processing, we also performed manual word-alignment annota- There are a number of packages available to tion. This task can be done automatically, but it process parallel corpora. To preprocess the is useful to have a gold-standard reference against paragraph-aligned texts, we used Microsoft’s which we can evaluate the automated method. bilingual sentence aligner (Moore, 2002). The

13 Precision Recall F(β = 1) Precision Recall F(β = 1) 39.4% 44.5% 41.79% 50.2% 64.5% 55.8%

Table 2: Precision, Recall and F-score for the Table 3: Precision, Recall and F-score for the word-alignment task using GIZA++ morpheme/word-alignment task using GIZA++ output of the sentence alignment was conse- to unearth strongly converging alignment patterns, quently manually corrected. We found that 95% of such as the one in Example 1. the sentences were correctly aligned with most er- (1) I have turned him down rors being made on sentences that were not present in English, i.e. instances where the translator de- cided to add an extra clarifying sentence to the di- Nimemkatalia rect translation from English. This also explains Morphologically deconstructing the word how- why there are more Swahili words in the paragraph ever can greatly relieve the sparse data problem for aligned texts than in English, while the situation is this task: reversed for the sentence aligned data. (2) For word-alignment, the state-of-the-art method I have turned him down is GIZA++ (Och and Ney, 2003), which imple- ments the word alignment methods IBM1 to IBM5 Ni- me- m- katalia and HMM. While this method has a strong Indo- The isolated Swahili morphemes can more eas- European bias, it is nevertheless interesting to see ily be linked to their English counterparts, since how far we can get with the default approach used there will be more linguistic evidence in the par- in statistical MT. allel corpus, linking for example ni to I and m We evaluate by looking at the word alignments to him. To perform this kind of morphological proposed by GIZA++ and compare them to the analysis, we developed a machine learning system manually word-aligned section of the SAWA Cor- trained and evaluated on the Helsinki corpus of pus. We can quantify the evaluation by calculat- Swahili (Hurskainen, 2004). Experimental results ing precision and recall and their harmonic mean, show that the data-driven approach achieves state- the F-score (Table 2). The former expresses how of-the-art performance in a direct comparison with many links are correct, divided by the total num- a rule-based method, with the added advantage of ber of links suggested by GIZA++. The latter is being robust to word forms for previously unseen calculated by dividing the number of correct links, lemmas (De Pauw and de Schryver, 2008). We by the total number of links in the manual annota- can consequently use morphological deconstruc- tion. The underwhelming results presented in Ta- tion as a preprocessing step for the alignment task, ble 2 can be attributed to the strong Indo-European similar to the method described by Goldwater and bias of the current approaches. It is clear that extra McClosky (2005), Oflazer (2008) and Stymne et linguistic data sources and a more elaborate explo- al. (2008). ration of the experimental parameters of GIZA++ We have no morphologically aligned parallel will be needed, as well as a different approach to data available, so evaluation of the morphology- word-alignment. In the next section, we describe based approach needs to be done in a roundabout a potential solution to the problem by defining the way. We first morphologically decompose the problem on the level of the morpheme. Swahili data and run GIZA++ again. Then we re- 6 Alignment into an Agglutinating compile the Swahili words from the morphemes Language and group the word alignment links accordingly. Incompatible linkages are removed. The updated The main problem in training a GIZA++ model for scores are presented in Table 3. While this cer- the language pair English - Swahili is the strong tainly improves on the scores in Table 2, we need agglutinating nature of the latter. Alignment pat- to be aware of the difficulty that the morphological terns such as the one in Figures 1 and 2 are not preprocessing step will introduce in the decoding impossible to retrieve. But no corpus is exhaus- phase, necessitating the introduction of a language tive enough to provide enough linguistic evidence model that not only works on the word level, but

14 also on the level of the morpheme. vide us with information on how much more data For the purpose of projection of annotation, this we need to achieve state-of-the-art performance. is however not an issue. We performed a prelim- This additional data can be automatically found inary experiment with a dependency-parsed En- by parallel web mining, for which a few sys- glish corpus, projected unto the morphologically tems have recently become available (Resnik and decompounded tokens in Swahili. We are cur- Smith, 2003). rently lacking the annotated gold-standard data to Furthermore, we will also look into the use perform quantitative evaluation, but have observed of comparable corpora, i.e. bilingual texts that interesting annotation results, that open up pos- are not straight translations, but deal with the sibilities for the morphological analysis of more same subject matter. These have been found to resource-scarce languages. work as additional material within a parallel cor- pus (McEnery and Xiao, 2007) and may further 7 Discussion help improve the development of a robust, open- ended and bidirectional machine translation sys- In this paper we presented parallel corpus collec- tem for the language pair English - Swahili. The tion work that will enable the construction of a most innovative prospect of the parallel corpus is machine translation system for the language pair the annotation of dependency analysis in Swahili, English - Swahili, as well as open up the possibil- not only on the syntactic level, but also on the ity of corpus annotation through projection. We level of the morphology. The preliminary exper- are confident that we are approaching a critical iments indicate that this approach might provide a amount of data that will enable good word align- valuable technique to bootstrap annotation in truly ment that can subsequently be used as a model for resource-scarce languages. an MT decoding system, such as the Moses pack- age (Koehn et al., 2007). While the currently re- Acknowledgments ported scores are not yet state-of-the-art, we are confident that further experimentation and the ad- The research presented in this paper was made dition of more bilingual data as well as the intro- possible through the support of the VLIR-IUC- duction of extra linguistic features will raise the UON program and was partly funded by the SAWA accuracy level of the proposed MT system. BOF UA-2007 project. The first author is funded as a Postdoctoral Fellow of the Research Founda- Apart from the morphological deconstruction tion - Flanders (FWO). We are greatly indebted to described in Section 6, the most straightforward Dr. James Omboga Zaja for contributing some of addition is the introduction of part-of-speech tags his translated data, to Mahmoud Shokrollahi-Far as an extra layer of linguistic description, which for his advice on the Quran and to Anne Kimani, can be used in word alignment model IBM5. The Chris Wangai Njoka and Naomi Maajabu for their current word alignment method tries to link word annotation efforts. forms, but knowing that for instance a word in the source language is a noun, will facilitate linking it to a corresponding noun in the target language, References rather than considering a verb as a possible match. The Kamusi Project Both for English (Ratnaparkhi, 1996) and Swahili M. Benjamin. 2009. . Available at: http://www.kamusiproject.org (Accessed: 14 Jan- (De Pauw et al., 2006), we have highly accurate uary 2009). part-of-speech taggers available. Another extra information source that we have G. De Pauw and G.-M. de Schryver. 2008. Improv- ing the computational morphological analysis of a so far ignored is a digital dictionary as a seed for Swahili corpus for lexicographic purposes. Lexikos, the word alignment. The kamusiproject.org elec- 18:303–318. tronic dictionary will be included in further word- G. De Pauw and P.W. Wagacha. 2007. Bootstrapping alignment experiments and will undoubtedly im- morphological analysis of G˜ıkuy˜ u˜ using unsuper- prove the quality of the output. vised maximum entropy learning. In Proceedings Once we have a stable word alignment mod- of the eighth INTERSPEECH conference, Antwerp, ule, we will further conduct learning curve exper- Belgium. iments, in which we train the system with grad- G. De Pauw, G.-M. de Schryver, and P.W. Wa- ually increasing amounts of data. This will pro- gacha. 2006. Data-driven part-of-speech tagging of

15 Kiswahili. In P. Sojka, I. Kopecek,ˇ and K. Pala, edi- A.M. McEnery and R.Z Xiao. 2007. Parallel and tors, Proceedings of Text, Speech and Dialogue, 9th comparable corpora: What are they up to? In In- International Conference, volume 4188 of Lecture corporating Corpora: Translation and the Linguist. Notes in Computer Science, pages 197–204, Berlin, Translating Europe. Multilingual Matters, Cleve- Germany. Springer Verlag. don, UK.

G. De Pauw, P.W. Wagacha, and D.A. Abade. 2007a. R.C. Moore. 2002. Fast and accurate sentence align- Unsupervised induction of Dholuo word classes ment of bilingual corpora. In Proceedings of the 5th using maximum entropy learning. In K. Getao Conference of the Association for Machine Transla- and E. Omwenga, editors, Proceedings of the First tion in the Americas on Machine Translation: From International Computer Science and ICT Confer- Research to Real Users, volume 2499 of Lecture ence, pages 139–143, Nairobi, Kenya. University of Notes in Computer Science, pages 135–144, Berlin, Nairobi. Germany. Springer Verlag.

G. De Pauw, P.W. Wagacha, and G.-M. de Schryver. F.J. Och and H. Ney. 2003. A systematic comparison 2007b. Automatic diacritic restoration for resource- of various statistical alignment models. Computa- scarce languages. In Vaclav´ Matousekˇ and Pavel tional Linguistics, 29(1):19–51. Mautner, editors, Proceedings of Text, Speech and Dialogue, Tenth International Conference, volume K. Oflazer. 2008. Statistical machine translation into 4629 of Lecture Notes in Computer Science, pages a morphologically complex language. In Compu- 170–179, Heidelberg, Germany. Springer Verlag. tational Linguistics and Intelligent Text Processing, pages 376–388, Berlin, Germany. Springer Verlag. G.-M. de Schryver and G. De Pauw. 2007. Dictio- nary writing system (DWS) + corpus query package A. Ratnaparkhi. 1996. A maximum entropy model for (CQP): The case of Tshwanelex. Lexikos, 17:226– part-of-speech tagging. In E. Brill and K. Church, 246. editors, Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing, pages S. Goldwater and D. McClosky. 2005. Improving sta- 133–142. Association for Computational Linguis- tistical MT through morphological analysis. In Pro- tics. ceedings of the Human Language Technology Con- ference and Conference on Empirical Methods in Ph. Resnik and N.A. Smith. 2003. The web as a par- allel corpus. Computational Linguistics, 29(1):349– Natural Language Processing, pages 676–683, Van- 380. couver, Canada. S. Roukos, D. Graff, and D. Melamed. 1997. Google. 2009. Google Translate. Available Hansard French/English. Available at: at http://www.google.com/translate (Accessed: 14 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp? January 2009. catalogId=LDC95T20 (Accessed: 14 January A. Hurskainen. 2004. HCS 2004 – Helsinki Corpus of 2009). Swahili. Technical report, Compilers: Institute for S. Stymne, M. Holmqvist, and L. Ahrenberg. 2008. Asian and African Studies (University of Helsinki) Effects of morphological analysis in translation be- and CSC. tween German and English. In Proceedings of the Third Workshop on Statistical Machine Translation, W.J. Hutchins. 1986. Machine translation: past, pages 135–138, Columbus, USA. present, future. Ellis, Chichester. P.W. Wagacha, G. De Pauw, and K. Getao. 2006a. The UMI- R. Hwa and N. Madnani. 2004. Development of a corpus for G˜ıkuy˜ u˜ using machine ACS Word Alignment Interface. Available at: learning techniques. In J.C. Roux, editor, Proceed- http://www.umiacs.umd.edu/˜nmadnani/alignment/ ings of LREC workshop - Networking the develop- forclip.htm (Accessed: 14 January 2009). ment of language resources for African languages, pages 27–30, Genoa, Italy, May, 2006. European R. Hwa, Ph. Resnik, A. Weinberg, and O. Kolak. 2002. Language Resources Association, ELRA. Evaluating translational correspondence using anno- tation projection. In Proceedings of the 40th Annual P.W. Wagacha, G. De Pauw, and P.W. Githinji. 2006b. Meeting of the Association for Computational Lin- A grapheme-based approach for accent restoration guistics, pages 392–399, Philadelphia, PA, USA. in G˜ıkuy˜ u.˜ In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Ph. Koehn, H. Hoang, A. Birch, C. Callison-Burch, pages 1937–1940, Genoa, Italy, May, 2006. Euro- M. Federico, N. Bertoldi, B. Cowan, W. Shen, pean Language Resources Association, ELRA. C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit Yahoo! 2009. Babelfish. Available at for statistical machine translation. In Annual Meet- http://babelfish.yahoo.com (Accessed: 14 January ing of the Association for Computational Linguistics 2009). (ACL), demonstration session, Prague, Czech Re- public.

16 Information Structure in African Languages: Corpora and Tools

Christian Chiarcos*, Ines Fiedler**, Mira Grubic*, Andreas Haida**, Katharina Hartmann**, Julia Ritz*, Anne Schwarz**, Amir Zeldes**, Malte Zimmermann*

* Universität Potsdam ** Humboldt-Universität zu Berlin Potsdam, Germany Berlin, Germany {chiarcos|grubic| {ines.fiedler|andreas.haida| julia|malte}@ k.hartmann|anne.schwarz| ling.uni-potsdam.de amir.zeldes}@rz.hu-berlin.de

hearer, and information status refers to different Abstract degrees of familiarity of an entity. Languages differ wrt. the means of realization In this paper, we describe tools and resources of IS, due to language-specific properties (e.g., for the study of African languages developed lexical tone). This makes a typological at the Collaborative Research Centre “Infor- comparison of traditionally less-studied mation Structure”. These include deeply anno- languages to existing theories, mostly on tated data collections of 25 subsaharan European languages, very promising. Particular languages that are described together with their annotation scheme, and further, the cor- emphasis is laid on the study of focus, its pus tool ANNIS that provides a unified access functions and manifestations in different to a broad variety of annotations created with a subsaharan languages, as well as the range of different tools. With the application differentiation between different types of focus, of ANNIS to several African data collections, i.e., term focus (focus on arguments/adjuncts), we illustrate its suitability for the purpose of predicate focus (focus on verb/verb language documentation, distributed access phrase/TAM/truth value), and sentence focus and the creation of data archives. (focus on the whole utterance). We describe corpora of 25 subsaharan 1 Information Structure languages created for this purpose, together with The Collaborative Research Centre (CRC) ANNIS, the technical infrastructure developed to "Information structure: the linguistic means for support linguists in their work with these data structuring utterances, sentences and texts" collections. ANNIS is specifically designed to brings together scientists from different fields of support corpora with rich and deep annotation, as linguistics and neighbouring disciplines from the IS manifests itself on practically all levels of University of Potsdam and the Humboldt- linguistic description. It provides user-friendly University Berlin. Our research comprises the means of querying and visualizations for use and advancement of corpus technologies for different kinds of linguistic annotations, complex linguistic annotations, such as the including flat, layer-based annotations as used annotation of information structure (IS). We for linguistic glosses, but also hierarchical define IS as the structuring of linguistic annotations as used for syntax annotation. information in order to optimize information 2 Research Activities at the CRC transfer within discourse: information needs to be prepared ("packaged") in different ways Within the Collaborative Research Centre, there depending on the goals a speaker pursues within are several projects eliciting data in large discourse. amounts and great diversity. These data, Fundamental concepts of IS include the originating from different languages, different concepts `topic’, `focus’, `background’ and modes (written and spoken language) and `information status’. Broadly speaking, the topic specific research questions characterize the is the entity a specific sentence is construed specification of the linguistic database ANNIS. about, focus represents the new or newsworthy information a sentence conveys, background is 2.1 Linguistic Data Base that part of the sentence that is familiar to the The project “Linguistic database for information structure: Annotation and Retrieval”, further

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 17–24, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

17 database project, coordinates annotation languages. The Chadic languages are a branch of activities in the CRC, provides service to projects the Afro-Asiatic language family mainly spoken in the creation and maintenance of data in northern Nigeria, Niger, and Chad. As tone collections, and conducts theoretical research on languages, the Chadic languages represent an multi-level annotations. Its primary goals, interesting subject for research into focus however, are the development and investigation because here, intonational/tonal marking – a of techniques to process, to integrate and to commonly used means for marking focus in exploit deeply annotated corpora with multiple European languages – is in potential conflict kinds of annotations. One concrete outcome of with lexical tone, and so, Chadic languages these efforts is the linguistic data base ANNIS resort to alternative means for marking focus. described further below. For the specific The languages investigated in the Chadic facilities of ANNIS, its application to several project include the western Chadic languages corpora of African languages and its use as a Hausa, Tangale, and Guruntum and the central general-purpose tool for the publication, Chadic languages Bura, South Marghi, and Tera. visualization and querying of linguistic data, see The main research goals of the Chadic project Sect. 5. are a deeper understanding of the following asymmetries: (i) subject focus is obligatorily 2.2 Gur and Kwa Languages marked, but marking of object focus is optional; (ii) in Tangale and Hausa there are sentences that Gur and Kwa languages, two genetically related are ambiguous between an object-focus West African language groups, are in the focus of interpretation and a predicate-focus the project “Interaction of information structure interpretation, but in intonation languages like and grammar in Gur and Kwa languages”, English and German, object focus and predicate henceforth Gur-Kwa project. In a first research focus are always marked differently from each stage, the precise means of expression of the other; (iii) in Hausa, Bole, and Guruntum there is pragmatic category focus were explored as well only a tendency to distinguish different types of as their functions in Gur and Kwa languages. For focus (new-information focus vs. contrastive this purpose, a number of data collections for focus), but in European languages like several languages were created (Sect. 3.1). Hungarian and Finnish, this differentiation is Findings obtained with this data led to different obligatory. subquestions which are of special interest from a cross-linguistic and a theoretical point of view. 2.4 Focus from a Cross-linguistic These concern (i) the analysis of syntactically Perspective marked focus constructions with features of The project "Focus realization, focus narrative sentences (Schwarz & Fiedler 2007), interpretation, and focus use from a cross- (ii) the study of verb-centered focus (i.e., focus linguistic perspective", further focus project, on verb/TAM/truth value), for which there are investigates the correspondence between the special means of realization in Gur and Kwa realization, interpretation and use of with an (Schwarz, forthcoming), (iii) the identification of emphasis on focus in African and south-east systematic focus-topic-overlap, i.e., coincidence Asian languages. It is structured into three fields of focus and topic in sentence-initial nominal of research: (i) the relation between differences constituents (Fiedler, forthcoming). The project's in realization and differences in semantic findings on IS are evaluated typologically on 19 meaning or pragmatic function, (ii) realization, selected languages. The questions raised by the interpretation and use of predicate focus, and (iii) project serve the superordinate goal to expand association with focus. our knowledge of linguistically relevant The relation between differences in realization information structural categories in the less- and semantic/pragmatic differences (i) studied Gur and Kwa languages as well as the particularly pertains the semantic interpretation interaction between IS, grammar and language of focus: For Hungarian and Finnish, a type. differentiation between two semantic types of foci corresponding to two different types of 2.3 Chadic Languages focus realization was suggested, and we The project “Information Structure in the Chadic investigate whether the languages studied here Languages”, henceforth Chadic project, have a similar distinction between two (or more) investigates focus phenomena in Chadic semantic focus types, whether this may differ

18 from language to language, and whether tradition, so that the corpus data mainly differences in focus realization correspond to represents oral communication. semantic or pragmatic differences. In all, the carefully collected heterogeneous The investigation of realization, interpretation data provide a corpus that gives a comprehensive and use of predicate focus (ii) involves the picture of IS, and in particular the focus systems, questions why different forms of predicate focus in these languages. are often realized in the same way, why they are often not obligatorily marked, and why they are 3.2 Hausar Baka Corpus often marked differently from term focus. In the Chadic project, data from 6 Chadic Association with focus (iii) means that the languages are considered. interpretation of the sentence is influenced by the One of the larger data sets annotated in the focusing of a particular constituent, marked by a Chadic project is drawn from Hausar Baka focus-sensitive expression (e.g., particles like (Randell, Bature & Schuh 1998), a collection of `only’, or quantificational adverbials like videotaped Hausa dialogues recording natural `always’), while usually, focus does not have an interaction in various cultural milieus, involving impact on the truth value of a sentence. The over fifty individuals of different age and gender. project investigates which focus-sensitive The annotated data set consists of approximately expressions there are in the languages studied, 1500 sentences. what kinds of constituents they associate with, The corpus was annotated according to the how this association works, and whether it works guidelines for Linguistic Information Structure differently for focus particles and Annotation (LISA, see Section 4.2). The Chadic quantificational adverbials. languages show various forms of syntactic displacement, and in order to account for this, an 3 Collections of African Language Data additional annotation level was added: at the CRC constituents are marked as ex-situ=”+” if 3.1 Gur and Kwa Corpora they occur displaced from their canonical, unmarked position. The Gur and Kwa corpora currently comprise An evaluation of the focus type and the data from 19 languages. displacement status reveals tendencies in the Due to the scarceness of information available morphosyntactic realization of different focus on IS in the Gur and Kwa languages, data had to types, see Sect. 5.2. be elicited, most of which was done during field research, mainly in West Africa, and some in 3.3 Hausa Internet Corpus Germany with the help of native speakers of the Besides these data collections that are currently respective languages. The typologically diverse available in the CRC and in ANNIS, further languages in which we elicited data by ourselves resources are continuously created. As such, a are: Baatonum, Buli, Byali, Dagbani, Ditammari, corpus of written Hausa is created in cooperation Gurene, Konkomba, Konni, Nateni, Waama, with another NLP project of the CRC. Yom (Gur languages) and Aja, Akan, Efutu, Ewe, The corpora previously mentioned mostly Fon, Foodo, Lelemi, Anii (Kwa languages). comprise elicited sentences from little- The elicitation of the data based mainly on the documented languages with rather small questionnaire on information structure, language communities. Hausa, in contrast, is developed by our research group (QUIS, see spoken by more than 24 million native speakers, Section 4.2). This ensured that comparable data with large amounts of Hausa material (some of it for the typological comparison were obtained. parallel to material in other, more-studied Moreover, language-specific additional tasks and languages) available on the internet. This makes questionnaires tailored to a more detailed Hausa a promising language for the creation of analysis or language-specific traits were resources that enable a quantitative study of developed. information structure. As the coding of IS varies across different The Hausa internet corpus is designed to cover types of texts, different text types were included different kinds of written language, including in the corpus, such as (semi-)spontaneous news articles from international radio stations speech, translations, mono- and dialogues. Most (e.g., http://www.dw-world.de), religious texts, of the languages do not have a long literacy literary prose but also material similar to spontaneous spoken language (e.g., in chat logs).

19 Parallel sections of the corpus comprise Machine Learning. ANNIS allows to export excerpts from the novel Ruwan Bagaja by query matches and all their annotated features to Abubakar Imam, Bible and Qur’an sections, and the table format ARFF which serves as input to the Declaration of Human Rights. As will be the data mining tool WEKA (Witten & Frank, described in Section 4.3, these parallel sections 2005), where instances can be clustered, or used open the possibility of semiautomatic to train classifiers for any annotation level. morphosyntactic annotation, providing a unique Projection. Based on (paragraph-, sentence- source for the study of information structure in or verse-) aligned sections in the Hausa internet Hausa. Sect. 5.2 gives an example for corpus, we are about to project annotations from bootstrapping ex-situ constituents in linguistically annotated English texts to Hausa, ANNIS only on the basis of morphosyntactic in a first step parts of speech and possibly annotation. nominal chunks. On the projected annotation, we will train a tagger/chunker to annotate the 4 Data Elicitation and Annotation remaining, non-parallel sections of the Hausa internet corpus. Existing manual annotations 4.1 Elicitation with QUIS (e.g. of the Hausar Baka corpus) will then serve The questionnaire on information structure as a gold standard for evaluation purposes. (Skopeteas et al., 2006) provides a tool for the Concerning projection techniques, we expect collection of natural linguistic data, both spoken to face a number of problems: (i) the question and written, and, secondly, for the elaboration of how to assign part of speech tags to categories grammars of IS in genetically diverse languages. existing only in the target language (e.g., the Focus information, for instance, is typically person-aspect complex in Hausa that binds elicited by embedding an utterance in a question together information about both the verb (aspect) context. To avoid the influence of a mediator and its (pronominal subject) argument), (ii) (working) language, the main body of QUIS is issues of orthography: the official orthography built on the basis of pictures and short movies Hausa (Boko) is systematically underspecified representing a nearly culture- and language- wrt. linguistically relevant distinctions. Neither neutral context. Besides highly controlled vowel length nor different qualities of certain experimental settings, less controlled settings consonants (r) are represented, and also, there is serve the purpose of eliciting longer, cohesive, no marking of tones (see Examples 1 and 2, fully natural texts for studying categories such as specified word forms in brackets). To distinguish focus and topic in a near-natural environment. such homographs, however, is essential to the appropriate interpretation and linguistic analysis 4.2 Transcription and Manual Annotation of utterances. In the CRC, the annotation scheme LISA has been developed with special respect to (1) ciki – 1. [cíkìi, noun] stomach, 2. [cíkí, prep.] applicability across typologically different inside languages (Dipper et al., 2007). It comprises guidelines for the annotation of phonology, (2) dace – 1. [dàacée, noun] morphology, syntax, semantics and IS. coincidence, 2. [dáacèe, verb] The data mentioned above is, in the case of be appropriate speech, transcribed according to IPA conventions, otherwise written according to We expect that in these cases, statistical orthographic conventions, and annotated with techniques using context features may help to glosses and IS, a translation of each sentence into predict correct vowelization and tonal patterns. English or French, (optionally) additional notes, references to QUIS experiments, and references 5 ANNIS – the Linguistic Database of to audio files and metadata. Information Structure Annotation 4.3 (Semi-)automatic Annotation 5.1 Conception and Architecture As to the automization of annotation, we pursue ANNIS (ANNotation of Information Structure) two strategies: (i) the training of classifiers on is a web-based corpus interface built to query annotated data, and (ii) the projection of and visualize multilevel corpora. It allows the annotations on texts in a source language to user to formulate queries on arbitrary, possibly parallel texts in a target language. nested annotation levels, which may be

20 conflictingly overlapping or discontinuous. The Special features. By allowing queries on types of annotations handled by ANNIS include, multiple, conflicting annotation levels among others, flat, layer-based annotations (e.g., simultaneously, the system supports the study of for glossing) and hierarchical trees (e.g., syntax). interdependencies between a potentially limitless Source data. As an architecture designed to variety of annotation levels. facilitate diverse and integrative research on IS, At the same time, ANNIS allows us to ANNIS can import formats from a broad variety integrate and to search through heterogeneous of tools from NLP and manual annotation, the resources by means of a unified interface, a latter including EXMARaLDA (Schmidt, 2004), powerful query language and a intuitive annotate (Brants and Plaehn, 2000), Synpathy graphical query editor and is therefore (www.lat-mpi.eu/tools/synpathy/), MMAX2 particularly well-suited for the purpose of (Müller and Strube, 2006), RSTTool (O'Donnell, language documentation. In particular, ANNIS 2000), PALinkA (Orasan, 2003), Toolbox can serve as a tool for the publication of data (Busemann & Busemann, 2008) etc. These tools collections via internet. A fine-grained user allow researchers to annotate data for syntax, management allows granting privileged users semantics, morphology, prosody, phonetics, access to specific data collections, to make a referentiality, lexis and much more, as their corpus available to the public, or to seal (but research questions require. preserve) a resource, e.g., until legal issues All annotated data are merged together via a (copyright) are settled. This also makes general interchange format PAULA (Dipper publishing linguistic data collections possible 2005, Dipper & Götze 2005), a highly expressive without giving them away. standoff XML format that specifically allows Moreover, ANNIS supports deep links to further annotation levels to be added at a later corpora and corpus queries. This means that time without disrupting the structure of existing queries and query results referred to in, e.g., a annotations. PAULA, then, is the native format scientific paper, can be reproduced and quoted of ANNIS. by means of (complex) links (see following Backend. The ANNIS server uses a relational example). database that offers many advantages including full Unicode support and regular expression 5.2 Using ANNIS. An Example Query searches. Extensive search functionalities are As an illustration for the application of ANNIS to supported, allowing complex relations between the data collections presented above, consider a individual word forms and annotations, such as research question previously discussed in the all forms of overlapping, contained or adjacent study of object focus in Hausa. annotation spans, dominance axes (children, In Hausa, object focus can be realized in two ancestors etc., as well as common parent, left- or ways: either ex-situ or in-situ (Section 3.2). It right-most child and more), etc. was found that these realizations do not differ in Search. In the user interface, queries can be their semantic type (Green & Jaggar 2003, formulated using the ANNIS Query Language Hartmann & Zimmermann 2007): instead, the (AQL). It is based on the definition of nodes to marked form signals that the focused constituent be searched for and the relationships between (or the whole speech act) is unexpected for the these nodes (see below for some examples). A hearer (Zimmermann 2008). These assumptions graphical query builder is also included in the are consistent with findings for other African web interface to make access as easy as possible. languages (Fiedler et al. 2006). Visualization. The web interface, realized as a In order to verify such claims on corpora with window-based AJAX application written in Java, morphosyntactic and syntactic annotation for the provides visualization facilities for search results. example of Hausa, a corpus query can be Available visualizations include token-based designed on the basis of the Hausar Baka corpus annotations, layered annotations, tree-like that comprises not only annotations for annotations (directed acyclic graphs), and a grammatical functions and information-structural discourse view of entire texts for, e.g., categories, but also an annotation of ex-situ coreference annotation. Multimodal data is elements. represented using an embedded media player.

21

Figure 2: ANNIS Query Builder, cf. example (5).

Figure 1: ANNIS partitur view, Hausar Baka corpus.

focus particle (cee or nee), in (5), we search for a So, in (3), we look for ex-situ constituents nominal chunk preceding a sequence of (variable #1) in declarative sentences in the pronoun/aspect marker and verb. Hausa Bakar corpus, i.e., sentences that are not One example matching template (5) from the translated as questions (variable #2) such that Hausar Baka corpus is given in Fig. 1. #1 is included in #2 (#1 _i_ #2). While AQL can be used in this way to help (3) EX-SITU=”+” & linguists understand the grammatical realization TRANSLATION=”.*[^?]” & #1 of certain phenomena, and the grammatical _i_ #2 context they occur in, patterns like (5) above are probably not too readable to an interested user. Considering the first 25 matches for this query This deficit, however, is compensated by the on Hausar Baka, 16 examples reveal to be graphical query builder that allows users to relevant (excluding interrogative pronouns and create AQL queries in a more intuitive way, cf. elliptical utterances). All of these are directly Fig. 2. preceded by a period (sentence-initial) or a Of course, these patterns are not exhaustive comma (preceded by ee ‘yes’, interjections or and overgenerate. However, they can be directly exclamations), with one exception, preceded by a evaluated against the manual ex-situ annotation sentence initial negation marker. in the Hausar Baka corpus and further refined. Only seven examples are morphologically So, the manual annotation of ex-situ marked by focus particles (nee, cee), focus- constituents in the Hausar Baka corpus provides sensitive adverbs (kawài ‘only’) or quantifiers templates for the semi-automatic detection of ex- (koomee ‘every’). In nine cases, a personal situ constituents in a morphosyntactically pronoun follows the ex-situ constituent, followed annotated corpus of Hausa: The patterns generate by the verb. Together, these constraints describe a set of candidate examples from which a human all examples retrieved, and as a generalization, annotator can then choose real ex-situ we can now postulate a number of patterns that constituents. Indeed, for a better understanding only make use of morphosyntactic and syntactic of ex-situ object focus, a study with a larger annotation (token tok, morphological database of more natural language would be of segmentation MORPH, parts of speech CLASS, great advantage, and this pattern-based approach nominal chunks CHUNK) with two examples represents a way to create such a database of ex- given below: situ constructions in Hausa. Finally, it would also help find instances of (4) tok=/[,.!?]/ & predicate focus. When a V(P) constituent is CHUNK=”NC” & MORPH=/[cn]ee/ & focused in Hausa, it is nominalized, and fronted #1 . #2 & #2 . #3 like a focused nominal constituent (Hartmann & (5) tok=/[,.!?]/ & CHUNK=”NC” & CLASS=/PRON.*/ & Zimmermann 2007). CLASS=/V/ & #1 . #2 & #2 . 5.3 Related Corpus Tools #3 & #3 . #4 Some annotation tools come with search In (4), we search for a nominal chunk facilities, e.g. Toolbox (Busemann & Busemann, following a punctuation sign and preceding a 2008), a system for annotating, managing and

22 analyzing language data, mainly geared to clientele it targets: The NITE XML Toolkit is lexicographic use, and ELAN (Hellwig et al., aimed at the developer and allows to build more 2008), an annotation tool for audio and video specialized displays, interfaces, and analyses as data. required by their respective end users when In contrast, ANNIS is not intended to provide working with highly structured data annotated on annotation functionality. The main reason behind multiple levels. this is that both Toolbox and ELAN are problem- As compared to this, ANNIS is directly specific annotation tools with limited capabilities targeted at the end user, that is, a linguist trying for application to different phenomena than they to explore and to work with a particular set of were designed for. Toolbox provides an intuitive corpora. Therefore, an important aspect of the annotation environment and search facilities for ANNIS implementation is the integration with a flat, word-oriented annotations; ELAN, on the data base and convenient means for visualization other hand, for annotations that stand in a and querying. temporal relation to each other. These tools – as well as the other annotation 6 Conclusion tools used within the CRC – are tailored to a In this paper, we described the Africanist projects particular type of annotation, neither of them of the CRC „Information Structure“ at the being capable of sufficiently representing the University of Potsdam and the Humboldt data from all other tools. Annotation on different University of Berlin/Germany, together with levels, however, is crucial for the investigation of their data collections from currently 25 information structural phenomena. In order to fill subsaharan languages. Also, we have presented in this gap, ANNIS was designed primarily with the linguistic database ANNIS that can be used to the focus on visualization and querying of multi- publish, access, query and visualize these data layer annotations. In particular, ANNIS allows to collections. As one specific example of our work, integrate annotations originating from different we have described the design and ongoing tools (e.g., syntax annotation created with construction of a corpus of written Hausa, the Synpathy, coreference annotation created with Hausa internet corpus, sketched the relevant NLP MMAX2, and flat, time-aligned annotations techniques for (semi)automatic morphosyntactic created with ELAN) that nevertheless refer to the annotation, and the application of the ANNIS same primary data. In this respect, ANNIS, Query Language to filter out ex-situ constituents together with the data format PAULA and the and their contexts, which are relevant with regard libraries created for the work with both, is best to our goal, a better understanding of focus and compared to general annotation frameworks such information structure in Hausa and other African as ATLAS, NITE and LAF. languages. Taking the NITE XML Toolkit as a representative example for this kind of References frameworks, it provides an abstract data model, XML-based formats for data storage and T. Brants, O. Plaehn. 2000. Interactive Corpus metadata, a query language, and a library with Annotation. In: Proc. of LREC 2000, Athens, JAVA routines for data storage and manipulation, Greece. querying and visualization. Additionally, a set of A. Busemann, K. Busemann. 2008. Toolbox command line tools and simple interfaces for Self-Training. Technical Report (Version 1.5.4 corpus querying and browsing are provided, Oct 2008) http://www.sil.org/ which illustrates how the libraries can be used to J. Carletta, S. Evert, U. Heid, J. Kilgour, J. create one's own, project-specific corpus Robertson, H. Voormann. 2003. The NITE interfaces and tools. XML Toolkit: Flexible Annotation for Multi- Similarly to ANNIS, NXT supports time- modal Language Data. Behavior Research aligned, hierarchical and pointer-based Methods, Instruments, and Computers 35(3), annotation, conflicting hierarchies and the 353-363. embedding of multi-modal primary data. The data storage format is based on the bundling of J. Carletta, S. Evert, U. Heid, J. Kilgour. 2005. multiple XML files similar to the standoff The NITE XML Toolkit: data model and concept employed in LAF and PAULA. query. Language Resources and Evaluation One fundamental difference between NXT and Journal, 313-334. ANNIS, however, is to be seen in the primary

23 S. Dipper. 2005. XML-based Stand-off M. Krifka. 1992. A compositional semantics for Representation and Exploitation of Multi- multiple focus constructions, in Jacobs, J: Level Linguistic Annotation. In: Proc. of Informationsstruktur und Grammatik, Berliner XML Tage 2005 (BXML 2005), Opladen, Westdeutscher Verlag, 17-53. Berlin, Germany, 39-50. C. Müller, M. Strube. 2006. Multi-Level S. Dipper, M. Götze. 2005. Accessing Annotation of Linguistic Data with MMAX2. Heterogeneous Linguistic Data - Generic In: S. Braun et al. (eds.), Corpus Technology XML-based Representation and Flexible and Language Pedagogy. New Resources, Visualization. In Proc. of the 2nd Language & New Tools, New Methods. Frankfurt: Peter Technology Conference: Human Language Lang, 197–214. Technologies as a Challenge for Computer M. O’Donnell. 2000. RSTTool 2.4 – A Markup Science and Linguistics. Poznan, Poland, 206- Tool for Rhetorical Structure Theory. In: Proc. 210. of the International Natural Language S. Dipper, M. Götze, S. Skopeteas (eds.). 2007. Generation Conference (INLG'2000), 13-16 Information structure in cross-linguistic June 2000, Mitzpe Ramon, Israel, 253–256. corpora: Annotation guidelines for phonology, C. Orasan. 2003. Palinka: A Highly morphology, syntax, semantics and Customisable Tool for Discourse Annotation. information structure. Interdisciplinary In: Proc. of the 4th SIGdial Workshop on Studies on Information Structure 7, 147-187. Discourse and Dialogue, Sapporo, Japan. Potsdam: University of Potsdam. R. Randell, A. Bature, R. Schuh. 1998. Hausar I. Fiedler. forthcoming. Contrastive topic Baka. http://www.humnet.ucla.edu/humnet/ marking in Gbe. In Proc. of the 18th aflang/hausarbaka/ International Conference on Linguistics, T. Schmidt. 2004. Transcribing and Annotating Seoul, Korea, 21. - 26. July 2008. Spoken Language with Exmaralda. In: Proc. I. Fiedler, K. Hartmann, B. Reineke, A. Schwarz, of the LREC-Workshop on XML Based Richly M. Zimmermann.. forthcoming. Subject Focus Annotated Corpora, Lisbon 2004. Paris: in West African Languages. In M. ELRA. Zimmermann, C. Féry (eds.), Information A. Schwarz. Verb and Predication Focus Markers Structure. Theoretical, Typological, and in Gur. forthcoming. In I. Fiedler, A. Schwarz Experimental Perspectives. Oxford: Oxford (eds.), Information structure in African University Press. . languages (Typological Studies in Language), M. Green, P. Jaggar. 2003. Ex-situ and in-situ 307-333. Amsterdam, Philadelphia: John focus in Hausa: syntax, semantics and Benjamins. discourse. In J. Lecarme et al. (eds.), Research A. Schwarz, I. Fiedler. 2007. Narrative focus in Afroasiatic Grammar 2 (Current Issues in strategies in Gur and Kwa. In E. Aboh et al. Linguistic Theory). Amsterdam: John (eds.): Focus strategies in Niger-Congo and Benjamins. 187-213. Afroasiatic – On the interaction of focus and K. Hartmann, M. Zimmermann. 2004. Nominal grammar in some African languages, 267-286. and Verbal Focus in the Chadic Languages. In Berlin: Mouton de Gruyter. F. Perrill et al. (eds.), Proc. of the Chicago S. Skopeteas, I. Fiedler, S. Hellmuth, A. Linguistics Society. 87-101. Schwarz, R. Stoel, G. Fanselow, C. Féry, M. K. Hartmann, M. Zimmermann. 2007. In Place - Krifka. 2006. Questionnaire on information Out of Place? Focus in Hausa. In K. Schwabe, structure (QUIS). Interdisciplinary Studies on S. Winkler (eds.), On Information Structure, Information Structure 4. Potsdam: University Meaning and Form: Generalizing Across of Potsdam. Languages. Benjamins, Amsterdam: 365-403. I. H. Witten, E. Frank, Data mining: Practical B. Hellwig, D. Van Uytvanck, M. Hulsbosch. machine learning tools and techniques, 2nd 2008. ELAN – Linguistic Annotator. Technical edn, Morgan Kaufman, San Francisco, 2005. Report (as of 2008-07-31). http://www.lat- M. Zimmermann. 2008. Contrastive Focus and mpi.eu/tools/elan/ Emphasis. In Acta Linguistica Hungarica 55: É. Kiss. 1998. Identificational Focus Versus 347-360. Information Focus. Language 74: 245-273.

24 A computational approach to Yorub` a´ morphology Raphael Finkel Department of Computer Science, University of Kentucky, USA [email protected]

` ´ ´ O. de´.tunj´ ´ı Ajad` ´ı, O. DE. JO. BI Cork Constraint Computation Center, University College Cork, Cork, Ireland. [email protected]

Abstract Anderson 1992, Corbett & Fraser 1993, Stump 2001); in this approach, each word form in a lex- We demonstrate the use of default de- eme’s paradigm is deduced from the lexical and fault inheritance hierarchies to represent morphosyntactic properties of the cell that it re- the morphology of Yorub` a´ verbs in the alizes by means of a system of morphological KATR formalism, treating inflectional ex- rules. For instance, the word form walks is de- ponences as markings associated with the duced from the cell by means of the rule of -s suffixation, word forms are deduced from simpler which applies to the root walk of the lexeme WALK roots or stems. In particular, we sug- to express the property set {3rd singular present gest a scheme of slots that together make indicative}. up a verb and show how each slot rep- resents a subset of the morphosyntactic We apply the realizational approach to the study properties associated with the verb. We of Yorub` a´ verbs. Yorub` a,´ an Edekiri language also show how we can account for the of the Niger-Congo family (Gordon 2005), is the tonal aspects of Yorub` a,´ in particular, the native language of more than 30 million peo- tone associated with the emphatic end- ple in West Africa. Although it has many di- ing. Our approach allows linguists to gain alects, all speakers can communicate effectively an appreciation for the structure of verbs, using Standard Yorub` a´ (SY), which is used in ed- gives teachers a foundation for organizing ucation, mass media and everyday communica- lessons in morphology, and provides stu- tion (Adewo´ .le´ 1988). dents a technique for generating forms of any verb. We represent our realizational analysis of SY in the KATR formalism (Finkel, Shen, Stump & 1 Introduction Thesayi 2002). KATR is based on DATR, a for- mal language for representing lexical knowledge Recent research into the nature of morphology has designed and implemented by Roger Evans and demonstrated the feasibility of several approaches Gerald Gazdar (Evans & Gazdar 1989). Our infor- to the definition of a language’s inflectional sys- mation about SY is primarily due to the expertise tem. Central to these approaches is the notion of of the second author. an inflectional paradigm. In general terms, the inflectional paradigm of a lexeme L can be re- This research is part of a larger effort aimed at garded as a set of cells, where each cell is the pair- elucidating the morphological structure of natural ing of L with a set of morphosyntactic properties, languages. In particular, we are interested in iden- and each cell has a word form as its realization; tifying the ways in which default-inheritance re- for instance, the paradigm of the lexeme walk in- lations describe a language’s morphology as well cludes cells such as and , whose real- tion of principal parts. To this end, we have izations are the word forms walks and walked. applied similar techniques to Hebrew (Finkel & Given this notion, one approach to the definition Stump 2007), Latin (Finkel & Stump to appear, of a language’s inflectional system is the realiza- 2009b), and French (Finkel & Stump to appear, tional approach (Matthews 1972, Zwicky 1985, 2009a).

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 25–31, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

25 1.1 Benefits Our analysis posits that SY verbs consist of a se- As we demonstrate below, the realizational ap- quence of morphological formatives, arranged in proach leads to a KATR theory that provides a six slots: clear picture of the morphology of SY verbs. Dif- • Person, which realizes the person and num- ferent audiences might find different aspects of it ber but is also influenced by tense and polar- attractive. ity,

• A linguist can peruse the theory to gain an • Negator marker 1, which appears only in the appreciation for the structure of SY verbs, negative, but is slightly influenced by person with all exceptional cases clearly marked ei- and number, ther by morphophonological diacritics or by rules of sandhi, which are segregated from all • Tense, which realizes the tense, influenced by the other rules. polarity, • Negator marker 2, which appears only in the • A teacher of the language can use the the- negative, influenced by tense, ory as a foundation for organizing lessons in morphology. • Stem, which realizes the verb’s lexeme,

• A student of the language can suggest verb • Ending, which appears only for emphatic roots and use the theory to generate all the ap- verbs. propriate forms, instead of locating the right Unlike many other languages, SY does not dis- paradigm in a book and substituting conso- tinguish conjugations of verbs, making its KATR nants. theory simpler than ones for languages such as 2 SY phonetics Latin and Hebrew. However, the tonality of SY adds a small amount of complexity. SY has 18 consonants (b, d, f, g, gb, h, j, k, l, m, A theory in KATR is a network of nodes. The n, p, r, s, s. , t, w, y), 7 simple vowels (a, e, e. , i, o, network of nodes constituting SY verb morphol- o. , u), 5 nasalized vowels (an, en, in, o. n, un), and ogy is very simple: every lexeme is represented 2 syllabic nasals (m, n). SY has 3 phonologically by a node that specifies its stem and then refers to contrastive tones: High, Mid and Low. Phoneti- the node Verb. The node Verb refers to nodes cally, there are also two tone variants, rising and for each of the slots. We use rules of Sandhi as a falling (Laniran & Clements 2003). SY orthog- final step before emitting verb forms. raphy employs two transcription formats for these Each of the nodes in a theory houses a set of tones. In one format, the two tones are marked rules. We represent the verb mun´ ‘take’ by a node: on one vowel. For example, the vowel a with a low tone followed by a high tone is written as aˇ Take: 1 = m un´ and with a high tone followed by a low tone as aˆ. 2 = Verb This paper follows the alternative orthography, in The node, named Take, has two rules, which which each tone is carried by exactly one vowel. we number for discussion purposes only. KATR We write aˇ as a`a´ and aˆ as a´a`. syntax requires that a node be terminated by a sin- 3 A Realizational KATR theory for SY gle period (full stop), which we omit here. Our convention is to name the node for a lexeme by a The purpose of the KATR theory described here capitalized English word (here Take) represent- is to generate verb forms for SY, specifically, ing its meaning. the realizations of all combinations of the mor- Rule 1 says that a query asking for the stem of phosyntactic properties of tense (present, continu- this verb should produce a two-atom result con- ous, past, future), polarity (positive, negative), per- taining m and un´ . Rule 2 says that all other queries son (1, 2 older, 3 older, 2 not older, 3 not older), are to be referred to the Verb node, which we in- number (singular, plural), and strength (normal, troduce below. emphatic). The combinations form a total of 160 A query is a list of atoms, such as morphosyntactic property sets (MPSs). or

26 sg>, addressed to a node such as Take. In normal,positive,present,1,pl our theory, the atoms in queries either represent a mun´ morphological formatives (such as stem) or normal,positive,present,2Older,sg morphosyntactic properties (such as 3Older e. mun´ and sg). normal,positive,present,2Older,pl A query addressed to a given node is matched e. mun´ against all the rules housed at that node. A rule normal,positive,present,3Older,sg matches if all the atoms on its left-hand side wo´.n mun´ match the atoms in the query. A rule can match normal,positive,present,3Older,pl even if its atoms do not exhaust the entire query. wo´.n mun´ In the case of Take, the query is normal,positive,present,2NotOlder,sg matched by Rules 1 and 2; the query is only matched by Rule 2. normal,positive,present,2NotOlder,pl Left-hand sides expressed with path nota- e. mun´ tion () only match if normal,positive,present,3NotOlder,sg their atoms match an initial substring of the o´ mun´ query. Left-hand sides expressed with set nota- normal,positive,present,3NotOlder,pl tion ({braces}) match if their atoms are all ex- wo´.n mun´ pressed, in whatever position, in the query. We normal,positive,past,2NotOlder,sg usually use set notation for queries based on mor- o ti mun´ phological formatives and morphosyntactic prop- normal,positive,continuous,2NotOlder,sg erties, where order is insignificant. o` nm´ un´ normal,positive,future,2NotOlder,sg When several rules match, KATR picks the best ´ ` match, that is, the one whose left-hand side “uses o oo mun´ normal,negative,present,2NotOlder,sg up” the most of the query. This choice embodies ` Pan¯ ini’s principle, which entails that if two rules o (k)o mun´ . normal,negative,past,2NotOlder,sg are applicable, the more restrictive rule applies, to ` the exclusion of the more general rule. We some- o (k)o t´ı`ı mun´ normal,negative,continuous,2NotOlder,sg times speak of a rule’s Pan¯ ini precedence, which . ` is the cardinality of its left-hand side. If a node in a o (k)o mun´ normal,negative,future,2NotOlder,sg KATR theory houses two applicable rules with the ` ´ ` same Pan¯ ini precedence, we consider that theory o (k)o n´ı (k`ıoo) mun´ . emphatic,positive,present,2NotOlder,sg malformed. o munun´ In our case, Rule 2 of Take only applies when emphatic,positive,past,2NotOlder,sg Rule 1 does not apply, because Rule 1 is always o ti munun´ a better match if it applies at all. Rule 2 is called The rule for Take illustrates the strategy we a default rule, because it applies by default if no term provisioning (Finkel & Stump 2007): It pro- other rule applies. Default rules define a hierarchi- vides information (here, the letters of the verb’s cal relation among some of the nodes in a KATR stem) needed by a more general node (here, theory. Verb). KATR generates output based on queries di- rected to nodes representing individual lexemes. 3.1 The Verb node Since these nodes, such as Take, are not referred We now turn to the Verb node, to which the Take node refers. to by other nodes, they are called leaves, as op- posed to nodes like Verb, which are called inter- Verb: 1 {continuous negative} = list of queries to be addressed to all leaves. Here is 2 {} = Person Negator1 Tense Negator2 the output that KATR generates for several queries , "" Ending directed to the Take node. Rule 1 of Verb reflects the continuous negative normal,positive,present,1,sg to the present negative, because they have identical mo mun´ forms.

27 Rule 2 is a default rule that composes the sur- 1 {negative} = , (k)o` face form by referring to a node for each slot 2 {negative 3NotOlder sg} = ko` 3 {} = except the stem. This rule directs an query that does not satisfy Rule 1 to each of the nodes men- The first negation slot introduces the exponence tioned. In this way, the theory computes values o` for negative forms (Rules 1 and 2) and the null for each of the slots that represent the morpholog- exponence for positive forms. In most situations, ical formatives. The KATR phrase "" this vowel starts a new word (represented by the directs a new query to the original node (in our comma), and careful speech may place an op- case, Take), which has provisioned information tional k before the vowel (represented by the par- about the stem (in our case, m un´ ). The comma enthetical k); in 3NotOlder sg, this consonant in the right-hand side of rule 2 is how we repre- is mandatory. sent a word division; our post-processing removes ordinary spaces. Tense: 1 {} = 3.2 Auxiliary nodes 2 {past} = , t i 3 {continuous positive} = , n´ - The Verb node invokes several auxiliary nodes to 4 {future positive} = , o´o` generate the surface forms for each slot. 5 {future 1 sg positive} = , a`a´ 6 {future 3NotOlder positive} = Person: 1 {1 sg} = mo 2 {1 sg negative} = mi The Tense slot is usually empty, as indicated 3 {1 sg future} = m by Rule 1. However, for both negative and posi- 4 {1 pl} = a tive past, the word ti appears here. In the positive 5 {2Older} = e . continuous, the following slot (the stem) is pre- 6 {2Older continuous} = .e` 7 {2Older continuous pl} = w .o´n fixed by n´. We use the hyphen (-) to remove the 8 {3Older positive !future} = w .o´n following word break by a spelling rule (shown 9 {3Older} = w. on 10 {2NotOlder sg} = o later). Similarly, future positive forms have a tense 11 {2NotOlder pl} = e. marker, with a special form for 1 sg. As of- 12 {2NotOlder continuous sg} = o` ten happens, the 3NotOlder form reflects to the 13 {2NotOlder continuous pl} = .e` 14 {3NotOlder} = o´ 3Older form. 15 {3NotOlder negative sg} = 16 {3NotOlder future} = yı´ Negator2: 17 {3NotOlder pl ++} = <3Older> 1 {future negative} = , nı´ 2 {past negative} = ´ ı` Generally, the Person slot depends on person 3 {} = and number, but it depends to a small extent on polarity and tense. For example, the exponence1 The second negator slot adds the word n´ı in the of 1 sg is m, but it takes an additional vowel future (Rule 1). In the past (Rule 2), it changes in the negative and the non-future positive. On the tone of the tense slot from ti to t´ı`ı. In all other the other hand, the exponence of 1 pl is always cases, Rule 3 gives a null default. Rule 2 follows a. Rule 8 applies to tenses other than future, as an assumption that tone and vowel can be speci- marked by the notation !future; in the future, fied independently in SY; without this assumption, the more general Rule 9 applies. Rule 17 reflects this slot would be more cumbersome to specify. any query involving 3NotOlder pl to the same Such floating tones are in keeping with theories of node (Person) and 3Older forms, to which it autosegmental phonology (Goldsmith 1976) and are seen in other Niger-Congo languages, such as is identical. The ++ notation increases the Pan¯ .ini precedence of this rule so that it applies in pref- Bambara (Mountford 1983). erence to Rules 15 and 16, even if one of them Ending: should apply. 1 {} = 2 {emphatic} = ↓ Negator1: The Ending slot is generally null (Rule 1), but 1An exponence is a surface form or part of a surface form, that is, the way a given lexeme appears when it is attached to in emphatic forms, it reduplicates the final vowel morphosyntactic properties. with a mid tone, unless the vowel already has a

28 mid tone, in which case the tone becomes low. Rule 3 indicates that if we see a vowel with- (We disagree slightly with Akinlabi and Liber- out a tone mark (indicating mid tone) followed by man, who suggest that this suffix is in low tone the jer, we replace it with the vowel (represented except after a low tone, in which case it becomes by $1) repeated with low tone. This specification mid (Akinlabi & Liberman 2000).) For this case, follows our assumption that tone and vowel may we introduce a jer2, represented by “↓”, for post- be treated independently. Rule 4 indicates that processing in the Sandhi phase, discussed below. a vowel followed by a tone mark and the jer is Such forms are important as a way to simplify pre- repeated with mid tone (without a mark). Rules sentation, covering many cases in one rule. When 5 and 6 are similar, but they deal with nasalized we tried to develop a SY KATR theory without a vowels. jer, we needed to separate the stem of each word There is one spelling rule to remove word into onset and coda so we could repeat the coda breaks that would otherwise be present. We have in emphatic forms, but we had no clear way to in- used “-” to indicate that a word break should dis- dicate the regular change in tone. The jer accom- appear. We use the following rule to enforce this plishes both reduplication and tone change with a strategy: single, simple mechanism. It also suggests that the emphatic ending is really a matter of tone Sandhi, #sandhi - , => . not a matter of default inheritance. That is, a hyphen before a comma removes both. 4 Postprocessing: Sandhi, Spelling and SY allows the negative future forms (k)o` n´ı and Alternatives ko` n´ı to be expressed instead as k`ıo´o`. We provide rules of alternation for this purpose: After the rules produce a surface form, we post- process that form to account for Sandhi (language- #alternative \(k\)o` , nı´ => kı`o´o` . specific rules dictating sound changes for eu- #alternative ko` , nı´ => kı`o´o` . phony), spelling conventions, and alternative ex- These alternation rules effectively collapse the ponence. We have only one Sandhi matter to ac- three slots, Negator1, Tense, and Negator2 into a count for, the jer “↓”. We accomplish this postpro- single exponence. cessing with these rules: 5 Processing 1 #vars $vowel: a e. e i o. o u . 2 #vars $tone: ´ ` . The interested reader may see the entire SY 3 #sandhi $vowel ↓ => $1 $1 ` . 4 #sandhi $vowel $tone ↓ => $1 $2 $1 . theory and run it through our software by di- 5 #sandhi $vowel n ↓ => $1 n $1 ` n . recting a browser to http://www.cs.uky. 6 #sandhi $vowel $tone n ↓ => $1 $2 n $1 n . edu/˜raphael/KATR.html, where theories for several other languages can also be found. Our The first two lines introduce shorthands so software runs in several steps: we can match arbitrary vowels and tone marks. Sandhi rules are applied in order, although in this 1. A Perl script converts the KATR theory into case, at most one of them will apply to any surface two files: a Prolog representation of the the- form. ory and a Perl script for post-processing. Rules 3–6 represent tone Sandhi by showing 2. A Prolog interpreter runs a query on the Pro- how to replace intermediate surface strings with log representation. final surface strings. Each rule has a left-hand side that is to be replaced by the information on the 3. The Perl post-processing script treats the Pro- right-hand side. Numbers like $1 on the right- log output. hand side refer to whatever a variable (in this case, the first variable) on the left-hand side has 4. Another Perl script either generates a textual matched. output for direct viewing or HTML output for a browser. 2A jer, also called a morphophoneme, is a phonologi- cal unit whose phonemic expression depends on its context. It is an intermediate surface form that is to be replaced in a This software is available from the first author context-sensitive way during postprocessing. under the GNU General Public License (GPL).

29 6 Discussion and Conclusions As we have noted elsewhere (Finkel & Stump 2007), writing KATR specifications requires con- This exercise demonstrates that the realizational siderable effort. Early choices color the structure approach to defining language morphology leads of the resulting theory, and the author must often to an effective description of SY verbs. We have discard attempts and rethink how to represent the applied language-specific knowledge and insight target morphology. The first author, along with to create a default inheritance hierarchy that cap- Gregory Stump, has built KATR theories for verbs tures the morphological structure of the language, in Hebrew, Slovak, Polish, Spanish, Irish, Shughni with slots pertaining to different morphosyntactic (an Iranian language of the Pamir) and Lingala (a properties. In particular, our KATR theory nicely Bantu language of the Congo), as well as for parts accounts for the slot structure of SY verbs, even of Hungarian, Sanskrit, and Pali. though most slots are dependent on multiple mor- phosyntactic properties, and we are easily able to Acknowledgments deal with the tone shifts introduced by the em- phatic suffix. We would like to thank Gregory Stump, the first This work is not intended to directly address author’s collaborator in designing KATR and ap- the problem of parsing, that is, converting surface plying it to many languages. Lei Shen and Suresh forms to pairings of lexemes with morphosyntactic Thesayi were instrumental in implementing our TM properties. We believe that our KATR theory for Java version of KATR. Nancy Snoke assisted in SY correctly covers all verb forms, but there may implementing our Perl/Prolog version. certainly be exceptional cases that do not follow Development of KATR was partially supported the structures we have presented. Such cases are by the US National Science Foundation under usually easy to account for by introducing infor- Grants IIS-0097278 and IIS-0325063 and by the mation in the leaf node of such lexemes. Further, University of Kentucky Center for Computational this work is not in the area of automated learning, Science. The second author is supported by Sci- so questions of precision and ability to deal with ence Foundation Ireland Grant 05/IN/I886 and unseen data are not directly relevant. Marie Curie Grant MTKD-CT-2006-042563. Any We have constructed the SY theory in KATR opinions, findings, conclusions or recommenda- instead of DATR for several reasons. tions expressed in this material are those of the authors and do not necessarily reflect the views of • We have a very fast KATR implementation, the funding agencies. making for speedy prototyping and iterative improvement in morphological theories. This References implementation is capable of taking standard DATR theories as well. Adewo´ .le,´ L. O. (1988). The categorical status and the function of the Yorub` a´ auxiliary verb with some { } structural analysis in GPSG, PhD thesis, University • KATR allows bracket notation ( and ) on of Edinburgh, Edinburgh. the left-hand side of rules, which makes it very easy to specify morphosyntactic proper- Akinlabi, A. & Liberman, M. (2000). The tonal phonol- ties for queries in any order and without men- ogy of Yoruba clitics, in B. Gerlach & J. Grizen- tioning those properties that are irrelevant to hout (eds), Clitics in phonology, morphology and syntax, John Benjamins Publishing Company, Am- a given rule. Rules in DATR theories tend to sterdam/Philadelphia, pp. 64–66. have much more complicated left-hand sides, obscuring the morphological rules. Anderson, S. R. (1992). A-morphous morphology, Cam- bridge University Press.

• KATR has a syntax for Sandhi that separates Corbett, G. G. & Fraser, N. M. (1993). Network Mor- its computation, which we see as postpro- phology: A DATR account of Russian nominal in- cessing of surface forms, from the applica- flection, Journal of Linguistics 29: 113–142. tion of morphological rules. It is possible to Evans, R. & Gazdar, G. (1989). Inference in DATR, write rules for Sandhi in DATR, but the rules Proceedings of the Fourth Conference of the Euro- are both unpleasant to write and difficult to pean Chapter of the Association for Computational describe. Linguistics, Manchester, pp. 66–71.

30 Finkel, R., Shen, L., Stump, G. & Thesayi, S. (2002). KATR: A set-based extension of DATR, Technical Report 346-02, University of Ken- tucky Department of Computer Science, Lex- ington, KY. ftp://ftp.cs.uky.edu/cs/ techreports/346-02.pdf. Finkel, R. & Stump, G. (2007). A default inheritance hierarchy for computing Hebrew verb morphology, Literary and Linguistic Computing 22(2): 117–136. dx.doi.org/10.1093/llc/fqm004. Finkel, R. & Stump, G. (to appear, 2009a). Stem alter- nations and principal parts in French verb inflection, Cascadilla Proceedings Project . Finkel, R. & Stump, G. (to appear, 2009b). What your teacher told you is true: Latin verbs have four prin- cipal parts, Digital Humanities Quarterly . Goldsmith, J. A. (1976). Autosegmental phonology, PhD thesis, Massachusetts Institute of Technology, Boston, MA. Gordon, R. G. (2005). Ethnologue: Languages of the World, 15th edn, SIL International, Dallas, Texas. Laniran, Y. O. & Clements, G. N. (2003). Downstep and high rising: interacting factors in Yorub` a´ tone production, J. of Phonetics 31(2): 203 – 250. Matthews, P. H. (1972). Inflectional morphology, Cam- bridge University Press.

Mountford, K. W. (1983). Bambara declarative sen- tence intonation, PhD thesis, Indiana University, Bloomington, IN.

Stump, G. T. (2001). Inflectional morphology, Cam- bridge University Press, Cambridge, England. Zwicky, A. M. (1985). How to describe inflection, Pro- ceedings of the 11th annual meeting of the Berkeley Linguistics Society, pp. 372–386.

31 Using Technology Transfer to Advance Automatic Lemmatisation for Setswana

Hendrik J. Groenewald Centre for Text Technology (CTexT) North-West University Potchefstroom 2531, South Africa [email protected]

Abstract information about lemmatisation. Section 3 provides specific information about lemmatisa- South African languages (and indigenous tion and the concept of a lemma in Setswana. African languages in general) lag behind other Section 4 describes previous work on lemmatisa- languages in terms of the availability of lin- tion in Afrikaans. Section 5 gives an overview of guistic resources. Efforts to improve or fast- memory based learning (the machine learning track the development of linguistic resources techniques used in this study) and the generic ar- are required to bridge this ever-increasing gap. chitecture developed for machine learning based In this paper we emphasize the advantages of technology transfer between two languages to lemmatisation. Data requirements and the data advance an existing linguistic technology/re- preparation process are discussed in Section 6. source. The advantages of technology transfer The implementation of a machine learning based are illustrated by showing how an existing lemmatiser for Setswana is explained in Section lemmatiser for Setswana can be improved by 7, while some concluding remarks and future dir- applying a methodology that was first used in ections are provided in Section 8. the development of a lemmatiser for Afrikaans. 2 Lemmatisation 1 Introduction Automatic Lemmatisation is an important pro- cess for many applications of text mining and South Africa has eleven official languages. Of natural language processing (NLP) (Plisson et al, these eleven languages, English is the only lan- 2004). Within the context of this research, lem- guage for which ample HLT resources exist. The matisation is defined as a simplified process of rest of the languages can be classified as so- morphological analysis (Daelemans and Strik, called “resource scarce languages”, i.e. lan- 2002), through which the inflected forms of a guages for which few digital resources exist. word are converted/normalised under the lemma However, this situation is changing, since re- or base-form. search in the field of Human Language Techno- For example, the grouping of the inflected logy (HLT) has enjoyed rapid growth in the past forms 'swim', 'swimming' and 'swam' under the few years, with the support of the South African base-form 'swim' is seen as an instance of lem- Government. Part of this development is a strong matisation. The last part of this definition applies focus on the development of core linguistic re- to this research, as the emphasis is on recovering sources and technologies. One such a techno- the base-form from the inflected form of the logy/resource is a lemmatiser. word. The base-form or lemma is the simplest The focus of this article is on how technology form of a word as it would appear as headword transfer between two languages can help to im- in a dictionary (Erjavec and Džeroski, 2004). prove and fast track the development of an exist- Lemmatisation should, however, not be con- ing linguistic resource. This is illustrated in the fused with stemming. Stemming is the process way that an existing lemmatiser for Setswana is whereby a word is reduced to its stem by the re- improved by applying the method that was first moval of both inflectional and derivational used in the development of a lemmatiser for morphemes (Plisson et al, 2004). Stemming can Afrikaans. thus be viewed as a "greedier" process than lem- The rest of this paper is organised as follows: matisation, because a larger number of morph- The next section provides general introductory Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 32–37, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

32 emes are removed by stemming than lemmatisa- was developed at the North-West University tion. Given this general background, it would (RAGEL, 2003). Ragel was developed by using therefore be necessary to have a clear under- traditional methods for stemming/lemmatisation standing of the inflectional affixes to be removed (i.e. affix stripping) (Porter, 1980; Kraaij and during the process of lemmatisation for a particu- Pohlmann, 1994) and consists of language-spe- lar language. cific rules for identifying lemmas. Although no There are essentially two approaches that can formal evaluation of Ragel was done, it ob- be followed in the development of lemmatisers, tained a disappointing linguistic accuracy namely a rule-based approach (Porter, 1980) or a figure of only 67% in an evaluation on a ran- statistically/data-driven approach (Chrupala, dom 1,000 word data set of complex words. 2006). The rule-based approach is a traditional This disappointing result motivated the de- method for stemming/lemmatisation (i.e. affix stripping) (Porter 1980; Gaustad and Bouma, velopment of another lemmatiser for 2002) and entails the use of language-specific Afrikaans. rules to identify the base-forms (i.e. lemmas) of This “new” lemmatiser (named Lia – “Lemma- word forms. identifiseerder vir Afrikaans” [Lemmatiser for Afrikaans]) was developed by Groenewald 3 Lemmatisation in Setswana (2006). The difference between Ragel and Lia is that Lia was developed by using a so-called data The first automatic lemmatiser for Setswana was driven machine learning method. Machine learn- developed by Brits (2006). As previously men- ing requires large amounts of annotated data. For tioned, one of the most important aspects of de- this purpose, a data set consisting of 73,000 veloping a lemmatiser in any language is to lemma-annotated words were developed. Lia define the inflectional affixes that need to be re- achieves a linguistic accuracy figure of 92,8% moved during the transformation from the sur- when trained on this data set. This result con- face form to the lemma of a particular word. In firms that the machine learning based approach response to this question, Brits (2006) found that outperforms the rule-based approach for lemmat- only stems (and not roots) can act independently isation in Afrikaans. as words and therefore suggests that only stems The increased linguistic accuracy figure ob- should be accepted as lemmas in the context of tained with the machine learning based approach automatic lemmatisation for Setswana. motivated the research presented in this paper. Setswana has seven different parts of speech. Since Ragel and the rule-based Setswana lem- Brits (2006) indicated that five of these seven matiser obtained comparable linguistic accuracy classes cannot be extended by means of regular figures, the question arises whether the applica- morphological processes. The remaining two tion of machine learning techniques, together classes, namely nouns and verbs, require the im- with the methodology and architecture developed plementation of alternation rules to determine the for Lia, can also be utilised to improve on the lemma. Brits (2006) formalized rules for the al- linguistic accuracy figure obtained by the Set- terations and implemented these rules as regular swana rule-based lemmatiser. expressions in FSA 6 (Van Noord, 2002), to cre- ate finite state transducers. These finite state 5 Methodology transducers generated C++ code that was used to implement the Setswana lemmatiser. This lem- 5.1 Memory Based Learning matiser achieved a linguistic accuracy figure of Memory based learning (Aha et al, 1991) is 62,17%, when evaluated on an evaluation subset based on the classical k-NN classification al- of 295 randomly selected Setswana words. Lin- gorithm. k-NN has become known as a powerful guistic accuracy is defined as the percentage of pattern classification algorithm (Daelemans et al, words in the evaluation set that was correctly 2007), and is considered the most basic instance- lemmatised. based algorithm. The assumption here is that all instances of a certain problem correspond to 4 Lia: Lemmatiser for Afrikaans points in the n-dimensional space (Aha et al, In 2003, a rule-based lemmatiser for Afrikaans 1991). The nearest neighbours of a certain in- (called Ragel – “Reëlgebaseerde Afrikaanse stance are computed using some form of distance Grondwoord- en Lemma-identifiseerder”) [Rule- metric (X,Y). This is done by assigning the most Based Root and Lemma Identifier for Afrikaans] frequent category within the found set of most

33 similar example(s) (the k-nearest neighbours) as as sets of data points. The evaluation instance(s) the category of the new test example. In case of a are then presented to the system and their class is tie amongst categories, a tie-breaking resolution computed by interpolation to the stored data method is used. points according to the selected algorithm and al- The memory based learning system on which gorithm parameters. The last step in the process Lia is based, is called TiMBL (Tilburg Memory- consists of generating the correct lemma(s) of the Based Learner). TiMBL was specifically de- evaluation instance(s), according to the class that veloped with NLP tasks in mind, but it can be was awarded during the classification process. used successfully for classification tasks in other The generic architecture of the machine learning domains as well (Daelemans et al, 2007). based lemmatiser is illustrated in Figure 1. 5.2 Architecture 6 Data 6.1 Data Size Key A negative aspect of the Machine Learning Start Process method for developing a lemmatiser is that a large amount of lemma-annotated training data is Decision required. Currently, there is a data set available Choose that contains only 2,947 lemma-annotated Set- Algorithm Data swana words. This is the evaluation data set con- structed by Brits (2006) to evaluate the perform- ance of the rule-based Setswana lemmatiser. A Compute Training Statistics on data set of 2,947 words is considered to be very Data Training Data small in machine learning terms. 6.2 Data Preparation Store Data in Memory Memory based learning requires that lemmatisa- tion be performed as a classification task. The training data should therefore consist of feature Classify vectors with assigned class labels (Chrupala, Evaluation Evaluation Data 2006). The feature vectors for each instance con- Data sist of the letters of the inflected word. The class labels contain the information required to trans- form the involved word form from the inflected Generate Lemma form to the lemma. The class labels are automatically derived by determining the character string (and the position thereof) to be removed and the possible replace- Lemma ment string during the transformation from word- form to lemma. This is determined by firstly ob- taining the longest common substring between Figure 1. Generic Architecture of the Machine the inflected word and the manually identified Learning Based Lemmatiser. lemma. Once the longest common substring is known, a comparison of the remaining strings in The architecture presented in this subsection was the inflected word form and the lemma indicates first developed and implemented for Lia, the ma- the strings that need to be removed (as well as chine learning based lemmatiser for Afrikaans. the possible replacement strings) during the This same architecture was used for the develop- transformation from word form to lemma. The ment of the machine learning based lemmatiser positions of the character string to be removed for Setswana. The first step in this “generic” ar- are annotated as L (left) or R (right). chitecture consists of training the system with If a word-form and its lemma are identical, the data. During this phase, the training data is ex- class awarded will be “0”, denoting that the amined and various statistical calculations are word should be left in the same form. This an- computed that aid the system during classifica- notation scheme yields classes like in column tion. This training data is then stored in memory four of Table 1.

34 Inflected Word-Form Manually Identified Longest Common Automatically Lemma Substring Derived Class matlhwao letlhwao hwao Lma>le menoganya menoga menoga Rya> itebatsa lebala ba Lit>lRtsa>la

Table 1. Data Preparation and Classes.

For example, Table 1 shows that the class of consisting of 90% of all the data, with an evalu- “matlhwao” is Lma>le. This means that the ation set consisting of 10% of all the data. A ma- string “ma” needs to be replaced by the string chine learning based lemmatiser was trained (by “le” (at the left hand side of the word) during the utilising default parameter settings) and evalu- transformation from the inflected form “matlh- ated with these two datasets. This lemmatiser ob- wao” to the lemma “letlhwao”. Accordingly, the tained an accuracy figure of 46.25%. This is a class of the word “menoganya” is Rya>, denot- disappointing result when compared to the lin- ing the string “ya” should be removed at the guistic accuracy figure of 62.71% obtained with right-hand side of the inflected form during lem- the rule-based Setswana lemmatiser when evalu- matisation. In this particular case, there is no re- ated on the same data set. Algorithmic parameter placement string. Some words like “itebatsa” un- optimisation with PSearch (Groenewald, 2008) dergo alterations to both sides of the inflected resulted in an improved accuracy figure of form during lemmatisation. The class 58.98%. This represents an increase of 12.73%, Lit>lRtsa>la indicates that the string “it” must but is still less than the accuracy figure obtained be replaced at the left-hand side of the word with by the rule-based lemmatiser. the letter “l”, while the string “tsa” should be re- Error analysis indicated that in some cases the placed with the string “la” at the right-hand side class predicted by TiMBL is conspicuously of the word. wrong. This is evident from instances shown in An example of the training of data of the lem- Table 2, where the assigned classes contain matiser is shown in Figure 2. The data is presen- strings that need to be removed that is not ted in C4.5 format (Quinlan, 1993) to the present in the inflected forms. memory based learning algorithm, where each feature is separated by a comma. The algorithm Inflected Correct Class Assigned Class requires that every instance must have the same Word number of features. In order to achieve this, it tlamparele Re>a Lmm>bRele>a was decided that each instance should contain 20 phologileng Rileng>a Regileng>a features. 20 features were chosen, since less than 1% of the words in the data contained more than Table 2. Instances with Incorrectly Assigned 20 letters. All instances were formatted to con- Classes. tain 20 features by adding underscores to the words that contained less than 20 features. The Inflected Assigned Class Class Distribution result of this process is displayed in Figure 2. Word 0 0.934098 _,_,_,_,_,_,_,_,_,_,t,s,o,g,a,t,s,o,g,a,0 tlamparele Lmm>bRele>a Re>a 1.82317 _,_,_,_,_,_,_,_,_,_,e,d,i,m,o,l,a,n,y,a,Rnya> Rele>a 0.914829 _,_,_,_,_,_,_,_,_,_,_,_,_,d,i,n,y,e,p,o,Ldi> Lmm>bRele>a _,_,_,_,_,_,_,_,_,_,t,s,i,s,e,d,i,t,s,e,Ltsisedi>Rse>la 1.96103 Rileng>a 3.00014 Figure 2. Training Data in C4.5 Format. phologileng Regileng>a Relang>a 1.24030 Regileng>a 4.20346

7 Implementation Table 3. Instances Containing Additional Class Distribution Information. Each of the 2,947 lemma-annotated words in the For example, the class assigned to the second in- evaluation data of the rule-based Setswana lem- stance in Table 2, is Regileng>a. This means that matiser was formatted in C4.5 format. The data the string “egileng” must be replaced with the was then split up further into a training data set,

35 character “a” at the right-hand side of the word. with a relatively small data set. We are confident However, the inflected word “phologileng” does that further increases in the linguistic accuracy not contain the string “egileng”, which means figure will be obtained by enlarging the training that the assigned class is sure to be incorrect. data set. Future work will therefore entail the This problem was overcome by utilizing the employment of bootstrapping techniques to an- TiMBL option (+v db) that adds class distribu- notate more training data for improving the lin- tion in the nearest neighbour set to the output guistic accuracy of the machine learning based file. The result of this is an additional output that Setswana lemmatiser. contains the class distribution information shown The most important result of the research in Table 3. The class distribution information presented in this paper is, however, that existing contains the nearest classes with their associated methodologies and research can be applied to distances from the involved evaluation instance. fast-track the development of linguistic resources A post-processing script that automatically re- or improve existing linguistic resources for re- cognises this type of incorrectly assigned class source-scarce languages, a result that is espe- and replaces the incorrect class with the second cially significant in the African context. most likely class (according to the class distribu- tion) was developed. The result of this was a fur- Acknowledgments ther increase in accuracy to 64.06%. A summary of the obtained results is displayed in Table 4. The author wishes to acknowledge the work of Jeanetta H. Brits, performed under the supervi- Method Linguistic sion of Rigardt Pretorius and Gerhard B. van Accuracy Huyssteen, on developing the first automatic Rule-based 62.17% lemmatiser for Setswana. Machine Learning with de- 46.25% fault parameter settings References Machine Learning with op- 58.9% David W. Aha, Dennis Kibler and Marc K. Albert. timised parameter settings 1991. Instance-Based Learning Algorithms. Ma- Machine Learning with op- 64.06%. chine Learning, 6:37-66. timised parameter settings Jeanetta H. Brits. 2006. Outomatiese Setswana and class distributions Lemma-identifisering ‘Automatic Setswana Lem- matisation’. Master’s Thesis. North-West Uni- Table 4. Summary of Results. versity, Potchefstroom, South Africa. Gregorz Chrupala. 2006. Simple Data-Driven Con- 8 Conclusion text-Sensitive Lemmatization. Proceedings of SEPLN 2006. The best results obtained by the machine learn- Walter Daelemans, Antal Van den Bosch, Jakub ing based Setswana lemmatiser was a linguistic Zavrel and Ko Van der Sloot. 2007. TiMBL: accuracy figure of 64.06%. This represents an in- Tilburg MemoryBased Learner. Version 6.1, Ref- crease of 1.9% on the accuracy figure obtained erence Guide. ILK Technical Report 07-03. by the rule-based lemmatiser. This seems to be a Walter Daelemans and Helmer Strik. 2002. Actieplan small increase in accuracy compared to the Voor Het Nederlands in de Taal- en Spraaktechno- 25.8% increase obtained when using a machine logie: Prioriteiten Voor Basisvoorzieningen. learning based method for Afrikaans lemmatisa- Report for the Nederlandse Taalunie. Nederlandse tion. The significance of this result becomes Taalunie. evident when considering the fact that it was ob- Tomaž Erjavec and Saso Džeroski. 2004. Machine tained by training the machine learning based Learning of Morphosyntactic Structure: Lemmat- Setswana lemmatiser with a training data set ising Unknown Slovene Words. Applied Artificial consisting of only 2,652 instances. This data set Intelligence, 18(1):17-40. is very small in comparison with the 73,000 in- Tanja Gaustad and Gosse Bouma. 2002. Accurate stances contained in the training data of Lia. Stemming of Dutch for Text Classification. The linguistic accuracy figure of 64.06% further- Language and Computers, 45 (1):104-117. more indicates that a machine learning based lemmatiser for Setswana that yields better results Hendrik J. Groenewald. 2007. Automatic Lemmatisa- than a rule-based lemmatiser can be developed tion for Afrikaans. Master’s Thesis. North-West University, Potchefstroom, South Africa.

36 Hendrik J. Groenewald. 2008. PSearch 1.0.0. North- West University, Potchefstroom, South Africa. Wessel Kraaij and Renee Pohlmann. 1994. Porter’s Stemming Algorithm for Dutch. Informatiewetenschap 1994: Wetenschaplike bijdraen aan de derde STINFON Conferentie. 1(1):167-180. Joel Plisson, Nada Lavrac and Dunja Mladenić. 2004. A Rule-based Approach to Word Lemmatization. Proceedings C of the 7th International Multi-Con- ference Information Society IS 2004, 1(1):83-86. Martin Porter. 1980. An Algorithm for Suffix Strip- ping. Program, 14 (3):130-137. John R. Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, USA. RAGEL. 2003. Reëlgebaseerde Afrikaanse Grondwoord- En Lemma-identifiseerder 'Rule- based Afrikaans Stemmer and Lemmatiser. http://www.puk.ac.za/opencms/export/PUK/html/f akulteite/lettere/ctext/ragel.html.> 11 January 2009. Gertjan Van Noord. 2002. Finite State Utilities. < http://www.let.rug.nl/~vannoord/Fsa/>. 12 January 2009.

37 Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words

Gertrud Faa߆‡ Ulrich Heid† Elsabe´ Taljard‡ Danie Prinsloo‡ † Institut fur¨ Maschinelle Sprachverarbeitung ‡ University of Pretoria Universitat¨ Stuttgart South Africa Germany [email protected] [email protected] [email protected] [email protected]

Abstract We address questions of the ambiguity of func- A major obstacle to part-of-speech tion words, in the framework of an attempt to use (=POS) tagging of Northern Sotho “standard” European-style statistical POS taggers (Bantu, S 32) are ambiguous function on Northern Sotho texts. words. Many are highly polysemous and In the remainder of this section, we briefly dis- very frequent in texts, and their local cuss our objectives (section 1.1) and situate our context is not always distinctive. work within the state of the art (section 1.2). Sec- With certain taggers, this issue leads to tion 2 deals with the main issues at stake, the han- comparatively poor results (between 88 dling of unknown open class words, and the pol- and 92 % accuracy), especially when ysemy of Northern Sotho function words. In sec- sizeable tagsets (over 100 tags) are used. tion 3, we discuss our methodology, summarizing We use the RF-tagger (Schmid and Laws, the tagset and the tagging technique used, and re- 2008), which is particularly designed for porting results from other taggers. Section 4 is the annotation of fine-grained tagsets (e.g. devoted to details of our own results, the effects including agreement information), and of the size of training material (4.2), the effects we restructure the 141 tags of the tagset of polysemy and reading frequency (4.3), and it proposed by Taljard et al. (2008) in a way includes a discussion of proposals for quality im- to fit the RF tagger. This leads to over provement (Spoustova´ et al., 2007). We conclude 94 % accuracy. Error analysis in addition in section 5. shows which types of phenomena cause 1.1 Objectives trouble in the POS-tagging of Northern Sotho. The long term perspective of our work is to sup- port information extraction, lexicography, as well 1 Introduction as grammar development of Northern Sotho with In this paper, we discuss issues of the part-of- POS-tagged and possibly parsed corpus data. We speech (POS) tagging of Northern Sotho, one of currently use the 5.5 million word University of the eleven official languages of South Africa, spo- Pretoria Sepedi Corpus (PSC, cf. de Schryver ken in the North-east of the country. Northern and Prinsloo (2000)), as well as a 45,000 words Sotho is a Bantu language belonging to the Sotho training corpus. We aim at high accuracy in the family (Guthrie, 1967: S32). It is written disjunc- POS-tagging, and at minimizing the amount of un- tively (contrary to e.g. Zulu), i.e. certain mor- known word forms in arbitrary unseen corpora, by phemes appear as character strings separated by using guessers for the open word classes. blank spaces. It makes use of 18 noun classes (1, 1a, 2, 2b, 3 to 10, 14, 15, and the locative classes 1.2 Recent work 16, 17, 18, N-, which may be summarized as LOC A few publications, so far, deal with POS-tagging for their identical syntactic features). A concor- of Northern Sotho; most prominently, de Schryver dial system helps to verify agreement or resolve and de Pauw (2007) have presented the MaxTag ambiguities. method, a tagger based on Maximum Entropy

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 38–45, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

38 Learning (Berger et al., 1996) as implemented in individual derivation types per verb. Only a few the machine learning package Maxent (Le, 2004). of these derivations are highly frequent in corpus When trained on manually annotated text, it ex- text; however, due to productivity, a large number tracts features such as the first and last letter of of verbal derivation types can potentially appear in each word, or the first two and last two letters or any unseen text. the first three and last three letters of each word; For tagging, noun and verb derivations show up it takes the word and the tag preceding and fol- as unknown items, and an attempt to cover them lowing the item to be tagged, etc., to decide about within a large lexicon will partly fail due to pro- word/tag probabilities. De Schryver and de Pauw ductivity and recursive applicability of certain af- report an accuracy of 93.5 % on unseen data, using fixes. The impact of the unknown material on a small training corpus of only ca. 10,000 word tagging quality is evident: de Schryver and de forms. Pauw (2007) report 95.1 % accuracy on known Other work is only partly engaged in POS- items, but only 78.9 % on unknowns; this leads tagging, e.g. Kotze’s´ (2008) finite state analysis of to a total accuracy of 93.5 % on their test corpus. the verb complex of Northern Sotho. This study We have carried out experiments with a version does not cover all parts of speech and can thus not of the memory-based tagger, MBT (Daelemans et be directly compared with our work. Taljard et al. al., 2007), which arrives at 90.67 % for the known (2008) and Van Rooy and Pretorius (2003) present items of our own test corpus (see section 3.2), as tagsets for Northern Sotho and the closely related opposed to only 59.68 % for unknowns. language Setswana, but they focus on the defini- To counterbalance the effect of unknown items, tion of the tagsets without discussing their auto- we use rule-based and partly heuristic guessers for matic application in detail. In (Prinsloo and Heid, noun and verb forms (cf. Prinsloo et al. (2008) and 2005), POS-tagging is mentioned as a step in a (Heid et al., 2008)) and add their results to the tag- corpus processing pipeline for Northern Sotho, but ger lexicon before applying the statistical tagger: no experimental results are reported. the possible annotations for all words contained in the text are thus part of the knowledge available to 2 Challenges in tagging Northern Sotho the tagger. Adverbs are also an open word class in North- POS-tagging of Northern Sotho and of any dis- ern Sotho; so far, we have no tools for identifying junctively written Bantu language has to deal es- them. In high quality tagging, the suggestions of pecially with two major issues which are con- our guessers are examined manually, before they sequences of their morphology and their syntax. are added to the tagger lexicon. One is the presence, in any unseen text, of a num- ber of lexical items which are not covered by the 2.2 Polysemous function words and lexicon of the tagger (“unknown words”), and the ambiguity other is an extraordinarily high number of ambigu- ous function words. Function words of Northern Sotho are highly am- biguous, and because of the disjunctive writing 2.1 Unknown words system of the language, a number of bound mor- phemes are written separately from other words. In Northern Sotho, nouns, verbs and adverbs are A single form can have several functions. For open class items; all other categories are closed example, the token -a- is nine-ways ambiguous: word classes: their items can be listed. The open it can be a subject concord of 1 or 6, classes are characterized in particular by a rich an object concord of class 6, a possessive concord morphology: nouns can form derivations to ex- of class 6, a demonstrative of class 6, a hortative press diminutives and augmentatives, as well as or a question particle or a verbal morpheme in- locative forms, to name just a few. Adding the dicating present tense or past tense (Appendix A suffix -ng to toropo ‘town’, for example, forms illustrates the ambiguity of -a- with example sen- toropong, ‘in/at/to town’. For verbs, tense, voice, tences). Furthermore, the most polysemous func- mood and many other dimensions, as well as nom- tion words are also the most frequent word types in inalization, lead to an even larger number of de- corpora. The highly ambiguous item go1 alone ac- rived items. Prinsloo (1994) distinguishes 18 clus- ters of verbal suffixes which give rise to over 260 111 different functions of go may be distinguished: object

39 counts for over 5 % of all occurrences in our train- the Northern Sotho verb complex, (Kotze,´ 2008), ing corpus, where 88 types of function words, with a number of POS tags are utilized to distinguish an average frequency of well over 200, make up the elements of the verb, however, due to Kotze’s´ for about 40 % of all occurrences. objectives, her classification does not cover other The different readings of the function words are items. De Schryver and de Pauw (2007) use a not evenly distributed: some are highly frequent, tagset of only 56 different tags, whereas the pro- others are rare. Furthermore, many ambiguous posal by Van Rooy and Pretorius leads to over 100 function words appear in the context of other func- tags. Finally, Taljard et al. (2008) propose a rather tion words; thus the local context does not nec- detailed tagset: contrary to the other authors men- essarily disambiguate individual function words. tioned, they do encode noun classes in all relevant This issue is particularly significant with ambigu- tags, which leads to a total of 141 tags. Further- ities between concords which can have the same more, they encode a number of additional mor- function (e.g. object) in different noun classes. As phosyntactic distinctions on a second level of their mentioned, -a- can be a subject concord of either tagset, which leads to a total of 262 different clas- noun class 1 or 6: though there are some clearcut sifications of Northern Sotho morphemes. cases, like the appearance of a noun of class 6 (in- Our current tagset is inspired by Taljard et al. dicating class 6), or an auxiliary or the conjunction (2008). However, we disregard some of their sec- ge in the left context (both rather indicating class ond level information for the moment (which in 1) there still remain a number of occurrences of -a- many cases encodes lexical properties of the items, in the corpora only where a broader context, some- e.g. the subdivision of particles: hortative, ques- times even information from preceding sentences, tion, instrumental, locative, connective, etc.). We may help to disambiguate this item. use the RF-tagger (Schmid and Laws, 2008) (cf. Consequently, comparing tagging performance section 3.3), which is geared towards the annota- across different tagsets does not give very clear re- tion of structured tagsets, by separating informa- sults: if a tagset, like the one used by de Schryver tion which partitions the inventory of forms (e.g. and de Pauw (2007), does not distinguish noun broad word classes) from feature-like information classes, obviously a large number of difficult dis- possibly shared by several classes, such as the ambiguation cases does not appear at all (their Sotho noun classes. With this method, we are able tagset distinguishes, for example, subject and ob- to account for Taljard et al.’s (2008) 141 tags by ject concord, but gives no information on noun means of only 25 toplevel tags, plus a number of class numbers). For the lexicographic applica- feature-like labels of lower levels. We summarize tion we are interested in, and more generally as a the properties of the tagsets considered in table 1. preparatory step to chunking or parsing of North- ern Sotho texts, an annotation providing informa- tion on noun classes is however highly desirable. 3.2 Training corpus 3 Methodology Our training corpus consists of ca. 45.000 manu- 3.1 Tagset ally annotated word forms, from two text types. There are several proposals for tagsets to be used Over 20.000 word forms come from a novel of for Northern Sotho and related languages. Van the South African author Oliver K. Matsepe (Mat- Rooy and Pretorius (2003) propose a detailed sepe, 1974); over 10.000 forms come from a Ph.D. tagset for Setswana, which is fully in line with dissertation by Raphehli M. Thobakgale (Thobak- the guidelines stated by the EAGLES project, cf. gale, 2005), and another 10.000 from a second Leech and Wilson (1999). This tagset encodes a Ph.D. dissertation, by Ramalau R. Maila (Maila, considerable number of semantic distinctions in 2006). Obviously, this is not a balanced corpus; it its nominal and verbal tags. In Kotze’s´ work on was indeed chosen because of its easy accessibil- concord of class 15, object concord of the locative classes, ity. We use this corpus to train our taggers and to object concord of the 2nd person singular, subject concord test them; in a ten-fold cross validation, we split of class 15, indefinite subject concord, subject concord of the text into ten slices of roughly equal size, train the locative classes, class prefix of class 15, locative particle, copulative indicating either an indefinite subject, or a subject on 9 of them and test on the tenth. In this article, of class 15 or a locative subject. we give figures for the median of these results.

40 Authors No. of tags ± noun class tool? (van Rooy and Pretorius, 2003) 106 - noun class no (De Schryver and De Pauw, 2007) 56 - noun class yes (Kotze,´ 2008) partial N.R. yes (Taljard et al., 2008) 141/262 + noun class no This paper 25/141 + noun class yes

Table 1: Tagsets for N. Sotho: authors, # of tags, consideration of the noun class system, use in tools

3.3 Tagging techniques: the RF-tagger sive pronouns are represented on three levels (e.g. PROPOSSPERS becomes PRO.POSS.PERS)2. We opt for the RF-tagger (Schmid and Laws, 2008), because it is a Hidden-Markov-Model 4 Results (HMM) tagger which was developed especially for POS tagsets with a large number of (fine- In a preliminary experiment, we compared several 3 grained) tags. Tests with our training corpus have taggers on our manually annotated data. Apart shown that this tagger outperforms the Tree-tagger from the RF-tagger (Schmid and Laws, 2008), ((Schmid, 1994) and (Schmid, 1995)), as shown in we also used the Tree-Tagger (Schmid, 1994), the figure 1. An additional external lexicon may serve TnT tagger (Brants, 2000) and MBT (Daelemans as input, too. The development of the RF-tagger et al., 2007). was based on the widely shared opinion that for 4.1 Comparing taggers languages like German or Czech, agreement in- formation (e.g. case, number or person) should The results give a relatively homogenous picture, preferably appear as part of all appropriate part with the RF-tagger achieving a median of 94.16 % of speech tags. However, as tagsets increase im- when utilising a lexicon containing several thou- mensely in size when such information is part of sand nouns and verbs. It leads to 91 % accuracy the tags, the data are decomposed, i.e. split into without this lexicon (to simulate similar condi- several levels of processing. The probability of tions as for TnT or MBT where no external lex- each level is then calculated separately (the joint icon was offered). TnT achieves 91.01 %, and probability of all levels is afterwards calculated as MBT 87.68 %. Data from the Tree-Tagger were their product). With such methodology, a tag of not comparable for they had been obtained at an the German determiner das may contain five levels earlier stage using the lexicon (92.46 %). of information, e.g. ART.Def.Nom.Sg.Neut to de- 4.2 Effects of the size of the training corpus fine a definite, nominative singular, neutral deter- on the tagging results miner (article) that appears in the nominative case. This approach makes sense for the Bantu- All probabilistic taggers are in need of training languages as well, since information on noun class data the size of which depends on the size of the numbers should be part of the noun tags, too, as in tagset and on the frequencies of occurrence of Taljard et al’s (2008) tagset. A noun here is not each context. De Schryver and de Pauw (2007) only tagged “N”, but Nclass, e.g. mohumagadi demonstrated that when utilizing a tagset that con- ‘(married) woman’ as N01. All concords, pro- tains only about a third of the tags (56) contained nouns or other types that concordially agree with in Taljard et al.’s (2008) tagset (141), their Max- nouns are also labelled with a class number, e.g. Tag POS-tagger reaches a 93.5 % accuracy with a o, the subject concord of class 1, is labelled CS01. training corpus of only about 10,000 tokens. This approach makes sense, especially in the view Figure 1 depicts the effects of the size of the of chunking/parsing and reference resolution, be- training corpus on the accuracy figures of the Tree- cause any of those elements can acquire a pronom- tagger and the RF-tagger. Tests with training cor- inal function when the noun that they refer to is pora of the sizes 15,000, 30,000 and 45,000 tokens deleted (Louwrens, 1991). 2Tests have shown that the quantitative pronouns should To utilize the RF-tagger, we split all tags con- be treated separately, their tags are thus only split into two levels. taining noun class numbers into several levels (e.g. 3Check the References section for the availability of these the tag N01 becomes N.01). Emphatic and posses- taggers.

41 a as freq RF-tagger sums % CS.01 1182 CS.01 1140 96.4 CS.06 19 1.6 MORPH 14 1.2 CDEM.06 3 0.3 PART 2 0.2 CPOSS.06 2 0.2 CO.06 2 0.2 CS.06 176 CS.06 111 63.1 CS.01 43 24.4 CPOSS.06 10 5.7 CDEM.06 5 2.8 MORPH 3 1.7 PART 3 1.7 CO.06 1 0.6 Figure 1: Effects of the size of the training corpus CO.06 18 MORPH 7 38.9 on tagger results. CS.01 6 33.3 CO.06 3 16.7 showed that the results might not improve much if CS.06 2 11.11 more data is added. The RF-tagger already reaches PART 45 CS.01 23 51.1 more than 94 % correctness when utilizing the cur- MORPH 11 24.4 rent 45,000 token training corpus. CDEM.06 5 11.1 PART 5 11.1 4.3 Effects of the highly polysemous function CPOSS.06 1 2.2 words of Northern Sotho CDEM.06 97 CDEM.06 89 91.8 The less frequently a token-label pair appears in CPOSS.06 4 4.1 the corpus, the lower is its probability (leading to CS.06 2 2.1 the sparse data problem, when probability guesses CS.01 1 1.0 become unreliable because of low numbers of oc- PART 1 1.0 currences). This issue poses a problem for North- CPOSS.06 209 CPOSS.06 186 89.0 ern Sotho function words: if they occur very fre- CDEM.06 12 5.7 quently with a certain label, the chances of them CS.06 6 2.9 being detected with another label are fairly low. PART 4 1.9 This effect is demonstrated in table 2, which de- CS.01 1 0.5 scribes the detection of the parts of speech of the MORPH 89 MORPH 44 49.4 highly ambiguous function word -a-. The word - CO.06 26 29.2 a- as PART(icle) occurs only 45 times while -a- CS.01 15 16.9 as CS.01 occurs 1,182 times. More than 50 % of CPOSS.06 4 4.5 the particle occurrences (23) are wrongly labelled sums 1816 1816 CS.01 by the tagger. In table 2, we list the cor- rect tags of all occurrences of -a-, as well as the Table 2: RF-tagger results for -a- assigned tags to each of them by our tagger. Each block of table 2 is ordered by decreasing numbers of occurrence of each tag in the output of the RF- cess. tagger. For easier reference, the correct tags as- 4.4 Suggestions for increasing correctness signed by the RF-tagger are printed in bold. Table percentages 2 also clearly shows the effect of ambiguous local context on the tagging result: the accuracy of the Spoustova´ et al. (2007) describe a significant in- CS.06-annotation (subject concord of class 6) is crease of accuracy in statistical tagging when uti- considerably lower than that of the more frequent lizing rule-based macros as a preprocessing, for CS.01 (96.45 % vs. 63.08 %), and CS.01 is the Czech. We have contemplated, in an earlier stage most frequent error in the CS.06 assignment pro- of our work (Prinsloo and Heid, 2005) to adopt a

42 similar strategy, i.e. to design rule-based macros chunking and/or parsing, by including informa- for the (partial) disambiguation of high-frequency tion on noun class numbers in the annotation. function words. However, the fact that the local We found that the RF-tagger (Schmid and Laws, context of many function words is similar (i.e. the 2008) performs well on this task, partly because ambiguity of this local context (see above)), is a it allows us to structure the tagset into layers, and major obstacle to a disambiguation of single func- to deal with noun class information in the same tion words by means of rules. Rules would interact way as with agreement features for European lan- in many ways, be dependent on the application or- guages. We reach over 94 % correctness, which der, or disambiguate only partially (i.e. leave sev- indicates that at least a first attempt at covering the eral tag options). An alternative would be to de- PSC corpus may now be in order. sign rules for the disambiguation of word or mor- Our error analysis, however, also highlights a pheme sequences. This would however amount to few more general aspects of the POS annotation partial parsing. The status of such rules within a of Northern Sotho and related languages: obvi- tagging architecture would then be unclear. ously, frequent items and items in distinctive lo- cal contexts are tagged quite well. When noun 4.5 Effects of tagset size and structure class information is part of the distinctions under- While a preprocessing with rule-based disam- lying the tagset, function words usable for more biguation does not seem to be promising, there than one noun class tend however, to appear in are other methods of improving accuracy, such as, non-distinctive local contexts and thus to lead to e.g., the adaptation of the tagset. Obviously, types a considerable error rate. Furthermore, we found a appearing in different contexts should have differ- few cases of uses of, e.g., subject concords that are ent labels. For example, in the tagset of Taljard et anaphoric, with antecedents far away and thus not al. (2008), auxiliary verbs are a sub-class of verbs accessible to tagging procedures based on the lo- (V aux). In typical Northern Sotho contexts, how- cal context. These facts raise the question whether, ever, auxiliaries are surrounded by subject con- to achieve the highest quality of lexical classifica- cords, while verbs are only preceded by them. tion of the words and morphemes of a text, chunk- When ’promoting’ the auxiliaries to the first level ing/parsing might be required altogether, rather by labelling them VAUX, the RF-tagger result in- than tagging. creases by 0.13 % to 94.16 % accuracy. We still Our experiments also showed that several pa- see room for further improvement here. For exam- rameters are involved in fine-tuning a Sotho tag- ple, ga as PART (either locative particle PART loc ger. The size and structure of the tagset is one or hortative particle PART hort) is identified cor- such a prominent parameter. Tendencies towards rectly in only 29.2 % of all cases at the moment. simpler and smaller tagsets obviously conflict with The hortative particle usually appears at the be- the needs of advanced processing of the texts and ginning of a verbal segment, while the locative in of linguistically demanding applications. It seems most cases follows the segment. Results may in- that tagset design and tool development go hand in crease to an even higher accuracy when ‘promot- hand. ing’ these second level annotations, hort(ative) and loc(ative) to the first annotation level. We intend to apply the current version of the RF-tagger to the PSC corpus and to evaluate the 5 Conclusions and future work results carefully. We expect a substantial gain from the use of the guessers for nouns and verbs, This article gives an overview of work on POS- cf. (Prinsloo et. al, 2008) and (Heid et al., 2008). tagging for Northern Sotho. Depending on the Detailed error analysis should allow us to also place of tagging in an overall NLP chain for this design specific rules to correct the output of the language, different choices with respect to the tagger. Instead of preprocessing (as proposed by tagset and to the tagging technology may prove Spoustova´ et al. (2007)), a partial postprocess- adequate. ing may contribute to further improving the overall In our work, which is part of a detailed lin- quality. Rules would then probably have to be ap- guistic annotation of Northern Sotho corpora for plied to particular sequences of words and/or mor- linguistic exploration with a view to lexicons and phemes which cause difficulties in the statistical grammars, it is vital to provide a solid basis for process.

43 References Daan J. Prinsloo. 1994. Lemmatization of verbs in Northern Sotho. SA Journal of African Languages Adam L. Berger, Stephen Della Peitra, and Vincent 14(2): pp. 93 – 102. J. Della Pietra. 1996. A Maximum Entropy Ap- proach to Natural Language Processing. Computa- Daan J. Prinsloo and Ulrich Heid. 2005. Creating tional Linguistics 22(1): pp. 39 – 71. word class tagged corpora for Northern Sotho by lin- guistically informed bootstrapping. in: Isabella Ties Thorsten Brants. 2000. TnT - A Statistical Part-of- (Ed.): LULCL, Lesser used languages and compu- Speech Tagger. In Proceedings of the Sixth Applied tational linguistics, 27/28-10-2005, Bozen/Bolzano, Natural Language Processing Conference ANLP- (Bozen: Eurac) 2006: pp. 97 – 113. 2000, Seattle, WA. Daan J. Prinsloo, Gertrud Faaß, Elsabe´ Taljard, and Walter Daelemans, Jakub Zavrel, Antal van den Bosch. Ulrich Heid. 2008. Designing a verb guesser for 2007. MBT: Memory-Base Tagger, version 3.1. Ref- part of speech tagging in Northern Sotho. Southern erence Guide. ILK Technical Report Series 07–08 African Linguistics and Applied Languages Studies [online]. Available : http://ilk.uvt.nl/ (SALALS) 26(2). mbt (10th Jan, 2009).

Gilles-Maurice de Schryver and Guy de Pauw. 2007. Helmut Schmid and Florian Laws. 2008. Es- Dictionary Writing System (DWS) + Corpus Query timation of Conditional Probabilities with Package (CQP): The Case of TshwaneLex. Lexikos Decision Trees and an Application to Fine- 17 AFRILEX-reeks/series 17:2007: pp. 226 – Grained POS Tagging [online]. COLING 246. [Online tagger:] http://aflat.org/?q= 2008. Manchester, Great Britain. Available: node/177. (10th Feb, 2009) http://www.ims.uni-stuttgart.de/ projekte/corplex/RFTagger/ (10th Jan, Gilles-Maurice de Schryver and Daan J. Prinsloo. 2009). 2000. The compilation of electronic corpora with special reference to the African languages. South- Helmut Schmid. September 1994. Probabilistic ern African Linguistics and Applied Language Stud- Part-of-Speech Tagging Using Decision Trees. Pro- ies 18(1-4): pp. 89 – 106. ceedings of the International Conference on New Methods in Language Processing[online]. Avail- Malcolm Guthrie. 1971. Comparative Bantu: an in- able: http://www.ims.uni-stuttgart. troduction to the comparative linguistics and prehis- de/projekte/corplex/TreeTagger/ tory of the Bantu languages, vol 2, Farnborough: (10th Jan, 2009). Gregg International. Helmut Schmid. March 1995. Improvements in Part- Ulrich Heid, Daan J. Prinsloo, Gertrud Faaß, and of-Speech Tagging with an Application to German. Elsabe´ Taljard. 2008 Designing a noun guesser for Proceedings of the ACL SIGDAT-Workshop. part of speech tagging in Northern Sotho (33 pp). ms: University of Pretoria. Drahom´ıra “johanka” Spoustova,´ Jan Hajic,˘ Jan Votrubec, Pavel Krbec, and Pavel Kveto˘ n.˘ Jun Petronella M. Kotze.´ 2008. Northern Sotho grammat- 29, 2007. The best of two worlds: Coop- ical descriptions: the design of a tokeniser for the eration of Statistical and Rule-based Taggers verbal segment. Southern African Linguistics and for Czech. Balto-Slavonic Natural Language Applied Language Studies 26(2): pp. 197 – 208. Processing: pp. 67 – 74 [online]. Available: http://langtech.jrc.it/BSNLP2007/ Zhang Le. 2004. Maximum Entropy Modeling Toolkit m/BSNLP-2007-proceedings.pdf (10th for Python and C++ (Technical Report) [online]. Jan, 2009). Available: http://homepages.inf.ed. ac.uk/s0450736/maxent_toolkit.html Elsabe´ Taljard, Gertrud Faaß, Ulrich Heid and Daan J. (10th Jan, 2009). Prinsloo. 2008. On the development of a tagset for Northern Sotho with special reference to standardi- Geoffrey Leech, Andrew Wilson. 1999. Standards for sation. Literator 29(1) 2008. Potchefstroom, South Tagsets. in van Halteren (Ed.) Syntactic world-class Africa. tagging: pp. 55 – 80 Dordrecht/Boston/London: Kluwer Academic Publishers. Raphehli M. Thobakgale. Khuetso˘ ya OK Matsepe go bangwadi ba Sepedi [=“Influence of OK Matsepe Louis J. Louwrens. 1991. Aspects of the Northern on the writers of Sepedi”]. Doctoral thesis. Univer- Sotho Grammar p. 154. Pretoria:Via Afrika. sity of Pretoria, South Africa.

Ramalau A. Maila. 2006. Kgolo ya tiragatso ya Se- Bertus van Rooy and Rigardt Pretorius. 2003. A word- pedi. [=“Development of the Sepedi Drama”]. Doc- class tagset for Setswana. Southern African Linguis- toral thesis. University of Pretoria, South Africa. tics and Applied Language Studies 21(4): pp. 203 – 222. Oliver K. Matsepe. 1974. Tsa˘ Ka Mafuri. [=“From the homestead”]. Pretoria: Van Schaik.

44 Appendix A. The polysemy of -a-

Description Example 1 Subject ge monna a fihla concord of conjunctive + noun cl. 1 + subject concord cl. 1 + verb stem if/when + man + subj-cl1 + arrive ”when the man arrives” 2 Subject masogana a thusa˘ basadi concord of noun cl. 6 + subject concord cl. 6 + verb stem + noun cl.2 nominal cl. 6 young men + subj-cl6 + help women ”the young men help the women” 3 Possessive maoto a gagwe concord of noun cl. 6 + possessive concord cl. 6 + possessive pronoun cl. 1 nominal cl. 6 feet + of + his ”his feet” 4 Present tense morutisi˘ o a bitsa˘ morpheme noun cl. 1 + subject concord cl.1 + present tense marker + verb stem teacher + subj-cl1 + pres + call ”the teacher is calling” 5 Past tense morutisi˘ ga o a bitsa˘ masogana morpheme noun cl. 1 + negation morpheme + subject concord cl.1 + past tense marker + verb stem + noun cl. 6 teacher + neg + subj-cl1 + past + call + young men ”the teacher did not call the young men” 6 Demonstrative ba nyaka masogana a concord of subject concord cl. 2 + verb stem + noun cl. 6 + demonstrative concord nominal cl. 6 they + look for + young men + these ”they are looking for these young men” 7 Hortative a ba tsene particle hortative particle + subject concord cl. 2 + verb stem let + subj-cl2 + come in ”let them come in” 8 Interrogative a o tseba Sepedi particle interrogative particle + subject concord 2nd pers sg. + verb stem + noun cl. 7 ques + subj-2nd-pers-sg + know + Sepedi ”do you know Sepedi” 9 Object moruti o a biditse˘ concord of noun cl. 1 + subject concord cl. 1 + object concord cl. 6 + verb stem teacher + subj-cl1 + obj-cl6 + called ”the teacher called them”

45 Development of an Amharic Text-to-Speech System Using Cepstral Method

Tadesse Anberbir Tomio Takara ICT Development Office, Addis Faculty of Engineering, University of Ababa University, Ethiopia the Ryukyus, Okinawa, Japan [email protected] [email protected]

Abstract formant synthesis method requires small linguis- tic resources and able to generate various speak- This paper presents a speech synthesis system ing styles. It is also suitable for mobile applica- for Amharic language and describes and how tions and easier for customization. However, this the important prosodic features of the lan- method produced less natural-sounding synthe- guage were modeled in the system. The devel- sized speech and the complex rules required to oped Amharic Text-to-Speech system model the prosody is a big problem. (AmhTTS) is parametric and rule-based that employs a cepstral method. The system uses a In general, each method has its own strengths source filter model for speech production and and weaknesses and there is always a tradeoff. a Log Magnitude Approximation (LMA) filter Therefore, which approach to use will be deter- as the vocal tract filter. The intelligibility and mined by the intended applications, the availabil- naturalness of the system was evaluated by ity of linguistic resources of a given language word and sentence listening tests respectively etc. In our research we used formant (rule-based) and we achieved 98% correct-rates for words synthesis method because we are intending to and an average Mean Opinion Score (MOS) of prepare a general framework for Ethiopian Se- 3.2 (which is categorized as good) for sen- mitic languages and apply it for mobile devices tences listening tests. The synthesized speech and web embedded applications. has high intelligibility and moderate natural- ness. Comparing with previous similar study, Currently, many speech synthesis systems are our system produced considerably similar available mainly for ‘major’ languages such as quality speech with a fairly good prosody. In English, Japanese etc. and successful results are particular our system is mainly suitable for obtained in various application areas. However, building new languages with little modifica- thousands of the world’s ‘minor’ languages lack tion. such technologies, and researches in the area are quite very few. Although recently many localiza- 1 Introduction tion projects (like the customization of Festvox 1) are being undergoing for many languages, it is Text-to-Speech (TTS) synthesis is a process quite inadequate and the localization process is which artificially produces synthetic speech for not an easy task mainly because of the lack of various applications such as services over tele- linguistic resources and absence of similar works phone, e-document reading, and speaking system in the area. Therefore, there is a strong demand for handicapped people etc. for the development of a speech synthesizer for The two primary technologies for generating many of the African minor languages such as synthetic speech are concatenative synthesis and Amharic. formant (Rule-based) synthesis methods. Con- Amharic, the official language of Ethiopia, is catenative synthesis produces the most natural- a Semitic language that has the greatest number sounding synthesized speech. However, it re- of speakers after Arabic. According to the 1998 quires a large amount of linguistic resources and census, Amharic has 17.4 million speaker as a generating a various speaking style is a challeng- mother thong language and 5.1 million speakers ing task. In general the amount of work required as a second language. However, it is one of the to build a concatenative system is enormous. Par-

ticularly, for languages with limited linguistic 1 Festvox is a voice building framework which offers gen- resources, it is more difficult. On the other hand, eral tools for building unit selection voices for new lan- guages.

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 46–52, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

46 least supported and least researched languages in of the striking features of Amharic phonology the world. Although, recently, the development that gives the language its characteristic sound of different natural language processing (NLP) when one hears it spoken: the weak, indetermi- tools for analyzing Amharic text has begun, it is nate stress; the presence of glottalic, palatal, and often very far comparing with other languages labialized consonants; the frequent gemination of (Alemu et al., 2003). Particularly, researches consonants and central vowels; and the use of an conducted on language technologies like speech automatic helping vowel (Bender et al., 1976). synthesis and the application of such technolo- Gemination in Amharic is one of the most gies are very limited or unavailable. To the distinctive characteristics of the cadence of the knowledge of the authors, so far there is only one speech, and also caries a very heavy semantic published work (Sebsibe, 2004) in the area of and syntactic functional weight (Bender and Fu- speech synthesis for Amharic. In this study they lass, 1978). Amharic gemination is either lexical tried to describe the issues to be considered in or morphological. Gemination as a lexical fea- developing a concatenative speech synthesizer ture cannot be predicted. For instance, አለ may using Festvox and recommended using syllables be read as alä meaning 'he said', or allä meaning as a basic unit for high quality speech synthesis. 'there is'. Although this is not a problem for Am- In our research we developed a syllabic based haric speakers, it is a challenging problem in TTS system with prosodic control method which speech synthesis. As a morphological feature is the first rule-based system published for Am- gemination is more predictable in the verb than haric. The designed Amharic TTS (AmhTTS) is in the noun, Bender and Fulass (1978). However, parametric and rule-based system that employs a this is also a challenging problem in speech syn- Cepstral method and uses a Log Magnitude Ap- thesis because to automatically identify the loca- proximation (LMA) filter. Unlike the previous tion of geminated syllables, it requires analysis study, Sebsibe (2004), our study provides a total and modeling of the complex morphology of the solution on prosodic information generation language. The lack of the orthography of Am- mainly by modeling the durations. The system is haric to show geminates is the main problem. In expected to have a wide range of applications, this study, we used our own manual gemination for example, in software aids to visually im- mark (‘) insertion techniques (see Section 3.3.1). paired people, in mobile phones and can also be The sixth order syllables are the other impor- easily customized for other Ethiopian languages. tant features of the language. Like geminates, the sixth order syllables are also very frequent and 2 Amharic Language’s Overview play a key role for proper pronunciation of Amharic (አማርኛ ) is a Semitic language and it is speech. In our previous study, (Tadesse and Ta- one of the most widely spoken languages in kara, 2006) we found that geminates and sixth Ethiopia. It has its own non Latin based syllabic order syllables are the two most important fea- tures that play a key role for proper pronuncia- script called “Fidel” or “Abugida”. The ortho- graphic representation of the language is organ- tion of words. Therefore, in our study we mainly ized into orders (derivatives) as shown in Fig.1. consider these language specific features to de- velop a high quality speech synthesizer for Am- Six of them are CV (C is a consonant, V is a vowel) combinations while the sixth order is the haric language. consonant itself. In total there are 32 consonants 3 AmhTTS System and 7 vowels with 7x32= 224 Syllables. But since there are redundant sounds that represent Amharic TTS synthesis system is a parametric the same sounds, the phonemes are only 28 (see and rule based system designed based on the the Appendix). general speech synthesis system. Fig.2. shows the scheme of Amharic speech synthesis system. 1st 2nd 3rd 4th 5th 6th 7th The design is based on the general speech syn- C/e/ C/u/ C/i/ C/a/ C/ie/ C C/o/ thesis system (Takara and Kochi, 2000). The sys- ቸ ቹ ቺ ቻ ቼ ች ቾ tem has three main components, a text analysis Figure 1: Amharic Syllables structure subsystem, prosodic generation module and a speech synthesis subsystem. The following three Like other languages, Amharic also has its sub-sections discuss the details of each compo- own typical phonological and morphological fea- nent. tures that characterize it. The following are some

47 a speech production model as shown in fig.3. The synthetic sound is produced using Log Mag- Text Text Speech Voice Analysis Prosodic Synthesis nitude Approximation (LMA) filter (Imai, 1980) Input Generation Output Subsystem Subsys- as the system filter, for which cepstral coeffi- tem cients are used to characterize the speech sound.

Fundamental Voic ed/ Cepstral frequency (F0) Unvoiced Coefficient

Speech database

Figure 2: Amharic Speech Synthesis System 3.1 Text Analysis The text analysis subsystem extracts the linguis- tic and prosodic information from the input text. The program iterates through the input text and extracts the gemination and other marks, and the Figure 3: Diagram of Speech Synthesis Model sequences of syllables using the syllabification rule. The letter-to-sound conversion has simple The LMA filter presents the vocal tract charac- one-to-one mapping between orthography and teristics that are estimated in 30 lower-order que- phonetic transcription (see Apendix). As defined frency elements. The LMA filter is a pole-zero by (Baye, 2008; Dawkins, 1969) and others, filter that is able to efficiently represent the vocal Amharic can be considered as a phonetic lan- tract features for all speech sounds. The LMA guage with relatively simple relationship be- filter is controlled by cepstrum parameters as tween orthography and phonology. vocal tract parameters, and it is driven by fun- 3.2 Speech Analysis and Synthesis systems damental period impulse series for voiced sounds and by white noise for unvoiced sounds. The First, as a speech database, all Amharic syllables fundamental frequency (F0) of the speech is con- (196) were collected and their sounds were pre- trolled by the impulse series of the fundamental pared by recording on digital audio tape (DAT) period. The gain of the filter or the power of syn- at a 48 kHz sampling rate and 16-bit value. After thesized speech is set by the 0th order cepstral that, they were down-sampled to 10 kHz for ana- coefficient, c [0]. lyzing. All speech units were recorded with nor- mal speaking rate. 3.3 Prosody Modeling Then, the speech sounds were analyzed by For any language, appropriate modeling of pros- the analysis system. The analysis system adopts ody is the most important issue for developing a short-time cepstral analysis with frame length high quality speech synthesizer. 25.6 ms and frame shifting time of 10 ms. A In Amharic language segments duration is the time-domain Hamming window with a length of most important and useful component in prosody 25.6 ms is used in analysis. The cepstrum is de- control. It is shown that, unlike English language fined as the inverse Fourier transform of the in which the rhythm of the speech is mainly short-time logarithm amplitude spectrum (Furui, characterized by stress (loudness), rhythm in 2001). Cepstral analysis has the advantage that it Amharic is mainly marked by longer and shorter could separate the spectral envelope part and the syllables depending on gemination of conso- excitation part. The resulting parameters of nants, and by certain features of phrasing speech unit include the number of frames and, (Bender et al., 1976). Therefore it is very impor- for each frame, voiced/unvoiced (V/UV) deci- tant to model the syllables duration in AmhTTS sion, pitch period and cepstral coefficients c[m], system. In this paper we propose a new segmen- 0 ≤ m ≤ 29. The speech database contains these tal duration control methods for synthesizing a parameters as shown in fig.2. high quality speech. Our rule-based TTS system Finally, the speech synthesis subsystem gen- uses a compact rule-based prosodic generation erates speech from pre-stored parameters under method in three phases: the control of the prosodic rules. For speech syn- thesis, the general source-filter model is used as • modeling geminated consonants duration

48 • controlling of sixth order syllables dura- Since Amharic orthography does not use gemi- tion nation mark, in our study we used our own gemination mark and a manual gemination inser- • assignment of a global intonation contour tion mechanism for input texts. Although some Prosody is modeled by variations of pitch and scholars make use of two dots ( ፟ which is pro- relative duration of speech elements. Our study posed in UNICODE 5.1 version as 135F) over a deals only with the basic aspects of prosody such consonant to show gemination, so far there is no as syllables duration and phrase intonation. software which supports this mark. Gemination is modeled as lengthened duration in such a way geminated syllables are modeled on 3.3.2 Sixth order syllables rule word level. Phrase level duration is modeled as As mentioned earlier the sixth order syllables are well to improve prosodic quality. Prosodic very frequent and play a major role for proper phrases are determined in simplified way by us- pronunciation of words. The sixth order ortho- ing text punctuation. To synthesize F0 contour graphic syllables, which do not have any vowel Fujisaki pitch model that superimpose both word unit associated to it in the written form, may as- level and phrase level prosody modulations is sociate the helping vowel (epenthetic vowel /ix/, used (Takara and Jun, 1988). see the Apendix) in its spoken form (Dawkins, The following sub-sections discuss the pro- 1969). The sixth order syllables are ambiguous; sodic control methods employed in our system. they can stand for either a consonant in isolation or a consonant with the short vowel. In our 3.3.1 Gemination rule study we prepared a simple algorithm to control Accurate estimation of segmental duration for the sixth order syllables duration. The algorithm different groups of geminate consonants (stops, to model the sixth order syllables duration uses nasals, liquids, glides, fricatives) will be crucial the following rules: for natural sounding of AmhTTS system. In our previous study, Tadesse and Takara (2006), we 1. The sixth order syllables at the beginning of studied the durational difference between single- word are always voweled (see /sxix/ in fig 5). tons vs. geminates of contrastive words and de- 2. The sixth order syllables at the end of a termined the threshold duration for different word are unvoweled (without vowel) but, if groups of consonants. Accordingly the following it is geminated, it becomes voweled. rule was implemented based on the threshold 3. The sixth order syllables in the middle of durations we obtained in our previous study. words are always unvoweled (see /f/ in The gemination rule is programmed in the sys- fig.5). But, if there is a cluster of three or tem and generates geminates from singletons by more consonants, it is broken up by inserting using a simple durational control method. It gen- helping vowel /ix/. erates geminates by lengthening the duration of the consonant part of the syllables following the The following figures shows sample words syn- gemination mark. Two types of rules were pre- thesized by our system by applying the prosodic pared for two groups of consonants, continuant rules. Fig.5 and fig.7 shows the waveform and (voiced and unvoiced) and non-continuant (stops duration of words synthesized by applying both and glottalized) consonants. If a gemination the gemination and sixth order syllables rules. mark (‘) is followed by syllable with voiced or Fig.4 and fig.6 shows the waveform of original unvoiced consonant then, the last three frames words just for comparison purpose only. The two of the cepstral parameters (c[0]) of vowel is ad- synthesized words are comparative words which justed linearly and then 120 ms of frame 1, 2 and differ only by presence or absence of gemination. 3 of second syllable is added. Then the second In the first word /sxixfta/ ሽፍታ meaning ‘bandit’, syllable is connected after frame 4. Totally 90 ms the sixth order syllable /f/ is unvoweled (see of cepstral parameters is added. Otherwise, if, a fig.5). However, in the second word /sxixffixta/ gemination mark (‘) is followed by syllable with ሽፍታ meaning ‘rash’, the sixth order syllable /f/ glottal or non-glottal consonant then, the last is voweled and longer /ffix/ (see fig.7) because it three frames of the cepstral parameters (c[0]) of is geminated. In our previous study, Tadesse and vowel is adjusted linearly and then 100 ms of Takara (2006), we observed that vowels are silence is added. Finally the second syllable is needed for singletons to be pronounced as gemi- directly connected. nates.

49 3.3.4 Intonation /sx ix/ /f/ /t a/ The intonation for a sentence is implemented by applying a simple declination line in the log fre- quency domain adopted from similar study for Japanese TTS system by Takara and Jun (1988). Figure 4: Waveform & duration of original word Fig.8 shows the intonation rule. The time ti is the ሽፍታ /sxixfta/, meaning ‘bandit’ initial point of syllable, and the initial value of F0 (circle mark) is calculated from this value. This is a simple linear line, which intends to ex- periment the very first step rule of intonation of /sx ix/ /f/ /t a/ Amharic. In this study, we simply analyzed some sample sentences and take the average slope = - 0.0011. But as a future work, the sentence pros-

Figure 5: Waveform & duration of synthesized ody should be studied more. ሽፍታ word ሽፍታ /sxixfta/, meaning ‘bandit’ -0.0011 tn fn = f0 * 2 f0

f1 f2 /sx ix/ /ff ix/ /t a/ f3

Figure 6: Waveform & duration of original word Log frequency [Hz] ሽፍታ /sxixffixta/, meaning ‘rash’ 0 t1 t2 t3 Time [frame] Figure 8: Intonation rule

/sx ix/ /ff ix/ /t a/ 4 Evaluation and Discussion In order to evaluate the intelligibility and natu- Figure 7: Waveform & duration of synthesized ralness of our system, we performed two types of word ሽፍታ /sxixffixta/, meaning ‘rash’ listening tests. The first listening test was per- formed to evaluate the intelligibility of words 3.3.3 Syllables connection rules synthesized by the system and the second listen- For syllables connections, we prepared four ing test was to evaluate the naturalness of syn- types of syllables connection-rules based on the thesized sentences. The listening tests were used type of consonants. The syllable units which are to evaluate the effectiveness of the prosodic rule represented by cepstrum parameters and stored in employed in our system. the database are connected based on the types of consonants joining with vowels. The connections 4.1 Recordings are implemented either by smoothing or interpo- For both listening tests the recording was done in lating the cepstral coefficients, F0 and amplitude a soundproof room, with a digital audio tape re- at the boundary. Generally, we drive two types of corder (SONY DAT) and SONY ECM-44S Elec- syllabic joining patterns. The first pattern is trets Condenser Microphone. Sampling rate of smoothly continuous linkage where the pitch, DAT was set at 48kHz then from DAT the re- amplitude and spectral assimilation occur at the corded data were transferred to a PC via a digital boundary. This pattern occurs when the bound- audio interface (A/D, D/A) converter. Finally, ary phonemes of joining syllables are unvoiced. the data was converted from stereo to mono; Another joining pattern is interpolation, this pat- down sampled to 10 kHz and the amplitude was tern occurs when one or both of the boundary normalized using Cool Edit program. All re- phonemes of joining syllables is voiced. If the cording was done by male native speaker of the boundary phonemes are plosive or glottal stop language who is not included in the listening then the pre-plosive or glottal stop closure pauses tests. with 40ms in length is inserted between them.

50 4.2 Speech Materials 4.4 Results and discussion The stimuli for the first listening test were con- After collecting all listeners’ response, we calcu- sisted of 200 words which were selected from lated the average values and we found the fol- Amharic-English dictionary. The selected words lowing results. are commonly and frequently used words in the In the first listening test, the average correct- day-to-day activities adopted from (Yasumoto rate for original and analysis-synthesis sounds and Honda, 1978). Among the 200 words we were 100% and that of rule-based synthesized selected, 80 words (40% of words) contain one sounds was 98%. We found the synthesized or more geminated syllables and 75% of the words to be very intelligible. words contain sixth order syllables. This shows In the second listening test the average Mean that how geminates and sixth order syllables are opinion score (MOS) for synthesized sentences very frequent. were 3.2 and that of original and analy- Then, using these words, two types of syn- sis/synthesis sentences were 5.0 and 4.7 respec- thesized speech data were prepared: Analy- tively. The result showed that the prosodic con- sis/synthesis sounds and rule-based synthesized trol method employed in our system is effective sounds using AmhTTS system. The original and produced fairly good prosody. However, the speech sounds were also added in the test for durational modeling only may not be enough to comparison purpose. properly generate natural sound. Appropriate For the second listening test we used five syllable connections rules and proper intonation sentences which contains words with either modeling are also important. Therefore studying geminated syllables or sixth order syllables or typical intonation contour by modeling word both. The sentences were selected from Amharic level prosody and improving syllables connec- grammar book, Baye (2008) which are used as tion rules by using quality speech units is neces- an example. We prepared three kinds of speech sary for synthesizing high quality speech. data: original sentences, analysis/synthesis sen- tences, and synthesized sentences by our system 5 Conclusions and future works by applying prosodic rules. In total we prepared We have presented the development of a syllabic 15 speech sounds. based AmhTTS system capable of synthesizing 4.3 Methods intelligible speech with fairly good prosody. We have shown that syllables produce reasonably Both listening tests were conducted by four natural quality speech and durational modeling is Ethiopian adults who are native speakers of the very crucial for naturalness. However the system language (2 female and 2 male). All listeners are still lacks naturalness and needs automatic gemi- 20-35 years old in age, born and raised in the nation assignment mechanisms for better dur- capital city of Ethiopia. For both listening tests ational modeling. we prepared listening test programs and a brief Therefore, as a future work, we will mainly introduction was given before the listening test. focus on improving the naturalness of the syn- In the first listening test, each sound was thesizer. We are planning to improve the dura- played once in 4 second interval and the listeners tion model using the data obtained from the an- write the corresponding Amharic scripts to the notated speech corpus, properly model the co- word they heard on the given answer sheet. articulation effect of geminates and to study the In the second listening test, for each listener, typical intonation contour. We are also planning we played all 15 sentences together and ran- to integrate a morphological analyzer for auto- domly. And each subject listens to 15 sentences matic gemination assignment and sophisticated and gives their judgment score using the listen- generation of prosodic parameters. ing test program by giving a measure of quality as follows: (5 – Excellent, 4 - Good, 3 - Fair, 2 - References Poor, 1 – Bad). They evaluated the system by considering the naturalness aspect. Each listener Atelach Alemu, Lars Asker and Mesfin Getachew. did the listening test fifteen times and we took 2003. Natural Language Processing For Am- the last ten results considering the first five tests haric: Overview And Suggestions for a Way as training. Forward , Proc. 10th Conference ‘Traitement Automatique Des Languages Naturalles’, pp. 173- 182, Vol.2, Batz-Sur-Mer, France.

51 Sebsibe H/Mariam, S P Kishore, Alan W Black, Rohit Appendix Kumar, and Rajeev Sangal. 2004. Unit Selection Amharic Phonetic List, IPA Equivalence and its Voice for Amharic Using Festvox , 5th ISCA ASCII Transliteration Table. (Mainly adopted Speech Synthesis Workshop, Pittsburgh. from (Sebsibe, 2004; Baye, 2008) M.L Bender, J.D.Bowen, R.L. Cooper and C.A. Fer- guson. 1976. Language in Ethiopia , London, Ox- ford University Press. M. Lionel Bender, Hailu Fulass. 1978. Amharic Verb Morphology: A Generative Approach , Carbondale. Tadesse Anberbir and Tomio Takara. 2006. Amharic Speech Synthesis Using Cepstral Method with Stress Generation Rule , INTERSPEECH 2006 ICSLP, Pittsburgh, Pennsylvania, pp. 1340-1343. T. Takara and T. Kochi. 2000. General speech syn- thesis system for Japanese Ryukyu dialect , Proc. of the 7th WestPRAC, pp. 173-176. Baye Yimam. 2008. የአማርኛ ሰዋሰው (“ Amharic Grammer”) , Addis Ababa. (in Amaharic). C.H DAWKINS. 1969. The Fundamentals of Am- haric , Bible Based Books, SIM Publishing, Addis Ababa, Ethiopia, pp.5-7. S. Furui, Digital Speech Processing, Synthesis, and Recognitio n, Second Edition, Marcel Dekker, Inc., 2001, pp. 266-270. S. Imai. 1980. Log Magnitude Approximation (LMA) filter , Trans. of IECE Japan, J63-A, 12, PP. 886-893. (in Japanese). Takara, Tomio and Oshiro, Jun. 1988. Continuous Speech Synthesis by Rule of Ryukyu Dialect , Trans. IEEE of Japan, Vol. 108-C, No. 10, pp. 773- 780. (in Japanese) B. Yasumoto and M. Honda. 1978. Birth Of Ja- panses , pp.352-358, Taishukun-Shoten. (in Japa- nese).

52 Building Capacities in Human Language Technology for African Languages

Tunde Adegbola African Languages Technlogy Initiative (Alt-i), Ibadan, Nigeria [email protected]

So far, developments in HLT have not suffi- Abstract ciently addressed African languages. This is at- tributable to a low level of awareness of the im- The development of Human Language Technology portance of and lack of interest in HLT and (HLT) is one of the important by-products of the in- among scholars of African languages and schol- formation revolution. However, the level of knowl- ars of technology on the African continent. Fur- edge and skills in HLT for African languages remain thermore, there is little or no immediate eco- unfortunately low as most scholars continue to work nomic incentive in working in HLT. Conse- within the frameworks of knowledge production for an industrial society while the information age dawns. quently, there is a dearth of scholars with the This paper reports the work of African Languages requisite impetus, knowledge and skills to sup- Technology Initiative (Alt-i) over a five-year period, port the development of HLT for African lan- and thereby presents a proposal for the acceleration of guages. Hence, even though there are pockets of the development of knowledge and skills in HLT for activities in Africa, the level of development of African languages. HLT for African languages remains low. The circumstances that necessitated the de- 1 Introduction velopment of HLT have been described, and The world is undergoing a transformation from rightly so as a revolution. The response therefore industrial economies to an information economy, has to be commensurate both in pace and inten- in which the indices of value are shifting from sity. Hence, the need for accelerated develop- material to non-material resources. This trans- ment of HLT for African languages is urgent. formation has been rightly described as a revolu- This paper presents a proposal aimed at accel- tion because of the height of its pace and inten- erating the development of HLT for African lan- sity. Of necessity therefore, the response to the guages based on the experiences gathered at Af- changes brought about by the information revo- rican Languages Technology Initiative (Alt-i) lution has to be commensurate, both in pace and over a five-year period of activities mainly in intensity. Nigeria. At the root of the information revolution is 2 The State of Language technology in the development of digital technology which has brought about a major shift in the way we con- Africa ceptualize, describe and anticipate our world. The application of Language technology to Afri- One of the salient social imperatives of the in- can languages is relatively new and most efforts formation revolution is the need for humans to seem to be incidental. The most consistent ef- communicate through and with machines. This forts motivated and guided by national policy has brought about the need to make machines come from South Africa while projects in other capable of handling natural language used by countries are based primarily on private initia- humans as against formal language used by ma- tives. In a report on HLT development in Sub- chines. The field of Human Language Technol- Saharan Africa, Justus Roux (2008) reported nine ogy (HLT) was developed to provide the neces- organizations involved in HLT activities in South sary knowledge and skills that will enhance the Africa, one organization in West Africa and two effectiveness and efficiency with which ma- in East Africa. Seven out of the nine organiza- chines mediate communication between humans tions in South Africa are based in universities, as well as facilitate communication between hu- one is a Semi-Government institution and one is mans and machines. an agency of the Government. The seven universities with HLT projects are:

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 53–58, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

53 z University of Cape Town 3 On-going Alt-i Activities z University of Limpopo Since its inception, Alt-i has done more work in z University of the North West (Potchef- Yoruba than in other African languages. This is stroom) due primarily to the ready availability of intellec- z University of Pretoria tual and other resources for Yoruba at Alt-i's base z University of South Africa in Ibadan. However, work in a few other lan- z University of Stellenbosch guages with available local resources have also z University of the Witwatersrand (Johan- been undertaken. The main projects undertaken nesburg) so far are: The Meraka institute is semi-governmental while there is a Human Language Technology 3.1 Automatic Speech Recognition (ASR) of Unit under the Department of Arts and Culture of Yoruba the Government of South Africa. In West Africa the only organization reported The ASR project started in 2001, and is still on- is Africa Languages Technology Initiative (Alt- going. A PhD thesis with the title of Application i), while the two reported in East Africa are The of Tonemic Information for Search-Space Reduc- Djibouti Center for Speech Research and Tech- tion in the Automatic Speech Recognition of Yo- nobyte Speech Technologies in Kenya. ruba is one of the results of the ASR project. The Apart from the organisations reported above, project approaches ASR of Yoruba from the there are individual efforts in some universities. point of view that Yoruba tones carry so much These include Dr. Odetunji Odejobi working on information that “talking drums” can “speak” the Text to Speech Synthesis at Obafemi Awolowo Yoruba language, hence, ASR of Yoruba (and University, Ile Ife, Nigeria, Dr. Wanjiku probably other African tone languages) should be Ng'ang'a working on Machine Translation and based primarily on tones or at least should ad- Dr. Peter Wagacha working on Machine Learn- dress considerable computational resources to- ing, both at the at the University of Nairobi, wards correct identification of tones. Experi- Kenya. ments have shown that a tone-guided search of Apart from these organizations and individu- the recognition space as proposed in the above als that are strictly located in Africa, there are a PhD thesis leads to improvement in recognition number other efforts in various parts of the speed and accuracy. ASR efforts are continuing world that address HLT for Africa languages, within the project “Redefining Literacy”; Alt-i's usually in cooperation with some organizations main on-going project funded by the Open Soci- in Africa. Examples include: ety Initiative for West Africa (OSIWA). z Local Language Speech Technology Ini- 3.2 Text to Speech (TTS) synthesis of Yoru- tiative (LLSTI), a project of Outside ba Echo in the UK z West African Language Documentation, This project was conceived and initiated in 2002. a project of the University of Bielefeld, TTS is a major component of the “Redefining Germany in collaboration with the Uni- Literacy” project but we have not yet succeeded versity of Uyo, Nigeria and the Univer- in attracting funding for this component of the sity of Cocody, Cote D'Ivoire. project. Hence, it is in abeyance. However, one Also, there are significant short-term activi- of our Associates, Dr. Odetunji Odejobi of the ties on language technology for African lan- Department of Computer Science and Engineer- guages both within and outside Africa which ing of the Obafemi Awolowo University, Ile Ife have not been sufficiently publicized. For exam- is actively working on TTS and we are collabora- ple, in 2002, there was an undergraduate project tion with him on this project. Dr. Odejobi ap- in Yoruba-English machine translation at the St plies fuzzy logic to formalise Yoruba prosody. Mary's College of Maryland, USA.1 3.3 Machine Translation

1 It is also necessary to mention the efforts of Work is on-going on Igbo-English and Yoruba- Bisharat and SIL. Even though both organizations are English Machine Translation. The machine not strictly founded for developing language technol- translation projects are not funded at present, but ogy for African languages, they have both done im- portant work in making various resources and tools for developing language technology for African lan- guages available.

54 they are taking advantage of the efforts of stu- 3.7 Academic assistance to the University of dent volunteers from the Department of Linguis- Ibadan tics and African Languages as well as the Africa Apart from the above projects, Alt-i offers aca- Regional Center for information Science, both at demic support to the University of Ibadan. The the University of Ibadan. The main thrust of the Executive Director of Alt-i teaches post-graduate present stage of the project is identifying and courses in Artificial Intelligence and Information developing formal specifications for various Networking as well as supervises post-graduate challenges of a rule-based machine translation as projects at the Africa Regional Center for Infor- it relates to translation between Igbo/Yoruba and mation Science (ARCIS) in the University of English. Ibadan. He also gives various levels of support 3.4 Yoruba Spelling checker in the supervision of post-graduate projects, par- ticularly in the area of acoustic analysis of speech As a member of the African Network of Localiz- at the Department of Linguistics and African ers (AnLoc), Alt-i is developing a spelling Languages. checker for Yoruba in Open Office. This has Some of the HLT issues addressed in the provided the opportunity to undertake a compu- Master in Information Science projects between tational study of Yoruba morphology. Staff and 2002 and the present include: students of the Department of Linguistics and Statistical Language Model (SLM) of Yoruba, African Languages at the University of Ibadan Man-Machine Communication in Yoruba, Ma- are playing an active role in this project. The chine Translation between spoken and signed work is producing new insights for interpreting Yoruba for the deaf and impaired in hearing, Yo- the existing literature of Yoruba morphology and ruba phonology multimodal learning courseware is already leading to interests in similar projects and phonetically motivated automatic language for other languages. As at the time of writing, a identification. dictionary file of about 5000 Yoruba root words Many PhD students of the Department of Lin- and over 100 highly productive affix rules have guistics and African Languages have also en- been developed. Even though some important joyed intellectual support and use of Alt-i's Yoruba morphological rules cannot be efficiently speech laboratory in their studies coded in Hunspell (the software on which the spelling checker is based) the modest dictionary 3.8 Support to other universities and schol- and affix files have produced a useful spelling arly associations checker. The project is funded by the Interna- Many staff members and students from far and tional Development Research Center (IDRC) of wide travel to Ibadan to use Alt-i's speech labora- Canada. tory. PhD students as well as faculty members 3.5 Automatic diacritic application for Yo- from the University of Lagos, University of Ilo- ruba rin, University of Benin and University of Abuja come regularly to use the facilities. At present a As a by-product of the Yoruba spelling checker teaching staff of the Department of Systems En- project, Alt-i is developing an automatic diacritic gineering of the University of Lagos is undertak- application program, using the Bayesian learning ing a PhD programme on ASR of Yoruba. This approach. This project is not funded, but it is PhD candidate visits Ibadan frequently and regu- taking due advantage of some of the resources, larly to use Alt-i's library and speech laboratory, particularly the corpus produced in the IDRC as well as consult Alt-i staff. funded spelling checker project. Work is on- Alt-i is involved in the activities of various going to expand the corpus used for the auto- scholarly associations such as the West African matic diacritic application program. Linguistic Society (WALS), Linguistics Asso- 3.6 Localization of Microsoft Vista and Of- ciation of Nigeria (LAN) and the Yoruba Studies fice Suite Association of Nigeria (YSAN). In 2004, Alt-i collaborated with the West African Linguistics Alt-i was appointed by Microsoft as moderators Society to organize the West African Languages for the localization of Microsoft Vista and Office Congress with the theme: Globalisation and the Suite into Hausa, Igbo and Yoruba. This project Future of African Languages. Alt-i's collabora- is making steady progress. tion in the organization of this congress influ- enced the proceedings towards language technol-

55 ogy which brought about great awareness of lan- 5 Recommendations guage technology issues among West African linguists. z Intensive and sustained awareness build- ing programmes on the importance of 3.9 Bridge building seminar series linguistics and language technology should be undertaken in institutions of Alt-i runs a Bridge Building Seminar series as a higher learning. This will make it possi- way of encouraging cross-disciplinary studies in ble to harness some of the available intel- universities and research centers. These semi- lectual resources that remain yet un- nars have so far been run in eight Nigerian uni- tapped for the development of language versities and the National Institute for Nigerian technology for African languages in Languages (NINLAN). The one-day seminars these institutions. bring together scholars from Linguistics, Litera- z Admission criteria and curricular should ture, Psychology, Mathematics, Physics, Com- be reviewed in order to encourage and puter Science and other relevant departments to capacitate students to widen their intel- build awareness of language technology prob- lectual horizon beyond the artificial tra- lems and the need for knowledge and skills from ditional departmental boundaries. a wide range of departments for their solutions z Modern techniques for the management 4 Observations of learning resources should be em- ployed in order to address the logistic While undertaking the above activities in the last challenges that discourage students from five years between 2003 and 2008, the following taking courses across the faculties of sci- observation were made: ence, technology and arts. z The intellectual resources needed for de- veloping knowledge, skills and academic 6 Alt-i: a historical perspective programmes in Language Technology Initial interests in language technology that ulti- are largely available in Nigerian univer- mately led to the founding of Alt-i date back to sities. 1978, but it was in 1985 that a small group of one z The lack of awareness of the need to ad- Electrical Engineer and two Physicist started to dress language technology problems has investigate Text to Speech synthesis of Yoruba in made it difficult to harness and direct Ibadan, Nigeria. All they were armed with was a these resources towards the development book (Electronic Speech Synthesis by Geof Bris- of language technology for African lan- tow), a microphone and storage oscilloscope. It guages. was extremely difficult to get the required mate- z Strong sentimental attachments to de- rials in the Nigeria of those days. partmental traditions makes it extremely Unfortunately, neither the Physicists nor the difficult for scholars to venture far out- Engineer realized the relevance of linguistics in side their departmental cocoons. their work because the academic environment z The importance of linguistics as a field within which they grew did not provide the nec- of study and the role of linguists in soci- essary impetus for interdisciplinary or multidis- ety are not properly understood. Hence ciplinary studies between certain fields of study, students of linguistic may not be suffi- certainly not between linguistics, physics and ciently motivated to aspire to their im- engineering. This brought about a lot of misdi- portant roles in society. rected efforts and frustration. By 2001 however, z Inappropriate admission criteria, limited with better access to the scientific literature of curricular, and low level of formal inter- computational linguistics and HLT, it had be- action between different faculties in the come clear to what was left of the group that universities make it extremely difficult HLT is as much an issue in language as it is an for students of the pure sciences, tech- issue in technology. nology and linguistics to share courses A careful review of the relevant aspects of the and thereby have opportunities for aca- scientific literature of linguistics was then under- demic interaction beneficial to the de- taken. Contacts were made with some of the velopment of HLT. known scholars of Yoruba phonology and their insights brought new impetus to the work, lead- ing to the founding of Alt-i in 2001.

56 The salient point in this historical perspective tics. Graduates of the master programme that is that the academic environment that produced attain a high level of performance in the project the members of the original group did not pro- will be encouraged to stay on for the PhD pro- vide the necessary impetus for the level of cross- gramme. disciplinary cooperation demanded by the solu- The main faculty for the programme shall be tions to the problems the group was addressing. drawn from relevant departments in the univer- Even though there were many papers that illus- sity. They shall undergo induction courses (lo- trated fruitful connections between linguistics cally and overseas) to re-orientate their knowl- and computer science in the mid 1980's, the Ni- edge towards applications in language technol- gerian economy was in such a bad state that the ogy. universities could not afford to keep their librar- To kick-start the programme, the support of ies updated with current publications in any field. scholars in the international language technology Nigerian universities now have noticeably better community shall be sought for curricular devel- access to the global academic literature but a sys- opment as well as teaching. Occasional or short- temic weaknesses that does not encourage inter- term visiting lectureships will be accommodated disciplinary scholarship still subsists and needs within sabbatical, fellowship and exchange pro- to be addressed. grammes. As an on-going experiment in this regard, two 7 Proposal students of the university of Ibadan are at present working together on Yoruba-English machine A project with advocacy and service components translation. One student is a graduate of Com- aimed at accelerating the development of lan- puter Science, working towards a master degree guage technology for African languages is here- in Information Science, while the other is a grad- by proposed. The aim of the proposed project is uate of Linguistics working towards a master de- to produce lecturers, researchers and other ex- gree in Linguistics. The two students are jointly perts in language technology for African lan- supervised by a lecturer in Information Science guages. and a lecturer in Linguistics. Even though the The advocacy component will identify and computer science graduate has never had any develop policy thrusts that will encourage the formal training in Linguistics and the Linguistics development of language technology and raise graduate has never had any formal training in awareness at various levels of the importance of computing, their collaboration has served to wi- linguistics and language technology. These in- den their knowledge-bases. The student of lin- clude raising awareness among secondary school guistics is approaching the project from the point students, university undergraduates, cultural ac- of view of comparative syntax and is now able to tivists and relevant policy makers. express syntax rules in the form of context-free- Within the service component, in affiliation grammar in Prolog, while the Information Sci- with a university, a post-graduate course of study ence student is approaching the project from the aimed at producing a number of PhDs in lan- point of view of predicate logic as a knowledge guage technology/computational linguistics with- representation formalism and now has a fair un- in the space of about five to six years is to be derstanding of the principles of Yoruba and Eng- developed. The candidates for this post-graduate lish grammars. course shall be university graduates of various Even though the experiment is still on-going, relevant fields. The programme shall start with a the emerging results suggest that the one year of one year diploma programme of intensive course formal study in the proposed diploma programme work in linguistics, computational and cognitive will provide adequate knowledge and skills for sciences. These will serve to widen the knowl- graduates of the physical sciences, computer sci- edge-base of the participants thereby creating the ence, technology, linguistics and psychology to necessary connections between their back- undertake productive research in language tech- grounds and various aspects of language tech- nology. nology. Those that attain a high level of achievement in the diploma course may stay on 8 Conclusion for another six-months to undertake a practical project in language technology. Success in this The development of language technology for Af- project will earn such candidates a master degree rican languages is at a rather embryonic stage. in language technology or computational linguis- Apart from the efforts in South Africa, there are

57 little or no coherent programmes on language Tunde Adegbola. 2009. The Future of African Lan- technology in African universities. National lan- guages in a Globalising World. Presented at the guage policies where they exist do not accom- Bamako Summit on Multilingualism, 19 - 21 Janu- modate language technology issues and there is ary 2009. Bamako, Mali. a generally low level of awareness of the benefits Tunde Adegbola. 2009 Indigenising Human Language derivable from language technology. Technology for National Development. To be pre- With one-third of the world's languages spo- sented at the Africa Regional Center for Informa- tion Science (ARCIS), University of Ibadan, Guest ken in Africa, there is an urgent need for the de- Lecture, March 18, 2009. velopment of new techniques that address the peculiar features of these languages and thereby eLearning Africa. 2007. Interview with Dr. Tunde Adegbola. http://www.elearning- make it possible for the cultures that use them to africa.com/newsportal/english/news56_print.php, benefit from the information revolution without International Conference on ICT for Development having to adopt foreign languages Education The implementation of the proposed aca- Justus Roux. 2008. HLT Development in Sub-Saharna demic programme in HLT would require active Africa. Report to COCOSDA/WRITE Workshop, support of the Nigerian government in coopera- LREC2008, Marrakesh. tion and collaboration with other friendly gov- http://www.ilc.cnr.it/flarenet/documents/lrec2008_ ernments as well as various multilateral agencies cocosda-write_workshop_roux.pdf for funding and other resources. However, in the Adam Samassekou. 2007. Linguistic Diversity. ICT absence of a language policy and a coherent lan- Update. http://ictupdate.cta.int/en/Regulars/Carte- guage technology programme, the advocacy blanche/Linguistic-diversity component becomes be the necessary starting Roger Tucker. 2003. Local language speech technol- point. ogy initiative. http:/www.llsti.org Developments in HLT and computational lin- guistics present avenues for Nigerian universities to re-invigorate the study of linguistics, provide new impetus for students of linguistics and pre- pare graduates of linguistics for more roles in society than the traditional teaching of local lan- guages at secondary level. Nigerian universities must therefore play an important role in the nec- essary advocacy. As the world moves further into the informa- tion age, concerted efforts are need to ensure that developments in HLT takes due account of Afri- can languages so that African languages and cul- tures can benefit from the information revolu- tion.

Acknowledgments Alt-i acknowledges with thanks the funding sup- port of Tiwa Systems Ltd., Bait-al-Hikma, Open Society Initiative for West Africa (OSWIA) and International Development Research Center (IDRC) in its activities.

References Tunde Adegbola. 2005. Application of tonemic in- formation for search-space reduction in the auto- matic speech recognition of Yoruba. Unpublished PhD. Thesis, December, 2005, University of Iba- dan, Nigeria. Tunde Adegbola. 2007. Hitting the right tone. ICT Update http://ictupdate.cta.int/en/Feature- Articles/Hitting-the-right-tone

58 Initial fieldwork for LWAZI: A Telephone-Based Spoken Dialog System for Rural South Africa

Tebogo Gumede Madelaine Plauché CSIR CSIR Meraka Institute Meraka Institute PO Box 395 PO Box 395 Pretoria, 0001 Pretoria, 0001 [email protected] [email protected]

Abstract caregivers of HIV positive children in Botswana. The researchers found that the users performed This paper describes sociological equally well using touchtone as speech input, fieldwork conducted in the autumn of when navigating the system. In the current paper, 2008 in eleven rural communities of South we expand on this body of work by investigating Africa. The goal of the fieldwork was to the potential role for SDSs in connecting rural evaluate the potential role of automated citizens of South Africa with government telephony services in improving access to important government information and services, such as free education opportunities and services. Our interviews, focus group stipends. discussions and surveys revealed that South Africa is the leader in Information and Lwazi, a telephone-based spoken dialog Communications Technology (ICT) in Africa system, could greatly support current and has the most developed telecommunications South African government efforts to network on the continent (SA year book effectively connect citizens to available 2006/2007: 131). In particular, mobile phone services, provided such services be toll usage has experienced massive growth due in free, in local languages, and with content part to its accessibility by non-literate people and relevant to each community. its “leapfrog” development, which skipped the

interim solutions adopted in the developed world 1 Introduction (Tongia & Subrahmanian, 2006). The amount of mobile phone users in South Africa is an There is a growing interest in deploying spoken astonishing 30 million people - out of a total dialog systems (SDSs) in developing regions. In population of 47 million (Benjamin, 2007). The rural communities of developing regions, where percentage of both rural and urban households infrastructure, distances, language and literacy with mobile phones tripled from 2001 to 2007, are barriers to access, but where mobile phones while “landline” use declined. The accessibility are prevalent, an SDS could be key to unlocking and widespread use of mobile phones make social and economic growth (Barnard et al., SDSs a good candidate for low-cost information 2003). Some notable recent studies in this field access. include “Tamil Market” (Plauché et al., 2006) In South Africa, there are eleven official and “VoiKiosk” (Agarwal et al., 2008). Both languages. Private companies, NGOs and were kiosk-based SDSs providing agricultural government offices who wish to reach South information that were tested in rural, semi- Africans through print or audio, find it extremely literate communities in India. Nasfors (2007) costly to do so for each language. Heugh (2007) also developed an agricultural information shows that in terms of speakers' proficiency, service, aimed at mobile telephone users in there is no single lingua franca for South Kenya. “Healthline” was evaluated by a small Africans (see Figure 1). In fact, in 2001, only 3 set of community health workers in Pakistan million of 44 million South Africans were (Sherwani et al., 2007), who had trouble with the English speakers, the language in which most voice-based interface, presumably due to their government messages are currently disseminated limited literacy. In a more recent study, Sharma (Household survey 2001). et al. (2008) evaluated a SDS designed for

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 59–65, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

59 understand the target users and their

25 23.8 environmental context as a first step to designing 22.9 such a system (Nielsen, 1993). In this paper, we 20 17.9 17.6 provide background on the current state of rural

15 14.4 13.3 1996 government service delivery in South Africa and % 2001 9.4 10 9.2 introduce the Lwazi project. We describe our 8.6 8.2 8.2 8.2 7.9 7.7 field methods, and finally, we present our 4.4 5 4.4 2.7 2.5 2.3 2.2

1.6 findings from the initial field work with design 1.5 0.6 0.5 0 and deployment implications for the Lwazi

Other system. isiZulu Sepedi SiSwati English Sesotho isiXhosa Xitsonga Afrikaans Setswana Tshivenda IsiNdebele Figure 1: Percentage of speakers per language in 2. Background South Africa. In South Africa, rural citizens are faced with a lack of economic activities and limited access to resources. The South African government is Heugh (2007) reports that between 35% and aware of both the need to improve citizen access 45% of South Africans above the age of 16 to services and the specific challenges that rural cannot read or write. Illiteracy is communities face. disproportionately high for women and for In this section, we note two successful rural people living in the primarily rural provinces: initiatives of the South African government KwaZulu-Natal, Limpopo and Mpumalanga. (Section 2.1) and we describe Lwazi (Section Mobile phone use is widespread in these areas 2.2), a SDS designed to augment government’s among semi-literate citizens and speakers of all accessibility by eligible citizens. languages. Communities in rural areas struggle to access 2.1 Rural Initiatives in South Africa government services due to their remote locations. Most community members must travel Two national efforts that are successfully long distances by foot or rare and costly public connecting rural South African citizens to transport to access basic services. In their government services are (1) The Thusong longitudinal household study on costs and Service Centres (TSCs) and (2) Community coping strategies with chronic illnesses, Goudge Development Workers (CDWs). et al. (2007), for example, found that people in Thusong Service Centres (TSCs), formerly rural areas of South Africa do not go to free known as MPCC (Multi-Purpose Community health care facilities because they cannot afford Centres), were initiated in 1999 as a national transport. initiative to integrate government services into NGOs face the same challenge when trying to primarily rural communities, where services and reach rural populations. Many produce participation by citizens was limited due to the information to assist households affected by long distances they needed to travel (TSC 2008). th HIV/AIDS, for example, but most of the In June 7, 2008, the 100 TSC was opened. Each materials are published on websites; the cost of TSC is a one-stop centre providing integrated providing multilingual print materials is often services and information from government to too high. Due to low literacy levels, language, rural community members close to where they and a lack of infrastructure, the information live. remains inaccessible to the people who need it, Community Development Workers (CDWs) especially those living in rural areas and were formed in 2004 as a national initiative to townships (Benjamin, 2007). further bridge the gap between government Given the well developed mobile phone services and eligible citizens, especially the rural network and the relatively sparse alternative poor (CDW 2008). CDWs are members of rural options in rural South Africa, the authors believe communities who are trained and employed by that multilingual SDSs can provide a low-cost the Department of Public Service and solution to improving the ICT access of citizens Administration (DPSA) under the office of the who may currently be excluded from presidency. They work within their communities government services due to language, literacy and coordinate with municipal and provincial and location. However, it is imperative to offices. The primary responsibilities of a CDW is

60 to keep informed, notify citizens about events, Community Type TSC CDWs and inform them about services for which they are eligible then follow up to ensure they Sterkspruit Rural Yes Yes successfully receive these services. Tshidilamolomo Rural Yes Yes

2.2 Project Lwazi Botshabelo Peri-urban Yes Yes As part of the ICT initiative, an ambitious, three- Kgautswane Rural Yes Yes year project is currently being conducted by the Waboomskraal Peri-urban Yes No Human Language Technology (HLT) research group under the Meraka institute, at the Council Durban Urban No No for Scientific and Industrial Research (CSIR) in Orhistad Rural Yes Yes South Africa. This project is funded by the South African Department of Arts and Culture to Sediba Rural Yes Yes develop a multilingual telephone-based SDS that Atteridgeville Urban Yes Yes would assist South African government service delivery. “Lwazi,” derived from the IsiZulu Laingsburg Rural Yes Yes word for knowledge, aims to make a positive Vredendal Rural Yes Yes impact in the daily lives of South Africans by connecting them to government and health Table 1: Sites visited in Spring 2008. services (Lwazi, 2008). The ability of SDSs to overcome barriers of language, literacy, and Data collection in these communities was distances led the Lwazi team to explore a low- primarily to investigate the suitability of the cost application that would support the current Lwazi SDS and to determine key user and rural initiatives mentioned in 2.1. contextual factors that would drive its design. In particular, we sought to: 3. Method ● Gather rural community information needs. First, we consulted previous research on ● Investigate how people currently get development and technology in South and information. Southern Africa. We reviewed the most recent ● Determine which cultural factors would census conducted (Statistics SA, 2007) for data impact the Lwazi system. on infrastructure, income, language, and ● Determine level of technical competency. technology use. ● Gauge interest in a low-cost SDS that Then, eleven communities were visited by improves access to government services. small, interdisciplinary teams of researchers over a period of 3 months in 2008. Of these eleven 4. Results centres, two were in peri-urban societies another two in urban and the rest were based in rural In this section, we present our overall results communities (Table 1). In each visit, the Lwazi from field visits in eleven communities of South team gained access to the community through Africa (Section 4.1). In particular, we report on the Thusong Service Centres (TSC’s) manager. factors that influence the design and potential These individuals provided materials, office uptake of the Lwazi system in this context: the space, and meetings with CDWs and key people information needs and sources (Section 4.2), at the TSC. cultural and social factors (Section 4.3), We conducted between one and five key suitability of the technology (Section 4.4), and informant interviews at each site with the TSC user experience (Section 4.5). employees, CDWs, and community members. In four of the eleven sites, we also conducted a 4.1 Overall Results focus group discussion. In two sites, we The eleven communities we visited were located shadowed a CDW during a typical day. We throughout seven of the nine provinces of South visited farms, day-care centres, churches, Africa. They varied greatly in available markets, youth centres, clinics, businesses and infrastructure and languages spoken. They shared households. an economic dependency on nearby cities and, in

61 some cases, reported social problems. These communities also share a dependency on 4.2. Information Needs and Sources government social grants. The majority of communities visited reported During interviews and focus group lack of economic activity as the primary problem discussions with government employees and in the community, and as could be expected, we community members, interviewees identified observed very high levels of unemployment. what they perceived as the primary problems in Grants are offered by the South African their communities. Across all eleven sites government to address the imbalance and visited, unemployment was most often reported stimulate the economy of these areas. There are as the primary problem (Figure 2). In fact, our six types of grants, namely: War Veteran grant, team observed that in at least 8 of the sites Old Age grant, Disability grant, Care visited, the community's livelihood was entirely dependency grant, Foster care grant and Child sustained by government grants. support grant. Citizens can apply for these at After unemployment, access to health and their nearest local South African social security social services was viewed as a primary problem agency (SASSA) or district office. in six of the sites visited. Crime and substance Figure 4 shows that all eleven communities abuse were also reported as community visited received Vukuzenzele magazine, a problems. monthly government information sharing medium. This is however not effective in these Primary problems in the 11 communities visited communities where literacy levels are low. This, as mentioned earlier, is one of the problems the 9 8 8 7 6 government is trying to address. The second 6 5 5 4 4 commonly used source of information was the 3 2 1 CDWs. This is a useful source because they are 1 0 the ‘foot soldiers’ of the government; they are Number of communities communities of Number responsible for door to door visits, collecting and Crime Orphans

social delivering information. as health, Access to Access Alcohol and abuse drug services such services

Unemployment Sources of government information Type of problem 12 10 Figure 2: Number of the eleven communities 8 6 4 that site these as primary problems in the 2 0 community, as reported by interviewees. CDW communities of Number Public meetings print Print media station in station (including

There are four mobile providers in South Africa radio local Government local language Vukuzenzele) Vukuzenzele) namely Cell-C, MTN, Virgin Mobile and Sources of information Vodacom. The two landline companies, Neo-tell and Telkom are not familiar in the communities Figure 4: sources of government information visited by Lwazi team. Community members prefer and use mobile phones because of ease of At the time of the visits, the eleven communities use and accessibility. Figure 3 illustrates the use received information from the Thusong Service of mobile providers in the eleven visited Centres. Government departments, NGOs, communities. municipalities and research institutes such the CSIR, used the TSCs as a platform to Mobile provider in communities visited disseminate information. At a more grass roots 12 level, African communities in South Africa share 10 information by word of mouth. Local radio 8 6 stations and newspapers in local languages are 4 also important sources of information. 2 0 Number of communities of Number Cell-C MTN Virgin Vodacom 4.3 Cultural and Social Factors Mobile providers The population in the communities we visited Figure 3: Mobile provider in communities consists mostly of older people taking care of visited grandchildren whose parents work in nearby

62 cities and visit only once or twice a year. As we are also familiar with the basic use of a have previously mentioned, the lack of economic computer, despite their limited access to them. activity means that these communities depend The Lwazi system must be as simple as making a heavily on government social grants. Older phone call to a friend or relative in order to be people in these communities are often unable to accessible to all. In most households, however, read and write. In some cases, their low literacy there is someone who is technically competent. levels and possible unfamiliarity with Based on our fieldwork and recent user studies of technology restricts their ability to use the SDSs in developing regions, a Lwazi system cheaper “texting” feature of their mobile phones. could be useable in the rural context of South In the communities we visited, ten out of the Africa, especially among the elderly and those eleven official South African languages were who do not read and write. spoken. Each community typically spoke two or more languages. A Lwazi system that delivers 5. Discussion information about SASSA services would be more accessible to the older, rural population 5.1 Potential Uptake than current print methods; the proposed system We saw two main areas where a telephony would support a majority of the eleven service could be very useful. The first is in languages in order to offer a better information supporting communication between community channel. and government. For example, a multilingual, 4.4 Suitability of Technology automated service could direct calls from community members to the appropriate TSC The research prior to the design of the Lwazi office or CDW, or perhaps provide locally- system investigated how to ensure that the relevant information such as office hours of proposed system will be suitable to the lifestyle operation, directions to the office, and eligibility of the community to be served. We do know requirements for services. Such a service might that currently, communities have other means of reduce the amount of calls that a TSC office or accessing government information, including the CDW would need to take personally. It could free “Vukuzenzele” monthly magazine and local also likely save a community member a trip if radio stations. Like these current means, Lwazi they were sure beforehand what paperwork they must be free in order to be effective and it must needed to bring and when the office was open. It contain content that is locally-relevant to these is important to mention here that the project will peri-urban and rural communities. Government require a buy-in from the local councillors of the departments will find Lwazi to be a very useful communities we will be piloting in. and low-cost way of disseminating their The second area in which a telephony service information. Rural South Africans will benefit could be useful would be in facilitating internal from the alternative source of critical communication among government service information. providers. CDWs may need to meet community members face to face whenever possible. 4.5 User Expertise Coordinating with government staff across the municipality, district, province, or country could We also sought to evaluate the current expertise happen remotely and efficiently if government of potential Lwazi system users with telephony staff could use an automated telephony service to and other ICT technologies. The user expertise send audio messages to several staff members at of telephony systems differ between young and once. The national coordinator for CDWs, for old. As mentioned earlier, households have at example, could notify every CDW in the country least one cell phone. The older members of the of upcoming events and policy changes with a community use it to call and receive calls from single phone call. friends and children in neighbouring urban areas. Many government officials, including They do not know how to send text messages. Thusong centre managers, felt the system might Some children have saved the full ‘Please call assist them in communicating with the me’ line as a quick dial so that their elder family communities they serve. There was, however, a members can just press it in cases of emergency. concern from one site that there are sections of The young people on the other hand are well the population that do not have mobile versed with telephony technology. Most of them

63 connection or a reliable source of electricity to feedback and discussion (Plauché et al., charge their phones. These could be the submitted). communities that need government services the Community buy-in is critical to the success of most. In these cases, Lwazi will have to play a an ICT deployment. We found not only that the supportive role to the existing services provided TSC and CDW national coordinators but also by the CDWs, rather than allowing direct access each of the communities visited were all excited by community members. about the potential of the proposed system. Our field work revealed the effectiveness of Generally, rural communities are comfortable national government programs to connect rural with the use of a mobile phone. But there is an citizens to available government services. Our age difference in preference of different major finding was that although particulars about applications. Because Lwazi is voice-based, the the communities differed, individuals in the senior citizens of the community will be more eleven communities visited experienced barriers likely to be excited about it than the younger to information access that could be mitigated generations. Younger South Africans are with automated telephony services, provided that comfortable with the cheaper, text interface to such services are toll free and localized to the mobile phones. We recognise that Lwazi may not language and information relevant to the suit the needs of all South Africans, but we aim particular rural community. Whereas to make it accessible to those who are infrastructure such as roads, services, and in historically excluded. In doing so, we hope to some cases, electricity were limited, the mobile have an overall impact in this country where only phone network of South Africa is reliable and 26% of the population has a Matric or tertiary widespread. We feel optimistic that the Lwazi education (Stats SA, 2007). system will build on the available infrastructure to transcend the barriers of geography and 6. Conclusion improve the connection between citizens and In this paper we evaluated the potential role of services. Lwazi, a proposed telephone-based SDS, in improving rural access to important government 5.2 Challenges services. The Lwazi project will create an open In a large and culturally diverse country such as platform for telephone-based services for South Africa, deploying a SDS intended to government to provide information in all eleven provide universal access is a great challenge. languages. Lwazi will be especially useful if it Designing for any given user group often can reduce cost or the distances that people travel requires a multiple iterations of testing and user to access government services and that the feedback. Our fieldwork revealed a diverse set of distances that government workers travel to end users; a successful design will require a check in with municipal offices. Our team plans greater investment in time and resource to gather to conduct pilots in two communities in the detailed and accurate information about rural summer of 2009. A successful pilot in one of South Africans. Although we plan to rely on our these communities will then burgeon into a partners (TSC and CDWs) on the ground for a national service for all South Africans to great deal of this information, we believe it is an empower themselves through improved access to ambitious goal to expect deployment of the information, services and government resources. Lwazi project in eleven languages country wide by summer 2009. References Not only is the technological aspect very Aditi Sharma, Madelaine Plauché, Etienne Barnard, ambitious, this kind of national government Christiaan and Kuun. (To appear). HIV health sponsored system requires tactful management information access using spoken dialog systems: of stakeholders. The success of the Lwazi project Touchtone vs. Speech. In Proc. of IEEE ICTD‘09, relies on community members, government 2009. partners, researchers, NGO's, and corporate Bernhard Suhm. 2008. IVR Usability Engineering interests, all of whom have conflicting needs and using Guidelines and Analyses of end-to-end calls. interests. Our team recognizes the importance of in D. Gardener-Bonneau and H.E. Blanchard managing stakeholder interests and has devised a (Eds). Human Factors and Voice Interactive problem structuring method to facilitate Systems. pp. 1-41, Second Edition, Springer Science: NY, USA.

64 CDW 2008: www.info.gov.za/issues/cdw.htm. Peter Benjamin. 2007. The cellphone information Accessed August 20, 2008. channel for HIV/AIDS. Unpublished information Etienne Barnard, Lawrence Cloete and Hina Patel. newsletter. 2003. Language and Technology Literacy Rahul Tongia, and Eswaran Subrahmanian. 2006. Barriers to Accessing Government Services. Information and Communications Technology for Lecture Notes in Computer Science, vol. 2739, Development (ICT4D) - A design challenge? In pp. 37-42. Proc. of IEEE ICTD'06, Berkeley, CA. Government Communication and Information Sheetal Agarwal, Arun Kumar, AA Nanavati and System. 2007. South African Year Book Nitendra Rajput. 2008. VoiKiosk: Increasing 2006/2007. Reachability of Kiosks in Developing Regions, in Proc. of the 17th International Conference on Jahanzeb Sherwani , Nosheen Ali, Sarwat Mirza, World Wide Web, pp. 1123-1124, 2008. Anjum Fatma, Yousuf Memon, Mehtab Karim, Statistics South Africa. 2007. Community Rahul Tongia, Roni Rosenfeld. 2007. Healthline: Speech-based Access to Health Information by Survey: low-literate users. in Proc. of IEEE ICTD‘07, http://www.statsa.gov.za/publications/P0301/P030 Bangalore, India. 1.pdf (last accessed 15 Sept 2008). Jane Goudge, Tebogo Gumede, Lucy Gilson, Steve Tebogo Gumede, Madelaine Plauché, and Aditi Russell, Steve Tollman & Anne Mills. 2007. Sharma. 2008. Evaluating the Potential of Coping with the cost burdens of illness: Automated Telephony Systems in Rural Combining qualitative and quantitative methods Communities. CSIR biannual conference. in longitudinal household research. Scandinavian TCS 2008: www.thusong.gov.za/. Accessed August journal of public health. 35 (Suppl 69), 181 – 185. 20, 2008. Kathleen Heugh. 2007. Language and Literacy issues in South Africa. In Rassool, Naz (ed) Global Issues in Language, Education and Development. Perspectives from Postcolonial Countries. Clevedon: Multilingual matters, 187- 217. Lwazi. 2008. http://.meraka.org.za/lwazi. Accessed August 20, 2008. Madelaine Plauché, Alta De Waal, Aditi Sharma, and Tebogo Gumede. (submitted). 2008. Morphological Analysis: A method for selecting ICT applications in South African government service delivery. ITID. Madelaine Plauché, Udhyakumar Nallasamy, Joyojeet Pal, Chuck Wooters and Divya. Ramachandran. 2006. Speech Recognition for Illiterate Access to Information and Technology. in Proc. of IEEE ICTD‘06, pp. 83-92. Ministry of Public Service and Administration. 2007. Community Development Workers Master Plan. Nielsen Jakob. 1993. Usability Engineering. AP Professional, Boston, MA, USA. PANSALB. 2001. Language use and Language Interaction in South Africa: A National Sociolinguistic Survey Summary Report. Pan South African Language Board. Pretoria Pernilla Nasfors. 2007. Efficient Voice Information Services for Developing Countries, Master Thesis, Department of Information technology, Uppsala University, Sweden.

65 Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography

Rigardt Pretorius Ansu Berg Laurette Pretorius School of Languages School of Languages School of Computing North-West University North-West University University of South Africa Potchefstroom, South Africa Potchefstroom, South Africa and [email protected] [email protected] Meraka Institute, CSIR Pretoria, South Africa [email protected]

Biffie Viljoen School of Computing University of South Africa Pretoria, South Africa [email protected]

1 Introduction Abstract Words, syntactic groups, clauses, sentences, Setswana, a Bantu language in the Sotho paragraphs, etc. usually form the basis of the group, is one of the eleven official languages analysis and processing of natural language text. of South Africa. The language is character- However, texts in electronic form are just se- ised by a disjunctive orthography, mainly af- quences of characters, including letters of the fecting the important word category of verbs. alphabet, numbers, punctuation, special symbols, In particular, verbal prefixal morphemes are whitespace, etc. The identification of word and usually written disjunctively, while suffixal sentence boundaries is therefore essential for any morphemes follow a conjunctive writing further processing of an electronic text. Tokeni- style. Therefore, Setswana tokenisation can- not be based solely on whitespace, as is the sation or word segmentation may be defined as case in many alphabetic, segmented lan- the process of breaking up the sequence of char- guages, including the conjunctively written acters in a text at the word boundaries (see, for Nguni group of South African Bantu lan- example, Palmer, 2000). Tokenisation may there- guages. This paper shows how a combination fore be regarded as a core technology in natural of two tokeniser transducers and a finite-state language processing. (rule-based) morphological analyser may be Since disjunctive orthography is our focus, we combined to effectively solve the Setswana distinguish between an orthographic word, that is tokenisation problem. The approach has the a unit of text bounded by whitespace, but not important advantage of bringing the process- containing whitespace, and a linguistic word, that ing of Setswana beyond the morphological analysis level in line with what is appropriate is a sequence of orthographic words that together for the Nguni languages. This means that the functions as a member of a word category such challenge of the disjunctive orthography is as, for example, nouns, pronouns, verbs and ad- met at the tokenisation/morphological analy- verbs (Kosch, 2006). Therefore, tokenisation sis level and does not in principle propagate may also be described as the process of identify- to subsequent levels of analysis such as POS ing linguistic words, henceforth referred to as tagging and shallow parsing, etc. Indeed, the tokens. approach ensures that an aspect such as or- While the Bantu languages are all agglutina- thography does not obfuscate sound linguis- tive and exhibit significant inherent structural tics and, ultimately, proper semantic analysis, similarity, they differ substantially in terms of which remains the ultimate aim of linguistic analysis and therefore also computational lin- their orthography. The reasons for this difference guistic analysis. are both historical and phonological. A detailed Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 66–73, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

66 discussion of this aspect falls outside the scope very dangerous.” The importance of accurate of this article, but the interested reader is referred tokenisation is also emphasised by Forst and to Cole (1955), Van Wyk (1958 & 1967) and Kaplan (2006). While Setswana is also an alpha- Krüger (2006). betic segmented language, its disjunctive orthog- Setswana, Northern Sotho and Southern Sotho raphy causes token internal whitespace in a form the Sotho group belonging to the South- number of constructions of which the verb is the Eastern zone of Bantu languages. These lan- most important and widely occurring. Since the guages are characterised by a disjunctive (also standard tokenisation issues of languages such as referred to as semi-conjunctive) orthography, English have been extensively discussed (Far- affecting mainly the word category of verbs ghaly, 2003; Mikeev, 2003; Palmer, 2000), our (Krüger, 2006:12-28). In particular, verbal pre- focus is on the challenge of Setswana verb to- fixal morphemes are usually written disjunc- kenisation specifically. We illustrate this by tively, while suffixal morphemes follow a con- means of two examples: junctive writing style. For this reason Setswana Example 1: In the English sentence “I shall buy tokenisation cannot be based solely on meat” the four tokens (separated by “/”) are I / whitespace, as is the case in many alphabetic, shall / buy / meat. However, in the Setswana segmented languages, including the conjunc- sentence Ke tla reka nama (I shall buy meat) the tively written Nguni group of South African two tokens are Ke tla reka / nama. Bantu languages, which includes Zulu, Xhosa, Example 2: Improper tokenisation may distort Swati and Ndebele. corpus linguistic conclusions and statistics. In a The following research question arises: Can study on corpus design for Setswana lexicogra- the development and application of a precise to- phy Otlogetswe (2007) claims that a is the most keniser and morphological analyser for Setswana frequent “word” in his 1.3 million “words” resolve the issue of disjunctive orthography? If Setswana corpus (Otlogetswe, 2007:125). In re- so, subsequent levels of processing could exploit ality, the orthographic word a in Setswana could the inherent structural similarities between the be any of several linguistic words or morphemes. Bantu languages (Dixon and Aikhenvald, Compare the following: 2002:8) and allow a uniform approach. A/ o itse/ rre/ yo/? (Do you know this gentle- The structure of the paper is as follows: The man?) Interrogative particle; introduction states and contextualises the re- Re bone/ makau/ a/ maabane/. (We saw these search question. The following section discusses young men yesterday.) Demonstrative pronoun; tokenisation in the context of the South African Metsi/ a/ bollo/. (The water is hot.) Descriptive Bantu languages. Since the morphological struc- copulative; ture of the Setswana verb is central to the tokeni- Madi/ a/ rona/ a/ mo/ bankeng/. (Our money sation problem, the next section comprises a (the money of us) is in the bank.) Possessive brief exposition thereof. The paper then proceeds particle and descriptive copulative; to discuss the finite-state computational approach Mosadi/ a ba bitsa/. (The woman (then) called that is followed. This entails the combination of them.) Subject agreement morpheme; two tokeniser transducers and a finite-state (rule- Dintswa/ ga di a re bona/. (The dogs did not see based) morphological analyser. The penultimate us.) Negative morpheme, which is concomitant section concerns a discussion of the computa- with the negative morpheme ga when the nega- tional results and insights gained. Possibilities tive of the perfect is indicated, thus an example for future work conclude the paper. of a separated dependency. In the six occurrences of a above only four 2 Tokenisation represent orthographic words that should form part of a word frequency count for a. Tokenisation for alphabetic, segmented lan- The above examples emphasise the impor- guages such as English is considered a relatively tance of correct tokenisation of corpora, particu- simple process where linguistic words are usu- larly in the light of the increased exploitation of ally delimited by whitespace and punctuation. electronic corpora for linguistic and lexico- This task is effectively handled by means of graphic research. In particular, the correct to- regular expression scripts. Mikeev (2003) how- kenisation of verbs in disjunctively written lan- ever warns that “errors made at such an early guages is crucial for all reliable and accurate stage are very likely to induce more errors at corpus-based research. Hurskeinen et al. later stages of text processing and are therefore (2005:450) confirm this by stating that “a care-

67 fully designed tokeniser is a prerequisite for morpheme sa (still) and the potential morpheme identifying verb structure in text”. ka (can). Examples are o a araba (he answers), ba sa ithuta (they are still learning) and ba ka 3 Morphological Structure of the Verb ithuta (they can learn). in Setswana The temporal morpheme: The temporal morpheme tla (indicating the future tense) is A complete exposition of Setswana verb mor- written disjunctively, for example ba tla ithuta phology falls outside the scope of this article (see (they shall learn). Krüger, 2006). Main aspects of interest are The negative morphemes ga, sa and se: The briefly introduced and illustrated by means of negative morphemes ga, sa and se are written examples. disjunctively. Examples are ga ba ithute (they The most basic form of the verb in Setswana do not learn), re sa mo thuse (we do not help consists of an infinitive prefix + a root + a verb- him), o se mo rome (do not send him). final suffix, for example, go bona (to see) con- sists of the infinitive prefix go, the root bon- and 3.2 Suffixes of the Setswana verb the verb-final suffix -a. Various morphemes may be suffixed to the While verbs in Setswana may also include verbal root and follow the conjunctive writing various other prefixes and suffixes, the root al- style: ways forms the lexical core of a word. Krüger Verb-final morphemes: Verbal-final suffixes (2006:36) describes the root as “a lexical mor- a, e, the relative -ng and the imperative –ng, for pheme [that] can be defined as that part of a example, ga ba ithute (they are not learning). word which does not include a grammatical The causative suffix -is-: Example, o rekisa morpheme; cannot occur independently as in the (he sells (he causes to buy)). case with words; constitutes the lexical meaning The applicative suffix -el-: Example, o balela of a word and belongs quantitatively to an open (she reads for). class”. The reciprocal suffix -an-: Example, re a 3.1 Prefixes of the Setswana verb thusana (we help each other). The perfect suffix -il-: Example, ba utlwile The verbal root can be preceded by several pre- (they heard). fixes (cf. Krüger (2006:171-183): The passive suffix -w-: Example, o romiwa Subject agreement morphemes: The subject (he is sent). agreement morphemes, written disjunctively, include non-consecutive subject agreement mor- 3.3 Auxiliary verbs and copulatives phemes and consecutive subject agreement mor- Krüger (2006:273) states that “Syntactically an phemes. This is the only modal distinction that auxiliary verb is a verb which must be followed influences the form of the subject morpheme. by a complementary predicate, which can be a The same subject agreement morpheme therefore verb or verbal group or a copulative group or an has a consecutive as well as a non-consecutive auxiliary verbal group, because it cannot func- form. For example, the non-consecutive subject tion in isolation”. Consider the following exam- agreement morpheme for class 5 is le as in lekau ple of the auxiliary verb tlhola: re tlhola/ re ba le a tshega (the young man is laughing), while thusa/ (we always help them). For a more de- the consecutive subject agreement morpheme for tailed discussion of auxiliary verbs in Setswana class 5 is la as in lekau la tshega (the young man refer to Pretorius (1997). then laughed). Copulatives function as introductory members Object agreement morphemes: The object to non-verbal complements. The morphological agreement morpheme is written disjunctively in forms of copula are determined by the copulative most instances, for example ba di bona (they see relation and the type of modal category in which it). they occur. These factors give rise to a large va- The reflexive morpheme: The reflexive mor- riety of morphological forms (Krüger, 2006: pheme i- (-self) is always written conjunctively 275-281). to the root, for example o ipona (he sees him- self). 3.4 Formation of verbs The aspectual morphemes: The aspectual morphemes are written disjunctively and include The formation of Setswana verbs is governed by the present tense morpheme a, the progressive a set of linguistic rules according to which the various prefixes and suffixes may be sequenced

68 to form valid verb forms (so-called morhotactics) morpheme, for example, ga ke di bône. (I do not and by a set of morphophonological alternation see it/them). rules that model the sound changes that occur at The negative morpheme sa follows the subject morpheme boundaries. These formation rules agreement morpheme, for example, (fa) le sa constitute a model of Setswana morphology that dire ((while) he is not working). forms the basis of the finite-state morphological The negative morpheme se also follows the analyser, discussed in subsequent sections. subject agreement morpheme, for example, This model, supported by a complete set of (gore) re se di je ((so that) we do not eat it). known, attested Setswana roots, may be used to However, if the verb is in the imperative mood recognise valid words, including verbs. It will the negative morpheme se is used before the ver- not recognise either incorrectly formed or partial bal root, for example, Se kwale! (Do not write!). strings as words. The significance of this for to- The aspectual morphemes always follow the kenisation specifically is that, in principle, the subject agreement morpheme, for example, ba sa model and therefore also the morphological ana- dira (they are still working). lyser based on it can and should recognise only The temporal morpheme also follows the sub- (valid) tokens. ject agreement morpheme, for example, ba tla Morphotactics: While verbs may be analysed dira (they shall work). linearly or hierarchically, our computational Due to the agglutinating nature of the lan- analysis follows the former approach, for exam- guage and the presence of long distance depend- ple: encies, the combinatorial complexity of possible ba a kwala (they write) morpheme combinations makes the identification Verb(INDmode),(PREStense,Pos):AgrSubj- of the underlying verb rather difficult. Examples Cl2+AspPre + [kwal]+Term of rules that assist in limiting possible combina- o tla reka (he will buy) tions are as follows: Verb(INDmode),(FUTtense,Pos):AgrSubj- Cl1+TmpPre+[rek]+Term The object agreement morpheme is a prefix ke dirile (I have worked) that can be used simultaneously with the other Verb(INDmode),(PERFtense,Pos):AgrSubj- prefixes in the verb, for example, ba a di bona 1P-Sg+[dir]+Perf+Term (they see it/them). The above analyses indicate the part-of-speech The aspectual morphemes and the temporal (verb), the mode ( indicative) and the tense (pre- morpheme cannot be used simultaneously, for sent, future or perfect), followed by a ‘:’ and then example, le ka ithuta (he can learn) and le tla the morphological analyses. The tags are chosen ithuta (he will learn). to be self-explanatory and the verb root appears Since (combinations of) suffixes are written in square brackets. For example the first analysis conjunctively, they do not add to the complexity is ba: subject agreement class 2; a: aspectual of the disjunctive writing style prevalent in verb prefix; kwal: verb root; a: verb terminative (verb- tokenisation. final suffix). The notation used in the presenta- Morphophonological alternation rules: tion of the morphological analyses is user- Sound changes can occur when morphemes are defined. affixed to the verbal root. In linear analyses the prefixes and suffixes The prefixes: The object agreement mor- have a specific sequencing with regard to the pheme of the first person singular ni/n in combi- verbal root. We illustrate this by means of a nation with the root causes a sound change and number of examples. A detailed exposition of the this combination is written conjunctively, for rules governing the order and valid combinations example ba ni-bon-a > ba mpona (they see me). of the various prefixes and suffixes may be found In some instances the object agreement mor- in Krüger (2006). pheme of the third person singular and class 1 Object agreement morphemes and the reflex- causes sound changes when used with verbal ive morpheme always appear directly in front of roots beginning with b-. They are then written the verbal root, for example le a di reka (he buys conjunctively, for example, ba mo-bon-a > ba it). No other prefix can be placed between the mmona (they see him). object agreement morpheme and the verbal root When the subject agreement morpheme ke or between the reflexive morpheme and the ver- (the first person singular) and the progressive bal root. morpheme ka are used in the same verb, the The position of the negative morpheme ga is sound change ke ka > nka appears, for example, always directly in front of the subject agreement ke ka opela > nka opela (I can sing).

69 The suffixes: Sound changes also occur under structures in text”. Anderson and Kotzé (2006) certain circumstances, but do not affect the con- concur that in their development of a Northern junctive writing style. Sotho morphological analyser “it became obvi- Summarising, the processing of electronic ous that tokenisation was a problem that needed Setswana text requires precise tokenisation; the to be overcome for the Northern disjunctive writing style followed for verb con- as distinct from the ongoing morphological and structions renders tokenisation on whitespace morpho-phonological analysis”. inappropriate; morphological structure is crucial in identifying valid verbs in text; due to the 4.2 Our approach regularity of word formation, linguistic rules Our underlying assumption is that the Bantu lan- (morphotactics and morphophonological alterna- guages are structurally very closely related. Our tion rules) suggest a rule-based model of contention is that precise tokenisation will result Setswana morphology that may form the basis of in comparable morphological analyses, and that a tokeniser transducer, and together with an ex- the similarities and structural agreement between tensive word root lexicon, also the basis for a Setswana and languages such as Zulu will pre- rule-based morphological analyser. Since the vail at subsequent levels of syntactic analysis, Bantu languages exhibit similar linguistic struc- which could and should then also be computa- ture, differences in orthography should be ad- tionally exploited. dressed at tokenisation / morphological analysis Our approach is based on the novel combina- level so that subsequent levels of computational tion of two tokeniser transducers and a morpho- (syntactic and semantic) analysis may benefit logical analyser for Setswana. optimally from prevalent structural similarities. 4.3 Morphological analyser 4 Facing the Computational Challenge The finite-state morphological analyser prototype Apart from tokenisation, computational morpho- for Setswana, developed with the Xerox finite logical analysis is regarded as central to the state toolkit (Beesley and Karttunen, 2003), im- processing of the (agglutinating) South African plements Setswana morpheme sequencing (mor- Bantu languages (Bosch & Pretorius, 2002, Pre- photactics) by means of a lexc script containing torius & Bosch, 2003). Moreover, standards and cascades of so-called lexicons, each of standardisation are pertinent to the development which represents a specific type of prefix, suffix of appropriate software tools and language re- or root. Sound changes at morpheme boundaries sources (Van Rooy & Pretorius, 2003), particu- (morphophonological alternation rules) are im- larly for languages that are similar in structure. plemented by means of xfst regular expressions. While such standardisation is an ideal worth These lexc and xfst scripts are then compiled and striving for, it remains difficult to attain. Indeed, subsequently composed into a single finite state the non-standard writing styles pose a definite transducer, constituting the morphological ana- challenge. lyser (Pretorius et al., 2005 and 2008). While the implementation of the morphotactics and alterna- 4.1 Other approaches to Bantu tokenisation tion rules is, in principle, complete, the word root Taljard and Bosch (2005) advocate an ap- lexicons still need to be extended to include all proach to word class identification that makes no known and valid Setswana roots. The verb mor- mention of tokenisation as a central issue in the phology is based on the assumption that valid processing of Northern Sotho and Zulu text. For verb structures are disjunctively written. For ex- Northern Sotho they propose a hybrid system ample, the verb token re tla dula (we will (consisting of a tagger, a morphological analyser sit/stay) is analysed as follows: Verb(INDmode),(FUTtense,Pos): AgrSubj- and a grammar) “containing information on both 1p-Pl+TmpPre+[dul]+Term morphological and syntactic aspects, although or biased towards morphology. This approach is Verb(PARmode),(FUTtense,Pos): AgrSubj- dictated at least in part, by the disjunctive 1p-Pl+TmpPre+[dul]+Term method of writing.” In contrast, Hurskainen et al. Both modes, indicative and participial, consti- (2005) in their work on the computational de- tute valid analyses. The occurrence of multiple scription of verbs of Kwanjama and Northern valid morphological analyses is typical and Sotho, concludes that “a carefully designed to- would require (context dependent) disambigua- keniser is a prerequisite for identifying verb tion at subsequent levels of processing.

70 4.4 Tokeniser 4.5 Methodology Since the focus is on verb constructions, the Our methodology is based on a combination of Setswana tokeniser prototype makes provision a comprehensive and reliable morphological ana- for punctuation and alphabetic text, but not yet lyser for Setswana catering for disjunctively for the usual non-alphabetic tokens such as dates, written verb constructions (see section 5.3), a numbers, hyphenation, abbreviations, etc. A verb tokeniser transducer (see section 5.4) and a grammar for linguistically valid verb construc- tokeniser transducer that tokenises on tions is implemented with xfst regular expres- whitespace. The process is illustrated in Figure 1. sions. By way of illustration we show a fragment Central to our approach is the assumption that thereof, where SP represents a single blank char- only analysed tokens are valid tokens and strings acter, WS is general whitespace and SYMBOL is that could not be analysed are not valid linguistic punctuation. In the fragment of xfst below ‘...’ words. indicates that other options have been removed Normalise running text for conciseness and is not strict xfst syntax : define WORD [Char]+[SP | SYMBOL]; define WORDwithVERBEnding [Char]+[a | e Verb Tokeniser yielding | n g] [SP | SYMBOL]; longest matches echo >>> define object concords define OBJ [g o | r e | l o | l e | m o Morphological Analyser | b a | o | e | a | s e | d i | b o] WS+; echo >>> define subject concords Analysed strings, typically verbs Unanalysed strings define SUBJ [k e | o | r e | l o | l e | a | b a | e | s e | d i | b o | g o] Whitespace Tokeniser WS+; Tokens yielding orthographic echo >>> define verb prefixes words echo >>> define indicative mode define INDPREF [(g a WS+) SUBJ ([a | s a Analysed strings, typically other word categories ] WS+) ([a | k a | s a] WS+) (t l a WS+) Morphological Analyser (OBJ)]; define VPREF [...| INDPREF | ...]; echo >>> define verb groups Unanalysed strings define VGROUP [VPREF WORDwithVERBEnd- Errors ing]; echo >>> define tokens Figure 1: Tokenisation procedure define Token [VGROUP | WORD | ...]; Tokenisation procedure: Finally, whitespace is normalised to a single Step 1: Normalise test data (running text) by re- blank character and the right-arrow, right-to-left, moving capitalisation and punctuation; longest match rule for verb tokens is built on the Step 2: Tokenise on longest match right-to-left; template Step 3: Perform a morphological analysis of the A ->@ B || L _ R; “tokens” from step 2; where A, B, L and R are regular expressions de- Step 4: Separate the tokens that were successfully noting languages, and L and R are optional analysed in step 3 from those that could not be ana- (Beesley and Karttunen, 2003:174). lysed; We note that (i) it may happen that a longest Step 5: Tokenise all unanalysed “tokens” from step match does not constitute a valid verb construct; 4 on whitespace; (ii) the right-to-left strategy is appropriate since [Example: unanalysed wa me becomes wa and me.] Step 6: Perform a morphological analysis of the the verb root and suffixes are written conjunc- “tokens” in step 5; tively and therefore should not be individually Step 7: Again, as in step 4, separate the analysed identified at the tokenisation stage while disjunc- and unanalysed strings resulting from step 6; tively written prefixes need to be recognised. Step 8: Combine all the valid tokens from steps 4 The two aspects that need further clarification and 7. are (i) How do we determine whether a mor- This procedure yields the tokens obtained by pheme sequence is valid? (ii) How do we recog- computational means. Errors are typically strings nise disjunctively written prefixes? Both these that could not be analysed by the morphological questions are discussed in the subsequent sec- analyser and should be rare. These strings should tion. be subjected to human elicitation. Finally a com- parison of the correspondences and differences

71 between the hand-tokenised tokens (hand-tokens) 95 are correctly tokenised. Moreover, it suggests and the tokens obtained by computational means that the tokenisation improves with the length of (auto-tokens) is necessary in order to assess the the tokens. reliability of the described tokenisation approach. Tokens Types The test data: Since the purpose was to estab- Hand-tokens, H 409 208 Auto-tokens, A 412 202 lish the validity of the tokenisation approach, we H ∩ A 383 (93.6%) 193 (92.8%) made use of a short Setswana text of 547 ortho- A \ H 29 9 graphic words, containing a variety of verb con- H \ A 26 15 structions (see Table 1). The text was tokenised Precision, P 0.93 0.96 by hand and checked by a linguist in order to Recall, R 0.94 0.93 provide a means to measure the success of the F-score, 2PR/(P+R) 0.93 0.94 tokenisation approach. Furthermore, the text was Table 2. Tokenisation results normalised not to contain capitalisation and punctuation. All word roots occurring in the text The F-score of 0.93 in Table 2 may be consid- were added to the root lexicon of the morpho- ered a promising result, given that it was ob- logical analyser to ensure that limitations in the tained on the most challenging aspect of analyser would not influence the tokenisation Setswana tokenisation. The approach scales well experiment. and may form the basis for a full scale, broad Examples of output of step 2: coverage tokeniser for Setswana. A limiting fac- ke tla nna tor is the as yet incomplete root lexicon of the o tla go kopa morphological analyser. However, this may be le ditsebe addressed by making use of a guesser variant of Examples of output of step 3: the morphological analyser that contains conso- Based on the morphological analysis, the first nant/vowel patterns for phonologically possible two of the above longest matches are tokens and roots to cater for absent roots. the third is not. The relevant analyses are: It should be noted that the procedure presented ke tla nna in this paper yields correctly tokenised and mor- Verb(INDmode), (FUTtense,Pos): AgrSubj- phologically analysed linguistic words, ready for 1p-Sg+TmpPre+[nn]+Term subsequent levels of parsing. o tla go kopa We identify two issues that warrant future in- Verb(INDmode), (FUTtense,Pos): AgrSubj- vestigation: Cl1+TmpPre+AgrObj-2p-Sg+[kop]+Term Examples of output of step 5: • Longest matches that allow morphologi- le, ditsebe cal analysis, but do not constitute tokens. Examples of output of step 6: Examples are ba ba neng, e e siameng and le o o fetileng. In these instances the tokenis- CopVerb(Descr), (INDmode), (FUT- er did not recognise the qualificative par- tense,Neg): AgrSubj-Cl5 ticle. The tokenisation should have been ditsebe ba/ ba neng, e/ e siameng and o/ o fetileng. NPre10+[tsebe] • Longest matches that do not allow mor- 5 Results and Discussion phological analysis and are directly split The results of the tokenisation procedure ap- up into single orthographic words instead plied to the test data, is summarised in Tables 1 of allowing verb constructions of interme- and 2. diate length. An example is e le monna, which was finally tokenised as e/ le/ mon-

Token length Test data Correctly na instead of e le/ monna. (in orthographic words) tokenised Finally, perfect tokenisation is context sensi- 2 84 68 3 25 25 tive. The string ke tsala should have been toke- 4 2 2 nised as ke/ tsala (noun), and not as the verb Table 1. Verb constructions construction ke tsala. In another context it can however be a verb with tsal- as the verb root. Table 1 shows that 111 of the 409 tokens in In conclusion, we have successfully demon- the test data consist of more than one ortho- strated that the novel combination of a precise graphic word (i.e. verb constructions) of which tokeniser and morphological analyser for

72 Setswana could indeed form the basis for resolv- Kosch, I.M. 2006. Topics in Morphology in the Afri- ing the issue of disjunctive orthography. can Language Context. Unisa Press, Pretoria, South Africa. 6 Future work Krüger, C.J.H. 2006. Introduction to the Morphology of Setswana. Lincom Europe, München, Germany. • The extension of the morphological ana- lyser to include complete coverage of the Megerdoomian, K. 2003. Text mining, corpus build- so-called closed word categories, as well ing and testing. In Handbook for Language Engi- as comprehensive noun and verb root lexi- neers, Farghaly, A. (Ed.). CSLI Publications, Cali- cons; fornia, USA. Mikheev, A. 2003. Text segmentation. In The Oxford • The refinement of the verb tokeniser to Handbook of Computational Linguistics, Mitkov, cater for a more extensive grammar of R. (Ed.) Oxford University Press, Oxford, UK. Setswana verb constructions and more so- phisticated ways of reducing the length of Otlogetswe, T.J. 2007. Corpus Design for Setswana Lexicography. PhD thesis. University of Pretoria, invalid longest right-to-left matches; Pretoria, South Africa. • The application of the procedure to large Palmer, D.D. 2000. Tokenisation and sentence seg- text corpora. mentation. In Handbook of natural Language Processing, Dale, R., Moisl, H. And Somers, H. Acknowledgements (Eds.). Marcel Dekker, Inc., New York, USA.

This material is based upon work supported by Pretorius, R.S. 1997. Auxiliary Verbs as a Sub- category of the Verb in Tswana. PhD thesis. PU the South African National Research Foundation for CHE, Potchefstroom, South Africa. under grant number 2053403. Any opinion, find- ings and conclusions or recommendations ex- Pretorius, L and Bosch, S.E. 2003. Computational pressed in this material are those of the authors aids for Zulu natural language processing. South and do not necessarily reflect the views of the African Linguistics and Applied Language Studies, 21(4):267-281. National Research Foundation. Pretorius, R., Viljoen, B. and Pretorius, L. 2005. A References finite-state morphological analysis of Setswana nouns. South African Journal of African Lan- Anderson, W.N. and Kotzé, P.M. Finite state tokeni- guages, 25(1):48-58. sation of an orthographical disjunctive agglutina- tive language: The verbal segment of Northern So- Pretorius, L., Viljoen, B., Pretorius, R. and Berg, A. tho. In Proceedings of the 5th International Confer- 2008. Towards a computational morphological ence on Language Resources and Evalution, analysis of Setswana compounds. Literator, Genoa, Italy, May 22-28, 2006. 29(1):1-20. Bosch, S.E. and Pretorius, L. 2002. The significance Schiller, A. 1996. Multilingual finite-state noun- of computational morphology for Zulu lexicogra- phrase extraction. In Proceedings of the ECAI 96 phy. South African Journal of African Languages, Workshop on Extended Finite State Models of 22(1):11-20. Language, Kornai, A. (Ed.). Cole, D.T. 1955. An Introduction to Tswana Gram- Taljard, E. 2006 Corpus based linguistic investiga- mar. Longman, Cape Town, South Africa. tion for the South African Bantu languages: a Northern Sotho case study. South African journal Dixon, R.M.W. and Aikhenvald, A.Y. 2002. Word: A of African languages, 26(4):165-183. Cross-linguistic Typology. Cambridge University Press, Cambridge, UK. Taljard, E. and Bosch, S.E. 2006. A Comparison of Approaches towards Word Class Tagging: Disjunc- Forst, M. and Kaplan, R.M. 2006. The importance of tively versus Conjunctively Written Bantu Lan- precise tokenization for deep grammars. In Pro- guages. Nordic Journal of African Studies, 15(4): th ceedings of the 5 International Conference on 428-442. Language Resources and Evalution, Genoa, Italy, May 22-28, 2006. Van Wyk, E.B. 1958. Woordverdeling in Noord- Sotho en Zoeloe. ‘n Bydrae tot die Vraagstuk van Hurskeinen, A., Louwrens, L. and Poulos, G. 2005 Woordidentifikasie in die Bantoetale. University Computational description of verbs in disjoining of Pretoria, Pretoria, South Africa. writing systems. Nordic Journal of African Stud- ies, 14(4): 438-451. Van Wyk, E.B. 1967. The word classes of Northern Sotho. Lingua, 17(2):230-261.

73 Interlinear glossing and its role in theoretical and descriptive studies of African and other lesser-documented languages

Dorothee Beermann Pavel Mihaylov Norwegian University of Science Ontotext, and Technology Sofia, Bulgaria Trondheim, Norway [email protected] [email protected]

Abstract partially due to the success of computational lin- guistics and its involvement in fields such as lex- In a manuscript William Labov (1987) icography, corpus linguistics and syntactic pars- states that although linguistics is a field ing, to just name some. Most crucially however with a long historical tradition and with this development is due to the success of IT in a high degree of consensus on basic cat- general and in particular to the World Wide Web egories, it experiences a fundamental de- which has created new standards also for linguis- vision concerning the role that quantita- tic research. Through the internet our perception tive methods should play as part of the of ’data’ and publication of linguistic results has research progress. Linguists differ in the changed drastically only in a matter of a few years. role they assign to the use of natural lan- Although the development of language resources guage examples in linguistic research and and language technology for African languages is in the publication of its results. In this pa- increasing steadily, the digital revolution and the per we suggest that the general availabil- resources and possibilities it offers to linguistics ity of richly annotated, multi-lingual data are mostly seized by researchers in the First World directly suited for scientific publications connected to work centering around the key lan- could have a positive impact on the way guages. For this paper we would like to con- we think about language, and how we ap- ceive of this situation in terms of lost opportuni- proach linguistics.We encourage the sys- ties: At present formal linguistics and linguistic tematic generation of linguistic data be- research conducted on Third World languages are yond what emerges from fieldwork and mostly undertaken with very little knowledge of other descriptive studies and introduce an each other and hardly any exchange of research online glossing tool for textual data anno- results. Likewise, language documentation, which tation. We argue that the availability of has roots in language typology and computational such an online tool will facilitate the gen- linguistics, only partially coincides with work in eration of in-depth annotated linguistic ex- African linguistics. Yet, it is evident that the amples as part of linguistic research. This general availability of linguistic material from a in turn will allow the build-up of linguis- bigger sample of languages will eventually not tic resources which can be used indepen- only affect the way in which we think about lan- dent of the research focus and of the the- guage, but also might have an impact on linguistic oretical framework applied. The tool we methodology and on the way we go about linguis- would like to present is a non-expert-user tic research. If you are only a few mouse clicks system designed in particular for the work away from showing that a certain generalization with lesser documented languages. It has only holds for a limited set of languages, but truly been used for the documentation of several fails to describe a given phenomenon for a wider African languages, and has served for two sample, statements claiming linguistic generality projects involving universities in Africa. have to be phrased much more carefully. Our per- ception of the nature of language could truly ben- 1 Introduction efit from general access to representative multi- The role that digital tools play in all fields of mod- lingual data. It therefore would seem a linguistic ern linguistics can not be underestimated. This is goal in itself to (a) work towards a more general

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 74–80, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

74 and more straightforward access to linguistic re- Needless to say, TypeCraft will not solve all the sources, (b) encourage the systematic generation problems mentioned above, yet it has some new of linguistic data beyond what emerges from field- features that make data annotation an easier task work and other descriptive studies and (c) advo- while adding discernibility and general efficiency. cate the generation of a multi-lingual data pool for That one can import example sentences directly linguistic research. into research papers is one of these features. In addition TypeCraft is a collaboration and knowl- 2 Annotation tools in linguistic research edge sharing tool, and, combined with database functionality, it offers some of the most important It is well known that the generation of natural lan- functions we expect to see in a digital language guage examples enriched by linguistic informa- documentation tool. tion in the form of symbols is a time consuming In the following we will address glossing and il- enterprise quite independent of the form that the lustrate present day glossing standards with exam- raw material has and the tools that were chosen. ples from Akan, a Kwa language spoken in Ghana, Equally well known are problems connected to to then turn to a more detailed description of Type- the generation and storage of linguistic data in the Craft. However, a brief overview over the main form of standard document files or spread sheets features of TypeCraft seems in order at this point. (Bird and Simons 2003). Although it is generally TypeCraft is a relational database for natural agreed on that linguistic resources must be kept in language text combined with a tabular text editor a sustainable and portable format, it remains less for interlinearized glossing, wrapped into a wiki clear, how a tool should look that would help the which is used as a collaborative tool and for on- linguist to accomplish these goals. For the individ- line publication. The system, which has at present ual researcher it is not easy to decide which of the 50 users and a repository of approximately 4000 available tools serve his purpose best. To start with annotated phrases, is still young. Table 1 gives a it is often not clear which direction research will first overview of TypeCraft’s main functionalities. take, which categories of data are needed and in which form the material should be organized and 3 Glossing stored. But perhaps even more importantly most tools turn out to be so complex that the goal of The use of glosses in the representation of primary mastering them becomes an issue in its own right. data became a standard for linguistic publications Researchers that work together with communities as late as in the 1980s (Lehmann, 2004) where that speak an endangered or lesser documented interlinear glosses for sample sentences started language experience that digital tools used for lan- to be required for all language examples except guage documentation can be technically too de- those coming from English. However, the use of manding.Training periods for annotators become glossed examples in written research was, and still necessary together with technical help and main- is, not accompanied by a common understanding tenance by experts which not necessarily are lin- of its function, neither concerning its role in re- guists themselves. In this way tool management search papers nor its role in research itself. It develops into an issue in itself taking away re- seems that glosses, when occurring in publica- sources from the original task at hand - the lin- tions, are mostly seen as a convenience to the guistic analysis.Linguists too often experience that reader. Quite commonly information essential to some unlucky decision concerning technical tools the understanding of examples is given in sur- gets data locked in systems which cannot be ac- rounding prose, and often without any appropriate cessed anymore after a project, and the technical reflection in the glosses themselves. support coming along with it,has run out of fund- Let us look at a couple of examples with in- ing. terlinear glosses taken at random from the list of texts containing Akan examples. These ex- 2.1 TypeCraft an overview amples are taken from the online database Odin In the following we would like to introduce a lin- at Fresno State University. The Odin database guistic tool for text annotation called TypeCraft, (http:/www.csufresno.edu/odin/) is a repository of which we have created through combining several interlinear glossed texts which have been extracted well-understood tools of knowledge management. mainly from linguistic papers. The database it-

75 Annotation Collaboration Data Migration tabular interface for word level individual work spaces for users manual import of text and indi- glossing - automatic sentence that would like to keep data pri- vidual sentence break-up vate drop down reference list of lin- data sharing for predefined export of annotated sentence to- guistic symbols groups such a research collabo- kens (individual tokens or sets) rations to Microsoft Word, Open Office and LaTEX word and morpheme deletion data export from the TypeCraft export of XML (embedded and insertion database to the TypeCraft wiki DTD) for further processing of data lazy annotation mode (sentence access to tag sets and help pages parsing) from the TypeCraft wiki customized sets of sentence access to information laid out by level tags for the annotation of other annotators or projects. construction level properties

Table 1: Overview over TypeCraft Functionalities self consists of a list of URLs ordered by language Except for Ameka, the authors quote Akan ex- leading the user to the texts of interest. amples which are excerpted from the linguistic lit- erature. Often examples coming from African lan- 3.1 The glossing of Akan - an example guages have a long citation history and their val- Akan is one of the Kwa languages spoken in idation is in most cases nearly impossible. When Ghana. The first example from the Odin database, we compare (1) – (4) we notice a certain incon- here given as (1), comes from a paper by (Haspel- sistency for the annotation of no´ which is glossed math, 2001) as ’the’ (1), ’that’ (3) and as DEF (2) respec- tively. This difference could indicate that Akan (1) Am´ a´ ma`a` me` s`ıka.´ does not make a lexical distinction between defi- Ama give 1SG money niteness and deixis, most likely however we sim- ‘Ama gave me money. ’ ply observe a ’glossing figment’. The general lack of part of speech information in all examples eas- The second example is extracted from a paper ily leads us astray; should we for example assume by (Ameka, 2001): that na in example (4) is a relative pronoun? The (2) Am´ a´ de` s`ıka´ no´ ma´a´ me.` general lack of proper word level glossing makes Ama take money the give 1SG the data for other linguists quite useless, in par- ’Ama gave me the money’ ticular if they are not themselves native speakers or experts in exactly this language.Ma`a` is a past (Lit: ’Ame took money gave me’) form, but that tense marking is derived by suffixa- The third example is quoted in a manuscript by tion is only indicated in (2) via a hyphen between (Wunderlich, 2003): the translational gloss and the PAST tag. Like- wise rehwehwEin(4)is a progressive form, yet the (3) O-fEmm me ne pOflnkono. 3sg-lent 1sg 3sgP horse that lack of morpheme boundaries, and consistent an- ’He lent me a horse’ notation prevents that these and similarly glossed serve as a general linguistic resource. Purely trans- and the forth one comes from a manuscript by lational glosses might be adequate for text strings (Drubig, 2000) who writes about focus construc- which serve as mere illustrations; however, for lin- tions: guistic data,that is those examples that are (a) ei- ther crucial for the evaluation of the theoretical (4) Hena na Ama rehwehwE? who FOC Ama is-looking-for? developxment reported on, or (b) portray linguis- ’Who is it that Ama is looking for?’ tic pattern of general interest, to provide morpho-

76 syntactic and morpho-functional as well as part of ists, TypeCraft is a relational database and there- speech information would seem best practice. fore by nature has many advantages over file It seems that linguists underestimate the role based systems like Toolbox. This concerns both, that glossing, if done properly,could play as part of data integrity and data migration. In addition linguist research. Symbolic rewriting and formal- databases in general offer a greater flexibility for grammar development are two distinct modes of data search.For example, it is not only possible linguistic research. Yet there is no verdict that to extract all serial verb constructions for all (or forces us to express descriptive generalizations ex- some) languages known to TypeCraft, it is also clusively by evoking a formal apparatus of consid- possible to use the gloss index to find all se- erable depth. Instead given simplicity and parsi- rial verb constructions where a verb receives a mony of expression it might well be that symbolic marking specific to the second verb in an SVC. rewriting serves better for some research purposes The other mayor difference between Toolbox and than theoretical modeling. One can not replace TypeCraft is that TypeCraft is an online system one by the other. Yet which form of linguistic which brings many advantages, but also some dis- rendering is the best in a given situation should advantages. An online database is a multi-user be a matter of methodological choice. Essential system, that is, many people can access the same is that we realize that we have a choice. Sizing data at the same time independent of were they the opportunity that lies in the application of sym- physically are. Distributive applications are effi- bolic rewriting, of which interlinear glossing is cient tools for international research collaboration. one form, could make us realize that the genera- TypeCraft is designed to allow data sharing and tion of true linguistic resources is not exclusively collaboration during the process of annotation. Yet a matter best left to computational linguists. although there are many advantages to an online tool, to be only online is at the same time a major 4 A short description of TypeCraft disadvantage. Not all linguists work with a stable internet connection, and in particular for work in Typecraft is an interlinear ’glosser’ designed for the field TypeCraft is not suitable. the annotation of natural language phrases and TypeCraft uses Unicode, so that every script small corpora. The TypeCraft wiki serves as an that the user can produce on his or her PC can access point to the TypeCraft database. We use be entered into the browser,1 which for Type- standard wiki functionality to direct the TypeCraft Craft must be Mozilla Firefox. Different from user from the index page of the TypeCraft wiki to Toolbox TypeCraft insists on a set of linguistic the TC interface of the database, called My Texts. glosses, reflecting standards advocated for exam- My Texts is illustrated in Figure 1.The interface ple by the Leipzig Convention distributed by the is taken from a user that not only possesses pri- Max Planck Institute for Evaluationary Anthro- vate data (Own texts), but who also shares data pology or an initiative such a GOLD (Farrar and with other users (Shared Texts). At present shar- Lewis, 2005).Yet, TypeCraft still allows a user- ing of text is a feature set by the database admin- driven flexibility when it comes to the extension istrator, but in the near future the user will be able of the tag-set, as explained in the next section. to choose from the TypeCraft user list the peo- ple with whom he wants to share his data. Note 5 Glossing with TypeCraft that data is stored as texts which consist of anno- tated tokens, standardly sentences. ’Text’ in Type- TypeCraft supports word-to-word glossing on Craft does not necessarily entail coherent text, but eight tiers as shown in Figure 2. After having may also refer to any collection of individual to- imported a text and run it through the sentence kens that the user has grouped together. A Type- splitter, a process that we will not describe here, Craft user can publish his data online; yet his own the user can select via mouse click one of the texts are by default ’private’, that is, only he as the phrases and enter the annotation mode. The sys- owner of the material can see the data and change tem prompts the user for the Lazy Annotation it. To share data within the system or online is a Mode (in Toolbox called sentence parsing) which function that can be selected by the user. will automatically insert (on a first choice ba- Different from Toolbox, which is a linguistic 1Note however that self-defined characters or characters data management system known to many African- that are not Unicode will also cause problems in TypeCraft

77 Figure 1: My texts in TypeCraft sis) the annotation of already known words into features of the annotation interface that we cannot the annotation table. TypeCraft distinguishes be- describe here are the easy representation of non- tween translational, functional and part-of-speech Latin scripts, deletion and insertion of words and glosses. They are visible to the annotator as dis- morphemes during annotation, the accessibility of tinct tiers called Meaning, Gloss and POS. Ev- several phrases under annotation and the grouping ery TypeCraft phrase, which can be either a lin- of tokens into texts. guistic phrase or a sentence, is accompanied by a free translation. In addition the specification 6 Data Migration of construction parameters is possible. Although Export of data to the main text editors is one of the user is restricted to a set of pre-defined tags, the central functions of TypeCraft. TC tokens can the TypeCraft glossery is negogiable. User dis- be exported to Microsoft Word, OpenOffice.org cussion on the TCwiki, for example in the context Writer and LaTeX. This way the user can store his of project work, or by individual users, has led to data in a database, and when the need arises, he an extension of the TypeCraft tag set. Although can integrate it into his research papers. Although TypeCraft endorses standardization, the system is annotating in TypeCraft is time consuming, even user-driven. Glosses are often rooted in traditional in Lazy Annotation Mode, the resusablity of data grammatical terminology, which we would like stored in TypeCraft will on the long run pay off. to set in relation to modern linguistic terminol- Export can be selected from the text editing win- ogy. The TCwiki is an adaquate forum to discuss dow or from the SEARCH interface.After import these traditions and to arrive at a annotation stan- the examples can still be edited in case small ajust- dard which is supported by the users of the sys- ments are necessery.Example (5) is an example ex- tem. Under annotation the user has access a drop- ported from TypeCraft. down menu, showing standard annotation sym- Omu nju hakataahamu abagyenyi bols. These symbols together with short explana- om` u` nju` hak` at` a`ah` am` u` ab` agy` engy´ `ı Omu n ju ha ka taah a mu a ba gyenyi (5) tions can also be accessed from the TypeCraft wiki in CL9 house CL16 PST enter IND LOC IV CL2 visitor so that they can be kept open in tabs during annota- PREP N V N ‘In the house entered visitors’ tion. In Figure 2 we also see the effect of ’mousing over’ symbols, which displays their ’long-names’. (5) illustrates locative inversion in Runyakitara, Some symbols have been ordered in classes. In a Bantu language spoken in Uganda. The trans- Figure 2 we see for example that the feature past lational and functional glosses, which belong to is a subtype of the feature Tense. This classifica- two distinct tiers in the TypeCraft annotation inter- tion will in the future also inform search. Further face, appear as one line when imported to one of the word processing programs supported by Type-

78 Figure 2: Glossing in TypeCrat

Craft. Although glossing on several tiers is con- ceptually more appropriate, linguistic publications require a more condensed format. As for now we have decided on an export which displays 6 tiers. Next to export to the main editors, TypeCraft al- lows XML export which allows the exchange of data with other applications. Figure 3 gives an overview over the top 15 languages in TypeCraft. In January 2009 Lule Sami with 2497 phrases and Runyakitara (Runyankore Rukiga)with 439 Lule Sami (2497) phrases were the top two languages. At present Runyankore-Rukiga (439) the database contains approximately 4000 from 30 Norwegian (411) Akan (314) languages.Most of the smaller language (with 300 Nyanja (192) t0 40 sentences) are African languages. Ganda (114) German (94) Sekpele (87) Abron (77) 7 Conclusion Bini (71) Koyraboro Senni Songhai (64) Tumbuka (64) In this paper we suggest that the general availabil- English (60) ity of richly annotated, multi-lingual data directly Icelandic (56) Ga (55) suited for scientific publication could have a pos- itive impact on the way we think about language, and how we approach linguistics. We stress the Figure 3: Top 15 TypeCraft languages by number opportunity that lies in the application of sym- of phrases bolic rewriting, of which interlinear glossing is one form, and encourage the systematic generation of linguistic data beyond what emerges from field- work and other descriptive studies. With Type- data available to a bigger research community. We Craft we introduce an online glossing tool for tex- hope that the use of this tool will add to the stan- tual data which has two main goals (a) to allow dardization of language annotation. We further linguists to gloss their data without having to learn hope that TypeCraft will be used as a forum for how to install software and without having to un- linguistic projects that draw attention to the lesser- dergo a long training period before the can use studied languages of the World. the tool and (b) to make linguistically annotated

79 References Felix K. Ameka. 2001. Multiverb constructions in a west african areal typological perspective. In Dorothee Beermann and Lars Hellan, editors, On- line Proceedings of TROSS – Trondheim Summer School 2001. Hans Bernhard Drubig. 2000. Towards a typology of focus and focus constructions. In Manuscript, Uni- versity of Tubingen,¨ Germany. Scott Farrar and William D. Lewis. 2005. The gold community of practice: An infrastructure for lin- guistic data on the web. In In Proceedings of the EMELD 2005 Workshop on Digital Language Doc- umentation: Linguistic Ontologies and Data Cate- gories for Language Resources. Martin Haspelmath. 2001. Explaining the ditransi- tive person-role constraint: A usage-based approach. In Manuscript Max-Planck-Institut fur¨ evolutionare¨ Anthropologie. William Labov. 1987. Some observations on the foun- dation of linguistics. In Unpublished manuscript, University of Pennsylvania, USA. Christian Lehmann, 2004. Morphologie: Ein Inter- nationales Handbuch zur Flexion und Wortbildung, chapter Interlinear morphological glossing. DeGry- ter Berlin-New York. Dieter Wunderlich. 2003. Was geschieht mit dem dritten argument? In Manuscript University of Dusseldorf,¨ Germany.

80 Towards an electronic dictionary of Tamajaq language in Niger

Chantal Enguehard Issouf Modi LINA - UMR CNRS 6241 Ministère de l'Education Nationale 2, rue de la Houssinière Direction des Enseignements du Cycle BP 92208 de Base1 44322 Nantes Cedex 03 Section Tamajaq. France Republique du Niger chantal.enguehard@univ- [email protected] nantes.fr

Abstract 1.3 Articulary phonetics

We present the Tamajaq language and the dic- Consonants Voiceless Voiced tionary we used as main linguistic resource in Bilabial Plosive b the two first parts. The third part details the complex morphology of this language. In the Nasal m part 4 we describe the conversion of the dictio- Trill r nary into electronic form, the inflectional rules we wrote and their implementation in the Nooj Semivowel w software. Finally we present a plan for our fu- Labiodental Fricative f ture work. Dental Plosive t d 1. The Tamajaq language Fricative s z 1.1 Socio-linguistic situation Nasal n In Niger, the official language is French and Lateral l there are eleven national languages. Five are Pharyngeal Plosive taught in a experimental schools: Fulfulde, ṭ ḍ Hausa, Kanuri, Tamajaq and Soŋay-Zarma. Fricative ṣ ẓ According to the last census in 1998, the Tama- Lateral ḷ jaq language is spoken by 8,4% of the 13.5 mil- lion people who live in Niger. This language is Palatal Plosive c ǰ also spoken in Mali, Burkina-Faso, Algeria and Fricative š j Libya. It is estimated there are around 5 millions Tamajaq-speakers around the world. Semivowel y The Tamacheq language belongs to the group of Velar Plosive k g, ğ Berber languages. Fricative ɣ x 1.2 Tamajaq alphabet Nasal ŋ The Tamajaq alphabet used in Niger (Republic Glottal Plosive q of Niger, 1999) uses 41 characters, 14 with dia- critical marks that all figure in the Unicode stan- Fricative h dard (See appendix A). There are 12 vowels: a, â, Table 1a: Articulary phonetics of Tamajaq consonants ă, ə, e, ê, i, î, o, ô, u, û.

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 81–88, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

81 Vowels Close Close-mid Open-mid Open viations presented in table 2. In addition, this ta- ble gives information about the number of entries Palatal i e of each lexical category. Central ə a a Labial u o Lexical category Abbrevi- Number ation of entries Table 1b: Articulary phonetics of Tamajaq vowels Tamajaq English əḍəkuḍ number ḍkḍ. 3 1.4 Tools on computers ənalkam deteminant nlkm. 1 There are no specific TALN tools for the Tama- jaq language. anamal verb nml. 1450 However characters can be easily typed on samal adjective sml. 48 French keyboards thanks to the AFRO keyboard layout (Enguehard and al. 2008). əsəmmadaɣ ən possessive smmdɣtl. 5 təla pronoun 2 Lexicographic resources isən noun sn. 3648 We use the school editorial dictionary "diction- naire Tamajaq-français destiné à l'enseignement isən n ənamal Verbal noun snnml. 33 du cycle de base 1". It was written by the SOUTEBA1 project of the DED2 organisation in isən an təɣərit name of sntɣrt. 2 2006. Because it targets children, this dictionary shout consists only of 5,390 entries. Words have been chosen by compiling school books. isən xalalan proper noun snxln. 29 2.1 Structure of an entry isən iẓẓəwen complex snẓwn. 137 noun Each entry generally details : - lemma, əstakar adverb stkr. 8 - lexical category, - translation in French, əsatkar n adag adverb of lo- stkrdg. 10 - an example, cation - gender (for nouns), - plural form (for nouns). əṣatkar n igət Adverb of stkrgt. 1 quantity Examples: təɣ ərit ono- tɣrt. 8 « ăbada1: sn. bas ventre. Daw tǝdist. Bărar wa matopoeia yǝllûẓăn ad t-yǝltǝɣ ăbada-net. tǝmust.: yy. tənalkamt particle tnlkmt. 2 igǝt: ibadan. » Table 2: Tamajaq lexical categories « ăbada2: sn. flanc. Tasăga meɣ daw ădăg ǝyyăn. Imǝwwǝẓla ǝklăn dăɣ ăbada n 3 Morphology ǝkašwar. Anammelu.: azador. tǝmust.: yy. The Tamajaq language presents a rich morpholo- gy (Aghali-Zakara, 1996). Ǝsǝfsǝs.: ă. Igǝt: ibadan. » 3.1 Verbal morphology Homonyms are described in different entries and followed by a number, as in the above example. Verbs are classified according to the number of consonants of their lexical root and then in dif- ferent types. There are monoliteral, biliteral trilit- 2.2 Lexical categories eral, quadriliteral verbs... The linguistic terms used in the dictionary are Three moods are distinguished: imperative, sim- written in the Tamajaq language using the abbre- ple injunctive and intense injunctive. Three aspects present different possible values: 1Soutien à l'éducation de base. - accomplished: intense or negative; 2DED: Deutscher Entwicklungsdienst.

82 - non accomplished: simple, intense or negative; b. Complex nouns - aorist future: simple or negative. Complex nouns are composed by several lexical units connected together by hyphens. It could in- Examples : clude nouns, determiners or prepositions as well əktəb (to write): triliteral verb, type 1. as verbs. Examples: əṣṣən (to know): triliteral verb, type 2 (ṣṣn). Noun +determiner + noun əməl (to say): biliteral verb, type 1 "ejaḍ-n-əjḍan", literally means "donkey of akər (to steal): biliteral verb, type 2 birds" (this is the name of a bird). Verb + noun awəy (to carry): biliteral verb, type 3 "awəy-əhuḍ" literally means "it follows ašwu (to drink): biliteral verb, type 4 harmattan" (kite). aru (to love): monoliteral verb, type 2 "gaẓẓay-təfuk" literally means "it looks at aru (to open): monoliteral verb, type 3 sun" (sunflower). Preposition + noun Each class of verb has its own rules of conjuga- "In-tamaṭ" means "the one of the tree aca- tion. cia" (of acacia). 3.2 Nominal morphology Verb + verb a. Simple nouns "azəl-azəl" means "run run" (return). Nouns present three characteristics: - gender: masculine or feminine; We counted 238 complex nouns in the studied - number: singular or plural; dictionary. - annexation state is marked by the change of the first vowel. 4 Natural Language Processing of Tamajaq

Terminology Abbreviation 4.1 Nooj software (Silberztein, 2007) təmust gender tmt. « Nooj is a linguistic development environment yey masculine yy. that includes tools to create and maintain large- tənte feminine tnt. coverage lexical resources, as well as morpho- logical and syntactic grammars. » This software awdəkki singular wdk. is specifically designed for linguists who can use iget plural gt. it to test hypothesis on real corpus. « Dictionaries əsəfsəs annexation sfss. and grammars are applied to texts in order to lo- state cate morphological, lexical and syntactic patterns Table 3: Tamajaq terminology for nouns and tag simple and compound words. » Nooj put all possible tags for each token or group of to- Example : kens but does not disambiguate between the mul- tiple possibilities. However, the user can build « aṭrǝkka: sn. morceau de sucre. Akku: ablǝɣ his own grammar to choose between the multiple n°2. tǝmust.: yy. Ǝsǝfsǝs.: ǝ. Igǝt: possible tags. The analysis can be displayed as a syntactic tree. ǝṭrǝkkatăn. » This software is supported by Windows. "aṭrǝkka" is a masculine noun. Its plural is We chose to construct resources for this software because it is fully compatible with Unicode. "ǝṭrǝkkatăn". It becomes "ǝṭrǝkka" when annexation state is expressed. 4.2 Construction of the dictionary

The plural form of nouns is not regular and has We convert the edited dictionary for the Nooj to be specifically listed. software. 3,463 simple nouns, 128 complex nouns, 46 ad- jectives and 33 verbo-nouns are given with their plural form. Annexation state is indicated for 987

83 nouns, 23 complex nouns, 2 adjectives and 7 ver- bo-nouns. 'I45' rule deletes the final letter and include "-en" at the end. We created morphological rules that we ex- Nooj pressed as Perl regular expressions and also in I45=en/Iget the Nooj format (with the associated tag). Perl ^(.*).$ → en/Iget # 78 words a. Annexation state rules Thirteen morphological rules calculate the an- Table 8: Rule 'I45' nexation state. 'I102' rule deletes the two last letters and the Examples: second one and includes a final "-a" and a The 'A1ă' rule replaces the first letter of the "-i-" in the second position. word by 'ă'. Nooj I102=ai/Iget Perl ^(.).(.*)..$ → $1i$2a 'A1ă' rule # 6 words Nooj ă/sfss Table 9: Rule 'I102' Perl ^.(.*)$ → ă$1 Table 4: Rule 'A1ă' The 'A2ǝ ' rule replaces the second letter of c. Combined rules the word by 'ǝ '. When it was necessary, the above rules have been combined to calculate singular and plural forms with or without annexation state. 'A2ǝ ' rule We thus finally obtained 319 rules. Nooj A2ǝ = ǝ /sfss Perl ^(.).(.*)$ → $1ǝ $2 Example: I2RA2ă = Table 5: Rule 'A2ǝ ' :Rwdk + :I2 + :Rwdk :A2ă + :I2 :A2ă b. Plural form rules We searched formal rules to unify the calculation of plural forms. We found 126 rules that fit from 2 up to 446 words. 2932 words could be associat- ed with, at least, one flexional rule.

Examples:

'I4' rule deletes the last letter, adds "-ăn" at the end and "i-" at the beginning. Fig. 1: Rule I2RA2ă Nooj I4=ăni/Iget This rule recognizes the singular form Perl ^(.*).$ → i$1ăn (:Rwdk), the plural form (:I2), the singular # 446 words form with the annexation state (:Rwdk :A2ă) Table 6: Rule 'I4' and the plural form with the annexation state (:I2 :A2ă). 'I2' rule deletes the last and the second letters 25 words meet this rule. and includes "-en" at the end and "-i-" in the second position. For instance, "taḍlǝmt" (accusation, provoca- tion), is inflected in: Nooj I2=eni/Iget - taḍlǝmt,taḍlǝmt,SN+tnt+wdk Perl ^(.).(.*).$ → $1i$2en - tiḍlǝmen,taḍlǝmt,SN+tnt+Iget # 144 words - tăḍlǝmen,taḍlǝmt,SN+tnt+Iget+sfss Table 7: Rule 'I2'

84 - tăḍlǝmt,taḍlǝmt,SN+tnt+wdk+sfss Examples: Singular Plural Translation d. Conjugaison rules ag-awnaf kel-awnaf tourist amanẓ o imenẓ a young animal Verb classes are not indicated in the dictionary. We only describe a few conjugaison rules, just to ănaffarešši inǝ ff ǝ r ǝ šša someboby with check the expressivity of the Nooj software bad mood ănesbehu inǝ sbuha liar Here is the rule of the verb "əṣṣən" (to know), efange ifangăyan bank intense accomplished aspect, represented as a efajanfăj ifajanfăɣ ăn sling transducer. emagărmăz imagămăzăn plant emazzăle imazzaletăn singer taḍ aggalt tiḍ ulen daugther-in- law tejăṭ tizḍ en goal (football) Table 10: Examples of irregular plural forms

f. Result There are 6,378 entries in the Nooj dictionary. The inflected dictionary, calculated from the above dictionary and with the inflectional and conjugation rules, encounters 11,223 entries. Nooj is able to use the electronic dictionary we've created to automatically tag a text (see an example in appendix B). 4.3 Future work Fig. 2: Verb "əṣṣən", intense accomplished aspect a Conversion into XML format We will convert the inflectional dictionary into the international standard Lexical Markup We obtain, in the inflected dictionary, the correct Framework format (Francopoulo and al., 2006) conjugated forms. in order to make it easily usable by other TALN əṣṣanaɣ+əṣṣən,V+accompli+wdk+1 application,. ,V+accompli+wdk+2 təṣṣanaɣ+əṣṣən b Automatic search of rules iṣṣan+əṣṣən,V+accompli+wdk+yy+3 Due to the high morphological complexity of the təṣṣan+əṣṣən,V+accompli+wdk+tnt+3 Tamajaq language, we plan to develop a Perl program that would automatically determine the nəṣṣan+əṣṣən,V+accompli+gt+1 derivational and conjugation rules. təṣṣanam+əṣṣən,V+accompli+gt+yy+2 c Completion and correction of the resource təṣṣanmat+əṣṣən,V+accompli+gt+tnt+2 The linguistic resource will be completed during əṣṣanan+əṣṣən,V+accompli+gt+yy+3 the next months in order to add the class of verbs əṣṣannat+əṣṣən,V+accompli+gt+tnt+3 that are absent for the moment, and also to cor- rect the errors that we noticed during this study. e. Irregular words Finally, the singular and plural forms of 2,457 d Enrichment of the resource words were explicitly written in the Nooj dictio- We plan to construct a corpus of school texts to nary because they do not follow any regular rule. evaluate the out-of-vocabulary rate of this dictio- nary. This corpus could then be used to enrich the dictionary. The information given by Nooj would be useful to choose the words to add.

85 Acknowledgement Special thanks to John Johnson, reviewer of this text.

References Aghali-Zakara M. 1996. Éléments de morphosyn- taxe touarègue. Paris : CRB-GETIC, 112 p. Enguehard C. and Naroua H. 2008. Evaluation of Virtual Keyboards for West-African Lan- guages. Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Marrakech, Morocco. Francopoulo G., George M., Calzolari N., Monachini M., Bel N., Pet M., Soria C. 2006 Lexical Markup Framework (LMF). LREC, Genoa, Italy. République of Niger. 19 octobre 1999. Arrêté 214-99 de la République du Niger. Max Silberztein. 2007. An Alternative Approach to Tagging. NLDB 2007: 1-11

86 APPENDIX A : Tamajaq official alphabet x U+0078 X U+0058 (République of Niger, 1999) y U+0079 Y U+0059

Character Code Character Code z U+007A Z U+005A a U+0061 A U+0041 ẓ U+1E93 Ẓ U+1E92 â U+00E1 Â U+00C2 ă U+0103 Ă U+0102 ǝ U+01DD Ǝ U+018E b U+0062 B U+0042 c U+0063 C U+0043 d U+0064 D U+0044 ḍ U+1E0D Ḍ U+1E0C e U+0065 E U+0045 ê U+00EA Ê U+00CA f U+0066 F U+0046 g U+0067 G U+0047 ǧ U+01E7 Ǧ U+01E6 h U+0068 H U+0048 i U+0069 I U+0049 î U+00EE Î U+00CE j U+006A J U+004A ǰ U+01F0 J̌ U+004AU+ 030C ɣ U+0263 Ɣ U+0194 k U+006B K U+004B l U+006C L U+004C ḷ U+1E37 Ḷ U+1E36 m U+006D M U+004D n U+006E N U+004E ŋ U+014B Ŋ U+014A o U+006F O U+004F ô U+00F4 Ô U+00D4 q U+0071 Q U+0051 r U+0072 R U+0052 s U+0073 S U+0053 ṣ U+1E63 Ṣ U+1E62 š U+0161 Š U+0160 t U+0074 T U+0054 ṭ U+1E6D Ṭ U+1E6C u U+0075 U U+0055 û U+00FB Û U+00DB w U+0077 W U+0057

87 APPENDIX B : Nooj tagging Tamajaq text

Nooj perfectly recognizes the four forms of the word "awăqqas" (big cat) in the text: "awăqqas, iwaɣsan, awaɣsan"

These forms are listed in the inflectional dictio- nary as: awăqqas,awăqqas,SN+yy+wdk

awăqqas,awăqqas,SN+yy+wdk+FLX=A1a+sfss

iwaɣsan,awăqqas,SN+yy+iget

awaɣsan,awăqqas,SN+yy+iget+FLX=A1a+sfss

Fig.3: Tags on the text "awăqqas, iwaɣ san, awaɣ san"

On the figure 3, we can see that the first token "awăqqas" gets two tags: - "awăqqas,SN+yy+wdk" (singular) - "awăqqas,SN+yy+wdk+sfss" (singular and annexation state). The second and third tokens get a unique tag because there is no ambiguity.

88 A repository of free lexical resources for African languages: the project and the method

Piotr Bański Beata Wójtowicz Institute of English Studies Institute of Oriental Studies University of Warsaw University of Warsaw Warsaw, Poland Warsaw, Poland [email protected] [email protected]

data was kept in the DICT format (Faith and Abstract Martin, 1997). DICT (Dictionary Server Protocol) is by now We report on a project which we believe to a well-established TCP-based query/response have the potential to become home to, among protocol that allows a client to access definitions others, bilingual dictionaries for African lan- from a set of various dictionary databases. It guages. Kept in a well-structured XML for- provides data in textual form, but it also has the mat with several possible degrees of confor- potential of providing MIME-encoded content. mance, the dictionaries will be able to get us- able even in their early versions, which will The clients can be free-standing desktop applica- be then subject to supervised improvement as tions or they can be integrated into editors or user feedback accumulates. The project is web browsers. DICT web gateways also exist FreeDict, part of SourceForge, a well-known (see e.g. http://dict.org/). Internet repository of open source content. The DICT format is a plain text format with an accompanying index file. The FreeDict-DICT We demonstrate a possible process of dic- interface initially used so-called “c5 files”.1 A c5 tionary development on the example of one example of an entry from version 0.0.1 of Swa- of FreeDict dictionaries, a Swahili-English hili-English dictionary is presented below. dictionary that we maintain and have been developing through subsequent stages of in- creasing complexity and machine- abiria processability. The aim of the paper is to passenger(s) show that even a small bilingual lexical re- source can be submitted to this project and Later on, the project adopted the TEI P4 standard gradually developed into a machine- (Sperberg-McQueen and Burnard, 2002) as its processable form that can then interact with primary format, and with the help of its second other FreeDict resources. We also present the administrator, Michael Bunk, created tools for immediate benefits of locating bilingual Afri- conversion from the TEI into a variety of other can dictionaries in this project. dictionary platforms. A simple dictionary editor was also created. The change of the primary We have found FreeDict to be a very promis- format was a very fortunate move, thanks to ing project with a lot of potential, and the present paper is meant to spread the news which we can today recommend the project as about it, in the hope to create an active com- the possible home for free African language dic- munity of linguists and lexicographers of tionaries big and small. various backgrounds, where common re- We are going to base our discussion on the search subprojects can be fruitfully carried Swahili-English dictionary, the first FreeDict out. dictionary encoded according to the guidelines of TEI P5 XML standard (TEI Consortium, 2007). 1 Introduction The dictionary in its current form is an offshoot of a different project of ours that we decided to The FreeDict project was started by Horst Eyer- make available on a free license and in version mann in 2000 and initially hosted bilingual dic- 0.4 contains over 2600 headwords. We use it to tionaries produced by concatenating (crossing) demonstrate a possible process of dictionary de- the contents of the dictionaries in the Ergane pro- ject (http://download.travlang.com/Ergane/), 1 C5 is the format used, among others, by the CIA World with Esperanto as the interlanguage. At first, the Factbook, where the heading is at the left edge and the con- tents are indented by 5 spaces. Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 89–95, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

89 velopment, from the simplest to the advanced, review of all the new elements ex- machine-processable form. tracted by an XPath query. Depending on the regularity of expressions in 2 From glossaries to rich lexical data- brackets, some additional words would be in- bases: the possible shapes of FreeDict serted into the search string, to be converted into dictionaries elements of the appropriate type. Ini- tially, only @type="hint" was used, as the A dictionary can begin its life at FreeDict as a most generic. At the moment, there are several simple glossary, with the simplest format possi- more specialized type values, including ble, as shown in the made-up entry below: @type="editor", which contains editorial re- marks that will not be shown to the user but will

alasiri remain in the source. @type-less ele- afternoon ments are used for quick localized communica- tion between editors and are discarded by XSLT

scripts when preparing the source version for The next example entry comes from Swahili- release (they are also clearly marked by the CSS English xFried Freedict Dictionary, version stylesheet that accompanies the dictionary, so 0.0.2, compiled by Beata Wójtowicz. That dic- that the editors can easily spot each note when tionary contained around 1500 entries of varied reviewing the dictionary in a browser). FreeDict quality. It was based on a dictionary extracted advocates the use of some other types of notes: from Morris P. Fried’s Swahili-Kiswahili to Eng- recording the last editor of the entry, the date of lish Translation Program, to which selected en- the latest modification, and the degree of cer- tries from the first FreeDict Swahili-English dic- tainty, valuable in this kind of projects (where, tionary (compiled by Horst Eyermann) were e.g., some automated changes would set the cer- added. That version also introduced information tainty level to “low” and as such requiring edito- on parts of speech. rial approval). In the current version, 0.4, the

alasiri
element looks as follows. afternoon n afternoon period between 3 On the way to version 0.3, the entry looked as p.m. and 5 p.m. follows:

This is what we decided to keep in version 0.4,

alasiri
exactly for the purpose of illustrating the possi- n ble development stages of dictionaries. In the next version, the element will eventu- afternoon (period between 3 p.m. and 5 p.m.) ally attain the form currently (i.e., after Septem- ber 2007) recommended by the TEI Guidelines for translation equivalents:

All the bracketed information was then turned into separate elements, in order to make the translation equivalents easily proc- afternoon period between 3 p.m. and essable (see Prinsloo and de Schryver, 2002, for 5 p.m. remarks on processability of translation equiva- lents). The change was performed by regex search-and-replace, roughly from \((.+)\) into element holds the translation type="hint">$12, with a subsequent equivalent that can be an anchor for dictionary

2 This is in fact a slight simplification of what has been such expressions per single element content, etc.). Some- done, made for the purpose of clarity. Naturally, the regexes times, an XSL transformation may turn out to do a better have to be adopted to the circumstances (regularity of mar- job, thanks to the many string-handling functions of XPath kup, regularity of expressions in brackets, the number of 2.0.

90 reversal or concatenation. The element opponent above is not abused anymore and holds a real in games or sports definition, i.e., an “explanatory equivalent”, which may become a sense-discriminating note in the reversed dictionary. We stress that each of the XML structures pre- If the plural form does not exist in the dictionary, sented above (some of them admittedly border- the script creates an entry for it: ing on tag abuse) conforms to the general P5 format and can be easily processed and pub- lished. In other words, dictionary editors are not

maadui forced to conform to the final format in order to
see their work being used and commented on. The next section presents another aspect of n dictionary creation, where what matters is the ease of data manipulation and filling in the pre- Plural of dictable information for the developer. adui 3 Plural forms: an illustration of auto- enemy opponent in mated creation of entries games or sports At the stage of development at which brackets have been eliminated from translation equiva- lents, a relatively simple entry might look as The equivalents in this kind of automatically cre- shown below. This is an entry as created by a 3 ated entries for plurals will remain within developer, to be further processed by XSLT. elements even after we adopt the

cit/quote system mentioned above in the discus-

sion of alasiri. This is because elements adui are not anchors for dictionary reversal, and plural entries will be skipped by the reversal tools, n unless the plural/collective form has its own enemy unique meaning, as is the case with e.g. majani, which is morphologically the plural form of jani opponent “leaf; blade of grass”, but apart from that, it in games or sports should also be glossed as “grass”, and it is the latter form that should become a headword in the reversed, English-Swahili dictionary. Another area where the XML format and tools This entry is then processed and turned into the give excellent results is text normalization. An form presented below. earlier example shows unnecessary spaces in version 0.0.2: afternoon.
Handling these required only the use of an XPath adui function normalize-space(), which strips all
the unwanted whitespace characters.4 maadui The indexing system in this particular dictionary n is based on the shape of the headword, which it is easy to convert into form acceptable by the XML ID attributes (the XPath translate() enemy and replace() functions are handy here). All 4 That version contained more traps for machine-processing, such as bracketed parts of words — sometimes this was 3 The verbosity of XML markup can be overwhelming, but done in a nontrivial manner, as in the entry for adui: many XML editors feature content completion and can save enemy(-ies). See Prinsloo and de the developer a lot of typing, the more so that TEI schemas Schryver (2002) for remarks on the non-friendliness of such are often part of the editor package. space-saving devices.

91 the entries are first reordered alphabetically, then The default value for the @extent attribute is the script checks for homographs and assigns “full”, so it only needs to be mentioned where them appropriate indexes (e.g. chapa-1 for the the value is different. noun meaning “brand”, chapa-2 for the verb An AfLaT reviewer rightly points out that in meaning “beat”) and appropriate attributes used an electronic dictionary, alphabetization is irrele- later in the creation of superscripts (suppressed vant. Indeed, the DICT format features separate in the plain-text DICT format). Multiple senses “dictionary” and “index” files, and searching is are also treated similarly — each receives its done on the index file, which addresses the rele- own @xml:id attribute and numbering (see the vant portions of the dictionary file. The issue of example of adui above). alphabetization arises, however, in two cases: We emphasize the fact that the encoding for- when preparing a print version of the dictionary, mat makes it possible to reduce the developer’s or when using the dictionary outside of the DICT workload, with each stage of the dictionary en- system (this is a “working view” for the main- hancement being publishable. This allows one to tainers that can also be used as an out-of-the-box work on the dictionary on and off, in their spare view for users). time. In connection with the first case — preparing print/PDF versions of dictionaries — it is worth 4 Visualization of underlying structure pointing out that conversion from TEI XML into various publication formats is made easy thanks Some Swahili words are best lemmatized as to the open-source XSLT conversion suite main- stems. Version 0.4 of our dictionary does not yet tained by Sebastian Rahtz (http://www.tei- display the differences between bound stems and c.org/Tools/Stylesheets/). free forms, but we will transfer this functionality As for the “out-of-the-box preview”, FreeDict (which we use in another project) to one of the dictionaries, by virtue of being marked up in future versions. This will make it possible for us XML, can be equipped with CSS stylesheets that to, e.g., add hyphens to bound forms and prepend make it possible to display the XML source in “-a ” to adjectives introduced by the “-a of rela- the browser, as if it were an HTML page. Here, tionship”, adding extra structure visible to the because the user can search the page for the end user, but ignored in sorting or queries. given form, alphabetization is not so relevant, Disjunctively written languages can be han- but it can be handy, if only for aesthetic reasons. dled similarly. Kiango (2000:36) lists the follow- Below is a screenshot of a fragment of the CSS ing examples from Haya, discussing the prob- view of the file swa-eng.tei from version 0.4 dis- lems surrounding alphabetization of nouns in tribution package, opened directly in the Firefox print dictionaries: browser.

a ka yaga ‘air’ e m pambo ‘seed’ o mu twe ‘head’

The vocalic pre-prefixes should not be used for the purpose of arranging headwords, because if they are, all nouns end up under one of the three letters. Instead, class prefixes (ka, m, mu) should form the basis for alphabetization. In the XML/TEI format adopted by FreeDict, such problems are easily solved, either by using a separate element to hold the pre-prefix, or by the use of an appropriate attribute. The former solu- tion is illustrated below.

Figure 1: A CSS preview of the source XML

a The CSS adds some text (e.g. “[sg=pl]”, “(pl:”, ka yaga or sense numbering) and imposes visual structure
onto the source XML. As can be seen in the en- air try for kiungo, we give precedence to formal

92 properties of headwords over the semantic dis- isa/ofisa “officer”. Both headwords are used in tinctions, but other macrostructural decisions are retrieval. obviously possible as well. The figure below shows one more CSS view, demonstrating sub- categorization, treatment of notes (all the brack-

afisa eted strings are contents of separate XML ele- ofisa well as the treatment of homographs.

The above is the source as prepared by a devel- oper. This is then processed, the plural forms are created if they do not exist in the dictionary, and the result is as in figure 3 below.

Figure 2: Subcategorization and grammar notes (an- Figure 3: Illustration of alternative spellings/plurals other CSS view of the source XML) (CSS view)

Our use of colours (here: shades of grey) is also a Other examples of what can be gradually intro- function of the CSS, introduced mainly with the duced into the dictionary include addition of developer in mind, as a kind of an error-checking subcategorization information (illustrated by device. pako in Figure 2, where “pron” is the POS and “poss” is the content of the element), 5 Other possible enhancements of the addition of explicit noun-class- and agreement- microstructure marking, introduction of irregularly inflected forms, tables with inflection (linked to the ap- In section 3, we have illustrated a possible propriate stems), nested entries, and, obviously, method of refining dictionary structure in an the continuous improvement of lexicographic automatic fashion. Thanks to the many possible information (the arrangement and selection of variations of the format, other features may be senses, selection of headwords, an appropriate introduced stage by stage. They include, e.g., POS system). adding the corresponding plurals (illustrated Crucially, this system allows a developer to above) or marking forms where the plural is the “publish early, publish often”, and few of the same as the singular, as done below by the use of enhancements mentioned here depend on others the attribute. This example also illustrates @type — developers are free to choose and to extend the addition of cross-references to synonyms, the dictionary at their own pace. where eropleni is linked to the second, inani- In our case, this is a gradual move towards mate, sense of ndege ‘bird; airplane’. structures of finer granularity, suitable for rever-

sal (into English-Swahili) and concatenation with

other dictionaries (we are going to use English as eropleni the bridge language for pairing the Swahili-
English dictionary with English-* dictionaries). n 6 Potential for the future airplane We wish to stress the potential that the encoding (synonym: ndege) lexical resources for “non-commercial” lan- guages, where funding and the time that the de- velopers may spend on the dictionary are not al-

ways guaranteed. FreeDict dictionary develop- The element below illustrates the han-

ment can proceed in stages, one can start with a dling of alternative spellings of the noun af-

93 simple format and get the dictionary published ple, Kęstutis Biliūnas is the packager for Debian on-line practically within days. The project has Linux and maintains a page tracking the usage of all the SourceForge publishing facilities at its Debian-FreeDict packages; disposal, together with bug/patch/etc. trackers Apart from the above, the content published and community forums. It also has a mailing list by FreeDict is guaranteed to be free. and a wiki that can serve to document some pos- sibly difficult aspects of dictionary creation. 7 The costs of developing for FreeDict Thanks to the robust build system of FreeDict, An AfLaT reviewer suggested that we provide a creating a tarball containing a DICT-formatted measure of the effort required to develop a re- dictionary and index is only a matter of issuing source for FreeDict. We hope to show here that the “make” command with appropriate argu- this is very much dependent on the quality and ments, and submitting the resulting archive to the form of the resource and on how much time the SourceForge file release system.5 dictionary creator is willing to invest into it. Cru- FreeDict is the nexus for the following: cially, given the open-source nature of the pro- • XML, with its potential for creating ject, even a simple, small list of near-binary pair- well-structured documents, ings of equivalents can be a) quickly made useful • TEI P5, a de-facto standard taking ad- to e.g. the readers of web pages written in the vantage of this potential, given language, and b) extended by others into a more satisfying resource. • the SourceForge repository as well as It may be that the effort needed to create lan- distribution and content-management net- guage resources in e.g. Lexique Pro, an excellent work, free tool by SIL International • the DICT distribution network: apart (http://www.lexiquepro.com/) is smaller, but from being able to query DICT servers there are differences in that Lexique Pro is a straight from the desktop, Firefox users Windows-only closed-source program whose can also take advantage of an add-on cli- native MDF (Multi-Dictionary-Formatter) format ent that returns definitions for highlighted is not as flexible as XML and therefore cannot be words on a web page, processed by the many tools that handle XML and TEI in particular. Consequently, the perspec- • FreeDict tools (still under development tives for re-use of Lexique Pro dictionaries in for TEI P5) as means to manipulate dic- computational linguistic applications are much tionaries and to create, among others, the smaller. To our knowledge, Lexique Pro does not DICT format (usable directly from DICT make it possible for users to query words straight servers and by other dictionary-providing 6 off web pages, which can be done thanks to dict, projects, e.g., StarDict or Open Dict). a Firefox add-on (http://dict.mozdev.org/). It ad- Additionally, lexical resources submitted to mittedly has other advantages that make it a seri- 7 FreeDict may undergo further transformations: ous alternative. reversal or concatenation, which means that The ideal solution would be to have an editing work put into developing a single resource may front-end such as Lexique Pro coupled with the well be re-used in developing others. Consider- openness and modifiability of the data offered by ing the possible re-use of lexical resources, they FreeDict. Indeed, there are plans for creating a are expected to be prepared with a view towards converter from the new LIFT interchange stan- clean exposure of translation equivalents (in the dard (http://code.google.com/p/lift-standard/) cit/quote system or at least by judicious use of that the beta versions of Lexique Pro can read separators and brackets). The project has its own distribution system, in 7 We do not discuss professional commercial dictionary the form of GNU/Linux packages — for exam- writing systems such as TshwaneLex (http://tshwanedje.com/tshwanelex/) because, despite the academic discounts, they may be out of range for the aver- 5 This is something that a dictionary creator need not bother age developer. It is worth mentioning that the discounted about — submitting a TEI source of the dictionary to the versions of TshwaneLex come with the understandable “no- mailing list is enough. commercial-use” restriction, which is in conflict with either 6 The FreeDict build process provides targets for platforms the GNU Public License or the nearly equivalent Creative other than DICT, e.g. the Evolutionary Dictionary Commons BY-SA license that all SourceForge resources (http://www.mrhoney.de/y/1/html/evolutio.htm) or zbedic must be under (cf. http://www.gnu.org/licenses/gpl- (http://bedic.sourceforge.net/). faq.html).

94 and write, to the version of the TEI format used will become home to many African language by FreeDict. That would undoubtedly enhance resources, and, thanks to the possibility of dic- the attractiveness of the project. tionary concatenation, facilitate also the creation To sum up, developing for FreeDict minimally of many African↔European dictionary pairs as requires some basic knowledge of XML. Free well as all-African bilingual dictionaries. XML editors exist (e.g. XML Copy Editor, http://sourceforge.net/projects/xml-copy-editor/) Acknowledgements that can make editing easier by autocompleting We are grateful to three anonymous AfLaT-2009 the elements (inserting closing tags, suggesting reviewers for their helpful comments on the pre- elements and attributes that are allowed at the vious version of this paper, and to Michael Bunk given place in the structure) and signalling en- for confirming that our version of the history of coding mistakes. the project is close to reality. We also wish to 8 African Languages and FreeDict thank Janusz S. Bień for turning our attention to the FreeDict project a few years ago. A reviewer remarked that the link to African lan- guage technology in this paper appears only to be References present in the examples. Indeed, FreeDict is not a DICT: http://www.dict.org/ project that focuses on African-languages — it is a project where African language resources can Faith, Rik and Martin, Brett. 1997. A Dictionary be hosted and quickly become useful to users. Server Protocol. Request for Comments: 2229 Given an opportunity, we will encourage re- (RFC #2229). Network Working Group. Available searches dealing with other languages to join the from ftp://ftp.isi.edu/in-notes/rfc2229.txt (last ac- cessed on January 19, 2009). project — which will hopefully result in the crea- tion of more cross-language resources, especially FreeDict: http://www.freedict.org/, given that the encoding format is not tied to any http://freedict.sourceforge.net/ particular language and is able to easily accom- Kiango, John G. 2000. Bantu lexicography: a criti- modate features characteristic of practically any cal survey of the principles and process of language. constructing dictionary entries. Tokyo: Institute During the session on “African Languages in for the Study of Languages and Cultures of Asia Advance” at the 2008 Poznań Linguistic Meet- and Africa, Tokyo University of Foreign Studies. ing, where we presented our Swahili-Polish pro- Prinsloo, Danie J. and Gilles-Maurice de Schryver. ject and also mentioned FreeDict as the place 2002. Reversing an African-language lexicon: the where we wanted to donate parts of our test Northern Sotho Terminology and Orthography No. Swahili-English dictionary that would otherwise 4 as a case in point. South African Journal of Afri- remain on our disks, we talked to an organizer of can Languages, 22/2: 161–185 that session, Karien Brits, about the advantages Sperberg-McQueen, Michael and Lou Burnard (eds). that this project can have for some of her col- 2002. The Text Encoding Initiative Guidelines leagues working on South-African languages. (P4). Text Encoding Initiative, Oxford. This is what encouraged us to move on with the FreeDict Swahili-English dictionary and we hope TEI Consortium, eds. 2007. TEI P5: Guidelines for that others will also find this project, and the Electronic Text Encoding and Interchange. possibilities that it offers, attractive. Version 1.2.0. Last updated on October 31st 2008. The FreeDict project has recently awoken af- TEI Consortium. Available from ter a period of lower activity, and at the moment, http://www.tei-c.org/Guidelines/P5/ (last accessed every week brings something new. Currently, as on January 19, 2009). far as African languages are concerned, apart from Swahili↔English dictionaries, the project hosts very basic Afrikaans↔English dictionaries, and an Afrikaans-German dictionary (all of them in need of a maintainer).8 We hope that FreeDict

8 Dictionary sources can be accessed from the “download” link on the project page: http://sourceforge.net/projects/ freedict/. They can be queried online at e.g. http://dict.org/, by setting the appropriate language pair in the database.

95 Exploiting Cross-linguistic Similarities in Zulu and Xhosa Computational Morphology

Laurette Pretorius Sonja Bosch School of Computing Department of African Languages University of South Africa & University of South Africa Meraka Institute, CSIR Pretoria, South Africa Pretoria, South Africa [email protected] [email protected]

Abstract A computational morphological analyser proto- type for Zulu (ZulMorph) is in an advanced This paper investigates the possibilities that stage of development, the results of which have cross-linguistic similarities and dissimilarities already been used in other applications. Prelimi- between related languages offer in terms of nary experiments and results towards obtaining bootstrapping a morphological analyser. In morphological analysers for Xhosa, Swati and this case an existing Zulu morphological ana- Ndebele by bootstrapping ZulMorph were par- lyser prototype (ZulMorph) serves as basis for a Xhosa analyser. The investigation is ticularly encouraging (Bosch et al., 2008). This structured around the morphotactics and the bootstrapping process may be briefly summa- morphophonological alternations of the lan- rised as a sequence of steps in which the baseline guages involved. Special attention is given to analyser, ZulMorph, is applied to the new lan- the so-called “open” class, which represents guage (in this case Xhosa) and then systemati- the word root lexicons for specifically nouns cally extended to include the morphology of the and verbs. The acquisition and coverage of other language. The extensions concern the word these lexicons prove to be crucial for the suc- root lexicon, followed by the grammatical mor- cess of the analysers under development. pheme lexicons and finally by the appropriate The bootstrapped morphological analyser is applied to parallel test corpora and the results morphophonological rules. The guiding principle are discussed. A variety of cross-linguistic ef- in this process is as follows: Use the Zulu mor- fects is illustrated with examples from the phological structure wherever applicable and corpora. It is found that bootstrapping mor- only extend the analyser to accommodate differ- phological analysers for languages that ex- ences between the source language (Zulu) and hibit significant structural and lexical simi- the target language (in this case Xhosa). So far larities may be fruitfully exploited for devel- the question as to whether the bootstrapped ana- oping analysers for lesser-resourced lan- lyser, extended to include Xhosa morphology, guages. could also improve the coverage of the Zulu ana- lyser was not specifically addressed in Bosch et 1 Introduction al. (2008). Zulu and Xhosa belong to the Nguni languages, a Cross-linguistic similarity and its exploitation group of languages from the South-eastern Bantu is a rather wide concept. In its broadest sense it zone and, as two of the eleven official languages aims at investigating and developing resources of South Africa, are spoken by approximately 9 and technologies that can be compared and and 8 million mother-tongue speakers, respec- linked, used and analysed with common ap- tively. In terms of natural language processing, proaches, and that contain linguistic information particularly computational morphology, the for the same or comparable phenomena. In this Bantu languages including Zulu and Xhosa cer- paper the focus is on the morphological similari- tainly belong to the lesser-studied languages of ties and dissimilarities between Zulu and Xhosa the world. and how these cross-linguistic similarities and One of the few Bantu languages for which dissimilarities inform the bootstrapping of a computational morphological analysers have morphological analyser for Zulu and Xhosa. In been fully developed so far is Swahili (Hur- particular, issues such as open versus closed skainen, 1992; De Pauw and De Schryver, 2008). classes, and language specific morphotactics and alternation rules are discussed. Special attention

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 96–103, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

96 is given to the word root lexicons. In addition, The concordial agreement system is signifi- the procedure for bootstrapping is broadened to cant in the Bantu languages because it forms the include a guesser variant of the morphological backbone of the whole sentence structure. Con- analyser. cordial agreement is brought about by the vari- The structure of the paper is as follows: Sec- ous noun classes in the sense that their prefixes tion 2 gives a general overview of the morpho- link the noun to other words in the sentence. This logical structure of the languages concerned. The linking is manifested by a concordial morpheme modelling and implementation approach is also that is derived from the noun prefix, and usually discussed. This is followed in sections 3 and 4 by bears a close resemblance to the noun prefix, as a systematic exposition of the cross-linguistic illustrated in the following example: dissimilarities pertaining to morphotactics and Izitsha lezi ezine zephukile morphophonological alternations. Section 5 fo- ‘These four plates are broken’ cuses on the so-called “open” class, which repre- This concordial agreement system governs sents the word root lexicons for specifically grammatical correlation in verbs, adjectives, pos- nouns and verbs. The acquisition and coverage of sessives, pronouns, and so forth. Bantu lan- these lexicons prove to be crucial for the success guages are predominantly agglutinating and po- of the analysers under development. Section 6 lymorphematic in nature, with affixes attached to addresses the use of the guesser variant of the the root or core of the word. morphological analyser as well as the application The morphological make-up of the verb is of the bootstrapped morphological analyser to considerably more complex than that of the parallel test corpora. A variety of cross-linguistic noun. A number of slots, both preceding and fol- effects is illustrated with examples from the cor- lowing the verb root may contain numerous pora. This provides novel insights into the inves- morphemes with functions such as derivations, tigation and exploitation of cross-linguistic simi- inflection for tense-aspect and marking of nomi- larities and their significance for bootstrapping nal arguments. Examples are cross-reference of purposes. Section 7 concerns future work and a the subject and object by means of class- (or per- conclusion. son-/number-)specific object markers, locative affixes, morphemes distinguishing verb forms in 2 General overview clause-final and non-final position, negation etc. Despite the complexities of these domains, 2.1 Morphological structure they are comparable across language boundaries, Bantu languages are characterised by a rich ag- specifically Nguni language boundaries, with a glutinating morphological structure, based on degree of formal similarity that lends itself to two principles, namely the nominal classification exploitation for bootstrapping purposes. system, and the concordial agreement system. According to the nominal classification system, 2.2 Modelling and Implementation nouns are categorised by prefixal morphemes. In the modelling and implementation of the mor- These noun prefixes have, for ease of analysis, phological structure a finite-state approach is been assigned numbers by scholars who have followed. The suitability of finite-state ap- worked within the field of Bantu linguistics, In proaches to computational morphology is well Zulu a noun such as umuntu 'person' for instance, known and has resulted in numerous software consists of a noun prefix umu- followed by the toolkits and development environments for this noun stem -ntu and is classified as a class 1 purpose (cf. Koskenniemi, 1997 and Karttunen, noun, while the noun isitha 'rival' consists of a 2001). Yli-Jyrä (2005) discusses the importance noun prefix isi- and the noun stem -tha and is of a finite-state morphology toolkit for lesser- classified as a class 7 noun. Noun prefixes gen- studies languages. He maintains that “[a]lthough erally indicate number, with the uneven class some lexicons and morphological grammars can numbers designating singular and the corre- be learned automatically from texts ... fully sponding even class numbers designating plural. automatic or unsupervised methods are not suffi- The plural forms of the above examples would cient. This is due to two reasons. First, the therefore respectively be the class 2 noun abantu amount of freely available corpora is limited for 'persons' and the class 8 noun izitha 'rivals'. We many of the less studied languages. Second, follow Meinhof's (1932:48) numbering system many of the less studied languages have rich which distinguishes between 23 noun prefixes morphologies that are difficult to learn accurately altogether in the various Bantu languages. with unsupervised methods”.

97 The Xerox finite-state tools (Beesley and proach followed, designing for accuracy and cor- Karttunen, 2003) as one of the preferred toolkits rectness and decisions regarding the analyser's for modelling and implementing natural lan- interface with its environment and its usage. guage morphology, is used in this work. Particular attention is paid to the handling of The morphological challenges in computa- exceptions; the modelling of separated depend- tional morphological analysis comprise the mod- encies by means of so-called flag-diacritics; the elling of two general linguistic components, specification of lexical forms (analyses) in terms namely morphotactics (word formation rules) as of morphological granularity and feature infor- well as morphophonological alternations. mation; the choice of an associated and appropri- Ideally, the morphotactics component should ate morphological tag set and also the position- include all and only word roots in the language, ing of these tags in relation to the morphemes all and only the affixes for all parts-of-speech they are associated with in the morphological (word categories) as well as a complete descrip- analyses (lexical forms) that are rendered. tion of the valid combinations and orders of these The components of ZulMorph, including its morphemes for forming all and only the words of scope in terms of word categories and their mor- the language concerned. Moreover, the morpho- phological structure, are summarised in Table 1 phonological alternations rules should constitute while its lexical coverage as reflected by the all known sound changes that occur at morpheme number of different noun stems, verb roots etc. is boundaries. The combination of these two com- discussed in section 5. ponents constitutes an accurate model of the The bootstrapping of ZulMorph to provide morphology of the language(s) under considera- for Xhosa as well requires a careful investigation tion. of the cross-linguistic similarities and dissimi- The Xerox lexicon compiler, lexc, is well- larities and how they are best modelled and im- suited to capturing the morphotactics of Zulu. A plemented. This aspect will be discussed in more lexc script, consisting of cascades of so-called detail in the following section. continuation classes (of morpheme lexicons) rep- resenting the (concatenative) morpheme se- Morphotactics Affixes for Word roots Rules for (lexc) all parts-of- (e.g. nouns, legal combi- quencing, is compiled into a finite-state network. speech (e.g. verbs, rela- nations and The Xerox regular expression language, xfst, subject & tives, ideo- orders of provides an extended regular expression calculus object con- phones) morphemes cords, noun (e.g. u-ya- with sophisticated Replace Rules for describing class pre- ngi-thand-a the morphophonological alternations rules of fixes, verb and not *ya- extensions u-a-thand- Zulu. The xfst script is also compiled into a fi- etc.) ngi) nite-state network. These networks are finally Morpho- Rules that determine the form of each mor- combined by means of the operation of composi- phonological pheme alternations (e.g. ku-lob-w-a > ku-lotsh-w-a, u-mu-lomo > tion into a so-called Lexical Transducer that con- (xfst) u-m-lomo) stitutes the morphological analyser and contains Table 1: Zulu Morphological Analyser Compo- all the morphological information of Zulu, in- nents cluding derivation, inflection, alternation and compounding. Pretorius and Bosch (2002) ad- 3 Morphotactics dress the suitability of this approach to Zulu morphology and illustrate it by means of exam- In word formation we distinguish between so- ples of lexc and xfst scripts for modelling the called closed and open classes. The open class Zulu noun. accepts the addition of new items by means of A detailed exposition of the design and im- processes such as borrowing, coining, com- plementation of ZulMorph may be found in Pre- pounding and derivation. In the context of this torius and Bosch (2003). In addition to consider- paper, the open class represents word roots in- ing both the accurate modelling of the morpho- cluding verb roots and noun stems. The closed tactics and the morphophonological alternation class represents affixes that model the fixed rules, they also address implementation and other morphological structure of words, as well as issues that need to be resolved in order to pro- items such as conjunctions, pronouns etc. Typi- duce a useful software artefact for automated cally no new items can be added to the closed morphological analysis. Issues of implementa- class (Fromkin et al., 2003:74). tion include a justification for the finite-state ap- Since our point of departure is ZulMorph, we focus on Xhosa affixes that differ from their Zulu

98 counterparts. A few examples are given in Table Certain areas in the Xhosa grammar need to be 2. modelled independently and then built into the

Morpheme Zulu Xhosa Noun Class Prefixes Class 1 and 3 um(u)- full form umu- with monosyllabic noun stems, um- with all noun stems: um-ntu, um-fana shortened form with polysyllabic noun stems: umu-ntu, um-fana Class 2a o-: o-baba oo-: oo-bawo Class 9 in- with all noun stems: i- with noun stems beginning with h, i, m, n, in-nyama ny: i-hambo

Class 10 izin- with monosyllabic and polysyllabic stems. iin- with polysyllabic stems: izin-ja; izin-dlebe iin-dlebe

Contracted subject concords (future tense). Examples: 1ps ngo- ndo- 2ps, Class 1 & 3 wo- uyo- Class 4 & 9 yo- iyo- Object concords 1ps ngi- ndi-

Absolute pronouns 1ps mina mna Class 15 khona kona Demonstrative Pronouns: Three positional types of the demonstrative pronouns are listed separately for each language. Examples:

Class 1 Pos. 1 lo; Pos. 2 lowo; Pos. 3 lowaya Pos. 1 lo; Pos. 2 lowo/loo; Pos. 3 lowa Class 5 Pos. 1 leli; Pos. 2 lelo; Pos. 3 leliya Pos. 1 eli; Pos. 2 elo; Pos. 3 eliya Adjective basic prefixes 1ps ngim(u-) nim- 2ps umu- um- Class1& 3 mu- m- Class 8 zin- zi- Locative demonstrative copulatives : Three positional types of the so-called locative demonstrative copulatives differ considerably for Zulu and Xhosa and are therefore listed separately for each language. Examples: Class 1 Pos. 1 nangu; Pos. 2 nango; Pos. 3 nanguya Pos. 1 nanku; Pos. 2 nanko; Pos. 3 nankuya Class 5 Pos. 1 nanti; Pos. 2 nanto; Pos. 3 nantiya Pos. 1 nali; Pos. 2 nalo; Pos. 3 naliya

Copulatives : Formation of copulatives derived from Xhosa nouns differs considerably from Zulu. This construction is class dependent in Xhosa and is modelled differently to its Zulu counterpart. Examples: yi- combines with noun prefixes i-: ngu- combines with classes 1, 1a, 2, 2a, 3 & yi-indoda > yindoda 6, e.g. ngu-umntu > ngumntu ngu- combines with noun prefixes u-, o-, a: yi- combines with classes 4 imi- and 9 in-, ngu-umuntu > ngumuntu e.g. ngu-obaba > ngobaba yi-imithi > yimithi ngu-amakati > ngamakati li- combines with class 5 i(li)-: wu combines with noun prefixes u-, o-: li-ihashe > lihashe wu-muntu > wumuntu, si- combines with class 7 isi-: wu-obaba > wobaba si-isitya > sisitya etc.

Table 2. Examples of variations in Zulu and Xhosa ‘closed’ morpheme information analyser, for instance the formation of the so- kubuya ‘when we return’. In terms of the word called temporal form that does not occur in Zulu. formation rules this means that an additional The temporal form is an indication of when an Xhosa specific morpheme lexicon (continuation action takes place or when a process is carried class) needs to be included. To facilitate accurate out, and has a present or past tense form (Louw, modelling appropriate constraints also need to be et al., 1984:163). The simple form consists of a formulated. subject concord plus -a- followed by the verb The bootstrapping process is iterative and new stem in the infinitive, the preprefix of which has information regarding dissimilar morphological been elided, for example si-a-uku-buya > sa- constructions is incorporated systematically in

99 the morphotactics component. Similarly, rules 4 Morphophonological alternations are adapted in a systematic manner. The process also inherently relies on similarities between the Differences in morphophonological alternations languages, and therefore the challenge is to between Zulu and Xhosa are exemplified in Ta- model the dissimilarities accurately. The care- ble 3. Some occur in noun class prefixes of class fully conceptualised and appropriately structured 10 and associated constructions, such as prefix- (lexc) continuation classes embodying the Zulu ing of adverbial morphemes (na-, nga-, etc.). morphotactics provide a suitable framework for Others are found in instances of palatalisation, “a including all the closed class dissimilarities dis- sound change whereby a bilabial sound in pas- cussed above. sive formation, locativisation and diminutive formation is replaced by a palatal sound” (Poulos and Msimang, 1998:531).

Zulu Xhosa Class 10 class prefix izin- occurs before monosyllabic as well Class 10 class prefix izin- changes to iin- before polysyllabic stems, as polysyllabic stems, e.g. izinja, izindlebe e.g. izinja, iindlebe Adverb prefix na + i > ne, e.g. nezindlebe (na-izin-ndlebe) Adverb prefix na + ii > nee; e.g. neendlebe (na-iin-ndlebe) Palatalisation with passive, diminutive & locative formation: Palatalisation with passive, diminutive & locative formation: b > tsh b > ty -hlab-w-a > -hlatsh-w-a, intaba-ana > intatsh-ana, -hlab-w-a > -hlaty-w-a, intaba-ana > intaty-a na indaba > entdatsheni ihlobo > ehlotyeni ph > sh ph > tsh -boph-w-a > -bosh-w-a, iphaphu-ana > iphash-ana –boph-w-a > -botsh-w-a, iphaphu-ana > iphatsh-ana, iphaphu > ephasheni usapho > elusatsheni Table 3. Examples of variations in Zulu and Xhosa morphophonology

As before, the Zulu alternations are assumed 984) for the Xhosa lexicon were extracted from to apply to Xhosa unless otherwise modelled. various recent prototype paper dictionaries Regarding language-specific alternations special whereas relative stems (27), adjective stems (17), care is taken to ensure that the rules fire only in ideophones (30) and conjunctions (28) were only the desired contexts and order. For example, included as representative samples at this stage. Xhosa-specific sound changes should not fire The most obvious difference between the two between Zulu-specific morphemes, and vice word root lexicons is the sparse coverage of versa. This applies, for instance, to the vowel nouns for Xhosa. A typical shortcoming in the combination ii, which does not occur in Zulu. current Xhosa lexicon is limited class informa- While the general rule ii > i holds for Zulu, the tion for noun stems. vowel combination ii needs to be preserved in Observations are firstly occurrences of shared Xhosa. noun stems (mainly loan words) but different class information, typically class 5/6 for Zulu 5 The word root lexicons versus class 9/10 for Xhosa, for example ‘box’ -bhokisi (Xhosa 9/10; Zulu 5/6) Compiling sufficiently extensive and complete ‘duster’ -dasta (Xhosa 9/10; Zulu 5/6) word root lexicons (i.e. populating the “open” ‘pinafore’ -fasikoti (Xhosa 9/10; Zulu 5/6). word classes) is a major challenge, particularly It should be noted that although a Xhosa noun for lesser-resourced languages (Yli-Jyrä, stem may be identical to its Zulu counterpart, 2005:2). A pragmatic approach of harvesting analysis is not possible if the class prefix differs roots from all readily available sources is fol- from the specified Zulu class prefix + noun stem lowed. The Zulu lexicon is based on an extensive combination in the morphotactics component of word list dating back to the mid 1950s (cf. Doke the analyser. and Vilakazi, 1964), but significant improve- A second observation is identical noun stems ments and additions are regularly made. At pre- with correct class information, valid for both sent the Zulu word roots include noun stems with languages, but so far only appearing in the Xhosa class information (15 759), verb roots (7 567), lexicon, for example relative stems (406), adjective stems (48), ideo- ‘number’ -namba (Xhosa and Zulu 9/10) phones (1 360), conjunctions (176). Noun stems ‘dice’-dayisi (Xhosa and Zulu 5/6). with class information (4 959) and verb roots (5

100 This phenomenon occurs mainly with bor- rowed nouns that are more prevalent in the Xhosa lexicon than in the more outdated Zulu lexicon. A closer look at the contents of the lexicons reveals that the two languages have the following in common: 1027 noun stems with corresponding class information, 1722 verb roots, 20 relative stems, 11 adjective stems, 10 ideophones and 9 conjunctions. 6 A computational approach to cross- linguistic similarity This section discusses the extension of the boot- strapping procedure of the morphological ana- lyser to include the use of the guesser variant of the morphological analyser. In addition the ap- plication of the bootstrapped morphological ana- lyser to parallel test corpora is addressed. A vari- Figure 1. Bootstrapping procedure ety of cross-linguistic effects is illustrated with examples from the corpora. language resources in the form of text cor- Even in languages where extensive word root pora as well as human intervention in the form lexicons are available, new word roots may occur of expert lexicographers or linguists to deter- from time to time. The Xerox toolkit makes pro- mine the eligibility of such words. vision for a guesser variant of the morphologi- The language resources chosen to illustrate cal analyser that uses typical word root patterns this point are parallel corpora in the form of the for identifying potential new word roots (Beesley South African Constitution (The Constitution, and Karttunen, 2003:444). By exploiting the (sa). The reason for the choice of these corpora is morphotactics and morphophonological alterna- that they are easily accessible on-line, and it is tions of the analyser prototype, the guesser is assumed that the nature of the contents ensures able to analyse morphologically valid words of accurate translations. which the roots match the specified pattern. The results of the application of the boot- Therefore, in cases where both the Zulu and strapped morphological analyser to this corpus Xhosa word root lexicons do not contain a root, are as follows: the guesser may facilitate the bootstrapping Zulu Statistics process. Corpus size: 7057 types The extended bootstrapping procedure is Analysed: 5748 types (81.45 %) schematically represented in Figure 1. Failures: 1309 types (18.55%) Since the available Zulu word list represents a Failures analysed by guesser: 1239 types rather outdated vocabulary, it is to be expected Failures not analysed by guesser: 70 types that the coverage of word roots/stems from a re- Xhosa Statistics cent corpus of running Zulu text may be unsatis- Corpus size: 7423 types. factory, due to the dynamic nature of language. Analysed: 5380 types (72.48 %) For example the word list contains no entry of Failures: 2043 types (27.52%) the loan word utoliki ‘interpreter’ since ‘inter- Failures analysed by guesser: 1772 types preter’ is rendered only as i(li)humusha ‘transla- Failures not analysed by guesser: 271 types tor’, the traditional term derived from the verb The output of the combined morphological stem –humusha ‘to translate, interpret’. Provi- analyser enables a detailed investigation into sion therefore needs to be made for the constant cross-linguistic features pertaining to the mor- inclusion of new roots/stems, be they newly phology of Zulu and Xhosa. The outcome of this coined, compounds or as yet unlisted foreign investigation is illustrated by means of typical roots/stems. examples from the corpora. This provides novel Updating and refining the lexicon requires the nsights into the investigation and exploitation of availability of current and contemporary

101 cross-linguistic similarities and their significance for bootstrapping purposes, as shown in Figure 1. Conjunctions used from the Zulu lexicon are: Notational conventions include [Xh] for Xho- futhi[Conj] sa specific morphemes, numbers indicate noun ukuthi[Conj] class information, e.g. tags the [NPrePre9] Examples of the guesser output from the Zulu noun preprefix of a class 9 noun while [Rel- corpus: Conc8] tags the relative concord of a class 8 The compound noun -shayamthetho (7/8) ‘legis- noun. lature’ is not listed in the Zulu lexicon, but was Examples from the Zulu corpus: guessed correctly: The analysis of the Zulu word ifomu ‘form’ uses isishayamthetho the Xhosa noun stem –fomu (9/10) in the Xhosa i[NPrePre7]si[BPre7] lexicon in the absence of the Zulu stem: shayamthetho-Guess[NStem] ifomu i[NPrePre9]fomu[Xh][NStem] The following are two examples of borrowed The analysis of the Zulu word ukutolikwa ‘to nouns (amabhajethi ‘budgets’ and amakhemikali interpret’ uses the Xhosa verb root ‘chemicals’) not in the Zulu lexicon, but guessed -tolik- in the Xhosa lexicon: correctly: ukutolikwa amabhajethi u[NPrePre15]ku[BPre15] a[NPrePre6]ma[BPre6] tolik[Xh][VRoot]w[PassExt]a[VerbTerm] bhajethi-Guess[NStem]

Examples from the Xhosa corpus: amakhemikali The analysis of the Xhosa words bephondo ‘of a[NPrePre6]ma[BPre6] khemikali-Guess[NStem] the province’ and esikhundleni ‘in the office’ use the Zulu noun stems –phondo (5/6) and The borrowed verb root -rejest- ‘register’ is not -khundleni (7/8) respectively in the Zulu lexicon: listed in the Zulu lexicon, but was guessed cor- bephondo rectly: ba[PossConc14]i[NPrePre5]li[BPre5] ezirejestiwe phondo[NStem] ezi[RelConc8]rejest-Guess[VRoot]

iw[PassExt]e[VerbTermPerf] bephondo ba[PossConc2]i[NPrePre5]li[BPre5] phondo[NStem] ezi[RelConc10]rejest-Guess[VRoot] iw[PassExt]e[VerbTermPerf] esikhundleni e[LocPre]i[NPrePre7]si[BPre7] The relatively small number of failures that khundla[NStem]ini[LocSuf] are not analysed by the guesser and for which no guessed verb roots or noun stems are offered, The analysis of the Xhosa words ekukhethweni simply do not match the word root patterns as ‘in the election’ and esihlonyelweyo ‘amended’ specified for Zulu and Xhosa in the analyser pro- use the Zulu verb roots –kheth- and –hlom- totype, namely respectively in the Xhosa lexicon: [C (C (C)) V]+ C (C (C)) ekukhethweni for verb roots and e[LocPre]u[NPrePre15]ku[BPre15] [C (C (C)) V]+ C (C (C)) V kheth[VRoot]w[PassExt]a[VerbTerm] ini[LocSuf] for noun stems. The majority of such failures is caused by spelling errors and foreign words in esihlonyelweyo the test corpus. esi[RelConc7]hlom[VRoot]el[ApplExt] w[PassExt]e[VerbTermPerf]yo[RelSuf] 7 Conclusion and Future Work

Ideophones used from the Zulu lexicon are: In this paper we focused on two aspects of ga[Ideoph] qho[Ideoph] cross-linguistic similarity between Zulu and sa[Ideoph] tu[Ideoph] ya[Ideoph] Xhosa, namely the morphological structure (morphotactics and alternation rules) and the Relative stems used from the Zulu lexicon are: word root lexicons. mandla[RelStem] Regarding the morphological structure only mdaka[RelStem] differences between Zulu and Xhosa were added. njalo[RelStem] mcimbi[RelStem]

102 Therefore, Zulu informed Xhosa in the sense that 6th International Conference on Language Re- the systematically developed grammar for Zul- sources and Evaluation, Marrakech, Morocco. Morph was directly available for the Xhosa ana- ISBN 2-9517408-4-0. lyser development, which significantly reduced De Pauw, G. and de Schryver, G-M. 2008. Improving the development time for the Xhosa prototype the computational morphological analysis of a compared to that for ZulMorph. Swahili corpus for lexicographic purposes. Lexikos Special attention was also given to the so- 18 (AFRILEX-reeks/series 18: 2008): 303–318. called “open” class, which represents the word Doke, C.M. and Vilakazi, B.W. 1964. Zulu–English root lexicons for specifically nouns and verbs. Dictionary. Witwatersrand University Press, Jo- The acquisition and coverage of these lexicons hannesburg. proved to be crucial for the success of the ana- Fromkin, V., Rodman, R. and Hyams, N. 2007. An lysers under development. Since we were fortu- Introduction to Language. Thomson Heinle, Mas- nate in having access to word root lexicons for sachusetts, USA. both Zulu and Xhosa we included what was Hurskainen, A. 1992. A two-level formalism for the available in such a way that word roots could be analysis of Bantu morphology: an application to shared between the languages. Here, although to Swahili. Nordic Journal of African Studies, a lesser extent, Xhosa also informed Zulu by 1(1):87-122. providing a current (more up to date) Xhosa lexi- con. In addition, the guesser variant was em- Koskenniemi, K. 1997. Representations and finite- state components in natural language, in Finite- ployed in identifying possible new roots in the State Language Processing E. Roche and Y. Scha- test corpora, both for Zulu and for Xhosa. bes (eds.), pp. 99–116. MIT Press, Boston. In general it is concluded that bootstrapping morphological analysers for languages that ex- Karttunen, L. 2001. Applications of finite-state hibit significant structural and lexical similarities trransducers in natural language processing, in Im- plementation and application of automata, S. Yu may be fruitfully exploited for developing ana- and A. Paun (eds.). Lecture Notes in Computer lysers for lesser-resourced languages. Science, 2088:34-46. Springer, Heidelberg. Future work includes the application of the approach followed in this work to the other Louw, J.A., Finlayson, R. and Satyo, S.C. 1984. Nguni languages, namely Swati and Ndebele Xhosa Guide 3 for XHA100-F. University of South Africa, Pretoria. (Southern and Zimbabwe); the application to lar- ger corpora, and the subsequent construction of Meinhof, C. 1932. Introduction to the phonology of stand-alone versions. Finally, the combined ana- the Bantu languages. Dietrich Reimer/Ernst lyser could also be used for (corpus-based) quan- Vohsen, Berlin. titative studies in cross-linguistic similarity. Poulos, G. and Msimang, C.T. 1998. A linguistic analysis of Zulu. Via Afrika, Pretoria. Acknowledgements Pretorius, L. and Bosch, S.E. 2002. Finite state com- putational morphology: Treatment of the Zulu This material is based upon work supported by noun. South African Computer Journal, 28:30-38. the South African National Research Foundation under grant number 2053403. Any opinion, find- Pretorius, L. and Bosch, S.E. 2003. Finite state com- putational morphology: An analyzer prototype for ings and conclusions or recommendations ex- Zulu. Machine Translation – Special issue on fi- pressed in this material are those of the authors nite-state language resources and language proc- and do not necessarily reflect the views of the essing, 18:195-216. National Research Foundation. The Constitution. (sa). [O]. Available: http://www.concourt.gov.za/site/theconstitution/the References text.htm. Accessed on 31 January 2008. Beesley, K.R. and Karttunen, L. 2003. Finite State Yli-Jyrä, A. 2005. Toward a widely usable finite-state Morphology. CSLI Publications, Stanford, CA. morphology workbench for less studied languages Bosch, S., Pretorius, L., Podile, K. and Fleisch, A. — Part I: Desiderata. Nordic Journal of African 2008. Experimental fast-tracking of morphological Studies, 14(4): 479 – 491. analysers for Nguni languages. Proceedings of the

103 Methods for Amharic Part-of-Speech Tagging

Bjorn¨ Gamback¨ †‡ Fredrik Olsson† Atelach Alemu Argaw? Lars Asker? †Userware Laboratory ‡Dpt. of Computer & Information Science ?Dpt. of Computer & System Sciences Swedish Institute of Computer Science Norwegian University of Science & Technology Stockholm University Kista, Sweden Trondheim, Norway Kista, Sweden {gamback,fredriko}@sics.se [email protected] {atelach,asker}@dsv.su.se

Abstract In spite of the relatively large number of speak- ers, Amharic is still a language for which very few The paper describes a set of experiments computational linguistic resources have been de- involving the application of three state-of- veloped, and previous efforts to create language the-art part-of-speech taggers to Ethiopian processing tools for Amharic—e.g., Alemayehu Amharic, using three different tagsets. and Willett (2002) and Fissaha (2005)—have been The taggers showed worse performance severely hampered by the lack of large-scale lin- than previously reported results for Eng- guistic resources for the language. In contrast, the lish, in particular having problems with work detailed in the present paper has been able unknown words. The best results were to utilize the first publicly available medium-sized obtained using a Maximum Entropy ap- tagged Amharic corpus, described in Section 5. proach, while HMM-based and SVM- However, first the Amharic language as such is based taggers got comparable results. introduced (in Section 2), and then the task of part- of-speech tagging and some previous work in the 1 Introduction field is described (Section 3). Section 4 details the Many languages, especially on the African con- tagging strategies used in the experiments, the re- tinent, are under-resourced in that they have sults of which can be found in Section 6 together very few computational linguistic tools or corpora with a short discussion. Finally, Section 7 sums up (such as lexica, taggers, parsers or tree-banks) the paper and points to ways in which we believe available. Here, we will concentrate on the task that the results can be improved in the future. of developing part-of-speech taggers for Amharic, 2 Amharic the official working language of the government of the Federal Democratic Republic of Ethiopia: Written Amharic (and Tigrinya) uses a unique Ethiopia is divided into nine regions, each with script originating from the Ge’ez alphabet (the its own nationality language; however, Amharic is liturgical language of the Ethiopian Orthodox the language for country-wide communication. Church). Written Ge’ez can be traced back to at Amharic is spoken by about 30 million people least the 4th century A.D., with the first versions as a first or second language, making it the second including consonants only, while the characters most spoken Semitic language in the world (after in later versions represent consonant-vowel (CV) Arabic), probably the second largest language in pairs. In modern Ethiopic script each syllograph Ethiopia (after Oromo), and possibly one of the (syllable pattern) comes in seven different forms five largest languages on the African continent. (called orders), reflecting the seven vowel sounds. The actual size of the Amharic speaking popula- The first order is the basic form; the others are de- tion must be based on estimates: Hudson (1999) rived from it by modifications indicating vowels. analysed the Ethiopian census from 1994 and in- There are 33 basic forms, giving 7*33 syllographs, dicated that more than 40% of the population then or fidels (‘fidel’, lit. ‘alphabet’ in Amharic, refers understood Amharic, while the current size of the both to the characters and the entire script). Unlike Ethiopian population is about 80 million.1 Arabic and Hebrew, Amharic is written from left to right. There is no agreed upon spelling standard 182.5 million according to CIA (2009); 76.9 according to Ethiopian parliament projections in December 2008 based on for compound words and the writing system uses the preliminary reports from the census of May 2007. several ways to denote compounds

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 104–111, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

104 form pattern Alemayehu and Willett (2002) report on a stem- root sbr CCC mer for Information Retrieval for Amharic, and perfect sabb¨ ar¨ CVCCVC testing on a 1221 random word sample stated imperfect sabr¨ CVCC gerund sabr¨ CVCC “Manual assessment of the resulting stems showed imperative sbar¨ CCVC that 95.5 percent of them were linguistically causative assabb¨ ar¨ as-CVCCVC meaningful,” but gave no evaluation of the cor- passive tas¨ abb¨ ar¨ tas-CVCCVC¨ rectness of the segmentations. Argaw and Asker Table 1: Some forms of the verb sbr (‘break’) (2007) created a rule-based stemmer for a similar task, and using 65 rules and machine readable dic- tionaries obtained 60.0% accuracy on fictional text 2.1 Amharic morphology (testing on 300 unique words) and 76.9% on news A significantly large part of the vocabulary con- articles (on 1503 words, of which 1000 unique).2 sists of verbs, and like many other Semitic lan- guages, Amharic has a rich verbal morphology 3 Part-of-Speech Tagging based on triconsonantal roots with vowel variants Part-of-speech (POS) tagging is normally treated describing modifications to, or supplementary de- as a classification task with the goal to assign lex- tail and variants of the root form. For example, ical categories (word classes) to the words in a the root sbr, meaning ‘to break’ can have (among text. Most work on tagging has concentrated on others!) the forms shown in Table 1. Subject, gen- English and on using supervised methods, in the der, number, etc., are also indicated as bound mor- sense that the taggers have been trained on an phemes on the verb, as well as objects and posses- available, tagged corpus. Both rule-based and sta- sion markers, mood and tense, beneficative, mal- tistical / machine-learning based approaches have factive, transitive, dative, negative, etc. been thoroughly investigated. The Brill Tagger Amharic nouns (and adjectives) can be inflected (Brill, 1995) was fundamental in using a com- for gender, number, definiteness, and case, al- bined rule- and learning-based strategy to achieve though gender is usually neutral. The definite ar- 96.6% accuracy on tagging the Penn Treebank ticle attaches to the end of a noun, as do conjunc- version of the Wall Street Journal corpus. That tions, while prepositions are mostly prefixed. is, to a level which is just about what humans 2.2 Processing Amharic morphology normally achieve when hand-tagging a corpus, in terms of interannotator agreement—even though The first effort on Amharic morphological pro- Voutilainen (1999) has shown that humans can get cessing was a rule-based system for verbs (and close to the 100% agreement mark if the annota- nouns derived from verbs) which used root pat- tors are allowed to discuss the problematic cases. terns and affixes to determine lexical and in- Later taggers have managed to improve Brill’s flectional categories (Bayou, 2000), while Bayu figures a little bit, to just above 97% on the Wall (2002) used an unsupervised learning approach Street Journal corpus using Hidden Markov Mod- based on probabilistic models to extract stems, els, HMM and Conditional Random Fields, CRF; prefixes, and suffixes for building a morphological e.g., Collins (2002) and Toutanova et al. (2003). dictionary. The system was able to successfully However, most recent work has concentrated on analyse 87% of a small testdata set of 500 words. applying tagging strategies to other languages than The first larger-scale morphological analyser English, on combining taggers, and/or on using for Amharic verbs used XFST, the Xerox Finite unsupervised methods. In this section we will look State Tools (Fissaha and Haller, 2003). This was at these issues in more detail, in particular with the later extended to include all word categories (Am- relation to languages similar to Amharic. salu and Gibbon, 2005). Testing with 1620 words text from an Amharic bible, 88–94% recall and 3.1 Tagging Semitic languages 54–94% precision (depending on the word-class) Diab et al. (2004) used a Support Vector Machine, were reported. The lowest precision (54%) was SVM-based tagger, trained on the Arabic Penn obtained for verbs; Amsalu and Demeke (2006) thus describe ways to extend the finite-state sys- 2Other knowledge sources for processing Amharic in- clude, e.g., Gasser’s verb stem finder (available from tem to handle 6400 simple verbal stems generated nlp.amharic.org) and wordlists as those collected by from 1300 root forms. Gebremichael (www.cs.ru.nl/∼biniam/geez).

105 Treebank 1 to tokenize, POS tag, and annotate labeled texts in a number of dialects. Using a tri- Arabic base phrases. With an accuracy of 95.5% gram HMM tagger, they first produced a baseline over a set of 24 tags, the data-driven tagger per- system and then gradually improved on that in an formed on par with state-of-the-art results for En- unsupervised manner by adding features so as to glish when trained on similar-sized data (168k to- facilitate the analysis of unknown words, and by kens). Bar-Haim et al. (2008) developed a lexicon- constraining and refining the lexicon. based HMM tagger for Hebrew. They report Unsupervised learning is often casted as the 89.6% accuracy using 21 tags and training on 36k problem of finding (hidden) structure in unla- tokens of news text. Mansour (2008) ported this beled data. Goldwater and Griffiths (2007) noted tagger into Arabic by replacing the morphological that most recent approaches to this problem aim analyzer, achieving an accuracy of 96.3% over 26 to identify the set of attributes that maximizes tags on a 89k token corpus. His approach modifies some target function (Maximum Likelihood Esti- the analyses of sentences receiving a low proba- mation), and then to select the values of these at- bility by adding synthetically constructed analyses tributes based on the representation of the model. proposed by a model using character information. They proposed a different approach, based on A first prototype POS tagger for Amharic used Bayesian principles, which tries to directly max- a stochastic HMM to model contextual dependen- imize the probability of the attributes based on cies (Getachew, 2001), but was trained and tested observation in the data. This Bayesian approach on only one page of text. Getachew suggested a outperformed Maximum Likelihood Estimation tagset for Amharic consisting of 25 tags. More when training a trigram HMM tagger for English. recently, CRFs have been applied to segment and Toutanova and Johnson (2007) report state-of-the- tag Amharic words (Fissaha, 2005), giving an ac- art results by extending the work on Bayesian curacy of 84% for word segmentation, using char- modelling for unsupervised learning of taggers acter, morphological and lexical features. The best both in the way that prior knowledge can be incor- result for POS-tagging was 74.8%, when adding a porated into the model, and in the way that possi- dictionary and bigrams to lexical and morphologi- ble tags for a given word is explicitly modeled. cal features, and 70.0% without dictionary and bi- grams. The data used in the experiments was also 3.3 Combining taggers quite small and consisted of 5 annotated news ar- ticles (1000 words). The tagset was a reduced ver- A possible way to improve on POS tagging results sion (10 tags) of the one used by Getachew (2001), is to combine the output of several different tag- and will be further discussed in Section 5.2. gers into a committee, forming joint decisions re- garding the labeling of the input. Roughly, there are three obvious ways of combining multiple pre- 3.2 Unsupervised tagging dicted tags for a word: random decision, voting, The desire to use unsupervised machine learning and stacking (Dietterich, 1997), with the first way approaches to tagging essentially originates from suited only for forming a baseline. Voting can the wish to exploit the vast amounts of unlabelled be divided into two subclasses: unweighted votes, data available when constructing taggers. The area and weighted votes. The weights of the votes, if is particularly vivid when it comes to the treatment any, are usually calculated based on the classifiers’ of languages for which there exist few, if any, com- performance on some initial dataset. Stacking, fi- putational resources, and for the case of adapting nally, is a way of combining the decisions made an existing tagger to a new language domain. by individual taggers in which the predicted tags Banko and Moore (2004) compared unsuper- for a given word are used as input to a subsequent vised HMM and transformation-based taggers tagger which outputs a final label for the word. trained on the same portions of the Penn Treebank, Committee-based approaches to POS tagging and showed that the quality of the lexicon used for have been in focus the last decade: Brill and Wu training had a high impact on the tagging results. (1998) combined four different taggers for English Duh and Kirchhoff (2005) presented a minimally- using unweighted voting and by exploring contex- supervised approach to tagging for dialectal Ara- tual cues (essentially a variant of stacking). Aires bic (Colloquial Egyptian), based on a morpholog- et al. (2000) experimented with 12 different ways ical analyzer for Modern Standard Arabic and un- of combining the output from taggers for Brazilian

106 Portuguese, and concluded that some, but not all, ability, the likelihood of a word given certain tag, combinations yielded better accuracy than the best and the second factor is the state transition (or con- individual tagger. Shacham and Wintner (2007) textual) probability, the likelihood of a tag given a contrasted what they refer to as being a na¨ıve way sequence of preceding tags. TnT uses the Viterbi of combining taggers with a more elaborate, hi- algorithm for finding the optimal tag sequence. erarchical one for Hebrew. In the end, the elabo- Smoothing is implemented by linear interpolation, rated method yielded results inferior to the na¨ıve the respective weights are determined by deleted approach. De Pauw et al. (2006) came to simi- interpolation. Unknown words are handled by a lar conclusions when using five different ways of suffix trie and successive abstraction. combining four data-driven taggers for Swahili. Applying TnT to the Wall Street Journal cor- The taggers were based on HMM, Memory-based pus, Brants (2000) reports 96.7% overall accuracy, learning, SVM, and Maximum Entropy, with the with 97.0% on known and 85.5% on unknown latter proving most accurate. Only in three of words (with 2.9% of the words being unknown). five cases did a combination of classifiers perform better than the Maximum Entropy-based tagger, 4.2 Support Vector Machines: SVMTool and simpler combination methods mostly outper- Support Vector Machines (SVM) is a linear learn- formed more elaborate ones. ing system which builds two class classifiers. It Spoustova´ et al. (2007) report on work on com- is a supervised learning method whereby the in- bining a hand-written rule-based tagger with three put data are represented as vectors in a high- statistically induced taggers for Czech. As an ef- dimensional space and SVM finds a hyperplane (a fect of Czech being highly inflectional, the tagsets decision boundary) separating the input space into are large: 1000–2000 unique tags. Thus the ap- two by maximizing the margin between positive proach to combining taggers first aims at reducing and negative data points. the number of plausible tags for a word by using SVMTool is an open source tagger based on the rule-based tagger to discard impossible tags. SVMs.5 Comparing the accuracy of SVMTool Precision is then increased by invoking one or all with TnT on the Wall Street Journal corpus, of the data-driven taggers. Three different ways of Gimenez´ and Marquez` (2004) report a better per- combining the taggers were explored: serial com- formance by SVMTool: 96.9%, with 97.2% on bination, involving one of the statistical taggers; known words and 83.5% on unknown. so called SUBPOS pre-processing, involving two instances of statistical taggers (possibly the same 4.3 Maximum Entropy: MALLET tagger); and, parallel combination, in which an ar- Maximum Entropy is a linear classification bitrary number of statistical taggers is used. The method. In its basic incarnation, linear classifi- combined tagger yielded the best results for Czech cation combines, by addition, the pre-determined POS tagging reported to date, and as a side-effect weights used for representing the importance of also the best accuracy for English: 97.43%.3 each feature to a given class. Training a Maxi- 4 The Taggers mum Entropy classifier involves fitting the weights of each feature value for a particular class to the This section describes the three taggers used in the available training data. A good fit of the weights experiments (which are reported on in Section 6). to the data is obtained by selecting weights to max- imize the log-likelihood of the learned classifica- 4.1 Hidden Markov Models: TnT tion model. Using an Maximum Entropy approach TnT, “Trigrams’n’Tags” (Brants, 2000) is a very to POS tagging, Ratnaparkhi (1996) reports a tag- fast and easy-to-use HMM-based tagger which ging accuracy of 96.6% on the Wall Street Journal. painlessly can be trained on different languages The software of choice for the experiments re- and tagsets, given a tagged corpus.4 A Markov- ported here is MALLET (McCallum, 2002), a based tagger aims to find a tag sequence which freely available Java implementation of a range of maximizes P (wordn|tagn) ∗ P (tagn|tag1...n−1), machine learning methods, such as Na¨ıve Bayes, where the first factor is the emit (or lexical) prob- decision trees, CRF, and Maximum Entropy.6

3As reported on ufal.mff.cuni.cz/compost/en 5www.lsi.upc.edu/∼nlp/SVMTool 4www.coli.uni-saarland.de/∼thorsten/tnt 6mallet.cs.umass.edu

107 5 The Dataset 5.2 Tagsets For the experiments, three different tagsets were The experiments of this paper utilize the first used. Firstly, the full, original 30-tag set devel- medium-sized corpus for Amharic (available at oped at the Ethiopian Languages Research Center http://nlp.amharic.org). The corpus consists and described by Demeke and Getachew (2006). of all 1065 news texts (210,000 words) from the This version of the corpus will be referred to as Ethiopian year 1994 (parts of the Gregorian years ‘ELRC’. It contains 200, 863 words and differs 2001–2002) from the Walta Information Center, a from the published corpus in way of the correc- private news service based in Addis Ababa. It has tions described in the previous section. been morphologically analysed and manually part- of-speech tagged by staff at ELRC, the Ethiopian Secondly, the corpus was mapped to 11 basic Languages Research Center at Addis Ababa Uni- tags. This set consists of ten word classes: Noun, versity (Demeke and Getachew, 2006). Pronoun, Verb, Adjective, Preposition, Conjunc- tion, Adverb, Numeral, Interjection, and Punctua- The corpus is available both in fidel and tran- tion, plus one tag for problematic words (unclear: scribed into a romanized version known as SERA, ). The main differences between the two System for Ethiopic Representation in ASCII (Ya- tagsets pertain to the treatment of prepositions and cob, 1997). We worked with the transliterated conjunctions: in ‘ELRC’ there are specific classes form (202,671 words), to be compatible with the for, e.g., pronouns attached with preposition, con- machine learning tools used in the experiments. junction, and both preposition and conjunction (similar classes occur for nouns, verbs, adjectives, 5.1 “Cleaning” the corpus and numerals). In addition, numerals are divided Unfortunately, the corpus available on the net con- into cardinals and ordinals, verbal nouns are sepa- tains quite a few errors and tagging inconsisten- rated from other nouns, while auxiliaries and rela- cies: nine persons participated in the manual tag- tive verbs are distinguished from other verbs. The ging, writing the tags with pen on hard copies, full tagset is made up of thirty subclasses of the which were given to typists for insertion into the basic classes, based on type of word only: the tags electronic version of the corpus—a procedure ob- contain no information on grammatical categories viously introducing several possible error sources. (such as number, gender, tense, and aspect). Before running the experiments the corpus had Thirdly, for comparison reasons, the full tagset to be “cleaned”: many non-tagged items have been was mapped to the 10 tags used by Fissaha (2005). tagged (the human taggers have, e.g., often tagged These classes include one for Residual (R) which the headlines of the news texts as one item, end- was assumed to be equivalent to . In addi- of-sentence punctuation), while some double tags tion, and were mapped to Ad- have been removed. Reflecting the segmentation position (AP), and both and to N. of the original Amharic text, all whitespaces were The other mappings were straight-forward, except removed, merging multiword units with a single that the ‘BASIC’ tagset groups all verbs together, tag into one-word units. Items like ‘"’ and ‘/’ while Fissaha kept Auxiliary (AUX) as its own have been treated consistently as punctuation, and class. This tagset will be referred to as ‘SISAY’. consistent tagging has been added to word-initial and word-final hyphens. Also, some direct tagging 5.3 Folds errors and misspellings have been corrected. For evaluation of the taggers, the corpus was split Time expressions and numbers have not been into 10 folds. These folds were created by chop- consistently tagged at all, but those had to be left ping the corpus into 100 pieces, each of about as they were. Finally, many words have been tran- 2000 words in sequence, while making sure that scribed into SERA in several versions, with only each piece contained full sentences (rather than the cases differing. However, this is also difficult cutting off the text in the middle of a sentence), to account for (and in the experiments below we and then merging sets of 10 pieces into a fold. used the case sensitive version of SERA), since Thus the folds represent even splits over the cor- the SERA notation in general lets upper and lower pus, to avoid tagging inconsistencies, but the se- cases of the English alphabet represent different quences are still large enough to potentially make symbols in fidel (the Amharic script). knowledge sources such as n-grams useful.

108 Fold TOTAL KNOWN UNKNOWN ELRCBASIC SISAY fold00 20,027 17,720 2,307 TnT 85.56 92.55 92.60 fold01 20,123 17,750 2,373 STDDEV 0.42 0.31 0.32 fold02 20,054 17,645 2,409 KNOWN 90.00 93.95 93.99 fold03 20,169 17,805 2,364 UNKNOWN 52.13 82.06 82.20 fold04 20,051 17,524 2,527 SVM 88.30 92.77 92.80 fold05 20,058 17,882 2,176 STDDEV 0.41 0.31 0.37 fold06 20,111 17,707 2,404 KNOWN 89.58 93.37 93.34 fold07 20,112 17,746 2,366 UNKNOWN 78.68 88.23 88.74 fold08 20,015 17,765 2,250 fold09 20,143 17,727 2,416 Own folds 88.69 92.97 92.99 STDDEV 0.33 0.17 0.26 Average 20,086 17,727 2,359 Percent — 88.26 11.74 MaxEnt 87.87 92.56 92.60 STDDEV 0.49 0.38 0.43 KNOWN 89.44 93.26 93.27 Table 2: Statistics for the 10 folds UNKNOWN 76.05 87.29 87.61 Own folds 90.83 94.64 94.52 Table 2 shows the data for each of the folds, in STDDEV 1.37 1.11 0.69 terms of total number of tokens, as well as split BASELINE 35.50 58.26 59.61 into known and unknown tokens, where the term Table 3: Tagging results UNKNOWN refers to tokens that are not in any of the other nine folds. The figures at the bottom of the table show the average numbers of known a bit over 50% on the unknown words (and 85.6% and unknown words, over all folds. Notably, the overall). For the two reduced tagsets TnT does average number of unknown words is about four better: overall performance goes up to a bit over times higher than in the Wall Street Journal cor- 92%, with 82% on unknown words. pus (which, however, is about six times larger). Table 3 shows the results on the default configu- ration of TnT, i.e., using 3-grams and interpolated 6 Results smoothing. Changing these settings give no sub- The results obtained by applying the three dif- stantial improvement overall: what is gained at ferent tagging strategies to the three tagsets are one end (e.g., on unknown words or a particular shown in Table 3, in terms of average accura- tagset) is lost at the other end (on known words or cies after 10-fold cross validation, over all the other tagsets). However, per default TnT uses a tokens (with standard deviation),7 as well as ac- suffix trie of length 10 to handle unknown words. curacy divided between the known and unknown Extending the suffix to 20 (the maximum value words. Additionally, SVMTool and MALLET in- in TnT) gave a slight performance increase on clude support for automatically running 10-fold ‘ELCR’ (0.13% on unknown words, 0.01% over- cross validation on their own folds. Figures for all), while having no effect on the smaller tagsets. those runs are also given. The last line of the table shows the baselines for the tagsets, given as the 6.2 SVM number of tokens tagged as regular nouns divided The SVM-tagger outperforms TnT on unknown by the total number of words after correction. words, but is a bit worse on known words. Overall, SVM is slightly better than TnT on the two smaller 6.1 TnT tagsets and clearly better on the large tagset, and As the bold face figures indicate, TnT achieves the somewhat better than MaxEnt on all three tagsets. best scores of all three taggers, on all three tagsets, These results are based on SVMTool’s default on known words. However, it has problems with parameters: a one-pass, left-to-right, greedy tag- the unknown words—and since these are so fre- ging scheme with a window size of 5. Previous quent in the corpus, TnT overall performs worse experiments with parameter tuning and multiple than the other taggers. The problems with the un- pass tagging have indicated that there is room for known words increase as the number of possible performance improvements by ≈ 2%. tags increase, and thus TnT does badly on the orig- inal tagging scheme (‘ELRC’), where it only gets 6.3 Maximum Entropy

7 p 1 Pn 2 The MaxEnt tagger gets results comparable to the The standard deviation is given by n i=1(xi − x) 1 Pn where x is the arithmetic mean ( n i=1 xi). other taggers on the predefined folds. Its overall

109 Wordn ; Tag of Wordn 7 Conclusions and Future Work Prefixes of Wordn, length 1-5 characters Postfixes of Wordn, length 1-5 characters The paper has described experiments with apply- Is Wordn capitalized? ing three state-of-the-art part-of-speech taggers to Is Wordn all digits? Does Wordn contain digits? Amharic, using three different tagsets. All tag- Does Wordn contain a hyphen? gers showed worse performance than previously Wordn−1 ; Tag of Wordn−1 reported results for English. The best accuracy Wordn−2 ; Tag of Wordn−2 Wordn+1 was obtained using a Maximum Entropy approach Wordn+2 when allowed to create its own folds: 90.1% on a 30 tag tagset, and 94.6 resp. 94.5% on two reduced Table 4: Features used in the MaxEnt tagger sets (11 resp. 10 tags), outperforming an HMM- based (TnT) and an SVM-based (SVMTool) tag- performance is equivalent to TnT’s on the smaller ger. On predefined folds all taggers got compa- tagsets, but significantly better on ‘ELRC’. rable results (92.5-92.8% on the reduced sets and As can be seen in Table 3, the MaxEnt tag- 4-7% lower on the full tagset). The SVM-tagger ger clearly outperforms the other taggers on all performs slightly better than the others overall, tagsets, when MALLET is allowed to create its since it has the best performance on unknown own folds: all tagsets achieved classification ac- words, which are four times as frequent in the curacies higher than 90%, with the two smaller 200K words Amharic corpus used than in the (six tagsets over 94.5%. The dramatic increase in the times larger) English Wall Street Journal corpus. tagger’s performance on these folds is surprising, TnT gave the best results for known words, but but a clear indication of one of the problems with had the worst performance on unknown words. n-fold cross validation: even though the results In order to improve tagging accuracy, we will represent averages after n runs, the choice of the investigate including explicit morphological pro- original folds to suit a particular tagging strategy cessing to treat unknown words, and combining is of utmost importance for the final result. taggers. Judging from previous efforts on com- Table 4 shows the 22 features used to represent bining taggers (Section 3.3), it is far from certain an instance (Wordn) in the Maximum Entropy tag- that the combination of taggers actually ends up ger. The features are calculated per token within producing better results than the best individual sentences: the starting token of a sentence is not tagger. A pre-requisite for successful combination affected by the characteristics of the tokens ending is that the taggers are sufficiently dissimilar; they the previous sentence, nor the other way around. must draw on different characteristics of the train- Thus not all features are calculated for all tokens. ing data and make different types of mistakes. The taggers described in this paper use no other 6.4 Discussion knowledge source than a tagged training corpus. In terms of accuracy, the MaxEnt tagger is by In addition to incorporating (partial) morpholog- far the best of the three taggers, and on all three ical processing, performance could be increased tagsets, when allowed to select its own folds. Still, by including knowledge sources such as machine as Table 3 shows, the variation of the results for readable dictionaries or lists of Amharic stem each individual fold was then substantially larger. forms (Section 2.2). Conversely, semi-supervised It should also be noted that TnT is by far the or unsupervised learning for tagging clearly are fastest of the three taggers, in all respects: in terms interesting alternatives to manually annotate and of time to set up and learn to use the tagger, in construct corpora for training taggers. Since terms of tagging speed, and in particular in terms there are few computational resources available of training time. Training TnT is a matter of sec- for Amharic, approaches as those briefly outlined onds, but a matter of hours for MALLET/MaxEnt in Section 3.2 deserve to be explored. and SVMTool. On the practical side, it is worth Acknowledgements adding that TnT is robust, well-documented, and The work was partially funded by Sida, the Swedish Inter- easy to use, while MALLET and SVMTool are national Development Cooperation Agency through SPIDER substantially more demanding in terms of user ef- (the Swedish Programme for ICT in Developing Regions). Thanks to Dr. Girma Demeke, Mesfin Getachew, and the fort and also appear to be more sensitive to the ELRC staff for their efforts on tagging the corpus, and to quality and format of the input data. Thorsten Brants for providing us with the TnT tagger.

110 References Kevin Duh and Katrin Kirchhoff. 2005. POS tagging of di- alectal Arabic: A minimally supervised approach. Com- Rachel V. Xavier Aires, Sandra M. Alu´ısio, Denise C. S. putational Approaches to Semitic Languages, pp. 55–62, Kuhn, Marcio L. B. Andreeta, and Osvaldo N. Oliveira Jr. Ann Arbor, Mich. 2000. Combining classifiers to improve part of speech tagging: A case study for Brazilian Portuguese. In 15th Sisay Fissaha and Johann Haller. 2003. Amharic verb Brazilian Symposium on AI, pp. 227–236, Atibaia, Brazil. lexicon in the context of machine translation. In 10th Traitement Automatique des Langues Naturelles, vol. 2, Nega Alemayehu and Peter Willett. 2002. Stemming of pp. 183–192, Batz-sur-Mer, France. Amharic words for information retrieval. Literary and Linguistic Computing, 17:1–17. Sisay Fissaha. 2005. Part of speech tagging for Amharic us- ing conditional random fields. Computational Approaches Saba Amsalu and Dafydd Gibbon. 2005. Finite state mor- to Semitic Languages, pp. 47–54, Ann Arbor, Mich. phology of Amharic. In 5th Recent Advances in Natural Language Processing, pp. 47–51, Borovets, Bulgaria. Mesfin Getachew. 2001. Automatic part of speech tag- ging for Amharic: An experiment using stochastic hid- Saba Amsalu and Girma A. Demeke. 2006. Non- den Markov model (HMM) approach. MSc Thesis, Addis concatinative finite-state morphotactics of Amharic sim- Ababa University, Ethiopia. ple verbs. ELRC Working Papers, 2:304-325. Jesus´ Gimenez´ and Llu´ıs Marquez.` 2004. SVMTool: A gen- Atelach Alemu Argaw and Lars Asker. 2007. An Amharic eral POS tagger generator based on support vector ma- stemmer: Reducing words to their citation forms. Compu- chines. In 4th Int. Conf. Language Resources and Eval- tational Approaches to Semitic Languages, pp. 104–110, uation, pp. 168–176, Lisbon, Portugal. Prague, Czech Rep. Sharon Goldwater and Thomas L. Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tag- Michele Banko and Robert C. Moore. 2004. Part of speech ging. In 45th ACL, pp. 744–751, Prague, Czech Rep. tagging in context. In 20th Int. Conf. on Computational Linguistics, pp. 556–561, Geneva, Switzerland. Grover Hudson. 1999. Linguistic analysis of the 1994 Ethiopian census. Northeast African Studies, 6:89–107. Roy Bar-Haim, Khalil Simaan, and Yoad Winter. 2008. Part- of-speech tagging of modern Hebrew text. Natural Lan- Saib Mansour. 2008. Combining character and morpheme guage Engineering, 14:223–251. based models for part-of-speech tagging of Semitic lan- guages. MSc Thesis, Technion, Haifa, Israel. Abiyot Bayou. 2000. Design and development of word parser for Amharic language. MSc Thesis, Addis Ababa Andrew Kachites McCallum. 2002. MALLET: A machine University, Ethiopia. learning for language toolkit. Webpage.

Tesfaye Bayu. 2002. Automatic morphological analyser: Guy De Pauw, Gilles-Maurice de Schryver, and Peter W. An experiment using unsupervised and autosegmental ap- Wagacha. 2006. Data-driven part-of-speech tagging of proach. MSc Thesis, Addis Ababa University, Ethiopia. Kiswahili. In 9th Int. Conf. Text, Speech and Dialogue, pp. 197–204, Brno, Czech Rep. Thorsten Brants. 2000. TnT — a statistical part-of-speech Adwait Ratnaparkhi. 1996. A maximum entropy model for tagger. In 6th Conf. Applied Natural Language Process- part-of-speech tagging. In Empirical Methods in Natural ing, pp. 224–231, Seattle, Wash. Language Processing, pp. 133–142, Philadelphia, Penn. Eric Brill and Jun Wu. 1998. Classifier combination for im- Danny Shacham and Shuly Wintner. 2007. Morphological proved lexical disambiguation. In 17th Int. Conf. on Com- disambiguation of Hebrew: A case study in classifier com- putational Linguistics, pp. 191–195, Montreal, Canada. bination. In Empirical Methods in Natural Language Pro- cessing, pp. 439–447, Prague, Czech Rep. Eric Brill. 1995. Transformation-based error-driven learning and Natural Language Processing: A case study in part of Drahomira´ Spoustova,´ Jan Hajic,ˇ Jan Votrubec, Pavel Krbec, speech tagging. Computational Linguistics, 21:543–565. and Pavel Kvetoˇ n.ˇ 2007. The best of two worlds: Co- operation of statistical and rule-based taggers for Czech. CIA. 2009. The World Factbook — Ethiopia. The Central In- Balto-Slavonic Natural Language Processing, pp. 67–74. telligence Agency, Washington, DC. [Updated 22/01/09.] Prague, Czech Rep.

Michael Collins. 2002. Discriminative training methods for Kristina Toutanova and Mark Johnson. 2007. A Bayesian hidden Markov models: Theory and experiments with per- LDA-based model for semi-supervised part-of-speech tag- ceptron algorithms. In Empirical Methods in Natural Lan- ging. In 21st Int. Conf. Advances in Neural Information guage Processing, pp. 1–8, Philadelphia, Penn. Processing Systems, pp. 1521–1528, Vancover, B.C.

Girma A. Demeke and Mesfin Getachew. 2006. Manual an- Kristina Toutanova, Dan Klein, Christopher D. Manning, and notation of Amharic news items with part-of-speech tags Yoram Singer. 2003. Feature-rich part-of-speech tagging and its challenges. ELRC Working Papers, 2:1–17. with a cyclic dependency network. In HLT Conf. North American ACL, pp. 173–180, Edmonton, Alberta. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Au- Atro Voutilainen. 1999. An experiment on the upper bound tomatic tagging of Arabic text: From raw text to base of interjudge agreement: The case of tagging. In 9th Eu- phrase chunks. In HLT Conf. North American ACL, ropean ACL, pp. 204–208, Bergen, Norway. pp. 149–152, Boston, Mass. Daniel Yacob. 1997. The System for Ethiopic Representa- Thomas G. Dietterich. 1997. Machine-learning research: tion in ASCII — 1997 standard. Webpage. Four current directions. AI magazine, 18:97–136.

111 An Ontology for Accessing Transcription Systems (OATS)

Steven Moran University of Washington Seattle, WA, USA [email protected]

Abstract tions can be written to perform intelligent search (deriving implicit knowledge from explicit infor- This paper presents the Ontology for Ac- mation). They can also interoperate between re- cessing Transcription Systems (OATS), a sources, thus allowing data to be shared across ap- knowledge base that supports interopera- plications and between research communities with tion over disparate transcription systems different terminologies, annotations, and notations and practical orthographies. The knowl- for marking up data. edge base includes an ontological descrip- OATS is a knowledge base, i.e. a data source tion of writing systems and relations for that uses an ontology to specify the structure of mapping transcription system segments entities and their relations. It includes general to an interlingua pivot, the IPA. It in- knowledge of writing systems and transcription cludes orthographic and phonemic inven- systems that are core to the General Ontology of tories from 203 African languages. OATS Linguistic Description (GOLD)2 (Farrar and Lan- is motivated by the desire to query data in gendoen, 2003). Other portions of OATS, in- the knowledge base via IPA or native or- cluding the relationships encoded for relating seg- thography, and for error checking of dig- ments of transcription systems, or the computa- itized data and conversion between tran- tional representations of these elements, extend scription systems. The model in this paper GOLD as a Community of Practice Extension implements these goals. (COPE) (Farrar and Lewis, 2006). OATS provides 1 Introduction interoperability for transcription systems and prac- tical orthographies that map phones and phonemes The World Wide Web has emerged as the pre- in unique relationships to their graphemic repre- dominate source for obtaining linguistic field data sentations. These systematic mappings thus pro- and language documentation in textual, audio and vide a computationally tractable starting point for video formats. A simple keyword search on the interoperating over linguistic texts. The resources 1 nearly extinct language Livonian [liv] returns nu- that are targeted also encompass a wide array of merous results that include text, audio and video data on lesser-studied languages of the world, as files. As data on the Web continue to increase, in- well as low density languages, i.e. those with few cluding material posted by native language com- electronic resources (Baldwin et al., 2006). munities, researchers are presented with an ideal This paper is structured as follows: in section medium for the automated discovery and analysis 2, linguistic and technological definitions and ter- of linguistic data, e.g. (Lewis, 2006). However, minology are provided. In section 3, the theoreti- resources on the Web are not always accessible to cal and technological challenges of interoperating users or software agents. The data often exist in over heterogeneous transcriptions systems are de- legacy or proprietary software and data formats. scribed. The technologies used in OATS and its This makes them difficult to locate and access. design are presented in section 4. In section 5, Interoperability of linguistic resources has the OATS’ implementation is illustrated with linguis- ability to make disparate linguistic data accessible tic data that was mined from the Web, therefore to researchers. It is also beneficial for data aggre- motivating the general design objectives taken into gation. Through the use of ontologies, applica-

1ISO 639-3 language codes are in []. 2http://linguistics-ontology.org/

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 112–120, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics

112 account in its development. Section 6 concludes symbols. Featural systems are less common and with future research goals. encode phonological features within the shapes of the symbols represented in the script. 2 Conventions and Terminology The term script refers to a collection of sym- 2.1 Conventions bols (or distinct marks) as employed by a writ- Standard conventions are used for distinguishing ing system. The term script is confused with and between graphemic < >, phonemic / / and pho- often used interchangeably with ‘writing system’. netic representations [ ].3 For character data infor- A writing system may be written with different mation, I follow the Unicode Standard’s notational scripts, e.g. the alphabet writing system can be conventions (The Unicode Consortium, 2007). written in Roman and Cyrillic scripts (Coulmas, Character names are represented in small capi- 1999). A grapheme is the unit of writing that represents a particular abstract representation of a tal letters (e.g. LATIN SMALL LETTER SCHWA) and code points are expressed as ‘U+n’ where n symbol employed by a writing system. Like the is a four to six digit hexadecimal number (e.g. phoneme is an abstract representation of a distinct U+0256), which is rendered as <@>. sound in a language, a grapheme is a contrastive graphical unit in a writing system. A grapheme 2.2 Linguistic definitions is the basic, minimally distinctive symbol of a In the context of this paper, a transcription sys- writing system. A script may employ multiple tem is a system of symbols and rules for graphi- graphemes to represent a single phoneme, e.g. the cally transcribing the sounds of a language variety. graphemes and when conjoined in En- A practical orthography is a phonemic writing glish represent one phoneme in English, system designed for practical use by speakers al- pronounced /Ù/ (or /k/). The opposite is also found ready competent in the language. The mapping re- in writing systems, where a single grapheme rep- lation between phonemes and graphemes in prac- resents two or more phonemes, e.g. in En- tical orthographies is purposely shallow, i.e. there glish is a combination of the phonemes /ks/. is a faithful mapping from a unique sound to a A graph is the smallest unit of written language unique symbol.4 The IPA is often used by field (Coulmas, 1999). The electronic counterpart of linguists in the development of practical orthogra- the graph is the glyph. Glyphs represent the varia- phies for languages without writing systems. An tion of graphemes as they appear when rendered or orthography specifies the symbols, punctuation, displayed. In typography glyphs are created using and the rules in which a language is correctly writ- different illustration techniques. These may result ten in a standardized way. All orthographies are in homoglyphs, pairs of characters with shapes language specific. that are either identical or are beyond differenti- Practical orthographies and transcription sys- ation by swift visual inspection. When rendered tems are both kinds of writing systems. A writing by hand, a writer may use different styles of hand- system is a symbolic system that uses visible or writing to produce glyphs in standard handwriting, tactile signs to represent a language in a systematic cursive, or calligraphy. When rendered computa- way. Differences in the encoding of meaning and tionally, a repertoire of glyphs makes up a font. sound form a continuum for representing writing A final distinction is needed for interoperating systems in a typology whose categories are com- over transcription systems. The term scripteme monly referred to as either logographic, syllabic, is used for the use of a grapheme within a writ- phonetic or featural. A logographic system de- ing system with the particular semantics (i.e., pro- notes symbols that visually represent morphemes nunciation) it is assigned within that writing sys- (and sometimes morphemes and syllables). A tem. The notion scripteme is needed because syllabic system uses symbols to denote syllables. graphemes may be homoglyphic across scripts and A phonetic system represents sound segments as languages, and the semantics of a grapheme is de- 3Phonemic and phonetic representations are given in the pendent on the writing system using it. For ex- International Phonetic Alphabet (IPA). ample, the grapheme

in Russian represents a 4Practical orthographies are intended to jump-start written dental or alveolar trill; /r/ in IPA. However,

is materials development by correlating a writing system with its sound units, making it easier for speakers to master and realized by English speakers as a voiceless bilabial acquire literacy. stop /p/. The defining of scripteme is necessary

113 for interoperability because it provides a level for Table 1: Phoneme-to-grapheme relations mapping a writing system specific grapheme to the phonological level, allowing the same grapheme /kp/ d /Ù/ /I/ /U/ Tone to represent different sounds across different tran- sig kp d, r ky Ì V not marked scription and writing systems. sil kp d ch i u accents 2.3 Technological definitions ssl - d ky I U accents A document refers to an electronic document that contains language data. Each document is associ- phoneme /d/ in Sisaala Pasaale (Toupin, 1995).5 ated with metadata and one or more transcription These three orthographies also differ because of systems or practical orthographies. A document’s their authors’ choices in assigning graphemes to content is comprised of a set scriptemes from its phonemes. In Sisaala Pasaale and Sisaala West- transcription system. A mapping relation is an ern, the phonemes /Ù/ and /Ã/ are written as unordered pair of a scripteme in a transcription and . In Sisaala Tumulung, however, these system and its representation in IPA. sounds are written and . Orthography OATS first maps scriptemes to their grapheme developers may have made these choices for prac- equivalent(s). Graphemes are then mapped to tical reasons, such as ease of learnability or tech- their character equivalents. A character in OATS nological limitations (Bodomo, 1997). During the is a computational representation of a grapheme. development of practical orthographies for Sisaala Character encodings represent a range of inte- Pasaale and Sisaala Western, the digraphs gers known as the code space.A code point is and were chosen because children learn Da- a unique integer, or point, within this code space. gaare [dga] in schools, so they are already famil- An abstract character is then mapped to a unique iar with their sounds in the Dagaare orthography code point and rendered as an encoded charac- (Mcgill et al., 1999) (Moran, 2008). ter and typographically defined by the font used Another difference lies in the representation of to render it. A set of encoded characters is a char- vowels. Both Sisaala Pasaale and Sisaala West- acter set and different character encodings en- ern represent their full sets of vowels orthograph- code characters as numbers via different encoding ically. These orthographies were developed rela- schemes. tively recently, when computers, character encod- ings, and font support, have become less problem- 3 Interoperating Over Transcription atic. In Sisaala Tumulung, however, the phonemes Systems /i/ and /I/ are collapsed to , and /u/ and /U/ to (Blass, 1975). Sisaala Tumulung’s orthog- Section 3.1 uses the Sisaala languages to illus- raphy was developed in the 1970s and technologi- trate interoperability challenges posed by linguis- cal limitations may have led its developers to col- tic data. Section 3.2 addresses technological is- lapse these phonemes in the writing system. For sues including encoding and ambiguity. example, the Ghana Alphabet Committee’s 1990 Report lacks an individual grapheme for the 3.1 Linguistic challenges phoneme /N/ for Dagaare. This difficulty of render- Three genetically related languages spoken in ing unconventional symbols on typewriters once Northern Ghana, Sisaala Pasaale [sig], Sisaala posed a challenge for orthography development Tumulung [sil] and Sisaala Western [ssl], differ (Bodomo, 1997). slightly in their orthographies for two reasons: Tone is both lexically and grammatically con- they have slightly divergent phonemic inventories trastive in Sisaala languages. In Sisaala Pasaale’s and their orthographies may differ graphemically official orthography tone is not marked and is not when representing the same phoneme. See Table used in native speaker materials. On the other 1. hand, in linguistic descriptions that use this or- The voiceless labial-velar phoneme /kp/ appears thography, tone is marked to disambiguate tonal in both Sisaala Tumulung and Sisaala Pasaale, but 5The phoneme /d/ has morphologically conditioned al- has been lost in Sisaala Western. There is a con- lographs (word initial) or (elsewhere) (McGill, vergence of the allophones [d] and [r] into one 2004).

114 minimal pairs in lexical items and grammatical two character sets differently, then data could not constructions (McGill, 2004). In the Sisaala be reliably and correctly displayed. (Tumulung)-English dictionary, tone is marked To circumvent these problems, OATS uses the only to disambiguate lexical items (Blass, 1975). Unicode Standard7 for multilingual character en- In linguistic descriptions of Sisaala Western, non- coding of electronic textual data. Unicode en- contrastive tone is marked. When tone is marked, codes 76 scripts and includes the IPA.8 In principle it appears as acute (high tone) and grave (low tone) this allows OATS to interoperate over IPA and all accents over vowels or nasals. scripts currently encoded in Unicode. However, Language researchers would quickly pick up on writing systems, scripts and transcriptions are of- these minute differences in orthographies. How- ten themselves encoded ambiguously. ever, what first seem to be trivial differences, illus- Unicode encodes characters, not glyphs, in trate one issue of resource discovery on the Web – scripts and sometimes unifies duplicate characters without methods for interoperability, even slightly across scripts. For example, IPA characters of divergent resources are more difficult to discover, Greek and Latin origin, such as and query and compare. How would someone re- are not given a distinct position within Unicode’s searching a comparative analysis of /Ù/ sounds of IPA character block. The Unicode code space languages in Northern Ghana discover that it is is subdivided into character blocks, which gener- represented as and without first lo- ally encode characters from a single script, but as cating the extremely sparse grammatical informa- is illustrated by the IPA, characters may be dis- tion available on these languages? Furthermore, persed across several different character blocks. automatic phonetic research is possible on lan- This poses a challenge for interoperation, particu- guages with shallow orthographies (Zuraw, 2006), larly with regard to homographs. Why shouldn’t a but crosslinguistic versions of such work require speaker of Russian use the CYRILLICSMALL interoperation over writing systems. LETTERA at code point U+0430 for IPA transcrip- tion, instead of LATIN SMALL LETTER A at 3.2 Technological challenges code point U+0061, when visually they are indis- The main technological challenges in interoperat- tinguishable? ing over textual electronic resources are: encod- Homoglyphs come in two flavors: linguistic and ing multilingual language text in an interoperable non-linguistic. Linguists are unlikely to distin- format and resolving ambiguity between mapping guish between the <@> LATIN SMALL LETTER relations. These are addressed below. SCHWA at code point U+0259 and <@> LATIN Hundreds of character encoding sets for writ- SMALLLETTERTURNEDE at U+01DD. And non- ing systems have been developed, e.g. ASCII, linguists are unlikely to differentiate any seman- GB 180306 and Unicode. Historically, different tic difference between an open back unrounded standards were formalized differently and for dif- vowel , the LATIN SMALL LETTER ALPHA ferent purposes by different standards commit- at U+0251, and the open front unrounded vowel tees. A lack of interoperability between char- , LATIN SMALL LETTER A at U+0061. acter encodings ensued. Linguists, restricted to Another challenge is how to handle ambigu- standard character sets that lacked IPA support ity in transcription systems and orthographies. In and other language-specific graphemes that they Serbo-Croatian, for example, the digraphs , needed, made their own solutions (Bird and Si- and represent distinct phonemes and mons, 2003). Some chose to represent unavailable each are comprised of two graphemes, which graphemes with substitutes, e.g. the combination themselves represent distinct phonemes. Words of to represent . Others redefined se- like ‘to outlive’ are composed of lected characters from a character encoding to map the morphemes , a prefix, and the verb their own fonts to. One linguist’s redefined char- . In this instance the combination of acter set, however, would not render properly on and does not represent a single digraph another linguist’s computer if they did not share ; they represent two neighboring phonemes the same font. If two character encodings defined across a morpheme boundary. Likewise in En-

6Guoji´ a¯ Biaozh¯ u,ˇ the national standard character set for 7ISO/IEC 1064 the People’s Republic of China 8http://www.unicode.org/Public/UNIDATA/Scripts.txt

115 glish, the grapheme sequence can be both annotation to enable intelligent search across lin- a digraph as well as a sequence of graphemes, guistic resources (Farrar and Langendoen, 2003). as in and . When pars- Several technologies are integral to the architec- ing words like and both ture of the Semantic Web, including Unicode, disambiguations are theoretically available. An- XML,10 and the Resource Description Framework other example is illustrated by , , and (RDF).11 OATS has been developed with these . How should be interpreted be- technologies and uses SPARQL12 to query the fore when English gives us both /tOm@s/ knowledge base of linked data. ‘Thomas’ and /Tioudor/ ‘Theodore’? The Sisaala The Unicode Standard is the standard text Western word ‘waterfall’ could be encoding for the Web, the recommended best- parsed as /niik.yuru/ instead of /nii.Ùuru/ to speak- practice for encoding linguistic resources, and the ers unfamiliar with the digraph of orthogra- underlying encoding for OATS. XML is a gen- phies of Northwestern Ghana. eral purpose specification for markup languages These ambiguities are due to mapping relations and provides a structured language for data ex- between phonemes and graphemes. Transcrip- change (Yergeau, 2006). It is the most widely tion systems and orthographies often have com- used implementation for descriptive markup, and plex grapheme-to-phoneme relationships and they is in fact so extensible that its structure does not vary in levels of phonological abstraction. The provide functionality for encoding explicit rela- transparency of the relation between spelling and tionships across documents. Therefore RDF is phonology differ between languages like English needed as the syntax for representing informa- and French, and say Serbo-Croatian. The former tion about resources on the Web and it is itself represent deep orthographic systems where the written in XML and is serializable. RDF de- same grapheme can represent different phonemes scribes resources in the form subject-predicate- in different contexts. The latter, a shallow or- object (or entity-relationship-entity) and identi- thography, is less polyvalent in its grapheme-to- fies unique resources through Uniform Resource phoneme relations. Challenges of ambiguity reso- Identifiers (URIs). In this manner, RDF encodes lution are particularly apparent in data conversion. meaning in sets of triples that resemble subject- verb-object constructions. These triples form a 4 Ontological Structure and Design graph data structure of nodes and arcs that are 4.1 Technologies non-hierarchical and can be complexly connected. In Philosophy, Ontology is the study of existence Numerous algorithms have been written to access and the meaning of being. In the Computer and and manipulate graph structures. Since all URIs Information Sciences, ontology has been co-opted are unique, each subject, object and predicate are to represent a data model that represents concepts uniquely defined resources that can be referred within a certain domain and the relationships be- to and reused by anyone. URIs give users flex- tween those concepts. At a low level an ontol- ibility in giving concepts a semantic representa- ogy is a taxonomy and a set of inference rules. tion. However, if two individuals are using differ- At a higher-level, ontologies are collections of in- ent URIs for the same concept, then a procedure formation that have formalized relationships that is needed to know that these two objects are in- hold between entities in a given domain. This pro- deed equivalent. A common example in linguis- vides the basis for automated reasoning by com- tic annotation is the synonymous use of genitive puter software, where content is given meaning and possessive. By incorporating domain specific in the sense of interpreting data and disambiguat- knowledge into an ontology in RDF, disambigua- ing entities. This is the vision of the Semantic tion and interoperation over data becomes pos- Web,9 a common framework for integrating and sible. GOLD addresses the challenge of inter- correlating linked data from disparate resources operability of disparate linguistic annotation and for interoperability (Beckett, 2004). The Gen- termsets in morphosyntax by functioning as an in- eral Ontology for Linguistic Description (GOLD) terlingua between them. In OATS, the interlingua is grounded in the Semantic Web and provides 10 a foundation for the interoperability of linguistic http://www.w3.org/XML/ 11http://www.w3.org/RDF/ 9http://www.w3.org/2001/sw/ 12http://www.w3.org/TR/rdf-sparql-query/

116 between systems of transcription is the IPA. Each TranscriptionSystem is a set of instances of Scripteme. Every Scripteme instance is in a 4.2 IPA as interlingua Mapping relation with its IPA counterpart. The OATS uses the IPA as an interlingua (or pivot) MappingSystem contains a list of Transcription- to which elements of systems of transcription are System instances that have Scripteme instances mapped. The IPA was chosen for its broad cov- mapped to IPA. The Grapheme class provides erage of the sounds of the world’s languages, its the mapping between Scripteme and Character. mainstream adoption as a system for transcription The Character class is the set of Unicode char- by linguists, and because it is encoded (at least acters and contains the Unicode version number, mostly) in Unicode. The pivot component resides character name, HTML entity and code point. at the Character ID entity, which is in a one-to-one 5 Implementation relationship with a Unicode Character. The Char- acter ID entity is provided for mapping characters 5.1 Data to multiple character encodings. This is useful for The African language data used in OATS were mapping IPA characters to legacy character encod- mined from Systemes` alphabetiques´ des langues ing sets like IPA Kiel and SIL IPA93, allowing africanies,15 an online database of Alphabets des for data conversion between character encodings. langues africaines (Hartell, 1993). Additional The IPA also encodes phonetic segments as small languages were added by hand. Currently, OATS feature bundles. Phonological theories extend the includes 203 languages from 23 language families. idea and interpretation of proposed feature sets, Each language contains its phonemic and ortho- an area of debate within Linguistics. These issues graphic inventories. should be taken into consideration when encoding interoperability via an interlingua, and should be 5.2 Query leveraged to expand current theoretical questions Linguists gain unprecedented access to linguistic that can be asked of the knowledge base. Charac- resources when they are able to query across dis- ter semantics also require consideration (Gibbon parate data in standardized notations regardless of et al., 2005). Glyph semantics provide implicit in- how the data in those resources is encoded. Cur- formation such as a resource’s language, its lan- rently OATS contains two phonetic notations for guage family assignment, its use by a specific so- querying: IPA and X-SAMPA. To illustrate the cial or scientific group, or corporate identity (Trip- querying functionality currently in place, the IPA pel et al., 2007). Documents with IPA characters is used to query the knowledge base of African or in legacy IPA character encodings provide se- language data16 for the occurrence of two seg- mantic knowledge regarding the document’s con- ments. The first is the voiced palatal nasal /ñ/. The tent, namely, that it contains transcribed linguistic results are captured in table 2. data.

4.3 Ontological design Table 2: Occurrences of voiced palatal nasal /ñ/ OATS consists of the following ontological classes: Character, Grapheme, Document, Map- Grapheme Languages % of Data ping, MappingSystem, WritingSystem, and 114 84% Scripteme. WritingSystem is further subdivided 11 8% into OrthographicSystem and TranscriptionSys- <ñ> 8 6% tem. Each Document is associated with the 2 1% OLAC Metadata Set,13 an extension of the Dublin 1 .05% Core Type Vocabulary14 for linguistic resources. This includes uniquely identifying the language The voiced palatal nasal /ñ/ is accounted for represented in the document with its ISO 639-3 in 136 languages, or roughly 67% of the 203 three letter language code. Each Document is also languages queried. Orthographically the voiced associated with an instance of WritingSystem. palatal nasal /ñ/ is represented as , ,

13http://www.language-archives.org/OLAC/metadata.html 15http://sumale.vjf.cnrs.fr/phono/ 14http://dublincore.org/usage/terms/dcmitype/ 16For a list of these languages, see http://phoible.org

117 <ñ>, , and interestingly as . The two Table 4: Occurrence of /gb/ and lack of /kp/ languages containing , Koonzime [ozm] and Akoose [bss] of Cameroon, both lack a phonemic Code Language Name Genetic Affiliation /N/. In these languages’ orthographies, both emk Maninkakan Mande and are used to represent the phoneme /ñ/. With further investigation, one can determine if kza Karaboro Gur they are contextually determined allographs like lia Limba Atlantic the and in Sisaala Pasaale. mif Mofu-Gudur Chadic The second simple query retrieves the occur- sld Sissala Gur rence of the voiced alveo-palatal affricate /Ã/. Ta- ssl Sisaala Gur ble 3 displays the results from the same sample of sus Susu Mande languages. ted Krumen Kru tem Themne Atlantic tsp Toussian Gur Table 3: Occurrences of voiced alveo-palatal af- fricate /Ã/

Grapheme Languages % of Data same task as querying the knowledge base via 84 92% the pivot. In this case, however, a mapping rela- 2 2% tion from the language-specific grapheme to IPA is first established. Since all transcription systems’ 2 2% graphemes must have an IPA counterpart, this re- 1 1% lationship is always available. A query is then <Ã> 1 1% made across all relevant mapping relations from 1 1% IPA to languages within the knowledge base. For example, a user familiar with the Sisaala The voiced alveo-palatal affricate /Ã/ is ac- Western orthography queries the knowledge base counted for in 92 languages, or 45%, of the 203 for languages with . Initially, the OATS languages sampled. The majority, over 92%, use system establishes the relationship between the same grapheme to represent /Ã/. Other and its IPA counterpart. In this case, repre- graphemes found in the language sample include sents the voiceless alveo-palatal affricate /Ù/. Hav- , , , <Ã>, and . The ing retrieved the IPA counterpart, the query next stands out in this data sample. Interestingly, it retrieves all languages that have /Ù/ in their phone- comes from Sudanese Arabic, which uses Latin- mic inventories. In the present data sample, this based characters in its orthography. It contains the query retrieves 99 languages with the phonemic phonemes /g/, /G/, and /Ã/, which are gramphemi- voiceless alveo-palatal affricate. If the user then cally represented as , and . wishes to compare the graphemic distributions of These are rather simplistic examples, but the /Ù/ and /Ã/, which was predominately , these graph data structure of RDF, and the power of results are easily provided. They are displayed in SPARQL provides an increasingly complex sys- Table 5. tem for querying any data stored in the knowledge The 97 occurrences of /Ù/ account for five more base and relationships as encoded by its ontologi- than the 92 languages sampled in section 5.2 that cal structure. For example, by combining queries had its voiced alveo-palatal affricate counterpart. such as ‘which languages have the phoneme /gb/’ Such information provides statistics for phoneme and ‘of those languages which lack its voiceless distribution across languages in the knowledge counterpart /kp/’, 11 results are found from this base. OATS is a powerful tool for gathering such sample of African languages, as outlined in Table knowledge about the world’s languages. 4.

5.3 Querying for phonetic data via 5.4 Code orthography There were two main steps in the implementation The ability to query the knowledge base via a of OATS. The first was the design and creation of language-specific orthography is ultimately the the OATS RDF model. This task was undertaken

118 query, and multilingual character encoding, OATS Table 5: Occurrences of voiceless alveo-palatal af- is designed to facilitate resource discovery and fricate /Ù/ intelligent search over linguistic data. The cur- rent knowledge base includes an ontological de- Grapheme Languages % of Data scription of writing systems and specifies rela- 60 62% tions for mapping segments of transcription sys- < > ch 28 29% tems to their IPA equivalents. IPA is used as the 3 3% interlingua pivot that provides the ability to query 2 2% across all resources in the knowledge base. OATS’ <Ù> 1 1% data source includes 203 African languages’ or- 1 1% thographic and phonemic inventories. 1 1% The case studies proposed and implemented in 1 1% this paper present functionality to use OATS to query all data in the knowledge base via stan- dards like the IPA. OATS also supports query via 17 using Protege, an open source ontology editor any transcription system or practical orthography developed by Stanford Center for Biomedical In- in the knowledge base. Another outcome of the formatics Research. The use of Protege was pri- OATS project is the ability to check for incon- marily to jump start the design and implementa- sistencies in digitized lexical data. The system tion of the ontology. The software provides a user could also test linguist-proposed phonotactic con- interface for ontology modeling and development, straints and look for exceptions in data. Data and exports the results into RDF. After the archi- from grapheme-to-phoneme mappings, phonotac- tecture was in place, the second step was the de- tics and character encodings can provide an ortho- 18 velopment of a code base in Python for gather- graphic profile/model of a transcription or writing ing data and working with RDF. This code base system. This could help to bootstrap software and includes two major pieces. The first was the de- resource development for low-density languages. velopment of a scraper, which was used to gather OATS also provides prospective uses for docu- phonemic inventories off of the Web by download- ment conversion and development of probabilistic ing Web pages and scraping them for relevant con- models of orthography-to-phoneme mappings. tents. Each language was collected with its ISO 639-3 code, and its orthographic inventory and Acknowledgements the mapping relation between these symbols and This work was supported in part by the Max- their IPA phonemic symbols. The second chunk of Planck-Institut fur¨ evolutionare¨ Anthropologie the code base provides functionality for working and thanks go to Bernard Comrie, Jeff Good and with the RDF graph and uses RDFLib,19 an RDF Michael Cysouw. For useful comments and re- Python module. The code includes scripts that add views, I thank Emily Bender, Scott Farrar, Sharon all relevant language data that was scraped from Hargus, Will Lewis, Richard Wright, and three the Web to the OATS RDF graph, it fills the graph anonymous reviewers. with the Unicode database character tables, and provides SPARQL queries for querying the graph as illustrated above. There is also Python code for References using OATS to convert between two character sets, Timothy Baldwin, Steven Bird, and Baden Hughes. and for error checking of characters within a doc- 2006. Collecting Low-Density Language Materials ument that are not in the target set. on the Web. In Proceedings of the 12th Australasian World Wide Web Conference (AusWeb06). 6 Conclusion and Future Work David Beckett. 2004. RDF/XML Syntax Specification OATS is a knowledge base that supports interop- (Revised). Technical report, W3C. eration over disparate transcription systems. By Steven Bird and Gary F. Simons. 2003. Seven Di- leveraging technologies for ontology description, mensions of Portability for Language Documenta- tion and Description. Language, 79(3):557–582. 17http://protege.stanford.edu/ 18http://python.org Regina Blass. 1975. Sisaala-English, English-Sisaala 19http://rdflib.net/ Dictionary. Institute of Linguistics, Tamale, Ghana.

119 Adams Bodomo. 1997. The Structure of Dagaare. Stanford Monographs in African Languages. CSLI Publications. Florian Coulmas. 1999. The Blackwell Encyclopedia of Writing Systems. Blackwell Publishers. Scott Farrar and Terry Langendoen. 2003. A Linguis- tic Ontology for the Semantic Web. GLOT, 7(3):97– 100. Scott Farrar and William D. Lewis. 2006. The GOLD Community of Practice: An Infrastructure for Lin- guistic Data on the Web. In Language Resources and Evaluation. Dafydd Gibbon, Baden Hughes, and Thorsten Trip- pel. 2005. Semantic Decomposition of Charac- ter Encodings for Linguistic Knowledge Discovery. In Proceedings of Jahrestagung der Gesellschaft fur¨ Klassifikation 2005. Rhonda L. Hartell. 1993. Alphabets des langues africaines. UNESCO and Societ´ e´ Internationale de Linguistique. William D. Lewis. 2006. ODIN: A Model for Adapt- ing and Enriching Legacy Infrastructure. In Pro- ceedings of the e-Humanities Workshop, held in co- operation with e-Science 2006: 2nd IEEE Interna- tional Conference on e-Science and Grid Comput- ing. Stuart Mcgill, Samuel Fembeti, and Mike Toupin. 1999. A Grammar of Sisaala-Pasaale, volume 4 of Language Monographs. Institute of African Studies, University of Ghana, Legon, Ghana. Stuart McGill. 2004. Focus and Activation in Paasaal: the particle rE. Master’s thesis, University of Read- ing. Steven Moran. 2008. A Grammatical Sketch of Isaalo (Western Sisaala). VDM. The Unicode Consortium. 2007. The Unicode Stan- dard, Version 5.0. Boston, MA, Addison-Wesley. Mike Toupin. 1995. The Phonology of Sisaale Pasaale. Collected Language Notes, 22. Thorsten Trippel, Dafydd Gibbon, and Baden Hughes. 2007. The Computational Semantics of Characters. In Proceedings of the Seventh International Work- shop on Computational Semantics (IWCS-7), pages 324–329. Francois Yergeau. 2006. Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C Recommenda- tion 16 August 2006, edited in place 29 September 2006. Kie Zuraw. 2006. Using the Web as a Phonological Corpus: a case study from Tagalog. In Proceedings of the 2nd International Workshop on Web as Cor- pus.

120 Author Index

Adegbola, Tunde, 53 Schwarz, Anne, 17 Alemu Argaw, Atelach, 104 Anberbir, Tadesse, 46 Takara, Tomio, 46 Asker, Lars, 104 Taljard, Elsabe,´ 38

Badenhorst, Jaco, 1 Van Heerden, Charl, 1 Banski,´ Piotr, 89 Viljoen, Biffie, 66 Barnard, Etienne, 1 Wagacha, Peter Waiganjo, 9 Beermann, Dorothee, 74 Wojtowicz,´ Beata, 89 Berg, Ansu, 66 Bosch, Sonja, 96 Zeldes, Amir, 17 Zimmermann, Malte, 17 Chiarcos, Christian, 17

Davel, Marelie, 1 De Pauw, Guy, 9 de Schryver, Gilles-Maurice, 9

Enguehard, Chantal, 81

Faaß, Gertrud, 38 Fiedler, Ines, 17 Finkel, Raphael, 25

Gamback,¨ Bjorn,¨ 104 Groenewald, Hendrik Johannes, 32 Grubic, Mira, 17 Gumede, Tebogo, 59

Haida, Andreas, 17 Hartmann, Katharina, 17 Heid, Ulrich, 38

Mihaylov, Pavel, 74 Modi, Issouf, 81 Moran, Steven, 112

Odejobi, Odetunji Ajadi, 25 Olsson, Fredrik, 104

Plauche,´ Madelaine, 59 Pretorius, Laurette, 66, 96 Pretorius, Rigardt, 66 Prinsloo, Danie, 38

Ritz, Julia, 17

121