An Open-Source Morphological Zulu Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Ukwabelana - An open-source morphological Zulu corpus Sebastian Spiegler Andrew van der Spuy Peter A. Flach Intelligent Systems Group Linguistics Department Intelligent Systems Group University of Bristol University of the Witwatersrand University of Bristol [email protected] [email protected] [email protected] Abstract 1994, it has been recognized as one of the eleven official languages of South Africa. It has a written Zulu is an indigenous language of South history of about 150 years: the first grammar was Africa, and one of the eleven official published by Grout (1859), and the first dictionary languages of that country. It is spoken by Colenso (1905). There are about 11 million by about 11 million speakers. Although mother-tongue speakers, who constitute approxi- it is similar in size to some Western mately 23% of South Africa’s population, making languages, e.g. Swedish, it is consid- Zulu the country’s largest language. erably under-resourced. This paper Zulu is highly mutually intelligible with the presents a new open-source morphologi- Xhosa, Swati and Southern Ndebele languages, cal corpus for Zulu named Ukwabelana and with Ndebele of Zimbabwe (Lanham, 1960), corpus. We describe the agglutinating to the extent that all of these can be consid- morphology of Zulu with its multiple ered dialects or varieties of a single language, prefixation and suffixation, and also Nguni. Despite its size, Zulu is considerably introduce our labeling scheme. Further, under-resourced, compared to Western languages the annotation process is described and with similar numbers of speakers, e.g. Swedish. all single resources are explained. These There are only about four regular publications in comprise a list of 10,000 labeled and Zulu, there are few published books, and the lan- 100,000 unlabeled word types, 3,000 guage is not used as a medium of instruction. part-of-speech (POS) tagged and 30,000 This of course is partly due to the short time- raw sentences as well as a morphological span of its written history, but the main reason, of Zulu grammar, and a parsing algorithm course, is the apartheid history of South Africa: which hypothesizes possible word roots for most of the twentieth century resources were and enumerates parses that conform to the allocated to Afrikaans and English, the two former Zulu grammar. We also provide a POS official languages, and relatively few resources tagger which assigns the grammatical to the indigenous Bantu languages. Since 1994, category to a morphologically analyzed Zulu has had a much larger presence in the media, word type. As it is hoped that the corpus with several television programs being broadcast and all resources will be of benefit to in Zulu every day. Yet much needs to be done in any person doing research on Zulu or on order to improve the resources available to Zulu computer-aided analysis of languages, speakers and students of Zulu. they will be made available in the public The aim of the project reported in this paper domain from http://www.cs.bris. was to establish a Zulu corpus, named the Uk- ac.uk/Research/MachineLearning/ wabelana corpus1, consisting of morphologically Morphology/Resources/. labeled words (that is, word types) and part-of- speech (POS) tagged sentences. Along with the 1 Introduction labeled corpus, unlabeled words and sentences, a Zulu (also known as isiZulu) is a Bantu language morphological grammar, a semi-automatic mor- of South Africa, classified as S.30 in Guthrie’s 1Ukwabelana means ‘to share’ in Zulu where the ‘k’ is classification scheme (Guthrie, 1971). Since pronounced voiced like a [g]. 1020 Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 1020–1028, Beijing, August 2010 phological analyzer and a POS tagger for morpho- high-quality automatic translation becomes avail- logically analyzed words will be provided. able, this would no longer be necessary. As it is The sources used for the corpus were limited to hoped that the Ukwabelana corpus will be of ben- fictional works and the Zulu Bible. This means efit to any person doing research on Zulu or on that there is not a wide variety of registers, and computer-aided analysis of languages, it will be perhaps even of vocabulary items. This defect will made available as the first morphologically anal- have to be corrected in future work. ysed corpus of Zulu in the public domain. The Ukwabelana corpus can be used to de- 2 Related work velop and train automatic morphological analyz- ers, which in turn tag a large corpus of writ- In this section, we will give an overview of lin- ten Zulu, similar to the Brown corpus or the guistic research on Nguni languages, following British National Corpus. Moreover, the list of the discussions in van der Spuy (2001), and there- POS tagged sentences is an essential step towards after a summary of computational approaches to building an automatic syntactic tagger, which still the analysis of Zulu. does not exist for Zulu, and a tagged corpus of Zulu. Such a corpus would be beneficial to lan- 2.1 Linguistic research on Nguni languages guage researchers as it provides them with ex- The five Nguni languages Zulu, Xhosa, South amples of actual usage, as opposed to elicited African Ndebele, Swati, and Zimbabwean Nde- or invented examples, which may be artificial or bele are highly mutually intelligible, and for this unlikely to occur in real discourse. This would reason, works on any of the other Nguni languages greatly improve the quality of Zulu dictionaries are directly relevant to an analysis of Zulu. and grammars, most of which rely heavily on There have been numerous studies of Nguni the work of Doke (1927) and Doke, Malcom and grammar, especially its morphology; in fact, Sikakana (1958), with little in the way of inno- the Nguni languages probably rival Swahili and vation. Morphological tagging is also useful for Chewa for the title of most-studied Bantu lan- practical computational applications like predic- guage. The generative approach to morphologi- tive text, spell-checking, grammar checking and cal description (as developed by Aronoff (1976), machine translation; in the case of Zulu, where Selkirk (1982), Lieber (1980), Lieber (1992)) has a large percentage of grammatical information is had very little influence on most of the work that conveyed by prefixes and suffixes rather than by has been done on Nguni morphology. separate words, it is essential. For example, in Usually, the descriptions have been atheoreti- English, the negative is expressed by means of a cal or structuralist. Doke’s paradigmatic descrip- separate word ‘not’, but in Zulu the negative is tion of the morphology (Doke, 1927; Doke, 1935) constructed using a prefix-and-suffix combination has remained the basis for linguistic work in the on the verb, and this combination differs accord- Southern Bantu languages. Doke (1935) criticized ing to the mood of the verb (indicative, participial previous writers on Bantu grammars for basing or subjunctive). The practical computational ap- their classification, treatment and terminology on plications mentioned could have a very great im- their own mother tongue or Latin. His intention pact on the use of Zulu as a written language, as was to create a grammatical structure for Bantu spell-checking and grammar checking would ben- which did not conform to European or classical efit proofreaders, editors and writers. Machine standards. Nevertheless, Doke himself could not translation could aid in increasing the number of shake off the European mindset: he treated the texts available in Zulu, thus making it more of a languages as if they had inflectional paradigms, literary language, and allowing it to become es- with characteristics like subjunctive or indicative tablished as a language of education. The use belonging to the whole word, rather than to identi- of Zulu in public life could also increase. Cur- fiable affixes; in fact, he claimed (1950) that Bantu rently, the tendency is to use English, as this is languages are “inflectional with [just] a tendency the language that reaches the widest audience. If to agglutination”, and assumed that the morphol- 1021 ogy was linear not hierarchical. Most subsequent logical and morphosyntactic rules which are learnt linguistic studies and reference grammars of the by consulting an oracle, in their case a linguis- Southern Bantu languages have been directed at tic expert who corrects analyses. The frame- refining or redefining Doke’s categories from a work then revises its grammar so that the updated paradigmatic perspective. morpheme lists and rules do not contradict previ- Important Nguni examples are Van Eeden ously found analyses. Botha and Barnard (2005) (1956), Van Wyk (1958), Beuchat (1966), Wilkes compared two approaches for gathering Zulu text (1971), Nkabinde (1975), Cope (1984), Davey corpora from the World Wide Web. They drew (1984), Louw (1984), Ziervogel et al. (1985), the conclusion that using commercial search en- Gauton (1990), Gauton (1994), Khumalo (1992), gines for finding Zulu websites outperforms web- Poulos and Msimang (1998), Posthumus (1987), crawlers even with a carefully selected starting Posthumus (1988), Posthumus (1988) and Posthu- point. They saw the reason for that in the fact that mus (2000). Among the very few generative most documents on the internet are in one of the morphological descriptions of Nguni are Lanham world’s dominant languages. Bosch and Eiselen (1971), Mbadi (1988) and Du Plessis (1993). Lan- (2005) presented a spell checker for Zulu based on ham (1971) gives a transformational analysis of morphological analysis and regular expressions. Zulu adjectival and relative forms. This analy- It was shown that after a certain threshold for sis can be viewed as diachronic rather than syn- the lexicon size performance could only be im- chronic.