Classification of South African Languages Using Text and Acoustic Based Methods: a Case of Six Selected Languages
Total Page:16
File Type:pdf, Size:1020Kb
Classification of South African languages using text and acoustic based methods: A case of six selected languages Peleira Nicholas Zulu University of KwaZulu-Natal Electrical, Electronic and Computer Engineering Durban, 4041, South Africa [email protected] Heeringa, 2004; Van-Bezooijen & Heeringa, 2006; Abstract Van-Hout & Münstermann, 1981), and those sub- jective decisions are, for example, the basis for Language variations are generally known to classifying separate languages, and certain groups have a severe impact on the performance of of language variants as dialects of one another. It is Human Language Technology Systems. In or- well known that languages are complex; they differ der to predict or improve system performance, in vocabulary, grammar, writing format, syntax a thorough investigation into these variations, and many other characteristics. This presents levels similarities and dissimilarities, is required. Distance measures have been used in several of difficulty in the construction of objective com- applications of speech processing to analyze parative measures between languages. Even if one different varying speech attributes. However, intuitively knows, for example, that English is not much work has been done on language dis- closer to French than it is to Chinese, what are the tance measures, and even less work has been objective factors that allow one to assess the levels done involving South African languages. This of distance? study explores two methods for measuring the This bears substantial similarities to the analo- linguistic distance of six South African lan- gous questions that have been asked about the rela- guages. It concerns a text based method, (the tionships between different species in the science Levenshtein Distance), and an acoustic ap- of cladistics. As in cladistics, the most satisfactory proach using extracted mean pitch values. The Levenshtein distance uses parallel word tran- answer would be a direct measure of the amount of scriptions from all six languages with as little time that has elapsed since the languages’ first split as 144 words, whereas the pitch method is from their most recent common ancestor. Also, as text-independent and compares mean language in cladistics, it is hard to measure this from the pitch differences. Cluster analysis resulting available evidence, and various approximate from the distance matrices from both methods measures have to be employed instead. In the bio- correlates closely with human perceptual dis- logical case, recent decades have seen tremendous tances and existing literature about the six lan- improvements in the accuracy of biological meas- guages. urements as it has become possible to measure dif- ferences between DNA sequences. In linguistics, 1 Introduction the analogue of DNA measurements is historical information on the evolution of languages, and the The development of objective metrics to assess the more easily measured—though indirect measure- distances between different languages is of great ments (akin to the biological phenotype)—are ei- theoretical and practical importance. Currently, ther the textual or acoustic representations of the subjective measures have generally been employed languages in question. to assess the degree of similarity or dissimilarity between different languages (Gooskens & 280 Proceedings of NAACL-HLT 2013, pages 280–287, Atlanta, Georgia, 9–14 June 2013. c 2013 Association for Computational Linguistics In the current article, we focus on language dis- after, the paper will present an evaluation on the tance measures derived from both text and acoustic six languages of South Africa, highlighting lan- formats; we apply two different techniques, name- guage groupings and proximity patterns. In conclu- ly Levenshtein distance between orthographic sion, the paper discusses the results. word transcriptions, and distances between lan- guage pitch means in order to obtain measures of 2 Theoretical Background dissimilarity amongst a set of languages. These methods are used to obtain language groupings Orthographic transcriptions are one of the most which are represented graphically using multidi- basic types of annotation used for speech transcrip- mensional scaling and dendrograms—two standard tion, and are particularly important in most fields statistical techniques. This allows us to visualize of research concerned with spoken language. The and assess the methods relative to known linguistic orthography of a language refers to the set symbols facts in order to judge their relative used to write a language and includes its writing reliability(Zulu, Botha, & Barnard, 2008). system. English, for example, has an alphabet of Our evaluation is based on six of the eleven 26 letters which includes both consonants and official languages of South Africa. 1 The eleven vowels. However, each English letter may repre- official languages fall into two distinct groups, sent more than one phoneme, and each phoneme namely the Germanic group (represented by Eng- may be represented by more than one letter. In the lish and Afrikaans) and the South African Bantu current research, we investigate the use of Le- languages, which belong to the South Eastern Ban- venshtein distance on orthographic transcriptions tu group. The South African Bantu languages can for the assessment of language similarities. further be classified in terms of different sub- On the other hand, speech has been and still groupings: Nguni (consisting of Zulu, Xhosa, Nde- very much is the most natural form of communica- bele and Swati), Sotho (consisting of Southern So- tion. Prosodic characteristics such as rhythm, stress tho, Northern Sotho and Tswana), and a pair that and intonation in speech convey important infor- falls outside these sub-families (Tsonga and Ven- mation regarding the identity of a spoken language. da). The six languages chosen for our evaluation Results of perception studies on spoken language are English, Afrikaans, Zulu, Xhosa, Northern So- identification confirm that prosodic information, tho (also known as Sepedi) and Tswana, which specifically pitch and intensity—which represent equally represent the three groups; Germanic, intonation and stress respectively—are useful for Nguni and Sotho. language identification (Kometsu, Mori, Arai, & We believe that an understanding of these lan- Murahara, 2001; Mori et al., 1999). This paper pre- guage distances is not only of inherent interest, but sents a preliminary investigation of pitch and its also of great practical importance. For purposes role in determining acoustic based language dis- such as language learning, the selection of target tances. languages for various resources and the develop- ment of Human Language Technologies, reliable 2.1 Levenshtein Distance knowledge of language distances would be of great value. Consider, for example, the common situa- There are several ways in which phoneticians have tion of an organization that wishes to publish in- tried to measure the distance between two linguis- formation relevant to all languages in a particular tic entities, most of which are based on the descrip- multi-lingual community, but has insufficient fund- tion of sounds via various representations. This ing to do so. Such an organization can be guided section introduces the Levenshtein Distance Meas- by knowledge of language distances and mutual ure, one of the more popular sequence-based dis- intelligibility between languages to make an ap- tance measures. In 1995 Kessler introduced the use propriate choice of publication languages. of the Levenshtein Distance as a tool for measuring The following sections describe the Levenshtein linguistic distances between dialects (Kessler, distance and pitch characteristics in detail. There- 1995). The basic idea behind the Levenshtein Dis- tance is to imagine that one is rewriting or trans- forming one string into another. Kessler 1 Data for all eleven languages is available on the Lwazi web- successfully applied the Levenshtein Distance site: ( http://www.meraka.org.za/lwazi/index.php ). 281 measure to the comparison of Irish dialects. In his spoken language. This controlled modulation of work, the strings were transcriptions of word pro- pitch is referred to as intonation . The sound units nunciations. In general, rewriting is effected by are shortened or lengthened in accordance to some basic operations, each of which is associated with a underlying pattern giving rhythmic properties to cost, as illustrated in Table 1 in the transformation speech. The information attained from these of the string “ mošemane ” to the string “ umfana ”, rhythmic patterns increases the intelligibility of which are both orthographic translations of the spoken languages, enabling the listener to segment word boy in Northern Sotho and Zulu respectively. continuous speech into phrases and words with ease (Shriberg, Stolcke, Hakkani-Tur, & Tur, Operation Cost 2000). The characteristics that make us perceive mošemane delete m 1 this and other information such as stress, accent ošemane delete š 1 and emotion are collectively referred to as prosody. oemane delete e 1 omane insert f 1 Comparisons have shown that languages differ omfane substitute o/u 2 greatly in their prosodic features (Hirst & Cristo, umfane substitute e/a 2 1998), therefore providing a basis for objective umfana comparison between languages. Further, pitch is a Total cost 8 perceptual attribute of sound, the physical correlate of which is fundamental frequency ( F ), which rep- Table 1. Levenshtein