Corpus-Based Lexicography for Sesotho

Total Page:16

File Type:pdf, Size:1020Kb

Corpus-Based Lexicography for Sesotho Corpus-based Lexicography for Sesotho by Mmasibidi Setaka Student Number 10657542 A dissertation submitted in fulfilment of the requirements for the degree Master of Arts in the Department of African Languages at the Faculty of Humanities UNIVERSITY OF PRETORIA, SUPERVISOR: Prof. D.J. Prinsloo August 2018 Acknowledgements Thanks God for allowing this opportunity and forever gracing me with your mercy and blessings. To my queen mama Ellen Setaka and my sister Madibe Setaka and brother Ntsheke Setaka, you have been more than a blessing in my life and I am most grateful for your unwavering support, advices and encouragement. You make my world brighter. To my son Relebohile Setaka, you motivate me to do more. To my late grandparents, I know you are watching over me always. Thank you and Rest in Peace. Prof. Prinsloo, thank you for believing in me. Your contribution in my studies gave me life lessons about patience, hope and success. I cannot thank you enough for all that you have done for me. SADiLaR, all could have been really hard to achieve if it was not for you. Thank you for the opportunity. Pieter Labuschagne, thank you for always helping me. To my prayer warrior, Matlhogonolo Tsae, thank you for your prayers and constantly enquiring about my studies. That is true friendship and I am most grateful to have you in my life. To all my friends and family who stood by me, I salute you. Modimo o le hlohonolofatse. i 1. Table of Contents Acknowledgements ........................................................................................................... i Abstract ........................................................................................................................ viii Chapter 1: Introduction .................................................................................................... 1 1.1 Introduction ...................................................................................................................... 1 1.2 Rationale and background to the study ........................................................................... 4 Figure 1.1: The treatment of medicine............................................................................... 9 Figure 1.2: Bilingual Sesotho dictionaries .......................................................................... 9 1.3 Aims of the study ........................................................................................................... 10 1.4 Building organic corpora for Sesotho ............................................................................. 11 1.5 Objectives of the research ............................................................................................. 12 1.5.1 Description of principles and practice for the compilation of modern Sesotho dictionaries and evaluating selected English dictionaries ................................................... 13 1.5.2 Critical evaluation of African-language dictionaries with specific focus on Sesotho dictionaries. .......................................................................................................................... 13 1.5.3 Designing a comprehensive lexicographic approach for the compilation of Sesotho dictionaries ........................................................................................................................... 14 1.6 Research methodology .................................................................................................. 15 1.7 Preview and layout of chapters ..................................................................................... 15 Chapter 2: The Sesotho language .................................................................................... 18 2.1 Introduction .................................................................................................................... 18 2.2 Classification of African languages ................................................................................. 19 Figure 2.1: Map of Africa .................................................................................................. 20 2.3 South African Bantu languages ...................................................................................... 23 2.4 The Constitution of South Africa .................................................................................... 24 2.5 History and classification of the Sesotho language ....................................................... 26 Figure 2.5.1: Structure of South-eastern Bantu zone ....................................................... 26 Table 2.5.2: Sesotho Noun Classes ................................................................................... 28 2.6 Sesotho dialects, alphabets and vowel variation ........................................................... 29 Figure 2.6.1: Sesotho phonetic chart ............................................................................... 30 Figure 2.6.2: Sesotho clicks (1) ......................................................................................... 30 ii Figure 2.6.3: Sesotho clicks (2) ......................................................................................... 31 Figure 2.6.4: Southern Sotho alphabetic representation and pronunciation .................. 31 2.7 The Sesotho speech community .................................................................................... 32 2.8 Sesotho orthography ...................................................................................................... 32 2.9 The beginning of Sesotho literature............................................................................... 32 2.10 Orthography: Lesotho Sesotho (LSe) vs South African Sesotho (SASe) ....................... 33 2.11 News and Media ........................................................................................................... 34 2.12 Conclusion .................................................................................................................... 36 Chapter 3: Compilation and utilization of Sesotho corpora .............................................. 38 3.1 Introduction .................................................................................................................... 38 3.2 History of electronic corpora ......................................................................................... 40 Table 3.1: English corpora ................................................................................................ 41 3.3 History of electronic corpora: an illustration for African languages .............................. 41 3.4 Corpus size...................................................................................................................... 42 3.5 Corpus designs ............................................................................................................... 44 Table 3.5.1: Basic composition of Brown and LOB corpora ............................................. 45 Figure 3.5.2: Longman/Lancaster English language corpus – current structure.............. 46 Figure 3.5.3: The British National Corpus ......................................................................... 47 Figure 3.5.4: The International Corpus of English ............................................................ 48 3.6 Balanced versus representative corpora ....................................................................... 48 3.7 The organic corpus ......................................................................................................... 50 3.8 The design of the Sesotho corpus .................................................................................. 52 3.9 Conclusion ...................................................................................................................... 53 Chapter 4: Corpus-enabled Macrostructural compilation ................................................ 55 4.1 Introduction .................................................................................................................... 55 4.2 Types of Macrostructures .............................................................................................. 57 Figure 4.2.1: The structure of a dictionary ....................................................................... 58 Figure 4.2.2: Different macrostructures ........................................................................... 60 4.3 The macrostructural components .................................................................................. 61 Figure 4.3.1: Presentation of the front matter................................................................. 62 iii 4.4 Corpora and the Macrostructure ................................................................................... 63 4.5 Frequency tests .............................................................................................................. 64 4.6 Putting frequencies in the dictionary ............................................................................. 65 Figure 4.6: Diamond representation of frequencies ........................................................ 66 4.7 Lemmatisation ................................................................................................................ 66 4.8 Lemmatisation approaches, strategies and lexicographic traditions ............................ 67 Figure 4.8.1: Example of rule-orientated approach ......................................................... 69 Figure 4.8.2: Southern Sotho pronunciation
Recommended publications
  • The Distribution and Interpretation of the Qualificative
    THE DISTRIBUTION AND INTERPRETATION OF THE QUALIFICATIVE IN SESOTHO by ’MADIRA LEONIAH THETSO Submitted in accordance with the requirement for the degree of DOCTOR OF LITERATURE AND PHILOSOPHY in the subject of African Languages at the UNIVERSITY OF SOUTH AFRICA Supervisor: Prof. M.L. Mojapelo Co-Supervisor: Prof. D.E. Mutasa JUNE 2018 DECLARATION STUDENT NUMBER: 55769365 I, ’MADIRA LEONIAH THETSO, declare that THE DISTRIBUTION AND INTERPRETATION OF THE QUALIFICATIVE IN SESOTHO is my own work and that all the sources that I have used or quoted have been indicated and acknowledged by means of complete references. ------------------------- ------------------------ SIGNATURE DATE i ABSTRACT This study explores the syntax of the substantive phrase, more especially substantive phrase composed of more than one qualificative, in Sesotho. Adopting interviews, questionnaires and documents, the study seeks to investigate the syntactic sequence of qualificatives, their relation to the modified head word and influence of such ordering pattern in the phrase. Structurally, qualificatives comprise two components, namely the qualificative concord and stem. The qualificative serves to give varied information about the implicit or explicit substantive resulting in seven types of qualificatives in Sesotho, be they the Adjective, Demonstrative, Enumerative, Interrogative, Possessive, Quantifier and Relative. From the Minimalist perspective, the qualificative is recursive. The study established a maximum of five qualificatives in a single phrase. The number is generally achieved by recurrence of the Adjective, the Possessive and the Relative up to a maximum of four of the same qualificative in a single phrase. It is observed that the recurrence of the Demonstrative, Interrogative, Enumerative and Quantifier is proscribed in Sesotho.
    [Show full text]
  • Sesotho Personal Names As Epithets: a Systemic Functional Linguistics Approach
    Sesotho Personal Names as Epithets: A Systemic Functional Linguistics Approach [PP: 43-56] Masechaba Mahloli L Mokhathi-Mbhele Faculty of Humanities, National University of Lesotho Lesotho ABSTRACT This study examined how the epithet feature of the nominal group described by Systemic Functional Linguistics (SFL) theory characterized independent clause Sesotho personal names. These names were described as authentic social discourse that exchanges information. Their semantics of interaction displayed speech roles such as statements, demands and commands, as questions and as the exclamative. The aim was to explore how these epithet personal names structured with features of these speech roles give the name awarder’s evaluation of the situation (modality) and context in which the child was born. They function as enacted messages pointing in various ways because these speech roles enfold the art of negotiating attitudes and through this art modality is highly incorporated. Data was collected from national examinations pass lists, admissions, telephone directories, media and employment roll lists from Public, Private, Tertiary and Orphanage institutions. Methodology is qualitative and it allows the researcher to make sense of or interpret phenomena in terms of meanings that people bring to them. It explores and describes the names within a specific context of Basotho and thus attempts to make meanings from the language users’ view, not fabricated views. In this way it displays modality and the negotiated attitudes. These help investigate the “why” and the “how” of the people’s decision making. The contribution by this article extends SFL-Onomastica relation and literature and opens up grammatical description functionally using material resourced from the speakers’ creative potential.
    [Show full text]
  • TANKISO LUCIA MOTJOPE-MOKHALI Submitted in Accordance with the Requirements for the Degree of Doctor of Literature and Philosoph
    A COMPARATIVE ANALYSIS OF SESUTO-ENGLISH DICTIONARY AND SETHANTŠO SA SESOTHO WITH REFERENCE TO LEXICAL ENTRIES AND DICTIONARY DESIGN by TANKISO LUCIA MOTJOPE-MOKHALI submitted in accordance with the requirements for the degree of Doctor of Literature and Philosophy in the subject African Language at the UNIVERSITY OF SOUTH AFRICA SUPERVISOR: PROFESSOR I.M. KOSCH CO-SUPERVISOR: PROFESSOR M.J. MAFELA November 2016 Student No.: 53328183 DECLARATION I, Tankiso Lucia Motjope-Mokhali, declare that A Comparative Analysis of Sesuto- English Dictionary and Sethantšo sa Sesotho with Reference to Lexical Entries and Dictionary Design is my own work and that all the sources that I have used or cited have been indicated and acknowledged by means of complete references. Signature……………………………… Date………………………….. i ACKNOWLEDGEMENTS I thank God for giving me the strength, power, good health and guidance to complete this race. I would not have made it if it were not for His mercy. As the African saying affirms ‘A person is a person because of other people’, which in Sesotho is translated as Motho ke motho ka batho, I believe that this thesis would not have been completed if it had not been for the assistance of other people. In particular, I extend my sincere gratitude to my Supervisors Professor I.M. Kosch and Professor M.J. Mafela for their guidance throughout the completion of this thesis. Their hard work, encouragement, constant support (both academically and personally) as well as their insightful comments offered throughout this study, gave me courage to continue. My special thanks also go to NIHSS/SAHUDA for its financial support during the carrying out of this study.
    [Show full text]
  • Anghelescu, Andrei, “Distinctive Transitional Probabilities Across
    Distinctive transitional probabilities across words and morphemes in Sesotho∗ Andrei Anghelescu University of British Columbia Abstract: This project examines a corpus of child-directed Sesotho speech in order to gauge how adequate transitional probability (TP) between units of sound is for predicting word boundaries and word-internal morpheme boundaries. The predictor is the likelihood of some phonological unit, a segment or syllable, given either the preceding or following unit, i:e: the (forward or backward) transitional probability of those two units. I conclude that transitional probability between adjacent segments and between adjacent syllables is not useful as a predictor of boundaries in Sesotho. Keywords: phonological learning, Sesotho, transitional probability 1 Introduction Several models of word segmentation in first language acquisition are based on using the probability of phonological sequences as a predictor. For example, Saffran et al. (1996a) argue that children can segment words out of the speech stream based on the likelihood of a syllable or segment sequence. If transitional probability is a reliable way for children who are acquiring their first language to segment the speech stream, then there should be an observable trend in real language data such that better than chance predictions can be made about boundary-hood based on TP; this has been demon- strated using computational modeling for some languages by Daland and Pierrehumbert (2011). The situation modeled by this examination is that of an infant attempting to posit boundaries between units of speech sounds. An infant learner has access to a wide variety of stimuli that are relevant in this task; however, in this model we abstract away from meaning and just examine what contribution can be made by simply paying attention to the distribution of sounds (specifically, syllables and segments) in the speech stream.
    [Show full text]
  • The Wili Benchmark Dataset for Written Natural Language Identification
    1 The WiLI benchmark dataset for written II. WRITTEN LANGUAGE BASICS language identification A language is a system of communication. This could be spoken language, written language or sign language. A spoken language Martin Thoma can have several forms of written language. For example, the E-Mail: [email protected] spoken Chinese language has at least three writing systems: Traditional Chinese characters, Simplified Chinese characters and Pinyin — the official romanization system for Standard Chinese. Abstract—This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. Languages evolve. New words like googling, television and WiLI-2018 is a publicly available,1 free of charge dataset of Internet get added, but written languages are also refactored. short text extracts from Wikipedia. It contains 1000 paragraphs For example, the German orthography reform of 1996 aimed at of 235 languages, totaling in 235 000 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in making the written language simpler. This means any system one dominant language, it has to be decided which language it which recognizes language and any benchmark needs to be is. adapted over time. Hence WiLI is versioned by year. Languages do not necessarily have only one name. According to Wikipedia, the Sranan language is also known as Sranan Tongo, Sranantongo, Surinaams, Surinamese, Surinamese Creole and I. INTRODUCTION Taki Taki. This makes ISO 369-3 valuable, but not all languages are represented in ISO 369-3. As ISO 369-3 uses combinations The identification of written natural language is a task which of 3 Latin letters and has 547 reserved combinations, it can appears often in web applications.
    [Show full text]
  • Modeling Word and Morpheme Order in Natural Language As an Efficient Tradeoff of Memory and Surprisal
    Modeling word and morpheme order in natural language as an efficient tradeoff of memory and surprisal Michael Hahn, Judith Degen, Richard Futrell June 24, 2020 Abstract Memory limitations are known to constrain language comprehension and production, and have been argued to account for crosslinguistic word order regularities. However, a systematic assessment of the role of memory limitations in language structure has proven elusive, in part because it is hard to ex- tract precise large-scale quantitative generalizations about language from existing mechanistic models of memory use in sentence processing. We provide an architecture-independent information-theoretic formalization of memory limitations which enables a simple calculation of the memory efficiency of languages. Our notion of memory efficiency is based on the idea of a memory–surprisal tradeoff : a certain level of average surprisal per word can only be achieved at the cost of storing some amount of information about past context. Based on this notion of memory usage, we advance the Efficient Tradeoff Hypothesis: the order of elements in natural language is under pressure to enable favorable memory- surprisal tradeoffs. We derive that languages enable more efficient tradeoffs when they exhibit informa- tion locality: when predictive information about an element is concentrated in its recent past. We provide empirical evidence from three test domains in support of the Efficient Tradeoff Hypothesis: a reanalysis of a miniature artificial language learning experiment, a large-scale study of word order in corpora of 54 languages, and an analysis of morpheme order in two agglutinative languages. These results suggest that principles of order in natural language can be explained via highly generic cognitively motivated principles and lend support to efficiency-based models of the structure of human language.
    [Show full text]
  • Race/Ethnicity And
    Pragmatics 22:3.477-499 (2012) International Pragmatics Association DOI: 10.1075/prag.22.3.06ste ETHNICITY AND CODESWITCHING: ETHNIC DIFFERENCES IN GRAMMATICAL AND PRAGMATIC PATTERNS OF CODESWITCHING IN THE FREE STATE Gerald Stell Abstract This article aims to compare three distinct grammatical and conversational patterns of code-switching, which it tentatively links to three different South African ethnoracial labels: White, Coloured and Black. It forms a continuation of a previous article in which correlations were established between Afrikaans- English code-switching patterns and White and Coloured ethnicities. The typological framework used is derived from Muysken, and the hypotheses are based on his predictions as to which type of grammatical CS (i.e. insertional, alternational, congruent lexicalisation) will dominate in which linguistic and sociolinguistic settings. Apart from strengthening the idea of a correlation between patterns of language variation and ethnicity in general, the article explores the theoretical possibility of specific social factors overriding linguistic constraints in determining the grammatical form of CS patterns. In this regard, it will be shown that – on account of specific social factors underlying ethnicity – CS between two typologically unrelated languages, namely Sesotho and English, can exhibit more marks of congruent lexicalization than CS between two typologically related languages, namely Afrikaans and English, while – from the point of view of linguistic constraints – insertional/alternational
    [Show full text]
  • Felix Banda Orthography Design
    Orthography design and harmonisation in development in Southern Africa By Felix Banda Garth Stead/iAfrika Photos roperly designed cross-border orthographies Of interest to this paper is the fact that Bantu lan- can play a monumental role in promoting the guages can be divided into zones or clusters of lan- use of African languages in all spheres of life, guages with varying degrees of mutual intelligibility. and hence contribute to the socio-economic Some of the languages in the Nguni language cluster development of Africans. In Southern Africa, are isiXhosa, isiZulu, siSwati and isiNdebele. Pwhere people speak related Bantu languages, benefits According to Miti (2006), Guthrie (1971) classifies from mass literacy campaigns for socio-economic devel- these languages in Group 40 of Zone S. The opment can only accrue from a wider readership of mate- Tswana/Sotho group is also classified in Zone S and rial, written in unified standard orthographies. Language includes the following languages: seTswana, sePedi planning and policy across geographical borders should and seSotho. Zone M includes languages such as take advantage of cross-border languages, which are cur- iciBemba, Lala, iciLamba, and ciTonga; Zone N rently promoted in isolation within nation-states. In this includes ciNyanja, ciTumbuka, ciNsenga, and ciKunda; idiom, status and corpus, planning should go beyond pro- while Zone P includes ciYao, eMakua and eLomwe. motion of isolated languages in nation-states, to a broad- What is worth noting at this juncture is that the lan- er and more comprehensive one that accounts for the guage zones or clusters do not constitute a neat pack- relatedness of Southern African languages as a result of a age, and they do not correspond to colonial borders.
    [Show full text]
  • Sesotho Book
    THE SESOTHO BOOK A Language Manual Developed By a Peace Corps Volunteer During Her Service in Lesotho 1 To the reader: I have attempted to make this book as comprehensive and yet as concise as possible. It is largely an adaptation of “Everyday Sesotho Grammar” by M.R.L Sharpe. I have included most of the vocabulary words, phrases, and grammatical points that I WISH someone had explained to me earlier on. I know learning Sesotho can seem like an incredibly daunting task, but it can be done. Here are my words of advice: Start USING Sesotho whenever possible. Carry around a dictionary and a little notebook to look up and write down the new words you come across. The more you practice (and the more mistakes you make), the easier it gets. If you reach a point where your language ability seems to have plateaud and you want to improve quickly, I recommend that you make a firm decision to ONLY speak Sesotho when in your village, explaining to those in your community that you would like them to only speak Sesotho with you as well. The communication barrier will be a big challenge at first, but soon you will start to see a rapid improvement in your language ability. Keep trying out language tutors until you find one that is helpful. It may take many tries, but it’s worth it. Finally, READ any material in Sesotho that you can get your hands on. Reading Sesotho (rather than just listening to people speak a mile a minute) gives you time to pick sentences apart.
    [Show full text]
  • GOO-80-02119 392P
    DOCUMENT RESUME ED 228 863 FL 013 634 AUTHOR Hatfield, Deborah H.; And Others TITLE A Survey of Materials for the Study of theUncommonly Taught Languages: Supplement, 1976-1981. INSTITUTION Center for Applied Linguistics, Washington, D.C. SPONS AGENCY Department of Education, Washington, D.C.Div. of International Education. PUB DATE Jul 82 CONTRACT GOO-79-03415; GOO-80-02119 NOTE 392p.; For related documents, see ED 130 537-538, ED 132 833-835, ED 132 860, and ED 166 949-950. PUB TYPE Reference Materials Bibliographies (131) EDRS PRICE MF01/PC16 Plus Postage. DESCRIPTORS Annotated Bibliographies; Dictionaries; *InStructional Materials; Postsecondary Edtmation; *Second Language Instruction; Textbooks; *Uncommonly Taught Languages ABSTRACT This annotated bibliography is a supplement tothe previous survey published in 1976. It coverslanguages and language groups in the following divisions:(1) Western Europe/Pidgins and Creoles (European-based); (2) Eastern Europeand the Soviet Union; (3) the Middle East and North Africa; (4) SouthAsia;(5) Eastern Asia; (6) Sub-Saharan Africa; (7) SoutheastAsia and the Pacific; and (8) North, Central, and South Anerica. The primaryemphasis of the bibliography is on materials for the use of theadult learner whose native language is English. Under each languageheading, the items are arranged as follows:teaching materials, readers, grammars, and dictionaries. The annotations are descriptive.Whenever possible, each entry contains standardbibliographical information, including notations about reprints and accompanyingtapes/records
    [Show full text]
  • Boston University Graduate School of Arts and Sciences, 2017
    BOSTON UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES Dissertation “WISDOM DOES NOT LIVE IN ONE HOUSE”: COMPILING ENVIRONMENTAL KNOWLEDGE IN LESOTHO, SOUTHERN AFRICA, C.1880-1965 By CHRISTOPHER R. CONZ B.S., University of Hartford, 1999 M.Ed., University of Massachusetts, Amherst, 2004 M.A., Boston University, 2013 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2017 ProQuest Number:10267447 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. ProQuest 10267447 Published by ProQuest LLC ( 2017). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 © Copyright by CHRISTOPHER R. CONZ 2017 Approved by First Reader ______________________________________________________ James C. McCann, Ph.D. Professor of History Second Reader ______________________________________________________ Diana Wylie, Ph.D. Professor of History Third Reader _______________________________________________________ Joanna Davidson, Ph.D. Assistant Professor of Anthropology ACKNOWLEDGEMENTS This dissertation was possible only because of others to whom I owe tremendous gratitude. A Fulbright Student Grant and a Graduate Research Abroad Fellowship from Boston University funded my research in 2014-2015. My two homes at Boston University, the Department of History and the African Studies Center and Library, provided me with the intellectual space for sharing my work with graduate students and faculty.
    [Show full text]
  • Palatalization and Other Non-Local Effects in Southern Bantu Languages
    Palatalization and other non-local effects in Southern Bantu Languages Gloria Baby Malambe Department of Phonetics and Linguistics University College London UCL Thesis submitted to the University of London in partial fulfilment of the requirements for the degree of Doctor of Philosophy 2006 UMI Number: U592274 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI U592274 Published by ProQuest LLC 2013. Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 Abstract Palatalization in Southern Bantu languages presents a number of challenges to phonological theory. Unlike ‘canonical’ palatalization, the process generally affects labial consonants rather than coronals or dorsals. It applies in the absence of an obvious palatalizing trigger; and it can apply non-locally, affecting labials that are some distance from the palatalizing suffix. The process has been variously treated as morphologically triggered (e.g. Herbert 1977, 1990) or phonologically triggered (e.g. Cole 1992). I take a phonological approach and analyze the data using the constraint- based framework of Optimality Theory. I propose that the palatalizing trigger takes the form of a lexically floating palatal feature [cor] (Mester and ltd 1989; Yip 1992; Zoll 1996).
    [Show full text]