The University of Dodoma University of Dodoma Institutional Repository http://repository.udom.ac.tz

Information and Communication Technology Master Dissertations

2015 Development of a rule base checker for

Bamsi, Haji Idd

The University of Dodoma

Bamsi, H. I. (2015). Development of a rule base grammar checker for Swahili language. Dodoma: The University of Dodoma. http://hdl.handle.net/20.500.12661/760 Downloaded from UDOM Institutional Repository at The University of Dodoma, an open access institutional repository. DEVELOPMENT OF A RULE BASED GRAMMAR CHECKER FOR SWAHILI

LANGUAGE

By

Haji Idd Bamsi

A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science in Computer Science of the University of Dodoma

The University of Dodoma

October, 2015

CERTIFICATION

The undersigned certifies that he has read and hereby recommends for acceptance by the

University of Dodoma a dissertation entitled Development of a rule base grammar checker for Swahili language in fulfilment of the requirements for the Master of

Science in Computer Science of the University of Dodoma.

………………………………………………………………. Dr. Salehe I. Mrutu (Supervisor)

Date……………………………………

i DECLARATION

AND

COPYRIGHT

I, Haji Idd Bamsi declare that this dissertation is my own original work and that it has not been presented and will not be presented to any other University for similar or any other degree award.

Signature………………………………………

No part of this dissertation may be reproduced, stored in any retrieval system, or transmitted in any form or by any means without prior written permission of the author or the University of Dodoma.

ii ACKNOWLEDGEMENT

I have taken efforts in this study. How it would not have been possible without receiving blessings from the Almighty God who kept me safe and healthy to be able to undertake my study.

I am highly indebted to my supervisor, Dr. Mrutu for his guidance and constant support up to the completion of this study. His fascination in this study and strong opinion served as a powerful essential towards the completion of this study.

I would like to express my sincere gratitude to my colleagues, MSC class 2014/2015.

Their support, encouragement and collaboration highly helped me out with their abilities.

This work would have not been possible without the support from my employer, the

University of Dodoma (UDOM) for granting me a study leave to attend the program.

I would like to acknowledge all staff of the College of Informatics and Virtual

Education, Department of Computer Science for their support and encouragement during my studies.

Last, but not least, I would like to thank my wife and my family for their constant encouragement and patience throughout my studies. I could not finish this without mentioning my friend, my colleague Ms Rose for her assistance and encouragement during the course of my study.

iii DEDICATION I would like to dedicate this work to my late mother, Fatma Mohammed. I always remember her guidance and love she had shown me. May Allah rest her in peace.

iv ABSTRACT

Grammar checker is a writing assistance tool developed for checking rules of the

Natural Languages automatically. Every natural language has a set of rules which are used to guide users of that language. Swahili is one of the most widely spoken languages in the East African countries and specifically in Tanzania. Efforts have been made so far towards the development of the tool. However, to the best of our knowledge there is no assistance tool developed and reported for detecting the grammatical errors of Swahili sentences automatically. In this study, a rule based grammar checker prototype for Swahili language has been developed and tested.

The system prototype has been developed using a rule based approach. In developing a grammar checker, Swahili texts were collected and analyzed. Then, a grammar rules were developed and tested using Transformation Based Learning (TBL) algorithm. The grammar checker prototype was designed into two modules; the first module detects spelling using Bayesian theory that finds the most likely spelling correction from the set of possible corrections, and the second module detects grammar errors that match the input text against the pre-defined grammar rules.

The performance of the developed prototype was evaluated using precision and recall standard performance measures. Precision was used to present the ability of the prototype to detect grammar errors, while recall was used to test the ability of the prototype to reveal only relevant grammar errors. The system prototype achieves 71% recall and 76% precision. Therefore, the accuracy of the grammar checker prototype obtained was 73%.

v TABLE OF CONTENTS

CERTIFICATION ...... i

DECLARATION AND COPYRIGHT ...... ii

ACKNOWLEDGEMENT ...... iii

DEDICATION ...... iv

ABSTRACT ...... v

TABLE OF CONTENTS ...... vi

LIST OF FIGURES ...... ix

LIST OF ABBREVIATIONS ...... xi

CHAPTER ONE : OVERVIEW OF THE STUDY ...... 1

1.1 Background Information ...... 1

1.2 Problem Statement ...... 4

1.3 Research Objectives ...... 5

1.3.1 Main Objective...... 5

1.3.2 Specific Objectives ...... 5

1.4 Research Questions ...... 5

1.5 Significance of the Study ...... 6

1.6 Limitation of the Study ...... 6

1.7 Dissertation Structure ...... 7

CHAPTER TWO : LITERATURE REVIEW ...... 9

2.1 About Swahili Language ...... 9

2.2 Definitions of Terms ...... 11

2.2.1 Natural Language Processing ...... 11

2.3 Rule Based Approach ...... 14

2.3.1 Part of Speech Tagging ...... 16

vi 2.3.1.1 Swahili Word Categories ...... 18

2.3.2 Phrase Chunking ...... 23

2.3.3 Phrase Structure ...... 23

2.4 Transformation Based Learning...... 24

2.5 Parsing ...... 26

2.6 Spelling Corrector ...... 28

2.7 Related Works ...... 28

CHAPTER THREE : METHODOLOGY ...... 30

3.1 Data Gathering, Review and Analysis ...... 30

3.2 Development Model for Grammar Checker ...... 31

3.2.1 Lexical Analysis...... 32

3.2.1.1 Part of Speech Tagging ...... 33

3.2.1.2 Parse Tree...... 34

3.1.2.3 Syntax Analyzer ...... 34

3.2.2 Grammar Rules Development ...... 34

3.2.3 Spelling Checker ...... 36

3.3 System Description ...... 37

3.4 System Design ...... 39

3.4.1 Class Description ...... 40

3.5 System Prototype Implementation ...... 41

3.6 System Testing and Evaluation ...... 42

CHAPTER FOUR : RESULTS AND DISCUSSION ...... 43

4.1 The Developed System Prototype ...... 43

4.2 General Discussion ...... 45

CHAPTER FIVE : CONCLUSION AND RECOMMENDATION ...... 49

5.1 General Conclusion ...... 49

vii 5.2 Recommendation and Future Work ...... 49

REFERENCES...... 51

APPENDICES ...... 56

viii LIST OF FIGURES Figure 2.1: Open Office Grammar Checker System ...... 15

Figure 2.2: kNN Example (Source: Sutton, 2012) ...... 17

Figure 2.3: Chomsky's Sentence Tree Structure ...... 24

Figure 2.4: Transformation Based Learning (Source: Mangu and Brill, 1997) ...... 25

Figure 2.5: Syntactic Constituent Structure ...... 27

Figure 2.6: Syntactic Feature Structure...... 27

Figure 2.7: Syntactic Dependency Structure ...... 27

Figure 3.1: Research Excursion Steps ...... 30

Figure 3.2: Development Model for Swahili Grammar Checker Prototype ...... 32

Figure 3.3: Parse Tree Structure ...... 34

Figure 3.4: Feature of Grammar Rule ...... 35

Figure 3.5: Activity Diagram of Swahili Grammar Checker Prototype ...... 38

Figure 3.6: Class Diagram for Swahili Grammar Checker ...... 39

Figure 4.1 Spelling Error ...... 43

Figure 4.2: General Error ...... 44

Figure 4.3: Follow Adjective ...... 44

Figure 4.4: Verb ...... 44

Figure 4.5: Agreement between Noun and Possessive Pronoun ...... 45

ix LIST OF TABLES

Table 2.1: Swahili Word Categories ...... 18 Table 2.2: Noun Classes (Source: Ashton, 1982) ...... 19 Table 2.3: Personal Pronouns (Source: Thompson and Schleicher, 2001) ...... 20 Table 2.4: Types of Pronouns ...... 21 Table 2.5: List of Swahili Phrases ...... 24 Table 3.1: Lexical Analysis Example ...... 33 Table 4.1: Types of Grammar Errors ...... 45 Table 4.1: Performance Results for Swahili Grammar Checker Prototype ...... 48

x LIST OF ABBREVIATIONS HSC : Helsinki Swahili Corpus

KNN : K-Nearest Neighbor

MBT : Memory Based Tagger

MXPOST : Maximum Entropy Modeling

NLP : Natural Language Processing

NLTK : Natural Language Toolkit

POS : Part of Speech

SALAMA : Swahili Language Manager

SVM : Support Vector Machine

TBL : Transformation Based Learning

TnT : Trigrams ‟n‟ Tags

XML : Extensible Markup Language

xi CHAPTER ONE

OVERVIEW OF THE STUDY

1.1 Background Information

Swahili is one of the most widely spoken languages in the East African countries and specifically in Tanzania. Swahili language originally was used by Bantu speakers and

Arabian traders to communicate in the early 2nd century (Hinnebusch, 1996) and

(Wikipedia, 2015). The earliest known documents written in Swahili were believed to be letters written in Kilwa using Arabic script (Wikipedia, 2015). Few years later, Latin scripts were influenced by European powers and later became standard until today.

Swahili language has been the mother tongue on the coast of Tanganyika, Kenya,

Zanzibar and the neighboring islands for centuries (Polome, 1967). The language has penetrated further inland as well as in the East and Central African countries. The rapid growth of Swahili language speakers led to the increase of the number of Swahili documents produced and processed over the internet (Tesha, 2012). Such documents can be found in various applications such as websites, blogs, social networks, Wikipedia etc. Beside Tanzania, Swahili language is also used in Democratic republic of Congo,

Kenya, Uganda, Burundi, Rwanda and Mozambique as the second language. Swahili is also the main language used in Tanzania‟s parliament, press and radio; it is used as a medium of instruction at primary level in Tanzania, besides being taught as the subject at secondary and higher levels, where English becomes the medium of instruction.

Recently, the government of Tanzania, through the Ministry of Education has announced new education policy, which emphasizes on the use of Swahili language as a medium of instruction at all levels of education. Despite the fact that Swahili language

1 continues to grow and gain recognition all over the world, there is a danger of having incorrectly written documents in Swahili document archives. This is due to the fact that, it is obvious for writers to make errors when they are typing to create and process or retrieve Swahili documents.

Msanjila (2005) reveals six writing problems in his study conducted in two secondary schools of Morogoro region, Tanzania. The researcher identifies some writing problems such as; capitalization and punctuation, inexplicitness or fuzziness, poor organization or illogical sequence, spelling problems and grammatical errors. He further argues that the problem is also found in higher education levels, including colleges and universities in

Tanzania. Kummert et al (2003) categorize errors which appear in written text into four categories: spelling errors, grammar errors, style errors and semantic errors. Spelling error occurs due to the incorrect combination of letters that form a word. This problem can be corrected by spell checkers that compare a given word with a large list of known words. Style error is caused by using uncommon words, and complex sentence structure that involve long sentences. Semantic errors are related to the meaning of a word or sentence. Semantic errors are hard to detect automatically. Grammar error is caused by incorrect conjugation of words without following syntactic rules of the specified language. Unlike spelling checking, grammar checking needs to make use of sentence structure of a language. To the best of our knowledge, there is no any tool developed and published for checking semantic errors and/or grammar errors of Swahili language.

Therefore, in this study we have developed a prototype for checking spelling and grammar of Swahili language.

2 Grammar is a systematic way of assembling words and sentences of a language to convey meaning. Grammar is built under the set of rules about the syntax and morphology of a language (Leacock et al, 2010). An example of grammar rule in

English and Swahili language is subject to verb agreement. Each language has its own syntax and semantics. Users should follow these predefined language rules (syntax) otherwise they will commit errors. Experience shows that non native speakers of a language are subjected to grammatical errors more than native speakers. Most of

Swahili speakers are non native speakers, which mean they use Swahili as their second language. Therefore, writing assistance tool will be more useful to them and language learners.

The idea of developing tool as a writing assistance was discovered so long ago. Back to

1970‟s the first writing tool was introduced for English language (Tesfaye, 2011). It was called Writer‟s Workbench which was used to check punctuation and style inconsistencies. Few years later, the first complete grammar checker for English was developed that could detect writing errors in addition to style and punctuation errors.

This was followed by a great achievement in 1992, when the grammar checker was integrated to word processor as an independent component (Wikipedia, 2015).

Grammar checker is an application that finds and corrects syntactic errors in a sentence

(Faili, 2011). Syntax errors do not consider the meaning of a sentence. Most of the grammar checkers do not consider the meaning of a sentence. For example in word processors, text editors normally use grammar checker application to detect and correct grammatical errors only but not semantic errors. Towards the automation of grammar checking, various scholars proposed system tools for different languages. Jensen et al.

3 (1983) proposed a grammar checker for English; Bustamante and Leon (1996) introduced a grammar and style checker for Spanish and Greek; Vosse (1994) suggested the rule based grammar checker for Dutch, and Martins et al. (1998) also proposed for

Brazilian and Portuguese. Hurskainen (2004) introduced Swahili language Manager

(SALAMA) as a computational environment for developing Swahili based applications.

However, SALAMA is limited to morphological description of Swahili words and language translation. Therefore, in this study the researcher has introduced a grammar checker prototype for Swahili language.

1.2 Problem Statement

Syntactic analysis of sentences is a fundamental problem in computational linguistics.

Automation of language syntax analyzer (grammar checker) becomes a crucial component in the development of computational linguistic applications. There are three approaches for developing a grammar checker: syntax-based, statistical or corpus-based and rule-based. The rule-based approach has been showing great performance in other languages such as English, Dutch, Spanish and Arabic. Although manual creation of rules seems to be a daunting job, the introduction of machine learning techniques simplifies the process of acquiring rules.

While great achievement have been made on spelling and grammar checker applications for major European languages such as English, Spanish, Dutch, German and Swedish, applications on including Swahili are still a nightmare. Efforts have been made so far towards the development of such applications. Swahili Language

Manager (SALAMA) developed by Hurskainen (2004) who collects and analyzes more than twelve million standard Swahili words. However, SALAMA is limited to Swahili

4 words classification and Swahili language translation. To the best of our knowledge, there is no any tool developed and reported for detecting grammatical errors of Swahili sentences automatically. Therefore, this study has attempted to solve this syntactic analysis problem by developing automatic spelling and grammar checker for Swahili language based on Swahili corpus and its grammatical rules.

1.3 Research Objectives

1.3.1 Main Objective

The general objective of this study was to develop a rule-based grammar checker prototype that detects and corrects grammatical errors of Swahili language text automatically.

1.3.2 Specific Objectives

a) To analyze the syntactic structure of sentences in Swahili language.

b) To design grammatical rules that covers general syntax of the largest possible set

of Swahili language sentences.

c) To implement grammar checker based on the grammatical rules generated from

Swahili language corpora.

d) To test and evaluate the performance of the prototype on detecting and correcting

grammatical errors of Swahili language texts.

1.4 Research Questions

a) What are the types of classes of words that Swahili language has?

b) What parse tree structures can represent syntactic structures of sentences in

Swahili language?

5 c) How grammatical rules can be designed to represent the syntax of the largest

possible set of sentences in Swahili language?

d) What code best represents the grammatical rules generated for checking Swahili

grammar?

e) What is the performance of the tool developed on grammatical errors detection

of Swahili language texts?

1.5 Significance of the Study

Swahili is widely used by most of African countries, and it is originated from Africa.

The number of Swahili language users are still growing and the need for applications written in Swahili language are increasing. This study has significance contribution to the development of such applications like; text Editor, Swahili language compiler, optical character recognition system, machine translation, digital processing automated devices and question answering systems etc. Also Grammar checker improves the quality of Swahili language texts, saves time while writing and supports non-native speakers when learning the language.

In natural language processing realm, grammar checker for Swahili language has not been developed and tested. Results from this study will provide a deeper understanding and reveal more techniques for solving this problem.

1.6 Limitation of the Study The developed application was tested to data set that were collected and analyzed by the researchers. It means that, the application can only detect and correct errors of Swahili sentence formed by combining words which are available in the data set. Also the application was confined to 85 errors of which 75 are grammatical and 10 are general

6 syntax errors such as extra space before comma, missing space after full stop etc. This is due to the limitation of time and resource constraints. Therefore the prototype was tested to a small set of words on which the rules apply. However, the architecture allows future expansion of the rules and data so as to cover the largest possible data set of

Swahili language.

1.7 Dissertation Structure

In this dissertation, the application prototype for detecting and correcting spelling and grammatical errors were introduced. The report is divided into five chapters. Chapter one was introducing the study which covers the background to the study, problem statement, the research objectives and questioning and the significance of the study. The remaining chapters are outline as follow:

Chapter Two explores the background information about Swahili language, and the existing literature on the development of grammar checker. It further discusses strengths and weaknesses of various approaches in developing grammar checker. Then extends the discussion on the rule based approach by briefly analyzing each associated component such as; corpora, part of speech tagging and grammar rules acquisition.

Chapter Three discusses the research methodology adopted by the researcher to carry out this study. The researcher briefly has presented the development model used which started with data review and analysis, then design, implementation and testing, and the model ends up with system evaluation.

7 Chapter Four presents the performance of the developed prototype on detecting and correcting spelling and grammatical errors of Swahili language. Recall and precision obtained from the experiment results are presented.

Chapter Five discusses summary of the research findings and then concludes the dissertation.

8 CHAPTER TWO

LITERATURE REVIEW

2.1 About Swahili Language

Swahili also known as Kiswahili in Tanzania is originated from the coast and Islands of

East African countries. Most of Swahili words are borrowed from maritime powers operating in the area; such as Arabic, Portuguese and English as well as from other languages.

Haspelmath and Tadmo, (2009) establish statistics of loanwords that are borrowed from other languages. The analysis was done using 1610 words and the results show that

Arabic 16.4-20%, English 4.6% and other languages including Portuguese, Persia, and

Hindi range between 0.1 up to 1%. The researcher also found out that Arabic, and to a lesser extent also English had influence on Swahili morphology, syntax and style. The history of penetration of Swahili language from the coast to the interior practically coincides with that of Arab trade towards Central Africa (Polome, 1967). It is generally accepted that its name is coming from Arabic name Sawahil, plural of Sahil means

„coast‟. Although, some scholars still debate on its originality and remains disputed until today. The earliest evidence of Arabic travelers and traders in coastal areas including

Kilwa, Lamu, Malindi and Mombasa suggests that Swahili language was used by

Arabic traders and local people in the area even before 10th century (Polome, 1967).

There are two groups of Swahili speakers; including one group of Swahili speakers whose mother tongue is Swahili, and those who use Swahili as the second language.

Along the coast of Kenya and Tanzania, including neighboring islands, Swahili has been the mother tongue of the local people in the area for centuries (Polome, 1967).

9 Swahili is recognized as the first language in Tanzania, Kenya and Democratic Republic of Congo (Wikipedia, 2015). Other countries use Swahili as their second language includes; Uganda, Zambia, Mozambique, Malawi, Rwanda and Burundi, Somalia and the Comoro Islands. At present, the number of Swahili speakers is ranging widely from

60 million to 150 million (Wikipedia, 2015). Several factors have led to the rapid growth of Swahili language in Africa and outside Africa. The media, commerce, education systems, its role in communication accelerates the growth of Swahili language in Africa. In Tanzania Swahili language is used as the media of instruction in primary school and as a compulsory subject in secondary school since independence

(Polome, 1967). In Kenya as well Swahili is a compulsory subject in primary and secondary schools. Statistically, more than 100 universities in different parts of the world are currently teaching Swahili language and its culture. In 2014, the president of

Tanzania, Hon. Mrisho Jakaya Kikwete launched a new education policy that will go to fulfill the current and future needs of Tanzanians. Among other items, the policy formally recognizes Swahili language as a media of instruction in all levels of education and training (MOE, 2014). This will change the way instructors used to teach and even reference books need to be written in Swahili. Therefore, it will accelerate the number of Swahili documents in Swahili archives. Currently Swahili is recognized as an official language in all government institutions in Tanzania. Various official documents are written in Swahili. Therefore, syntax should be observed to ensure the common understanding.

10 2.2 Definitions of Terms

2.2.1 Natural Language Processing

Normally we use language to communicate. Language is a broad term that can be referred to computer language or Natural language. Computer or formal languages have precisely structure enumerated by a set of rules. Examples of formal languages are;

Java, C++, Python etc. Natural language refers to the languages spoken by people.

Examples of Natural languages are; English, Japanese, French, Swahili etc. Natural languages are very dynamic, which means cannot be characterized as a definitive set of sentences. Natural languages are difficult to deal with because their rules are not strict compared to formal languages. The ability of computer to learn, understand and interpret a natural language is called Natural Language Processing (NLP).

The grammar of a natural language is referred to a set of rules (syntax) and morphology of language to form sentences (Henrich and Reuter, 2009). Different languages differ in syntax and morphology. For example: In English, Adjective (ADJ) comes before a noun

(N), while in Swahili Adjective follows a noun. (See example 1 below)

Mtoto/N mzuri/ADJ (Swahili) -> beautiful/ADJ baby/N (English) ………………. (1)

Hence, it is not practical to apply the same set of rules of one language to another.

Every language has its own syntactic principles of word order. For instance, English language sentences follow subject-verb-object (abbreviated as S-V-O), Japanese sentences allow the words to occur in many possible orders but the normal arrangement is subject-object-verb, Irish sentences follow verb-subject-object word order. Swahili

11 language, like many other Bantu languages has fairly fixed base word order

(subject/–verb–object/theme: SVO), where the agent precedes the verb and the theme (see example 2) ( Ashton, 1982) and (Grigori et al, 2013).

Kijana/S anampiga/V msichana/O (The boy is hitting the girl) ……………………. (2)

However, other sentences with derived order of the arguments, such as passives

(subject/theme–verb–prepositional phrase/agent), object relatives (object/theme – subject/agent–verb) are also accepted. (See example 3a & b)

Msichana anapigwa na kijana. (The girl is hit by the boy) {Passive theme-agent}… (3a)

Msichana ambaye kijana anampiga. (The girl who the boy is hitting){Object relative – relative pronoun}……………………………………………………………………. (3b)

Various approaches have been established on developing machine learning tools for spelling and grammar checker. Back to early 70‟s when the interest of researchers was to detect spelling errors. Few years later, more researchers started to work on grammar as well. Jensen et al. (1983) exploited parse fitting and prose fixing techniques that detect syntactic errors and suggests their appropriate corrections. Between 1997 and

1999, Lingosoft has been developing grammar checker for Swedish. In 2000, Lingosoft has licensed Grammatifix to Microsoft as the grammar checker component for

Microsoft version 2000. Grammatifix uses constraint grammar technique which uses two levels of morphological approaches to produce a system for syntactic analysis.

Carlberger et al (2004) reviewed another grammar checker for Swedish called

GRANSKA. It combines probabilistic and rule based method to achieve high efficiency and robustness. Ehsan and Faili (2011) demonstrate a statistical grammar checker for

Persian language that uses Statistical Machine Translation framework.

12 In general, grammar checker can be categorized into three groups; syntax based, statistical based and rule based grammar checker. The advantage of syntax based parser is that the grammar checker can always detect an error of a sentence when it is completely parsed. However, it cannot tell exactly where the problem is. Major disadvantage of this technique is that the tool requires a complete grammar which covers all types of sentences. Although in practice, it is not possible to cover every valid sentence structure (Naber, 2003).

The advantage of statistical based approach is that it is language independent. Also it is very powerful tool when the morphological structure of a language is very rich (Henrich and Reuter, 2009). On the other hand, the statistical based approach has the following disadvantages. It is difficult to understand the interpretation of the tool, also the idea of setting reasonable threshold value that separates between common and uncommon patterns leading to misleading the marked errors and it is difficult to get large corpus or collection of training data which covers all possible errors (Naber, 2003).

The rule based approach is considered the best and robust technique given the strong rules that cover almost every aspect of the sentence structure. However, in practice it is difficult to cover every aspect if the morphological structure of the language is complex.

Also the rule is tailored for a specific language since different languages differ in their syntax and semantic (Henrich and Reuter, 2009). Despite of the above disadvantages,

Naber (2003) and Mozgovoy (2010) outlined the following advantages:

i. The system can detect errors before the sentence is complete.

ii. Each rule has its own expressive description and can be enabled or disabled

individually.

13 iii. It can provide detailed error message explaining grammar error.

iv. It can easily maintain and increment rules as per language requirements.

In this study, the researcher adopted a rule-based approach for developing a grammar checker for Swahili language. This is due to its flexibility and efficiency of the approach.

2.3 Rule Based Approach

In a rule based approach, a sentence or phrase is checked against a set of rules designed for a specific language. According to Henrich and Reuter (2009), rules cannot be reused across different languages. Therefore, it is usual to find different grammar checker for different languages. Examples of a rule based grammar checker are; grammar checker for Microsoft Word, WordPerfect, and Language Tool and Grammarian Pro X. See in more details, language tool for open office.

Language tool is an open source proof reading tool developed using XML. This tool can be integrated into Open Office. It is a simple grammar checking tool which uses processes as shown in figure 2.1.

14

Figure 2.1: Open Office Grammar Checker System

The first stage is a lexical analysis of the input text. The input text is broken into sentences, and then each sentence is broken further into words. The next stage is to assign each word into respective category, i.e., part-of-speech tagging. Then analyzed text is then matched against the built in rules and against the rules loaded from xml file.

The process of breaking down the input text into smaller unit (token) is called tokenization (Manning and Schutze, 1999). A token can be a word, punctuation or something like a number. Normally, tokenization involves splitting the input text into words using whitespaces as a separator. It is simple to split text using functions defined in Programming languages like Python, Java, C++ etc.

Almost every language has a collection of words or phrases. A body of text is called corpus, and when you have several such collections of texts, you have corpora

(Manning and Schutze, 1999). Examples of the most known corpora are: American

National Corpus (ANC) for American English, British National (BNC) Corpus for

British English, Helsink Swahili Corpus (HSC) for Swahili. Helsinki Corpus of Swahili is a result of SALAMA (Swahili Language Manager) project initiated by Hurskainen in

1985. HCS is a large corpus of Swahili which currently contains more than twelve

15 million standard Swahili words. Hurskainen (2004) adopted two-level morphology approach to develop HCS. The system collects terminologies from the internet as well as from the lists of terminology coined by the National Swahili Council of Tanzania.

2.3.1 Part of Speech Tagging Part of speech tagging is the process of assigning each word into its category. It is very important stage in the development of grammar checking using either of the defined approach. Pauw et al. (2006) have done experiment on Swahili taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Researchers have tested the performance of four data-driven taggers namely; TnT, MBT, SVMTool and MXPOST.

The result of the experiment shows MXPOST as being the most accurate tagger for

Swahili data set. However, in this study has used MBT for tagging because of its simplicity and accessibility.

Memory based tagging (MBT) is an approach to sequence tagging based on Memory

Based Learning (MBL). MBL has adopted a classical k-Nearest Neighbor (k-NN) approach to statistical pattern classification. It has been proven successful in large number of tasks in Natural language processing (Daelemans and Van den Bosch, 2005).

The idea here is using categorizing or classifying unknown word based on the similarity/ position of the words in the sentence. Similarity function can be computed after the application is trained using training data. Daelemans et al, ( 2007) have made

MBT application available for free within the 1TiMBL software package.

1 MBT is available from http://www.ilk.uvt.nl/software.html

16 k-Nearest Neighbour algorithm uses a collection of data set as the training data in which each data point is categorized into different classes. By using these categorized data points, new data point class can be predicted using distance function. New data point will be categorized into the closest group of training data points. Sutton (2012) summarizes k-NN algorithm as follows:

1. A positive integer n is specified, along with a new sample

2. We select the n entries in our data set which are closest to the new sample

(unseen data)

3. We find the most common classification of these entries

4. This is the classification we give to the new sample

For example, let n instances of training data set x1, x2……xn are nearest to new data point xq. Then return the class that represents the maximum of the k instances.

Figure 2.2: kNN Example

(Source: Sutton, 2012)

From figure 2.1, if n=5, then in this case query instance xq will be classified as negative since three of its nearest neighbors classified as negative.

17 2.3.1.1 Swahili Word Categories Swahili word categories (Parts of speech) sometimes called word classes, in computational linguistic also known as lexical categories. It refers to the group of words which are characterized by their semantic content. Words can be classified using various criteria. Traditionally, words in Swahili are classified into eight classes; Noun, Pronoun,

Verb, Adverb, Adjective, Conjunction, Interjection and Preposition. Refer table 2.1

Table 2.1: Swahili Word Categories

Class Translation(English) Example

Nomino Noun mtu (person), mbili (two) Viwakilishi Pronoun mimi (I), wao(they) Vitenzi Verb kula (eat), lala (sleep) Vielezi Adverb shuleni (school), asubuhi (morning) Vivumishi Adjective nzuri (good), mrefu (tall) Viunganishi Conjuction na (and), lakini(but) Vihisishi Interjection lo! (lol!), ah! (ah!) Vihusishi Preposition mbele ya (in front of)

In computational linguistic, the traditional classes can be categorized further into smaller units. Tags/ markers are assigned to these units in order to ensure that, these units agree to each other when joined together to form a correct sentence.

Noun Classes (Ngeli za nomino)

Noun classes are categorized based on their semantic meaning of Swahili language.

These classes are important in syntactic analysis of the language because they affect other; modifiers like adjectives, numbers and verbs. See example 4 and 5:

18 Walimu wawili wameanguka ……………………….……. (4)

N ADJ V

Viti viwili vimeanguka ……………………….……. (5) N ADJ V

When you look at sentence 4 Watoto belong to M-wa (see table 2.2) that requires an adjective to start with wa-wili and also its verb to start with wa-meanguka.

On the other hand, sentence 5 Viti belong to Ki-vi noun class (see table 2.2) that requires its adjective to start with vi-wili and also its verb to start with vi-meanguaka.

Table 2.2 shows the list of noun classes with their examples.

Table 2.2: Noun Classes (Source: Ashton, 1982)

Noun class Noun class prefixes Adjective prefixes Verb prefixes

M-wa class m- [singular] - wa- m-[singular] – wa- a-[singular] - wa- [plural] [plural] [plural] Example: Mtu mmoja amepotea. M-mi class m- [singular] – mi- m- [singular] – mi- u- [singular] – i - [plural] [plural] [plural] Example: Miti mingi imeungua. Ji / Ma class ji-, j-, no prefix no prefix[singular] – li-[singular] – ya- [singular] – ma- ma- [plural] [plural] [plural] Note: except to some started with vowel Example: Mayai mabovu yametupwa Ki-vi class ki-,ch-[singular] – ki-[singular] – vi- ki-[singular] – vi- vi-, vy-[plural] [plural] [plural] Example: Vitabu vingi vimeharibika N class N-, Ny-, M-, or no n-[singular] – n- i-[singular] – zi- prefix [plural] [plural] Example: Kalamu nyekundu imepotea

19 U class u-, w-, uw- m-[singular] – n- u-[singular] – zi- [plural] [plural] Example: Uzi mrefu umekatika Ku class ku-, kw- Example: Wanafunzi wanapenda kuimba wimbo wa shule Pa locative pa-, suffix- ni pa- pa- class ku locative suffix- ni ku-, kw- Ku-, kw- class Example: Nyumbani kwangu kule barabarani Mu- locative Suffix ni mu- mu- class Example: Maji yako mule

Pronouns (Kiwakilishi)

Personal pronoun (Viwakilishi vya nafsi)

Personal pronouns are the same whether used as subject or object. However, they have different forms depending on the person referred to. Table 2.3 shows the list of personal pronouns.

Table 2.3: Personal Pronouns (Source: Thompson and Schleicher, 2001)

Singular Plural 1st person (Nafsi ya Mimi (I, me) Sisi (we, us) kwanza) 2nd person (Nafsi ya pili) Wewe (You) Ninyi (you)

3rd person (Nafsi ya tatu) Yeye (him, her, she, he) Wao (they, them)

20 The Pronoun it, they and them are not expressed by personal pronouns in Swahili when they do not refer to people. Instead demonstrative pronouns are used. Table 2.4 shows demonstrative pronouns and other types of pronouns.

Table 2.4: Types of Pronouns

Reflexive pronoun (Viwakilishi Use reflexive object marker –ji- virejeshi) (shamirisho ya kujirejea) Example: Ulijisaida (You helped yourself) Interrogative pronoun (Viwakilishi Nani (Singular) always take class 1 noun viulizi) agreement.

Examples:

Nani atakuja kesho? (as subject)

Ulimwona nani? (as direct object)

Alimpa nani barua? (as indirect object)

Ulisafiri na nani? (as object of a preposition)

Akina nani (Plural) always take class 2 noun agreement.

Examples:

Akina nani watakuja kesho? (as subject)

Also used as direct/ indirect object and object preposition.

Nini (asking things) precedes the verb if functioning as subject and follows the verb if functioning as object

Example:

Nini kinakusumbua?

Also used as direct/ indirect object and object preposition.

21

Demonstrative pronoun (viwakilishi Huyu, Yule, wale, hao… vioneshi) Example: Hawa ni watoto wangu. Wale ni wa dada yangu.

Possessive pronoun (Viwakilishi Changu, wangu, yangu… vimilikishi) Example: Kitabu hiki ni changu.

Relative pronouns (Viwakilishi virejeshi) Amba- ye-, o-, ko-, po-…..

Examples:

Nina rafiki ambaye anaishi ZanzibarNina rafiki anayeishi Zanzibar.

Nina rafiki aishiye huko Zanzibar.(I have friend who lives in Zanzibar)

Indefinite pronoun -ote, -ingi, -ingine, -o –ote, baadhi ya, mojawapo, kadhaa

Example:

Vitu vyote vipo (There is everything)

Verb (Kitenzi)

Swahili verb has a complex morphology. Generally, Swahili verbs consist of subject agreement marker, tense marker, object marker, stem, suffixes and final vowel (ignoring negative markers, relative markers and morphology). See example 6.

Swahili verb structure

Subject agreement-tense marker-object marker-root-suffixes-final vowel wa-na-chez-a (subject-tense-root-final vowel) ……………………………… (6)

22 2.3.2 Phrase Chunking

A chunk (phrase) is a syntactic structure which groups several consecutive words to form a phrase. Phrase chunking increases performance and accuracy of a system, since a sentence is analyzed in groups. For instance, in English language, typical chunks are noun phrase (NP) and verb phrase (VP). Noun phrases consist of , adjectives and nouns or pronouns. While verb phrases consist of a single verb or of an auxiliary verb plus (Naber, 2003). In Swahili, there are at least six noun phrases described as; Kirai Nomino (Noun phrase), Kirai tenzi (Verb phrase), Kirai vumishi

(Adjective phrase), Kirai elezi (Adverb phrase), Kirai husishi (Preposition phrase) and

Kirai unganishi (Conjuction phrase). These phrases can be conjugated to form sentences in Swahili. Normally, a sentence is formed by two important phrases, Noun and Verb phrases (KN & KT). See example 7 below.

Mpira unachezwa na Juma (S -> KN + KT (T + KU(U + KN))) ……………… (7)

2.3.3 Phrase Structure

Phrase structure or sometimes referred to as visual syntactic structure of a sentence, is originated from Noam Chomsky (1957), when he published his most influential work in linguistic theory. The method of analyzing the syntactic structure of a sentence using tree diagram was first introduced in 1957. Phrase structure is used to break down sentences into its constituent parts, known as constituent categories such as lexical categories (part-of-speech) and phrase categories (Wikipedia, 2015). See example of phrase tree structure in figure 2.3

23

Figure 2.3: Chomsky's Sentence Tree Structure

(Source: Wikipedia, 2015.)

Table 2.5 below shows a list of Swahili phrases used to analyze Swahili sentence structure.

Table 2.5: List of Swahili Phrases

Phrase Swahili tags

Kirai nomino (Noun phrase) KN:N|N ADJ|PRO|PRO ADJ Kirai kitenzi (Verb phrase) KT:V|V KN|V KN KE Kirai kielezi (Adverb phrase) KE:ADV|ADV ADV

Kirai kivumishi (Adjective phrase) KV:ADJ|ADJ ADV

Kirai kihusishi (Preposition phrase) KH:PREP|PREP KN

2.4 Transformation Based Learning

Manual encoding of linguistic information is being challenged by annotated corpus- based learning as a method of extracting linguistic information (Brill, 1995).

Transformation based learning has been successfully adopted in the rule-learning based natural language processing studies. TBL can be applied to part-of-speech tagging,

24 prepositional phrase attachment disambiguation and syntactic parsing (Brill, 2008). Brill

(2008), uses TBL in Part-Of-Speech tagging, Ramshaw and Marcus (1995) use TBL in text chunking, Kate et al, (2005) also use TBL to transform Natural language to formal language (Curran and Wong 2000 and Keizer et al 2009). TBL are often achieving state-of-the art accuracy while capturing the learned knowledge in a small set of rules

(Mangu and Brill, 1997).

Figure 2.4: Transformation Based Learning

(Source: Mangu and Brill, 1997)

Figure 2.4 shows the learning process of Transformation Based Learning. Mangu and

Brill (1997) identified important items to be satisfied when implementing TBL systems:

1. A baseline predictor

25 2. A set of allowable transformation types

3. An objective function for learning

In learning, there should be a training set which contain correctly annotated corpus.

Training set is passed through a program to form initial baseline prediction. The quality of baseline corpus is measured by comparing with the pre-determined truth according to the specified objective function. The learner then iteratively learns unordered sequence of rules defined. After each learning cycle, new transformation rule is added to the transformation list. Then, the rule is applied to the training corpus. Learning iteration ends when there is no new transformation which can be found, whose application results in an improvement to the training corpus.

2.5 Parsing

Parsing is the process of analyzing a string of symbols, either in natural language or in computer languages, conforming to the rules of a formal grammar (Wikipedia, 2015).

Parsers produce a syntactic structure of the sentence as output. Syntactic structure may be produced in different forms such as constituent tree (c-structure or parse tree), a syntactic feature structure, or a dependency structure (Sangae et al, 2004). See figures

2.5 - 2.7 below

26

Figure 2.5: Syntactic Constituent Structure

Predicate: Anasoma

Tense Present Voice Active Subject Predicate : Noun Person : 1st person Number : Singular Object Predicate: kitabu Number :Singular Figure 2.6: Syntactic Feature Structure

subject object

Mwalimu anasoma kitabu

Figure 2.7: Syntactic Dependency Structure

27 Each parser described above shows how words can be combined to form a sentence.

The choice of syntactic parser format depends mainly on the purpose of the syntactic analysis. But for the purpose of this study, syntactic constituent structure and syntactic dependency structure were combined to check for grammar errors.

2.6 Spelling Corrector

Spelling corrector or checker is the technique used in so many applications to check and correct spelling errors in different languages. The technique is used in Google, Yahoo,

Microsoft editors, and various editors available online. There are various techniques used to check spelling such as Darm Cool Algorithm, Bloom filter, good essay 2by Peter

Norvig. Good essay technique is based on Bayes‟ Probability theory. Darm cool algorithm is a tree based data structure technique and Bloom filter uses space efficient probabilistic data structure.

2.7 Related Works

Several studies in natural language have been conducted by various researchers across the continents. Specifically in automation of grammar checking tool, several applications are available on the web, text editors, word processors and other internet applications. For instance Kernick and Powers (1996) and Naber (2003) proposed a rule-based style and grammar checker for English, Arppe (2000) and Carlberger (2004) developed a grammar checker for Swedish, Bustamante and Leon, (1994) presented one

(called GramCheck) for Spanish, Schmidt-Wigger (1998) implemented a grammar checker for German using pattern matching.

2 Good essay is available here: http://norvig.com/spell-correct.html

28 Despite of a lot of projects using various techniques for different languages, until now there is no one available for Swahili. However, there are number of ongoing projects in

NLP such as; Hilsinki Corpus of Swahili (HCS), Kamusi project (1995), and earlier version of English Swahili dictionary by TUKI. Furthermore, Pauw et al. (2006) presented Part-Of-Speech Tagger for Swahili tested in HCS using Memory Based

Learning.

29

CHAPTER THREE

METHODOLOGY

The development methodology for this study was established from similar study conducted by Dimalen et al. (2006), when they were developing the checker add-in for open office. The development model is described in Figure 3.1 which shows five main steps followed by the researchers. It started with data collection, review and analysis of Swahili grammar rules. Then followed by; system design, implementation and testing, and finally experiment to measure system performance.

System evaluation 5 Performance testing using precision and recall. System testing 4 The improved application

System implementation 3 Designed architecture was transformed into application program using python.

2 System design Using object oriented design

1 Data gathering, review and analysis Preparation of Swahili data set and analysis of grammar rules

Figure 3.1: Research Excursion Steps

3.1 Data Gathering, Review and Analysis

Data were collected from various sources including newspapers, books, Swahili articles and articles from the internet. About 300 sentences were collected and analyzed. A

Total number of 1500 words were used as training data and 500 words were used as test data. Analysis of text data involved handling multi-word concepts, reduplicated verbs,

30 domain-specific terms, idioms and other types of words that require special treatment.

In order to ensure clear lexical data unit, Swahili dictionaries such as KAMUSI project and TUKI dictionary were used.

3.2 Development Model for Grammar Checker

Figure 3.2 shows the model that was followed by the researchers to develop a prototype.

The model describes the important elements of the application prototype with their relationship. Initially, the input text is analyzed into a lexical unit. Then text is broken into token and tagged with correct tag by doing dictionary look up. The result obtained here is the tagged sentence. Later parse tree is generated for further syntactic analysis task. Then the structure obtained is compared against the rules to find any deviation of the rules. Incorrect sentences are then passed to correction handler for diagnosis of the problem and then suggest possible correction back to user.

31

Figure 3.2: Development Model for Swahili Grammar Checker Prototype

3.2.1 Lexical Analysis

Tokenization is the process of splitting input text into a smallest unit called token. The input sentence or paragraph is broken into words in which each word is analyzed independently. The process involves the use of split method which uses space as a separator. Normally each word in the sentence is separated by a space, therefore the method uses space as a separator to break sentence into words. See example 8 below

Dkt. Nuru anafundisha wanafunzi darasani.

Token = [„Dkt.‟,‟Nuru‟, „anafundisha‟,‟wanafunzi‟,‟darasani.‟] ………………. (8)

32 3.2.1.1 Part of Speech Tagging

In this stage, each token is analyzed and grouped into its word category. This process involves matching each token against each word in the dictionary. A dictionary contains all words in the data set and its associated tags, class and word root. Part of speech tagging involves the following steps:

i. Find token in abbreviation list, if found, assign „ABBR‟ to that token and

proceed with the next token. If not found go to step ii;

ii. Check if token contains any punctuation mark at the beginning or end of token.

If found, split that token and process each token in the next step

independently;

iii. Match token against each word root in the dictionary. Return the list of all words

which have similar pattern in the dictionary;

iv. Extract prefixes and suffixes of the token by removing word root from the token;

v. Match prefixes and/or suffixes against each prefixes/ suffixes of that word root.

If match then assign tag and class to that token. If not found label that token

as unknown.

Table 3.1: Lexical Analysis Example Dkt. anafundisha

Step 1 – return ABBR Step 1 – not found

Step 2 – no puct

Step 3 – return [ fundish]

Step 4- pref=[ana], suff =[a]

Step 5 – return V

33 3.2.1.2 Parse Tree Parse tree uses symbols to present a sentence. It is called parse tree because it has tree like structure. Sentences are described from root to their leaves using phrases and part of speech tags. Figure 3.3 represents an example of a parse tree.

S

KN KT KE

ABBR N V ADV Dkt. Nuru anafundisha darasani.

Figure 3.3: Parse Tree Structure

3.1.2.3 Syntax Analyzer Parse tree generated from previous step is then passed through grammar rules in order to find grammar errors. Practically, the pattern produced is matched against each grammar rules. Example, parse tree generated in Figure 3.3 can be analyzed by checking the agreement between Noun and Verb. Example 9 shows the analysis:

Sentence pattern: ABBR N[„1/2-SG3‟] V[1/2-SG3-SP+PRES] ADV[17-DER:ni]

Rule: N V-> word_class(„N‟) == „1/2-SG3‟ and subject_prefix(„V„) != „1/2-SG3‟- >replace(subject_prefix(„V‟));

Result: The pattern is passed through the rule successfully, because it does not satisfy condition. ………………………………………………………………………. (9)

3.2.2 Grammar Rules Development

Grammar rules have to be developed and tested to ensure that the rule applies to the largest possible data set. In order to achieve that, the training data set were divided into

34 correct sentences and erroneous sentences. The rule is tested against erroneous sentences first, and then tested against correct sentences. The rule is added to the list of grammar rules if it detects errors in erroneous corpus and does not return errors against majority of correct corpus. To simplify the implementation process of grammar rules, the researcher has decided to treat rules that are common as general syntax rules such as missing space after comma, or missing space before bracket etc. and other rules that are specific to Swahili language such as subject-verb agreement as grammar rules. Figure

3.4 below shows the general structure of grammar rules.

pattern <- condition -> replacement # message OR pattern <- condition -> expression to generate replacement string # message

Figure 3.4: Feature of Grammar Rule

Grammar rule development process took the following steps:

i. Take an error which is not in the list of grammar rules (Normally errors are

coming from erroneous corpus);

ii. Generate pattern, condition, replacement and error message associated with that

error;

iii. Test the rule pattern against incorrect sentences. If the rule pattern does not

detect an error, rewrite the pattern and go back to step ii. If it does, go to the

next step;

35 iv. Test the rule against correct sentences. If the rule pattern returns a lot of errors,

rewrite the pattern and go back to step ii. If it doesn‟t, then add new grammar

rule to the list of grammar rules.

Example: subject-verb agreement rule

Juma/N a-na-kul-a/V wali/N na/CONJ samaki/N. …………. … (10)

rd 3 SING Sag-Present-root-suffix …..

Agreement rule Subject (3rd person singular) -> (Subject-Verb agreement – a)

More list of subject-verb agreement in Swahili is presented in chapter 2 - table 2.2

3.2.3 Spelling Checker Spelling checker suggests the correct spelling of words given the large set of words in the dictionary. The module returns correct spelling once that word is not available in the dictionary. All data set collected (training and test data) is available in the dictionary.

The implementation of spelling corrector relies on the Bayes‟ theorem. Given unknown word, we are trying to find the most likely spelling correction for that word. The idea here is that, we are trying to find the correction c (where c is a list of words), out of all possible corrections that maximize the probability of c given word w. See equation 11 below

argmaxc P(c|w) …………………………………………… (11)

Where:

c - is a list of possible corrections,

w - is a given word,

36 P(c|w) – is a probability of corrections c given word w.

3.3 System Description Swahili grammar checker consists of two parts, spelling checker and grammar checker.

When a new text is tested in the grammar checker, first is checked for spelling using spelling checker. If there is any misspelled word, the program suggests an alternative word and then terminates. Otherwise, the input text is checked for grammar, if an error rule matches, the program identifies an error at the position of the match. Returns suggestion and then terminate. See figure 3.5 below

37

Figure 3.5: Activity Diagram of Swahili Grammar Checker Prototype

38 3.4 System Design

The system prototype was designed into modules. This allows future expansion of the prototype to accommodate new requirements when arise. The system employ object oriented design approach, which uses classes and objects. Figure 3.6 below shows class diagram of Swahili grammar checker.

Figure 3.6: Class Diagram for Swahili Grammar Checker

39 3.4.1 Class Description Text Checker: is the main class which allows initial operations on input text. It is associated with tokenize class, swa_tagger class and text_chunker. It is the base class that triggers other operations when one new text is tested against rules. Refer appendix

A section 1.

Tokenizer: splits input text into token. First it splits input text into sentences by identifying sentence boundaries using sentence_boundary() method. This method uses common markers such as a dot, question mark and exclamation mark [.?!] as sentences separator. However, the method is taking care of numbers in decimal, abbreviations which normally contain dot symbols. Then sentences are broken apart into tokens using split_tense() method. Refer appendix A section 2.

Swatagger: assigns tags to each token using get_tag() method. This function gets tag from dictionary which contains the list of words with its associated tag, root and class.

Text chunker: divides tagged sentences into chunks. It uses the list of tags defined in the phrase_list variable.

Spelling checker: the purpose of this class is to check for spelling. It uses check_spelling() method to call check spelling algorithm. Algorithm used in this study is borrowed from Peter Norvid work3. The code was customized to fit into Swahili language.

Grammar checker: is used to check for Swahili grammar. This class is associated with match_rules class which comprises of two sub classes grammar rules and general rules.

3 Spelling corrector souce code is available here: http://norvig.com/spell-correct.html

40 Grammar rules contain 75 rules which were used to check for Swahili grammar and general rules contain 10 rules. Appendix A section 4 has a list of some of the rules.

3.5 System Prototype Implementation

The implementation of the system prototype was based on the architecture designed in figure 3.4. The prototype was developed using python language. This is due to the fact that python is more relevant to the application and used by various researchers to develop grammar checkers for other languages such as; Naber (2003) who developed a grammar checker for English language, and Bird et al, (2009) who used it in developing a Natural Language Toolkit (NLTK). Python is also object oriented language with the following characteristics;

System independent, i.e., it runs at least on Unix/Linux, Windows and Mac platforms.

i. Python supports object oriented programming techniques, and it makes use of

object-orientation features such as inheritance, association etc.

ii. Python allows for fast application development due to its built-in support for

common data types like lists, dictionaries (hash tables) and a powerful

standard library which supports e.g. regular expressions, object serialization,

and XML parsing.

iii. Python supports Unicode in a natural way, i.e. input strings are decoded

(interpreted) in a given encoding and can then be accessed as Unicode strings.

iv. Python is implicitly typed, i.e. a programmer will not have to define the type of a

variable, but the variable will have a type nonetheless.

41 3.6 System Testing and Evaluation

Finally, the system prototype was tested to ensure that it is working properly. Then its performance was measured using recall and precision. Data were divided into two groups, training data and test data. Training data were used to test rules as explained in section 3.2. Test data were tested against the rules generated and tested using training data. Experiments were conducted and the results of recall and precision performance measures are presented.

Precision was used to present the ability of the system to detect relevant spelling and grammar error. Recall was used to present the ability of the system to present only relevant spelling and grammar error. See equation 12 & 13 below

Number of correctly reduced errors Precision =  100% …...…….. (12) Number of reduces errors

Number of correctly reduced errors Re call = 100% …………… (13) Number of errors

Then F-measure (F-score) was used to measure the accuracy of the results obtained. F- measure uses both precision and recall as shown in the formula (see equation 14).

Precision Recall F score = 2  ………………………………. (14) Precision  Recall

The result of experiments is presented in chapter 4.

42 CHAPTER FOUR

RESULTS AND DISCUSSION

4.1 The Developed System Prototype The system prototype for checking spelling and grammar of Swahili language was made into two main parts. One part is for checking spelling and the other part for checking grammar. Incorrect spelling is identified by the application when the user misspells

Swahili word. Spelling module relies on the data set defined by the researchers. Any words that are outside the data set defined are considered as incorrect. The system prototype detects incorrect spelling, and suggests the correct spelling back to user (See figure 4.1).

Figure 4.1 Spelling Error

The other part is for detecting and correcting grammar errors. The application will check for spelling errors first, and then grammar errors. Grammar errors rely on the rules defined. There are 85 rules defined so far in the system prototype. Out of these rules, 10 are general rules which detect the general errors such as extra space between words, missing space after full stop, and etc. (Figure 4.2). The remaining rules detect

Swahili grammar such as subject-verb agreement, agreement between Noun adjective

43 and Noun pronoun. (Figures 4.3, 4.4 and 4.6). Table 4.1 shows the types of grammar errors which were implemented and tested. However, more errors can be added to the existing rules in order to improve the performance of the grammar checker.

Figure 4.2: General Error

Figure 4.3: Noun Follow Adjective

Figure 4.4: Subject Verb Agreement

44

Figure 4.5: Agreement between Noun and Possessive Pronoun

Table 4.1: Types of Grammar Errors

Pattern Error Example

ADJ N Adjective follow noun Mkubwa nyoka ameuawa. INF V Verb follow infinitive verb Mwalimu anaenda alikuwa sokoni. ADV V Adverb follow verb Juma haraka anatembea. N ADJ Agreement between noun and Mtoto wazuri wanaimba. adjective V Subject verb agreement Mwanasiasa wameenda hospitalini.

4.2 General Discussion In this section, general discussions of this study are made regarding development of a rule based grammar checker for Swahili language. This study was guided by four sub studies corresponding to the specific objectives and research questions presented in section 1.3 of chapter one. The results obtained from those sub-studies are presented as follow:

45 What are the types of classes of words that Swahili language has?

In computational linguistic, analysis of word classes of language is important, since words in the same class has the same characteristics. In Swahili language there are about eight word classes namely; nomino (noun), kitenzi (verb), kielezi (adverb), kivumishi

(adjective), kiwakilishi (pronoun), kihusishi (preposition), kiunganishi (conjunction) and kihisishi (interjection). These main classes are further divided into sub-classes as discussed in detail in section 2.3.1.1 of chapter two.

What parse tree structures can represent syntactic structures of sentences in Swahili language?

Parse tree structure uses symbols to represent characteristics of each word in a sentence.

Parse tree structures used by the researcher to represent sentences in Swahili are explained in chapter 4, section 3.2.1.2. The structure consists of phrases and part of speech tags. A phrase consists of at least one tag. Example of phrases used in Swahili are Noun phrase (Kirai Nomino), Verb phrase (Kirai kitenzi) etc.

How grammatical rules can be designed to represent the syntax of the largest possible set of sentences in Swahili?

Swahili language grammar has many rules. However, in this study the researchers have identified and used 85 grammar rules for detecting Swahili grammar errors. Grammar rule was generated from erroneous corpus and then tested against both erroneous and correct corpus. Grammar rules use tags to test the pattern instead of words. For example the rule of agreement between Noun and adjective ensures that, if the class of Noun is

„½-SG‟, then its adjective class should be „1/2-SG‟ otherwise, error flag is displayed.

46 Therefore, this rule covers all nouns which belong to that class when followed by adjective. The coverage of sentences to be tested by this rule is large compared to when the rule applies to specific word (Refer chapter 3, section 3.2.2).

What code best represent grammatical rules generated for checking Swahili grammar?

The prototype has been implemented based on the design which follows object oriented approach. Object oriented design was employed due to its maintainability, reliability and flexibility, as well as reusability. The researchers have implemented the design using Python programming language. Python was selected because of its flexibility and popularity on handling natural language processing applications. Therefore, python was identified and used to develop grammar checker prototype for Swahili language using object oriented design approach.

What is the performance of the tool developed on grammatical errors detection of

Swahili texts?

The performance of the proposed grammar checker prototype was measured using precision and accuracy. The prototype was designed for checking spelling and grammar, so the performance of the prototype for spelling and grammar were measured separately. This is because, the technique used to check spelling is independent from the one used to check grammar. However, spelling must be checked first before grammar.

Spelling checker achieves 81% accuracy per each ten incorrect spelling words. The performance of grammar checker is presented in table 4 below.

47

Table 4.1: Performance Results for Swahili Grammar Checker Prototype

Measuring Criteria Number of Number of correct Total number of Total number of incorrect flags flags flags errors in the test set

36 114 150 160

Based on the results shown in table 4.1, the computed precision is 76% and recall is

71%. The accuracy (F-measure) obtained was 73%.

48

CHAPTER FIVE

CONCLUSION AND RECOMMENDATION

5.1 General Conclusion The main goal of this research work was to design and implement a system prototype that uses a rule-based approach for detecting and correcting grammar of Swahili text automatically. In achieving the goal, four studies were conducted. In these studies,

Swahili word categories were identified, then structure of Swahili sentences were analyzed and presented in parse tree structure. Rules were generated from erroneous corpus (set of Swahili sentences). Each rule was tested to ensure that it applies to the majority of Swahili sentences. Finally, these rules were tested to measure their performance on detecting grammatical errors.

The performance of the system prototype is promising. Most of the false flags are related to multiple classes of words and complex sentences. This is due to the fact that, most of the rules are generated from simple sentences.

5.2 Recommendation and Future Work This study has used small data set to generate Swahili grammar rules for detecting errors. For the purpose of an automated grammar checker that covers almost all possible errors, the researcher proposes the following;

i. Increasing the size of data set used for training and testing the system prototype.

This includes the size of erroneous corpus.

ii. Handling multiple tags concept when implementing grammar checker for

Swahili language.

49 iii. Expanding the number of grammar rules in order to handle more errors. iv. Further research into other techniques of solving this problem, such as statistical

methods and unsupervised learning. v. Developing new solutions for solving this problem of grammar checker for

Swahili language.

50 REFERENCES

Antti Arppe. (2000), Developing a Grammar Checker for Swedish, Lingsoft, Helsinki University, Finland.

Ashton, E. O. (1982). Swahili Grammar. London: Longman, Green and Co. Ltd.

Bird S., Klein E. and Loper E. (2009), Natural Language Processing with Python, O‟Reilly Media.

Brill E. (1995), „Transformation-Based error driven learning and natural language processing: A case study on part-of-speech tagging‟ In Maryland, Proceedings of Computational linguistics Vol. 21.

Brill E. (2008),‟Some advances in Transformation-Based Part of speech tagging‟, In Massachusetts, Proceedings of the National conference on artificial intelligence (AAAI-94).

Bustamante and Leon, (1996), “GramCheck: A Grammar and Style Checker”, in ARXIV, proceedings of COLING-96 Vol 1.

Carlberger J., Domeij R., Kann V., Knutsson O. (2004), “The Development and Performance of a Grammar Checker for Swedish: A Language Engineering Perspective”, Natural Language Engineering.

Claudia Leacock, Martin Chodorow, Michael Gamon, Joel Tetreault, 2010, Automated Grammatical Error Detection for Language Learners, Morgan & ClayPool, Inc.

Daelemans W. and Van den Bosch A. (2005), Memory-based Language Processing, Cambridge University press, Cambridge.

Editha D. Dimalen, Davis D. Dimalen (2006), An OpenOffice Spelling and Grammar CheckerAdd-in Using an Open Source External Engine as

51 Resource Manager and Parser, Unpublished report, MSU- Iligan Institute of Technology, Iligan city.

Eric Brill (1995), “Transformation-Based Error Driven learning and natural language processing: A case study in part of speech tagging”, Computational Linguistics, Vol. 21, No. 4, 544-566

Grigori S., Anubhav G., Martin T., Dolors C., Catena A. and Fuentes S. (2013), Rule-based System for Automatic Grammar Correction Using Syntactic N- grams for English Language Learning (L2), unpublished report, Autonomous University, Barcelona.

Henrich V. and Reuter, T. (2009), LISGrammerChecker: Language Independent Statistical Grammar Checking, unpublished report, Hochschule Darmstadt & Reykjavík University, Aislandi.

Hinnebusch T. (1996), “What kind of language is Swahili?” Afrikanistische Arbeitspapiere, Vol. 1. No. 47, 73-95

Hurskainen, A (2004), “HCS - Helsinki Corpus of Swahili. Compilers” Institute for Asian and African Studies (University of Helsinki) and CSC. No. 3, 363–397.

James R. Curran and Raymond K. Wong (1998), Formalisation of Transformation- based Learning, Unpublished report, University of Sydney, New South Wales.

Jensen, G. E. Heidorn, L. A. Miller, and Y. Ravin (1983), “Parse fitting and prose fixing: Getting a hold on ill-formedness.” Comp. Linguistics, Vol. 9, No. 3, 147–160.

Jurcıcek F., Gasic M., Keizer S., Mairesse F., Thomson B., Yu K., and Young S.(2009), Transformation-based Learning for Semantic parsing, Unpublished report, Cambridge University, Cambridge.

52 Kenji S., Brian M. and Lavie A. (2000), „Automatic parsing of parental verbal input‟, [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1880885], site visited on 7/08/2015.

Kernick S. and Powers M. (1996), A Statistical Grammar Checker, Unpublished report, Flinders University, SA.

Kukich K. (1992), “Techniques for automatically correcting words in text,” ACM Computing Surveys (CSUR), Vol. 24, No. 4, 377–439.

Kummert F., Fakultät T., Witt A., Linguistik F., and Schaft W. (2003), A Rule-Based Style and Grammar Checker, Unpublished report, Technische Fakultät, Universität Bielefeld.

Leacock C., Chodorow M., Gamon M., and Tetreault J. (2010), “Automated grammatical error detection for language learners,” Synthesis Lectures on Human Language Technologies, Vol. 3, No. 1, 1–134

Mangu L. and Brill E. (1997) , Automatic Rule Acquisition for Spelling Correction, John Hopkins University, Maryland.

Mangu, L. & E. Brill. (1997). “Automatic Rule Acquisition for Spelling Correction”. In ICML: Proceedings of the 14th International Conference on Machine Learning. Vol. I.

Manning D. and Schutze H. (1999), Foundations of Statistical Natural Language Processing, The MIT press, London.

Martin Haspelmath, Uri Tadmor. (2009), Loanwords in the World's Languages: A Comparative Handbook, Walter de Gruyter GmbH & Co. KG, Berlin.

Martins T., Hasegawa R., Das Gra M. ¸ VolpeNunes, Montilha G., and De Oliveira O. (1998), Linguistic Issues in the Development of regra: A grammar checker for Brazilian Portuguese. Natural Language Engineering, Vol. 4, No. 4, 287– 307.

53 MOE (2014), Education and Training Policy, Unpublished report, Ministry of Education and Vocational Training, Dar es alaam

Mozgovoy M. (2010), Dependency-based Rules for Grammar Checking with Language Tool, Unpublished report, University of Aizu, Fukushima.

Msanjila P. (2005), “Problems of Writing in Kiswahili: A Case Study of Kiguru nyembe and Morogoro Secondary Schools in Tanzania”, Nordic Journal of African Studies, Vol. 1, No. 14, 15-25

Naber D. (2003), A Rule Based Style and Grammar Checker, Unpublished report, Technische Fakultät, Universität Bielefeld

Nava Ehsan, Heshaam Faili (2011). “Statistical Machine Translation as a Grammar Checker for Persian Language”, In ICCGI, Proceedings of the Sixth International Multi-Conference on Computing in the Global Information Technology (IARIA).

Polome C.(1967), Swahili Language Handbook, Center for applied linguistics, NW.

Ramshaw L. and Marcus M. (1995), „Text Chunking using TBL‟, Proceedings on ACL third workshop on very large corpora. Vol. 1.

Robert and Valin (2001), An Introduction to Syntax, Cambridge University Press, Cambridge.

Rolit J. Yuk W. and Raymond J. (2005), „Leaning to transform natural to formal languages‟ In Pittsburgh, Proceedings of the twentieth National conference on artificial intelligence (AAAI-05) Vol. 2.

Schmidt-Wigger (1998), “Grammar and Style Checking in German”, Proceeding in International Workshop on Controlled Language Applications, Vol 2.

Tesfaye D. (2011), „A rule-based Oromo grammar checker‟, International Journal of Advanced Computer Science and Applications, Vol. 2, No. 8, 126-130

54 Tesha T. (2012), Automating the classification of Swahili documents, unpublished report, The University of Dodoma, Dodoma.

Theodorus G. Vosse (1994). The Word Connection. Grammar-Based Spelling Error Correction in Dutch. Enschede, Neslia Paniculata.

Thompson K. and Schleicher A. (2001), Swahili Learners’ Reference Grammar, NALRC Press, Ohio.

Tom M., Bastiaanse R. and Emma S. (2013), „Sentence Comprehension in Swahili- English Bilingual Agramatic Speakers‟ In UK, Proceedings of Clinical Linguistics and Phonetics Vol. 5. No. 27, 355-370.

Walter Daelemans, Jakub Zavrel, Antal van den Bosch and Ko van der Sloot (2007), “MBT: Memory-Based Tagger, Reference Guide”, [http://ilk.uvt.nl/downloads/pub/papers/ilk.0709.pdf] , site visited on 5/06/2015.

Wikipedia encyclopedia, (2015), “Language tool wiki: Open source proof-reading tool” Available from: http://wiki.languagetool.org/development-overview [Accessed 19 May 2015]

55 APPENDICES Appendix A

1.0. Text Checker Class

------Begin class definition------

__author__ = 'Haji Idd Bamsi' import re from collections import OrderedDict #import associated classes from tokenize import tokenize from swaTagger import swaTagger from chunks import chunker chunkFile = 'resource/chunks.txt' class text_checker(object):

sentences = OrderedDict() tagged_tense = [] phrase_Set = OrderedDict() phrase_Tag = OrderedDict() super_Tag = OrderedDict()

def __init__(self): pass

def set_parameters(self, sentence): self.super_Tag.clear() splitText = tokenize() tense = splitText.split(sentence) self.super_Tag = self.get_tags(tense)

#return self.sentences

def split_tense(self, inputText): splitText = tokenize() return splitText.splitInputText(inputText)

def get_tags(self, sentence): tagging = swaTagger() self.tagged_tense.clear() for each_item in sentence: tag = tagging.get_tag(each_item) self.tagged_tense.append(tag.popitem()) return self.tagged_tense

def tagged_phrase(self, item): for each_item in self.tagged_tense: if item == each_item[0]: tag = each_item[1][0]['pos'] return tag

def get_token(self, sentence): splitText = tokenize() return splitText.split(sentence)

def get_chunks(self, phrase): chunking = chunker(chunkFile)

56 chunking.chunk(phrase) self.phrase_Set = chunking.phraseSet self.phrase_Tag = chunking.phraseTag return self.phrase_Tag

------End of class definition------

2.0. Tokenize Class – Split Method def split_tense(self, sentence): lexicon.clear() wordList = re.split("\s+", sentence) #split sentence using whitespace #loop through each word to check for special characters for token in wordList : if self.is_abbrev(token): lexicon.append(token) #append token into list #lexicon[token] = abbr pass elif self.contains_punct(token): continue #remove empty string from list else : lexicon.append(token) return lexicon

3.0. Spelling Corrector

__author__ = 'Haji Idd Bamsi' import re import collections from collections import defaultdict data_set = 'resource/data_set.txt' alphabet = 'abcdefghijklmnopqrstuvwxyz' NWORDS = defaultdict()

class spelling:

global NWORDS

def __init__(self): spelling.set_nwords()

def words(self, text): return re.findall('[a-z]+', text.lower())

def set_nwords(self): spelling.NWORDS = spelling.train(self, spelling.words(self, open(data_set).read()))

def train(self, features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model

def edits1(self, word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet]

57 return set(deletes + transposes + replaces + inserts)

def known_edits2(self, word): return set(e2 for e1 in spelling.edits1(self, word) for e2 in spelling.edits1(self, e1) if e2 in spelling.NWORDS)

def known(self, words): return set(w for w in words if w in spelling.NWORDS)

def correct(self, word): word = word.rstrip() if word[-1:] == '.': word = word[:-1] print(word)

spelling.set_nwords(self) candidates = spelling.known(self,[word]) or spelling.known(self, spelling.edits1(self, word)) or spelling.known_edits2(self, word) or [word] return max(candidates, key= spelling.NWORDS.get)

4.0. Grammar Rules & General Rules

------General rules------#Extra space +-> #Extra space

#Suggest the missing space after the !, ? or . signs: \b([?!.])([a-zA-Z]+) ->\1 \2 # Missing space?

#suggest space after ( \(\b ->( # Missing space after (

#suggest space before ( \b(\S)(\()->\1 \2# Missing space before (

#suggest space after ) \b(\))(\S)->\1 \2# Missing space after )

------Grammar rules------ADJ N->N ADJ #Noun follow adjective

V->self.subject_prefix() != '5/6-PL' and (self.subject_class(tag_set) == '5/6-PL' or self.subject_class(tag_set) == '5/6-0')- >self.highlight(tag_set)#Subject does not agree with verb

V->self.subject_prefix() != '7/8-SG' and (self.subject_class(tag_set) == '7/8-SG' or self.subject_class(tag_set) == '7/8-0')- >self.highlight(tag_set)#Subject does not agree with verb

V->self.subject_prefix() != '7/8-PL' and (self.subject_class(tag_set) == '7/8-PL' or self.subject_class(tag_set) == '7/8-0')- >self.highlight(tag_set)#Subject does not agree with verb

N PRO-POSS->self.word_class('PRO-POSS') != '18-LOC' and self.word_class('N') == '18-LOC'->self.highlight(tag_set)#Possessive pronoun does not agree with noun

58 Appendix B Dictionary data

alikimbia V kimbi 1/2-SG3-SP+PAST kabla PREP kabla - dukani N duka 17-LOC . FULLSTOP - - shuka N - 3/4-SG - - karata N - 9/10-0 barua N barua 9/10-0 zawadi N - 9/10-0 tulimwona V ona 1/2-PL1- nyumba N - 9/10-0 SP+PAST+1/2-SG3-OBJ alikusanya V - 1/2-SG3-SP+PAST nyoka N nyoka 9/10-0 ? QNMARK - - latifa N latifa PROPNAME hiki PRO-DEMO hi 7/8-SG kanita N kanita PROPNAME kukutana INFV kutana 15-SG-SP wamejipamba V pamb 1/2-PL3- saba N saba NUM SP+PERF+REFL-OBJ wanataka V tak 1/2-PL3-SP-PRES jua N jua 5/6-0 wangu ADJ-POSS angu 1/2-SG juani N jua 17-LOC chumvi N chumvi 9/10-0 kali ADJ-DESC kali 5/6-0|9/10-0 hajakuandikia V andik 1/2-SG3-NEG- liliwaka V wak 5/6-PL-SP+PAST SP+PERF+15-SG-OBJ linawaka V wak 5/6-PL-SP+PAST kamati N kamati 9/10-0 mchana ADV mchana - alitembea V tembe 1/2-SG3-SP+PAST changu ADJ-POSS changu 7/8-SG mto N to 3/4-SG kutwa ADV kutwa - mama N mama 1/2-0 kula V kula - tayari ADV tayari - runinga N runinga 9/10-0 haraka ADV haraka - ya ADJ-POSS ya 3/4-PL mwalimu N limu 1/2-SG bwana N bwana 1/2-SG akaelekea V eleke 1/2-SG3-SP+PERF anampenda V pend 1/2-SG3- kisha ADV kisha - SP+PRES+1/2-SG3-OBJ hawa PRO-DEMO hawa 1/2-PL kazito N kazito PROPNAME mwanasiasa N nasiasa 1/2-SG imeharibika V harib 9/10-SG-SP+PERF siasa N siasa 9/10-0 tena ADV tena - jumamosi N jumamosi - amina N amina PROPNAME

59 sana ADV sana - kucheza INFV chez 15-SG-SP guyo N guyo PROPNAME yeye PRO-PERS yeye SG3 aliharibu V haribu 1/2-SG3-SP+PAST alinipiga V pig 1/2-SG3- karai N karai 9/10-0 SP+PAST+1/2-SG1-OBJ kufua INFV fua 15-SG-SP wao PRO-PERS wao 1/2-PL3 katika ADV katika - walimpitisha V pit 1/2-PL- kilitokea V toke 7/8-SG+PAST SP+PAST+1/2-SG3-OBJ ilikuwa AUXV kuwa 3/4-PL|9/2- nani PRO-INTEROG nani - 0+PAST radio N radio 9/10-0 za ADJ-POSS za 9/10-PL nini PRO-INTEROG nini - anahitaji V hitaji 1/2-SG3-SP+PRES ulimpa V pa 1/2-SG2-SP+PAST+1/2- mtu N tu 1/2-SG SG3-OBJ watu N tu 1/2-PL ulisafiri V safiri 1/2-SG2-SP+PAST kilometa N kilometa 9/10-0 ni V ni UNINFL-V:ni yake PRO-POSS ke 1/2-SG3 mtoto N toto 1/2-SG siku N siku 9/10-0 yule PRO-DEMO yule 1/2-SG wazuri ADJ zuri 1/2-PL mjane N mjane 1/2-SG nzuri ADJ zuri 9/10-0 nina PRO-PERS nina 1/2-SG1 mzuri ADJ zuri 1/2-SG rafiki N rafiki 9/10 nyingine ADJ ingine 9/10-0 ambaye PRO-REL ambaye 1/2-SG3 kizuri ADJ-DESC zuri 7/8-SG anaishi V ishi 1/2-SG3-SP+PRES na CONJ na - zanzibar N zanzibar PROPNAME riziki N riziki 9/10-0 unataka V tak 1/2-SG3-SP+PRES nilikumbuka V kumbuk 1/2-SG1- kitu N tu 7/8-SG SP+PAST chochote PRO-INDEF chochote - mkubwa ADJ-DESC kubwa 1/2-SG kitabu N kitabu 7/8-SG anasikiliza V sikiliz 1/2-SG3- walimu N lim 1/2-PL SP+PRES mwenyewe ADJ-POSS enyewe 1/2-SG hivi ADV hivi 7/8-PL ulinunua V nunua 1/2-SG2-SP+PAST anapenda V pend 1/2-SG3-SP+PRES gani ADJ-INTEROG gani - kutia INFV tia 15-SG-SP hodari ADJ-DESC hodari -

60 shuleni N shule 17-LOC juma N juma PROPNAME , COMMA - - kwa PREP kwa - nilikua V kuwa 1/2-SG1-SP+PAST ataenda V end 1/2-SG3-SP+FUT mdogo N dogo 1/2-SG sababu CONJ sababu - wanachama N nachama 1/2-PL sukari N sukari 9/10-0 walikuja V kuj 1/2-PL3-SP+PAST wakati CONJ wakati - saa N saa - nilipoenda V enda 1/2-SG1- alipewa V pewa 1/2-SG3- SP+PAST+16-SG-REL SP+PAST yote ADJ yote - asubuhi N asubuhi 9/10-0 hizo ADJ hizo 9/10-PL babake ADJ-POSS baba 3/4-PL kuzikunjia V kunj 16-SG-SP+9/10-PL- nuru N nuru PROPNAME OBJ alikuwa AUXV kuwa 1/2-SG3- kwenye ADV kwenye - SP+PAST moja NUM moja NUM wavuvi N vuvi 1/2-PL akaziweka V wek 1/2-SG3- wenzake ADJ-POSS wenzake 1/2- SP+PERF+9/10-PL-OBJ PL nane NUM nane NUM kutafuta INFV tafuta 15-SG-SP nilimpiga V pig 1/2-SG1- tana N tana PROPNAME SP+PAST+1/2-SG3-OBJ gumato N gumato PROPNAME ameondoka V ndok 1/2-SG3-SP+PERF pia ADV pia - ameenda V end 1/2-SG3-SP+PERF sokoni N soko 17-LOC nyumbani ADV nyumba 17-LOC ndugu N ndugu 9/10-0 hii PRO-DEMO hii 3/4-PL zake ADJ-POSS zake 9/10-PL kuelekea INFV elekea 15-SG-SP wadogo ADJ dogo 1/2-PL yupi PRO-DEMO yupi 1/2-SG munga N munga PROPNAME yao ADJ-POSS yao 3/4-PL habuko N habuko PROPNAME kaelekea V elekea 1/2-SG3-SP sawa ADV sawa - hizi PRO-DEMO hizi 9/10-PL kazi N kazi 9/10-0 wameondoka V ndok 1/2-PL3- nguo N nguo 9/10-0 SP+PERF familia N familia 9/10-0 watoto N toto 1/2-PL

61

62