The University of Birmingham School of Computer Science MSc in Advanced Computer Science

Summer Project

Syllable Identification

Norshuhani Zamin

Supervisor: Dr. W H Edmondson

September 2004 Abstract

Syllabification is part of the linguistic problems and developing computer software to predict the syllable boundaries is a challenging task. In practice, it is easier to determine the syllable boundaries manually especially in a syllabic spelling system with the fact that we know the linguistic element of the language. Identifying syllable boundaries for English is a daunting process because English is an alphabetic spelling system. To write software, it is traditionally assumed that various sources of linguistic knowledge should be incorporated in order to convert into their syllable structure with reasonable accuracy. The linguistic knowledge is important to define the graphotactic and phonetic rules.

The purpose of this project has been to investigate the problem in English syllabification and to represent 2 different approaches to automatic detection of syllable boundaries. The first approach syllabifies a text from its grapheme or symbol while the second approach syllabifies a text from its sound. It was found that, many existing research on syllabification adopted the second approach. Although different researchers propose different knowledge structure but most of them used the typical architecture for grapheme-to- conversion while to go from text to grapheme or symbol is a new technique.

In this project, I demonstrate the use of hand-written rules for English syllabification and knowledge structures trained on both approaches and compare the performance and accuracy of these approaches. The evaluation shows that going from text to symbol is easier and it performs better on finding the syllable boundaries than going from text to sound. Recommendations for future projects of this nature are made.

Keywords Syllable; syllabification; maximum onset principle; phonotactic; graphotactic; diagraph rule; silent rules; orthography; syllabic consonant; consonant clusters; segmentation; constraints.

11 Acknowledgement

After a long period of completing this master degree thesis, I would like to express my sincere gratitude to the following people who contributed in some way to this thesis.

Dr. William Edmondson, my supervisor for the incredible amount of patience he had with me since the first time he knew me. He was the one who inspired me to do natural language processing which I never thought before. It was an absolute pleasure to have him as the supervisor. Thank you for the many discussions, motivations and wise words. I owe him lots of gratitude and I am very glad to get to know him in my life.

Dr. Peter Coxhead, my second supervisor who took over the supervision when Dr. William Edmondson was away for his sabbatical studies. I learned many things from him and he was always very kind to me. As the Academic Manager in the school, he is always busy but always was available when I needed his advises. I am grateful for his invaluable support and excellent guidance.

Dr. Ela Claridge, my Academic Advisor for providing academic and support service. Thank you for your advises while monitoring my academic progress.

Last but not least, I am very grateful to my husband, Azmi for his love and patience during my study period. One of the best experiences that we lived through in this period was the birth of our son Anwar Aliff, who provided and additional and joyful dimension to our life.

Ill Contents

Abstract and keywords ...... ii Acknowledgement...... iii Figures ...... vi Tables ...... vii

I Introduction ...... l 1.1 Background ...... I 1.2 Objective ...... 2 1.3 Organization of the studies ...... 3 1.4 Scope and limitation ...... 3 1.5 Research methodology ...... 4

2 Literature Review ...... 4

3 An Overview of English Spelling ...... 9 3.1 Phonology and orthography ...... 10 3.2 Consonants and vowels ...... I 7 3.3 Syllable structure ...... 22 3.4 Problem to overcome ...... 26

4 Methods ...... 27 4.1 Approaches 4.1.1 Text- Symbol -Syllable ...... 28 4.1.2 Text- Sound- Syllable ...... 29 4.2 Data collection and analysis ...... 23 4.3 Rules construction ...... 30 4.4 Syllabification with Maximum Onset Principle ...... 30

IV 5 Implementations ...... 47 5.1 Memory ...... 47 5.2 Data structures ...... 47 5.3 Modularity ...... 48 5.4 Input I Output...... 48 5.5 Features ...... 42 5.6 Tools ...... 53

6 Performance and Justifications ...... 54

7 Conclusions and Future Work ...... 46

References ...... 48

Bibliographies ...... 58

Appendices

A Summer project declaration B IP A symbols with corresponding ASCII C English loanwords 0 IP A full chart E System Requirements and User Guide

v Figures

3.1 Poem illustrating difficulties of English spelling and sounds ...... I 0 3.2 Great Vowel Shift process ...... 15 3.3 The consonants of English ...... 19 3.4 The vowels of English ...... 20 3.5 Diagram of vocal organs and articulatory regions ...... 20 3.6 Conventional Syllable Structure ...... 22 3.7 Example of William and Zhang's approach on syllable structure ...... 23 4.1 Syllabification flow chart ...... 27 4.2 Maximum Onset Principle process ...... 44 5.1 User interface ...... 39 5.2 Sample output for 'signification' ...... 50 5.3 Sample output for word 'surreptitious' ...... 50 5.4 Sample output for word 'bedridden' ...... 51 5.5 Sample output for word 'representation' ...... 51 5.6 Sample output for word 'antidisestablishmentarianism' ...... 52 5.7 Sample output for text 'access accurate occident accompany' ...... 52 6.1 Text -7 Symbol-7 Sound with syllable boundaries ...... 57

VI Tables

3.1 Name of vocal organs and articulatory regions ...... 21 3.2 English open syllables ...... 24 3.3 English closed syllables ...... 25

4.1 Example of Text~ Symbol~ Syllable approach ...... 28

4.2 Example of Text~ Sound~ Syllable approach ...... 23 4.3 Graphotactic rules ...... 30 4.4 List of permissible onset sequences for symbol...... 32 4.5 Pattern of consonant clusters for symbol...... 33 4.6 Diagraph rules ...... 35 4.7 Vowel rules ...... 37 4.8 Consonant rules ...... 39 4.9 Silent rules ...... 40 4.10 List of permissible onset sequences for sound ...... 42 4.11 Pattern of consonant clusters for sound ...... 43 6.1 Output comparison table ...... 55

Vll CHAPTER I INTRODUCTION

I am unaware of just how far back the concept of syllabification goes, but it has been a recurring idea ever since linguists discovered in this century the idea of syllable structure. This work will examine on the nature of syllable structure in English and how the identification of the syllable boundaries can be done automatically considering on 2 different routes to syllabification. Those who have basic knowledge on English phonology and orthography may have a slight advantage in understanding further contents of the thesis.

1.1 Background

The first words we learn to spell are closely linked to syllables and syllabification. If we were to look at a foreign word or at any word we might never came across before, we may be uncertain about the sounds that make it up but we will definitely able to determine how many syllables that the word has. That is because we may already have been taught about syllables in our early education and it helps us to identify the possible syllables for the given words at ease.

Advances in computer technology integrated with linguistic knowledge have made it possible to automate the process of finding the syllable boundaries in any language of the world. One benefit of such software is the convenience of not having to manually do the detection of syllable boundaries, which may become hard and tedious at certain levels. As a matter of fact, if a group of English speakers were asked to find the syllables, there is often a tendency for some disagreement to occur. For example, how do we decide where the syllable boundaries in the words like and with some kind of dummy in the latter word?

According to Carney [I] in his book, there are many differences in word divisions, for instance the following is the small sample of divisions suggested in the first (1948) and third (1974) editions of the Oxford Advanced Learner of Current English, a leading authority for foreign learners of English:

1st Edition: , , , 2"d Edition: , , , 3'd Edition: , , , 1 4 h Edition: , , ,

Syllabification varies in usage. It helps to pronounce words correctly and to do the word division in writing and typing. One benefit of having a tool or algorithm to do the syllabification with reasonable accuracy is to help justifying the printed text at the edge of the print, which can never be done by an ordinary typewriter. Carney [I] examined that splitting words can be done by juggling with slightly wider or slightly smaller spaces at the word boundaries so that the line of print expands or contracts to end at the word boundary. When this cannot be done neatly in any way, a word has to be checked with the syllabification algorithm and once a rule(s) is conform, the word has to be split over two lines using a .

1.2 Objective

The aim of this project is to produce phonetic representations of English syllable derived from a conventional text input. A computer program is written in Java to attempt the conversion in more than one approach with reasonable accuracy. Given a text as input, a mapping should be achieved transliterating a string of text into string of graphemes or also known as symbols and string of sounds. The first approach is to syllabify the string of symbols while the second approach is to syllabify the string of

2 sounds. These are rule-based approaches where conversion of unsyllabified strings into syllabified strings is done by set of manually defined rules. The output for both approaches are compared and evaluated in order to find which approach produces better output with higher degree of accuracy.

1.3 Organization of the studies

Chapter 2 discusses on what has been published on the same topic by accredited scholars and researchers. Chapter 3 gives a brief understanding on English spelling and the nature of the problems on syllabification in English. Chapter 4 introduces the two different routes to syllable identification and how the syllabification process is done. Chapter 5 describes the requirements of the software, what the software does, the input it takes and the output it delivers. It also touches on the features of the software and how it interacts with the user. Chapter 6 concentrates on the overall performance of the software and assessing the success by its accuracy, timing, scalability and robustness. Not forgetting giving the comparison between the two approaches, decision taken and its justifications. Chapter 8 discusses conclusion drawn from the project and gives recommendations for future work.

1.4 Scope and limitation

This project is focus on syllabification for . The input to the system at present is only a piece of English text and will not process a text in batch or a text file. The input can consists of monosyllabic words but will not be able to process words with syllabic consonants. For more detailed discussion on this topic, the reader is encouraged to follow the references.

3 1.5 Research methodology

Literature studies have been conducted on existing thesis, conference papers, journals and Web Pages to get a brief understanding on this project. Every information, citation and quotation is acknowledged as detailed as possible in the reference section. The Internet has been a valuable resource for obtaining information about this topic. Much of the material referenced in this thesis is available on the Internet. In the reference section, URLs (Uniform Resource Locators) have been included that are current at the time of publication. Due to the changing nature of the Internet and especially the World Wide Web, there is no guarantee that this information will continue to be available at these locations.

4 CHAPTER2 LITERATURE REVIEW

In this literature review, the goal is to learn on what other researchers have done in the same discipline I am working in to help gaining more relevant materials for my project.

The syllable is the most basic element in the language and is countable. But according to the linguists [2] of MacQuerie University in Australia, there is no exact definition that phonologist agreed upon what a syllable is. They believe that the variation in defining the syllable depend on the speaker awareness. In their research, they also found that finding the syllable boundaries is much difficult than counting the number of syllables especially for those who have not exposed to alphabetic writing system.

The question of syllable became crucial when Chomsky and Halle [3] published The Sound Pattern of English in 1968, present a view of phonology that fits in with the rest of Chomsky's early theories of language but surprisingly they consider the syllable to be an unnecessary notion in phonological theory. The contrary opinion of Chomsky and Halle has made Kahn [4] to reintroduce the notion in generative phonology in his dissertation published in 1980. His dissertation has become a major breakthrough towards the idea of syllabification and extended his analysis to account for ambisyllabicity. The concept of syllabifying with the maximum onset principle that he introduced is useful for my project but I will not cover on the ambisyllabicity issue. Admittedly, there are a few other objections to Khan's theory that are harder to dismiss. Lowenstamm [ 5] is an early criticism of Kahn's theory, arguing against the view of his syllabification concept. He proposed that word-initial consonant clusters should be

5 identified with syllable-initial consonant clusters. Another concept he presented is the syllable structure consists of strictly alternating consonantal and vocalic positions. More and more phoneticians became aware and started to scrutinize the significance of syllable structure in syllabification task.

An interesting vtew of syllable structure is then presented by Clements [6]. He introduces a new approach to syllable representation. It proposes an additional level of phonological representation, the CV -tier, which defines functional positions within the syllable. There is a full-scale phonological justification on his CV -tier theory and surprisingly his approach supports for varied selection of languages, including English, Turkish, Finnish, French, Spanish, and Danish. To learn about the syllable structure in depth, I went to look for other related research topics. Although the purpose of the research may not relevant but part of the research does help, for instance, the work of Zhang and Edmondson [7] on the usage of syllable patterns in their speech recognition research. They describe the internal structure of the syllable as groupings of segments, which comprise a vowel nucleus with up to three consonants before and three consonants after the vowel letter. Graphically, the syllable pattern is viewed as (C)( C)( C) V (C)(C)(C) where the brackets show the consonants to be optional.

My literature studies continue to search on the existing approaches to syllabification for English. Some of the approaches may not be available for English but it may be possible to adopt the idea with some changes in algorithms and rules to make it compatible with the structure of English spelling. Weber [9] investigates how the word segmentation in English can be done using both phonotactic and acoustic cues. The results of his study support the claim that phoneme sequences can solve the segmentation problem than acoustic sequences.

Grapheme to phoneme (G2P) conversiOn is one of the popular methods to speech synthesis system and basic syllabification rules. It is a routine that converts an input word sequence into their corresponding phonetic transcription. There are various approaches to do the conversion including the dictionary based, rule-based and

6 statistical based. The simplest approach is the dictionary-based approach. A dictionary­ based G2P requires a large dictionary or corpora for the mapping. Strings of sounds can easily be obtained by a dictionary look up but unfortunately it cannot treat the unknown words, which are not listed in the dictionary. It is currently an unsolved problem mainly for speech synthesis. Example of work with this approach done by Luksaneeyanawin [9] and Roth [10]. This approach has been followed by a simulation system to model the human cognitive ability in spelling known as MITalk [ 11]. It took 25 years in making and mirror the human capability for reading aloud. There are two stages of word processing in MlTalk. The input word is first checked with the very large look-up dictionary and if no match found then only the word is processed by rules.

The idea of using rules to process unrecorded word has lead to a widely attempt of the rule-based approach. With a refine set of mapping rules, it is possible to have accurate strings of sequences but it is not an easy task because English is not a phonetic language and there is not always a direct mapping between letters and sounds. This is called the rule-based G2P and is one of the routes that this project needs to attempt. It transfers most phonological knowledge of a dictionary into set of rules. Phonemisation is done by going through letters to sounds conversion prior to syllabification. The task to manually define the rules and the ordering of the rules is crucial to avoid rules confliction. In addition, expert knowledge and in depth understanding of the specific language is significance to maintain the rules. Previous work, which has established this approach, includes Kumar [12], Schaden [13] and Mareuil [14]. The similarity that these work have in common is their rule-based G2P conversion is meant for speech synthesis not syllabification.

However, it was found that a hybrid method of dictionary-based and rule-based combines the advantages of both approaches giving extra accuracy to the outputs as attempted by Kim et a!. [ 15].

7 Statistical-based approach for G2P is an automatic technique based on machine learning from large corpora. One achievement can be seen in Muller's [16] work. He presents an approach to supervised learning and automated syllabification combining the advantage of Treebank and bracketed corpora training for German language. He claimed that the accuracy of the grammar increase when more linguistic knowledge is added. The grammar is used for predicting syllable boundaries in the test corpus. Several research [17,18] discovered that this method performs very well in predicting the syllable boundaries.

Without a doubt, one of the most widely discussed here is the G2P conversion as the popular method not only in speech synthesis but also in syllabification. Unfortunately, much of the existing work focused on G2P method and no work has been found to concern on going from grapheme to symbol so far. It would be a great deal to attempt the method and to present it as an initial work for solving the syllabification problem.

In this paper, I present two different approaches to identify the syllable boundaries of the input word(s) or text. The first approach is to attempt the process of identifying syllable from a string of phonemes with regards to the rule-based technique. The second approach is to attempt the process of identifying syllable from a string of symbols. As a result, a comparison of performance and accuracy between these two approaches is conducted to find the best route for solving the syllabification problem sufficiently. The methodology and implementation are discussed in the following chapters.

8 CHAPTER3 AN OVERVIEW OF ENGLISH SPELLING

Our strange language

When the English tongue we speak Why is "break" not rhyme with "freak" Will you tell me why it's true? We say "sew" but likewise "few"; And the maker of the verse Cannot cap his "horse" with "worse"; "Beard" sounds not the same as "heard"; "Cord" is different from "word". Cow is "'cow" but low is "low"; "Shoe" is never rhymed with "foe"; Think of "hose " and "dose" and "lose"; And think of "goose" and not of "choose"; Think of "comb" and "tomb" and "bomb"; "Doll" and "roll", "home" and "some"; And since "pay" is rhymed with ''say"; Why not "paid" with "said", I pray; We have "blood" and "food" and "good"; "Mould" is not pronounced like "could"; Wherefore "done" but "gone" and "lone"? Is there any reason known?

9 And in short it seems to me Sounds and letters disagree.

Author unknown but taken from Pennington[J9} Figure 3.1 Poem illustrating difficulties of English spelling and sounds

3.1 Phonology and orthography

English language was first introduced in England has now being the primary medium for communication around the world since the last three centuries. It was believed that the language was originated from the Germanic language then influenced by some French language during the Middle Ages then only settled down at the stage now known as Modem English. The first English dictionary was published in 1755 by Samuel Johnson [20].

Phonology and orthography are the branches of linguistics and are important components for letter-to-sound or sound-to-letter correspondences. There are many definitions found for both terms but similarly they focused on the same viewpoint that is to describe the study of letters and sounds for a particular language. Macmillan English Dictionary [21] defines phonology as "the study of the pattern of speech sounds used in a particular language". Some varied definitions in text but have the similarity in common were found in the web as follows:

Branch of linguistics concerned with the sounds of speech, the way the sounds of particular languages change over time and the way the sounds of one language relate to those of another. www.read-the-bible.org/glossary.html

The study of speech sounds. www.mcg.edu/Otolaryngology/glossary.htm

10 The sound system oflanguage including speech sounds, speech patterns and rules that applies to those sounds. www.beyond-words.org/dcfinitions.htm

Phonology is the study of the sounds in a language, both the stndy of phonemes (those sounds that have a significance in a language) and phonetics (how we make and hear those sounds). www. think -ink. net/doh/ gloss. htm

Phonology is often considered to be a part of grammar: it is the study of the way speech sounds are combined to create words and meaning within a sentence. Just as there are grammar rules that apply to the syntax of a sentence and the morphology of words, there are phonological rules. These are rather complex, however, and do not concern us at this level. www .cnglishbiz.co. uk/ grammar/main_ files/ definitionsn-z.htm

From the various definitions, it can be concluded that phonology is the study of the sounds of the speaker's language. When talking about phonology, it is inevitable to ignore the discussion of phonetics and phonemes. Pennington [19) describes phonetics as "the study and description of the nature of the raw noises and silences of speech" while Roach's [23) definition is "the comparatively straightforward business of describing the sounds we use in speaking". From the description, it can be said that phonetics is the basis for phonology. A continuous stream of sounds is produced when speaking. In phonology, this stream can be divided into smaller parts called segments. Each of these segments carries their own characteristics of speech sounds. This representation is known as phoneme. Phoneme is the basic unit of speech as defined by Roach [23) in his book. A special set of symbols (phonemic symbols) known as the International Phonetic Association (IP A) symbols to represent each different phoneme. Please refer to appendix A for the list of IP A symbols with the corresponding ASCII as provided by Coxhead [[25). Please note that the correspondences have been used in the implementation of the system. A phoneme writing system is also called the phonemic transcription.

I I In more detail description given by Pennington [19], "phoneme is a phone or group of phones which contrasts in another phoneme in a minimal pair". A minimal pair exits when there is a distinction in sound between the same segment or letter. Phone is the term given for the actual sound produced. Phoneme symbols are written in slashes while phone in brackets. The distinction in sounds is called allophone. The following are some examples of allophones in English:

[p] and [pH] are allophones for the phoneme /p/

[t] and [tH] are allophones for the phoneme It/

[k] and [kH] are allophones for the phoneme /k/

The letter 'H' after the phone is used show the predictable realization of the different in pronunciation. For instance [kH] is aspirated where the pronunciation followed by a burst of air. An experiment can be conducted to give a clear understanding and distinction between the two allophones. By putting your hand in front of your mouth and say the word 'keep' and 'skip' continuously, you can feel the extra burst of air came out of the mouth touching your skin. Therefore. the word 'keep' is written as /kip/ phonemically and [kHip] phonetically. As a summary, phoneme is the basic unit of sound with phone representing the actual sound. The different realization in phone is known as allophones.

Macmillan English Dictionary [21] defines the orthography as "the system of spelling that a language uses". There are also various definitions from other sources. The Dictionary.com website [22] describes orthography as "the art or study of correct spelling according to established usage", "the aspect of language study concerned with letters and their sequences in words" and also "a method of representing a language by written symbols; spelling". Klima [24] in his own words described orthography as "a system designed for readers, who know the language, who understand sentences and therefore know the surface structure of sentences". English orthographic system is alphabet and graphemes are the written symbols used to uniquely identify each other. There are 26 letters used to create graphemes. According to Rubba [26], there is no direct mapping between letters and graphemes. Letters in the alphabet are the raw

12 material used to create graphemes, which in tum are used to represent phonemes in English. In her studies, it was found that a grapheme can consists of a single letter, double letters or combination of letters. The following examples are taken solely from Rubba [26] to give a clear understanding on the types of English graphemes:

• Single vowel letters: a e i o u (w, y sometimes are used to represent vowel sounds) • Single consonant letters: b c d f g h j kIm n p q r s tv w x y z • Double letters (a sequence of two identical letters): o Vowels: ee, oo ( appears in a few loanwords and proper names). o Consonants: all consonants are frequently used doubly except h, j, k, q, w,

x, and y. Examples: a~mle, summer, toss, dizzy, etc. • Letter combinations: a sequence of two or more different letters used to represent single sounds or sound sequences. o Digraphs: a sequence of two or more different letters, which represents a single sound. Examples: representing If/ in 'phone'; representing Iii in 'seat'; for the voiced and voiceless interdental fricatives in 'that' and 'thing', respectively; for the voiced velar nasal in most Americans' pronunciation of the words 'sing', 'twang'; for /o/ in 'boat', etc. Notice that there are both consonant digraphs like and vowel digraphs like . o Blends: Blends are not single graphemes, as are digraphs; a blend is a sequence of 2 or 3 graphemes in which each letter represents a logical consonant sound. Examples are the of 'black', the of 'scream', etc. You will find mention of blends in many phonics programs. They are separated out for explicit teaching because children, for reasons that are unclear, often leave out letters in blends. They do this even when they can both hear and say all the sounds in the word. For instance, a child might spell 'bread' as 'bead'.

Rubba [26] also addresses the used of silent letters in English spelling. The silent letters like and appear in the word but not pronounced as in 'time' and 'knee'.

13 They have no sound value but are used to differentiate one word from another in spelling for instance 'mat' vs. 'mate', 'grim' vs. 'grime' and 'tap' vs. 'tape'.

An article [28] on English spelling found on the internet claims that English spelling is chaotic, inconsistent and has more complicated rules that other spelling systems. It gives three major reasons to support their claim:

1) Historical sound change

Old English was replaced by Norman French during the Norman invasion or also known as Norman Conquest for nearly 300 years before the Middle English took place. The long duration of invasion starting from the Germanic period has preserve foreign spelling for loanwords and much influenced by French as the word 'lieutenant', 'ballet', 'physics' and 'chancellor'. 'Ketchup' taken from Malay language and 'safari' from Swahili has also been accepted as English loanwords. List of the loanwords by periods of borrowing is available for reference in Appendix C.

2) Great Vowel Shift (GVS)

According to Menzer [28]. "the GVS was a massive sound change affecting the long vowels of English during the fifteenth to eighteenth centuries". Basically, GVS is the process of shifting the low vowels to the higher position and the shifted vowels are known as diphthong. Menzer has also described in details how the process is done. It happens in eight steps and each step may take longer time to complete. The following diagram illustrates Menzer's research of the GVS steps.

14 Step 1: 1 and u drop and bec'"nc at and aL • Step2:. and 0 1110\'C up. becoming I and ll ~ 1 ~ u 2 Step 3: a moves forward to :c 2 r - Step 4: E becomes e. >becomes o 6 e 0 Step 5: :c moves up toE 1'4 ( 9( 9U \ Stcp6.cmovcsuptoi 7 \::! £ 8~ ,{8 4 A new e was created in Step 4; now that c moves up to i f ~' aJ aU :) Step 7: E mo\·es up toe K ·n~~: new E created in Step 5 now moves up. a Step 8: &l and &U drop to at and at:

Figure 3.2 Great Vowel Shift process

3) The nature of English language

The third major reason is the English spelling itself. English consists of 26 letters with more than 50 different sounds. This gap shows that a one-to-one correspondence between letters and sounds is almost not possible for English. The next paragraphs present a few examples of the irregularities in English spelling:

Example I: represents two sounds: represents /s/ when followed by , or as in 'cent', 'city' and 'cycle' while it represents /k/ when followed by , or as in 'corn', 'cup" and 'cat'. Both sounds can be found in the word 'accent' and 'access'.

Example 2: /k/ sound can be presented by five graphemes: as in 'kid', as in 'cat', as in , as in 'back' and as in 'quite'.

In addition, an article [41] found on the Internet discuss about the silent letters in English spelling. It claims in their own words that the silent letters "as the false etymological spellings". For example the in 'debt'. the in 'island', the in

15 'knee' and the in 'ghost'. The Simplified Spelling Society [29] in the UK stands on the same ground that English spelling is too difficult for most people. Their research on the difficulties of English spelling among the British has lead to the following facts as reported in the website:

• Even after II years at school barely half of all English speakers become confident spellers. • Italian children can spell accurately after just 2 years at school. • Italy has only half as many identified dyslexics as England. • Around 7 million British adults and 40 million US adults are functionally illiterate. • English speaking adults always come near the bottom in international studies on literacy. • In 1992, a poor spelling standard of university students has been has been reported in the UK. • In 1998, a poor spelling of many students has been reported at Oxford. • In all UK schools there are some teachers who regularly make spelling mistakes on school reports.

A comparative study [29] has been conducted between English and several European countries languages. They conclude, "At least 3500 common words do not follow the 90 basic English spelling patterns. German has only about 800 such words, Spanish 600 and Italian merely 400. That is why Italian spelling can be mastered quickly while learning to spell English takes a long time and is never quite conquered by millions of learners". They suggest updating the spelling system and change the difficult spelling rules can make it much fun and easier to learn. According to them, by having a simpler English spelling system, children can spend more time on playing rather than to focus only on the chaotic language.

16 Weir and Venezky oppose to the issues of the English spelling inconsistency as found in the Klima's [24] research "suggest that one of the advantages of English orthography is that it preserves the identity of common morphemic elements (that would otherwise be obscured) by not requiring a uniform direct correspondence between the letter and the sound". Klima agreed that the identity need to be kept within the orthography as it is obviously that English is not a phonetic language. Klima discusses the issue of optimal orthography and highlights on the research by Chomsky and Halle's about optimality. They define optimal orthography is an orthographic system with no phonetic variation and is predictable by general phonological rule. Klima disagreed with Chomsky and Halle the notion they suggested could only be applied to the speakers of the language who know the orthographic system. He then suggests the characteristics of an optimal orthography as follows in his own words:

1) The degree in arbitrariness in the relationship within the orthographic units and the corresponding linguistic units: the less arbitrary the orthography the easier it will be to learn. 2) The degree of redundancy in the orthographic representation vis-it-vis the linguistic form: the greater the parsimony of the orthography, the better. 3) The degree of ambiguity in the orthographic representation with respect to the linguistic form represented: the orthography must be suitably expressive. 4) Standardization: one and the same word should not have several spelling, that is, a difference in spelling should represent a difference in linguistic structure.

Therefore it can be concluded that an optimal orthography is an orthographic system with less arbitrariness, less redundancy, more expressive and preserving the standard spelling.

3.2 Consonants and Vowels

There is often a misuse of the term vowel and consonant in many literature or research. Carney [1] finds many published studies do not address the distinction between sound and letter. He claims that it is a bad practice in a research not to highlight the

17 difference, as it would lead to the misunderstanding of the terminologies by the reader. He also finds that most of the term 'consonant' and 'vowel' refer to the sound rather than letter. With regards to his awareness, I would like to address in my study that 'consonant' and 'vowel' is used to refer to sounds or phonemes and not letters.

Consonants and vowels are the major classes in speech sounds. Again, same as the term 'syllable' we discussed in the previous chapter, there is no agreement on the distinction of 'consonant' and 'vowel' from linguists universally. Traditionally, Chomsky and Halle [3] defines that the utterance of consonants and vowels happen by the flow of the air from the lungs out of the mouth. Consonants are produced when there is an obstacle or constriction over the flow of the air while vowels are produced by the air coming out freely without any barriers. Roach [23] describes vowels as "sounds in which there is no obstruction to the flow of air as it passes from the larynx to the lips". In his book, Roach has the disagreement in defining the basis to the distinction of consonants and vowels. He claims that the most important difference is the distribution of the sounds and not the production of the sounds. The distribution of the sounds he is referring to is the "difference context and positions in which particular sounds can occur".

An interesting discussion on the distinction is found in an online article [30] involves the contrastive opinions of Ferdinand de Saussure, Leonard Bloomfield and Pike. The writer of the article concludes that the major difference is the consonants are sounds produced with stricture and the position of the organ that produces the sounds can be identified precisely while vowels are not.

18 Every speech sound is produced by articulation or the movement of one or more vocal organs along the vocal tract. According to Pennington [ 19], the vocal organ that particularly involves in producing a sound is called the active articulator while the idle vocal organ(s) is called the passive articulator. When talking about articulation, it is inevitable not to discuss about the place and manner of the articulation. Place of articulation is the position along the vocal tract in which the sound is produced. Manner of articulation refers to how the sound is uttered by the speaker. Figure 3.3 and Figure 3.4 show the relationship' between manner and place of articulation for English consonants and vowels which softcopies are taken from the University of Oregon in its online lecture notes [31].

Place of Articulation

0 hkatl,... :::: .!3 ,-\thl(".l!tl" '=::" t: < ~wl -0 LoHJH.\t 1:-" t"!Ukl c Rrtml!.x ;:<"' Uqucd

(•!ilk

State- ot I he

Figure 3.3 The consonants of English

' A complete chart can be found in Appendix D for further reference.

19 Central Back

-- ..I u I I I I f! 0 0.1 I I I I --·I

Figure 3.4 The vowels of English

In addition Pennington describes, "The vocal tract is divided into different regions which are used to describe the place of articulation of individual consonants and vowel". Below is the diagram illustrating the articulators used frequently in the study of phonetics.

nasal cav1ty

oral cav1ty

larynx (vo i cab ox)

Figure 3.5 Diagram of vocal organs and articulatory regions

20 She defines the locations of these organs and regions as shown in the following table:

Vocal organs and articulatory regions (Nouns) Adjectives nose nasal mouth oral lips labial teeth dental alveoli (or alveola ridge or gum ridge) alveolar (hard) palate palatal velum (soft palate) velar pharynx pharyngeal uvula uvular larynx laryngeal glottis glottal

Table 3.1 Name of vocal organs and articulatory regions

21 There are many things can be discussed on English consonants and vowels such as the characteristics, systems and positional variations but I will not go into details as my study will only concern on the symbols for letter and sounds. Further knowledge on general phonetics can be found in two good introductory books by Ladgeford and Abercrombie as listed in the Bibliography section.

3.3 Syllable Structure

Syllable is one of the properties widely discussed in English phonetics with variation in the possible structure. In rather basic terms, many agreed syllable to be the very important unit in spelling. Most speakers of English have no trouble of cutting up words into syllables because syllables are countable. If a number of English speakers were given a word like antidisestablishmentarianism, they would know that the word is consists of more than one syllable but they may have different view of segmenting the words into its syllables.

The consciousness on the syllable structure has created many variations in defining the syllable structure of a word. Many criteria are used as the basis or properties of the syllables including intensity, loudness, resonance and quantity (duration and length) [33]. As I have pointed out earlier, generally there is no acceptance in the exact definition of the syllable structure by phoneticians around the world. The conventional approach to describing the syllable structure is shown in Figure 3.6.

syllable ~ onset rhyme

nucleus coda

Figure 3.6 Conventional Syllable Structure

22 Basically, I found most of the researchers agree that nucleus is the basic element in the syllable structure. For instance, Smith [34] in his book defines syllable as "a unit of sound consisting of a vowel preceded and/or followed by consonants". As I have mentioned earlier in the first chapter, I would concern on the syllable structure proposed by the work of William and Zhang [7], [32]. They describe their syllable structure as "a group of segment" consists of a vowel as the nucleus with up to three consonants preceding the vowel (onset) and three consonants following the vowels (coda). Their notion can be illustrated as (C) (C) (C) V (C) (C) (C) where bracket shows the optional character of the presence of the consonants in the respective positions. In addition, they add the principle of the Sonority Sequencing Principle into their structure to give the syllables the sonority values. Sonority is defined by the opening of the air passages, the more it opens the more sonorous the element. The sonority principle makes the nucleus as the most sonorous element in the segment. The sonority wave builds up from the less sonorous segment i.e. onset to the nucleus (maximally sonorous segment) and decrease progressively when moving away from nucleus to coda. Example shown in Figure 3. 7 is taken from their research would give a brief understanding on their idea. Dotted line shows the sonority wave.

syllable I syllable 2 syllable 3 ~ ~ ~ rhyme

' ' '' //\' ' ' ' p'ucleus \Acoda ' ' ' ' ' ------

f i sh 'n' ch i p s

Figure 3.7 Example of William and Zhang's approach on syllable structure

23 There is a possibility in English to have a syllable with no coda or in other words syllable that end in a vowel. This is called "open syllable" and the most common open syllable is the CV syllable. Syllable with at least one consonant after the vowel is called "closed syllable" and of the type (C)VC. In the "closed syllable", English permitted a syllable to begin directly with nucleus and followed by coda. Some words will also have a diphthong as the nucleus. Diphthong is a sound consisting of one or two vowels.

English has a lot of monosyllabic words. By exammmg the sequences of the consonants and vowels would give a brief understanding the permissible syllable structure in English. The following examples are some of the English monosyllabic words taken from the lecture notes [2] of the MacQuerie University in Australia:

Type ofCV structure Example ofa word v I cv me CCV spy CCCV spray

Table 3.2 English open syllables

24 3.4 Problem to overcome

In this study, an investigation will be conducted on the mam problem with syllabification and to come out with the best solution. Basically, people who have the prior knowledge on phonetics would have no problem in dividing words into their syllables. Generally, any speaker of a language who has been taught about the structure of their spelling system may also have less difficulty to do the word division. But what I am concern here is the accuracy of the word segmentation. Some existing approaches may draw ambiguous syllable boundaries between adjacent codas and onsets. Lets take a word like "Knightsbridge" and "signifY". It is clear that the sequences of "ts" in "Knightsbridge" and "gn" in "signify" are ambiguous and there is a tendency for people to have a difficulty to decide where to put the syllable boundary.

In order to solve such syllabification problem, the following approaches are taken into account: each word is segmented in two ways and the output is analyzed and justification is made to choose the best approach that gives the most reasonable accuracy outputs. The method on syllabification in English will be the main concern of the next chapter.

26 CHAPTER4 METHODS

4.1 Approaches

In this study, a syllable is identified via two different routes or approaches. The reason of having more than one approach is to make a comparison of performance and accuracy between the two approaches and to select the best approach with the most reasonable output. In simpler words is to see if the syllable boundary is at the same place using the two approaches. The approaches are: I) Text -7 Symbol -7 Syllable 2) Text -7 Sound -7 Syllable The following diagram shows the idea:

Syllabification rules

II Get symbols ~~~~ Apply MOP 11--lll-111-----.+

Syllabified Text 1

Evaluate & compare

Get phonemes

Get the corresponding Ill text and insert the syllable boundaries

Figure 4.1 Syllabification flow chart

27 4.1.1 Text 7 Symbol 7 Syllable

In this approach, a word or a string of words (or text) is converted into graphemes or writing symbols and syllabification is done over the string of symbols. The conversion is based on a set of graphotactic rules and syllabification process will need to conform to the syllabification rules. In the end of the process, a string of symbols with inserted to mark the syllable boundaries is produced. This is not yet a final result, as it seems to be unreadable for those with no linguistics knowledge. To help towards a better understanding, a mapping is made between the initial text and the converted text but keeping the hyphen at its original positions. The following example should be able to give a better view on this approach:

Input word: skateboard

Processing Output Description Stage In the graphotactic rules, 'oa' is found to be a I skatebVrd vowel therefore it is replaced by a symbol 'V'.

In the graphotactic rules, 'e' is found to be a 2 skatbVrd silent grapheme therefore 'e' is removed. In the graphotactic rules, 'Vr' is found to be a 3 skatbVd vowel. In the syllabification rules, 'tb' is found NOT 4 skat-bVd to be a permitted consonant clusters therefore a hyphen is inserted to mark the boundary. A mapping between the initial word and the 5 skate-board output produced in the stage 4 is made to give the final result.

Table 4.1 Example of Text 7 Symbol 7 Syllable approach

28 The number of processing stage differs from one input to another. Longer word or text may require more processing stages. Each processing stage will go through a set of predefined rules and if any of the rules match or found to be true, a necessary change will be made and output is updated for the next processes. The order of the rules is extremely important to ensure on only for the accuracy of the output but also for the reliability and robustness of the program.

4.1.2 Text~ Sound~ Syllable

In the second approach, a word or a string of words (or text) is converted into phonemes or IP A ASCII symbols and syllabification is done over the string of sounds. Grapheme to phoneme conversion has been done quite some times and becomes very popular method for speech synthesis. The conversion in this study is based on a set of phonotactic rules and syllabification process will need to conform to the syllabification rules for sounds. In the end of the process, a string of sounds with inserted hyphens to mark the syllable boundaries is produced. Similar as the first approach, this is not yet a final result. A mapping from between the initial text and the converted text is made but keeping the hyphen at its original positions. The following example shows how theis approach works: Input word: helpful

Processing Output Description Stage In the phonotactic rules, based on its sound I hEHlpfUXl 'e' is converted to 'EH' and 'u' to 'UX'.

In the syllabification rules, 'pf is found NOT 2 hEHlp-fUXl to be a permitted consonant cluster therefore a hyphen is inserted to mark the boundary. A mapping between the initial word and the 3 help-ful output produced in the stage 2 is made giving the final result.

Table 4.2 Example of Text~ Sound~ Syllable approach

29 4.2 Data collection and analysis

In this study, data is referred to a set of raw materials used to construct the rules. Data are collected from the survey in most of the phonetic books and also from the observation on the normal words in the English . Analysis is done to determine how well the data work in a particular rule. Data are then simplified into a programming code like to be used in the process of mapping correspondences and syllabification.

4.3 Rules construction

Graphotactic rules are rules used to define possible sequences of grapheme in a writing system. It is also known as a Letter Distribution Rules in Carney's book [I] . In Text

~ Symbol ~ Syllable approach, the construction of the graphotactic rules is relatively simple because it ignores the difference in sounds. In the survey, the following rules are determined to be relevant as the graphotactic constraints: No. Rule's description Simplified form

Any combination of 'ey', 'ay', 'ew', ey,ay,ew,or,ar,oy,ow,ur,Ir,uy,ye l 'or','ar','oy','ow','ur','ir', "uy\ 'ye' is ~v substituted for 'V' E.g. bowl, fork, buy, oyster The spelling 'ough' is substituted for 'V' ough ~ Vf 2 followed by 'f The spelling 'ight' is substituted for 'V' ight ~ Vf 3 followed by 'f E.g. fight, might Any consonant followed by 'y' is substituted Cy~V 4 for 'V' E.g. lyric, python, hymn Any double consonants are substituted for a 5 cc~c single consonant except for consonant 'c' E.g. diffuse, syllable, litter Any double vowels are substituted for a single vv~v 6 vowel including two consecutive vowels E.g. door, coin, sheer, taut A consonant 'e' following a consonant and a vee~ 7 vc vowel is removed E.g. mate, place, frame 8 Terminal 'e' is removed e#~ e A consonant 'o' followed by consonant 'r' and orV ~ VrV 9 followed by a vowel substituted for 'V' E.g. chloride, adorable

Table 4.3 Graphotactic rules

30 Please note that the combination of 'ough' can be pronounced in 9 different ways; therefore I would consider it to be general.

1) rough= rhf

2) dough= dso

3) thought= Ebrt

4) plough= plso

5) through = erur

6) borough= 'bMs

7) slough= slh f

8) cough= kof

9) hiccough = 'hrkhp

Below is the list of the permissible onset sequences derived from the observation in several dictionaries [21],[22],[35]:

Consonant cluster Example of a word bl blow br bread ch chap chr chrome cJ clap cr crab dr draft dw dwarf fl flag fr frame gh ghost gl glad gn gnat gr grace hy hype kn knock kr krill ph phase phi phlegm phr phrase pi place

31 pr prmse rh rhyme sc scale sch school scr scream sh shack shr shrink sk skate sl slam sm smack sn snag sp span sph sphinx spl split spr sprint sq square st stab str strange sw swab th that thr throw thw thwart tr trace tw tweak wh whack wr wrap

Table 4.4 List of permissible onset sequences for symbol

It is feasible to transform the list into a look up table (LUT) and simply check for the permitted consonant clusters with the LUT during the syllabification process. But in this study, I do not intend to do the straightforward process where I believe a pattern for the possible onset sequences can be derived from the list. By using list of onset sequences pattern constructed in a rule form, the syllabification is no longer a time consummg process. As a matter of fact, new pattern can be added at any time in the programming code. The cluster is divided into three, which means that the maximum number of consonant to accommodate is three. The following table shows the pattern of the permitted onset sequences.

32 3 Consonant Clusters 2 Consonant Clusters Cl C2 C3 t h r,w p h r,l s,c h r s p r,h,l s c r,h s t r s c,k,l,m,p,q,t,w,h,n

~ g l,r,h,n

~ 0 k n,r ~ b l,r 00

0 d r,w r h 0W,0.~ 000 ~ ~ w h,r h y c l,r,h ~ ~ t h,r,w W/~ p l,r,h ~

Table 4.5 Pattern of consonant clusters for symbol

From the table above, an example is 'k' always go with 'n' as in 'knife' and 'k' also goes with 'r' as in 'krill'.

Phonotactic rules are rules used to define possible sequences of phonemes in a language. Working from sounds or phonemes is different and more complicated comparing to symbols. It requires more rules for the phonemisation task as English is not a phonetic language. The dictionaries of well-known publishers such as Oxford,

33 Collins and Longman are the main references to find how a particular word should be correctly spell. There is variation in sounds for one particular grapheme especially the different sounds of the same vowel.

Most graphemes have different sound values because there is no direct correspondence between graphemes and sounds. For instance, 'x' in 'text' has the sound value of /ks/. The phoneme /k/ is the sound for 'ck' in 'back', 'ch' in 'chloride' and 'k' in 'kill'. As the rules for sounds are more details so I divided them into diagraph rules, vowel rules, consonant rules and silent rules and these are listed in order in the programming code. As I pointed out in the previous chapter, the order of the rules is extremely important to produce a high level of accuracy output. In defining which rules to go first, a lot of test was conducted during the implementation phase and this will be discussed further in the next chapter.

The following four tables are the grapheme to phoneme conversion rules and are used in the program. Please note that the transformation of sounds to its IP A ASCII values is based from the IP A ASCII list in Appendix B. A diagraph is a cluster of at least two consonants forming a single speech sound in a word or in a syllable. Having a diagraph rules would expand the initial text/word into a less complicated form. The list of diagraph rules is shown in the table below:

No. Rule's description Simp/ifiedform 'd' followed by 'g' is substituted for' JH' dg~JH I E.g. budget, judge, ledge 'q' followed by 'u' is substituted for 'kw' qu ~kw 2 E.g. quick, request 'p' followed by 'h' is substituted for 'f' ph~ f 3 E.g. phase, photograph, 'w' followed by 'h' is substituted for 'w' wh~w 4 E.g. whip, white, why 'c' followed by 'h' is substituted for 'CH' 'ch ~ CH 5 E.g. church, chain 's' followed by 'h' is substituted for 'SH' sh ~ SH 6 E.g. shack, shade, mash

1 Except for ache

34 't' followed by 'h' is substituted for 'TH' th 7TH 7 E.g. them, think, thumb 'n' followed by 'g' is substituted for 'NG' ng-7NG 8 E.g. sing, long, anger 'k' followed by 'n' is substituted for 'n' kn-7n 9 E.g. knife, knew, knee When 'b' is followed by 'u' and then #buY -7 bY 10 followed by a vowel at the beginning of a E.g. build word then it is substituted for 'bY' When 'c' is followed by 'h' and then che# -7 SH 11 followed by 'e' at the end of a word then it is E.g. cache, moustache, gauche substituted for 'SH' When 'q' is followed by 'u' and then que# -7 k 12 followed by 'e' at the end of a word then it is E.g. cheque, plaque substituted for 'k' When 's' is followed by 'c' and then #2schY -7 sk followed by 'h' and finally by a vowel at the E.g. school, scheme, schism 13 beginning of a word then it is substituted for 'sk' 'x' is substituted for 'ks' x -7 ks 14 E.g. deluxe, annex, approximate When 'c' is followed by 'h' and then chi, chr -7 kl, kr 15 followed by 'I' or 'r' then it is substituted for E.g. chloride, chronic 'kl' or 'kr' When 'c' is followed by 'h' and then chol, chor -7 kol, kor 16 followed by 'o' and finally by 'l' or 'r' at the E.g. cholera, chord beginning of a word then it is substituted for 'kol' or 'kor'

Table 4.6 Diagraph rules

The next table shows the vowel rules used in the program. As I mentioned earlier, any discussion of vowel or consonant is referred as phonetic event/term. Diphthongs are also included in the vowel rules. Ford [36] in his research defines a diphthong as "a complex vowel formed of two phones functioning as a single phoneme". He added that in English, one phone carries the syllabic length and other phone which could also be a consonant functions as a vowel by giving the example: the vowel /i/ in English can be represented as 'e','i','y','ee','ea','ei','ie','ey','eo','ae' and 'oe\ while the consonant /i/

2 Except for schedule

35 (also written /j/) is represented by 'y', 'i' and 'e'. The sound of a diphthong is much heard as a blended speech sound.

No. Rule's description Simplified form Double vowel 'e', a single syllable with a ee, #Ce#, y# -7 IY I consonant preceding the vowel 'e' or a word that E.g. heel, me ends in a consonant 'y' is substituted for 'IY' Any consonant followed by a consonant or a CiC, ey -7 lli 2 combination of 'e' and 'y' is substituted for 'IH' E.g. hit, bit, honey, chimney A combination of 'e' and 'i' or a combination of ei, ay -7 EY 3 'a' and 'y' is substituted for 'EY' E.g. leisure, eight, day, A consonant followed by 'e' and followed by a CeC, ea -7 EH consonant or a combination of'e' and 'a' is E.g. met, head 4 substituted for 'EH'

Double vowel 'o', a combination of 'u' and 'e', a oo, ue, ui, ew -7 UW 5 combination of 'u' and 'i' or a combination of 'e' E.g. soon, glue, juice, flew and 'w' is substituted for 'UW' A combination of 'o' and 'a' is substituted for oa -->OW 6 'OW' E.g. boat, coal 7 A combination of'a' and 'r' is substituted for ar -7 AH 'AH' E.g. bar, apart Acombinationof'a' and 'r', 'a' and 'u', 'a' and or, aw, au, oor, oar~ AO 8 'w', double vowel 'o' and 'r' or the spelling 'oar' E.g. fork, taut, door, boar, is substituted for 'AO' Jaw A combination of 'a' and 'i' or 'a' followed by a ai, aCe -7 EI 9 consonant then followed by 'e' is substituted for E.g. wait, cake, date 'EI' A combination of 'o' and 'i' or a combination of oi, oy -7 OY 10 'o' and 'y' is substituted for 'OY' E.g. coin, toy, oyster A combination of 'o' and 'w' or a combination of ow, ou -7 AW 11 'o' and 'u' is substituted for' A W' E.g. cow, out, shoulder Vowel 'e' followed by 'a' or 'e' or 'i' then ear, eer, eir -7 IA 12 followed by r is substituted for 'IA' E.g. ear, sheer, weird The spelling 'air' or 'are' is substituted for 'EA" air, are -7 EA 13 E.g. share, chair A consecutive of 'o', 'u', 'r', a combination of 'u' our, ur, ir -7 AX 14 and 'r' or 'i' and 'r' is substituted for 'AX' E.g. armour, tum, firm Vowel 'i', followed by a consonant then followed iCe, uy, ye, ie# -7 A Y by 'e', a combination of 'u' and 'y', a combination E.g. kite, buy, eye, pie 15 of'y' and 'e' or word that ends in 'ie' is substituted for 'AY' 16 Any two consecutive consonants followed by 'y' CCy-7 CCAY

36 is substituted for the same consonants followed by E.g. cry, spy, fry 'AY' A vowel followed by a consonant then followed vcy~ vern 17 by 'y' is substituted for the combination of the E.g. busy, lazy, tidy vowel and consonant followed by 'lli' A string of'alk' is substituted for 'AO!k' alk ~ AO!k 18 E.g. talk, walk A string of 'all' is substituted for' AOll' all~ AOll 19 E.g. tall, wall A string of 'ire' is substituted for 'AYAX' ire ~ AYAX 20 E.g. fire, admire, expire A string of'igh' is substituted for 'AY' igh ~ AY 21 E.g. fight, might Vowel 'o' followed by 'r' then followed by a orV ~ AOrV 22 vowel is substituted for 'AOrV' E.g. chloride, adorable

Table 4. 7 Vowel rules

It was found that there are some irregularities in English spelling with regards to the vowels. Words with unpredictable phonemes are identified and a selection is made from the variation of phonemes. Three cases have been identified based on the ~ ,..:z. similarity of their phones: < g ~G; ,...lO.. ...<0 -· Case 1: ::E3 I) 'a' that could be substituted for 'AX' as in about, accord, adhere, again, agree ""0 ..,~ and account ~~ 2) 'a' that could be substituted for 'AE' as in absence, accurate, add, agriculture en-,..t: and agony ~~ 3) 'a' that could be substituted for 'EJ' as in age, agent, ailment, aid, aim and ~~ alien ~

Based from the findings of case 1, a rule is constructed for each of the options above and the result in bold is the chosen result for the rule:

aCV ~ AX, AE, EI aCC ~ AX, AE, EI ave~ EI

Case 2:

37 1) 'u' that could be substituted for 'UX' as in up, upper, umbrella, ulcer, ultra, utmost, utter. 2) 'u' that could be substituted for 'jUW' as in unicorn, unique, unify, utensil, use, umon. 3) 'u' that could be substituted for 'AX' as in upon

Similar as in case 1, the results for case 2 are as follows:

uCC ~ UX

uCV ~ jUW,AX

Case 3: 1) 'o' that could be substituted for 'AX' as in domestic, coconut 2) 'o' that could be substituted for 'AU' as in hose, hope, oath, oasis 3) 'o' that could be substituted for 'OU' as in open, bone, bold, domain, dome, donut

The results for case 3 are as follows:

oCV ~ OH, AU, OU, OH, AX

oCC~ OH

oVC~ AU

The following table shows the consonant rules used in the program:

No. Rule's description Simplified form

A word ends in 'ck' is substituted for 'k' ck# ~ k I E.g. back, lack A consonant 'c' followed by either vowel 'o', 'a' or c/o,a,u ~ k 2 'u' is substituted for 'k' E.g. access, cat, com A consonant 'c' followed by either vowel 'i', 'e' or c/i,e,y ~ s 3 'y' is substituted for's' E.g. access, cent, city, A vowel followed by's' then followed by a vowel, VsV ~ VzV 4 the voiceless 's' is substituted for a voiced 'z' E.g. lose, user, amuse A vowel followed by's' in the end of a word is Vs# ~ Vz 5 substituted for the same vowel followed by a voiced E.g. his, this •z~. 6 A word ends in 're' is substituted for 'AX' re# ~ AX

38 E.g. adhere, capture A word begins in 'wr' is substituted for 'r' #wr -7 r 7 E.g. write, wrong, wrap A consonant 'g' followed by either vowel 'a', 'o' or g/a,o,u -7 g 8 'u' is substituted for 'g' E.g. game, gum A consonant 'g' followed by either vowel 'e', 'i' or g/e,i,y -7 JH 9 'y' is substituted for' JH' E.g. general, giant, 3 . gmger A consonant 'g' followed by 'n' then followed by a gnC -7 nC 10 consonant is substituted for 'nC' E.g. signpost, alignment A consecutive 'g' and 'n' followed by a vowel is gnV -7 gnV 11 remained as is. E.g. signal, dignity

Table 4.8 Consonant rules

It was found that there are some irregularities in English spelling with regards to the consonants as well. I dealt with these cases similar as I did to the vowels. Words with unpredictable phonemes are identified and a selection is made from the variation of phonemes. Two cases have been identified based on the similarity of their phones:

Case 1:

I) 'y' that could be substituted for' A Y' as in cycle, hygiene, python, myself, lychee, symbol 2) 'y' that could be substituted for 'IH' as in cygnet, hymn, pyramid, mystery, lyric, syphone

The results for case 1 are as follows: yCC -7 AY, IH yCV -7 AY, IH

Case 2: 1) 's' that could be substituted for 'SH' as in assure, censure, tonsure, fissure 2) 's' that could be substituted for 'ZH' as in measure, leisure, exposure, closure

The results for case 2 are as follows:

3 May have a confliction with NG rule. Consider other words like angel and plunge.

39 Csure# 7 CSHAX# Vsure# 7 VZHAX#

Silent letters are letters, which appear in the spelling but do not represent any sound. Silent letters are resulted from the historical change where it used to be pronounced during the Old English period. As we can see, the historical sound change has preserved the spelling but dropping the sound. It is quite strange to accept the silent letters as part of the grapheme in English orthography but it indeed useful for the correct pronunciation of a word. Silent graphemes often a cue for the sound of the preceding vowel. For examples:

Silent grapheme 'e' Silent grapheme 'gh · tap vs. tape mit vs. might mat vs. mate sit vs. sight . . pip vs. pipe lit vs. light

The silent rules that I have defined are shown in the following table:

No. Rule's description Simplified form A word begins with the spelling 'gn' or ends in the #gn, gn# 7 n (g is silent) I spelling 'gn' is substituted for 'n' E.g. gnash, sign, gnaw A word begins with the spelling 'gh' is substituted #gh 7 g (h is silent) 2 for 'g' E.g. ghost, ghoul, ghee A word ends with the spelling 'bt' is substituted for bt# 7 t (b is silent) 3 'y' E.g. debt, doubt A word begins with the spelling 'rh' is substituted #rh 7 r (h is silent) 4 for "r' E.g. rhyme, rhino, rhubarb A word begins with the spelling 'mb' or 'mn' is mb#,mn# 7 m 5 substituted for 'm' E.g. lamb, comb, autumn, damn

Table 4. 9 Silent rules

40 Finished with the phonemisation rules, the rules for the permissible onset sequences are determined and this is the same concept as applied in the text-? symbol approach. From the dictionaries, the following list is constructed which will be used to derive the patterns for onset sequences:

Consonant Cluster Example of a word bj beauty bl blot br brim dj duty dr drink dw dwell fj future fl fly fr freak

g) argue gl glue gr grace gw Gwen hj hue kj cure kl climb kr crew kw queen . mJ mute nJ nuclear PJ pure pi please pr pnme

lJ review

41 sf sphinx SHr shrink - SJ suitable sk sky skj skew skr screw skw squid sl slot sm smear sn snake sp spot spJ spume spl spleen spr spray st stick stj stew str straw sw sweet THj Thew - THr throw THw thwart tj tune tr train tw twist

Table 4.10 List of permissible onset sequences for sound

42 From the list of the permissible onset sequences, the patterns of the consonant clusters are constructed as in the following table, which later will be used as the syllabification rules:

3 Consonant Clusters 2 Consonant Clusters CJ C2 C3 S,T H r s p,t,k r s t,k J T H J,W s p j,l s k w b,d,f,g,k,m,n,r,s,t,p J b,f,g,k,p,s,c 1 b,d,f,g,k,p,t r ---~~ •W/// ~ d,g,k,s,t w s f,m,n,p,t,k

Table 4.11 Pattern of consonant clusters for sound

A few examples from the above tables: /SHr/ and /THr/ are the permitted syllable-initially /dw/, /gw/, /kw/, lswl, ltw/ are the permitted syllable-initially

4.4 Syllabification with Maximum Onset Principle

The conversion of the input text into its symbol and sound representation must be done prior to syllabification. This process includes the use of the graphotactic rules and phonotactic rules to produce a string of text with added knowledge of its orthographic and phoneme sequence. The next step is to determine the ideal method to separate

43 polysyllabic words into syllables. There are diverse models of syllabification and a widely used model is known as the Maximum Onset Principle (MOP). Example of a work using MOP approach can be referred in Kahn's dissertation [4]. In this study, MOP is chosen to be the method to identify the syllable boundaries. A general assumption taken from Harrington and Cox [3 7] is "the syllable structure is required to satisfY the MOP only within the limit set by the syntactic, morphological and phonotactic constraints of the language".

Basically, what MOP does is to syllabifY the word based on its vowel. The first step in MOP is to look for a vowel, as we understood that vowel is the nucleus in a syllable. The next step is to look at the consonants on its left and to check the permissible sequence of consonant cluster with the graphotactics/phonototactics rules defined earlier. If it conforms to any of the rules, a symbol is inserted to mark the boundary. The remaining consonants will be assigned to the next identified vowel as its coda clusters. It can be concluded that after the various definitions of MOP algorithm, a commonality is MOP prefers onset than coda. Therefore, in my study I would include the constraints for onset only and not coda.

The process of MOP can be illustrated in the following diagrams:

/1~ c v c v c c v c v c c v c v c ( 1) (2) (3)

Figure 4.2 Maximum Onset Principle process

44 To be more specific, below are the algorithms of the MOP used in my program. Please note that in my study, I chose to work from RIGHT to LEFT.

MOP Algorithm:

1) Work from left to right, identify possible vowel. 2) Get a maximum of 3 preceding consonants (as this is the maximum number that is allowed in the syllable structure) or stop when a vowel is met. A single consonant found will be assigned straight away to the vowel while a consonant cluster will be checked with the related conversion rules for the permissible onset sequences. 3) Any consonant in the consonant cluster that is found to violate the rules will be assigned to the coda of the preceding syllable. 4) Associate any remaining consonant to coda position.

In conclusion, the syllabification algorithms for each of the approaches are details in the following paragraph:

Syllabification Ah:orithm for Text 7 Symbol 7 Syllable:

1) Expand diagraph by applying the diagraph rules 2) Apply graphotactic rules for vowels 3) Apply graphotactic rules for consonants 4) Remove silent 'e' and terminal 'e' 5) Remove double consonants 6) ApplyMOP

45 Syllabification Algorithm for Text 7 Sound 7 Syllable:

1) Expand diagraph by applying the diagraph rules 2) Apply phonotactic rules for vowels 3) Apply phonotactic rules for consonants 4) Remove silent and terminal 'e' 5) Apply special rule for double consonant 'c' 6) Remove double consonants 7) ApplyMOP

Generally, we would find a similarity in both of the algorithms but the major different are the rules. The second approach requires more detail rules because English is not a phonetic language as said earlier. There is also a special rule to handle double consonant 'c' for the second approach. Consonant 'c' is considered special because it carries two different sounds i.e. /s/ or /kl depending on the word. The special rule applies on a word with double consonant 'c' followed by /i/ or /e/ where the first consonant becomes is/ and the second consonant becomes /k/. The following examples would give a brief understanding on the problems of double consonant 'c':

Words comply with the rule: access accent occident eccentric

Words do not comply with the rule: accurate accommodate occupant ecclesiastical

46 CHAPTERS IMPLEMENTATIONS

5.1 Memory

Implementing the program or software does not require a large memory space due to the use of hand-written rules. The mapping of symbols and sounds is based from the rules and so as the syllable generalizations. Without using any supervised learning techniques like a corpus based table lookup model, a mapping can be generated at reasonable accuracy. Basically table lookup model would require massive space in the computer memory to store the words enriched with default mappings. Thus, the model only works on pre-stored words and slowing down the processing speed.

5.2 Data structures

Input data are stored in an expandable array known as vector in Java. Each of the element on the vector goes into a splitting process to split the word into its characters. These characters are stored in another vector. The mapping process requires the use of this character vector. The complete transformation after the mapping process is stored in another vector and ready to be syllabified. The syllabified characters are then map into their initial characters before putting the final results i.e. the output data into the result vector. The output data are those data with inserted hyphen at the particular positions that have been detected by the program based from the hand-written rules. Although many of their data elements are public, most manipulations of these data

47 structures are handled through access functions, which will be described m the following sections.

5.3 Modularity

The program consists of several modular functions. The first function is to get the input text and to extract the text into their characters. The second function does the mapping between the characters and the sounds. This function runs in parallel with the third function that does the mapping between the characters and the symbols. The syllabification function syllabifies the string of symbols and sounds concurrently. The last function in the program used to map the syllabified result with the initial text for their correspondences.

5.4 Input I Output

Basically, the program consists of a single user interface. User's input is captured from a single text box and the processed outputs for each approach are displayed in two different text boxes with vertical scroll bar. Three simple buttons provide the functionality to the program. The 'syllabifY' button triggers the syllabification algorithm and displays the output, the 'clear' buttons erase the entire text boxes and the 'exit' button terminates the program. The user interface is shown in the following figure.

48 Figure 5.1 User interface

One limitation of the project is the problem to solve a minor part of the programming codes. The incompletion part is the reverse mapping between the sound string and the initial text. Generally, the problem is simple but the construction is very complicated as it involved mapping double characters (IP A ASCII) with the initial single characters. Unfortunately, until this thesis is written I could not find the correct algorithm for the reverse mapping. Most of the algorithm tried so far produced funny and messy outputs. After tiring moments on trying to get things better, I decided to leave the output as is. The output can still be predicted from the sound string with ASCII characters. I would agree to say that this problem limits the maximum performance of the results but further discussion on performances is discussed in the next chapter. I hope my justification can be considered and my inability to solve the part can be accepted. A few examples of words are taken as input and the output are shown in the next pages.

49 Figure 5.2 Sample output for word 'signification'

Figure 5.3 Sample output for word 'surreptitious'

50 Figure 5.4 Sample output for word 'bedridden'

prEHz-EHn-tAE-SHn

Figure 5.5 Sample output for word 'representation'

51 Figure 5.6 Sample output for word 'antidisestablishmentarianism'

c-cess ac-curate oc-cident c-com-pa-ny

Figure 5.7 Sample output for text 'access accurate occident accompany'

5.5 Features

The program provides easy visualization feature of both approaches to syllabification and helps user towards better understanding on the process. The comparison feature helps the user to compare and choose the output with higher degree of accuracy.

52 5.6 Tools

This program was implemented using Java 2 SDK version 1.4 as the set of command­ line programs used to create, compile and run Java programs at the earlier stage. After a while, it was found that Java 2 SDK lacks of some features most professional programmers needed. There are a lot of sophisticated Java programming tools available such as Borland JBuilder, WebGain Visual Cafe and IBM Visual Aid for Java but I found that Sun One Studio offered by Sun MicroSystems is simple and easy to use. The Community Edition is free software available from Sun's Website offering a built-in text editor, graphical user interface and project management tools. One advantage is Sun One Studio supports Java 2 SDK version 1.4 and can be configured to support higher version.

53 CHAPTER6 PERFORMANCE AND JUSTIFICATIONS

Improvement on the syllabification method and changes in rules are the recurring tasks during the performance-testing phase. Adding more rules, changing the rules or the order of the rules is important to refine the output. After a long period of testing, it was found to be averagely stable and running through to completion for most of polysyllabic word(s). By using Sun ONE Studio software as the development tool, a back end screen or window is provided for the programmer to monitor the process and to spot the reason if a problem was encountered while running the software.

The primary concern in the performance of the program is the accuracy. After endless testing, it was found that the outputs of the program might vary slightly in both accuracy and validity. One possible cause is some of the inputs may not comply with some of the rules thus creating undetected tiny errors. These tiny errors may have grown so large that sometimes a simple word could produce incorrect input. After each program and data set was run, all the results were loaded into a table for comparison. The table is shown below:

String ofsymbols with String ofsounds with syllable Text syllable boundaries boundaries bureaucracy bVVu-cra-c V bA-XAOcr-AE-zlli complaisant com-plY -san! kOHmp-lEI-sAHnt _l)(:_rpetual per-pe-tVl pAXp-EH-tUXal divergence divr-genc dAYv-AX-gEHnc correspondence cV -re-spon-denc kA-OEHsp-OHn-dEHnc acclimatize ac-cli-ma-tiz AEck-lAYm-AX-tlliz

54 persevere UNKNOWN ERROR UNKNOWN ERROR constraint con-strVnt kOHnstrEint knapsack knap-sack nAEp-sAXk knighthood knV-thVd nllighTHUWd obsequious ob-se-q Vous OHbs-EH-kwi-HA Ws courageous cV-ra-gous kUAEI-gA-WUXs

Table 6. I Output comparison table

Scalability is also an important issue in system performance. Several tests have proven that the program could handle most of the English long words such as:

1) Antidisestablishmentarianism: popularly accepted as one of the English 2) Pseudocontraneoantidisestablishmentarian: by adding more suffix to the first example giving new version and meaning to the word. 3) Floccinaucinihilipilification: the longest non-technical word described in the first edition of Oxford English Dictionary explaining habit of esteeming or describing something as worthless, or making something to be worthless by said means. 4) Contraneoantidisestablishmentarianalistically: one of the longest word used in Wikipedia website 5) pneumonoultramicroscopicsilicovolcanoconiosis: a coined term for lung desease

The above examples are taken from Wikipedia [38]. As for now, the program can deal with a single word or a piece of text but not on siginificantly increased load like a text file.

55 The processing time is found to be optimum and results can be produced just in a few seconds. Reasons might be contributing to this feature are the correct order of the functions in the source code and the use of rules for the mapping and syllabification processes. Though lookup tables (LUT) are the popular method to find the mapping correspondences but it is not practical to have them as a corpus in the IBM pc. The size is terribly huge and it would take quite a while to go through each data inside the LUT unless an efficient and faster searching technique is used. Both of the approaches perform equally faster and outputs are displayed almost at the same time.

The program has been designed in respect of robustness to ensure that it works in a correct way in any condition regardless of any mistaken input by user. Assumption made during the implementation and operation process is "Input should consists of characters only". Input will always be checked with this assumption before proceeding to the next process. Unfortunately, the program has not yet achieved the optimum robustness. There are some tricky errors, which are very hard to find even with the best testing techniques. This implies to some unpredictable inputs with the tendency to trigger the error. These inputs are likely to be the same structure as the solved input but strangely when no output is displayed in the result boxes, the inputs has triggered the error. Luckily, it will not terminate the program and user always have the attempt to try for other inputs by clearing the input text box and entering the new input.

The reliability of the program is measured during the implementation and operation process. Due to the problems of inaccuracy and less robust, it is hardly to reach to the peak performance of the program without any processing errors or unreliable outputs. May be a popular phrase "you can't write a good software" is true for me. After a long period of hard work, I still could not solve the problem of unpredictable inputs and the reverse mapping between the phoneme strings with syllable boundaries to its corresponding initial text. As I admitted earlier, text ~ sound ~ symbol is a cumbersome task and I have no confidence to make any more changes as it might introduce unforeseen problems to the program. Sometimes the error might due to the reverse mapping code where no output will be displayed in the final result text box.

56 My performance analysis has found the first approach (text 7 symbol 7 syllable) is more successful than the second approach (text 7 sound 7 syllable) with regards to the accuracy of its outputs. Syllabification by symbols is easier because it discards the specification of sounds. It requires fewer rules thus giving less error. Working from sounds is found to be more difficult because there are ambiguities in the specification of sounds. A lot of rules are needed in order to produce a refine result. Currently there are almost 60 rules for text to sound mapping in the source code. Having more rules would be better in producing accurate result but incorrect order of the rules tend to cause rules confliction and undetectable errors at in the end of the process. Therefore, I would suggest that it is easy to syllabify the word/text by its string of symbols and do the mapping over the syllabification results to give the phonetic representation. The idea is illustrated in the following diagram:

TEXT String of symbols

Syllabification with MOP

SOUND String of symbols with I I syllable boundaries

Figure 6.1 Text 7 Symbol 7 Sound with syllable boundaries

57 CHAPTER 7 CONCLUSIONS AND FUTURE WORK

This thesis contributes methods to automating the syllabification process for English considering two routes: Text ~ Symbol ~ Syllable and Text ~ Sound ~ Syllable. The examined literature reviews discovered that no attempt has been found to syllabify English word via string of symbols. Existing research identify syllables via string of phonemes. This project illustrated the feasibility and accuracy of dividing words into their syllables using the two approach mentioned earlier. The testing phase of the program also described some of the problems that need to be addressed as outlined in Chapter 4 to improve the effectiveness and level of accuracy for each approach.

Clearly, there are two dimensions of difficulties, which are the difficulty to handle unexpected input giving inaccuracy results, and the difficulty in text to phoneme converswn. After highlighting these problems, Chapter 6 concludes that it is easier and less trouble to go from symbols than sounds. However, unfortunately the mapping between strings of sounds to its initial text with syllable boundary is left unsolved due to the problems of mapping two characters (i.e. the IPA ASCII characters) to a single character.

This thesis also discusses the successful of the first approach than the second approach. Concerning on the less difficulties of going from symbols than sounds, a new idea is generated. The idea to identify syllable by its string of symbols and then map the string of symbols with syllable boundaries to their corresponding phonemes is found to be

58 much more easier and accurate. It is proven by a lot of testing being conducted on the program.

On a personal note, I don't feel that this is a finished project. This is just the beginning of something much bigger. A good algorithm for the reverse mapping between the phoneme strings to initial text is highly recommended for the first part to pursue the project. A further refinement on phonotactic rules is highly encouraged for the accuracy of phoneme strings. Following the investigations described in this thesis, a number of issues could be discussed and focused to expand the knowledge in syllable identification.

It would be interesting to cater for a syllabic consonant instead of a vowel as the nucleus in syllabification process. Syllabic consonant is a condition in English syllable structure where consonant can constitute as the center or peak of a syllable [39]. In this case, liquids (l,r) or nasals (m,n,NG) stands as the center of the syllable instead of a vowel. Examine the following words with syllabic consonants in bold:

words end in 'le': cattle, bottle words end in 'el': tunnel, panel words end in 'al': pedal, bridal words end in 'en': soften, raven words end in 'on: bacon, wagon words end in 'Cm': firm, rhythm

Syllabic consonants can also be found in the words like muddle, widen, smitten, hospital, loosen, hassle, cousin, weasel, gentle. According to the University of Bucharest [33], the consonant can replace a vowel due to high degree of sonority values. English is not unique therefore we could find many English words with syllabic consonants. The+/- syllabic is one of the features introduced by Chomsky and Halle in their book, The Sound Pattern ofEnglish [3].

59 Another possible issue is the ambisyllabicity or rules for intervocalic consonant clusters. According to Berces [40] in her research "ambisyllabicity is a term from traditional syllabic theory denoting a situation when a segment simultaneously belong to two syllables, namely, when a consonant acts both as a coda and an onset". In a simple word, ambisyllabicity is the condition where an intervocalic cluster of two or more consonants can be broken down into a coda and an onset. For example, in the word "monstrous", the string nstr is ambiguous. There is a rule that str is a valid syllable beginning and ns is the valid syllable ending. Therefore it is said here that s is happen to overlap ambisyllabicity. Further investigation can be carried out to write a rule for ambisyllabicity.

Future work can also focus on the reusability of the mapping and syllabification algorithms in different languages. At present, the algorithms only work for English and the rules were purposely written for English. It is also an added advantage if it could be scaled gradually to support batch processing of text files.

60 REFERENCES

[1] Carney, E. (1994). A Survey ofEnglish Spelling, Routledge Inc, New York.

[2] Manne!, R. and Cox, F. (2000). An Introduction to Phonetics and Phonology, http://www.ling.mq.edu.au/units/ling21 0-90 1/phonology/syllable/index.html 19 June 2003.

[3] Chomsky, N. and Morris H. (1968). The Sound Pattern ofEnglish. Harper & Row, New York.

[4] Kahn, D. (1976). Syllable-based Generalizations in English Phonology. PhD Thesis, Institute of Technology. US. Published by Garland, 1980.

[5] Lowenstanun, J. (1981). On the Maximal Cluster Approach to Syllable Structure. Linguistic Inquiry. 12:575-604.

[6] Clements, G.N. and Samuel, J.K. (1983). CV Phonology: A Generative Theory of the Syllable. The MIT Press, Cambridge, US.

[7] Edmondson, W.H. and Zhang, L. (2002). Speech Recognition using Syllable Patterns. University of Birmingham, UK. ICSLP2002.

[8] Weber, A. (2000). Phonotactic and Acoustic Cues for Word Segmentation in English. Max Planck Institute of Psycho linguistics, Netherlands. ICSLP2000.

[9] Luksaneeyanawin, S. (1990). A Thai Text to Speech System, Proceeding of the Conference of the Regional Workshops on Computer Processing of Asian Languages, Asian Institute of Technology, pp 305-315.

[10] Roth, A. (2004). The Algorithmic Determination ofRhyme, http://pages.cpsc.ucalgary.ca/-rotha/503/interim.html, 20 June 2004.

61 [II] Allan, H.S. and Klatt, D. (1987). From Text to Speech: The M!Talk System, Cambridge University Press, UK.

[12] Kumar, R., Kataria, A. and Sofat, S. (2003). Building Non-Native Pronunciation Lexicon for English Using a Rule Based Approach, Proceedings ofiCON-2003, CIIL Mysore.

[13] Schaden, S. (2003). Rule-based Lexical Modeling of Foreign- Accented Pronunciation Variants, Proceedings lO'h Conference of the European Chapter of the Association for Computational Linguistics (EACL'03), Budapest, Hungary.

[14] Mareuil, P.B. linguistic-Prosodic Processing for Text-to-Speech Synthesis in Italian, http://www.elanspeech.com/randd/doc/icslp.doc, 20 June 2004.

[15] Kim, B., Lee, G. G. and Lee, J.H. (2002). Morpheme-based Grapheme to Phoneme Conversion using Phonetic Patterns and Morphophonemic Connectivity Information, ACM Transaction on Asian Language Information Processing (TALIP), I, l, pp 65- 82.

[16] Muller, K. (2001). Automatic Detection of Syllable Boundaries Combining the Advantages ofTreebank and Bracketed Corpora Training, Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse.

[ 17] Chotimongkol, A. (2000). Statistical Trained Orthographic to Sound Models for Thai, Proceeding of the 6'h International Conference on Spoken Language Processing, pp 533-554.

[ 18] Muller, K. Probabilistics Context Free Grammars for Syllabification and Grapheme to Phoneme Conversion, http://staff.science.uva.nl/-kmueller/Onlinepapers/EMNLP.pdf, 20 June 2004.

62 [19] Pennington, M.C. (1996). Phonology in English Language Teaching. Addison W es1ey Longman, New York.

[20] Lynch, J. (2002). History of the English Language,

http://www.english.upenn.edu/~jlynch/Tenns/Temp/history.html, 21 June 2002.

[21] Macmillan English Dictionary for Advanced Learners. (2002). Macmillan Publishers Limited, Oxford, UK.

[22] Dictionary.com. (2002). http://www.dictionary.com, 21 June 2002.

[23] Roach, P. (1983). English Phonetics and Phonology, Cambridge University Press, Cambridge, UK.

[24] Klima, E.S. How Alphabets Might Reflect Language, took from Kavanagh, J.F., Mattingly, I. (1972). Language by Ear and by Eye the Relationship Between Speech and Reading, MIT Press, US.

[25] Coxhead, P. (2002). Natural Language Processing and Application, Lecture Notes, University of Binningham, UK.

[26] Rubba, J. (2003). Phonology, Phonics and English Spelling,

http://cla.calpoly.edu/~jrubbalphon/phon.spel.html, 20 June 2003.

[27] Wikipedia. (2004). English Orthography, http://en.wikipedia.org/wikillinguistics, 1 July 2004.

[28] Melinda, M.J. (2000). What is the Great Vowel Shift?,

http://alpha.funnan.edu/~mmenzer/gvs/what.htm. 1 July 2004.

[29] The Simplified Spelling Society. (2004 ). Why English Spelling Should be Updated. http://www.spellingsociety.org. 1 July 2004.

63 [30] The Sounds of English. Consonants and Vowels. An Articulatory, Classification and Description. Acoustic Correlates. http://www.unibuc.ro/eBooks/filologie/mateescu/pdf/31.pdf. 1 July 2004.

[31] English Consonants. http//darkwing.uoregon.edu/-astephen. 2 July 2004.

[32] Edmondson, W.H. and Zhang, L. (2002). The Use of Syllable Structure for Speech Recognition, University of Birmingham, UK. Technical Report CSRP- 02-07.

[33] University of Bucharest, Romania .(2004). Beyond the Segment: Syllable Structure in English. http://www.unibuc.ro/eBooks/filologie/mateescu/pdf/27 .pdf. 15 January 2004.

(34] Smith G.W. (1991). Computers and Human Language. Oxford University Press, UK.

[35] Oxford English Dictionary Online. (2003). http://www.dictionary.oed.com. 8 8February 2003.

[36] Ford O.T. (1999). Basic Linguistic of English. http:/ /thestewardship.org/research/blte.htm. 30 March 2004.

[37] Harrington J., Cox F. (2004). The Syllable and Phonotactic Constraints. http://www .ling.mq .edu.au/units/ling21 0-901 /phonology/syllable/ syll_phonotactics.html. 26 June 2004.

[38] Wikipedia. (2004). The Longest Word in English. http://en.wikipedia.org/wiki/Longest_ word _in_ English. 30 June 2004.

[39] Cardiff University, UK. (2004). Syllabic Consonant. http://www.cf.ac.uk/encap/staff/tench/syllabic.html. 2 July 2004.

64 [40] Berces K.B. (2004 ). "Ambisyllabicity" Across Word Boundaries: A strict CV Phonological Approach. English Linguistic PhD Programme, EL TE. 2 July 2004

[41] English Orthography. http://www.fact-ndex.com/e/enlenglish_ orthography.html

26 June 2004.

65 BIBLIOGRAPHIES

Ladgeford, P. (1982). A Course in Phonetics 2"d Edition, Harcourt Brace, New York.

Abercrombie, D. (1967). Elements of General Phonetics. Edinburgh University Press, UK.

66 APPENDIX A

The University of Birmingham School of Computer Science

MSc in Advanced Computer Science

Summer project

This form is to be used to declare your choice of summer project in coune. Please complete this form, obtain lbc signature of your supervisor and post it in lbc appropriate assessed work pigeon hole.

Student rnvnb«: t;l">l !;, ,;tj

Tbe followin& questiOilll should be answered in conjunction with a reading of lbc course luwdbook.

Aimofproj«1 10 pvaiiAc.e pY>Of'l~-lrc:. ,r;:pv-all'll-~ .,t ~~ %114Jd.! de.;;-wd ~ &I tbliiVfVIii" oVIal t~'X't ·lnp!At .

To wvit< ~ ~iev \N'"'&>'"avt1 -to atr1 C7YJ

!PTO

67 APPENDIX A (cont..)

Pro}«t managemmt I) VtVelof' ~ pvoj~ct {Jt6t., ~ ~Mdf~ e.Acl? slcills ~c-K"v1tJ i"l:> a spec4~<=f -1\ ~ ~~me . Briefly explabt how youwilld.vi#a 2.) Jk"~'" wu-~ mc!~i<"~ """'lt-, ~ "vpev-vr~.--­ ltUIIItJ8't!menlplan to fo V' pv~v--~gt rwr~w . allow J'OIU'" ntpWVUOr to evaluate your :3) ex~~" at"d ~v&~t(uc;\ff'Or? . progras 4J w.-;t;~ ~ f>'~<'ct ~rov-1- .

IOt'6ce u.e: Copy fur lstude:ntlaupervi!or/file

68 APPENDIXB List of IP A symbols and their corresponding ASCII symbols

lPA SEE Examples ASCII Partial Feature Set. fi] heeJ, me IY {vowel,voiced} . [t] hit rn {vowel,voiced} · [c] SAEboit EY {voweJ,voiced) [e) met. head EH {vowel,voiced} [a:] hat AE {vowel,voiced l [a] SAE father. pot AA {vowel,voiced) [3] about, after. fern AX {vowel,voiced J [A) up. fun ux {vowel.voiced) [u] soon uw {vowel.voiced} [u] put, foot UH {vowel.voioed} [~) SAE boat ow {vowel,voiced} [o] fork, taut AO {vowel,voiced} [n] hot OH {vowel,voiced] [o] bath, bar AB {vowel, voiced} [e1] walt, cake EI (vowel,voiced) [at] kite, buy AY {vowel,voiced} [::u] coin, toy OY (vowel,voiced} (oo] bone. open ou (vowel,voiced) £au] cow, out AW (vowel,voiced} (1:)] ear, sheer IA (vowel,voiced} [ea) air, share EA {vowel,voiced} [uo] tour UA (vowel,voiced) (p) spin p {stop,bilabial, voiceless} fbl boo b {stop,bilabial, voiced} [t] stop t {stop,alveolar. voice~ess} [d) dog d {stop.alveolar, voiced} [k} scan k {stop.velar, voiceless) [g} gate g {stop, velar. voiced} [mJ mat m ( nasal,bilabial,voiced} [nJ not n ( nasal.alveolar,voiced] [JJ] king NG {n.asal. velar, voiced} [fj Cat f {fricativeJabiodcntal.. voiceless} [v] vat v {fricative,labiodental,voiced) raJ tbumb TH {fricati ve.dental,voiceless} ' [~] that DH {fricative,dental. voiced) [s] sat s {fricative,alveolar, voiceless} [z] zip % {fricative,alveolar. voiced} [j1 mesh SH {fricative,palatal voiceless} [3] measure ZH ( fiicative,palatal,voiced} [h) hot b {fricative.glottal} ltD chair CH ( affricativ~palatal.voiceless J ~ (

69 APPENDIXC

Source: http://www .ruf.rice.edu/-kemmer/W ords/loanwords.html.

English Loanwords Major Periods of Borrowing in the

I. Germanic period Latin The fonns given in this section are the Old English ones. The original Latin source word is given in parentheses where significantly different. Some Latin words were themselves originally borrowed from Greek. ancor 'anchor' butere 'butter' (L < Gr. butyros) cealc 'chalk' ceas 'cheese' (caseum) cetel 'kettle' cycene 'kitchen' cirice 'church' (ecclesia

II. Old English Period (600-1100) Latin apostol 'apostle' (apostolus < Gr. apostolos) casere 'caesar, emperor' ceaster 'city' ( castra 'camp') cest 'chest' (cista 'box') circul 'circle' cometa 'comet' (cometa

70 APPENDIX C (cont .. )

III. Middle English Period (1100-1500) Scandinavian Most of these first appeared in the written language in Middle English; but many were no doubt borrowed earlier, during the period of the Danelaw (9th-10th centuries). anger, blight, by-law, cake, call, clumsy, doze, egg, fellow, gear, get, give, hale, hit, husband, kick, kill, kilt, kindle, law, low, lump, rag, raise, root, scathe, scorch, score, scowl, scrape, scrub, seat, skill, skin, skirt, sky, sly, take, they, them, their, thrall, thrust, ugly, want, window, wing French Law and government attorney, bailiff, chancellor, chattel, country, court, crime, defendent, evidence, government, jail, judge, jury, larceny, noble, parliament, plaintiff, plea, prison, revenue, state, tax, verdict Church abbot, chaplain, chapter, clergy, friar, prayer, preach, priest, religion, sacrament, saint, sermon Nobility: baron, baroness; count, countess; duke, duchess; marquis, marquess; prince, princess; viscount, viscountess; noble, royal (contrast native words: king, queen, earl, lord, lady. knight, kingly, queenly) Military army, artillery, battle, captain, company, corporal, defense,enemy,marine, navy, sergeant, soldier, volunteer Cooking beef, boil, broil, butcher, dine, fry, mutton, pork, poultry, roast, salmon, stew, veal Culture and luxury goods art, bracelet, claret, clarinet, dance, diamond, fashion, fur, jewel, oboe, painting, pendant, satin, ruby, sculpture Other adventure, change, charge, chart, courage, devout, dignity, enamor, feign, fruit, letter, literature, magic, male, female, mirror, pilgrimage, proud, question, regard, special

71 APPENDIX C (cont .. )

IV. Early Modern English Period (1500-1650)

The effects of the renaissance begin to be seriously felt in England. We see the beginnings of a huge influx of Latin and Greek words, many of them learned words imported by scholars well versed in those languages. But many are borrowings from other languages, as words from European high culture begin to make their presence felt and the first words come in from the earliest period of colonial expansion. Latin agile, abdomen, anatomy, area, capsule, compensate, dexterity, discus, disc/disk, excavate, expensive, fictitious, gradual, habitual, insane, janitor, meditate, notorious, orbit, peninsula, physician, superintendent, ultimate, vindicate Greek (many of these via Latin) anonymous, atmosphere, autograph, catastrophe, climax, comedy, critic, data, ectasy, history, ostracize, parasite, pneumonia, skeleton, tonic, tragedy Greek bound morphemes: -ism, -ize Arabic via Spanish alcove, algebra, zenith, algorithm, almanac, azimuth, alchemy, admiral Arabic via other Romance languages: amber, cipher, orange, saffron, sugar, zero, coffee

V. Modern English (1650-present)

Period of major colonial expansion, industrial/technological revolution, and American immigration. Words from European languages French French continues to be the largest single source of new words outside of very specialized vocabulary domains (scientific/technical vocabulary, still dominated by classical borrowings). High culture ballet, bouillabaise, cabemet, cachet, chaise longue, champagne, chic, cognac, corsage, faux pas, nom de plume, quiche, rouge, roulet, sachet, salon, saloon, sang froid, savoir faire War and Military bastion, brigade, battalion, cavalry, grenade, infantry, pallisade, rebuff, bayonet Other bigot, chassis, clique, denim, garage, grotesque, jcan(s), niche, shock

72 APPENDIX C (cont .. ) Spanish annada, adobe, alligator, alpaca, armadillo, barricade, bravado, cannibal, canyon, coyote, desperado, embargo, enchilada, guitar, marijuana, mesa, mosquito, mustang, ranch, taco, tornado, tortilla, vigilante Italian alto, arsenal, balcony, broccoli, cameo, casino, cupola, duo, fresco, fugue, gazette (via French), ghetto, gondola, grotto, macaroni, madrigal, motto, piano, opera, pantaloons, prima donna, regatta, sequin, soprano, opera, stanza, stucco, studio, tempo, torso, umbrella, viola, violin, Words from Italian American immigrants: cappuccino, espresso, linguini, mafioso, pasta, pizza, ravioli, spaghetti, spumante, zabaglione, zucchini Dutch Shipping, naval terms avast, boom, bow, bowsprit, buoy, commodore, cruise, dock, freight, keel, keelhaul, leak, pump, reef, scoop, scour, skipper, sloop, smuggle, splice, tackle, yawl, yacht Cloth industry bale, cambric, duck (fabric), fuller's earth, mart, nap (of cloth), selvage, spool, stripe Art easel, etching, landscape, sketch War beleaguer, holster, freebooter, furlough, onslaught Food and drink booze, brandy(wine), coleslaw, cookie, cranberry, crullers, gin, hops, stockfish, waffle Other bugger ( orig. French), crap, curl, dollar, scum, split ( orig. nautical terrn), uproar German bum, dunk, feldspar, quartz, hex, lager, knackwurst, liverwurst, loafer, noodle, poodle, dachshund, pretzel, pinochle, pumpernickel, sauerkraut, sclmitzel, zwieback, (beer)stein, lederhosen, dirndl 20th century German loanwords: blitzkrieg, zeppelin, strafe, U-boat, delicatessen, hamburger, frankfurter, wiener, hausfrau, kindergarten, Oktoberfest, schuss, wunderkind, bundt (cake), spritz (cookies), (apple) strudel

Yiddish (most are 20th century borrowings) bagel, Chanukkah (Hanukkah), chutzpah, dreidel, kibbitzer, kosher, lox, pastrami (orig. from Romanian), schlep, spiel, schlepp, schlemiel, schlimazel, gefilte fish, goy, klutz, knish, matzoh, oy vey, schmuck

73 APPENDIX C (cont..)

Scandinavian fjord, maelstrom, ombudsman, ski, slalom, smorgasbord Russian apparatchik, borscht, czar/tsar, glasnost, icon, perestroika, vodka

Words from other parts of the world Sanskrit avatar, karma, mahatma, swastika, yoga Hindi bandanna, bangle, bungalow, chintz, cot, cummerbund, dungaree, juggernaut, jungle, loot, maharaja, nabob, pajamas, punch (the drink), shampoo, thug, kedgeree, jamboree Dravidian curry, mango, teak, pariah Persian (Farsi) check, checkmate, chess Arabic bedouin, emir, jakir, gazelle, giraffe, harem, hashish, lute, minaret, mosque, myrrh, salaam, sirocco, sultan, vizier, bazaar, caravan African languages banana (via Portuguese), banjo, boogie-woogie, chigger, goober, gorilla, gumbo, jazz, jitterbug, jitters, juke(box), voodoo, yam, zebra, zombie American Indian languages avocado, cacao, cannibal, canoe, chipmunk, chocolate, chili, hammock, hominy, hurricane, maize, moccasin, moose, papoose, pecan, possum, potato, skunk, squaw, succotash, squash, tamale (via Spanish), teepee, terrapin, tobacco, toboggan, tomahawk, tomato, wigwam, woodchuck (plus thousands of place names, including Ottawa, Toronto, Saskatchewan and the names of more than half the states of the U.S., including Michigan, , Nebraska, Illinois) Chinese chop suey, chow mein, dim sum, tea, ginseng, kowtow, litchee Japanese geisha, harakiri, judo, jujitsu, kamikaze, karaoke, kimono, samurai, soy, sumo, sushi, tsunami Pacific Islands bamboo, gingham, rattan, taboo, tattoo, ukulele, boondocks Australia boomerang, budgerigar, didgeridoo, kangaroo (and many more in Australian English)

© 2003 Suzanne Kennner

74 APPENDIXD A free copy from the International Phonetic Association (Department of Theoretical and Applied Linguistics, School of English, Aristotle University of Thessaloniki, Thessaloniki 54124, GREECE).

TilE INTERN<\TIONAL PHONETIC ALPHABET (revised to 1993)

r~o•~

Nual m Jl) n 11.

TnlJ 1l r

TaporPlnp r t

fncallve I 3 ~ 7, Latcnl fru:ative

Approlunumt l -l j LalcnL IIPPJDUmaDl 1 l A. L wt.-~ II['PCIIr Ia pobl, Q60CM li> lllo ri&Jx Je!W*flU a voiced ccruaatat. Sbied&JOU clulllwllll1laJiadoat )ldlo4 CONSONANTS (NON-PULMONIC) SUPRASEOMENI"ALS roN~ & WUIWAUl:.'fl":S Clicb Voic..t i.mplosivu Ejecth• Pri .. I£VSL Cioal d v ~ ! - o-..~ry,...... t Q ~ .. ! g h th dh re-a\ C3 - A ~ Aopioatl - u.,uollbial 1 Q .. Lani.., ..t cJ (e -e Mcn!UUBdod "'Labillmd t'IV dW - N-tlztk>- polauJ rr.au•u ~.. OoiiU&Izad Vel&r!ud or !lhll)'llco.al•w f W Vl.ll-po:uql ~ fj Siodt&AtOW uo! X Mid-.!roli....t ~ ~ (~ • '<'Oicrd 111\'edw !rbdve) I : ~ H V~..,Ofl

75 SYSTEM REQUIREMENTS AND USER GUIDE

The minimum requirements to operating the program are listed below: Hardware requirements: Processor: Pentium-class processor 266 MHz or higher Operating system: Windows 98 Disk space: 5MB of free disk space Memory: 64MB or RAM Software requirements: Sun One Studio Community Edition Release 3.0

Installing Sun One Studio: Click on SunOneStudio software installed in the CD-ROM and follow the instruction wizard. You will find that the name of the earlier version i.e. Forte for Java appear during the installation process. Specify your own destination or just click OK to resume with the default folder. Please be reminded to copy both Jsyllablell 0504.java and SimpleFrame.java into your preferred folder.

Executing the program: To run Sun ONE Studio for Windows, click Program/Forte for Java CE/ Forte for Java CE from the start menu. An interface will pop-up to setup iPlanet Web Server, click cancel to skip the process. Another interface to configure the general setting for Java will appear and again click on the cancel button to proceed with the software.

To open an existing file, select File/Open or the Open File icon on the tool bar. Browse to the file you wish to open and double click. The program is located in the CD-Rom and named Jsyllable110504. Now you can view the source code of the Java program in the editing window. When you have a complete class definition you can compile the file by clicking on the Build menu and selecting the Compile File option. Errors will be listed in the window below the editing window. Or, if there are no errors the message 'Finished < File Name >' will appear after a moment or two. When there are no errors

76 click on the Build menu and select Execute File. Click on the File menu and use Save As ... to copy your code into your preferred directory . The shortcut to compile and execute the file is to click on the play button in green. A successful compilation will execute two windows i.e. the application window and the output window to monitor the back end process. If any errors found, error message will be displayed in the output window. Normally you do not need to restart the program, just continue with the next input by clicking on the clear button to erase the previous input and type in the new input and click syllabifY button.

77