Automatic Syllabification in European Languages: a Comparison of Data-Driven Methods

AUTOMATIC SYLLABIFICATION IN EUROPEAN LANGUAGES: A COMPARISON OF DATA-DRIVEN METHODS by Connie R. Adsett Submitted in partial fulfillment of the requirements for the degree of Master of Computer Science at Dalhousie University Halifax, Nova Scotia June 2008 c Copyright by Connie R. Adsett, 2008 DALHOUSIE UNIVERSITY FACULTY OF COMPUTER SCIENCE The undersigned hereby certify that they have read and recommend to the Faculty of Graduate Studies for acceptance a thesis entitled \AUTOMATIC SYLLABIFICATION IN EUROPEAN LANGUAGES: A COMPARISON OF DATA-DRIVEN METHODS" by Connie R. Adsett in partial fulfillment of the requirements for the degree of Master of Computer Science. Dated: June 24, 2008 Supervisors: Dr. Yannick Marchand Dr. Vlado Keselj Reader: Dr. Qigang Gao ii DALHOUSIE UNIVERSITY DATE: June 24, 2008 AUTHOR: Connie R. Adsett TITLE: AUTOMATIC SYLLABIFICATION IN EUROPEAN LANGUAGES: A COMPARISON OF DATA-DRIVEN METHODS DEPARTMENT OR SCHOOL: Faculty of Computer Science DEGREE: MCSc CONVOCATION: October YEAR: 2008 Permission is herewith granted to Dalhousie University to circulate and to have copied for non-commercial purposes, at its discretion, the above title upon the request of individuals or institutions. Signature of Author The author reserves other publication rights, and neither the thesis nor extensive extracts from it may be printed or otherwise reproduced without the author's written permission. The author attests that permission has been obtained for the use of any copyrighted material appearing in the thesis (other than brief excerpts requiring only proper acknowledgement in scholarly writing) and that all such use is clearly acknowledged. iii Syllables govern the world. - John Selden (1584{1654) iv Table of Contents List of Tables ................................... viii List of Figures .................................. x Abstract ...................................... xii List of Abbreviations and Symbols Used .................. xiii Acknowledgements ............................... xiv Chapter 1 Introduction .......................... 1 1.1 Goals and Objectives . 2 1.2 Main Contributions . 3 1.3 Thesis Outline . 3 Chapter 2 Motivation and Related Work ............... 5 2.1 Why Study Automatic Syllabification? . 5 2.1.1 Natural Language Processing . 6 2.1.2 Modeling Human Language Processing . 9 2.1.3 Comparison of Syllabic Complexity . 10 2.2 Data-driven or Rule-based Automatic Syllabification? . 13 2.2.1 English . 14 2.2.2 Italian . 15 2.3 Data-driven Syllabification Approaches . 16 2.4 Problem Definition . 19 Chapter 3 Languages and Lexicons Used ............... 21 3.1 Basque . 22 3.2 Dutch . 24 v 3.3 English . 25 3.4 French . 27 3.5 Frisian . 28 3.6 German . 29 3.7 Italian . 30 3.8 Norwegian . 32 3.9 Spanish . 32 3.10 General Lexicon Information . 34 3.11 Common Lexicon Creation . 37 Chapter 4 Algorithms ........................... 40 4.1 Syllabification as a Classification Problem . 40 4.2 Instance-Based Learning . 41 4.2.1 IB1 . 42 4.2.2 Look Up Procedure . 46 4.2.3 Example . 47 4.3 Liang's Hyphenation Algorithm . 50 4.4 Syllabification by Analogy . 59 4.5 Assessing Algorithm Performance . 65 Chapter 5 Results and Discussion .................... 67 5.1 Syllabification Benchmark . 67 5.1.1 IB1 Results . 68 5.1.2 Liang Results . 71 5.1.3 Look-up Procedure Results . 77 5.1.4 Syllabification by Analogy Results . 78 5.1.5 Comparison of All Algorithms . 80 5.2 Spelling versus Pronunciation . 85 5.3 Cross-language Study . 87 vi Chapter 6 Conclusion ............................ 91 References ..................................... 95 vii List of Tables Table 3.1 Character and entry information for each lexicon. 34 Table 3.2 The alphabets used for each lexicon. 36 Table 3.3 The maximum and average word and syllable lengths for each lexicon. 37 Table 4.1 The IB1 instance base entries for the word àble'. 46 Table 4.2 The weight vectors used in the Look Up Procedure as used by Weijters (1991); Marchand, Adsett, and Damper (in press); Ad- sett and Marchand (2007). The juncture to be classified lies between the letter at position −1 and the one at +1. 47 Table 4.3 The instance base resulting from the storage of the words `syllable', àble', and àvailable' with feature weights for both the Look Up Procedure and IB1-IG. 49 Table 4.4 The results of the calculations necessary to determine the weight of feature −3. ........................... 50 Table 4.5 The distances between each stored instance and those from `table' using both the Look Up Procedure and IB1-IG weights. 51 Table 4.6 The best matches and corresponding classifications for each instance from `table' according to the Look Up Procedure and IB1-IG weights. 52 Table 4.7 Potential patterns as processed at level 1 in Liang's algorithm using the words `syllable', `table', `tabulate', and àble'. 54 Table 4.8 Potential patterns as processed at level 2 in Liang's algorithm using the words `syllable', `table', `tabulate', and àble'. 56 Table 4.9 The parameters used to run Liang's algorithm. The abbreviations g, b, and t are used to represented good weight, bad weight, and threshold, respectively. 57 Table 4.10 The matches found in the lexicon for the substrings of length six from the word `table'. 61 viii Table 4.11 Sample values for the paths from Figure 4.2 using each of the five scoring strategies for SbA. 64 Table 4.12 The rankings and points for each candidate according to the scoring strategy results reported in Table 4.11. 64 Table 4.13 Example of juncture accuracy evaluation for the word `table'. 66 Table 5.1 The average IB1-IG word accuracy results for each left and right context size. 71 Table 5.2 The word accuracy results obtained using Liang's Algorithm for each lexicon and parameter set. 72 Table 5.3 The juncture accuracy results for each lexicon using version 1 of the parameter sets for Liang's algorithm. 73 Table 5.4 All combinations of the five scoring strategies used to test the Syllabification by Analogy algorithm. 80 Table 5.5 Summary of the algorithm results (number of lexicons in which each algorithm had the best score, minimum word accuracy score, standard deviation of word accuracies over all lexicons) for both Full and Common lexicons. 84 Table 5.6 Comparison of performance between spelling and pronunciation domain mean results and the results of χ2 tests for significance. The `*' and `**' indicate that the results are statistically signif- icant with p < 0:05 and p < 0:01, respectively. 86 Table 5.7 Spelling and pronunciation character set sizes for each language ranked from the greatest difference between the two to the least. 87 Table 5.8 The rank order of the syllabic complexity of languages according to the mean word accuracy in the spelling and pronunciation domains. 87 Table 5.9 Languages ordered from highest to lowest frequency of CV syllables and from lowest to highest frequency of closed syllables (according to Frota and Vigário(2001)). 89 ix List of Figures Figure 2.1 Frequency of CV syllables in Dutch (Frota & Vigário,2001), English (Dauer, 1983), European Portuguese (EP) (Frota & Vigário,2001), French (Laks, 1995), Italian (Bortolini, 1976), and Spanish (Dauer, 1983). 12 Figure 3.1 The relationships between the nine languages studied with re- spect to the Indo-European language family. 21 Figure 3.2 The word length distribution for the spelling domain Common lexicons. 38 Figure 3.3 The word length distribution for the pronunciation domain Com- mon lexicons. 39 Figure 4.1 A subset of the syllabification lattice for `table' generated using the words `syllable', àble', àvailable', and `tabular'. 62 Figure 4.2 The possible shortest paths for the substring à?b?l' from `table', were the arc from à' to `l' not to exist. 63 Figure 5.1 The word accuracies for each lexicon with left and right contexts of six. The first letter of each lexicon label denotes the language and the second represents the domain (French and Frisian are distinguished by using `Fc' and `Fs', respectively). 69 Figure 5.2 The normalized IB1 weights generated for the German Spelling domain lexicon using the Information Gain (IG), Gain Ratio (GR), and χ2 equations. 70 Figure 5.3 Average pattern lengths across lexicons for each parameter setting of Liang's algorithm. 74 Figure 5.4 The average number of patterns generated for each parameter setting of Liang's algorithm (with standard error). 75 Figure 5.5 The percent of the patterns generated at each level for each parameter setting of Liang's algorithm. 76 Figure 5.6 The average word accuracy for each weight version of the Look Up Procedure. 77 x Figure 5.7 The word accuracy of version 10 of the Look Up Procedure weights for each lexicon. The first letter of each lexicon label denotes the language and the second represents the domain (French and Frisian are distinguished by using `Fc' and `Fs', respectively). 78 Figure 5.8 The version 10 weights for the Look Up Procedure. 79 Figure 5.9 The mean word accuracies (with standard error) across all lexicons for each strategy combination of Syllabification by Anal- ogy. 81 Figure 5.10 The number of lexicons where each scoring strategy combination achieves word and juncture accuracy in the top three for the Syllabification by Analogy algorithm. 82 Figure 5.11 Comparison of algorithm word accuracy results over all the Full lexicons. The first letter of each lexicon label represents the language and the second denotes the domain. 83 Figure 5.12 The difference between the pronunciation and spelling domain mean word accuracies on those language with lexicons in both domains. 85 Figure 5.13 Each language ranking in the spelling and pronunciation domain and the regression line given by the Spearman correlation statistical test.

Load more