A Spell Checker for Esperanto

MASARYK UNIVERSITY FACULTY OF INFORMATICS A Spell Checker for Esperanto BACHELOR THESIS Marek Blahuš Brno, May 2008 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out on my own. All sources, references and literature used or excerpted during the elaboration of this work are properly cited and listed in complete reference to the due source. Supervisor: RNDr. Petr Sojka, Ph.D. ii Acknowledgement I would like to thank RNDr. Petr Sojka, Ph.D., the supervisor of my bachelor thesis, for his comments, suggestions and time he spent helping me with this work. I would also like to thank Dr. Petr Chrdle, CSc., owner of the KAVA-PECH publishing house, who has provided me with a complimentary copy of the Plena ilustrita vortaro de Esperanto 2005. I am grateful to Dr. Ludvic Lazar Zamenhof, the initiator of Esperanto. iii Abstract This thesis provides a brief overview of spell checking software and describes the process of constructing a spell checker for the Esperanto language and its implementation as a dictionary (i.e. an affix file and a word list) for the Hunspell spell checker. The word list is an adaptation of word roots coming from the renowned Esperanto dictionary PIV. Recognition of morphologically complex words, which are common in Esperanto due to its agglutina- tive nature, is made possible by the affix file which has been built based on ready-made morpheme segmentation of word derivations appearing in the same source. Rules derived in the latter process have been improved by semantic classification of all involved roots, for which a system has been created based on corpus analysis and several specialized dictionaries, in combination with knowledge on the capability of each affix to accept roots from different semantic classes, acquired from the PMEG reference grammar. The resulting spell checker is a working proof of concept, to be further improved and integrated in the grammar checker project of the E@I organization. Abstrakto Tiu ĉi disertaĵo donas koncizan trarigardon de literumkontrola programaro kaj priskribas la procezon de konstruado de literumkontrolilo por la lingvo Esperanto kaj ties implementon forme de vortaro (t.e. afiksa dosiero kaj vortlisto) por la literumkontrolilo Hunspell. La vortlisto estas adaptaĵo de vortradikoj venantaj de la renoma Esperanto-vortaro PIV. Rekono de morfologie kompleksaj vortoj, kiuj estas oftaj en Esperanto pro ties aglutina ĥaraktero, estas ebla pro la afiksa dosiero kiu konstruiĝis surbaze de preta morfemstrukturi- go de vortderivaĵoj aperantaj en la sama fonto. Reguloj ekestintaj en tiu procezo estas pli- bonigitaj per semantika klasado de ĉiuj engaĝitaj radikoj, por kio kreiĝis sistemo bazita sur tekstara analizo kaj kelkaj fakvortaroj, kombine kun scioj pri akceptemo de ĉiu afikso al radikoj el diversaj semantikaj klasoj, akiritaj de la gramatika manlibro PMEG. La rezultinta literumkontrolilo estas funkcianta koncept-pruvo, plibonigota kaj integrota en la gra- matikkontrolilan projekton de la organizo E@I. iv Keywords spell checker, Esperanto, Hunspell, corpus, morphology, semantic classification Ŝlosilvortoj literumkontrolilo, Esperanto, Hunspell, tekstaro, morfologio, semantika klasado v Table of Contents 1 Introduction.........................................................................................................................1 2 Overview of Existing Software...........................................................................................2 2.1 Kontrolu Literumadon...............................................................................................2 2.2 Esperantilo.................................................................................................................2 2.3 Ispell..........................................................................................................................3 2.4 GNU Aspell...............................................................................................................3 2.5 MySpell.....................................................................................................................4 2.6 Hunspell.....................................................................................................................4 2.6.1 Structure of Hunspell Data Files.......................................................................5 2.7 Overview Table.........................................................................................................6 3 A Hunspell Dictionary for Esperanto..................................................................................7 3.1 Strengths and Weaknesses of the Existing Dictionaries............................................7 3.2 Adopting a Suitable Approach to Esperanto Morphology........................................9 3.2.1 The Structure of Words in Esperanto................................................................9 3.2.2 System for a Semantic Classification of Stems...............................................11 3.3 DFD Diagram for Dictionary Construction.............................................................16 3.4 Compiling a New Word List....................................................................................17 3.4.1 Identifying Relevant Sources for the Word List.............................................17 3.4.2 Moore Machine for Semantic Classification...................................................19 3.4.3 Implementation of the Semantic Classification..............................................21 3.5 Compiling a New Set of Affix Rules.......................................................................23 3.5.1 Dictionary-Based Word Derivation System....................................................23 3.5.2 Implementation of Esperanto Morphology in Hunspell..................................25 3.6 Integration in OpenOffice.org and the E@I Grammar Checker..............................27 4 Evaluation of the Newly Constructed Dictionary.............................................................28 5 Conclusion.........................................................................................................................29 Bibliography.........................................................................................................................30 Appendix A: The 16 Rules of Esperanto Grammar..............................................................34 Appendix B: An Overview of Esperanto Affixes.................................................................38 vi Chapter 1 Introduction The development of computer technologies and the internet has been having a strong impact on the world in which we live, often introducing major changes into certain fields of human activities, within human communities, sometimes even giving brand new potential to tools which we had already had before. Such is also the case of Esperanto, the international auxiliary language created in 1887 by Dr. Ludvic Lazar Zamenhof, and its world- wide community of speakers, which, according to some sources (Gordon, 2005, the article about Esperanto), is estimated to count up to 2 million of people, spread in 115 countries of the world from South America through Europe to eastern Asia. The impact of these new technologies on the Esperanto community, a language-based dias- pora, has been mainly positive: the borders between countries seem to be disappearing, ge- ographical distances are losing their character of a burden, there are more opportunities for maintaining international contacts. Esperanto speakers and students are able to meet each other in an easier fashion, documents, music, literature and other resources in the language are easier to find. It has also never been possible to address such a wide audience at once, in Esperanto as in other languages. There are hundreds of thousands of webpages in Es- peranto in the internet, and writing an article for the Esperanto Wikipedia or for one’s own online diary does not take a serious effort. To send an e-mail message to an online discus- sion group is much easier than to mail a letter to the editor of a magazine or to send out bulk mail on one’s own, and this is being taken advantage of. However, apart from all these positive aspects, negative consequences are emerging as well. The language capabilities of an average Esperantist have probably not improved much over the recent decades, but more and more published Esperanto texts are now written by average Esperantists, with little or no attention to their language level, and thus the quality of Esperanto texts accessible to the public observes a fall. On the other hand, the technology does not only impose the problem; it also gives us the tools to solve it. In 2007, a group of young people from E@I (Education@Internet, an international organization promoting usage of the internet among Esperanto speakers) came up with the idea of creating a language checking software package for Esperanto, intended both for the student who yet needs to cultivate their language skills, as well as the skilled Esperantist who types pages of text a day and whose mistakes usually come merely from a lack of attention. A good spell and grammar checker would be useful in every kind of text processor, e-mail client, web browser and everywhere in the internet where texts in Es- peranto are being written (forums, chats, blogs). This bachelor thesis is a part of the project and its goal is to design a proof of concept realization of the spell checking part of

Load more